| Age | Commit message (Collapse) | Author |
|
iommu_device_register() walks every device on the PCI bus via
bus_for_each_dev() and calls amd_iommu_probe_device() for each. The
inlined check_device() path computes the device's sbdf, calls
rlookup_amd_iommu() to find the owning IOMMU, and only afterwards
verifies devid <= pci_seg->last_bdf. __rlookup_amd_iommu() indexes
rlookup_table[devid] with no bounds check of its own, so for a PCI
device whose BDF is not described by the IVRS, the lookup reads past
the end of the allocation before the caller's bounds check can run.
This was harmless before commit e874c666b15b ("iommu/amd: Change
rlookup, irq_lookup, and alias to use kvalloc()"): the table was a
zeroed page-order allocation, so the over-read returned NULL and the
caller's NULL check skipped the device. After that commit the table is
a tight kvcalloc() and the over-read returns adjacent slab contents,
which check_device() then dereferences as a struct amd_iommu *,
causing a boot-time GPF.
Seen on Google Compute Engine ct6e VMs, where the virtualized IVRS
describes only the four TPU endpoints 00:04.0-07.0; the gVNIC at
00:08.0 (devid 0x40) indexes 56 bytes past the 456-byte allocation,
into the adjacent kmalloc-512 slab object:
pci 0000:00:04.0: Adding to iommu group 0
pci 0000:00:05.0: Adding to iommu group 1
pci 0000:00:06.0: Adding to iommu group 2
pci 0000:00:07.0: Adding to iommu group 3
Oops: general protection fault, probably for non-canonical address 0x3a64695f78746382: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.22 #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/06/2025
RIP: 0010:amd_iommu_probe_device+0x54/0x3a0
Call Trace:
__iommu_probe_device+0x107/0x520
probe_iommu_group+0x29/0x50
bus_for_each_dev+0x7e/0xe0
iommu_device_register+0xc9/0x240
iommu_go_to_state+0x9c0/0x1c60
amd_iommu_init+0x14/0x40
pci_iommu_init+0x16/0x60
do_one_initcall+0x47/0x2f0
Guard the array access in __rlookup_amd_iommu(). With the fix applied
on 6.18.22, the gVNIC at 00:08.0 is skipped cleanly and the VM boots.
Fixes: e874c666b15b ("iommu/amd: Change rlookup, irq_lookup, and alias to use kvalloc()")
Cc: stable@vger.kernel.org
Reported-by: Ziyuan Chen <zc@anthropic.com>
Tested-by: Ziyuan Chen <zc@anthropic.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Assisted-by: Claude:unspecified
Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
In iommu_mmio_write() and iommu_capability_write(), the variables
dbg_mmio_offset and dbg_cap_offset are declared as int. However, they
are populated using kstrtou32_from_user(). If a user provides a
sufficiently large value, it can become a negative integer.
Prior to this patch, the AMD IOMMU debugfs implementation was already
protected by different mechanisms.
1. #define OFS_IN_SZ 8 ensures the user string <= 8 bytes, so
e.g. 0xffffffff isn't a valid input.
if (cnt > OFS_IN_SZ)
return -EINVAL;
2. Implicit type promotion in iommu_mmio_write(), dbg_mmio_offset is int
and iommu->mmio_phys_end is u64
if (dbg_mmio_offset > iommu->mmio_phys_end - sizeof(u64))
return -EINVAL;
3. The show handlers would currently catch the negative number and
refuse to perform the read.
Replace kstrtou32_from_user() with kstrtos32_from_user() to parse the
input, and check for negative values to explicitly prevent out-of-bounds
memory accesses directly in iommu_mmio_write() and
iommu_capability_write().
Signed-off-by: Eder Zulian <ezulian@redhat.com>
Fixes: 7a4ee419e8c1 ("iommu/amd: Add debugfs support to dump IOMMU MMIO registers")
Cc: stable@vger.kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Bitwise OR | operator has a higher precedence than the ternary ?:
operatior. It will be incorrectly evaluated as:
new->data[1] |= (FIELD_PREP(...) | dev_data->ats_enabled) ? DTE_FLAG_IOTLB : 0;
Wrap the conditional operation in parentheses to enforce the
correct evaluation order.
Fixes: 93eee2a49c1b ("iommu/amd: Refactor logic to program the host page table in DTE")
Signed-off-by: Weinan Liu <wnliu@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Due to CVE-2023-20585, the PPR log buffer must use the maximum supported
size (512K) on Genoa (Family 0x19, model >= 0x10) systems when SNP is
enabled, to mitigate a potential security vulnerability. Note that Family
0x19 models below 0x10 (Milan) do not support PPR when SNP is enabled.
Hence the PPR log size increase is only applied for model >= 0x10.
All other systems continue to use the default PPR log buffer size (8K).
Apply the errata fix by making the following changes:
- Introduce global new variable (amd_iommu_pprlog_size) to have PPR log buffer
size. Adjust variable size for Genoa family.
- Extend 'amd_iommu_apply_erratum_snp()' to also set the PPR log buffer
size to maximum for Family 0x19 model >= 0x10 when SNP is enabled.
- Rename PPR_* macros to make it more readable.
Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-3016.html
Cc: Borislav Petkov <bp@alien8.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Due to CVE-2023-20585, the Event log buffer must use the maximum supported
size (512K) on Milan/Genoa (Family 0x19) systems when SNP is enabled,
to mitigate a potential security vulnerability. All other systems continue to
use the default Event log buffer size (8K).
Apply the errata fix by making the following changes:
* Introduce new global variable (amd_iommu_evtlog_size) to have event log
buffer size. Adjust variable size for family 0x19.
* Since 'iommu_snp_enable()' must be called after the core IOMMU subsystem
is initialized, it cannot be moved to the early init stage. The SNP errata
must also be applied after the 'iommu_snp_enable()' check. Therefore,
'alloc_event_buffer()' and 'iommu_enable_event_buffer()' are now called
in the IOMMU_ENABLED state, after the errata is applied.
* Adjust alloc_event_buffer() and iommu_enable_event_buffer() to handle
all IOMMU instances.
* Also rename EVT_* macros to make it more readable.
Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-3016.html
Cc: Borislav Petkov <bp@alien8.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
DMA aliasing causes interrupt remapping table entries (IRTEs) to be shared
between multiple device IDs. See commit 3c124435e8dd
("iommu/amd: Support multiple PCI DMA aliases in IRQ Remapping") for more
information on this. However, the AMD IOMMU driver currently invalidates
IRTE cache entries on a per-device basis whenever an IRTE is updated, not
for each alias.
This approach leaves stale IRTE cache entries when an IRTE is cached under
one DMA alias but later updated and invalidated through a different alias.
In such cases, the original device ID is never invalidated, since it is
programmed via aliasing.
This incoherency bug has been observed when IRTEs are cached for one
Non-Transparent Bridge (NTB) DMA alias, later updated via another.
Fix this by invalidating the interrupt remapping table cache for all DMA
aliases when updating an IRTE.
Co-developed-by: Lars B. Kristiansen <larsk@dolphinics.com>
Signed-off-by: Lars B. Kristiansen <larsk@dolphinics.com>
Co-developed-by: Jonas Markussen <jonas@dolphinics.com>
Signed-off-by: Jonas Markussen <jonas@dolphinics.com>
Co-developed-by: Tore H. Larsen <torel@simula.no>
Signed-off-by: Tore H. Larsen <torel@simula.no>
Signed-off-by: Magnus Kalland <magnus@dolphinics.com>
Link: https://lore.kernel.org/linux-iommu/9204da81-f821-4034-b8ad-501e43383b56@amd.com/
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently clone_alias() assumes first argument (pdev) is always the
original device pointer. This function is called by
pci_for_each_dma_alias() which based on topology decides to send
original or alias device details in first argument.
This meant that the source devid used to look up and copy the DTE
may be incorrect, leading to wrong or stale DTE entries being
propagated to alias device.
Fix this by passing the original pdev as the opaque data argument to
both the direct clone_alias() call and pci_for_each_dma_alias(). Inside
clone_alias(), retrieve the original device from data and compute devid
from it.
Fixes: 3332364e4ebc ("iommu/amd: Support multiple PCI DMA aliases in device table")
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
iommupt always supports the semantics required for DMA-FQ, when drivers
are converted to use it they automatically get support.
Detect iommpt directly instead of using IOMMU_CAP_DEFERRED_FLUSH and
remove IOMMU_CAP_DEFERRED_FLUSH from converted drivers.
This will also enable DMA-FQ on RISC-V.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
In the current AMD IOMMU debugfs, when multiple processes simultaneously
access the IOMMU mmio/cap registers using the IOMMU debugfs, illegal
access issues can occur in the following execution flow:
1. CPU1: Sets a valid access address using iommu_mmio/capability_write,
and verifies the access address's validity in iommu_mmio/capability_show
2. CPU2: Sets an invalid address using iommu_mmio/capability_write
3. CPU1: accesses the IOMMU mmio/cap registers based on the invalid
address, resulting in an illegal access.
This patch modifies the execution process to first verify the address's
validity and then access it based on the same address, ensuring
correctness and robustness.
Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
In the current AMD IOMMU debugFS, when multiple processes use the IOMMU
debugFS process simultaneously, illegal access issues can occur in the
following execution flow:
1. CPU1: Sets a valid sbdf via devid_write, then checks the sbdf's
validity in execution flows such as devid_show, iommu_devtbl_show,
and iommu_irqtbl_show.
2. CPU2: Sets an invalid sbdf via devid_write, at which point the sbdf
value is -1.
3. CPU1: accesses the IOMMU device table, IRQ table, based on the
invalid SBDF value of -1, resulting in illegal access.
This is especially problematic in monitoring scripts, where multiple
scripts may access debugFS simultaneously, and some scripts may
unexpectedly set invalid values, which triggers illegal access in
debugfs.
This patch modifies the execution flow of devid_show,
iommu_devtbl_show, and iommu_irqtbl_show to ensure that these
processes determine the validity and access based on the
same device-id, thus guaranteeing correctness and robustness.
Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
PCIe ATS may be disabled by platform firmware, root complex limitations,
or kernel policy even when a device advertises the ATS capability in its
PCI configuration space.
Add a new IOMMU_CAP_PCI_ATS_SUPPORTED capability to allow IOMMU drivers
to report the effective ATS decision for a device.
When this capability is true for a device, ATS may be enabled for that
device, but it does not imply that ATS is currently enabled.
A subsequent patch will extend iommufd to expose the effective ATS
status to userspace.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Previously, commit 8388f7df936b ("iommu/amd: Do not support
IOMMU_DOMAIN_IDENTITY after SNP is enabled") prevented users from
changing the IOMMU domain to identity if SNP was enabled.
This resulted in an error when writing to sysfs:
# echo "identity" > /sys/kernel/iommu_groups/50/type
-bash: echo: write error: Cannot allocate memory
However, commit 4402f2627d30 ("iommu/amd: Implement global identity
domain") changed the flow of the code, skipping the SNP guard and
allowing users to change the IOMMU domain to identity after a machine
has booted.
Once the user does that, they will probably try to bind and the
device/driver will start to do DMA which will trigger errors:
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0 (unnamed net_device) (uninitialized): Error (timeout: 500015) msg {0x0 0x0} len:0
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0: probe with driver bnxt_en failed with error -16
To prevent this from happening, create an attach wrapper for
identity_domain_ops which returns EINVAL if amd_iommu_snp_en is true.
With this commit applied:
# echo "identity" > /sys/kernel/iommu_groups/62/type
-bash: echo: write error: Invalid argument
Fixes: 4402f2627d30 ("iommu/amd: Implement global identity domain")
Signed-off-by: Joe Damato <joe@dama.to>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently, PPR Log and GA logs for AMD IOMMU are allocated using
iommu_alloc_pages_sz(), which does not account for NUMA affinity. This can
lead to remote memory access latencies if the memory is allocated on a
different node than the IOMMU hardware.
Switch to iommu_alloc_pages_node_sz() to ensure that these data structures
are allocated on the same NUMA node as the IOMMU device. If the node
information is unavailable, it defaults to NUMA_NO_NODE.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core changes:
- Rust bindings for IO-pgtable code
- IOMMU page allocation debugging support
- Disable ATS during PCI resets
Intel VT-d changes:
- Skip dev-iotlb flush for inaccessible PCIe device
- Flush cache for PASID table before using it
- Use right invalidation method for SVA and NESTED domains
- Ensure atomicity in context and PASID entry updates
AMD-Vi changes:
- Support for nested translations
- Other minor improvements
ARM-SMMU-v2 changes:
- Configure SoC-specific prefetcher settings for Qualcomm's "MDSS"
ARM-SMMU-v3 changes:
- Improve CMDQ locking fairness for pathetically small queue sizes
- Remove tracking of the IAS as this is only relevant for AArch32 and
was causing C_BAD_STE errors
- Add device-tree support for NVIDIA's CMDQV extension
- Allow some hitless transitions for the 'MEV' and 'EATS' STE fields
- Don't disable ATS for nested S1-bypass nested domains
- Additions to the kunit selftests"
* tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
iommupt: Always add IOVA range to iotlb_gather in gather_range_pages()
iommu/amd: serialize sequence allocation under concurrent TLB invalidations
iommu/amd: Fix type of type parameter to amd_iommufd_hw_info()
iommu/arm-smmu-v3: Do not set disable_ats unless vSTE is Translate
iommu/arm-smmu-v3-test: Add nested s1bypass/s1dssbypass coverage
iommu/arm-smmu-v3: Mark EATS_TRANS safe when computing the update sequence
iommu/arm-smmu-v3: Mark STE MEV safe when computing the update sequence
iommu/arm-smmu-v3: Add update_safe bits to fix STE update sequence
iommu/arm-smmu-v3: Add device-tree support for CMDQV driver
iommu/tegra241-cmdqv: Decouple driver from ACPI
iommu/arm-smmu-qcom: Restore ACTLR settings for MDSS on sa8775p
iommu/vt-d: Fix race condition during PASID entry replacement
iommu/vt-d: Clear Present bit before tearing down context entry
iommu/vt-d: Clear Present bit before tearing down PASID entry
iommu/vt-d: Flush piotlb for SVM and Nested domain
iommu/vt-d: Flush cache for PASID table before using it
iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode
rust: iommu: fix `srctree` link warning
rust: iommu: fix Rust formatting
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Thomas Gleixner:
"A series of treewide cleanups to ensure interrupt request consistency.
- Add the missing IRQF_COND_ONESHOT flag to devm_request_irq()
This is inconsistent vs request_irq() and causes the same issues
which where addressed with the introduction of this flag
- Cleanup IRQF_ONESHOT and IRQF_NO_THREAD usage
Quite some drivers have inconsistent interrupt request flags
related to interrupt threading namely IRQF_ONESHOT and
IRQF_NO_THREAD. This leads to warnings and/or malfunction when
forced interrupt threading is enabled.
- Remove stub primary (hard interrupt) handlers
A bunch of drivers implement a stub primary (hard interrupt)
handler which just returns IRQ_WAKE_THREAD. The same functionality
is provided by the core code when the primary handler argument of
request_thread_irq() is set to NULL"
* tag 'irq-cleanups-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
media: pci: mg4b: Use IRQF_NO_THREAD
mfd: wm8350-core: Use IRQF_ONESHOT
thermal/qcom/lmh: Replace IRQF_ONESHOT with IRQF_NO_THREAD
rtc: amlogic-a4: Remove IRQF_ONESHOT
usb: typec: fusb302: Remove IRQF_ONESHOT
EDAC/altera: Remove IRQF_ONESHOT
char: tpm: cr50: Remove IRQF_ONESHOT
ARM: versatile: Remove IRQF_ONESHOT
scsi: efct: Use IRQF_ONESHOT and default primary handler
Bluetooth: btintel_pcie: Use IRQF_ONESHOT and default primary handler
bus: fsl-mc: Use default primary handler
mailbox: bcm-ferxrm-mailbox: Use default primary handler
iommu/amd: Use core's primary handler and set IRQF_ONESHOT
platform/x86: int0002: Remove IRQF_ONESHOT from request_irq()
genirq: Set IRQF_COND_ONESHOT in devm_request_irq().
|
|
'core' into next
|
|
With concurrent TLB invalidations, completion wait randomly gets timed out
because cmd_sem_val was incremented outside the IOMMU spinlock, allowing
CMD_COMPL_WAIT commands to be queued out of sequence and breaking the
ordering assumption in wait_on_sem().
Move the cmd_sem_val increment under iommu->lock so completion sequence
allocation is serialized with command queuing.
And remove the unnecessary return.
Fixes: d2a0cac10597 ("iommu/amd: move wait_on_sem() out of spinlock")
Tested-by: Srikanth Aithal <sraithal@amd.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
request_threaded_irq() is invoked with a primary and a secondary handler
and no flags are passed. The primary handler is the same as
irq_default_primary_handler() so there is no need to have an identical
copy.
The lack of the IRQF_ONESHOT can be dangerous because the interrupt
source is not masked while the threaded handler is active. This means,
especially on LEVEL typed interrupt lines, the interrupt can fire again
before the threaded handler had a chance to run.
Use the default primary interrupt handler by specifying NULL and set
IRQF_ONESHOT so the interrupt source is masked until the secondary
handler is done.
Fixes: 72fe00f01f9a3 ("x86/amd-iommu: Use threaded interupt handler")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128095540.863589-4-bigeasy@linutronix.de
|
|
When building with -Wincompatible-function-pointer-types-strict, a
warning designed to catch kernel control flow integrity (kCFI) issues at
build time, there is an instance around amd_iommufd_hw_info():
drivers/iommu/amd/iommu.c:3141:13: error: incompatible function pointer types initializing 'void *(*)(struct device *, u32 *, enum iommu_hw_info_type *)' (aka 'void *(*)(struct device *, unsigned int *, enum iommu_hw_info_type *)') with an expression of type 'void *(struct device *, u32 *, u32 *)' (aka 'void *(struct device *, unsigned int *, unsigned int *)') [-Werror,-Wincompatible-function-pointer-types-strict]
3141 | .hw_info = amd_iommufd_hw_info,
| ^~~~~~~~~~~~~~~~~~~
While 'u32 *' and 'enum iommu_hw_info_type *' are ABI compatible, hence
no regular warning from -Wincompatible-function-pointer-types, the
mismatch will trigger a kCFI violation when amd_iommufd_hw_info() is
called indirectly.
Update the type parameter of amd_iommufd_hw_info() to be
'enum iommu_hw_info_type *' to match the prototype in
'struct iommu_ops', clearing up the warning and kCFI violation.
Fixes: 7d8b06ecc45b ("iommu/amd: Add support for hw_info for iommu capability query")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This fixes warning reported by 0-DAY CI Kernel Test Service.
Fixes: 757d2b1fdf5b ("iommu/amd: Introduce gDomID-to-hDomID Mapping and handle parent domain invalidation")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202601190634.bl7Mjx5Q-lkp@intel.com/
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently, the error path of amd_iommu_probe_device() unconditionally
references dev_data, which may not be initialized if an early failure
occurs (like iommu_init_device() fails).
Move the out_err label to ensure the function exits immediately on
failure without accessing potentially uninitialized dev_data.
Fixes: 19e5cc156cb ("iommu/amd: Enable support for up to 2K interrupts per function")
Cc: Rakuram Eswaran <rakuram.e96@gmail.com>
Cc: Jörg Rödel <joro@8bytes.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202512191724.meqJENXe-lkp@intel.com/
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Introduce set_dte_nested() to program guest translation settings in
the host DTE when attaches the nested domain to a device.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Introduce the amd_iommu_set_dte_v1() helper function to configure
IOMMU host (v1) page table into DTE. This will be used later
when attaching nested doamin.
Also, remove obsolete warning when SNP is enabled and domain id
is zero since this check is no longer applicable.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
amd_iommu_make_clear_dte()
To help avoid duplicate logic when programing DTE for nested translation.
Note that this commit changes behavior of when the IOMMU driver is
switching domain during attach and the blocking domain, where DTE bit
fields for interrupt pass-through (i.e. Lint0, Lint1, NMI, INIT, ExtInt)
and System management message could be affected. These DTE bits are
specified in the IVRS table for specific devices, and should be persistent.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
invalidation
Each nested domain is assigned guest domain ID (gDomID), which guest OS
programs into guest Device Table Entry (gDTE). For each gDomID, the driver
assigns a corresponding host domain ID (hDomID), which will be programmed
into the host Device Table Entry (hDTE).
The hDomID is allocated during amd_iommu_alloc_domain_nested(),
and free during nested_domain_free(). The gDomID-to-hDomID mapping info
(struct guest_domain_mapping_info) is stored in a per-viommu xarray
(struct amd_iommu_viommu.gdomid_array), which is indexed by gDomID.
Note also that parent domain can be shared among struct iommufd_viommu.
Therefore, when hypervisor invalidates the nest parent domain, the AMD
IOMMU command INVALIDATE_IOMMU_PAGES must be issued for each hDomID in
the gdomid_array. This is handled by the iommu_flush_pages_v1_hdom_ids(),
where it iterates through struct protection_domain.viommu_list.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The nested domain is allocated with IOMMU_DOMAIN_NESTED type to store
stage-1 translation (i.e. GVA->GPA). This includes the GCR3 root pointer
table along with guest page tables. The struct iommu_hwpt_amd_guest
contains this information, and is passed from user-space as a parameter
of the struct iommu_ops.domain_alloc_nested().
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Which stores reference to nested parent domain assigned during the call to
struct iommu_ops.viommu_init(). Information in the nest parent is needed
when setting up the nested translation.
Note that the viommu initialization will be introduced in subsequent
commit.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
To support nested translation, the nest parent domain is allocated with
IOMMU_HWPT_ALLOC_NEST_PARENT flag, and stores information of the v1 page
table for stage 2 (i.e. GPA->SPA).
Also, only support nest parent domain on AMD system, which can support
the Guest CR3 Table (GCR3TRPMode) feature. This feature is required in
order to program DTE[GCR3 Table Root Pointer] with the GPA.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The GCR3TRPMode feature allows the DTE[GCR3TRP] field to be configured
with GPA (instead of SPA). This simplifies the implementation, and is
a pre-requisite for nested translation support.
Therefore, always enable this feature if available.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Which includes DTE update, clone_aliases, DTE flush and completion-wait
commands to avoid code duplication when reuse to setup DTE for nested
translation.
Also, make amd_iommu_update_dte() non-static to reuse in
in a new nested.c file for nested translation.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This will be reused in a new nested.c file for nested translation.
Also, remove unused function parameter ptr.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Also change the define to use GENMASK_ULL instead.
There is no functional change.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
AMD IOMMU Extended Feature (EFR) and Extended Feature 2 (EFR2) registers
specify features supported by each IOMMU hardware instance.
The IOMMU driver checks each feature-specific bits before enabling
each feature at run time.
For IOMMUFD, the hypervisor passes the raw value of amd_iommu_efr and
amd_iommu_efr2 to VMM via iommufd IOMMU_DEVICE_GET_HW_INFO ioctl.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
alloc_irq_table() contains a conditional check for a NULL iommu pointer
when computing the NUMA node, but the function dereferences iommu
in multiple places afterwards.
All callers ensure that a valid iommu pointer is passed in, and a NULL
iommu is not expected by the current callers. Remove the incorrect
NULL check to make the assumptions consistent and address the Smatch
warning.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202512191724.meqJENXe-lkp@intel.com/
Signed-off-by: Rakuram Eswaran <rakuram.e96@gmail.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
With iommu.strict=1, the existing completion wait path can cause soft
lockups under stressed environment, as wait_on_sem() busy-waits under the
spinlock with interrupts disabled.
Move the completion wait in iommu_completion_wait() out of the spinlock.
wait_on_sem() only polls the hardware-updated cmd_sem and does not require
iommu->lock, so holding the lock during the busy wait unnecessarily
increases contention and extends the time with interrupts disabled.
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
So that both iommu.c and init.c can utilize them. Also define a new
function 'pdom_id_destroy()' to destroy 'pdom_ids' instead of directly
calling ida functions.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently AMD IOMMU driver does not reserve domain ids programmed in the
DTE while reusing the device table inside kdump kernel. This can cause
reallocation of these domain ids for newer domains that are created by
the kdump kernel, which can lead to potential IO_PAGE_FAULTs
Hence reserve these ids inside pdom_ids.
Fixes: 38e5f33ee359 ("iommu/amd: Reuse device table for kdump")
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm
Pull PCIe Link Encryption and Device Authentication from Dan Williams:
"New PCI infrastructure and one architecture implementation for PCIe
link encryption establishment via platform firmware services.
This work is the result of multiple vendors coming to consensus on
some core infrastructure (thanks Alexey, Yilun, and Aneesh!), and
three vendor implementations, although only one is included in this
pull. The PCI core changes have an ack from Bjorn, the crypto/ccp/
changes have an ack from Tom, and the iommu/amd/ changes have an ack
from Joerg.
PCIe link encryption is made possible by the soup of acronyms
mentioned in the shortlog below. Link Integrity and Data Encryption
(IDE) is a protocol for installing keys in the transmitter and
receiver at each end of a link. That protocol is transported over Data
Object Exchange (DOE) mailboxes using PCI configuration requests.
The aspect that makes this a "platform firmware service" is that the
key provisioning and protocol is coordinated through a Trusted
Execution Envrionment (TEE) Security Manager (TSM). That is either
firmware running in a coprocessor (AMD SEV-TIO), or quasi-hypervisor
software (Intel TDX Connect / ARM CCA) running in a protected CPU
mode.
Now, the only reason to ask a TSM to run this protocol and install the
keys rather than have a Linux driver do the same is so that later, a
confidential VM can ask the TSM directly "can you certify this
device?".
That precludes host Linux from provisioning its own keys, because host
Linux is outside the trust domain for the VM. It also turns out that
all architectures, save for one, do not publish a mechanism for an OS
to establish keys in the root port. So "TSM-established link
encryption" is the only cross-architecture path for this capability
for the foreseeable future.
This unblocks the other arch implementations to follow in v6.20/v7.0,
once they clear some other dependencies, and it unblocks the next
phase of work to implement the end-to-end flow of confidential device
assignment. The PCIe specification calls this end-to-end flow Trusted
Execution Environment (TEE) Device Interface Security Protocol
(TDISP).
In the meantime, Linux gets a link encryption facility which has
practical benefits along the same lines as memory encryption. It
authenticates devices via certificates and may protect against
interposer attacks trying to capture clear-text PCIe traffic.
Summary:
- Introduce the PCI/TSM core for the coordination of device
authentication, link encryption and establishment (IDE), and later
management of the device security operational states (TDISP).
Notify the new TSM core layer of PCI device arrival and departure
- Add a low level TSM driver for the link encryption establishment
capabilities of the AMD SEV-TIO architecture
- Add a library of helpers TSM drivers to use for IDE establishment
and the DOE transport
- Add skeleton support for 'bind' and 'guest_request' operations in
support of TDISP"
* tag 'tsm-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm: (23 commits)
crypto/ccp: Fix CONFIG_PCI=n build
virt: Fix Kconfig warning when selecting TSM without VIRT_DRIVERS
crypto/ccp: Implement SEV-TIO PCIe IDE (phase1)
iommu/amd: Report SEV-TIO support
psp-sev: Assign numbers to all status codes and add new
ccp: Make snp_reclaim_pages and __sev_do_cmd_locked public
PCI/TSM: Add 'dsm' and 'bound' attributes for dependent functions
PCI/TSM: Add pci_tsm_guest_req() for managing TDIs
PCI/TSM: Add pci_tsm_bind() helper for instantiating TDIs
PCI/IDE: Initialize an ID for all IDE streams
PCI/IDE: Add Address Association Register setup for downstream MMIO
resource: Introduce resource_assigned() for discerning active resources
PCI/TSM: Drop stub for pci_tsm_doe_transfer()
drivers/virt: Drop VIRT_DRIVERS build dependency
PCI/TSM: Report active IDE streams
PCI/IDE: Report available IDE streams
PCI/IDE: Add IDE establishment helpers
PCI: Establish document for PCI host bridge sysfs attributes
PCI: Add PCIe Device 3 Extended Capability enumeration
PCI/TSM: Establish Secure Sessions and Link Encryption
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for power
management, as a preparation for future Tegra specific changes
- Reset controller updates with added drivers for LAN969x, eic770 and
RZ/G3S SoCs
- Protection of system controller registers on Renesas and Google
SoCs, to prevent trivially triggering a system crash from e.g.
debugfs access
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas"
* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
memory: tegra186-emc: Fix missing put_bpmp
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc/tegra: pmc: Add USB wake events for Tegra234
amba: tegra-ahb: Fix device leak on SMMU enable
...
|
|
The SEV-TIO switch in the AMD BIOS is reported to the OS via
the IOMMU Extended Feature 2 register (EFR2), bit 1.
Add helper to parse the bit and report the feature presence.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Link: https://patch.msgid.link/20251202024449.542361-4-aik@amd.com
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
'nvidia/tegra', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
If the IOVA is limited to less than 48 the page table will be constructed
with a 3 level configuration which is unsupported by hardware.
Like the second stage the caller needs to pass in both the top_level an
the vasz to specify a table that has more levels than required to hold the
IOVA range.
Fixes: 6cbc09b7719e ("iommu/vt-d: Restore previous domain::aperture_end calculation")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
modify_irte_ga()
The return type of __modify_irte_ga() is int, but modify_irte_ga()
treats it as a bool. Casting the int to bool discards the error code.
To fix the issue, change the type of ret to int in modify_irte_ga().
Fixes: 57cdb720eaa5 ("iommu/amd: Do not flush IRTE when only updating isRun and destination fields")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Several drivers can benefit from registering per-instance data along
with the syscore operations. To achieve this, move the modifiable fields
out of the syscore_ops structure and into a separate struct syscore that
can be registered with the framework. Add a void * driver data field for
drivers to store contextual data that will be passed to the syscore ops.
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
|
|
Fix a memory leak of struct amd_iommu_pci_segment in alloc_pci_segment()
when system memory (or contiguous memory) is insufficient.
Fixes: 04230c119930 ("iommu/amd: Introduce per PCI segment device table")
Fixes: eda797a27795 ("iommu/amd: Introduce per PCI segment rlookup table")
Fixes: 99fc4ac3d297 ("iommu/amd: Introduce per PCI segment alias_table")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Current IOMMU driver prints "Completion-wait Time-out" error message with
insufficient information to further debug the issue.
Enhancing the error message as following:
1. Log IOMMU PCI device ID in the error message.
2. With "amd_iommu_dump=1" kernel command line option, dump entire
command buffer entries including Head and Tail offset.
Dump the entire command buffer only on the first 'Completion-wait Time-out'
to avoid dmesg spam.
Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|