linux.git/drivers/vfio/pci, branch master

vfio/pci: Expose latched module parameter policy in debugfs

2026-06-29T20:26:31+00:00

The nointxmask and disable_idle_d3 module parameters remain writable,
but vfio-pci now latches their values into each device at init.  Once a
device is registered, changing the module parameter only affects future
devices, leaving no direct way to confirm the effective policy for an
existing device.

Add a pci debugfs directory under the VFIO device debugfs root and
report the per-device nointxmask and disable_idle_d3 values.  These are
read-only debugfs views and use the same Y/N bool output convention as
the module parameters.

Read-only vfio-pci parameters, such as disable_vga, are not exposed here
because they cannot drift from the latched device value, therefore the
existing module parameter exposure via sysfs is sufficient.

Note that while only vfio-pci currently provides these options, the
implementation is in vfio-pci-core and therefore properly reflects the
device policy in the core, regardless of driver.

Assisted-by: OpenAI Codex:gpt-5
Cc: Guixin Liu 
Signed-off-by: Alex Williamson 
Reviewed-by: Kevin Tian 
Link: https://lore.kernel.org/r/20260615191241.688297-7-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson

vfio/pci: Latch all module parameters per device

2026-06-29T19:46:50+00:00

The vfio-pci module parameters of disable_idle_d3, nointxmask, and
disable_vga latch vfio-pci policy into vfio-pci-core globals each time
the vfio-pci module is initialized.  The disable_idle_d3 parameter has
already migrated to a per-device flag in order to provide consistency
for refcounted PM operations for the lifetime of the device
registration.

Pull the remaining vfio-pci module-parameter policy out of vfio-pci-core
into per-device flags set at device initialization.

This also restores the mutable aspect of the disable_idle_d3 and
nointxmask module parameters for vfio-pci, with the caveat that the
parameters are latched into the device at probe.

A notable change for variant drivers is that their devices are no longer
affected by vfio-pci module parameters and those drivers may need to
adopt similar module parameters if any devices have a hidden dependency
on vfio-pci setting non-default policy.

Assisted-by: Claude:claude-opus-4-8
Acked-by: Chengwen Feng 
Signed-off-by: Alex Williamson 
Reviewed-by: Kevin Tian 
Link: https://lore.kernel.org/r/20260615191241.688297-6-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson

vfio/mlx5: Fix racy bitfields and tighten struct layout

2026-06-29T19:46:50+00:00

Bitfield operations are not atomic, they use a read-modify-write
pattern, therefore we should be careful not to pack bitfields that
can be concurrently updated into the same storage unit.

This split takes a binary approach: flags that are only modified
pre/post open/close remain bitfields, flags modified from user
action, including actions that reach across to another device (ex.
reset) use dedicated storage units.

Note mlx5_vhca_page_tracker.status is relocated to fill the alignment
hole this split exposes.

Bitfield justifications:

  migrate_cap: written only in mlx5vf_cmd_set_migratable() at probe
  chunk_mode: written only in mlx5vf_cmd_set_migratable() at probe
  mig_state_cap: written only in mlx5vf_cmd_set_migratable() at probe

Dedicated storage units:

  mdev_detach: written in the VF attach/detach event notifier
               mlx5fv_vf_event() at runtime
  log_active: written in mlx5vf_start_page_tracker()/
              mlx5vf_stop_page_tracker() during runtime dirty tracking
  deferred_reset: written in mlx5vf_state_mutex_unlock()/
                  mlx5vf_pci_aer_reset_done() during runtime reset handling
  is_err: set by tracker error handling and dirty-log polling at runtime
  object_changed: set by tracker event handling and cleared by dirty-log
                  polling at runtime

Fixes: 61a2f1460fd0 ("vfio/mlx5: Manage the VF attach/detach callback from the PF")
Fixes: 79c3cf279926 ("vfio/mlx5: Init QP based resources for dirty tracking")
Fixes: f886473071d6 ("vfio/mlx5: Add support for tracker object change event")
Cc: Yishai Hadas 
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Alex Williamson 
Reviewed-by: Kevin Tian 
Link: https://lore.kernel.org/r/20260615191241.688297-5-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson

vfio/pci: Release the VGA arbiter client on register_device() failure

2026-06-29T19:46:49+00:00

The re-order in the Fixes commit below displaced vfio_pci_vga_init() as
the last failure point of what is now vfio_pci_core_register_device()
without introducing an unwind for the VGA arbiter registration.

In current kernels this is mostly benign because vfio_pci_set_decode()
only uses pci_dev state, but the original failure path could leave a
callback with a freed vdev cookie.  The stale registration also becomes
unsafe again once the callback follows drvdata to the vfio device.

Add the required VGA unwind callout.

Fixes: 4aeec3984ddc ("vfio/pci: Re-order vfio_pci_probe()")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Alex Williamson 
Reviewed-by: Kevin Tian 
Link: https://lore.kernel.org/r/20260615191241.688297-3-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson

vfio/pci: Latch disable_idle_d3 per device

2026-06-29T19:46:49+00:00

When disable_idle_d3 was introduced in vfio-pci, it directly manipulated
the device power state with pci_set_power_state().  There were no
refcounts to maintain or balanced operations, we could unconditionally
bring the device to D0 and conditionally move it to D3hot.  Therefore
the module parameter was made writable.

Later, in commit c61302aa48f7 ("vfio/pci: Move module parameters to
vfio_pci.c"), as part of the vfio-pci-core split, the writable aspect
of the module parameter was nullified.  The parameter value could still
be changed through sysfs, but the vfio-pci driver latched the values
into vfio-pci-core globals at module init.  Loading the vfio-pci module,
or unloading and reloading, with non-default or different values could
change the globals relative to existing devices bound to vfio-pci
variant drivers.

Runtime PM was introduced in commit 7ab5e10eda02 ("vfio/pci: Move the
unused device into low power state with runtime PM"), which marks the
point where power states became refcounted.  PM get and put operations
need to be balanced, but the same module operations noted above can
change the global variables relative to those devices already bound to
vfio-pci variant drivers.  This introduces a window where PM operations
can now become unbalanced.

To resolve this with a narrow footprint for stable backports, the
disable_idle_d3 flag is latched into the vfio_pci_core_device at the
time of initialization, such that the device always operates with a
consistent value.

NB. vfio_pci_dev_set_try_reset() now unconditionally raises the
runtime PM usage count around bus reset to account for disable_idle_d3
becoming a per-device rather than global flag.  When this flag is set,
the additional get/put pair is harmless and allows continued use of the
shared vfio_pci_dev_set_pm_runtime_get() helper.

Fixes: 7ab5e10eda02 ("vfio/pci: Move the unused device into low power state with runtime PM")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Alex Williamson 
Reviewed-by: Kevin Tian 
Link: https://lore.kernel.org/r/20260615191241.688297-2-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson

vfio/qat: fix f_pos race in qat_vf_resume_write()

2026-06-10T20:33:05+00:00

qat_vf_resume_write() checks filp->f_pos before taking migf->lock, but
copies into the migration-state buffer after taking the lock and
re-reading the shared file position.

Two concurrent writers could therefore pass the bounds check with the
old offset, then have the second writer copy after the first advanced
f_pos, writing past the end of the migration-state buffer.

Take migf->lock before doing the boundary checks.

Fixes: bb208810b1ab ("vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices")
Reviewed-by: Ahsan Atta 
Signed-off-by: Giovanni Cabiddu 
Link: https://lore.kernel.org/r/20260608151317.136613-1-giovanni.cabiddu@intel.com
Signed-off-by: Alex Williamson

vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC

2026-06-05T16:43:32+00:00

Add a CXL DVSEC-based readiness check for Blackwell-Next GPUs alongside
the existing legacy BAR0 polling path. The CXL Device DVSEC offset is
discovered at probe time. Probe, fault and read/write paths then branch
on that to use either the legacy BAR0 polling or the CXL DVSEC polling.

The CXL path polls Memory_Active, requiring MEM_INFO_VALID within 1s and
MEM_ACTIVE within Memory_Active_Timeout (up to 256s) as per CXL spec r4.0
sec 8.1.3.8.2. Given the long worst-case wait, the CXL poll runs outside
memory_lock with only a quick readiness check is done under the lock.

The poll loops sleep with schedule_timeout_killable() and return -EINTR
on a fatal signal. This avoids hung-task panics during the long
uninterruptible wait. Extend this to the legacy based wait as well for
improvement.

In the fault handler the wait runs locklessly before memory_lock. If a
reset races in, the in-lock recheck returns -EAGAIN and the wait is
retried rather than returning a spurious VM_FAULT_SIGBUS.

Add PCI_DVSEC_CXL_MEM_ACTIVE_TIMEOUT to pci_regs.h for the timeout field.

Cc: Ilpo Järvinen 
Cc: Kevin Tian 
Suggested-by: Alex Williamson 
Signed-off-by: Ankit Agrawal 
Reviewed-by: Kevin Tian 
Link: https://lore.kernel.org/r/20260602063015.3915-1-ankita@nvidia.com
Signed-off-by: Alex Williamson

vfio/pci: Use a private flag to prevent power state change with VFs

2026-05-22T15:14:16+00:00

The current implementation uses pci_num_vf() while holding the
memory_lock to prevent changing the power state of a PF when
VFs are enabled. This creates a lockdep circular dependency
warning because memory_lock is held during device probing.

[  286.997167] ======================================================
[  287.003363] WARNING: possible circular locking dependency detected
[  287.009562] 7.0.0-dbg-DEV #3 Tainted: G S
[  287.015074] ------------------------------------------------------
[  287.021270] vfio_pci_sriov_/18636 is trying to acquire lock:
[  287.026942] ff45bea2294d4968 (&vdev->memory_lock){+.+.}-{4:4}, at:
vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.036530]
[  287.036530] but task is already holding lock:
[  287.042383] ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4}, at:
vfio_group_fops_unl_ioctl+0x44d/0x7b0
[  287.051879]
[  287.051879] which lock already depends on the new lock.
[  287.051879]
[  287.060070]
[  287.060070] the existing dependency chain (in reverse order) is:
[  287.067568]
[  287.067568] -> #2 (&new_dev_set->lock){+.+.}-{4:4}:
[  287.073941]        __mutex_lock+0x92/0xb80
[  287.078058]        vfio_assign_device_set+0x66/0x1b0
[  287.083042]        vfio_pci_core_register_device+0xd1/0x2a0
[  287.088638]        vfio_pci_probe+0xd2/0x100
[  287.092933]        local_pci_probe_callback+0x4d/0xa0
[  287.098001]        process_scheduled_works+0x2ca/0x680
[  287.103158]        worker_thread+0x1e8/0x2f0
[  287.107452]        kthread+0x10c/0x140
[  287.111230]        ret_from_fork+0x18e/0x360
[  287.115519]        ret_from_fork_asm+0x1a/0x30
[  287.119983]
[  287.119983] -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
[  287.127219]        __flush_work+0x345/0x490
[  287.131429]        pci_device_probe+0x2e3/0x490
[  287.135979]        really_probe+0x1f9/0x4e0
[  287.140180]        __driver_probe_device+0x77/0x100
[  287.145079]        driver_probe_device+0x1e/0x110
[  287.149803]        __device_attach_driver+0xe3/0x170
[  287.154789]        bus_for_each_drv+0x125/0x150
[  287.159346]        __device_attach+0xca/0x1a0
[  287.163720]        device_initial_probe+0x34/0x50
[  287.168445]        pci_bus_add_device+0x6e/0x90
[  287.172995]        pci_iov_add_virtfn+0x3c9/0x3e0
[  287.177719]        sriov_add_vfs+0x2c/0x60
[  287.181838]        sriov_enable+0x306/0x4a0
[  287.186038]        vfio_pci_core_sriov_configure+0x184/0x220
[  287.191715]        sriov_numvfs_store+0xd9/0x1c0
[  287.196351]        kernfs_fop_write_iter+0x13f/0x1d0
[  287.201338]        vfs_write+0x2be/0x3b0
[  287.205286]        ksys_write+0x73/0x100
[  287.209233]        do_syscall_64+0x14d/0x750
[  287.213529]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  287.219120]
[  287.219120] -> #0 (&vdev->memory_lock){+.+.}-{4:4}:
[  287.225491]        __lock_acquire+0x14c6/0x2800
[  287.230048]        lock_acquire+0xd3/0x2f0
[  287.234168]        down_write+0x3a/0xc0
[  287.238019]        vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.243436]        __rpm_callback+0x8c/0x310
[  287.247730]        rpm_resume+0x529/0x6f0
[  287.251765]        __pm_runtime_resume+0x68/0x90
[  287.256402]        vfio_pci_core_enable+0x44/0x310
[  287.261216]        vfio_pci_open_device+0x1c/0x80
[  287.265947]        vfio_df_open+0x10f/0x150
[  287.270148]        vfio_group_fops_unl_ioctl+0x4a4/0x7b0
[  287.275476]        __se_sys_ioctl+0x71/0xc0
[  287.279679]        do_syscall_64+0x14d/0x750
[  287.283975]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  287.289559]
[  287.289559] other info that might help us debug this:
[  287.289559]
[  287.297582] Chain exists of:
[  287.297582]   &vdev->memory_lock --> (work_completion)(&arg.work)
--> &new_dev_set->lock
[  287.297582]
[  287.310023]  Possible unsafe locking scenario:
[  287.310023]
[  287.315961]        CPU0                    CPU1
[  287.320510]        ----                    ----
[  287.325059]   lock(&new_dev_set->lock);
[  287.328917]
lock((work_completion)(&arg.work));
[  287.336153]                                lock(&new_dev_set->lock);
[  287.342523]   lock(&vdev->memory_lock);
[  287.346382]
[  287.346382]  *** DEADLOCK ***
[  287.346382]
[  287.352315] 2 locks held by vfio_pci_sriov_/18636:
[  287.357125]  #0: ff45bea208ed3e18 (&group->group_lock){+.+.}-{4:4},
at: vfio_group_fops_unl_ioctl+0x3e3/0x7b0
[  287.367048]  #1: ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4},
at: vfio_group_fops_unl_ioctl+0x44d/0x7b0
[  287.376976]
[  287.376976] stack backtrace:
[  287.381353] CPU: 191 UID: 0 PID: 18636 Comm: vfio_pci_sriov_
Tainted: G S                  7.0.0-dbg-DEV #3 PREEMPTLAZY
[  287.381355] Tainted: [S]=CPU_OUT_OF_SPEC
[  287.381356] Call Trace:
[  287.381357]  
[  287.381358]  dump_stack_lvl+0x54/0x70
[  287.381361]  print_circular_bug+0x2e1/0x300
[  287.381363]  check_noncircular+0xf9/0x120
[  287.381364]  ? __lock_acquire+0x5b4/0x2800
[  287.381366]  __lock_acquire+0x14c6/0x2800
[  287.381368]  ? pci_mmcfg_read+0x4f/0x220
[  287.381370]  ? pci_mmcfg_write+0x57/0x220
[  287.381371]  ? lock_acquire+0xd3/0x2f0
[  287.381373]  ? pci_mmcfg_write+0x57/0x220
[  287.381374]  ? lock_release+0xef/0x360
[  287.381376]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381377]  lock_acquire+0xd3/0x2f0
[  287.381378]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381379]  ? lock_is_held_type+0x76/0x100
[  287.381382]  down_write+0x3a/0xc0
[  287.381382]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381383]  vfio_pci_core_runtime_resume+0x1f/0xa0
[  287.381384]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[  287.381385]  __rpm_callback+0x8c/0x310
[  287.381386]  ? ktime_get_mono_fast_ns+0x3d/0xb0
[  287.381389]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[  287.381390]  rpm_resume+0x529/0x6f0
[  287.381392]  ? lock_is_held_type+0x76/0x100
[  287.381394]  __pm_runtime_resume+0x68/0x90
[  287.381396]  vfio_pci_core_enable+0x44/0x310
[  287.381398]  vfio_pci_open_device+0x1c/0x80
[  287.381399]  vfio_df_open+0x10f/0x150
[  287.381401]  vfio_group_fops_unl_ioctl+0x4a4/0x7b0
[  287.381402]  __se_sys_ioctl+0x71/0xc0
[  287.381404]  do_syscall_64+0x14d/0x750
[  287.381405]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  287.381406]  ? trace_irq_disable+0x25/0xd0
[  287.381409]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Introduce a private flag 'sriov_active' in the vfio_pci_core_device
struct. This  allows the driver to track the SR-IOV power state requirement
without  relying on pci_num_vf() while holding the memory_lock. The lock is
now  only held to set the flag and ensure the device is in D0, after which
pci_enable_sriov() can be called without the lock.

Fixes: f4162eb1e2fc ("vfio/pci: Change the PF power state to D0 before enabling VFs")
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe 
Suggested-by: Alex Williamson 
Signed-off-by: Raghavendra Rao Ananta 
Link: https://lore.kernel.org/r/20260514173449.3282188-1-rananta@google.com
[promote bitfield to plain bool to avoid storage-unit races]
Signed-off-by: Alex Williamson

vfio/xe: avoid duplicate reset in xe_vfio_pci_reset_done

2026-05-20T20:52:21+00:00

xe_vfio_pci_reset_done() sets deferred_reset and, when it manages to
acquire state_mutex itself, hands the cleanup off to
xe_vfio_pci_state_mutex_unlock().

That helper already clears deferred_reset and runs xe_vfio_pci_reset()
before dropping the mutex. Calling xe_vfio_pci_reset() again right
afterwards repeats the reset handling unnecessarily.

Fixes: 1f5556ec8b9e ("vfio/xe: Add device specific vfio_pci driver variant for Intel graphics")
Signed-off-by: GuoHan Zhao 
Reviewed-by: Kevin Tian 
Acked-by: Michał Winiarski 
Link: https://lore.kernel.org/r/20260427012128.117051-1-zhaoguohan@kylinos.cn
Signed-off-by: Alex Williamson

hisi_acc_vfio_pci: simplify the command for reading device information

2026-05-20T20:52:09+00:00

The mailbox operation for the Hisi accelerator device now provides a
new read function that supports direct information retrieval by
specifying commands, thereby simplifying the related mailbox command
handling in the driver.

Signed-off-by: Weili Qian 
Signed-off-by: Longfang Liu 
Link: https://lore.kernel.org/r/20260514092026.2018844-1-liulongfang@huawei.com
Signed-off-by: Alex Williamson