linux-stable.git/drivers/vfio, branch v4.9.26

vfio/spapr: Postpone default window creation

2017-03-22T11:43:38+00:00

[ Upstream commit d9c728949ddc9de5734bf3b12ea906ca8a77f2a0 ]

We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.

At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.

This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Acked-by: Alex Williamson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

vfio/spapr: Add a helper to create default DMA window

2017-03-22T11:43:37+00:00

[ Upstream commit 6f01cc692a16405235d5c34056455b182682123c ]

There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.

Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.

This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Acked-by: Alex Williamson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2017-03-22T11:43:37+00:00

[ Upstream commit 4b6fad7097f883335b6d9627c883cb7f276d94c9 ]

At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Nicholas Piggin 
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

vfio/spapr: Reference mm in tce_container

2017-03-22T11:43:37+00:00

[ Upstream commit bc82d122ae4a0e9f971f13403995898fcfa0c09e ]

In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).

This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.

This replaces current->mm with container->mm everywhere except debug
prints.

This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.

DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.

This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

powerpc/iommu: Stop using @current in mm_iommu_xxx

2017-03-22T11:43:37+00:00

[ Upstream commit d7baee6901b34c4895eb78efdbf13a49079d7404 ]

This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Acked-by: Alex Williamson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

vfio/spapr: Postpone allocation of userspace version of TCE table

2017-03-22T11:43:37+00:00

[ Upstream commit 39701e56f5f16ea0cf8fc9e8472e645f8de91d23 ]

The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.

As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.

This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Acked-by: Alex Williamson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

vfio/pci: Fix integer overflows, bitmask check

2016-10-26T19:49:29+00:00

The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
user-supplied integers, potentially allowing memory corruption. This
patch adds appropriate integer overflow checks, checks the range bounds
for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
vfio_pci_set_irqs_ioctl().

Furthermore, a kzalloc is changed to a kcalloc because the use of a
kzalloc with an integer multiplication allowed an integer overflow
condition to be reached without this patch. kcalloc checks for overflow
and should prevent a similar occurrence.

Signed-off-by: Vlad Tsyrklevich 
Signed-off-by: Alex Williamson

vfio_pci: use pci_alloc_irq_vectors

2016-09-29T19:36:38+00:00

Simplify the interrupt setup by using the new PCI layer helpers.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Alex Williamson

vfio-pci: Disable INTx after MSI/X teardown

2016-09-26T19:52:19+00:00

The MSI/X shutdown path can gratuitously enable INTx, which is not
something we want to happen if we're dealing with broken INTx device.

Signed-off-by: Alex Williamson

vfio-pci: Virtualize PCIe & AF FLR

2016-09-26T19:52:16+00:00

We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state.  This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR.  Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit.  We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup.  We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.

This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.

Signed-off-by: Alex Williamson 
Reviewed-by: Greg Rose