linux.git/virt, branch v4.12

KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages

2017-06-06T13:28:40+00:00

Under memory pressure, we start ageing pages, which amounts to parsing
the page tables. Since we don't want to allocate any extra level,
we pass NULL for our private allocation cache. Which means that
stage2_get_pud() is allowed to fail. This results in the following
splat:

[ 1520.409577] Unable to handle kernel NULL pointer dereference at virtual address 00000008
[ 1520.417741] pgd = ffff810f52fef000
[ 1520.421201] [00000008] *pgd=0000010f636c5003, *pud=0000010f56f48003, *pmd=0000000000000000
[ 1520.429546] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[ 1520.435156] Modules linked in:
[ 1520.438246] CPU: 15 PID: 53550 Comm: qemu-system-aar Tainted: G        W       4.12.0-rc4-00027-g1885c397eaec #7205
[ 1520.448705] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB12A 10/26/2016
[ 1520.463726] task: ffff800ac5fb4e00 task.stack: ffff800ce04e0000
[ 1520.469666] PC is at stage2_get_pmd+0x34/0x110
[ 1520.474119] LR is at kvm_age_hva_handler+0x44/0xf0
[ 1520.478917] pc : [] lr : [] pstate: 40000145
[ 1520.486325] sp : ffff800ce04e33d0
[ 1520.489644] x29: ffff800ce04e33d0 x28: 0000000ffff40064
[ 1520.494967] x27: 0000ffff27e00000 x26: 0000000000000000
[ 1520.500289] x25: ffff81051ba65008 x24: 0000ffff40065000
[ 1520.505618] x23: 0000ffff40064000 x22: 0000000000000000
[ 1520.510947] x21: ffff810f52b20000 x20: 0000000000000000
[ 1520.516274] x19: 0000000058264000 x18: 0000000000000000
[ 1520.521603] x17: 0000ffffa6fe7438 x16: ffff000008278b70
[ 1520.526940] x15: 000028ccd8000000 x14: 0000000000000008
[ 1520.532264] x13: ffff7e0018298000 x12: 0000000000000002
[ 1520.537582] x11: ffff000009241b93 x10: 0000000000000940
[ 1520.542908] x9 : ffff0000092ef800 x8 : 0000000000000200
[ 1520.548229] x7 : ffff800ce04e36a8 x6 : 0000000000000000
[ 1520.553552] x5 : 0000000000000001 x4 : 0000000000000000
[ 1520.558873] x3 : 0000000000000000 x2 : 0000000000000008
[ 1520.571696] x1 : ffff000008fd5000 x0 : ffff0000080b149c
[ 1520.577039] Process qemu-system-aar (pid: 53550, stack limit = 0xffff800ce04e0000)
[...]
[ 1521.510735] [] stage2_get_pmd+0x34/0x110
[ 1521.516221] [] kvm_age_hva_handler+0x44/0xf0
[ 1521.522054] [] handle_hva_to_gpa+0xb8/0xe8
[ 1521.527716] [] kvm_age_hva+0x44/0xf0
[ 1521.532854] [] kvm_mmu_notifier_clear_flush_young+0x70/0xc0
[ 1521.539992] [] __mmu_notifier_clear_flush_young+0x88/0xd0
[ 1521.546958] [] page_referenced_one+0xf0/0x188
[ 1521.552881] [] rmap_walk_anon+0xec/0x250
[ 1521.558370] [] rmap_walk+0x78/0xa0
[ 1521.563337] [] page_referenced+0x164/0x180
[ 1521.569002] [] shrink_active_list+0x178/0x3b8
[ 1521.574922] [] shrink_node_memcg+0x328/0x600
[ 1521.580758] [] shrink_node+0xc4/0x328
[ 1521.585986] [] do_try_to_free_pages+0xc0/0x340
[ 1521.592000] [] try_to_free_pages+0xcc/0x240
[...]

The trivial fix is to handle this NULL pud value early, rather than
dereferencing it blindly.

Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier 
Reviewed-by: Christoffer Dall 
Signed-off-by: Christoffer Dall

KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction

2017-06-06T08:16:53+00:00

We used to extract PRIbits from the ICH_VT_EL2 which was the upper field
in the register word, so a mask wasn't necessary, but as we switched to
looking at PREbits, which is bits 26 through 28 with the PRIbits field
being potentially non-zero, we really need to mask off the field value,
otherwise fun things may happen.

Signed-off-by: Christoffer Dall 
Acked-by: Marc Zyngier

KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration

2017-05-24T07:44:07+00:00

We have been a little loose with our intermediate VMCR representation
where we had a 'ctlr' field, but we failed to differentiate between the
GICv2 GICC_CTLR and ICC_CTLR_EL1 layouts, and therefore ended up mapping
the wrong bits into the individual fields of the ICH_VMCR_EL2 when
emulating a GICv2 on a GICv3 system.

Fix this by using explicit fields for the VMCR bits instead.

Cc: Eric Auger 
Reported-by: wanghaibin 
Signed-off-by: Christoffer Dall 
Reviewed-by: Marc Zyngier 
Tested-by: Marc Zyngier

KVM: arm/arm64: Hold slots_lock when unregistering kvm io bus devices

2017-05-18T09:18:16+00:00

We were not holding the kvm->slots_lock as required when calling
kvm_io_bus_unregister_dev() as required.

This only affects the error path, but still, let's do our due
diligence.

Reported by: Eric Auger 
Signed-off-by: Christoffer Dall 
Reviewed-by: Eric Auger

KVM: arm/arm64: Fix bug when registering redist iodevs

2017-05-18T09:18:12+00:00

If userspace creates the VCPUs after initializing the VGIC, then we end
up in a situation where we trigger a bug in kvm_vcpu_get_idx(), because
it is called prior to adding the VCPU into the vcpus array on the VM.

There is no tight coupling between the VCPU index and the area of the
redistributor region used for the VCPU, so we can simply ensure that all
creations of redistributors are serialized per VM, and increment an
offset when we successfully add a redistributor.

The vgic_register_redist_iodev() function can be called from two paths:
vgic_redister_all_redist_iodev() which is called via the kvm_vgic_addr()
device attribute handler.  This patch already holds the kvm->lock mutex.

The other path is via kvm_vgic_vcpu_init, which is called through a
longer chain from kvm_vm_ioctl_create_vcpu(), which releases the
kvm->lock mutex just before calling kvm_arch_vcpu_create(), so we can
simply take this mutex again later for our purposes.

Fixes: ab6f468c10 ("KVM: arm/arm64: Register iodevs when setting redist base and creating VCPUs")
Signed-off-by: Christoffer Dall 
Tested-by: Jean-Philippe Brucker 
Reviewed-by: Eric Auger

kvm: arm/arm64: Fix use after free of stage2 page table

2017-05-16T09:54:25+00:00

We yield the kvm->mmu_lock occassionaly while performing an operation
(e.g, unmap or permission changes) on a large area of stage2 mappings.
However this could possibly cause another thread to clear and free up
the stage2 page tables while we were waiting for regaining the lock and
thus the original thread could end up in accessing memory that was
freed. This patch fixes the problem by making sure that the stage2
pagetable is still valid after we regain the lock. The fact that
mmu_notifer->release() could be called twice (via __mmu_notifier_release
and mmu_notifier_unregsister) enhances the possibility of hitting
this race where there are two threads trying to unmap the entire guest
shadow pages.

While at it, cleanup the redudant checks around cond_resched_lock in
stage2_wp_range(), as cond_resched_lock already does the same checks.

Cc: Mark Rutland 
Cc: Radim Krčmář 
Cc: andreyknvl@google.com
Cc: Paolo Bonzini 
Cc: stable@vger.kernel.org
Acked-by: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
Reviewed-by: Christoffer Dall 
Signed-off-by: Christoffer Dall

kvm: arm/arm64: Force reading uncached stage2 PGD

2017-05-16T09:54:00+00:00

Make sure we don't use a cached value of the KVM stage2 PGD while
resetting the PGD.

Cc: Marc Zyngier 
Cc: stable@vger.kernel.org
Signed-off-by: Suzuki K Poulose 
Reviewed-by: Christoffer Dall 
Signed-off-by: Christoffer Dall

kvm: arm/arm64: Fix race in resetting stage2 PGD

2017-05-15T10:05:25+00:00

In kvm_free_stage2_pgd() we check the stage2 PGD before holding
the lock and proceed to take the lock if it is valid. And we unmap
the page tables, followed by releasing the lock. We reset the PGD
only after dropping this lock, which could cause a race condition
where another thread waiting on or even holding the lock, could
potentially see that the PGD is still valid and proceed to perform
a stage2 operation and later encounter a NULL PGD.

[223090.242280] Unable to handle kernel NULL pointer dereference at
virtual address 00000040
[223090.262330] PC is at unmap_stage2_range+0x8c/0x428
[223090.262332] LR is at kvm_unmap_hva_handler+0x2c/0x3c
[223090.262531] Call trace:
[223090.262533] [] unmap_stage2_range+0x8c/0x428
[223090.262535] [] kvm_unmap_hva_handler+0x2c/0x3c
[223090.262537] [] handle_hva_to_gpa+0xb0/0x104
[223090.262539] [] kvm_unmap_hva+0x5c/0xbc
[223090.262543] []
kvm_mmu_notifier_invalidate_page+0x50/0x8c
[223090.262547] []
__mmu_notifier_invalidate_page+0x5c/0x84
[223090.262551] [] try_to_unmap_one+0x1d0/0x4a0
[223090.262553] [] rmap_walk+0x1cc/0x2e0
[223090.262555] [] try_to_unmap+0x74/0xa4
[223090.262557] [] migrate_pages+0x31c/0x5ac
[223090.262561] [] compact_zone+0x3fc/0x7ac
[223090.262563] [] compact_zone_order+0x94/0xb0
[223090.262564] [] try_to_compact_pages+0x108/0x290
[223090.262569] [] __alloc_pages_direct_compact+0x70/0x1ac
[223090.262571] [] __alloc_pages_nodemask+0x434/0x9f4
[223090.262572] [] alloc_pages_vma+0x230/0x254
[223090.262574] [] do_huge_pmd_anonymous_page+0x114/0x538
[223090.262576] [] handle_mm_fault+0xd40/0x17a4
[223090.262577] [] __get_user_pages+0x12c/0x36c
[223090.262578] [] get_user_pages_unlocked+0xa4/0x1b8
[223090.262579] [] __gfn_to_pfn_memslot+0x280/0x31c
[223090.262580] [] gfn_to_pfn_prot+0x4c/0x5c
[223090.262582] [] kvm_handle_guest_abort+0x240/0x774
[223090.262584] [] handle_exit+0x11c/0x1ac
[223090.262586] [] kvm_arch_vcpu_ioctl_run+0x31c/0x648
[223090.262587] [] kvm_vcpu_ioctl+0x378/0x768
[223090.262590] [] do_vfs_ioctl+0x324/0x5a4
[223090.262591] [] SyS_ioctl+0x90/0xa4
[223090.262595] [] el0_svc_naked+0x38/0x3c

This patch moves the stage2 PGD manipulation under the lock.

Reported-by: Alexander Graf 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Reviewed-by: Christoffer Dall 
Reviewed-by: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
Signed-off-by: Christoffer Dall

KVM: arm/arm64: vgic-v3: Use PREbits to infer the number of ICH_APxRn_EL2 registers

2017-05-15T09:32:04+00:00

The GICv3 documentation is extremely confusing, as it talks about
the number of priorities represented by the ICH_APxRn_EL2 registers,
while it should really talk about the number of preemption levels.

This leads to a bug where we may access undefined ICH_APxRn_EL2
registers, since PREbits is allowed to be smaller than PRIbits.
Thankfully, nobody seem to have taken this path so far...

The fix is to use ICH_VTR_EL2.PREbits instead.

Signed-off-by: Marc Zyngier 
Reviewed-by: Christoffer Dall 
Signed-off-by: Christoffer Dall

KVM: arm/arm64: vgic-v3: Do not use Active+Pending state for a HW interrupt

2017-05-15T09:31:51+00:00

When an interrupt is injected with the HW bit set (indicating that
deactivation should be propagated to the physical distributor),
special care must be taken so that we never mark the corresponding
LR with the Active+Pending state (as the pending state is kept in
the physycal distributor).

Cc: stable@vger.kernel.org
Fixes: 59529f69f504 ("KVM: arm/arm64: vgic-new: Add GICv3 world switch backend")
Signed-off-by: Marc Zyngier 
Reviewed-by: Christoffer Dall 
Signed-off-by: Christoffer Dall