linux-stable.git/arch/x86/include/asm, branch linux-3.12.y

x86/vdso: Plug race between mapping and ELF header setup

2017-04-28T17:30:42+00:00

commit 6fdc6dd90272ce7e75d744f71535cfbd8d77da81 upstream.

The vsyscall32 sysctl can racy against a concurrent fork when it switches
from disabled to enabled:

    arch_setup_additional_pages()
	if (vdso32_enabled)
           --> No mapping
                                        sysctl.vsysscall32()
                                          --> vdso32_enabled = true
    create_elf_tables()
      ARCH_DLINFO_IA32
        if (vdso32_enabled) {
           --> Add VDSO entry with NULL pointer

Make ARCH_DLINFO_IA32 check whether the VDSO mapping has been set up for
the newly forked process or not.

Signed-off-by: Thomas Gleixner 
Acked-by: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Mathias Krause 
Link: http://lkml.kernel.org/r/20170410151723.602367196@linutronix.de
Signed-off-by: Thomas Gleixner 
Signed-off-by: Jiri Slaby

x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()

2017-01-27T16:15:00+00:00

commit b0f48706a176b71a6e54f399d7404bbeeaa7cfab upstream.

===============================
[ INFO: suspicious RCU usage. ]
4.8.0-rc6+ #5 Not tainted
-------------------------------
./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
no locks held by swapper/2/0.

stack backtrace:
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0-rc6+ #5
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
 0000000000000000 ffff8d1bd6003f10 ffffffff94446949 ffff8d1bd4a68000
 0000000000000001 ffff8d1bd6003f40 ffffffff940e9247 ffff8d1bbdfcf3d0
 000000000000080b 0000000000000000 0000000000000000 ffff8d1bd6003f70
Call Trace:
   [] dump_stack+0x99/0xd0
 [] lockdep_rcu_suspicious+0xe7/0x120
 [] do_trace_write_msr+0x135/0x140
 [] native_write_msr+0x20/0x30
 [] native_apic_msr_eoi_write+0x1d/0x30
 [] smp_trace_call_function_interrupt+0x1e/0x270
 [] trace_call_function_interrupt+0x96/0xa0
   [] ? cpuidle_enter_state+0xe4/0x360
 [] ? cpuidle_enter_state+0xcf/0x360
 [] cpuidle_enter+0x17/0x20
 [] cpu_startup_entry+0x338/0x4d0
 [] start_secondary+0x154/0x180

This can be reproduced readily by running ftrace test case of kselftest.

Move the irq_enter() call before ack_APIC_irq(), because irq_enter() tells
the RCU susbstems to end the extended quiescent state, so that the
following trace call in ack_APIC_irq() works correctly. The same applies to
exiting_ack_irq() which calls ack_APIC_irq() after irq_exit().

[ tglx: Massaged changelog ]

Signed-off-by: Wanpeng Li 
Cc: Peter Zijlstra 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1474198491-3738-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner 
Acked-by: Joerg Roedel 
Signed-off-by: Jiri Slaby

x86/mm/xen: Suppress hugetlbfs in PV guests

2016-11-24T15:23:39+00:00

commit 103f6112f253017d7062cd74d17f4a514ed4485c upstream.

Huge pages are not normally available to PV guests. Not suppressing
hugetlbfs use results in an endless loop of page faults when user mode
code tries to access a hugetlbfs mapped area (since the hypervisor
denies such PTEs to be created, but error indications can't be
propagated out of xen_set_pte_at(), just like for various of its
siblings), and - once killed in an oops like this:

  kernel BUG at .../fs/hugetlbfs/inode.c:428!
  invalid opcode: 0000 [#1] SMP
  ...
  RIP: e030:[]  [] remove_inode_hugepages+0x25b/0x320
  ...
  Call Trace:
   [] hugetlbfs_evict_inode+0x15/0x40
   [] evict+0xbd/0x1b0
   [] __dentry_kill+0x19a/0x1f0
   [] dput+0x1fe/0x220
   [] __fput+0x155/0x200
   [] task_work_run+0x60/0xa0
   [] do_exit+0x160/0x400
   [] do_group_exit+0x3b/0xa0
   [] get_signal+0x1ed/0x470
   [] do_signal+0x14/0x110
   [] prepare_exit_to_usermode+0xe9/0xf0
   [] retint_user+0x8/0x13

This is CVE-2016-3961 / XSA-174.

Reported-by: Vitaly Kuznetsov 
Signed-off-by: Jan Beulich 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: David Vrabel 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Juergen Gross 
Cc: Linus Torvalds 
Cc: Luis R. Rodriguez 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: xen-devel 
Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Jiri Slaby

Fix potential infoleak in older kernels

2016-11-24T15:20:50+00:00

Not upstream as it is not needed there.

So a patch something like this might be a safe way to fix the
potential infoleak in older kernels.

THIS IS UNTESTED. It's a very obvious patch, though, so if it compiles
it probably works. It just initializes the output variable with 0 in
the inline asm description, instead of doing it in the exception
handler.

It will generate slightly worse code (a few unnecessary ALU
operations), but it doesn't have any interactions with the exception
handler implementation.

Signed-off-by: Jiri Slaby

Revert "fix minor infoleak in get_user_ex()"

2016-11-08T15:38:29+00:00

This reverts commit d42924ab1ec523c0671f5560d51750996be31d3a which is
1c109fabbd51863475cd12ac206bdd249aee35af upstream.

Signed-off-by: Jiri Slaby 
Cc: Al Viro 
Cc: Linus Torvalds

fix minor infoleak in get_user_ex()

2016-09-29T09:14:28+00:00

commit 1c109fabbd51863475cd12ac206bdd249aee35af upstream.

get_user_ex(x, ptr) should zero x on failure.  It's not a lot of a leak
(at most we are leaking uninitialized 64bit value off the kernel stack,
and in a fairly constrained situation, at that), but the fix is trivial,
so...

Signed-off-by: Al Viro 
[ This sat in different branch from the uaccess fixes since mid-August ]
Signed-off-by: Linus Torvalds 
Signed-off-by: Jiri Slaby

x86/mm: Disable preemption during CR3 read+write

2016-09-21T11:40:07+00:00

commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e upstream.

There's a subtle preemption race on UP kernels:

Usually current->mm (and therefore mm->pgd) stays the same during the
lifetime of a task so it does not matter if a task gets preempted during
the read and write of the CR3.

But then, there is this scenario on x86-UP:

TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:

 -> mmput()
 -> exit_mmap()
 -> tlb_finish_mmu()
 -> tlb_flush_mmu()
 -> tlb_flush_mmu_tlbonly()
 -> tlb_flush()
 -> flush_tlb_mm_range()
 -> __flush_tlb_up()
 -> __flush_tlb()
 ->  __native_flush_tlb()

At this point current->mm is NULL but current->active_mm still points to
the "old" mm.

Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
own mm so CR3 has changed.

Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
mm and so CR3 remains unchanged. Once taskA gets active it continues
where it was interrupted and that means it writes its old CR3 value
back. Everything is fine because userland won't need its memory
anymore.

Now the fun part:

Let's preempt taskA one more time and get back to taskB. This
time switch_mm() won't do a thing because oldmm (->active_mm)
is the same as mm (as per context_switch()). So we remain
with a bad CR3 / PGD and return to userland.

The next thing that happens is handle_mm_fault() with an address for
the execution of its code in userland. handle_mm_fault() realizes that
it has a PTE with proper rights so it returns doing nothing. But the
CPU looks at the wrong PGD and insists that something is wrong and
faults again. And again. And one more time…

This pagefault circle continues until the scheduler gets tired of it and
puts another task on the CPU. It gets little difficult if the task is a
RT task with a high priority. The system will either freeze or it gets
fixed by the software watchdog thread which usually runs at RT-max prio.
But waiting for the watchdog will increase the latency of the RT task
which is no good.

Fix this by disabling preemption across the critical code section.

Signed-off-by: Sebastian Andrzej Siewior 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Rik van Riel 
Acked-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar 
Signed-off-by: Jiri Slaby

x86/mm: Improve switch_mm() barrier comments

2016-08-19T07:51:05+00:00

commit 4eaffdd5a5fe6ff9f95e1ab4de1ac904d5e0fa8b upstream.

My previous comments were still a bit confusing and there was a
typo. Fix it up.

Reported-by: Peter Zijlstra 
Signed-off-by: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: stable@vger.kernel.org
Fixes: 71b3c126e611 ("x86/mm: Add barriers and document switch_mm()-vs-flush synchronization")
Link: http://lkml.kernel.org/r/0a0b43cdcdd241c5faaaecfbcc91a155ddedc9a1.1452631609.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Jiri Slaby

x86/mm: Add barriers and document switch_mm()-vs-flush synchronization

2016-07-21T06:36:14+00:00

commit 71b3c126e61177eb693423f2e18a1914205b165e upstream.

When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent.  In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.

Document all the barriers that make this work correctly and add
a couple that were missing.

Signed-off-by: Andy Lutomirski 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar 
Cc: Charles (Chas) Williams 
Signed-off-by: Jiri Slaby

x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()

2016-04-11T14:44:01+00:00

commit 7834c10313fb823e538f2772be78edcdeed2e6e3 upstream.

Since 4.4, I've been able to trigger this occasionally:

===============================
[ INFO: suspicious RCU usage. ]
4.5.0-rc7-think+ #3 Not tainted
Cc: Andi Kleen 
Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.uk
Signed-off-by: Thomas Gleixner 
Signed-off-by: Jiri Slaby 

-------------------------------
./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
no locks held by swapper/3/0.

stack backtrace:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3
 ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a
 ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6
 ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60
Call Trace:
   [] dump_stack+0x67/0x9d
 [] lockdep_rcu_suspicious+0xe6/0x100
 [] do_trace_write_msr+0x127/0x1a0
 [] native_apic_msr_eoi_write+0x23/0x30
 [] smp_trace_call_function_interrupt+0x38/0x360
 [] trace_call_function_interrupt+0x90/0xa0
   [] ? cpuidle_enter_state+0x1b4/0x520

Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
tells the RCU susbstems to end the extended quiescent state, so that the
following trace call in ack_APIC_irq() works correctly.

Suggested-by: Andi Kleen 
Fixes: 4787c368a9bc "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()"
Signed-off-by: Dave Jones 
Signed-off-by: Thomas Gleixner