linux-stable.git/kernel, branch v3.2.99

blktrace: fix unlocked access to init/start-stop/teardown

2018-02-13T18:32:15+00:00

commit 1f2cac107c591c24b60b115d6050adc213d10fc0 upstream.

sg.c calls into the blktrace functions without holding the proper queue
mutex for doing setup, start/stop, or teardown.

Add internal unlocked variants, and export the ones that do the proper
locking.

Fixes: 6da127ad0918 ("blktrace: Add blktrace ioctls to SCSI generic devices")
Tested-by: Dmitry Vyukov 
Signed-off-by: Jens Axboe 
Signed-off-by: Ben Hutchings

blktrace: Fix potential deadlock between delete & sysfs ops

2018-02-13T18:32:15+00:00

commit 5acb3cc2c2e9d3020a4fee43763c6463767f1572 upstream.

The lockdep code had reported the following unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#228);
                               lock(&bdev->bd_mutex/1);
                               lock(s_active#228);
  lock(&bdev->bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.

Suggested-by: Christoph Hellwig 
Signed-off-by: Waiman Long 
Acked-by: Steven Rostedt (VMware) 

Fix typo in patch subject line, and prune a comment detailing how
the code used to work.

Signed-off-by: Jens Axboe 
Signed-off-by: Ben Hutchings

kprobes, x86/alternatives: Use text_mutex to protect smp_alt_modules

2018-02-13T18:32:14+00:00

commit e846d13958066828a9483d862cc8370a72fadbb6 upstream.

We use alternatives_text_reserved() to check if the address is in
the fixed pieces of alternative reserved, but the problem is that
we don't hold the smp_alt mutex when call this function. So the list
traversal may encounter a deleted list_head if another path is doing
alternatives_smp_module_del().

One solution is that we can hold smp_alt mutex before call this
function, but the difficult point is that the callers of this
functions, arch_prepare_kprobe() and arch_prepare_optimized_kprobe(),
are called inside the text_mutex. So we must hold smp_alt mutex
before we go into these arch dependent code. But we can't now,
the smp_alt mutex is the arch dependent part, only x86 has it.
Maybe we can export another arch dependent callback to solve this.

But there is a simpler way to handle this problem. We can reuse the
text_mutex to protect smp_alt_modules instead of using another mutex.
And all the arch dependent checks of kprobes are inside the text_mutex,
so it's safe now.

Signed-off-by: Zhou Chengming 
Reviewed-by: Masami Hiramatsu 
Acked-by: Steven Rostedt (VMware) 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: bp@suse.de
Fixes: 2cfa197 "ftrace/alternatives: Introducing *_text_reserved functions"
Link: http://lkml.kernel.org/r/1509585501-79466-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

x86/smp: Don't ever patch back to UP if we unplug cpus

2018-02-13T18:32:14+00:00

commit 816afe4ff98ee10b1d30fd66361be132a0a5cee6 upstream.

We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time.  In particular, not if we
unplug CPUs to return to a single cpu.

Paul McKenney points out:

 mean offline overhead is 6251/48=130.2 milliseconds.

 If I remove the alternatives_smp_switch() from the offline
 path [...] the mean offline overhead is 550/42=13.1 milliseconds

Basically, we're never going to get those 120ms back, and the
code is pretty messy.

We get rid of:

 1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
    documentation is wrong. It's now the default.

 2) The skip_smp_alternatives flag used by suspend.

 3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
    which were only used to set this one flag.

Signed-off-by: Rusty Russell 
Cc: Paul McKenney 
Cc: Suresh Siddha 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

KAISER: Kernel Address Isolation

2018-01-07T01:46:49+00:00

This patch introduces our implementation of KAISER (Kernel Address
Isolation to have Side-channels Efficiently Removed), a kernel isolation
technique to close hardware side channels on kernel address information.

More information about the original patch can be found at:
https://github.com/IAIK/KAISER
http://marc.info/?l=linux-kernel&m=149390087310405&w=2

Daniel Gruss 
Richard Fellner 
Michael Schwarz 



That original was then developed further by
Dave Hansen 
Hugh Dickins 
then others after this snapshot.

This combined patch for 3.2.96 was derived from hughd's patches below
for 3.18.72, in 2017-12-04's kaiser-3.18.72.tar; except for the last,
which was sent in 2017-12-09's nokaiser-3.18.72.tar.  They have been
combined in order to minimize the effort of rebasing: most of the
patches in the 3.18.72 series were small fixes and cleanups and
enhancements to three large patches.  About the only new work in this
backport is a simple reimplementation of kaiser_remove_mapping():
since mm/pageattr.c changed a lot between 3.2 and 3.18, and the
mods there for Kaiser never seemed necessary.

KAISER: Kernel Address Isolation
kaiser: merged update
kaiser: do not set _PAGE_NX on pgd_none
kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
kaiser: fix build and FIXME in alloc_ldt_struct()
kaiser: KAISER depends on SMP
kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
kaiser: fix perf crashes
kaiser: ENOMEM if kaiser_pagetable_walk() NULL
kaiser: tidied up asm/kaiser.h somewhat
kaiser: tidied up kaiser_add/remove_mapping slightly
kaiser: kaiser_remove_mapping() move along the pgd
kaiser: align addition to x86/mm/Makefile
kaiser: cleanups while trying for gold link
kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
kaiser: delete KAISER_REAL_SWITCH option
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
kaiser: enhanced by kernel and user PCIDs
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
kaiser: PCID 0 for kernel and 128 for user
kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
kaiser: paranoid_entry pass cr3 need to paranoid_exit
kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
kaiser: fix unlikely error in alloc_ldt_struct()
kaiser: drop is_atomic arg to kaiser_pagetable_walk()

Signed-off-by: Hugh Dickins 
[bwh:
 - Fixed the #undef in arch/x86/boot/compressed/misc.h
 - Add missing #include in arch/x86/mm/kaiser.c]
Signed-off-by: Ben Hutchings

sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()

2018-01-07T01:46:48+00:00

commit 252d2a4117bc181b287eeddf848863788da733ae upstream.

idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes.  Nonetheless, I think it
should be backported because it's trivial.  There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: stable@vger.kernel.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Hugh Dickins 
Signed-off-by: Ben Hutchings

sched/core: Add switch_mm_irqs_off() and use it in the scheduler

2018-01-07T01:46:47+00:00

commit f98db6013c557c216da5038d9c52045be55cd039 upstream.

By default, this is the same thing as switch_mm().

x86 will override it as an optimization.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Hugh Dickins 
Signed-off-by: Ben Hutchings

ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock

2018-01-01T20:51:04+00:00

commit 1333ab03150478df8d6f5673a91df1e50dc6ab97 upstream.

This test-case (simplified version of generated by syzkaller)

	#include 
	#include 
	#include 

	void test(void)
	{
		for (;;) {
			if (fork()) {
				wait(NULL);
				continue;
			}

			ptrace(PTRACE_SEIZE, getppid(), 0, 0);
			ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
			_exit(0);
		}
	}

	int main(void)
	{
		int np;

		for (np = 0; np < 8; ++np)
			if (!fork())
				test();

		while (wait(NULL) > 0)
			;
		return 0;
	}

triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap().  The
problem is that __ptrace_unlink() clears task->jobctl under siglock but
task->ptrace is cleared without this lock held; this fools the "else"
branch which assumes that !PT_SEIZED means PT_PTRACED.

Note also that most of other PTRACE_SEIZE checks can race with detach
from the exiting tracer too.  Say, the callers of ptrace_trap_notify()
assume that SEIZED can't go away after it was checked.

Signed-off-by: Oleg Nesterov 
Reported-by: Dmitry Vyukov 
Cc: Tejun Heo 
Cc: syzkaller 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

kernel/params.c: align add_sysfs_param documentation with code

2018-01-01T20:50:55+00:00

commit 630cc2b30a42c70628368a412beb4a5e5dd71abe upstream.

This parameter is named kp, so the documentation should use that.

Fixes: 9b473de87209 ("param: Fix duplicate module prefixes")
Link: http://lkml.kernel.org/r/20170919142656.64aea59e@endymion
Signed-off-by: Jean Delvare 
Acked-by: Rusty Russell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

sched/sysctl: Check user input value of sysctl_sched_time_avg

2018-01-01T20:50:53+00:00

commit 5ccba44ba118a5000cccc50076b0344632459779 upstream.

System will hang if user set sysctl_sched_time_avg to 0:

  [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

  Stack traceback for pid 0
  0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
  ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
  0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
  ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
  Call Trace:
   [] ? update_group_capacity+0x110/0x200
  [] ? update_sd_lb_stats+0x109/0x600
  [] ? find_busiest_group+0x47/0x530
  [] ? load_balance+0x194/0x900
  [] ? update_rq_clock.part.83+0x1a/0xe0
  [] ? rebalance_domains+0x152/0x290
  [] ? run_rebalance_domains+0xdc/0x1d0
  [] ? __do_softirq+0xfb/0x320
  [] ? irq_exit+0x125/0x130
  [] ? scheduler_ipi+0x97/0x160
  [] ? smp_reschedule_interrupt+0x29/0x30
  [] ? reschedule_interrupt+0x6e/0x80
    [] ? cpuidle_enter_state+0xcc/0x230
  [] ? cpuidle_enter_state+0x9c/0x230
  [] ? cpuidle_enter+0x17/0x20
  [] ? cpu_startup_entry+0x38c/0x420
  [] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function:

update_group_capacity()
  update_cpu_capacity()
    scale_rt_capacity()
     {
          ...
          total = sched_avg_period() + delta;
          used = div_u64(avg, total);
          ...
     }

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hitting invalid input, and set the minimum limit of
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran 
Signed-off-by: Ethan Zhao 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: efault@gmx.de
Cc: ethan.kernel@gmail.com
Cc: keescook@chromium.org
Cc: mcgrof@kernel.org
Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings