linux-stable.git/kernel, branch v4.1.45

ftrace: Fix memleak when unregistering dynamic ops when tracing disabled

2017-10-04T11:43:05+00:00

[ Upstream commit edb096e00724f02db5f6ec7900f3bbd465c6c76f ]

If function tracing is disabled by the user via the function-trace option or
the proc sysctl file, and a ftrace_ops that was allocated on the heap is
unregistered, then the shutdown code exits out without doing the proper
clean up. This was found via kmemleak and running the ftrace selftests, as
one of the tests unregisters with function tracing disabled.

 # cat kmemleak
unreferenced object 0xffffffffa0020000 (size 4096):
  comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s)
  hex dump (first 32 bytes):
    55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89  U.t$.UH...t$.UH.
    e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c  .H......H.D$PH.L
  backtrace:
    [] kmemleak_vmalloc+0x85/0xf0
    [] __vmalloc_node_range+0x281/0x3e0
    [] module_alloc+0x4f/0x90
    [] arch_ftrace_update_trampoline+0x160/0x420
    [] ftrace_startup+0xe7/0x300
    [] register_ftrace_function+0x72/0x90
    [] trace_selftest_ops+0x204/0x397
    [] trace_selftest_startup_function+0x394/0x624
    [] run_tracer_selftest+0x15c/0x1d7
    [] init_trace_selftests+0x75/0x192
    [] do_one_initcall+0x90/0x1e2
    [] kernel_init_freeable+0x350/0x3fe
    [] kernel_init+0x13/0x122
    [] ret_from_fork+0x2a/0x40
    [] 0xffffffffffffffff

Cc: stable@vger.kernel.org
Fixes: 12cce594fa ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines")
Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Sasha Levin

tracing: Apply trace_clock changes to instance max buffer

2017-10-04T11:43:02+00:00

[ Upstream commit 170b3b1050e28d1ba0700e262f0899ffa4fccc52 ]

Currently trace_clock timestamps are applied to both regular and max
buffers only for global trace. For instance trace, trace_clock
timestamps are applied only to regular buffer. But, regular and max
buffers can be swapped, for example, following a snapshot. So, for
instance trace, bad timestamps can be seen following a snapshot.
Let's apply trace_clock timestamps to instance max buffer as well.

Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers")
Signed-off-by: Baohong Liu 
Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Sasha Levin

ftrace: Fix selftest goto location on error

2017-10-04T11:43:02+00:00

[ Upstream commit 46320a6acc4fb58f04bcf78c4c942cc43b20f986 ]

In the second iteration of trace_selftest_ops(), the error goto label is
wrong in the case where trace_selftest_test_global_cnt is off. In the
case of error, it leaks the dynamic ops that was allocated.

Cc: stable@vger.kernel.org
Fixes: 95950c2e ("ftrace: Add self-tests for multiple function trace users")
Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Sasha Levin

locktorture: Fix potential memory leak with rw lock test

2017-10-04T01:36:47+00:00

[ Upstream commit f4dbba591945dc301c302672adefba9e2ec08dc5 ]

When running locktorture module with the below commands with kmemleak enabled:

$ modprobe locktorture torture_type=rw_lock_irq
$ rmmod locktorture

The below kmemleak got caught:

root@10:~# echo scan > /sys/kernel/debug/kmemleak
[  323.197029] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@10:~# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffffffc07592d500 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 c3 7b 02 00 00 00 00 00  .........{......
    00 00 00 00 00 00 00 00 d7 9b 02 00 00 00 00 00  ................
  backtrace:
    [] create_object+0x110/0x288
    [] kmemleak_alloc+0x58/0xa0
    [] __kmalloc+0x234/0x318
    [] 0xffffff80006fa130
    [] do_one_initcall+0x44/0x138
    [] do_init_module+0x68/0x1cc
    [] load_module+0x1a68/0x22e0
    [] SyS_finit_module+0xe0/0xf0
    [] el0_svc_naked+0x24/0x28
    [] 0xffffffffffffffff
unreferenced object 0xffffffc07592d480 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 3b 6f 01 00 00 00 00 00  ........;o......
    00 00 00 00 00 00 00 00 23 6a 01 00 00 00 00 00  ........#j......
  backtrace:
    [] create_object+0x110/0x288
    [] kmemleak_alloc+0x58/0xa0
    [] __kmalloc+0x234/0x318
    [] 0xffffff80006fa22c
    [] do_one_initcall+0x44/0x138
    [] do_init_module+0x68/0x1cc
    [] load_module+0x1a68/0x22e0
    [] SyS_finit_module+0xe0/0xf0
    [] el0_svc_naked+0x24/0x28
    [] 0xffffffffffffffff

It is because cxt.lwsa and cxt.lrsa don't get freed in module_exit, so free
them in lock_torture_cleanup() and free writer_tasks if reader_tasks is
failed at memory allocation.

Signed-off-by: Yang Shi 
Signed-off-by: Paul E. McKenney 
Reviewed-by: Josh Triplett 
Signed-off-by: Sasha Levin

perf/core: Fix group {cpu,task} validation

2017-10-04T01:36:35+00:00

[ Upstream commit 64aee2a965cf2954a038b5522f11d2cd2f0f8f3e ]

Regardless of which events form a group, it does not make sense for the
events to target different tasks and/or CPUs, as this leaves the group
inconsistent and impossible to schedule. The core perf code assumes that
these are consistent across (successfully intialised) groups.

Core perf code only verifies this when moving SW events into a HW
context. Thus, we can violate this requirement for pure SW groups and
pure HW groups, unless the relevant PMU driver happens to perform this
verification itself. These mismatched groups subsequently wreak havoc
elsewhere.

For example, we handle watchpoints as SW events, and reserve watchpoint
HW on a per-CPU basis at pmu::event_init() time to ensure that any event
that is initialised is guaranteed to have a slot at pmu::add() time.
However, the core code only checks the group leader's cpu filter (via
event_filter_match()), and can thus install follower events onto CPUs
violating thier (mismatched) CPU filters, potentially installing them
into a CPU without sufficient reserved slots.

This can be triggered with the below test case, resulting in warnings
from arch backends.

  #define _GNU_SOURCE
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
			   int group_fd, unsigned long flags)
  {
	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
  }

  char watched_char;

  struct perf_event_attr wp_attr = {
	.type = PERF_TYPE_BREAKPOINT,
	.bp_type = HW_BREAKPOINT_RW,
	.bp_addr = (unsigned long)&watched_char,
	.bp_len = 1,
	.size = sizeof(wp_attr),
  };

  int main(int argc, char *argv[])
  {
	int leader, ret;
	cpu_set_t cpus;

	/*
	 * Force use of CPU0 to ensure our CPU0-bound events get scheduled.
	 */
	CPU_ZERO(&cpus);
	CPU_SET(0, &cpus);
	ret = sched_setaffinity(0, sizeof(cpus), &cpus);
	if (ret) {
		printf("Unable to set cpu affinity\n");
		return 1;
	}

	/* open leader event, bound to this task, CPU0 only */
	leader = perf_event_open(&wp_attr, 0, 0, -1, 0);
	if (leader < 0) {
		printf("Couldn't open leader: %d\n", leader);
		return 1;
	}

	/*
	 * Open a follower event that is bound to the same task, but a
	 * different CPU. This means that the group should never be possible to
	 * schedule.
	 */
	ret = perf_event_open(&wp_attr, 0, 1, leader, 0);
	if (ret < 0) {
		printf("Couldn't open mismatched follower: %d\n", ret);
		return 1;
	} else {
		printf("Opened leader/follower with mismastched CPUs\n");
	}

	/*
	 * Open as many independent events as we can, all bound to the same
	 * task, CPU0 only.
	 */
	do {
		ret = perf_event_open(&wp_attr, 0, 0, -1, 0);
	} while (ret >= 0);

	/*
	 * Force enable/disble all events to trigger the erronoeous
	 * installation of the follower event.
	 */
	printf("Opened all events. Toggling..\n");
	for (;;) {
		prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0);
		prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0);
	}

	return 0;
  }

Fix this by validating this requirement regardless of whether we're
moving events.

Signed-off-by: Mark Rutland 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Zhou Chengming 
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

tracing: Fix freeing of filter in create_filter() when set_str is false

2017-10-04T01:36:34+00:00

[ Upstream commit 8b0db1a5bdfcee0dbfa89607672598ae203c9045 ]

Performing the following task with kmemleak enabled:

 # cd /sys/kernel/tracing/events/irq/irq_handler_entry/
 # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
 # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
 # echo scan > /sys/kernel/debug/kmemleak
 # cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8800b9290308 (size 32):
  comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [] kmemleak_alloc+0x4a/0xa0
    [] kmem_cache_alloc_trace+0x158/0x290
    [] create_filter_start.constprop.28+0x99/0x940
    [] create_filter+0xa9/0x160
    [] create_event_filter+0xc/0x10
    [] set_trigger_filter+0xe5/0x210
    [] event_enable_trigger_func+0x324/0x490
    [] event_trigger_write+0x1a2/0x260
    [] __vfs_write+0xd7/0x380
    [] vfs_write+0x101/0x260
    [] SyS_write+0xab/0x130
    [] entry_SYSCALL_64_fastpath+0x1f/0xbe
    [] 0xffffffffffffffff

The function create_filter() is passed a 'filterp' pointer that gets
allocated, and if "set_str" is true, it is up to the caller to free it, even
on error. The problem is that the pointer is not freed by create_filter()
when set_str is false. This is a bug, and it is not up to the caller to free
the filter on error if it doesn't care about the string.

Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com

Cc: stable@vger.kernel.org
Fixes: 38b78eb85 ("tracing: Factorize filter creation")
Reported-by: Chunyu Hu 
Tested-by: Chunyu Hu 
Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Sasha Levin

audit: Fix use after free in audit_remove_watch_rule()

2017-10-04T01:36:29+00:00

[ Upstream commit d76036ab47eafa6ce52b69482e91ca3ba337d6d6 ]

audit_remove_watch_rule() drops watch's reference to parent but then
continues to work with it. That is not safe as parent can get freed once
we drop our reference. The following is a trivial reproducer:

mount -o loop image /mnt
touch /mnt/file
auditctl -w /mnt/file -p wax
umount /mnt
auditctl -D


Grab our own reference in audit_remove_watch_rule() earlier to make sure
mark does not get freed under us.

CC: stable@vger.kernel.org
Reported-by: Tony Jones 
Signed-off-by: Jan Kara 
Tested-by: Tony Jones 
Signed-off-by: Paul Moore 
Signed-off-by: Sasha Levin

signal: protect SIGNAL_UNKILLABLE from unintentional clearing.

2017-09-10T20:36:07+00:00

[ Upstream commit 2d39b3cd34e6d323720d4c61bd714f5ae202c022 ]

Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes.  init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.

This can result in init becoming stoppable/killable after tracing.  For
example, running:

  while true; do kill -STOP 1; done &
  strace -p 1

and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.

Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.

Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles 
Acked-by: Oleg Nesterov 
Cc: Alexander Viro 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

workqueue: restore WQ_UNBOUND/max_active==1 to be ordered

2017-09-10T20:35:58+00:00

[ Upstream commit 5c0338c68706be53b3dc472e4308961c36e4ece1 ]

The combination of WQ_UNBOUND and max_active == 1 used to imply
ordered execution.  After NUMA affinity 4c16bd327c74 ("workqueue:
implement NUMA affinity for unbound workqueues"), this is no longer
true due to per-node worker pools.

While the right way to create an ordered workqueue is
alloc_ordered_workqueue(), the documentation has been misleading for a
long time and people do use WQ_UNBOUND and max_active == 1 for ordered
workqueues which can lead to subtle bugs which are very difficult to
trigger.

It's unlikely that we'd see noticeable performance impact by enforcing
ordering on WQ_UNBOUND / max_active == 1 workqueues.  Let's
automatically set __WQ_ORDERED for those workqueues.

Signed-off-by: Tejun Heo 
Reported-by: Christoph Hellwig 
Reported-by: Alexei Potashnik 
Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues")
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Sasha Levin

/proc/iomem: only expose physical resource addresses to privileged users

2017-09-10T20:35:50+00:00

[ Upstream commit 51d7b120418e99d6b3bf8df9eb3cc31e8171dee4 ]

In commit c4004b02f8e5b ("x86: remove the kernel code/data/bss resources
from /proc/iomem") I was hoping to remove the phyiscal kernel address
data from /proc/iomem entirely, but that had to be reverted because some
system programs actually use it.

This limits all the detailed resource information to properly
credentialed users instead.

Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin