linux-stable.git/kernel, branch linux-4.8.y

fgraph: Handle a case where a tracer ignores set_graph_notrace

2017-01-09T07:21:58+00:00

commit 794de08a16cf1fc1bf785dc48f66d36218cf6d88 upstream.

Both the wakeup and irqsoff tracers can use the function graph tracer when
the display-graph option is set. The problem is that they ignore the notrace
file, and record the entry of functions that would be ignored by the
function_graph tracer. This causes the trace->depth to be recorded into the
ring buffer. The set_graph_notrace uses a trick by adding a large negative
number to the trace->depth when a graph function is to be ignored.

On trace output, the graph function uses the depth to record a stack of
functions. But since the depth is negative, it accesses the array with a
negative number and causes an out of bounds access that can cause a kernel
oops or corrupt data.

Have the print functions handle cases where a tracer still records functions
even when they are in set_graph_notrace.

Also add warnings if the depth is below zero before accessing the array.

Note, the function graph logic will still prevent the return of these
functions from being recorded, which means that they will be left hanging
without a return. For example:

   # echo '*spin*' > set_graph_notrace
   # echo 1 > options/display-graph
   # echo wakeup > current_tracer
   # cat trace
   [...]
      _raw_spin_lock() {
        preempt_count_add() {
        do_raw_spin_lock() {
      update_rq_clock();

Where it should look like:

      _raw_spin_lock() {
        preempt_count_add();
        do_raw_spin_lock();
      }
      update_rq_clock();

Cc: Namhyung Kim 
Fixes: 29ad23b00474 ("ftrace: Add set_graph_notrace filter")
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion

2017-01-09T07:21:47+00:00

commit 9c1645727b8fa90d07256fdfcc45bf831242a3ab upstream.

The clocksource delta to nanoseconds conversion is using signed math, but
the delta is unsigned. This makes the conversion space smaller than
necessary and in case of a multiplication overflow the conversion can
become negative. The conversion is done with scaled math:

    s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;

Shifting a signed integer right obvioulsy preserves the sign, which has
interesting consequences:

 - Time jumps backwards

 - __iter_div_u64_rem() which is used in one of the calling code pathes
   will take forever to piecewise calculate the seconds/nanoseconds part.

This has been reported by several people with different scenarios:

David observed that when stopping a VM with a debugger:

 "It was essentially the stopped by debugger case.  I forget exactly why,
  but the guest was being explicitly stopped from outside, it wasn't just
  scheduling lag.  I think it was something in the vicinity of 10 minutes
  stopped."

 When lifting the stop the machine went dead.

The stopped by debugger case is not really interesting, but nevertheless it
would be a good thing not to die completely.

But this was also observed on a live system by Liav:

 "When the OS is too overloaded, delta will get a high enough value for the
  msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
  after the shift the nsec variable will gain a value similar to
  0xffffffffff000000."

Unfortunately this has been reintroduced recently with commit 6bd58f09e1d8
("time: Add cycles to nanoseconds translation"). It had been fixed a year
ago already in commit 35a4933a8959 ("time: Avoid signed overflow in
timekeeping_get_ns()").

Though it's not surprising that the issue has been reintroduced because the
function itself and the whole call chain uses s64 for the result and the
propagation of it. The change in this recent commit is subtle:

   s64 nsec;

-  nsec = (d * m + n) >> s:
+  nsec = d * m + n;
+  nsec >>= s;

d being type of cycle_t adds another level of obfuscation.

This wouldn't have happened if the previous change to unsigned computation
would have made the 'nsec' variable u64 right away and a follow up patch
had cleaned up the whole call chain.

There have been patches submitted which basically did a revert of the above
patch leaving everything else unchanged as signed. Back to square one. This
spawned a admittedly pointless discussion about potential users which rely
on the unsigned behaviour until someone pointed out that it had been fixed
before. The changelogs of said patches added further confusion as they made
finally false claims about the consequences for eventual users which expect
signed results.

Despite delta being cycle_t, aka. u64, it's very well possible to hand in
a signed negative value and the signed computation will happily return the
correct result. But nobody actually sat down and analyzed the code which
was added as user after the propably unintended signed conversion.

Though in sensitive code like this it's better to analyze it proper and
make sure that nothing relies on this than hunting the subtle wreckage half
a year later. After analyzing all call chains it stands that no caller can
hand in a negative value (which actually would work due to the s64 cast)
and rely on the signed math to do the right thing.

Change the conversion function to unsigned math. The conversion of all call
chains is done in a follow up patch.

This solves the starvation issue, which was caused by the negative result,
but it does not solve the underlying problem. It merily procrastinates
it. When the timekeeper update is deferred long enough that the unsigned
multiplication overflows, then time going backwards is observable again.

It does neither solve the issue of clocksources with a small counter width
which will wrap around possibly several times and cause random time stamps
to be generated. But those are usually not found on systems used for
virtualization, so this is likely a non issue.

I took the liberty to claim authorship for this simply because
analyzing all callsites and writing the changelog took substantially
more time than just making the simple s/s64/u64/ change and ignore the
rest.

Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
Reported-by: David Gibson 
Reported-by: Liav Rehana 
Signed-off-by: Thomas Gleixner 
Reviewed-by: David Gibson 
Acked-by: Peter Zijlstra (Intel) 
Cc: Parit Bhargava 
Cc: Laurent Vivier 
Cc: "Christopher S. Hall" 
Cc: Chris Metcalf 
Cc: Richard Cochran 
Cc: John Stultz 
Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.de
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

kernel/debug/debug_core.c: more properly delay for secondary CPUs

2017-01-06T10:16:24+00:00

commit 2d13bb6494c807bcf3f78af0e96c0b8615a94385 upstream.

We've got a delay loop waiting for secondary CPUs.  That loop uses
loops_per_jiffy.  However, loops_per_jiffy doesn't actually mean how
many tight loops make up a jiffy on all architectures.  It is quite
common to see things like this in the boot log:

  Calibrating delay loop (skipped), value calculated using timer
  frequency.. 48.00 BogoMIPS (lpj=24000)

In my case I was seeing lots of cases where other CPUs timed out
entering the debugger only to print their stack crawls shortly after the
kdb> prompt was written.

Elsewhere in kgdb we already use udelay(), so that should be safe enough
to use to implement our timeout.  We'll delay 1 ms for 1000 times, which
should give us a full second of delay (just like the old code wanted)
but allow us to notice that we're done every 1 ms.

[akpm@linux-foundation.org: simplifications, per Daniel]
Link: http://lkml.kernel.org/r/1477091361-2039-1-git-send-email-dianders@chromium.org
Signed-off-by: Douglas Anderson 
Reviewed-by: Daniel Thompson 
Cc: Jason Wessel 
Cc: Brian Norris 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

kernel/watchdog: use nmi registers snapshot in hardlockup handler

2017-01-06T10:16:24+00:00

commit 4d1f0fb096aedea7bb5489af93498a82e467c480 upstream.

NMI handler doesn't call set_irq_regs(), it's set only by normal IRQ.
Thus get_irq_regs() returns NULL or stale registers snapshot with IP/SP
pointing to the code interrupted by IRQ which was interrupted by NMI.
NULL isn't a problem: in this case watchdog calls dump_stack() and
prints full stack trace including NMI.  But if we're stuck in IRQ
handler then NMI watchlog will print stack trace without IRQ part at
all.

This patch uses registers snapshot passed into NMI handler as arguments:
these registers point exactly to the instruction interrupted by NMI.

Fixes: 55537871ef66 ("kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup")
Link: http://lkml.kernel.org/r/146771764784.86724.6006627197118544150.stgit@buzz
Signed-off-by: Konstantin Khlebnikov 
Cc: Jiri Kosina 
Cc: Ulrich Obergfell 
Cc: Aaron Tomlin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ptrace: Capture the ptracer's creds not PT_PTRACE_CAP

2017-01-06T10:16:21+00:00

commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream.

When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked.  This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in.  This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid.  More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Greg Kroah-Hartman

mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

2017-01-06T10:16:21+00:00

commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue.  task->cred->user_ns is always the same
as or descendent of mm->user_ns.  Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.

To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm.  As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc//stat

Acked-by: Kees Cook 
Tested-by: Cyrill Gorcunov 
Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Greg Kroah-Hartman

exec: Ensure mm->user_ns contains the execed files

2017-01-06T10:16:21+00:00

commit f84df2a6f268de584a201e8911384a2d244876e3 upstream.

When the user namespace support was merged the need to prevent
ptrace from revealing the contents of an unreadable executable
was overlooked.

Correct this oversight by ensuring that the executed file
or files are in mm->user_ns, by adjusting mm->user_ns.

Use the new function privileged_wrt_inode_uidgid to see if
the executable is a member of the user namespace, and as such
if having CAP_SYS_PTRACE in the user namespace should allow
tracing the executable.  If not update mm->user_ns to
the parent user namespace until an appropriate parent is found.

Reported-by: Jann Horn 
Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Greg Kroah-Hartman

hotplug: Make register and unregister notifier API symmetric

2016-12-15T16:50:36+00:00

commit 777c6e0daebb3fcefbbd6f620410a946b07ef6d0 upstream.

Yu Zhao has noticed that __unregister_cpu_notifier only unregisters its
notifiers when HOTPLUG_CPU=y while the registration might succeed even
when HOTPLUG_CPU=n if MODULE is enabled. This means that e.g. zswap
might keep a stale notifier on the list on the manual clean up during
the pool tear down and thus corrupt the list. Resulting in the following

[  144.964346] BUG: unable to handle kernel paging request at ffff880658a2be78
[  144.971337] IP: [] raw_notifier_chain_register+0x1b/0x40

[  145.122628] Call Trace:
[  145.125086]  [] __register_cpu_notifier+0x18/0x20
[  145.131350]  [] zswap_pool_create+0x273/0x400
[  145.137268]  [] __zswap_param_set+0x1fc/0x300
[  145.143188]  [] ? trace_hardirqs_on+0xd/0x10
[  145.149018]  [] ? kernel_param_lock+0x28/0x30
[  145.154940]  [] ? __might_fault+0x4f/0xa0
[  145.160511]  [] zswap_compressor_param_set+0x17/0x20
[  145.167035]  [] param_attr_store+0x5c/0xb0
[  145.172694]  [] module_attr_store+0x1d/0x30
[  145.178443]  [] sysfs_kf_write+0x4f/0x70
[  145.183925]  [] kernfs_fop_write+0x149/0x180
[  145.189761]  [] __vfs_write+0x18/0x40
[  145.194982]  [] vfs_write+0xb2/0x1a0
[  145.200122]  [] SyS_write+0x52/0xa0
[  145.205177]  [] entry_SYSCALL_64_fastpath+0x12/0x17

This can be even triggered manually by changing
/sys/module/zswap/parameters/compressor multiple times.

Fix this issue by making unregister APIs symmetric to the register so
there are no surprises.

Fixes: 47e627bc8c9a ("[PATCH] hotplug: Allow modules to use the cpu hotplug notifiers even if !CONFIG_HOTPLUG_CPU")
Reported-and-tested-by: Yu Zhao 
Signed-off-by: Michal Hocko 
Cc: linux-mm@kvack.org
Cc: Andrew Morton 
Cc: Dan Streetman 
Link: http://lkml.kernel.org/r/20161207135438.4310-1-mhocko@kernel.org
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

sched/autogroup: Fix 64-bit kernel nice level adjustment

2016-12-15T16:50:35+00:00

commit 83929cce95251cc77e5659bf493bd424ae0e7a67 upstream.

Michael Kerrisk reported:

> Regarding the previous paragraph...  My tests indicate
> that writing *any* value to the autogroup [nice priority level]
> file causes the task group to get a lower priority.

Because autogroup didn't call the then meaningless scale_load()...

Autogroup nice level adjustment has been broken ever since load
resolution was increased for 64-bit kernels.  Use scale_load() to
scale group weight.

Michael Kerrisk tested this patch to fix the problem:

> Applied and tested against 4.9-rc6 on an Intel u7 (4 cores).
> Test setup:
>
> Terminal window 1: running 40 CPU burner jobs
> Terminal window 2: running 40 CPU burner jobs
> Terminal window 1: running  1 CPU burner job
>
> Demonstrated that:
> * Writing "0" to the autogroup file for TW1 now causes no change
>   to the rate at which the process on the terminal consume CPU.
> * Writing -20 to the autogroup file for TW1 caused those processes
>   to get the lion's share of CPU while TW2 TW3 get a tiny amount.
> * Writing -20 to the autogroup files for TW1 and TW3 allowed the
>   process on TW3 to get as much CPU as it was getting as when
>   the autogroup nice values for both terminals were 0.

Reported-by: Michael Kerrisk 
Tested-by: Michael Kerrisk 
Signed-off-by: Mike Galbraith 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-man 
Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.de
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()

2016-12-15T16:50:34+00:00

commit 1be5d4fa0af34fb7bafa205aeb59f5c7cc7a089d upstream.

While debugging the rtmutex unlock vs. dequeue race Will suggested to use
READ_ONCE() in rt_mutex_owner() as it might race against the
cmpxchg_release() in unlock_rt_mutex_safe().

Will: "It's a minor thing which will most likely not matter in practice"

Careful search did not unearth an actual problem in todays code, but it's
better to be safe than surprised.

Suggested-by: Will Deacon 
Signed-off-by: Thomas Gleixner 
Acked-by: Peter Zijlstra (Intel) 
Cc: David Daney 
Cc: Linus Torvalds 
Cc: Mark Rutland 
Cc: Peter Zijlstra 
Cc: Sebastian Siewior 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman