linux-stable.git/kernel, branch linux-3.4.y

time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge

2016-10-26T15:15:45+00:00

commit 833f32d763028c1bb371c64f457788b933773b3e upstream.

Currently, leapsecond adjustments are done at tick time. As a result,
the leapsecond was applied at the first timer tick *after* the
leapsecond (~1-10ms late depending on HZ), rather then exactly on the
second edge.

This was in part historical from back when we were always tick based,
but correcting this since has been avoided since it adds extra
conditional checks in the gettime fastpath, which has performance
overhead.

However, it was recently pointed out that ABS_TIME CLOCK_REALTIME
timers set for right after the leapsecond could fire a second early,
since some timers may be expired before we trigger the timekeeping
timer, which then applies the leapsecond.

This isn't quite as bad as it sounds, since behaviorally it is similar
to what is possible w/ ntpd made leapsecond adjustments done w/o using
the kernel discipline. Where due to latencies, timers may fire just
prior to the settimeofday call. (Also, one should note that all
applications using CLOCK_REALTIME timers should always be careful,
since they are prone to quirks from settimeofday() disturbances.)

However, the purpose of having the kernel do the leap adjustment is to
avoid such latencies, so I think this is worth fixing.

So in order to properly keep those timers from firing a second early,
this patch modifies the ntp and timekeeping logic so that we keep
enough state so that the update_base_offsets_now accessor, which
provides the hrtimer core the current time, can check and apply the
leapsecond adjustment on the second edge. This prevents the hrtimer
core from expiring timers too early.

This patch does not modify any other time read path, so no additional
overhead is incurred. However, this also means that the leap-second
continues to be applied at tick time for all other read-paths.

Apologies to Richard Cochran, who pushed for similar changes years
ago, which I resisted due to the concerns about the performance
overhead.

While I suspect this isn't extremely critical, folks who care about
strict leap-second correctness will likely want to watch
this. Potentially a -stable candidate eventually.

Originally-suggested-by: Richard Cochran 
Reported-by: Daniel Bristot de Oliveira 
Reported-by: Prarit Bhargava 
Signed-off-by: John Stultz 
Cc: Richard Cochran 
Cc: Jan Kara 
Cc: Jiri Bohac 
Cc: Shuah Khan 
Cc: Ingo Molnar 
Link: http://lkml.kernel.org/r/1434063297-28657-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner 
[Yadi: Move do_adjtimex to timekeeping.c and solve context issues]
Signed-off-by: Hu 
Signed-off-by: Zefan Li

genirq: Prevent chip buslock deadlock

2016-10-26T15:15:37+00:00

commit abc7e40c81d113ef4bacb556f0a77ca63ac81d85 upstream.

If a interrupt chip utilizes chip->buslock then free_irq() can
deadlock in the following way:

CPU0				CPU1
				interrupt(X) (Shared or spurious)
free_irq(X)			interrupt_thread(X)
chip_bus_lock(X)
				   irq_finalize_oneshot(X)
				     chip_bus_lock(X)
synchronize_irq(X)

synchronize_irq() waits for the interrupt thread to complete,
i.e. forever.

Solution is simple: Drop chip_bus_lock() before calling
synchronize_irq() as we do with the irq_desc lock. There is nothing to
be protected after the point where irq_desc lock has been released.

This adds chip_bus_lock/unlock() to the remove_irq() code path, but
that's actually correct in the case where remove_irq() is called on
such an interrupt. The current users of remove_irq() are not affected
as none of those interrupts is on a chip which requires buslock.

Reported-by: Fredrik Markström 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Zefan Li

sched/core: Clear the root_domain cpumasks in init_rootdomain()

2016-10-26T15:15:33+00:00

commit 8295c69925ad53ec32ca54ac9fc194ff21bc40e2 upstream.

root_domain::rto_mask allocated through alloc_cpumask_var()
contains garbage data, this may cause problems. For instance,
When doing pull_rt_task(), it may do useless iterations if
rto_mask retains some extra garbage bits. Worse still, this
violates the isolated domain rule for clustered scheduling
using cpuset, because the tasks(with all the cpus allowed)
belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var()
instead of alloc_cpumask_var() for root_domain::rto_mask
allocation, thereby addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
dlo_mask, span, and online.

Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar 
[lizf: there's no rd->dlo_mask, so remove the change to it]
Signed-off-by: Zefan Li

ring-buffer: Update read stamp with first real commit on page

2016-10-26T15:15:30+00:00

commit b81f472a208d3e2b4392faa6d17037a89442f4ce upstream.

Do not update the read stamp after swapping out the reader page from the
write buffer. If the reader page is swapped out of the buffer before an
event is written to it, then the read_stamp may get an out of date
timestamp, as the page timestamp is updated on the first commit to that
page.

rb_get_reader_page() only returns a page if it has an event on it, otherwise
it will return NULL. At that point, check if the page being returned has
events and has not been read yet. Then at that point update the read_stamp
to match the time stamp of the reader page.

Signed-off-by: Steven Rostedt 
Signed-off-by: Zefan Li

perf: Fix inherited events vs. tracepoint filters

2016-10-26T15:15:28+00:00

commit b71b437eedaed985062492565d9d421d975ae845 upstream.

Arnaldo reported that tracepoint filters seem to misbehave (ie. not
apply) on inherited events.

The fix is obvious; filters are only set on the actual (parent)
event, use the normal pattern of using this parent event for filters.
This is safe because each child event has a reference to it.

Reported-by: Arnaldo Carvalho de Melo 
Tested-by: Arnaldo Carvalho de Melo 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Adrian Hunter 
Cc: Arnaldo Carvalho de Melo 
Cc: David Ahern 
Cc: Frédéric Weisbecker 
Cc: Jiri Olsa 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: Wang Nan 
Link: http://lkml.kernel.org/r/20151102095051.GN17308@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Zefan Li

clocksource: Fix abs() usage w/ 64bit values

2016-04-27T10:55:25+00:00

commit 67dfae0cd72fec5cd158b6e5fb1647b7dbe0834c upstream.

This patch fixes one cases where abs() was being used with 64-bit
nanosecond values, where the result may be capped at 32-bits.

This potentially could cause watchdog false negatives on 32-bit
systems, so this patch addresses the issue by using abs64().

Signed-off-by: John Stultz 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Link: http://lkml.kernel.org/r/1442279124-7309-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner 
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li

genirq: Fix race in register_irq_proc()

2016-04-27T10:55:25+00:00

commit 95c2b17534654829db428f11bcf4297c059a2a7e upstream.

Per-IRQ directories in procfs are created only when a handler is first
added to the irqdesc, not when the irqdesc is created.  In the case of
a shared IRQ, multiple tasks can race to create a directory.  This
race condition seems to have been present forever, but is easier to
hit with async probing.

Signed-off-by: Ben Hutchings 
Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.uk
Signed-off-by: Thomas Gleixner 
Signed-off-by: Zefan Li

sched/core: Fix TASK_DEAD race in finish_task_switch()

2016-04-27T10:55:21+00:00

commit 95913d97914f44db2b81271c2e2ebd4d2ac2df83 upstream.

So the problem this patch is trying to address is as follows:

        CPU0                            CPU1

        context_switch(A, B)
                                        ttwu(A)
                                          LOCK A->pi_lock
                                          A->on_cpu == 0
        finish_task_switch(A)
          prev_state = A->state  <-.
          WMB                      |
          A->on_cpu = 0;           |
          UNLOCK rq0->lock         |
                                   |    context_switch(C, A)
                                   `--  A->state = TASK_DEAD
          prev_state == TASK_DEAD
            put_task_struct(A)
                                        context_switch(A, C)
                                        finish_task_switch(A)
                                          A->state == TASK_DEAD
                                            put_task_struct(A)

The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Reported-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
[lizf: Backported to 3.4: use smb_mb() instead of smp_store_release(), which
 is not defined in 3.4.y]
Signed-off-by: Zefan Li

module: Fix locking in symbol_put_addr()

2016-04-27T10:55:19+00:00

commit 275d7d44d802ef271a42dc87ac091a495ba72fc5 upstream.

Poma (on the way to another bug) reported an assertion triggering:

  [] module_assert_mutex_or_preempt+0x49/0x90
  [] __module_address+0x32/0x150
  [] __module_text_address+0x16/0x70
  [] symbol_put_addr+0x29/0x40
  [] dvb_frontend_detach+0x7d/0x90 [dvb_core]

Laura Abbott  produced a patch which lead us to
inspect symbol_put_addr(). This function has a comment claiming it
doesn't need to disable preemption around the module lookup
because it holds a reference to the module it wants to find, which
therefore cannot go away.

This is wrong (and a false optimization too, preempt_disable() is really
rather cheap, and I doubt any of this is on uber critical paths,
otherwise it would've retained a pointer to the actual module anyway and
avoided the second lookup).

While its true that the module cannot go away while we hold a reference
on it, the data structure we do the lookup in very much _CAN_ change
while we do the lookup. Therefore fix the comment and add the
required preempt_disable().

Reported-by: poma 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 
Fixes: a6e6abd575fc ("module: remove module_text_address()")
Signed-off-by: Zefan Li

fs: create and use seq_show_option for escaping

2016-04-27T10:55:18+00:00

commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.

Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook 
Acked-by: Serge Hallyn 
Acked-by: Jan Kara 
Acked-by: Paul Moore 
Cc: J. R. Okajima 
Signed-off-by: Kees Cook 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[lizf: Backported to 3.4:
 - adjust context
 - one more place in ceph needs to be changed
 - drop changes to overlayfs
 - drop showing vers in cifs]
Signed-off-by: Zefan Li