linux-stable.git/kernel, branch v3.12.63

module: Invalidate signatures on force-loaded modules

2016-08-19T10:02:43+00:00

commit bca014caaa6130e57f69b5bf527967aa8ee70fdd upstream.

Signing a module should only make it trusted by the specific kernel it
was built for, not anything else.  Loading a signed module meant for a
kernel with a different ABI could have interesting effects.
Therefore, treat all signatures as invalid when a module is
force-loaded.

Signed-off-by: Ben Hutchings 
Signed-off-by: Rusty Russell 
Signed-off-by: Jiri Slaby

tracing: Handle NULL formats in hold_module_trace_bprintk_format()

2016-08-19T07:50:45+00:00

commit 70c8217acd4383e069fe1898bbad36ea4fcdbdcc upstream.

If a task uses a non constant string for the format parameter in
trace_printk(), then the trace_printk_fmt variable is set to NULL. This
variable is then saved in the __trace_printk_fmt section.

The function hold_module_trace_bprintk_format() checks to see if duplicate
formats are used by modules, and reuses them if so (saves them to the list
if it is new). But this function calls lookup_format() that does a strcmp()
to the value (which is now NULL) and can cause a kernel oops.

This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
when not using bprintk()") which added "__used" to the trace_printk_fmt
variable, and before that, the kernel simply optimized it out (no NULL value
was saved).

The fix is simply to handle the NULL pointer in lookup_format() and have the
caller ignore the value if it was NULL.

Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com

Reported-by: xingzhen 
Acked-by: Namhyung Kim 
Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
Signed-off-by: Steven Rostedt 
Signed-off-by: Jiri Slaby

printk: do cond_resched() between lines while outputting to consoles

2016-07-21T06:36:15+00:00

commit 8d91f8b15361dfb438ab6eb3b319e2ded43458ff upstream.

@console_may_schedule tracks whether console_sem was acquired through
lock or trylock.  If the former, we're inside a sleepable context and
console_conditional_schedule() performs cond_resched().  This allows
console drivers which use console_lock for synchronization to yield
while performing time-consuming operations such as scrolling.

However, the actual console outputting is performed while holding
irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule
before starting outputting lines.  Also, only a few drivers call
console_conditional_schedule() to begin with.  This means that when a
lot of lines need to be output by console_unlock(), for example on a
console registration, the task doing console_unlock() may not yield for
a long time on a non-preemptible kernel.

If this happens with a slow console devices, for example a serial
console, the outputting task may occupy the cpu for a very long time.
Long enough to trigger softlockup and/or RCU stall warnings, which in
turn pile more messages, sometimes enough to trigger the next cycle of
warnings incapacitating the system.

Fix it by making console_unlock() insert cond_resched() between lines if
@console_may_schedule.

Signed-off-by: Tejun Heo 
Reported-by: Calvin Owens 
Acked-by: Jan Kara 
Cc: Dave Jones 
Cc: Kyle McMartin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Charles (Chas) Williams 
Signed-off-by: Jiri Slaby

panic: release stale console lock to always get the logbuf printed out

2016-07-21T06:36:14+00:00

commit 08d78658f393fefaa2e6507ea052c6f8ef4002a2 upstream.

In some cases we may end up killing the CPU holding the console lock
while still having valuable data in logbuf. E.g. I'm observing the
following:

- A crash is happening on one CPU and console_unlock() is being called on
  some other.

- console_unlock() tries to print out the buffer before releasing the lock
  and on slow console it takes time.

- in the meanwhile crashing CPU does lots of printk()-s with valuable data
  (which go to the logbuf) and sends IPIs to all other CPUs.

- console_unlock() finishes printing previous chunk and enables interrupts
  before trying to print out the rest, the CPU catches the IPI and never
  releases console lock.

This is not the only possible case: in VT/fb subsystems we have many other
console_lock()/console_unlock() users.  Non-masked interrupts (or
receiving NMI in case of extreme slowness) will have the same result.
Getting the whole console buffer printed out on crash should be top
priority.

[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Vitaly Kuznetsov 
Cc: HATAYAMA Daisuke 
Cc: Masami Hiramatsu 
Cc: Jiri Kosina 
Cc: Baoquan He 
Cc: Prarit Bhargava 
Cc: Xie XiuQi 
Cc: Seth Jennings 
Cc: "K. Y. Srinivasan" 
Cc: Jan Kara 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Jiri Slaby

signal: remove warning about using SI_TKILL in rt_[tg]sigqueueinfo

2016-07-21T06:36:13+00:00

commit 69828dce7af2cb6d08ef5a03de687d422fb7ec1f upstream.

Sending SI_TKILL from rt_[tg]sigqueueinfo was deprecated, so now we issue
a warning on the first attempt of doing it.  We use WARN_ON_ONCE, which is
not informative and, what is worse, taints the kernel, making the trinity
syscall fuzzer complain false-positively from time to time.

It does not look like we need this warning at all, because the behaviour
changed quite a long time ago (2.6.39), and if an application relies on
the old API, it gets EPERM anyway and can issue a warning by itself.

So let us zap the warning in kernel.

Signed-off-by: Vladimir Davydov 
Acked-by: Oleg Nesterov 
Cc: Richard Weinberger 
Cc: "Paul E. McKenney" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Jiri Slaby

ktime: export ktime_divns

2016-07-21T06:36:05+00:00

ktime_divns was exported in upstream as a side-effect of commit
166afb64511eef08e13331b970c44fe91cea45ef (ktime: Sanitize
ktime_to_us/ms conversion). But we do not want the commit given ktime
is not nanoseconds in 3.12 yet.

So we only export the function here as it is needed by upstream commit
d2c5cf88d5282de258f4eb6ab40040b80a075cd8 (ALSA: hrtimer: Handle
start/stop more properly):
ERROR: "ktime_divns" [sound/core/snd-hrtimer.ko] undefined!

Signed-off-by: Jiri Slaby 
Cc: Takashi Iwai 
Cc: Thomas Gleixner 
Cc: John Stultz

ring-buffer: Prevent overflow of size in ring_buffer_resize()

2016-06-15T07:32:03+00:00

commit 59643d1535eb220668692a5359de22545af579f6 upstream.

If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
then the DIV_ROUND_UP() will return zero.

Here's the details:

  # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb

tracing_entries_write() processes this and converts kb to bytes.

 18014398509481980 << 10 = 18446744073709547520

and this is passed to ring_buffer_resize() as unsigned long size.

 size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);

Where DIV_ROUND_UP(a, b) is (a + b - 1)/b

BUF_PAGE_SIZE is 4080 and here

 18446744073709547520 + 4080 - 1 = 18446744073709551599

where 18446744073709551599 is still smaller than 2^64

 2^64 - 18446744073709551599 = 17

But now 18446744073709551599 / 4080 = 4521260802379792

and size = size * 4080 = 18446744073709551360

This is checked to make sure its still greater than 2 * 4080,
which it is.

Then we convert to the number of buffer pages needed.

 nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)

but this time size is 18446744073709551360 and

 2^64 - (18446744073709551360 + 4080 - 1) = -3823

Thus it overflows and the resulting number is less than 4080, which makes

  3823 / 4080 = 0

an nr_pages is set to this. As we already checked against the minimum that
nr_pages may be, this causes the logic to fail as well, and we crash the
kernel.

There's no reason to have the two DIV_ROUND_UP() (that's just result of
historical code changes), clean up the code and fix this bug.

Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Steven Rostedt 
Signed-off-by: Jiri Slaby

ring-buffer: Use long for nr_pages to avoid overflow failures

2016-06-15T07:32:03+00:00

commit 9b94a8fba501f38368aef6ac1b30e7335252a220 upstream.

The size variable to change the ring buffer in ftrace is a long. The
nr_pages used to update the ring buffer based on the size is int. On 64 bit
machines this can cause an overflow problem.

For example, the following will cause the ring buffer to crash:

 # cd /sys/kernel/debug/tracing
 # echo 10 > buffer_size_kb
 # echo 8556384240 > buffer_size_kb

Then you get the warning of:

 WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260

Which is:

  RB_WARN_ON(cpu_buffer, nr_removed);

Note each ring buffer page holds 4080 bytes.

This is because:

 1) 10 causes the ring buffer to have 3 pages.
    (10kb requires 3 * 4080 pages to hold)

 2) (2^31 / 2^10  + 1) * 4080 = 8556384240
    The value written into buffer_size_kb is shifted by 10 and then passed
    to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760

 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
    which is 4080. 8761737461760 / 4080 = 2147484672

 4) nr_pages is subtracted from the current nr_pages (3) and we get:
    2147484669. This value is saved in a signed integer nr_pages_to_update

 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
    turns into the value of -2147482627

 6) As the value is a negative number, in update_pages_handler() it is
    negated and passed to rb_remove_pages() and 2147482627 pages will
    be removed, which is much larger than 3 and it causes the warning
    because not all the pages asked to be removed were removed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001

Fixes: 7a8e76a3829f1 ("tracing: unified trace buffer")
Reported-by: Hao Qin 
Signed-off-by: Steven Rostedt 
Signed-off-by: Jiri Slaby

sched: Remove lockdep check in sched_move_task()

2016-05-19T09:00:14+00:00

commit f7b8a47da17c9ee4998f2ca2018fcc424e953c0e upstream.

sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.

Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do no
need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.

Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.

Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
Reported-by: Oleg Nesterov 
Reported-by: Fengguang Wu 
Signed-off-by: Kirill Tkhai 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
Signed-off-by: Ingo Molnar 
Signed-off-by: Jiri Slaby

workqueue: fix ghost PENDING flag while doing MQ IO

2016-05-02T17:54:56+00:00

commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [] ? bit_wait+0x60/0x60
[  601.351444]  [] schedule+0x35/0x80
[  601.351709]  [] schedule_timeout+0x192/0x230
[  601.351958]  [] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [] ? ktime_get+0x37/0xa0
[  601.352446]  [] ? bit_wait+0x60/0x60
[  601.352688]  [] io_schedule_timeout+0xa4/0x110
[  601.352951]  [] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [] bit_wait_io+0x1b/0x70
[  601.353440]  [] __wait_on_bit+0x5d/0x90
[  601.353689]  [] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [] blkdev_fsync+0x1b/0x50
[  601.355193]  [] vfs_fsync_range+0x49/0xa0
[  601.355432]  [] blkdev_write_iter+0xca/0x100
[  601.355679]  [] __vfs_write+0xaa/0xe0
[  601.355925]  [] vfs_write+0xa9/0x1a0
[  601.356164]  [] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier , which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen 
Cc: Gioh Kim 
Cc: Michael Wang 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo 
Signed-off-by: Jiri Slaby