linux-stable.git/kernel, branch v3.18.46

cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-12-23T14:41:11+00:00

[ Upstream commit 28b89b9e6f7b6c8fef7b3af39828722bca20cfee ]

A discrepancy between cpu_online_mask and cpuset's effective_cpus
mask is inevitable during hotplug since cpuset defers updating of
effective_cpus mask using a workqueue, during which time nothing
prevents the system from more hotplug operations.  For that reason
guarantee_online_cpus() walks up the cpuset hierarchy until it finds
an intersection under the assumption that top cpuset's effective_cpus
mask intersects with cpu_online_mask even with such a race occurring.

However a sequence of CPU hotplugs can open a time window, during which
none of the effective CPUs in the top cpuset intersect with
cpu_online_mask.

For example when there are 4 possible CPUs 0-3 and only CPU0 is online:

  ========================  ===========================
   cpu_online_mask           top_cpuset.effective_cpus
  ========================  ===========================
   echo 1 > cpu2/online.
   CPU hotplug notifier woke up hotplug work but not yet scheduled.
      [0,2]                     [0]

   echo 0 > cpu0/online.
   The workqueue is still runnable.
      [2]                       [0]
  ========================  ===========================

  Now there is no intersection between cpu_online_mask and
  top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
  this moment can cause following:

   Unable to handle kernel NULL pointer dereference at virtual address 000000d0
   ------------[ cut here ]------------
   Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
   Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
   Modules linked in:
   CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
   task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
   PC is at guarantee_online_cpus+0x2c/0x58
   LR is at cpuset_cpus_allowed+0x4c/0x6c
   
   Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
   Call trace:
   [] guarantee_online_cpus+0x2c/0x58
   [] cpuset_cpus_allowed+0x4c/0x6c
   [] sched_setaffinity+0xc0/0x1ac
   [] SyS_sched_setaffinity+0x98/0xac
   [] el0_svc_naked+0x24/0x28

The top cpuset's effective_cpus are guaranteed to be identical to
cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
there is no intersection between top cpuset's effective_cpus and
cpu_online_mask.

Signed-off-by: Joonwoo Park 
Acked-by: Li Zefan 
Cc: Tejun Heo 
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc:  # 3.17+
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

genirq: Generic chip: Change irq_reg_{readl,writel} arguments

2016-10-06T02:40:20+00:00

[ Upstream commit 332fd7c4fef5f3b166e93decb07fd69eb24f7998 ]

Pass in the irq_chip_generic struct so we can use different readl/writel
settings for each irqchip driver, when appropriate.  Compute
(gc->reg_base + reg_offset) in the helper function because this is pretty
much what all callers want to do anyway.

Compile-tested using the following configurations:

    at91_dt_defconfig (CONFIG_ATMEL_AIC_IRQ=y)
    sama5_defconfig (CONFIG_ATMEL_AIC5_IRQ=y)
    sunxi_defconfig (CONFIG_ARCH_SUNXI=y)

tb10x (ARC) is untested.

Signed-off-by: Kevin Cernekee 
Acked-by: Thomas Gleixner 
Acked-by: Acked-by: Arnd Bergmann 
Link: https://lkml.kernel.org/r/1415342669-30640-3-git-send-email-cernekee@gmail.com
Signed-off-by: Jason Cooper 
Signed-off-by: Sasha Levin

sched/core: Fix a race between try_to_wake_up() and a woken up task

2016-10-06T02:40:20+00:00

[ Upstream commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf ]

The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p->on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh 
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Benjamin Herrenschmidt 
Cc: Alexey Kardashevskiy 
Cc: Linus Torvalds 
Cc: Nicholas Piggin 
Cc: Nicholas Piggin 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: 
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar 

Signed-off-by: Sasha Levin

cpuset: make sure new tasks conform to the current config of the cpuset

2016-09-12T18:53:07+00:00

[ Upstream commit 06f4e94898918bcad00cdd4d349313a439d6911e ]

A new task inherits cpus_allowed and mems_allowed masks from its parent,
but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
before this new task is inserted into the cgroup's task list, the new task
won't be updated accordingly.

Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

timekeeping: Cap array access in timekeeping_debug

2016-09-01T02:05:44+00:00

[ Upstream commit a4f8f6667f099036c88f231dcad4cf233652c824 ]

It was reported that hibernation could fail on the 2nd attempt, where the
system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
claim_dma_lock(), because the lock has already been taken.

However there is actually no other process would like to grab this lock on
that problematic platform.

Further investigation showed that the problem is triggered by setting
/sys/power/pm_trace to 1 before the 1st hibernation.

Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
than 1970) to the release date of that motherboard during POST stage, thus
after resumed, it may seem that the system had a significant long sleep time
which is a completely meaningless value.

Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
sleep time happened to be set to 1, fls() returns 32 and we add 1 to
sleep_time_bin[32], which causes an out of bounds array access and therefor
memory being overwritten.

As depicted by System.map:
0xffffffff81c9d080 b sleep_time_bin
0xffffffff81c9d100 B dma_spin_lock
the dma_spin_lock.val is set to 1, which caused this problem.

This patch adds a sanity check in tk_debug_account_sleep_time()
to ensure we don't index past the sleep_time_bin array.

[jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
 issue slightly differently, but borrowed his excelent explanation of the
 issue here.]

Fixes: 5c83545f24ab "power: Add option to log time spent in suspend"
Reported-by: Janek Kozicki 
Reported-by: Chen Yu 
Signed-off-by: John Stultz 
Cc: linux-pm@vger.kernel.org
Cc: Peter Zijlstra 
Cc: Xunlei Pang 
Cc: "Rafael J. Wysocki" 
Cc: stable 
Cc: Zhang Rui 
Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner 
Signed-off-by: Sasha Levin

uprobes: Fix the memcg accounting

2016-09-01T02:05:44+00:00

[ Upstream commit 6c4687cc17a788a6dd8de3e27dbeabb7cbd3e066 ]

__replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path,
it should only do this if page_check_address() fails.

This means that every enable/disable leads to unbalanced mem_cgroup_uncharge()
from put_page(old_page), it is trivial to underflow the page_counter->count
and trigger OOM.

Reported-and-tested-by: Brenden Blanco 
Signed-off-by: Oleg Nesterov 
Reviewed-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Alexander Shishkin 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Cc: stable@vger.kernel.org # 3.17+
Fixes: 00501b531c47 ("mm: memcontrol: rewrite charge API")
Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

module: Invalidate signatures on force-loaded modules

2016-08-22T16:23:18+00:00

[ Upstream commit bca014caaa6130e57f69b5bf527967aa8ee70fdd ]

Signing a module should only make it trusted by the specific kernel it
was built for, not anything else.  Loading a signed module meant for a
kernel with a different ABI could have interesting effects.
Therefore, treat all signatures as invalid when a module is
force-loaded.

Signed-off-by: Ben Hutchings 
Cc: stable@vger.kernel.org
Signed-off-by: Rusty Russell 
Signed-off-by: Sasha Levin

audit: fix a double fetch in audit_log_single_execve_arg()

2016-08-22T16:23:15+00:00

[ Upstream commit 43761473c254b45883a64441dd0bc85a42f3645c ]

There is a double fetch problem in audit_log_single_execve_arg()
where we first check the execve(2) argumnets for any "bad" characters
which would require hex encoding and then re-fetch the arguments for
logging in the audit record[1].  Of course this leaves a window of
opportunity for an unsavory application to munge with the data.

This patch reworks things by only fetching the argument data once[2]
into a buffer where it is scanned and logged into the audit
records(s).  In addition to fixing the double fetch, this patch
improves on the original code in a few other ways: better handling
of large arguments which require encoding, stricter record length
checking, and some performance improvements (completely unverified,
but we got rid of some strlen() calls, that's got to be a good
thing).

As part of the development of this patch, I've also created a basic
regression test for the audit-testsuite, the test can be tracked on
GitHub at the following link:

 * https://github.com/linux-audit/audit-testsuite/issues/25

[1] If you pay careful attention, there is actually a triple fetch
problem due to a strnlen_user() call at the top of the function.

[2] This is a tiny white lie, we do make a call to strnlen_user()
prior to fetching the argument data.  I don't like it, but due to the
way the audit record is structured we really have no choice unless we
copy the entire argument at once (which would require a rather
wasteful allocation).  The good news is that with this patch the
kernel no longer relies on this strnlen_user() value for anything
beyond recording it in the log, we also update it with a trustworthy
value whenever possible.

Reported-by: Pengfei Wang 
Cc: 
Signed-off-by: Paul Moore 
Signed-off-by: Sasha Levin

Fix broken audit tests for exec arg len

2016-08-22T16:23:15+00:00

[ Upstream commit 45820c294fe1b1a9df495d57f40585ef2d069a39 ]

The "fix" in commit 0b08c5e5944 ("audit: Fix check of return value of
strnlen_user()") didn't fix anything, it broke things.  As reported by
Steven Rostedt:

 "Yes, strnlen_user() returns 0 on fault, but if you look at what len is
  set to, than you would notice that on fault len would be -1"

because we just subtracted one from the return value.  So testing
against 0 doesn't test for a fault condition, it tests against a
perfectly valid empty string.

Also fix up the usual braindamage wrt using WARN_ON() inside a
conditional - make it part of the conditional and remove the explicit
unlikely() (which is already part of the WARN_ON*() logic, exactly so
that you don't have to write unreadable code.

Reported-and-tested-by: Steven Rostedt 
Cc: Jan Kara 
Cc: Paul Moore 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

audit: Fix check of return value of strnlen_user()

2016-08-22T16:23:14+00:00

[ Upstream commit 0b08c5e59441d08ab4b5e72afefd5cd98a4d83df ]

strnlen_user() returns 0 when it hits fault, not -1. Fix the test in
audit_log_single_execve_arg(). Luckily this shouldn't ever happen unless
there's a kernel bug so it's mostly a cosmetic fix.

CC: Paul Moore 
Signed-off-by: Jan Kara 
Signed-off-by: Paul Moore 
Signed-off-by: Sasha Levin