linux-stable.git/kernel/cpuset.c, branch v4.1.45

cpuset: consider dying css as offline

2017-06-26T02:02:19+00:00

[ Upstream commit 41c25707d21716826e3c1f60967f5550610ec1c9 ]

In most cases, a cgroup controller don't care about the liftimes of
cgroups.  For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.

However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not.  Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period.  The effects of cgroup
removals are delayed when seen from userland.

This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending.  This gets rid of the userland visible
delays.

Signed-off-by: Tejun Heo 
Reported-by: Daniel Jordan 
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-12-22T03:45:42+00:00

[ Upstream commit 28b89b9e6f7b6c8fef7b3af39828722bca20cfee ]

A discrepancy between cpu_online_mask and cpuset's effective_cpus
mask is inevitable during hotplug since cpuset defers updating of
effective_cpus mask using a workqueue, during which time nothing
prevents the system from more hotplug operations.  For that reason
guarantee_online_cpus() walks up the cpuset hierarchy until it finds
an intersection under the assumption that top cpuset's effective_cpus
mask intersects with cpu_online_mask even with such a race occurring.

However a sequence of CPU hotplugs can open a time window, during which
none of the effective CPUs in the top cpuset intersect with
cpu_online_mask.

For example when there are 4 possible CPUs 0-3 and only CPU0 is online:

  ========================  ===========================
   cpu_online_mask           top_cpuset.effective_cpus
  ========================  ===========================
   echo 1 > cpu2/online.
   CPU hotplug notifier woke up hotplug work but not yet scheduled.
      [0,2]                     [0]

   echo 0 > cpu0/online.
   The workqueue is still runnable.
      [2]                       [0]
  ========================  ===========================

  Now there is no intersection between cpu_online_mask and
  top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
  this moment can cause following:

   Unable to handle kernel NULL pointer dereference at virtual address 000000d0
   ------------[ cut here ]------------
   Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
   Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
   Modules linked in:
   CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
   task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
   PC is at guarantee_online_cpus+0x2c/0x58
   LR is at cpuset_cpus_allowed+0x4c/0x6c
   
   Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
   Call trace:
   [] guarantee_online_cpus+0x2c/0x58
   [] cpuset_cpus_allowed+0x4c/0x6c
   [] sched_setaffinity+0xc0/0x1ac
   [] SyS_sched_setaffinity+0x98/0xac
   [] el0_svc_naked+0x24/0x28

The top cpuset's effective_cpus are guaranteed to be identical to
cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
there is no intersection between top cpuset's effective_cpus and
cpu_online_mask.

Signed-off-by: Joonwoo Park 
Acked-by: Li Zefan 
Cc: Tejun Heo 
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc:  # 3.17+
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cpuset: make sure new tasks conform to the current config of the cpuset

2016-09-12T13:30:26+00:00

[ Upstream commit 06f4e94898918bcad00cdd4d349313a439d6911e ]

A new task inherits cpus_allowed and mems_allowed masks from its parent,
but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
before this new task is inserted into the cgroup's task list, the new task
won't be updated accordingly.

Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

cpuset: use trialcs->mems_allowed as a temp variable

2015-09-13T16:07:46+00:00

commit 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 upstream.

The comment says it's using trialcs->mems_allowed as a temp variable but
it didn't match the code. Change the code to match the comment.

This fixes an issue when writing in cpuset.mems when a sub-directory
exists: we need to write several times for the information to persist:

| root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
| root@alban:/sys/fs/cgroup/cpuset# cd footest9
| root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9#

This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.

Signed-off-by: Alban Crequy 
Tested-by: Iago López Galeiras 
Acked-by: Li Zefan 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

kernel, cpuset: remove exception for __GFP_THISNODE

2015-04-14T23:49:03+00:00

Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so
remove the obscure comment about it and its special-case exception.

Signed-off-by: David Rientjes 
Acked-by: Vlastimil Babka 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: Joonsoo Kim 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Pravin Shelar 
Cc: Jarno Rajahalme 
Cc: Li Zefan 
Cc: Greg Thelen 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

cpusets, isolcpus: exclude isolcpus from load balancing in cpusets

2015-03-19T18:28:19+00:00

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: cgroups@vger.kernel.org
Signed-off-by: Rik van Riel 
Tested-by: David Rientjes 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: David Rientjes 
Acked-by: Zefan Li 
Signed-off-by: Tejun Heo

cpuset: Fix cpuset sched_relax_domain_level

2015-03-02T16:55:04+00:00

The cpuset.sched_relax_domain_level can control how far we do
immediate load balancing on a system. However, it was found on recent
kernels that echo'ing a value into cpuset.sched_relax_domain_level
did not reduce any immediate load balancing.

The reason this occurred was because the update_domain_attr_tree() traversal
did not update for the "top_cpuset". This resulted in nothing being changed
when modifying the sched_relax_domain_level parameter.

This patch is able to address that problem by having update_domain_attr_tree()
allow updates for the root in the cpuset traversal.

Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")
Cc:  # 3.9+
Signed-off-by: Jason Low 
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Tested-by: Serge Hallyn

cpuset: fix a warning when clearing configured masks in old hierarchy

2015-03-02T16:55:04+00:00

When we clear cpuset.cpus, cpuset.effective_cpus won't be cleared:

  # mount -t cgroup -o cpuset xxx /mnt
  # mkdir /mnt/tmp
  # echo 0 > /mnt/tmp/cpuset.cpus
  # echo > /mnt/tmp/cpuset.cpus
  # cat cpuset.cpus

  # cat cpuset.effective_cpus
  0-15

And a kernel warning in update_cpumasks_hier() is triggered:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 4028 at kernel/cpuset.c:894 update_cpumasks_hier+0x471/0x650()

Cc:  # 3.17+
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Tested-by: Serge Hallyn

cpuset: initialize effective masks when clone_children is enabled

2015-03-02T16:55:04+00:00

If clone_children is enabled, effective masks won't be initialized
due to the bug:

  # mount -t cgroup -o cpuset xxx /mnt
  # echo 1 > cgroup.clone_children
  # mkdir /mnt/tmp
  # cat /mnt/tmp/
  # cat cpuset.effective_cpus

  # cat cpuset.cpus
  0-15

And then this cpuset won't constrain the tasks in it.

Either the bug or the fix has no effect on unified hierarchy, as
there's no clone_chidren flag there any more.

Reported-by: Christian Brauner 
Reported-by: Serge Hallyn 
Cc:  # 3.17+
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
Tested-by: Serge Hallyn

cpuset: use %*pb[l] to print bitmaps including cpumasks and nodemasks

2015-02-14T05:21:37+00:00

printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

* kernel/cpuset.c::cpuset_print_task_mems_allowed() used a static
  buffer which is protected by a dedicated spinlock.  Removed.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds