linux-stable.git/kernel/cgroup, branch linux-6.17.y

cpuset: Treat cpusets in attaching as populated

2025-12-18T12:59:50+00:00

[ Upstream commit b1bcaed1e39a9e0dfbe324a15d2ca4253deda316 ]

Currently, the check for whether a partition is populated does not
account for tasks in the cpuset of attaching. This is a corner case
that can leave a task stuck in a partition with no effective CPUs.

The race condition occurs as follows:

cpu0				cpu1
				//cpuset A  with cpu N
migrate task p to A
cpuset_can_attach
// with effective cpus
// check ok

// cpuset_mutex is not held	// clear cpuset.cpus.exclusive
				// making effective cpus empty
				update_exclusive_cpumask
				// tasks_nocpu_error check ok
				// empty effective cpus, partition valid
cpuset_attach
...
// task p stays in A, with non-effective cpus.

To fix this issue, this patch introduces cs_is_populated, which considers
tasks in the attaching cpuset. This new helper is used in validate_change
and partition_is_populated.

Fixes: e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective")
Signed-off-by: Chen Ridong 
Reviewed-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

bpf: Do not limit bpf_cgroup_from_id to current's namespace

2025-11-13T20:36:51+00:00

[ Upstream commit 2c895133950646f45e5cf3900b168c952c8dbee8 ]

The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the
cgroup corresponding to a given cgroup ID. This helper can be called in
a lot of contexts where the current thread can be random. A recent
example was its use in sched_ext's ops.tick(), to obtain the root cgroup
pointer. Since the current task can be whatever random user space task
preempted by the timer tick, this makes the behavior of the helper
unreliable.

Refactor out __cgroup_get_from_id as the non-namespace aware version of
cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it.

There is no compatibility breakage here, since changing the namespace
against which the lookup is being done to the root cgroup namespace only
permits a wider set of lookups to succeed now. The cgroup IDs across
namespaces are globally unique, and thus don't need to be retranslated.

Reported-by: Dan Schatzberg 
Signed-off-by: Kumar Kartikeya Dwivedi 
Acked-by: Tejun Heo 
Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Sasha Levin

cpuset: Use new excpus for nocpu error check when enabling root partition

2025-11-02T13:18:04+00:00

[ Upstream commit 59d5de3655698679ad8fd2cc82228de4679c4263 ]

A previous patch fixed a bug where new_prs should be assigned before
checking housekeeping conflicts. This patch addresses another potential
issue: the nocpu error check currently uses the xcpus which is not updated.
Although no issue has been observed so far, the check should be performed
using the new effective exclusive cpus.

The comment has been removed because the function returns an error if
nocpu checking fails, which is unrelated to the parent.

Signed-off-by: Chen Ridong 
Reviewed-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cpuset: fix failure to enable isolated partition when containing isolcpus

2025-10-15T10:03:18+00:00

[ Upstream commit 216217ebee16afc4d79c3e86a736d87175c18e68 ]

The 'isolcpus' parameter specified at boot time can be assigned to an
isolated partition. While it is valid put the 'isolcpus' in an isolated
partition, attempting to change a member cpuset to an isolated partition
will fail if the cpuset contains any 'isolcpus'.

For example, the system boots with 'isolcpus=9', and the following
configuration works correctly:

  # cd /sys/fs/cgroup/
  # mkdir test
  # echo 1 > test/cpuset.cpus
  # echo isolated > test/cpuset.cpus.partition
  # cat test/cpuset.cpus.partition
  isolated
  # echo 9 > test/cpuset.cpus
  # cat test/cpuset.cpus.partition
  isolated
  # cat test/cpuset.cpus
  9

However, the following steps to convert a member cpuset to an isolated
partition will fail:

  # cd /sys/fs/cgroup/
  # mkdir test
  # echo 9 > test/cpuset.cpus
  # echo isolated > test/cpuset.cpus.partition
  # cat test/cpuset.cpus.partition
  isolated invalid (partition config conflicts with housekeeping setup)

The issue occurs because the new partition state (new_prs) is used for
validation against housekeeping constraints before it has been properly
updated. To resolve this, move the assignment of new_prs before the
housekeeping validation check when enabling a root partition.

Fixes: 4a74e418881f ("cgroup/cpuset: Check partition conflict with housekeeping setup")
Signed-off-by: Chen Ridong 
Reviewed-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup/psi: Set of->priv to NULL upon file release

2025-08-22T17:47:43+00:00

Setting of->priv to NULL when the file is released enables earlier bug
detection. This allows potential bugs to manifest as NULL pointer
dereferences rather than use-after-free errors[1], which are generally more
difficult to diagnose.

[1] https://lore.kernel.org/cgroups/38ef3ff9-b380-44f0-9315-8b3714b0948d@huaweicloud.com/T/#m8a3b3f88f0ff3da5925d342e90043394f8b2091b
Signed-off-by: Chen Ridong 
Signed-off-by: Tejun Heo

cgroup: split cgroup_destroy_wq into 3 workqueues

2025-08-22T17:44:11+00:00

A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie 
Signed-off-by: Chen Ridong 
Suggested-by: Teju Heo 
Signed-off-by: Tejun Heo

cgroup: avoid null de-ref in css_rstat_exit()

2025-08-09T18:46:32+00:00

css_rstat_exit() may be called asynchronously in scenarios where preceding
calls to css_rstat_init() have not completed. One such example is this
sequence below:

css_create(...)
{
	...
	init_and_link_css(css, ...);

	err = percpu_ref_init(...);
	if (err)
		goto err_free_css;
	err = cgroup_idr_alloc(...);
	if (err)
		goto err_free_css;
	err = css_rstat_init(css, ...);
	if (err)
		goto err_free_css;
	...
err_free_css:
	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
	return ERR_PTR(err);
}

If any of the three goto jumps are taken, async cleanup will begin and
css_rstat_exit() will be invoked on an uninitialized css->rstat_cpu.

Avoid accessing the unitialized field by returning early in
css_rstat_exit() if this is the case.

Signed-off-by: JP Kobryn 
Suggested-by: Michal Koutný 
Fixes: 5da3bfa029d68 ("cgroup: use separate rstat trees for each subsystem")
Cc: stable@vger.kernel.org # v6.16
Reported-by: syzbot+8d052e8b99e40bc625ed@syzkaller.appspotmail.com
Acked-by: Shakeel Butt 
Signed-off-by: Tejun Heo

cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()

2025-08-09T18:42:47+00:00

The css_get/put() calls in cpuset_partition_write() are unnecessary as
an active reference of the kernfs node will be taken which will prevent
its removal and guarantee the existence of the css. Only the online
check is needed.

Signed-off-by: Waiman Long 
Reviewed-by: Michal Koutný 
Signed-off-by: Tejun Heo

cgroup/cpuset: Fix a partition error with CPU hotplug

2025-08-09T18:42:28+00:00

It was found during testing that an invalid leaf partition with an
empty effective exclusive CPU list can become a valid empty partition
with no CPU afer an offline/online operation of an unrelated CPU. An
empty partition root is allowed in the special case that it has no
task in its cgroup and has distributed out all its CPUs to its child
partitions. That is certainly not the case here.

The problem is in the cpumask_subsets() test in the hotplug case
(update with no new mask) of update_parent_effective_cpumask() as it
also returns true if the effective exclusive CPU list is empty. Fix that
by addding the cpumask_empty() test to root out this exception case.
Also add the cpumask_empty() test in cpuset_hotplug_update_tasks()
to avoid calling update_parent_effective_cpumask() for this special case.

Fixes: 0c7f293efc87 ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2")
Signed-off-by: Waiman Long 
Signed-off-by: Tejun Heo

cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key

2025-08-09T18:41:39+00:00

The following lockdep splat was observed.

[  812.359086] ============================================
[  812.359089] WARNING: possible recursive locking detected
[  812.359097] --------------------------------------------
[  812.359100] runtest.sh/30042 is trying to acquire lock:
[  812.359105] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0xe/0x20
[  812.359131]
[  812.359131] but task is already holding lock:
[  812.359134] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_write_resmask+0x98/0xa70
     :
[  812.359267] Call Trace:
[  812.359272]  
[  812.359367]  cpus_read_lock+0x3c/0xe0
[  812.359382]  static_key_enable+0xe/0x20
[  812.359389]  check_insane_mems_config.part.0+0x11/0x30
[  812.359398]  cpuset_write_resmask+0x9f2/0xa70
[  812.359411]  cgroup_file_write+0x1c7/0x660
[  812.359467]  kernfs_fop_write_iter+0x358/0x530
[  812.359479]  vfs_write+0xabe/0x1250
[  812.359529]  ksys_write+0xf9/0x1d0
[  812.359558]  do_syscall_64+0x5f/0xe0

Since commit d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem
and hotplug lock order"), the ordering of cpu hotplug lock
and cpuset_mutex had been reversed. That patch correctly
used the cpuslocked version of the static branch API to enable
cpusets_pre_enable_key and cpusets_enabled_key, but it didn't do the
same for cpusets_insane_config_key.

The cpusets_insane_config_key can be enabled in the
check_insane_mems_config() which is called from update_nodemask()
or cpuset_hotplug_update_tasks() with both cpu hotplug lock and
cpuset_mutex held. Deadlock can happen with a pending hotplug event that
tries to acquire the cpu hotplug write lock which will block further
cpus_read_lock() attempt from check_insane_mems_config(). Fix that by
switching to use static_branch_enable_cpuslocked().

Fixes: d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order")
Signed-off-by: Waiman Long 
Reviewed-by: Juri Lelli 
Signed-off-by: Tejun Heo