linux-stable.git/kernel/cgroup, branch linux-6.12.y

cgroup: fix race between task migration and iteration

2026-03-25T10:08:30+00:00

commit 5ee01f1a7343d6a3547b6802ca2d4cdce0edacb1 upstream.

When a task is migrated out of a css_set, cgroup_migrate_add_task()
first moves it from cset->tasks to cset->mg_tasks via:

    list_move_tail(&task->cg_list, &cset->mg_tasks);

If a css_task_iter currently has it->task_pos pointing to this task,
css_set_move_task() calls css_task_iter_skip() to keep the iterator
valid. However, since the task has already been moved to ->mg_tasks,
the iterator is advanced relative to the mg_tasks list instead of the
original tasks list. As a result, remaining tasks on cset->tasks, as
well as tasks queued on cset->mg_tasks, can be skipped by iteration.

Fix this by calling css_set_skip_task_iters() before unlinking
task->cg_list from cset->tasks. This advances all active iterators to
the next task on cset->tasks, so iteration continues correctly even
when a task is concurrently being migrated.

This race is hard to hit in practice without instrumentation, but it
can be reproduced by artificially slowing down cgroup_procs_show().
For example, on an Android device a temporary
/sys/kernel/cgroup/cgroup_test knob can be added to inject a delay
into cgroup_procs_show(), and then:

  1) Spawn three long-running tasks (PIDs 101, 102, 103).
  2) Create a test cgroup and move the tasks into it.
  3) Enable a large delay via /sys/kernel/cgroup/cgroup_test.
  4) In one shell, read cgroup.procs from the test cgroup.
  5) Within the delay window, in another shell migrate PID 102 by
     writing it to a different cgroup.procs file.

Under this setup, cgroup.procs can intermittently show only PID 101
while skipping PID 103. Once the migration completes, reading the
file again shows all tasks as expected.

Note that this change does not allow removing the existing
css_set_skip_task_iters() call in css_set_move_task(). The new call
in cgroup_migrate_add_task() only handles iterators that are racing
with migration while the task is still on cset->tasks. Iterators may
also start after the task has been moved to cset->mg_tasks. If we
dropped css_set_skip_task_iters() from css_set_move_task(), such
iterators could keep task_pos pointing to a migrating task, causing
css_task_iter_advance() to malfunction on the destination css_set,
up to and including crashes or infinite loops.

The race window between migration and iteration is very small, and
css_task_iter is not on a hot path. In the worst case, when an
iterator is positioned on the first thread of the migrating process,
cgroup_migrate_add_task() may have to skip multiple tasks via
css_set_skip_task_iters(). However, this only happens when migration
and iteration actually race, so the performance impact is negligible
compared to the correctness fix provided here.

Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Qingye Zhao 
Reviewed-by: Michal Koutný 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()

2026-03-13T16:20:17+00:00

[ Upstream commit 68230aac8b9aad243626fbaf3ca170012c17fec5 ]

Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask()
from tmp->new_cpus to cp->effective_cpus. This second parameter is just
a temporary cpumask for internal use. The cpuset_update_tasks_cpumask()
function was originally called update_tasks_cpumask() before commit
381b53c3b549 ("cgroup/cpuset: rename functions shared between v1
and v2").

This mistake can incorrectly change the effective_cpus of the
cpuset when it is the top_cpuset or in arm64 architecture where
task_cpu_possible_mask() may differ from cpu_possible_mask.  So far
top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64
arch can still be impacted. Fix it by reverting the incorrect change.

Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
Signed-off-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cpuset: Fix missing adaptation for cpuset_is_populated

2026-02-19T15:29:55+00:00

Commit b1bcaed1e39a ("cpuset: Treat cpusets in attaching as populated")
was backported to the long‑term support (LTS) branches. However, because
commit d5cf4d34a333 ("cgroup/cpuset: Don't track # of local child
partitions") was not backported, a corresponding adaptation to the
backported code is still required.

To ensure correct behavior, replace cgroup_is_populated with
cpuset_is_populated in the partition_is_populated function.

Cc: stable@vger.kernel.org	# 6.1+
Fixes: b1bcaed1e39a ("cpuset: Treat cpusets in attaching as populated")
Cc: Waiman Long 
Cc: Tejun Heo 
Signed-off-by: Chen Ridong 
Signed-off-by: Greg Kroah-Hartman

cgroup: Fix kernfs_node UAF in css_free_rwork_fn

2026-02-06T15:55:49+00:00

This fix patch is not upstream, and is applicable only to kernels 6.10
(where the cgroup_rstat_lock tracepoint was added) through 6.15 after
which commit 5da3bfa029d6 ("cgroup: use separate rstat trees for each
subsystem") reordered cgroup_rstat_flush as part of a new feature
addition and inadvertently fixed this UAF.

css_free_rwork_fn first releases the last reference on the cgroup's
kernfs_node, and then calls cgroup_rstat_exit which attempts to use it
in the cgroup_rstat_lock tracepoint:

kernfs_put(cgrp->kn);
cgroup_rstat_exit
  cgroup_rstat_flush
    __cgroup_rstat_lock
      trace_cgroup_rstat_locked:
        TP_fast_assign(
          __entry->root = cgrp->root->hierarchy_id;
          __entry->id = cgroup_id(cgrp);

Where cgroup_id is:
static inline u64 cgroup_id(const struct cgroup *cgrp)
{
	return cgrp->kn->id;
}

Fix this by reordering the kernfs_put after cgroup_rstat_exit.

[78782.605161][ T9861] BUG: KASAN: slab-use-after-free in trace_event_raw_event_cgroup_rstat+0x110/0x1dc
[78782.605182][ T9861] Read of size 8 at addr ffffff890270e610 by task kworker/6:1/9861
[78782.605199][ T9861] CPU: 6 UID: 0 PID: 9861 Comm: kworker/6:1 Tainted: G        W  OE      6.12.23-android16-5-gabaf21382e8f-4k #1 0308449da8ad70d2d3649ae989c1d02f0fbf562c
[78782.605220][ T9861] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[78782.605226][ T9861] Hardware name: Qualcomm Technologies, Inc. Alor QRD + WCN7750 WLAN + Kundu PD2536F_EX (DT)
[78782.605235][ T9861] Workqueue: cgroup_destroy css_free_rwork_fn
[78782.605251][ T9861] Call trace:
[78782.605254][ T9861]  dump_backtrace+0x120/0x170
[78782.605267][ T9861]  show_stack+0x2c/0x40
[78782.605276][ T9861]  dump_stack_lvl+0x84/0xb4
[78782.605286][ T9861]  print_report+0x144/0x7a4
[78782.605301][ T9861]  kasan_report+0xe0/0x140
[78782.605315][ T9861]  __asan_load8+0x98/0xa0
[78782.605329][ T9861]  trace_event_raw_event_cgroup_rstat+0x110/0x1dc
[78782.605339][ T9861]  __traceiter_cgroup_rstat_locked+0x78/0xc4
[78782.605355][ T9861]  __cgroup_rstat_lock+0xe8/0x1dc
[78782.605368][ T9861]  cgroup_rstat_flush_locked+0x7dc/0xaec
[78782.605383][ T9861]  cgroup_rstat_flush+0x34/0x108
[78782.605396][ T9861]  cgroup_rstat_exit+0x2c/0x120
[78782.605409][ T9861]  css_free_rwork_fn+0x504/0xa18
[78782.605421][ T9861]  process_scheduled_works+0x378/0x8e0
[78782.605435][ T9861]  worker_thread+0x5a8/0x77c
[78782.605446][ T9861]  kthread+0x1c4/0x270
[78782.605455][ T9861]  ret_from_fork+0x10/0x20
[78782.605470][ T9861] Allocated by task 2864 on cpu 7 at 78781.564561s:
[78782.605467][    C5] ENHANCE: rpm_suspend+0x93c/0xafc: 0:0:0:0 ret=0
[78782.605481][ T9861]  kasan_save_track+0x44/0x9c
[78782.605497][ T9861]  kasan_save_alloc_info+0x40/0x54
[78782.605507][ T9861]  __kasan_slab_alloc+0x70/0x8c
[78782.605521][ T9861]  kmem_cache_alloc_noprof+0x1a0/0x428
[78782.605534][ T9861]  __kernfs_new_node+0xd4/0x3e4
[78782.605545][ T9861]  kernfs_new_node+0xbc/0x168
[78782.605554][ T9861]  kernfs_create_dir_ns+0x58/0xe8
[78782.605565][ T9861]  cgroup_mkdir+0x25c/0xc9c
[78782.605576][ T9861]  kernfs_iop_mkdir+0x130/0x214
[78782.605586][ T9861]  vfs_mkdir+0x290/0x388
[78782.605599][ T9861]  do_mkdirat+0xfc/0x27c
[78782.605612][ T9861]  __arm64_sys_mkdirat+0x5c/0x78
[78782.605625][ T9861]  invoke_syscall+0x90/0x1e8
[78782.605634][ T9861]  el0_svc_common+0x134/0x168
[78782.605643][ T9861]  do_el0_svc+0x34/0x44
[78782.605652][ T9861]  el0_svc+0x38/0x84
[78782.605667][ T9861]  el0t_64_sync_handler+0x70/0xbc
[78782.605681][ T9861]  el0t_64_sync+0x19c/0x1a0
[78782.605695][ T9861] Freed by task 69 on cpu 1 at 78782.573275s:
[78782.605705][ T9861]  kasan_save_track+0x44/0x9c
[78782.605719][ T9861]  kasan_save_free_info+0x54/0x70
[78782.605729][ T9861]  __kasan_slab_free+0x68/0x8c
[78782.605743][ T9861]  kmem_cache_free+0x118/0x488
[78782.605755][ T9861]  kernfs_free_rcu+0xa0/0xb8
[78782.605765][ T9861]  rcu_do_batch+0x324/0xaa0
[78782.605775][ T9861]  rcu_nocb_cb_kthread+0x388/0x690
[78782.605785][ T9861]  kthread+0x1c4/0x270
[78782.605794][ T9861]  ret_from_fork+0x10/0x20
[78782.605809][ T9861] Last potentially related work creation:
[78782.605814][ T9861]  kasan_save_stack+0x40/0x70
[78782.605829][ T9861]  __kasan_record_aux_stack+0xb0/0xcc
[78782.605839][ T9861]  kasan_record_aux_stack_noalloc+0x14/0x24
[78782.605849][ T9861]  __call_rcu_common+0x54/0x390
[78782.605863][ T9861]  call_rcu+0x18/0x28
[78782.605875][ T9861]  kernfs_put+0x17c/0x28c
[78782.605884][ T9861]  css_free_rwork_fn+0x4f4/0xa18
[78782.605897][ T9861]  process_scheduled_works+0x378/0x8e0
[78782.605910][ T9861]  worker_thread+0x5a8/0x77c
[78782.605923][ T9861]  kthread+0x1c4/0x270
[78782.605932][ T9861]  ret_from_fork+0x10/0x20
[78782.605947][ T9861] The buggy address belongs to the object at ffffff890270e5b0
[78782.605947][ T9861]  which belongs to the cache kernfs_node_cache of size 144
[78782.605957][ T9861] The buggy address is located 96 bytes inside of
[78782.605957][ T9861]  freed 144-byte region [ffffff890270e5b0, ffffff890270e640)

Fixes: fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints")
Signed-off-by: T.J. Mercier 
Acked-by: Michal Koutný 
Signed-off-by: Greg Kroah-Hartman

cpuset: Treat cpusets in attaching as populated

2025-12-18T12:55:03+00:00

[ Upstream commit b1bcaed1e39a9e0dfbe324a15d2ca4253deda316 ]

Currently, the check for whether a partition is populated does not
account for tasks in the cpuset of attaching. This is a corner case
that can leave a task stuck in a partition with no effective CPUs.

The race condition occurs as follows:

cpu0				cpu1
				//cpuset A  with cpu N
migrate task p to A
cpuset_can_attach
// with effective cpus
// check ok

// cpuset_mutex is not held	// clear cpuset.cpus.exclusive
				// making effective cpus empty
				update_exclusive_cpumask
				// tasks_nocpu_error check ok
				// empty effective cpus, partition valid
cpuset_attach
...
// task p stays in A, with non-effective cpus.

To fix this issue, this patch introduces cs_is_populated, which considers
tasks in the attaching cpuset. This new helper is used in validate_change
and partition_is_populated.

Fixes: e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective")
Signed-off-by: Chen Ridong 
Reviewed-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

bpf: Do not limit bpf_cgroup_from_id to current's namespace

2025-11-13T20:34:06+00:00

[ Upstream commit 2c895133950646f45e5cf3900b168c952c8dbee8 ]

The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the
cgroup corresponding to a given cgroup ID. This helper can be called in
a lot of contexts where the current thread can be random. A recent
example was its use in sched_ext's ops.tick(), to obtain the root cgroup
pointer. Since the current task can be whatever random user space task
preempted by the timer tick, this makes the behavior of the helper
unreliable.

Refactor out __cgroup_get_from_id as the non-namespace aware version of
cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it.

There is no compatibility breakage here, since changing the namespace
against which the lookup is being done to the root cgroup namespace only
permits a wider set of lookups to succeed now. The cgroup IDs across
namespaces are globally unique, and thus don't need to be retranslated.

Reported-by: Dan Schatzberg 
Signed-off-by: Kumar Kartikeya Dwivedi 
Acked-by: Tejun Heo 
Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Sasha Levin

cpuset: Use new excpus for nocpu error check when enabling root partition

2025-11-02T13:15:21+00:00

[ Upstream commit 59d5de3655698679ad8fd2cc82228de4679c4263 ]

A previous patch fixed a bug where new_prs should be assigned before
checking housekeeping conflicts. This patch addresses another potential
issue: the nocpu error check currently uses the xcpus which is not updated.
Although no issue has been observed so far, the check should be performed
using the new effective exclusive cpus.

The comment has been removed because the function returns an error if
nocpu checking fails, which is unrelated to the parent.

Signed-off-by: Chen Ridong 
Reviewed-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup: split cgroup_destroy_wq into 3 workqueues

2025-09-25T09:13:42+00:00

[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ]

A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie 
Signed-off-by: Chen Ridong 
Suggested-by: Teju Heo 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup/cpuset: Fix a partition error with CPU hotplug

2025-08-28T14:31:11+00:00

[ Upstream commit 150e298ae0ccbecff2357a72fbabd80f8849ea6e ]

It was found during testing that an invalid leaf partition with an
empty effective exclusive CPU list can become a valid empty partition
with no CPU afer an offline/online operation of an unrelated CPU. An
empty partition root is allowed in the special case that it has no
task in its cgroup and has distributed out all its CPUs to its child
partitions. That is certainly not the case here.

The problem is in the cpumask_subsets() test in the hotplug case
(update with no new mask) of update_parent_effective_cpumask() as it
also returns true if the effective exclusive CPU list is empty. Fix that
by addding the cpumask_empty() test to root out this exception case.
Also add the cpumask_empty() test in cpuset_hotplug_update_tasks()
to avoid calling update_parent_effective_cpumask() for this special case.

Fixes: 0c7f293efc87 ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2")
Signed-off-by: Waiman Long 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key

2025-08-28T14:31:11+00:00

[ Upstream commit 65f97cc81b0adc5f49cf6cff5d874be0058e3f41 ]

The following lockdep splat was observed.

[  812.359086] ============================================
[  812.359089] WARNING: possible recursive locking detected
[  812.359097] --------------------------------------------
[  812.359100] runtest.sh/30042 is trying to acquire lock:
[  812.359105] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0xe/0x20
[  812.359131]
[  812.359131] but task is already holding lock:
[  812.359134] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_write_resmask+0x98/0xa70
     :
[  812.359267] Call Trace:
[  812.359272]  
[  812.359367]  cpus_read_lock+0x3c/0xe0
[  812.359382]  static_key_enable+0xe/0x20
[  812.359389]  check_insane_mems_config.part.0+0x11/0x30
[  812.359398]  cpuset_write_resmask+0x9f2/0xa70
[  812.359411]  cgroup_file_write+0x1c7/0x660
[  812.359467]  kernfs_fop_write_iter+0x358/0x530
[  812.359479]  vfs_write+0xabe/0x1250
[  812.359529]  ksys_write+0xf9/0x1d0
[  812.359558]  do_syscall_64+0x5f/0xe0

Since commit d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem
and hotplug lock order"), the ordering of cpu hotplug lock
and cpuset_mutex had been reversed. That patch correctly
used the cpuslocked version of the static branch API to enable
cpusets_pre_enable_key and cpusets_enabled_key, but it didn't do the
same for cpusets_insane_config_key.

The cpusets_insane_config_key can be enabled in the
check_insane_mems_config() which is called from update_nodemask()
or cpuset_hotplug_update_tasks() with both cpu hotplug lock and
cpuset_mutex held. Deadlock can happen with a pending hotplug event that
tries to acquire the cpu hotplug write lock which will block further
cpus_read_lock() attempt from check_insane_mems_config(). Fix that by
switching to use static_branch_enable_cpuslocked().

Fixes: d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order")
Signed-off-by: Waiman Long 
Reviewed-by: Juri Lelli 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin