summaryrefslogtreecommitdiff
path: root/kernel/cgroup
AgeCommit message (Collapse)Author
2026-03-19cgroup: fix race between task migration and iterationQingye Zhao
commit 5ee01f1a7343d6a3547b6802ca2d4cdce0edacb1 upstream. When a task is migrated out of a css_set, cgroup_migrate_add_task() first moves it from cset->tasks to cset->mg_tasks via: list_move_tail(&task->cg_list, &cset->mg_tasks); If a css_task_iter currently has it->task_pos pointing to this task, css_set_move_task() calls css_task_iter_skip() to keep the iterator valid. However, since the task has already been moved to ->mg_tasks, the iterator is advanced relative to the mg_tasks list instead of the original tasks list. As a result, remaining tasks on cset->tasks, as well as tasks queued on cset->mg_tasks, can be skipped by iteration. Fix this by calling css_set_skip_task_iters() before unlinking task->cg_list from cset->tasks. This advances all active iterators to the next task on cset->tasks, so iteration continues correctly even when a task is concurrently being migrated. This race is hard to hit in practice without instrumentation, but it can be reproduced by artificially slowing down cgroup_procs_show(). For example, on an Android device a temporary /sys/kernel/cgroup/cgroup_test knob can be added to inject a delay into cgroup_procs_show(), and then: 1) Spawn three long-running tasks (PIDs 101, 102, 103). 2) Create a test cgroup and move the tasks into it. 3) Enable a large delay via /sys/kernel/cgroup/cgroup_test. 4) In one shell, read cgroup.procs from the test cgroup. 5) Within the delay window, in another shell migrate PID 102 by writing it to a different cgroup.procs file. Under this setup, cgroup.procs can intermittently show only PID 101 while skipping PID 103. Once the migration completes, reading the file again shows all tasks as expected. Note that this change does not allow removing the existing css_set_skip_task_iters() call in css_set_move_task(). The new call in cgroup_migrate_add_task() only handles iterators that are racing with migration while the task is still on cset->tasks. Iterators may also start after the task has been moved to cset->mg_tasks. If we dropped css_set_skip_task_iters() from css_set_move_task(), such iterators could keep task_pos pointing to a migrating task, causing css_task_iter_advance() to malfunction on the destination css_set, up to and including crashes or infinite loops. The race window between migration and iteration is very small, and css_task_iter is not on a hot path. In the worst case, when an iterator is positioned on the first thread of the migrating process, cgroup_migrate_add_task() may have to skip multiple tasks via css_set_skip_task_iters(). However, this only happens when migration and iteration actually race, so the performance impact is negligible compared to the correctness fix provided here. Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Qingye Zhao <zhaoqingye@honor.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in ↵Waiman Long
update_cpumasks_hier() [ Upstream commit 68230aac8b9aad243626fbaf3ca170012c17fec5 ] Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask() from tmp->new_cpus to cp->effective_cpus. This second parameter is just a temporary cpumask for internal use. The cpuset_update_tasks_cpumask() function was originally called update_tasks_cpumask() before commit 381b53c3b549 ("cgroup/cpuset: rename functions shared between v1 and v2"). This mistake can incorrectly change the effective_cpus of the cpuset when it is the top_cpuset or in arm64 architecture where task_cpu_possible_mask() may differ from cpu_possible_mask. So far top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64 arch can still be impacted. Fix it by reverting the incorrect change. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04mm/vmscan: fix demotion targets checks in reclaim/demotionBing Jiao
[ Upstream commit 1aceed565ff172fc0331dd1d5e7e65139b711139 ] Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion", v9. This patch series addresses two issues in demote_folio_list(), can_demote(), and next_demotion_node() in reclaim/demotion. 1. demote_folio_list() and can_demote() do not correctly check demotion target against cpuset.mems_effective, which will cause (a) pages to be demoted to not-allowed nodes and (b) pages fail demotion even if the system still has allowed demotion nodes. Patch 1 fixes this bug by updating cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. 2. next_demotion_node() returns a preferred demotion target, but it does not check the node against allowed nodes. Patch 2 ensures that next_demotion_node() filters against the allowed node mask and selects the closest demotion target to the source node. This patch (of 2): Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks against cpuset.mems_effective in reclaim/demotion. Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Bug 1 reproduction: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Bug 2 reproduction: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04cgroup/cpuset: Don't fail cpuset.cpus change in v2Waiman Long
[ Upstream commit 6e6f13f6d5095f3a432da421e78f4d7d51ef39c8 ] Commit fe8cd2736e75 ("cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition") introduced a new check to disallow the setting of a new cpuset.cpus.exclusive value that is a superset of a sibling's cpuset.cpus value so that there will at least be one CPU left in the sibling in case the cpuset becomes a valid partition root. This new check does have the side effect of failing a cpuset.cpus change that make it a subset of a sibling's cpuset.cpus.exclusive value. With v2, users are supposed to be allowed to set whatever value they want in cpuset.cpus without failure. To maintain this rule, the check is now restricted to only when cpuset.cpus.exclusive is being changed not when cpuset.cpus is changed. The cgroup-v2.rst doc file is also updated to reflect this change. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-19cpuset: Fix missing adaptation for cpuset_is_populatedChen Ridong
Commit b1bcaed1e39a ("cpuset: Treat cpusets in attaching as populated") was backported to the long‑term support (LTS) branches. However, because commit d5cf4d34a333 ("cgroup/cpuset: Don't track # of local child partitions") was not backported, a corresponding adaptation to the backported code is still required. To ensure correct behavior, replace cgroup_is_populated with cpuset_is_populated in the partition_is_populated function. Cc: stable@vger.kernel.org # 6.1+ Fixes: b1bcaed1e39a ("cpuset: Treat cpusets in attaching as populated") Cc: Waiman Long <longman@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-11cgroup/dmem: avoid pool UAFChen Ridong
commit 99a2ef500906138ba58093b9893972a5c303c734 upstream. An UAF issue was observed: BUG: KASAN: slab-use-after-free in page_counter_uncharge+0x65/0x150 Write of size 8 at addr ffff888106715440 by task insmod/527 CPU: 4 UID: 0 PID: 527 Comm: insmod 6.19.0-rc7-next-20260129+ #11 Tainted: [O]=OOT_MODULE Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 kasan_report+0xca/0x100 kasan_check_range+0x39/0x1c0 page_counter_uncharge+0x65/0x150 dmem_cgroup_uncharge+0x1f/0x260 Allocated by task 527: Freed by task 0: The buggy address belongs to the object at ffff888106715400 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 64 bytes inside of freed 512-byte region [ffff888106715400, ffff888106715600) The buggy address belongs to the physical page: Memory state around the buggy address: ffff888106715300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888106715380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888106715400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888106715480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888106715500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb The issue occurs because a pool can still be held by a caller after its associated memory region is unregistered. The current implementation frees the pool even if users still hold references to it (e.g., before uncharge operations complete). This patch adds a reference counter to each pool, ensuring that a pool is only freed when its reference count drops to zero. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-11cgroup/dmem: avoid rcu warning when unregister regionChen Ridong
commit 592a68212c5664bcaa88f24ed80bf791282790fe upstream. A warnning was detected: WARNING: suspicious RCU usage 6.19.0-rc7-next-20260129+ #1101 Tainted: G O kernel/cgroup/dmem.c:456 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by insmod/532: #0: ffffffff85e78b38 (dmemcg_lock){+.+.}-dmem_cgroup_unregister_region+ stack backtrace: CPU: 2 UID: 0 PID: 532 Comm: insmod Tainted: 6.19.0-rc7-next- Tainted: [O]=OOT_MODULE Call Trace: <TASK> dump_stack_lvl+0xb0/0xd0 lockdep_rcu_suspicious+0x151/0x1c0 dmem_cgroup_unregister_region+0x1e2/0x380 ? __pfx_dmem_test_init+0x10/0x10 [dmem_uaf] dmem_test_init+0x65/0xff0 [dmem_uaf] do_one_initcall+0xbb/0x3a0 The macro list_for_each_rcu() must be used within an RCU read-side critical section (between rcu_read_lock() and rcu_read_unlock()). Using it outside that context, as seen in dmem_cgroup_unregister_region(), triggers the lockdep warning because the RCU protection is not guaranteed. Replace list_for_each_rcu() with list_for_each_entry_safe(), which is appropriate for traversal under spinlock protection where nodes may be deleted. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-11cgroup/dmem: fix NULL pointer dereference when setting maxChen Ridong
commit 43151f812886be1855d2cba059f9c93e4729460b upstream. An issue was triggered: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 15 UID: 0 PID: 658 Comm: bash Tainted: 6.19.0-rc6-next-2026012 Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:strcmp+0x10/0x30 RSP: 0018:ffffc900017f7dc0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff888107cd4358 RDX: 0000000019f73907 RSI: ffffffff82cc381a RDI: 0000000000000000 RBP: ffff8881016bef0d R08: 000000006c0e7145 R09: 0000000056c0e714 R10: 0000000000000001 R11: ffff888107cd4358 R12: 0007ffffffffffff R13: ffff888101399200 R14: ffff888100fcb360 R15: 0007ffffffffffff CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000105c79000 CR4: 00000000000006f0 Call Trace: <TASK> dmemcg_limit_write.constprop.0+0x16d/0x390 ? __pfx_set_resource_max+0x10/0x10 kernfs_fop_write_iter+0x14e/0x200 vfs_write+0x367/0x510 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f42697e1887 It was trriggered setting max without limitation, the command is like: "echo test/region0 > dmem.max". To fix this issue, add check whether options is valid after parsing the region_name. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08cpuset: fix warning when disabling remote partitionChen Ridong
[ Upstream commit aa7d3a56a20f07978d9f401e13637a6479b13bd0 ] A warning was triggered as follows: WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110 RIP: 0010:remote_partition_disable+0xf7/0x110 RSP: 0018:ffffc90001947d88 EFLAGS: 00000206 RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40 RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000 RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8 R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x2d3/0x580 cpuset_partition_write+0x94/0xf0 kernfs_fop_write_iter+0x147/0x200 vfs_write+0x35d/0x500 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f55c8cd4887 Reproduction steps (on a 16-CPU machine): # cd /sys/fs/cgroup/ # mkdir A1 # echo +cpuset > A1/cgroup.subtree_control # echo "0-14" > A1/cpuset.cpus.exclusive # mkdir A1/A2 # echo "0-14" > A1/A2/cpuset.cpus.exclusive # echo "root" > A1/A2/cpuset.cpus.partition # echo 0 > /sys/devices/system/cpu/cpu15/online # echo member > A1/A2/cpuset.cpus.partition When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs remain available for the top_cpuset, forcing partitions to share CPUs with the top_cpuset. In this scenario, disabling the remote partition triggers a warning stating that effective_xcpus is not a subset of subpartitions_cpus. Partitions should be invalidated in this case to inform users that the partition is now invalid(cpus are shared with top_cpuset). To fix this issue: 1. Only emit the warning only if subpartitions_cpus is not empty and the effective_xcpus is not a subset of subpartitions_cpus. 2. During the CPU hotplug process, invalidate partitions if subpartitions_cpus is empty. Fixes: f62a5d39368e ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02cgroup: rstat: use LOCK CMPXCHG in css_rstat_updatedShakeel Butt
commit 3309b63a2281efb72df7621d60cc1246b6286ad3 upstream. On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which means it is only safe for the local CPU and not for multiple CPUs. Recently the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe") make css_rstat_updated lockless and uses lockless list to allow reentrancy. Since css_rstat_updated can invoked from process context, IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will inset the lockless lnode into the global per-cpu lockless list. However the commit missed one case where lockless node of a cgroup can be accessed and modified by another CPU doing the flushing. Basically llist_del_first_init() in css_process_update_tree(). On a cursory look, it can be questioned how css_process_update_tree() can see a lockless node in global lockless list where the updater is at this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated(). This can indeed happen in the presence of IRQs/NMI. Consider this scenario: Updater for cgroup stat C on CPU A in process context is after llist_on_list() check and before this_cpu_cmpxchg() in css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI context, a new updater calls css_rstat_updated() for same cgroup C and successfully inserts rstatc_pcpu->lnode. Now concurrently CPU B is running the flusher and it calls llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C which was added by the IRQ/NMI updater. Now imagine CPU B calling init_llist_node() on cgroup C's rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently. The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK prefix. In Meta's fleet running the kernel with the commit 36df6e3dbd7e, we are observing on some machines the memcg stats are getting skewed by more than the actual memory on the system. On close inspection, we noticed that lockless node for a workload for specific CPU was in the bad state and thus all the updates on that CPU for that cgroup was being lost. To confirm if this skew was indeed due to this CMPXCHG without LOCK in css_rstat_updated(), we created a repro (using AI) at [1] which shows that CMPXCHG without LOCK creates almost the same lnode corruption as seem in Meta's fleet and with LOCK CMPXCHG the issue does not reproduces. Link: http://lore.kernel.org/efiagdwmzfwpdzps74fvcwq3n4cs36q33ij7eebcpssactv3zu@se4hqiwxcfxq [1] Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: stable@vger.kernel.org # v6.17+ Fixes: 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-18cpuset: Treat cpusets in attaching as populatedChen Ridong
[ Upstream commit b1bcaed1e39a9e0dfbe324a15d2ca4253deda316 ] Currently, the check for whether a partition is populated does not account for tasks in the cpuset of attaching. This is a corner case that can leave a task stuck in a partition with no effective CPUs. The race condition occurs as follows: cpu0 cpu1 //cpuset A with cpu N migrate task p to A cpuset_can_attach // with effective cpus // check ok // cpuset_mutex is not held // clear cpuset.cpus.exclusive // making effective cpus empty update_exclusive_cpumask // tasks_nocpu_error check ok // empty effective cpus, partition valid cpuset_attach ... // task p stays in A, with non-effective cpus. To fix this issue, this patch introduces cs_is_populated, which considers tasks in the attaching cpuset. This new helper is used in validate_change and partition_is_populated. Fixes: e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18cgroup: add cgroup namespace to tree after owner is setChristian Brauner
[ Upstream commit 768b1565d9d1e1eebf7567f477f7f46c05a98a4d ] Otherwise we trip VFS_WARN_ON_ONC() in __ns_tree_add_raw(). Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-6-2e6f823ebdc0@kernel.org Fixes: 7c6059398533 ("cgroup: support ns lookup") Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-20Merge tag 'cgroup-for-6.18-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix seqcount lockdep assertion failure in cgroup freezer on PREEMPT_RT. Plain seqcount_t expects preemption disabled, but PREEMPT_RT spinlocks don't disable preemption. Switch to seqcount_spinlock_t to properly associate css_set_lock with the freeze timing seqcount. - Misc changes including kernel-doc warning fix for misc_res_type enum and improved selftest diagnostics. * tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/misc: fix misc_res_type kernel-doc warning selftests: cgroup: Use values_close_report in test_cpu selftests: cgroup: add values_close_report helper cgroup: Fix seqcount lockdep assertion in cgroup freezer
2025-10-04Merge tag 'rcu.2025.09.26a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Paul McKenney: "Documentation updates: - Update whatisRCU.rst and checklist.rst for recent RCU API additions - Fix RCU documentation formatting and typos - Replace dead Ottawa Linux Symposium links in RTFP.txt Miscellaneous RCU updates: - Document that rcu_barrier() hurries RCU_LAZY callbacks - Remove redundant interrupt disabling from rcu_preempt_deferred_qs_handler() - Move list_for_each_rcu from list.h to rculist.h, and adjust the include directive in kernel/cgroup/dmem.c accordingly - Make initial set of changes to accommodate upcoming system_percpu_wq changes SRCU updates: - Create an srcu_read_lock_fast_notrace() for eventual use in tracing, including adding guards - Document the reliance on per-CPU operations as implicit RCU readers in __srcu_read_{,un}lock_fast() - Document the srcu_flip() function's memory-barrier D's relationship to SRCU-fast readers - Remove a redundant preempt_disable() and preempt_enable() pair from srcu_gp_start_if_needed() Torture-test updates: - Fix jitter.sh spin time so that it actually varies as advertised. It is still quite coarse-grained, but at least it does now vary - Update torture.sh help text to include the not-so-new --do-normal parameter, which permits (for example) testing KCSAN kernels without doing non-debug kernels - Fix a number of false-positive diagnostics that were being triggered by rcutorture starting before boot completed. Running multiple near-CPU-bound rcutorture processes when there is only the boot CPU is after all a bit excessive - Substitute kcalloc() for kzalloc() - Remove a redundant kfree() and NULL out kfree()ed objects" * tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits) rcu: WQ_UNBOUND added to sync_wq workqueue rcu: WQ_PERCPU added to alloc_workqueue users rcu: replace use of system_wq with system_percpu_wq refperf: Set reader_tasks to NULL after kfree() refperf: Remove redundant kfree() after torture_stop_kthread() srcu/tiny: Remove preempt_disable/enable() in srcu_gp_start_if_needed() srcu: Document srcu_flip() memory-barrier D relation to SRCU-fast srcu: Document __srcu_read_{,un}lock_fast() implicit RCU readers rculist: move list_for_each_rcu() to where it belongs refscale: Use kcalloc() instead of kzalloc() rcutorture: Use kcalloc() instead of kzalloc() docs: rcu: Replace multiple dead OLS links in RTFP.txt doc: Fix typo in RCU's torture.rst documentation Documentation: RCU: Retitle toctree index Documentation: RCU: Reduce toctree depth Documentation: RCU: Wrap kvm-remote.sh rerun snippet in literal code block rcu: docs: Requirements.rst: Abide by conventions of kernel documentation doc: Add RCU guards to checklist.rst doc: Update whatisRCU.rst for recent RCU API additions rcutorture: Delay forward-progress testing until boot completes ...
2025-10-03cgroup: Fix seqcount lockdep assertion in cgroup freezerNirbhay Sharma
The commit afa3701c0e45 ("cgroup: cgroup.stat.local time accounting") introduced a seqcount to track freeze timing but initialized it as a plain seqcount_t using seqcount_init(). However, the write-side critical section in cgroup_do_freeze() holds the css_set_lock spinlock while calling write_seqcount_begin(). On PREEMPT_RT kernels, spinlocks do not disable preemption, causing the lockdep assertion for a plain seqcount_t, which checks for preemption being disabled, to fail. This triggers the following warning: WARNING: CPU: 0 PID: 9692 at include/linux/seqlock.h:221 Fix this by changing the type to seqcount_spinlock_t and initializing it with seqcount_spinlock_init() to associate css_set_lock with the seqcount. This allows lockdep to correctly validate that the spinlock is held during write operations, resolving the assertion failure on all kernel configurations. Reported-by: syzbot+27a2519eb4dad86d0156@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=27a2519eb4dad86d0156 Fixes: afa3701c0e45 ("cgroup: cgroup.stat.local time accounting") Signed-off-by: Nirbhay Sharma <nirbhay.lkd@gmail.com> Link: https://lore.kernel.org/r/20251002165510.KtY3IT--@linutronix.de/ Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-30Merge tag 'bpf-next-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc (Amery Hung) Applied as a stable branch in bpf-next and net-next trees. - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki) Also a stable branch in bpf-next and net-next trees. - Enforce expected_attach_type for tailcall compatibility (Daniel Borkmann) - Replace path-sensitive with path-insensitive live stack analysis in the verifier (Eduard Zingerman) This is a significant change in the verification logic. More details, motivation, long term plans are in the cover letter/merge commit. - Support signed BPF programs (KP Singh) This is another major feature that took years to materialize. Algorithm details are in the cover letter/marge commit - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich) - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan) - Fix USDT SIB argument handling in libbpf (Jiawei Zhao) - Allow uprobe-bpf program to change context registers (Jiri Olsa) - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and Puranjay Mohan) - Allow access to union arguments in tracing programs (Leon Hwang) - Optimize rcu_read_lock() + migrate_disable() combination where it's used in BPF subsystem (Menglong Dong) - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred execution of BPF callback in the context of a specific task using the kernel’s task_work infrastructure (Mykyta Yatsenko) - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya Dwivedi) - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi) - Improve the precision of tnum multiplier verifier operation (Nandakumar Edamana) - Use tnums to improve is_branch_taken() logic (Paul Chaignon) - Add support for atomic operations in arena in riscv JIT (Pu Lehui) - Report arena faults to BPF error stream (Puranjay Mohan) - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin Monnet) - Add bpf_strcasecmp() kfunc (Rong Tao) - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao Chen) * tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits) libbpf: Replace AF_ALG with open coded SHA-256 selftests/bpf: Add stress test for rqspinlock in NMI selftests/bpf: Add test case for different expected_attach_type bpf: Enforce expected_attach_type for tailcall compatibility bpftool: Remove duplicate string.h header bpf: Remove duplicate crypto/sha2.h header libbpf: Fix error when st-prefix_ops and ops from differ btf selftests/bpf: Test changing packet data from kfunc selftests/bpf: Add stacktrace map lookup_and_delete_elem test case selftests/bpf: Refactor stacktrace_map case with skeleton bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE selftests/bpf: Fix flaky bpf_cookie selftest selftests/bpf: Test changing packet data from global functions with a kfunc bpf: Emit struct bpf_xdp_sock type in vmlinux BTF selftests/bpf: Task_work selftest cleanup fixes MAINTAINERS: Delete inactive maintainers from AF_XDP bpf: Mark kfuncs as __noclone selftests/bpf: Add kprobe multi write ctx attach test selftests/bpf: Add kprobe write ctx attach test selftests/bpf: Add uprobe context ip register change test ...
2025-09-30Merge tag 'cgroup-for-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Extensive cpuset code cleanup and refactoring work with no functional changes: CPU mask computation logic refactoring, introducing new helpers, removing redundant code paths, and improving error handling for better maintainability. - A few bug fixes to cpuset including fixes for partition creation failures when isolcpus is in use, missing error returns, and null pointer access prevention in free_tmpmasks(). - Core cgroup changes include replacing the global percpu_rwsem with per-threadgroup rwsem when writing to cgroup.procs for better scalability, workqueue conversions to use WQ_PERCPU and system_percpu_wq to prepare for workqueue default switching from percpu to unbound, and removal of unused code including the post_attach callback. - New cgroup.stat.local time accounting feature that tracks frozen time duration. - Misc changes including selftests updates (new freezer time tests and backward compatibility fixes), documentation sync, string function safety improvements, and 64-bit division fixes. * tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (39 commits) cpuset: remove is_prs_invalid helper cpuset: remove impossible warning in update_parent_effective_cpumask cpuset: remove redundant special case for null input in node mask update cpuset: fix missing error return in update_cpumask cpuset: Use new excpus for nocpu error check when enabling root partition cpuset: fix failure to enable isolated partition when containing isolcpus Documentation: cgroup-v2: Sync manual toctree cpuset: use partition_cpus_change for setting exclusive cpus cpuset: use parse_cpulist for setting cpus.exclusive cpuset: introduce partition_cpus_change cpuset: refactor cpus_allowed_validate_change cpuset: refactor out validate_partition cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers cpuset: refactor CPU mask buffer parsing logic cpuset: Refactor exclusive CPU mask computation logic cpuset: change return type of is_partition_[in]valid to bool cpuset: remove unused assignment to trialcs->partition_root_state cpuset: move the root cpuset write check earlier cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock ...
2025-09-29Merge tag 'namespace-6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_*() helpers net: port to ns_ref_*() helpers uts: port to ns_ref_*() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_*() helpers time: port to ns_ref_*() helpers pid: port to ns_ref_*() helpers ipc: port to ns_ref_*() helpers cgroup: port to ns_ref_*() helpers ...
2025-09-29Merge tag 'kernel-6.18-rc1.clone3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_*() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] * tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-25ns: move ns type into struct ns_commonChristian Brauner
It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-22cpuset: remove is_prs_invalid helperChen Ridong
The is_prs_invalid helper function is redundant as it serves a similar purpose to is_partition_invalid. It can be fully replaced by the existing is_partition_invalid function, so this patch removes the is_prs_invalid helper. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-22cpuset: remove impossible warning in update_parent_effective_cpumaskChen Ridong
If the parent is not a valid partition, an error will be returned before any partition update command is processed. This means the WARN_ON_ONCE(!is_partition_valid(parent)) can never be triggered, so it is safe to remove. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-22cpuset: remove redundant special case for null input in node mask updateChen Ridong
The nodelist_parse function already handles empty nodemask input appropriately, making it unnecessary to handle this case separately during the node mask update process. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-22ns: simplify ns_common_init() furtherChristian Brauner
Simply derive the ns operations from the namespace type. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19cpuset: fix missing error return in update_cpumaskChen Ridong
The commit c6366739804f ("cpuset: refactor cpus_allowed_validate_change") inadvertently removed the error return when cpus_allowed_validate_change() fails. This patch restores the proper error handling by returning retval when the validation check fails. Fixes: c6366739804f ("cpuset: refactor cpus_allowed_validate_change") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-19cpuset: Use new excpus for nocpu error check when enabling root partitionChen Ridong
A previous patch fixed a bug where new_prs should be assigned before checking housekeeping conflicts. This patch addresses another potential issue: the nocpu error check currently uses the xcpus which is not updated. Although no issue has been observed so far, the check should be performed using the new effective exclusive cpus. The comment has been removed because the function returns an error if nocpu checking fails, which is unrelated to the parent. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-19cpuset: fix failure to enable isolated partition when containing isolcpusChen Ridong
The 'isolcpus' parameter specified at boot time can be assigned to an isolated partition. While it is valid put the 'isolcpus' in an isolated partition, attempting to change a member cpuset to an isolated partition will fail if the cpuset contains any 'isolcpus'. For example, the system boots with 'isolcpus=9', and the following configuration works correctly: # cd /sys/fs/cgroup/ # mkdir test # echo 1 > test/cpuset.cpus # echo isolated > test/cpuset.cpus.partition # cat test/cpuset.cpus.partition isolated # echo 9 > test/cpuset.cpus # cat test/cpuset.cpus.partition isolated # cat test/cpuset.cpus 9 However, the following steps to convert a member cpuset to an isolated partition will fail: # cd /sys/fs/cgroup/ # mkdir test # echo 9 > test/cpuset.cpus # echo isolated > test/cpuset.cpus.partition # cat test/cpuset.cpus.partition isolated invalid (partition config conflicts with housekeeping setup) The issue occurs because the new partition state (new_prs) is used for validation against housekeeping constraints before it has been properly updated. To resolve this, move the assignment of new_prs before the housekeeping validation check when enabling a root partition. Fixes: 4a74e418881f ("cgroup/cpuset: Check partition conflict with housekeeping setup") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-19ns: use inode initializer for initial namespacesChristian Brauner
Just use the common helper we have. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19ns: rename to __ns_refChristian Brauner
Make it easier to grep and rename to ns_count. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19ns: add ns_common_free()Christian Brauner
And drop ns_free_inum(). Anything common that can be wasted centrally should be wasted in the new common helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19nscommon: simplify initializationChristian Brauner
There's a lot of information that namespace implementers don't need to know about at all. Encapsulate this all in the initialization helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19ns: add to_<type>_ns() to respective headersChristian Brauner
Every namespace type has a container_of(ns, <ns_type>, ns) static inline function that is currently not exposed in the header. So we have a bunch of places that open-code it via container_of(). Move it to the headers so we can use it directly. Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19cgroup: support ns lookupChristian Brauner
Support the generic ns lookup infrastructure to support file handles for namespaces. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19Merge branch 'no-rebase-mnt_ns_tree_remove'Christian Brauner
Bring in the fix for removing a mount namespace from the mount namespace rbtree and list. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19cgroup: use ns_common_init()Christian Brauner
Don't cargo-cult the same thing over and over. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-17cpuset: use partition_cpus_change for setting exclusive cpusChen Ridong
A previous patch has introduced a new helper function partition_cpus_change(). Now replace the exclusive cpus setting logic with this helper function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: use parse_cpulist for setting cpus.exclusiveChen Ridong
Previous patches made parse_cpulist handle empty cpu mask input. Now use this helper for exclusive cpus setting. Also, compute_trialcs_xcpus can be called with empty cpus and handles it correctly. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: introduce partition_cpus_changeChen Ridong
Introduce the partition_cpus_change function to handle both regular CPU set updates and exclusive CPU modifications, either of which may trigger partition state changes. This generalized function will also be utilized for exclusive CPU updates in subsequent patches. With the introduction of compute_trialcs_excpus in a previous patch, the trialcs->effective_xcpus field is now consistently computed and maintained. Consequently, the legacy logic which assigned **trialcs->allowed_cpus to a local 'xcpus' variable** when trialcs->effective_xcpus was empty has been removed. This removal is safe because when trialcs is not a partition member, trialcs->effective_xcpus is now correctly populated with the intersection of user_xcpus and the parent's effective_xcpus. This calculation inherently covers the scenario previously handled by the removed code. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: refactor cpus_allowed_validate_changeChen Ridong
Refactor cpus_allowed_validate_change to handle the special case where cpuset.cpus can be set even when violating partition sibling CPU exclusivity rules. This differs from the general validation logic in validate_change. Add a wrapper function to properly handle this exceptional case. The trialcs->prs_err field is cleared before performing validation checks for both CPU changes and partition errors. If cpus_allowed_validate_change fails its validation, trialcs->prs_err is set to PERR_NOTEXCL. If partition validation fails, the specific error code returned by validate_partition is assigned to trialcs->prs_err. With the partition validation status now directly available through trialcs->prs_err, the local boolean variable 'invalidate' becomes redundant and can be safely removed. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: refactor out validate_partitionChen Ridong
Refactor the validate_partition function to handle cpuset partition validation when modifying cpuset.cpus. This refactoring also makes the function reusable for handling cpuset.cpus.exclusive updates in subsequent patches. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpersChen Ridong
This patch adds cpus_excl_conflict() and mems_excl_conflict() helper functions to improve code readability and maintainability. The exclusive conflict checking follows these rules: 1. If either cpuset has the 'exclusive' flag set, their user_xcpus must not have any overlap. 2. If neither cpuset has the 'exclusive' flag set, their 'cpuset.cpus.exclusive' (only for v2) values must not intersect. 3. The 'cpuset.cpus' of one cpuset must not form a subset of another cpuset's 'cpuset.cpus.exclusive'. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: refactor CPU mask buffer parsing logicChen Ridong
The current implementation contains redundant handling for empty mask inputs, as cpulist_parse() already properly handles these cases. This refactoring introduces a new helper function parse_cpuset_cpulist() to consolidate CPU list parsing logic and eliminate special-case checks for empty inputs. Additionally, the effective_xcpus computation for trial cpusets has been simplified. Rather than computing effective_xcpus only when exclusive_cpus is set or when the cpuset forms a valid partition, we now recalculate it on every cpuset.cpus update. This approach ensures consistency and allows removal of redundant effective_xcpus logic in subsequent patches. The trial cpuset's effective_xcpus calculation follows two distinct cases: 1. For member cpusets: effective_xcpus is determined by the intersection of cpuset->exclusive_cpus and the parent's effective_xcpus. 2. For non-member cpusets: effective_xcpus is derived from the intersection of user_xcpus and the parent's effective_xcpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: Refactor exclusive CPU mask computation logicChen Ridong
The current compute_effective_exclusive_cpumask function handles multiple scenarios with different input parameters, making the code difficult to follow. This patch refactors it into two separate functions: compute_excpus and compute_trialcs_excpus. The compute_excpus function calculates the exclusive CPU mask for a given input and excludes exclusive CPUs from sibling cpusets when cs's exclusive_cpus is not explicitly set. The compute_trialcs_excpus function specifically handles exclusive CPU computation for trial cpusets used during CPU mask configuration updates, and always excludes exclusive CPUs from sibling cpusets. This refactoring significantly improves code readability and clarity, making it explicit which function to call for each use case and what parameters should be provided. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: change return type of is_partition_[in]valid to boolChen Ridong
The functions is_partition_valid() and is_partition_invalid() logically return boolean values, but were previously declared with return type 'int'. This patch changes their return type to 'bool' to better reflect their semantic meaning and improve type safety. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: remove unused assignment to trialcs->partition_root_stateChen Ridong
The trialcs->partition_root_state field is not used during the configuration of 'cpuset.cpus' or 'cpuset.cpus.exclusive'. Therefore, the assignment of values to this field can be safely removed. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: move the root cpuset write check earlierChen Ridong
The 'cpus' or 'mems' lists of the top_cpuset cannot be modified. This check can be moved before acquiring any locks as a common code block to improve efficiency and maintainability. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lockpengdonglin
Since commit a8bb74acd8efe ("rcu: Consolidate RCU-sched update-side function definitions") there is no difference between rcu_read_lock(), rcu_read_lock_bh() and rcu_read_lock_sched() in terms of RCU read section and the relevant grace period. That means that spin_lock(), which implies rcu_read_lock_sched(), also implies rcu_read_lock(). There is no need no explicitly start a RCU read section if one has already been started implicitly by spin_lock(). Simplify the code and remove the inner rcu_read_lock() invocation. Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16cgroup: Remove redundant rcu_read_lock/unlock() in spin_lockpengdonglin
Since commit a8bb74acd8efe ("rcu: Consolidate RCU-sched update-side function definitions") there is no difference between rcu_read_lock(), rcu_read_lock_bh() and rcu_read_lock_sched() in terms of RCU read section and the relevant grace period. That means that spin_lock(), which implies rcu_read_lock_sched(), also implies rcu_read_lock(). There is no need no explicitly start a RCU read section if one has already been started implicitly by spin_lock(). Simplify the code and remove the inner rcu_read_lock() invocation. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-15bpf: Do not limit bpf_cgroup_from_id to current's namespaceKumar Kartikeya Dwivedi
The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the cgroup corresponding to a given cgroup ID. This helper can be called in a lot of contexts where the current thread can be random. A recent example was its use in sched_ext's ops.tick(), to obtain the root cgroup pointer. Since the current task can be whatever random user space task preempted by the timer tick, this makes the behavior of the helper unreliable. Refactor out __cgroup_get_from_id as the non-namespace aware version of cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it. There is no compatibility breakage here, since changing the namespace against which the lookup is being done to the root cgroup namespace only permits a wider set of lookups to succeed now. The cgroup IDs across namespaces are globally unique, and thus don't need to be retranslated. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-10cgroup: replace global percpu_rwsem with per threadgroup resem when writing ↵Yi Tao
to cgroup.procs The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to pr_warn_once(). Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>