linux.git/kernel/cgroup, branch v7.2-rc1

Merge tag 'cgroup-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2026-06-17T11:03:56+00:00

Pull cgroup updates from Tejun Heo:

 - Last cycle deferred css teardown on cgroup removal until the cgroup
   depopulated, so a css is not taken offline while tasks can still
   reference it. Disabling a controller through cgroup.subtree_control
   still had the same problem. This reworks the deferral from per-cgroup
   to per-css so that path is covered too.

 - New RDMA controller monitoring files: rdma.peak for per-device peak
   usage and rdma.events / rdma.events.local for resource-limit
   exhaustion. The max-limit parser was rewritten, fixing two input
   parsing bugs.

 - cpuset: fix a sched-domain leak on the domain-rebuild failure path
   and skip a redundant hardwall ancestor scan on v2.

 - Misc: pair the remaining lockless cgroup.max.* reads with WRITE_ONCE,
   assorted selftest robustness fixes, and doc path corrections.

* tag 'cgroup-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (22 commits)
  cgroup: Migrate tasks to the root css when a controller is rebound
  docs: cgroup: Fix stale source file paths
  cgroup/cpuset: Free sched domains on rebuild guard failure
  cgroup: pair max limit READ_ONCE() with WRITE_ONCE()
  selftests/cgroup: enable memory controller in hugetlb memcg test
  cgroup/rdma: Drop unnecessary READ_ONCE() on event counters
  cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  cgroup: Add per-subsys-css kill_css_finish deferral
  cgroup: Move populated counters to cgroup_subsys_state
  cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE
  cgroup: Inline cgroup_has_tasks() in cgroup.h
  cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local
  cgroup/rdma: add rdma.events.local for per-cgroup allocation failure attribution
  cgroup/rdma: add rdma.events to track resource limit exhaustion
  cgroup/rdma: add rdma.peak for per-device peak usage tracking
  selftests/cgroup: check malloc return value in alloc_anon functions
  cgroup/cpuset: Skip hardwall ancestor scan in cpuset v2 in cpuset_current_node_allowed()
  selftests/cgroup: fix misleading debug message in test_cgfreezer_time_child
  selftests/cgroup: fix child process escaping to parent cleanup in test_cpucg_nice
  selftests/cgroup: Add NULL check after malloc in cgroup_util.c
  ...

Merge tag 'cgroup-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2026-06-03T15:59:24+00:00

Pull cgroup fixes from Tejun Heo:
 "One cpuset fix and a maintenance update, both low-risk:

   - Fix cpuset partition CPU accounting under sibling CPU exclusion
     that could produce wrong CPU assignments and trigger
     scheduling-domain warnings. Includes selftests.

   - Update an email address in MAINTAINERS"

* tag 'cgroup-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Change Ridong's email
  cgroup/cpuset: Add test cases for sibling CPU exclusion on partition update
  cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask calculation

cgroup: Migrate tasks to the root css when a controller is rebound

2026-06-02T18:25:29+00:00

cgroup_apply_control_disable() defers kill_css_finish() while a css is
still populated, relying on css_update_populated() to fire the deferred
kill once the populated count reaches zero.

This deadlocks when a controller is rebound out of a hierarchy. Mounting
an implicit_on_dfl controller such as perf_event as a v1 hierarchy steals
it off the default hierarchy, and rebind_subsystems() kills its
per-cgroup csses while they are still populated. The migration run in the
same step keeps the old css for a controller no longer in the hierarchy's
mask, so no task is migrated off the dying csses. Their populated count
never reaches zero, the deferred kill_css_finish() never fires, and the
next cgroup_lock_and_drain_offline() hangs forever under cgroup_mutex.

That migration is already a no-op pass over the rebound subtree. Add
cgroup_rebind_ss_mask so find_existing_css_set() resolves the leaving
controllers to the root css. Their tasks are migrated there, the
per-cgroup csses depopulate, and cgroup_apply_control_disable() kills
them synchronously. The deferral stays correct for the rmdir and
controller-disable paths it was meant for.

Fixes: 1dffd95575eb ("cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()")
Reported-by: Mark Brown
Closes: https://lore.kernel.org/all/41cd159c-54e5-45e0-81df-eaf36a6c028e@sirena.org.uk/
Reported-by: Bert Karwatzki
Closes: https://lore.kernel.org/all/4e986b4ed7e16547805d54b6e67d09120bc4d2f2.camel@web.de/
Tested-by: Mark Brown
Tested-by: Bert Karwatzki
Signed-off-by: Tejun Heo

cgroup/cpuset: Free sched domains on rebuild guard failure

2026-05-29T18:23:18+00:00

generate_sched_domains() returns sched-domain masks and optional
attributes that are normally handed to partition_sched_domains(), which
takes ownership of them.

rebuild_sched_domains_locked() has a WARN guard after
generate_sched_domains() and before partition_sched_domains() to avoid
passing offline CPUs into the scheduler domain rebuild path. If that
guard fires, the function currently returns directly without freeing
the generated doms and attr.

Free the generated sched-domain masks and attributes before returning
from the guard failure path.

Signed-off-by: Guopeng Zhang 
Reviewed-by: Waiman Long 
Signed-off-by: Tejun Heo

cgroup: pair max limit READ_ONCE() with WRITE_ONCE()

2026-05-28T15:40:06+00:00

cgroup.max.descendants and cgroup.max.depth are shown through seq_file.
Their show callbacks read cgrp->max_descendants and cgrp->max_depth with
READ_ONCE(), respectively.

The corresponding write callbacks update the same scalar fields while
holding the cgroup lock, but the seq_file show path does not serialize
against those stores. This leaves the lockless show-side loads annotated
with READ_ONCE(), while the corresponding stores remain plain stores.

Use WRITE_ONCE() for the updates so the intended lockless access is marked
consistently on both sides. This does not change locking, ordering, or
user-visible semantics.

Assisted-by: OpenAI-Codex:gpt-5.5
Signed-off-by: Ren Tamura 
Signed-off-by: Tejun Heo

cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask calculation

2026-05-27T18:58:59+00:00

When sibling CPU exclusion occurs, a partition's user_xcpus may contain
CPUs that were never actually granted to it. These CPUs are present in
user_xcpus(cs) but not in cs->effective_xcpus.

The partcmd_update path in update_parent_effective_cpumask() uses
user_xcpus(cs) (via the local variable xcpus) to compute the addmask
(CPUs to return to parent) and delmask (CPUs to request from parent).
This is incorrect:

 1) When newmask removes a CPU that was previously excluded by a
    sibling, addmask incorrectly includes that CPU and tries to return
    it to the parent even though the partition never actually owned it,
    causing CPU overlap with sibling partitions and triggering warnings
    in generate_sched_domains().

 2) When newmask adds a previously excluded CPU that is now available,
    delmask fails to request it from the parent because user_xcpus(cs)
    already includes it.

Fix this by using cs->effective_xcpus instead of user_xcpus(cs) in all
partcmd_update paths that calculate addmask or delmask, including the
PERR_NOCPUS error handling paths.

Reproducers:

  Example 1 - Removing a sibling-excluded CPU incorrectly returns it:

    # cd /sys/fs/cgroup
    # echo "0-1" > a1/cpuset.cpus
    # echo "root" > a1/cpuset.cpus.partition
    # echo "0-2" > b1/cpuset.cpus
    # echo "root" > b1/cpuset.cpus.partition
    # echo "2" > b1/cpuset.cpus
    # cat cpuset.cpus.effective
    # Actual: 0-1,3    Expected: 3

  Example 2 - Expanding to a previously excluded CPU fails to request it:

    # cd /sys/fs/cgroup
    # echo "0-1" > a1/cpuset.cpus
    # echo "root" > a1/cpuset.cpus.partition
    # echo "0-2" > b1/cpuset.cpus
    # echo "root" > b1/cpuset.cpus.partition
    # echo "member" > a1/cpuset.cpus.partition
    # echo "1-2" > b1/cpuset.cpus
    # cat cpuset.cpus.effective
    # Actual: 0-1,3    Expected: 0,3

Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
Cc: stable@vger.kernel.org # v7.0+
Suggested-by: Zhang Guopeng 
Signed-off-by: Sun Shaojie 
Reviewed-by: Waiman Long 
Signed-off-by: Tejun Heo

Merge tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2026-05-22T23:28:47+00:00

Pull cgroup fixes from Tejun Heo:
 "Two rstat fixes:

   - Out-of-bounds access in the css_rstat_updated() BPF kfunc when
     called with an unchecked user-supplied cpu

   - Over-strict NMI guard after the recent switch to try_cmpxchg left
     sparc and ppc64 unable to queue rstat updates from NMI"

* tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rstat: relax NMI guard after switch to try_cmpxchg
  cgroup/rstat: validate cpu before css_rstat_cpu() access

cgroup: rstat: relax NMI guard after switch to try_cmpxchg

2026-05-20T19:44:35+00:00

Commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe") used
this_cpu_cmpxchg() for the lockless insertion, and therefore required
both ARCH_HAVE_NMI_SAFE_CMPXCHG and ARCH_HAS_NMI_SAFE_THIS_CPU_OPS in
the NMI guard: on archs without the latter, this_cpu_cmpxchg() falls
back to "local_irq_save() + plain cmpxchg", and local_irq_save()
cannot mask NMIs.

Commit 3309b63a2281 ("cgroup: rstat: use LOCK CMPXCHG in
css_rstat_updated") later replaced this_cpu_cmpxchg() with plain
try_cmpxchg() to fix cross-CPU lockless-list corruption, but left the
NMI guard untouched.  After that switch, css_rstat_updated() no longer
performs any this_cpu_*() RMW operations and only relies on the arch
having NMI-safe cmpxchg, so ARCH_HAS_NMI_SAFE_THIS_CPU_OPS is no
longer required in the guard.

Relax the guard accordingly so that archs which have HAVE_NMI and
ARCH_HAVE_NMI_SAFE_CMPXCHG but not ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
(e.g. sparc, powerpc on PPC64/BOOK3S) can benefit from the existing
CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC path.  Without this, the css
is never queued in NMI on those archs, and the atomics staged by
account_{slab,kmem}_nmi_safe() are not drained by flush_nmi_stats().

Fixes: 3309b63a2281 ("cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated")
Signed-off-by: Cunlong Li 
Signed-off-by: Tejun Heo

cgroup/rstat: validate cpu before css_rstat_cpu() access

2026-05-18T19:31:52+00:00

css_rstat_updated() is exposed as a BPF kfunc and accepts a
caller-provided cpu argument. The function uses cpu for per-cpu rstat
lookups without checking whether it refers to a valid possible CPU.

A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an
invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu ==
0x7fffffff triggers:

  UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9
  index 2147483647 is out of range for type 'long unsigned int [64]'
  Call Trace:
    css_rstat_updated
    bpf_iter_run_prog
    cgroup_iter_seq_show
    bpf_seq_read

Add cpu validation to the BPF-facing css_rstat_updated() kfunc and
move the common implementation to __css_rstat_updated() for in-kernel
callers.

Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat")
Signed-off-by: Qing Ming 
Signed-off-by: Tejun Heo

cgroup/rdma: Drop unnecessary READ_ONCE() on event counters

2026-05-18T19:24:50+00:00

All accesses to the event counters are serialized by rdmacg_mutex,
making the READ_ONCE() annotations unnecessary. Remove them.

Signed-off-by: Tao Cui 
Signed-off-by: Tejun Heo