<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/kernel/sched, branch v7.1-rc4</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2026-05-13T22:00:40+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-05-13T22:00:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=59a62ea4583e0f740bb3576ec210b23f39754327'/>
<id>59a62ea4583e0f740bb3576ec210b23f39754327</id>
<content type='text'>
Pull sched_ext fixes from Tejun Heo:
 "The bulk of this is hardening of the new sub-scheduler infrastructure.

   - UAFs and lifecycle bugs on the sub-sched attach/detach paths:
     parent sub_kset freed under a racing child, list_del_rcu on an
     uninitialized list head, ops-&gt;priv stomped by concurrent
     attach/detach, and a UAF in the init-failure error path

   - Task state-machine reorg closing concurrent enable-vs-dead races: a
     task exiting during the unlocked init window could trip NULL ops
     derefs or skip exit_task() cleanup

   - A scx_link_sched() self-deadlock on scx_sched_lock

   - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
     cpumask without RCU, and stop rejecting BPF schedulers when only
     cpuset isolated partitions are active

   - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
     the failing task rather than the irq_work kthread

   - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
     fixes"

* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
  sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
  sched_ext: INIT_LIST_HEAD() &amp;sch-&gt;all in scx_alloc_and_add_sched()
  sched_ext: Drop NONE early return in scx_disable_and_exit_task()
  sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
  sched_ext: Clear ops-&gt;priv on scx_alloc_and_add_sched() error paths
  sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach
  selftests/sched_ext: Fix build error in dequeue selftest
  sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
  sched_ext: Close sub-sched init race with post-init DEAD recheck
  sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
  sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
  sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
  sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch-&gt;disable_irq_work
  sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
  sched_ext: Drop unused scx_find_sub_sched() stub
  sched_ext: Move scx_error() out of scx_link_sched()'s lock region
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull sched_ext fixes from Tejun Heo:
 "The bulk of this is hardening of the new sub-scheduler infrastructure.

   - UAFs and lifecycle bugs on the sub-sched attach/detach paths:
     parent sub_kset freed under a racing child, list_del_rcu on an
     uninitialized list head, ops-&gt;priv stomped by concurrent
     attach/detach, and a UAF in the init-failure error path

   - Task state-machine reorg closing concurrent enable-vs-dead races: a
     task exiting during the unlocked init window could trip NULL ops
     derefs or skip exit_task() cleanup

   - A scx_link_sched() self-deadlock on scx_sched_lock

   - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
     cpumask without RCU, and stop rejecting BPF schedulers when only
     cpuset isolated partitions are active

   - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
     the failing task rather than the irq_work kthread

   - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
     fixes"

* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
  sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
  sched_ext: INIT_LIST_HEAD() &amp;sch-&gt;all in scx_alloc_and_add_sched()
  sched_ext: Drop NONE early return in scx_disable_and_exit_task()
  sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
  sched_ext: Clear ops-&gt;priv on scx_alloc_and_add_sched() error paths
  sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach
  selftests/sched_ext: Fix build error in dequeue selftest
  sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
  sched_ext: Close sub-sched init race with post-init DEAD recheck
  sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
  sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
  sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
  sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch-&gt;disable_irq_work
  sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
  sched_ext: Drop unused scx_find_sub_sched() stub
  sched_ext: Move scx_error() out of scx_link_sched()'s lock region
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2026-05-13T21:56:31+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-05-13T21:56:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0913b580f8490caaaf08dd1591e0bc07ac2720cb'/>
<id>0913b580f8490caaaf08dd1591e0bc07ac2720cb</id>
<content type='text'>
Pull cgroup fixes from Tejun Heo:

 - cpuset fixes:
     - Partition invalidation could return CPUs still in use by sibling
       partitions, producing overlapping effective_cpus
     - cpuset_can_attach() over-reserved DL bandwidth on moves that
       stayed within the same root domain
     - Pending DL migration state leaked into later attaches when a
       later can_attach() check failed
     - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
       allocate from any node and exit quickly

 - dmem: propagate -ENOMEM instead of spinning forever when the fallback
   pool allocation also fails

 - selftests/cgroup: percpu test error-path leak, bogus numeric
   comparison of cpuset strings, and a zero-length read() that silently
   passed OOM-kill tests

* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
  selftests/cgroup: Fix error path leaks in test_percpu_basic
  cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
  cgroup/cpuset: Reset DL migration state on can_attach() failure
  selftests/cgroup: Fix string comparison in write_test
  selftests/cgroup: Fix cg_read_strcmp() empty string comparison
  cgroup/dmem: Return -ENOMEM on failed pool preallocation
  cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull cgroup fixes from Tejun Heo:

 - cpuset fixes:
     - Partition invalidation could return CPUs still in use by sibling
       partitions, producing overlapping effective_cpus
     - cpuset_can_attach() over-reserved DL bandwidth on moves that
       stayed within the same root domain
     - Pending DL migration state leaked into later attaches when a
       later can_attach() check failed
     - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
       allocate from any node and exit quickly

 - dmem: propagate -ENOMEM instead of spinning forever when the fallback
   pool allocation also fails

 - selftests/cgroup: percpu test error-path leak, bogus numeric
   comparison of cpuset strings, and a zero-length read() that silently
   passed OOM-kill tests

* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
  selftests/cgroup: Fix error path leaks in test_percpu_basic
  cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
  cgroup/cpuset: Reset DL migration state on can_attach() failure
  selftests/cgroup: Fix string comparison in write_test
  selftests/cgroup: Fix cg_read_strcmp() empty string comparison
  cgroup/dmem: Return -ENOMEM on failed pool preallocation
  cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation</title>
<updated>2026-05-13T20:02:57+00:00</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2026-05-13T11:24:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6ae315d37924435516d697ea7dde0b799a5928e0'/>
<id>6ae315d37924435516d697ea7dde0b799a5928e0</id>
<content type='text'>
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.

Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:

  =============================
  WARNING: suspicious RCU usage
  -----------------------------
  kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!

  1 lock held by scx_flash/281:
   #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at:
       bpf_struct_ops_link_create+0x134/0x1c0

  Call Trace:
   dump_stack_lvl+0x6f/0xb0
   lockdep_rcu_suspicious.cold+0x37/0x70
   housekeeping_cpumask+0xcd/0xe0
   scx_enable.isra.0+0x17/0x120
   bpf_scx_reg+0x5e/0x80
   bpf_struct_ops_link_create+0x151/0x1c0
   __sys_bpf+0x1e4b/0x33c0
   __x64_sys_bpf+0x21/0x30
   do_syscall_64+0x117/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.

Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.

Fixes: 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.

Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:

  =============================
  WARNING: suspicious RCU usage
  -----------------------------
  kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!

  1 lock held by scx_flash/281:
   #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at:
       bpf_struct_ops_link_create+0x134/0x1c0

  Call Trace:
   dump_stack_lvl+0x6f/0xb0
   lockdep_rcu_suspicious.cold+0x37/0x70
   housekeeping_cpumask+0xcd/0xe0
   scx_enable.isra.0+0x17/0x120
   bpf_scx_reg+0x5e/0x80
   bpf_struct_ops_link_create+0x151/0x1c0
   __sys_bpf+0x1e4b/0x33c0
   __x64_sys_bpf+0x21/0x30
   do_syscall_64+0x117/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.

Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.

Fixes: 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work</title>
<updated>2026-05-12T21:28:56+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-11T23:18:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cceb874eee46fe4b3d3c6c496f19125d9a3a9a8f'/>
<id>cceb874eee46fe4b3d3c6c496f19125d9a3a9a8f</id>
<content type='text'>
scx_sub_enable_workfn() pins parent-&gt;kobj before dropping scx_sched_lock,
but that does not pin parent-&gt;sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.

Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.

Fixes: 411d3ef1a705 ("sched_ext: Unregister sub_kset on scheduler disable")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
scx_sub_enable_workfn() pins parent-&gt;kobj before dropping scx_sched_lock,
but that does not pin parent-&gt;sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.

Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.

Fixes: 411d3ef1a705 ("sched_ext: Unregister sub_kset on scheduler disable")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: INIT_LIST_HEAD() &amp;sch-&gt;all in scx_alloc_and_add_sched()</title>
<updated>2026-05-12T21:28:56+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-11T23:18:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b273b75b8d677aea06dd06d80b61b3bb06e94680'/>
<id>b273b75b8d677aea06dd06d80b61b3bb06e94680</id>
<content type='text'>
On scx_link_sched() error paths (parent disabled, hash insert failure),
&amp;sch-&gt;all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&amp;sch-&gt;all) on a
list_head that was never initialized triggering a corruption warning.

Initialize &amp;sch-&gt;all.

Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On scx_link_sched() error paths (parent disabled, hash insert failure),
&amp;sch-&gt;all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&amp;sch-&gt;all) on a
list_head that was never initialized triggering a corruption warning.

Initialize &amp;sch-&gt;all.

Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Drop NONE early return in scx_disable_and_exit_task()</title>
<updated>2026-05-12T21:13:58+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-12T20:30:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=39e25a2100604320e8d9df54c6c31258f7a3df29'/>
<id>39e25a2100604320e8d9df54c6c31258f7a3df29</id>
<content type='text'>
d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p-&gt;scx.sched dangles - print_scx_info() from sched_show_task() reads
sch-&gt;ops.name from freed memory.

Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p-&gt;scx.sched is the only
work left - and the one thing the path actually needs.

v2: Extend the SUB_INIT block comment to note that the flag is only
    set on the sub-enable path, so it's always clear on the NONE
    re-entry (Andrea).

Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p-&gt;scx.sched dangles - print_scx_info() from sched_show_task() reads
sch-&gt;ops.name from freed memory.

Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p-&gt;scx.sched is the only
work left - and the one thing the path actually needs.

v2: Extend the SUB_INIT block comment to note that the flag is only
    set on the sub-enable path, so it's always clear on the NONE
    re-entry (Andrea).

Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path</title>
<updated>2026-05-11T22:05:48+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-11T22:05:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9a415cc53711f2238e0f0ca8a6bcc796c003b127'/>
<id>9a415cc53711f2238e0f0ca8a6bcc796c003b127</id>
<content type='text'>
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p-&gt;comm and p-&gt;pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.

Move put_task_struct() past scx_error().

Reported-by: Sashiko &lt;sashiko-bot@kernel.org&gt;
Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/
Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p-&gt;comm and p-&gt;pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.

Move put_task_struct() past scx_error().

Reported-by: Sashiko &lt;sashiko-bot@kernel.org&gt;
Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/
Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cgroup/cpuset: Reserve DL bandwidth only for root-domain moves</title>
<updated>2026-05-11T20:27:14+00:00</updated>
<author>
<name>Guopeng Zhang</name>
<email>zhangguopeng@kylinos.cn</email>
</author>
<published>2026-05-09T10:20:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5dd74441cbf42c22e874450eb6a6bbb19390a216'/>
<id>5dd74441cbf42c22e874450eb6a6bbb19390a216</id>
<content type='text'>
cpuset_can_attach() currently adds the bandwidth of all migrating
SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination
cpuset effective CPU masks do not overlap, the whole sum is then
reserved in the destination root domain.

set_cpus_allowed_dl(), however, subtracts bandwidth from the source
root domain only when the affinity change really moves the task between
root domains. A DL task can move between cpusets that are still in the
same root domain, so including that task in sum_migrate_dl_bw can reserve
destination bandwidth without a matching source-side subtraction.

Share the root-domain move test with set_cpus_allowed_dl(). Keep
nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL
task accounting, but add to sum_migrate_dl_bw only for tasks that need a
root-domain bandwidth move. Keep using the destination cpuset effective
CPU mask and leave the broader can_attach()/attach() transaction model
unchanged.

Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Cc: stable@vger.kernel.org # v6.10+
Signed-off-by: Guopeng Zhang &lt;zhangguopeng@kylinos.cn&gt;
Reviewed-by: Waiman Long &lt;longman@redhat.com&gt;
Acked-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Tested-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
cpuset_can_attach() currently adds the bandwidth of all migrating
SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination
cpuset effective CPU masks do not overlap, the whole sum is then
reserved in the destination root domain.

set_cpus_allowed_dl(), however, subtracts bandwidth from the source
root domain only when the affinity change really moves the task between
root domains. A DL task can move between cpusets that are still in the
same root domain, so including that task in sum_migrate_dl_bw can reserve
destination bandwidth without a matching source-side subtraction.

Share the root-domain move test with set_cpus_allowed_dl(). Keep
nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL
task accounting, but add to sum_migrate_dl_bw only for tasks that need a
root-domain bandwidth move. Keep using the destination cpuset effective
CPU mask and leave the broader can_attach()/attach() transaction model
unchanged.

Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Cc: stable@vger.kernel.org # v6.10+
Signed-off-by: Guopeng Zhang &lt;zhangguopeng@kylinos.cn&gt;
Reviewed-by: Waiman Long &lt;longman@redhat.com&gt;
Acked-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Tested-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Clear ops-&gt;priv on scx_alloc_and_add_sched() error paths</title>
<updated>2026-05-11T08:50:31+00:00</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2026-05-11T08:31:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=86ecb1c1a1f5c1bf4a45b91f54f8220c3121bd3b'/>
<id>86ecb1c1a1f5c1bf4a45b91f54f8220c3121bd3b</id>
<content type='text'>
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops-&gt;priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -&gt; scx_kobj_release() -&gt; RCU
work), but @ops-&gt;priv is left pointing at the about-to-be-freed pointer.

With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops-&gt;priv is still
non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on
concurrent attach/detach"), a dangling @ops-&gt;priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.

Clear @ops-&gt;priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().

Fixes: bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach")
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops-&gt;priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -&gt; scx_kobj_release() -&gt; RCU
work), but @ops-&gt;priv is left pointing at the about-to-be-freed pointer.

With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops-&gt;priv is still
non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on
concurrent attach/detach"), a dangling @ops-&gt;priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.

Clear @ops-&gt;priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().

Fixes: bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach")
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach</title>
<updated>2026-05-11T07:40:03+00:00</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2026-05-11T06:18:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bbf30b383cf6e87f2fe57c292fbd640b1d88b4c3'/>
<id>bbf30b383cf6e87f2fe57c292fbd640b1d88b4c3</id>
<content type='text'>
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:

 $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop
 ...
 BUG: kernel NULL pointer dereference, address: 0000000000000400
 RIP: 0010:scx_claim_exit+0x3b/0x120
 Call Trace:
  &lt;TASK&gt;
  bpf_scx_unreg+0x45/0xb0
  bpf_struct_ops_map_link_dealloc+0x39/0x50
  bpf_link_release+0x18/0x20
  __fput+0x10b/0x2e0
  __x64_sys_close+0x47/0xa0

The underlying race (diagnosed by Tejun Heo) is a stomp of @ops-&gt;priv,
not a missing NULL check:

  T2 unreg(K)                       T1 reg(K)
  -----------                       ---------
  sch = ops-&gt;priv = sch_b800
  scx_disable; flush_disable_work
    [scx_root_disable: scx_root=NULL,
     mutex_unlock, state=DISABLED]
                                    mutex_lock; state ok
                                    scx_alloc_and_add_sched:
                                      ops-&gt;priv = sch_a800
                                    scx_root = sch_a800; init=0
                                    state=ENABLED; mutex_unlock
    [flush returns]
  RCU_INIT_POINTER(ops-&gt;priv, NULL) &lt;-- clobbers sch_a800
  kobject_put(sch_b800)

T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock
window and starts a fresh attach on the same kdata, assigning sch_a800
to @ops-&gt;priv. T2 then continues out of scx_disable()/flush_disable_work
and clobbers @ops-&gt;priv to NULL, leaking sch_a800; the bpf_link is gone
but state stays SCX_ENABLED, so all future attaches fail with -EBUSY
permanently. The next bpf_scx_unreg() on that kdata then reads NULL
@ops-&gt;priv and dereferences it in scx_claim_exit().

Make @ops-&gt;priv the lifecycle binding: in scx_root_enable_workfn() and
scx_sub_enable_workfn(), after the existing state check and still under
scx_enable_mutex, refuse with -EBUSY if @ops-&gt;priv is non-NULL. This
rejects an attempt to reuse a kdata that is still bound to a previous
scheduler instance, closing the race without changing the unreg side.

Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()")
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:

 $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop
 ...
 BUG: kernel NULL pointer dereference, address: 0000000000000400
 RIP: 0010:scx_claim_exit+0x3b/0x120
 Call Trace:
  &lt;TASK&gt;
  bpf_scx_unreg+0x45/0xb0
  bpf_struct_ops_map_link_dealloc+0x39/0x50
  bpf_link_release+0x18/0x20
  __fput+0x10b/0x2e0
  __x64_sys_close+0x47/0xa0

The underlying race (diagnosed by Tejun Heo) is a stomp of @ops-&gt;priv,
not a missing NULL check:

  T2 unreg(K)                       T1 reg(K)
  -----------                       ---------
  sch = ops-&gt;priv = sch_b800
  scx_disable; flush_disable_work
    [scx_root_disable: scx_root=NULL,
     mutex_unlock, state=DISABLED]
                                    mutex_lock; state ok
                                    scx_alloc_and_add_sched:
                                      ops-&gt;priv = sch_a800
                                    scx_root = sch_a800; init=0
                                    state=ENABLED; mutex_unlock
    [flush returns]
  RCU_INIT_POINTER(ops-&gt;priv, NULL) &lt;-- clobbers sch_a800
  kobject_put(sch_b800)

T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock
window and starts a fresh attach on the same kdata, assigning sch_a800
to @ops-&gt;priv. T2 then continues out of scx_disable()/flush_disable_work
and clobbers @ops-&gt;priv to NULL, leaking sch_a800; the bpf_link is gone
but state stays SCX_ENABLED, so all future attaches fail with -EBUSY
permanently. The next bpf_scx_unreg() on that kdata then reads NULL
@ops-&gt;priv and dereferences it in scx_claim_exit().

Make @ops-&gt;priv the lifecycle binding: in scx_root_enable_workfn() and
scx_sub_enable_workfn(), after the existing state check and still under
scx_enable_mutex, refuse with -EBUSY if @ops-&gt;priv is non-NULL. This
rejects an attempt to reuse a kdata that is still bound to a previous
scheduler instance, closing the race without changing the unreg side.

Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()")
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
