<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/kernel/sched/ext.c, branch v7.1-rc5</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2026-05-22T23:43:33+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-05-22T23:43:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=79bd2dded182b1d458b18e62684b7f82ffc682e5'/>
<id>79bd2dded182b1d458b18e62684b7f82ffc682e5</id>
<content type='text'>
Pull sched_ext fixes from Tejun Heo:

 - Spurious WARN in ops_dequeue() racing with concurrent dispatch

 - Self-deadlock between scheduler disable and a concurrent sub-sched
   enable

* tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()
  sched_ext: Fix deadlock between scx_root_disable() and concurrent forks
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull sched_ext fixes from Tejun Heo:

 - Spurious WARN in ops_dequeue() racing with concurrent dispatch

 - Self-deadlock between scheduler disable and a concurrent sub-sched
   enable

* tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()
  sched_ext: Fix deadlock between scx_root_disable() and concurrent forks
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()</title>
<updated>2026-05-21T16:27:44+00:00</updated>
<author>
<name>Samuele Mariotti</name>
<email>smariotti@disroot.org</email>
</author>
<published>2026-05-21T10:59:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0c1a9dce208b4dc265925898e5da98934f7f9266'/>
<id>0c1a9dce208b4dc265925898e5da98934f7f9266</id>
<content type='text'>
ops_dequeue() can race with finish_dispatch() and spuriously trigger the
"queued task must be in BPF scheduler's custody" warning.

ops_dequeue() snapshots p-&gt;scx.ops_state via atomic_long_read_acquire()
and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY
is set. The two reads are not atomic w.r.t. a concurrent
finish_dispatch() running on another CPU:

CPU 1                                    CPU 2
=====                                    =====
                                         dequeue_task_scx()
                                           ops_dequeue()
                                             opss = read_acquire(ops_state)
                                                  = SCX_OPSS_QUEUED
finish_dispatch()
  cmpxchg ops_state:
    SCX_OPSS_QUEUED -&gt; SCX_OPSS_DISPATCHING  [succeeds]
  dispatch_enqueue(SCX_DSQ_GLOBAL,
                   SCX_ENQ_CLEAR_OPSS)
    call_task_dequeue()
      p-&gt;scx.flags &amp;= ~SCX_TASK_IN_CUSTODY
                                             WARN_ON_ONCE(!(p-&gt;scx.flags &amp;
                                                     SCX_TASK_IN_CUSTODY))
                                            /* opss is stale: QUEUED,
                                             * but task already claimed */
    set_release(ops_state, SCX_OPSS_NONE)

The race has been observed via two distinct call chains: the most common
goes through sched_setaffinity(), a rarer variant through
sched_change_begin().

For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears
SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE
(intentional, to avoid concurrent non-atomic RMW of p-&gt;scx.flags against
ops_dequeue()). The window between those two writes is exactly what
ops_dequeue() observes as "QUEUED without custody".

The observed state is not actually inconsistent, it just means CPU 1 has
already claimed the task and the QUEUED value held by CPU 2 is stale.
Re-read ops_state in that case; the next read is guaranteed to return
SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch
cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has
already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED
would require re-enqueue under p's rq lock, which CPU 2 holds.

Changes in v2:
- Use READ_ONCE() for p-&gt;scx.flags to ensure fresh reads and prevent
  compiler reordering in the lockless path
- Add cpu_relax() to reduce power consumption and improve performance
  during the spin-wait
- Use unlikely() to optimize branch prediction for the common case
- Expand the in-code comment to document the race condition and
  bounded retry guarantee

Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Suggested-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Samuele Mariotti &lt;smariotti@disroot.org&gt;
Signed-off-by: Paolo Valente &lt;paolo.valente@unimore.it&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
ops_dequeue() can race with finish_dispatch() and spuriously trigger the
"queued task must be in BPF scheduler's custody" warning.

ops_dequeue() snapshots p-&gt;scx.ops_state via atomic_long_read_acquire()
and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY
is set. The two reads are not atomic w.r.t. a concurrent
finish_dispatch() running on another CPU:

CPU 1                                    CPU 2
=====                                    =====
                                         dequeue_task_scx()
                                           ops_dequeue()
                                             opss = read_acquire(ops_state)
                                                  = SCX_OPSS_QUEUED
finish_dispatch()
  cmpxchg ops_state:
    SCX_OPSS_QUEUED -&gt; SCX_OPSS_DISPATCHING  [succeeds]
  dispatch_enqueue(SCX_DSQ_GLOBAL,
                   SCX_ENQ_CLEAR_OPSS)
    call_task_dequeue()
      p-&gt;scx.flags &amp;= ~SCX_TASK_IN_CUSTODY
                                             WARN_ON_ONCE(!(p-&gt;scx.flags &amp;
                                                     SCX_TASK_IN_CUSTODY))
                                            /* opss is stale: QUEUED,
                                             * but task already claimed */
    set_release(ops_state, SCX_OPSS_NONE)

The race has been observed via two distinct call chains: the most common
goes through sched_setaffinity(), a rarer variant through
sched_change_begin().

For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears
SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE
(intentional, to avoid concurrent non-atomic RMW of p-&gt;scx.flags against
ops_dequeue()). The window between those two writes is exactly what
ops_dequeue() observes as "QUEUED without custody".

The observed state is not actually inconsistent, it just means CPU 1 has
already claimed the task and the QUEUED value held by CPU 2 is stale.
Re-read ops_state in that case; the next read is guaranteed to return
SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch
cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has
already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED
would require re-enqueue under p's rq lock, which CPU 2 holds.

Changes in v2:
- Use READ_ONCE() for p-&gt;scx.flags to ensure fresh reads and prevent
  compiler reordering in the lockless path
- Add cpu_relax() to reduce power consumption and improve performance
  during the spin-wait
- Use unlikely() to optimize branch prediction for the common case
- Expand the in-code comment to document the race condition and
  bounded retry guarantee

Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Suggested-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Samuele Mariotti &lt;smariotti@disroot.org&gt;
Signed-off-by: Paolo Valente &lt;paolo.valente@unimore.it&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Fix deadlock between scx_root_disable() and concurrent forks</title>
<updated>2026-05-17T19:06:38+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-17T17:43:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=515e3996a4c26e7f955c13b3b19522a2c8642af9'/>
<id>515e3996a4c26e7f955c13b3b19522a2c8642af9</id>
<content type='text'>
scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to
clear __scx_switched_all and scx_switching_all. task_should_scx() short-circuits on DISABLING,
so forks in that window land on fair while next_active_class() still skips
fair - the new tasks stall.

This can deadlock the disable path itself: scx_alloc_and_add_sched() runs
under scx_enable_mutex and creates a helper kthread; if that new kthread is
one of the stalled fair tasks, the mutex holder waits forever and
scx_root_disable() can never make progress. Only sub-sched support exposes
this, since sub-sched enables are the only path where
scx_alloc_and_add_sched() can race the root's disable.

Move the DISABLING check after @scx_switching_all. @scx_switching_all
serves as a proxy for __scx_switched_all, so while it's set, forks keep
going to scx. Once cleared, DISABLING applies normally.

v2: Reword in-source comment and description. (Andrea)

Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to
clear __scx_switched_all and scx_switching_all. task_should_scx() short-circuits on DISABLING,
so forks in that window land on fair while next_active_class() still skips
fair - the new tasks stall.

This can deadlock the disable path itself: scx_alloc_and_add_sched() runs
under scx_enable_mutex and creates a helper kthread; if that new kthread is
one of the stalled fair tasks, the mutex holder waits forever and
scx_root_disable() can never make progress. Only sub-sched support exposes
this, since sub-sched enables are the only path where
scx_alloc_and_add_sched() can race the root's disable.

Move the DISABLING check after @scx_switching_all. @scx_switching_all
serves as a proxy for __scx_switched_all, so while it's set, forks keep
going to scx. Once cleared, DISABLING applies normally.

v2: Reword in-source comment and description. (Andrea)

Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2026-05-13T22:00:40+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-05-13T22:00:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=59a62ea4583e0f740bb3576ec210b23f39754327'/>
<id>59a62ea4583e0f740bb3576ec210b23f39754327</id>
<content type='text'>
Pull sched_ext fixes from Tejun Heo:
 "The bulk of this is hardening of the new sub-scheduler infrastructure.

   - UAFs and lifecycle bugs on the sub-sched attach/detach paths:
     parent sub_kset freed under a racing child, list_del_rcu on an
     uninitialized list head, ops-&gt;priv stomped by concurrent
     attach/detach, and a UAF in the init-failure error path

   - Task state-machine reorg closing concurrent enable-vs-dead races: a
     task exiting during the unlocked init window could trip NULL ops
     derefs or skip exit_task() cleanup

   - A scx_link_sched() self-deadlock on scx_sched_lock

   - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
     cpumask without RCU, and stop rejecting BPF schedulers when only
     cpuset isolated partitions are active

   - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
     the failing task rather than the irq_work kthread

   - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
     fixes"

* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
  sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
  sched_ext: INIT_LIST_HEAD() &amp;sch-&gt;all in scx_alloc_and_add_sched()
  sched_ext: Drop NONE early return in scx_disable_and_exit_task()
  sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
  sched_ext: Clear ops-&gt;priv on scx_alloc_and_add_sched() error paths
  sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach
  selftests/sched_ext: Fix build error in dequeue selftest
  sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
  sched_ext: Close sub-sched init race with post-init DEAD recheck
  sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
  sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
  sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
  sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch-&gt;disable_irq_work
  sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
  sched_ext: Drop unused scx_find_sub_sched() stub
  sched_ext: Move scx_error() out of scx_link_sched()'s lock region
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull sched_ext fixes from Tejun Heo:
 "The bulk of this is hardening of the new sub-scheduler infrastructure.

   - UAFs and lifecycle bugs on the sub-sched attach/detach paths:
     parent sub_kset freed under a racing child, list_del_rcu on an
     uninitialized list head, ops-&gt;priv stomped by concurrent
     attach/detach, and a UAF in the init-failure error path

   - Task state-machine reorg closing concurrent enable-vs-dead races: a
     task exiting during the unlocked init window could trip NULL ops
     derefs or skip exit_task() cleanup

   - A scx_link_sched() self-deadlock on scx_sched_lock

   - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
     cpumask without RCU, and stop rejecting BPF schedulers when only
     cpuset isolated partitions are active

   - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
     the failing task rather than the irq_work kthread

   - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
     fixes"

* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
  sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
  sched_ext: INIT_LIST_HEAD() &amp;sch-&gt;all in scx_alloc_and_add_sched()
  sched_ext: Drop NONE early return in scx_disable_and_exit_task()
  sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
  sched_ext: Clear ops-&gt;priv on scx_alloc_and_add_sched() error paths
  sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach
  selftests/sched_ext: Fix build error in dequeue selftest
  sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
  sched_ext: Close sub-sched init race with post-init DEAD recheck
  sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
  sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
  sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
  sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch-&gt;disable_irq_work
  sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
  sched_ext: Drop unused scx_find_sub_sched() stub
  sched_ext: Move scx_error() out of scx_link_sched()'s lock region
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation</title>
<updated>2026-05-13T20:02:57+00:00</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2026-05-13T11:24:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6ae315d37924435516d697ea7dde0b799a5928e0'/>
<id>6ae315d37924435516d697ea7dde0b799a5928e0</id>
<content type='text'>
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.

Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:

  =============================
  WARNING: suspicious RCU usage
  -----------------------------
  kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!

  1 lock held by scx_flash/281:
   #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at:
       bpf_struct_ops_link_create+0x134/0x1c0

  Call Trace:
   dump_stack_lvl+0x6f/0xb0
   lockdep_rcu_suspicious.cold+0x37/0x70
   housekeeping_cpumask+0xcd/0xe0
   scx_enable.isra.0+0x17/0x120
   bpf_scx_reg+0x5e/0x80
   bpf_struct_ops_link_create+0x151/0x1c0
   __sys_bpf+0x1e4b/0x33c0
   __x64_sys_bpf+0x21/0x30
   do_syscall_64+0x117/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.

Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.

Fixes: 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.

Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:

  =============================
  WARNING: suspicious RCU usage
  -----------------------------
  kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!

  1 lock held by scx_flash/281:
   #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at:
       bpf_struct_ops_link_create+0x134/0x1c0

  Call Trace:
   dump_stack_lvl+0x6f/0xb0
   lockdep_rcu_suspicious.cold+0x37/0x70
   housekeeping_cpumask+0xcd/0xe0
   scx_enable.isra.0+0x17/0x120
   bpf_scx_reg+0x5e/0x80
   bpf_struct_ops_link_create+0x151/0x1c0
   __sys_bpf+0x1e4b/0x33c0
   __x64_sys_bpf+0x21/0x30
   do_syscall_64+0x117/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.

Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.

Fixes: 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work</title>
<updated>2026-05-12T21:28:56+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-11T23:18:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cceb874eee46fe4b3d3c6c496f19125d9a3a9a8f'/>
<id>cceb874eee46fe4b3d3c6c496f19125d9a3a9a8f</id>
<content type='text'>
scx_sub_enable_workfn() pins parent-&gt;kobj before dropping scx_sched_lock,
but that does not pin parent-&gt;sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.

Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.

Fixes: 411d3ef1a705 ("sched_ext: Unregister sub_kset on scheduler disable")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
scx_sub_enable_workfn() pins parent-&gt;kobj before dropping scx_sched_lock,
but that does not pin parent-&gt;sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.

Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.

Fixes: 411d3ef1a705 ("sched_ext: Unregister sub_kset on scheduler disable")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: INIT_LIST_HEAD() &amp;sch-&gt;all in scx_alloc_and_add_sched()</title>
<updated>2026-05-12T21:28:56+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-11T23:18:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b273b75b8d677aea06dd06d80b61b3bb06e94680'/>
<id>b273b75b8d677aea06dd06d80b61b3bb06e94680</id>
<content type='text'>
On scx_link_sched() error paths (parent disabled, hash insert failure),
&amp;sch-&gt;all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&amp;sch-&gt;all) on a
list_head that was never initialized triggering a corruption warning.

Initialize &amp;sch-&gt;all.

Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On scx_link_sched() error paths (parent disabled, hash insert failure),
&amp;sch-&gt;all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&amp;sch-&gt;all) on a
list_head that was never initialized triggering a corruption warning.

Initialize &amp;sch-&gt;all.

Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Drop NONE early return in scx_disable_and_exit_task()</title>
<updated>2026-05-12T21:13:58+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-12T20:30:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=39e25a2100604320e8d9df54c6c31258f7a3df29'/>
<id>39e25a2100604320e8d9df54c6c31258f7a3df29</id>
<content type='text'>
d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p-&gt;scx.sched dangles - print_scx_info() from sched_show_task() reads
sch-&gt;ops.name from freed memory.

Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p-&gt;scx.sched is the only
work left - and the one thing the path actually needs.

v2: Extend the SUB_INIT block comment to note that the flag is only
    set on the sub-enable path, so it's always clear on the NONE
    re-entry (Andrea).

Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p-&gt;scx.sched dangles - print_scx_info() from sched_show_task() reads
sch-&gt;ops.name from freed memory.

Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p-&gt;scx.sched is the only
work left - and the one thing the path actually needs.

v2: Extend the SUB_INIT block comment to note that the flag is only
    set on the sub-enable path, so it's always clear on the NONE
    re-entry (Andrea).

Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path</title>
<updated>2026-05-11T22:05:48+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-11T22:05:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9a415cc53711f2238e0f0ca8a6bcc796c003b127'/>
<id>9a415cc53711f2238e0f0ca8a6bcc796c003b127</id>
<content type='text'>
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p-&gt;comm and p-&gt;pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.

Move put_task_struct() past scx_error().

Reported-by: Sashiko &lt;sashiko-bot@kernel.org&gt;
Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/
Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p-&gt;comm and p-&gt;pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.

Move put_task_struct() past scx_error().

Reported-by: Sashiko &lt;sashiko-bot@kernel.org&gt;
Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/
Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched_ext: Clear ops-&gt;priv on scx_alloc_and_add_sched() error paths</title>
<updated>2026-05-11T08:50:31+00:00</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2026-05-11T08:31:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=86ecb1c1a1f5c1bf4a45b91f54f8220c3121bd3b'/>
<id>86ecb1c1a1f5c1bf4a45b91f54f8220c3121bd3b</id>
<content type='text'>
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops-&gt;priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -&gt; scx_kobj_release() -&gt; RCU
work), but @ops-&gt;priv is left pointing at the about-to-be-freed pointer.

With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops-&gt;priv is still
non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on
concurrent attach/detach"), a dangling @ops-&gt;priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.

Clear @ops-&gt;priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().

Fixes: bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach")
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops-&gt;priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -&gt; scx_kobj_release() -&gt; RCU
work), but @ops-&gt;priv is left pointing at the about-to-be-freed pointer.

With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops-&gt;priv is still
non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on
concurrent attach/detach"), a dangling @ops-&gt;priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.

Clear @ops-&gt;priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().

Fixes: bbf30b383cf6 ("sched_ext: Fix ops-&gt;priv clobber on concurrent attach/detach")
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
