linux.git/io_uring/tw.c, branch master

io_uring: Use system_dfl_wq instead of system_unbound_wq

2026-06-16T21:59:10+00:00

Commit de7341ffe49e ("io_uring: switch normal task_work to a mpscq")
added a use of system_unbound_wq, which is deprecated in favor of
system_dfl_wq added by commit 128ea9f6ccfb ("workqueue: Add
system_percpu_wq and system_dfl_wq"). An upcoming warning in the
workqueue tree flags this with:

  workqueue: work func io_tctx_fallback_work enqueued on deprecated workqueue. Use system_{percpu|dfl}_wq instead.

Switch to system_dfl_wq to clear up the warning.

Fixes: de7341ffe49e ("io_uring: switch normal task_work to a mpscq")
Signed-off-by: Nathan Chancellor 
Link: https://patch.msgid.link/20260616-io_uring-fix-wq-warning-v1-1-cfc9d934eedb@kernel.org
Signed-off-by: Jens Axboe

io_uring: get rid of tw_pending for !DEFER task work

2026-06-16T15:48:00+00:00

The normal task_work path used a tw_pending bit to ensure the callback
was only added once: the mpscq drains incrementally, so a single
tctx_task_work() run can take the queue through empty -> non-empty
several times, and each transition would otherwise re-add the already
pending callback_head. This corrupts the task_work list, and is what
tw_pending protects again.

This can go away, if we stop running the task_work as soon as the queue
empties.

Suggested-by: Caleb Sander Mateos 
Reviewed-by: Caleb Sander Mateos 
Signed-off-by: Jens Axboe

io_uring: remove the per-ctx fallback task_work machinery

2026-06-13T12:27:20+00:00

With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.

With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.

Signed-off-by: Jens Axboe

io_uring: run the tctx task_work fallback directly

2026-06-13T12:27:15+00:00

The fallback work drains the tctx queue only to redistribute the entries
into the per-ctx fallback lists, bouncing them through a second
(per-ctx) work item before they finally run. That made sense when the
producer side did the draining and could be in any context, but the
fallback work is a regular process context kworker: it can just run the
entries itself. Reuse the normal run loop - if run from the fallback
kernel thread, ts.cancel will get set, and the work terminated.

Signed-off-by: Jens Axboe

io_uring: switch normal task_work to a mpscq

2026-06-13T12:27:11+00:00

Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.

Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.

Signed-off-by: Jens Axboe

io_uring: switch local task_work to a mpscq

2026-06-13T12:27:06+00:00

The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.

Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.

For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:

     1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.64% at ~46Gb/sec

and after this change:

     1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.11% at ~53Gb/sec

which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:

     2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     3.50% at ~24Gb/sec

we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:

     0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.32% at ~26Gb/sec

most of that overhead is gone, and performance is better as well.

Caleb Sander Mateos  reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.

[1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/

Signed-off-by: Jens Axboe

io_uring: grab RCU read lock marking task run

2026-06-13T12:26:53+00:00

Not required right now, as io_req_local_work_add() already calls this
helper with the RCU read lock held. But in preparation for that not
being the case, grab it locally.

Reviewed-by: Caleb Sander Mateos 
Signed-off-by: Jens Axboe

io_uring/tw: serialize ctx->retry_llist with ->uring_lock

2026-04-30T12:57:20+00:00

The DEFER_TASKRUN local task work paths all run under ctx->uring_lock,
which serializes them with each other and with the rest of the ring's
hot paths. io_move_task_work_from_local() is the exception - it's called
from io_ring_exit_work() on a kworker without holding the lock and from
the iopoll cancelation side right after dropping it.

->work_llist is fine with this, as it's only ever updated via the
expected paths. But the ->retry_llist is updated while runing, and hence
it could potentially race between normal task_work running and the
task-has-exited shutdown path.

Simply grab ->uring_lock while moving the local work to the fallback
list for exit purposes, which nicely serializes it across both the
normal additions and the exit prune path.

Cc: stable@vger.kernel.org
Fixes: f46b9cdb22f7 ("io_uring: limit local tw done")
Reported-by: Robert Femmer 
Reported-by: Christian Reitter 
Reported-by: Michael Rodler 
Signed-off-by: Jens Axboe

io_uring: mark known and harmless racy ctx->int_flags uses

2026-03-16T21:33:10+00:00

There are a few of these, where flags are read outside of the
uring_lock, yet it's harmless to race on them.

Signed-off-by: Jens Axboe

io_uring: switch struct io_ring_ctx internal bitfields to flags

2026-03-16T21:32:59+00:00

Bitfields cannot be set and checked atomically, and this makes it more
clear that these are indeed in shared storage and must be checked and
set in a sane fashion. This is in preparation for annotating a few of
the known racy, but harmless, flags checking.

No intended functional changes in this patch.

Reviewed-by: Gabriel Krisman Bertazi 
Signed-off-by: Jens Axboe