summaryrefslogtreecommitdiff
path: root/io_uring/tw.c
AgeCommit message (Collapse)Author
13 daysio_uring: Use system_dfl_wq instead of system_unbound_wqNathan Chancellor
Commit de7341ffe49e ("io_uring: switch normal task_work to a mpscq") added a use of system_unbound_wq, which is deprecated in favor of system_dfl_wq added by commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq"). An upcoming warning in the workqueue tree flags this with: workqueue: work func io_tctx_fallback_work enqueued on deprecated workqueue. Use system_{percpu|dfl}_wq instead. Switch to system_dfl_wq to clear up the warning. Fixes: de7341ffe49e ("io_uring: switch normal task_work to a mpscq") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://patch.msgid.link/20260616-io_uring-fix-wq-warning-v1-1-cfc9d934eedb@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysio_uring: get rid of tw_pending for !DEFER task workJens Axboe
The normal task_work path used a tw_pending bit to ensure the callback was only added once: the mpscq drains incrementally, so a single tctx_task_work() run can take the queue through empty -> non-empty several times, and each transition would otherwise re-add the already pending callback_head. This corrupts the task_work list, and is what tw_pending protects again. This can go away, if we stop running the task_work as soon as the queue empties. Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring: remove the per-ctx fallback task_work machineryJens Axboe
With the tctx fallback running its entries directly, the per-ctx fallback work has a single user left: moving local (DEFER_TASKRUN) task_work entries out of a ring that is going away. Both of its call sites are process context and don't hold ->uring_lock, the same conditions the deferred fallback work itself ran under - so run the entries in cancel mode right there instead, and rename the helper to io_cancel_local_task_work() to match what it now does. With that, ->fallback_llist, ->fallback_work, io_fallback_req_func() and __io_fallback_tw() can all go away, along with the fallback work flushing in the ring exit and cancel paths. Requests that get orphaned by an exiting task now run via the tctx fallback work, which the ring exit side implicitly waits on through the ctx refs those requests hold. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring: run the tctx task_work fallback directlyJens Axboe
The fallback work drains the tctx queue only to redistribute the entries into the per-ctx fallback lists, bouncing them through a second (per-ctx) work item before they finally run. That made sense when the producer side did the draining and could be in any context, but the fallback work is a regular process context kworker: it can just run the entries itself. Reuse the normal run loop - if run from the fallback kernel thread, ts.cancel will get set, and the work terminated. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring: switch normal task_work to a mpscqJens Axboe
Like the local task_work list, the normal (tctx) task_work list is an llist, and hence needs the O(n) llist_reverse_order() pass before running entries in queue order. On top of that, capped runs - sqpoll processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the claimed-but-unprocessed leftovers carried in a separate retry_list, as they can't be pushed back to the shared list. Switch tctx->task_list to a mpscq, like what was done for the DEFER_TASKRUN paths as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring: switch local task_work to a mpscqJens Axboe
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO ordered, and hence __io_run_local_work() has to restore the right running order with an O(n) llist_reverse_order() pass first. On top of that, a batch that gets capped by max_events needs the leftover entries parked on a separate ->retry_llist, as they can't be pushed back to the shared list. Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg retry loop, entries are popped in queue order with no reversal pass, capping a run simply leaves the remainder on the queue, and ->retry_llist goes away entirely. The consumer cursor, ->work_head, lives with the rest of the ->uring_lock protected state rather than next to the queue, so that popping entries doesn't dirty the producer side cacheline. For low amounts of task_work, this ends up being a bit more efficient than the existing scheme. As an example of that, doing multishot receives for 8 clients has the following task_work overhead: 1.02% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.88% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.60% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.14% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.64% at ~46Gb/sec and after this change: 1.08% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.03% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.11% at ~53Gb/sec which has less overhead even though that test run was faster. For a case of having 1024 clients on a single ring: 2.22% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.84% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.02% sock-test [kernel.kallsyms] [k] __io_run_local_work 3.50% at ~24Gb/sec we start to see the llist reversing taking a considerable amount of time, and the total add+run task_work overhead is around 3.5%. After the change: 0.90% sock-test [kernel.kallsyms] [k] __io_run_local_work 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.32% at ~26Gb/sec most of that overhead is gone, and performance is better as well. Caleb Sander Mateos <csander@purestorage.com> reports that it improves the performance of a ublk 4kb workload by 4% [1], while testing v1 of this patchset. [1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring: grab RCU read lock marking task runJens Axboe
Not required right now, as io_req_local_work_add() already calls this helper with the RCU read lock held. But in preparation for that not being the case, grab it locally. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-30io_uring/tw: serialize ctx->retry_llist with ->uring_lockJens Axboe
The DEFER_TASKRUN local task work paths all run under ctx->uring_lock, which serializes them with each other and with the rest of the ring's hot paths. io_move_task_work_from_local() is the exception - it's called from io_ring_exit_work() on a kworker without holding the lock and from the iopoll cancelation side right after dropping it. ->work_llist is fine with this, as it's only ever updated via the expected paths. But the ->retry_llist is updated while runing, and hence it could potentially race between normal task_work running and the task-has-exited shutdown path. Simply grab ->uring_lock while moving the local work to the fallback list for exit purposes, which nicely serializes it across both the normal additions and the exit prune path. Cc: stable@vger.kernel.org Fixes: f46b9cdb22f7 ("io_uring: limit local tw done") Reported-by: Robert Femmer <robert.femmer@x41-dsec.de> Reported-by: Christian Reitter <invd@inhq.net> Reported-by: Michael Rodler <michael.rodler@x41-dsec.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring: mark known and harmless racy ctx->int_flags usesJens Axboe
There are a few of these, where flags are read outside of the uring_lock, yet it's harmless to race on them. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring: switch struct io_ring_ctx internal bitfields to flagsJens Axboe
Bitfields cannot be set and checked atomically, and this makes it more clear that these are indeed in shared storage and must be checked and set in a sane fashion. This is in preparation for annotating a few of the known racy, but harmless, flags checking. No intended functional changes in this patch. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-11io_uring: ensure ctx->rings is stable for task work flags manipulationJens Axboe
If DEFER_TASKRUN | SETUP_TASKRUN is used and task work is added while the ring is being resized, it's possible for the OR'ing of IORING_SQ_TASKRUN to happen in the small window of swapping into the new rings and the old rings being freed. Prevent this by adding a 2nd ->rings pointer, ->rings_rcu, which is protected by RCU. The task work flags manipulation is inside RCU already, and if the resize ring freeing is done post an RCU synchronize, then there's no need to add locking to the fast path of task work additions. Note: this is only done for DEFER_TASKRUN, as that's the only setup mode that supports ring resizing. If this ever changes, then they too need to use the io_ctx_mark_taskrun() helper. Link: https://lore.kernel.org/io-uring/20260309062759.482210-1-naup96721@gmail.com/ Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22io_uring: split out CQ waiting code into wait.cJens Axboe
Move the completion queue waiting and scheduling code out of io_uring.c into a dedicated wait.c file. This further removes code out of the main io_uring C and header file, and into a topical new file. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22io_uring: split out task work code into tw.cJens Axboe
Move the task work handling code out of io_uring.c into a new tw.c file. This includes the local work, normal work, and fallback work handling infrastructure. The associated tw.h header contains io_should_terminate_tw() as a static inline helper, along with the necessary function declarations. Signed-off-by: Jens Axboe <axboe@kernel.dk>