| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Rework the task_work infrastructure.
Both the local (DEFER_TASKRUN) and the normal (tctx) task_work lists
were llist based, which is LIFO ordered, and hence each run had to do
an O(n) list reversal pass first to restore queue order.
Additionally, to cap the amount of task_work run, each method needed
a retry list as well.
Add a lockless MPCS FIFO queue (based on Dmitry Vyukov's intrusive
MPSC algorithm) and switch both task_work lists to it. It performs
better than llists and we can then also ditch the retry lists as well
as entries are popped one-at-the-time.
On top of those changes, run the tctx fallback task_work directly and
remove the now-unused per-ctx fallback machinery entirely.
- zcrx user notifications.
Add a mechanism for zcrx to communicate conditions back to userspace
via a dedicated CQE, with the initial users being notification on
running out of buffers and on a frag copy fallback, plus
shared-memory notification statistics.
Alongside that, a series of zcrx reliability and cleanup fixes: more
reliable scrubbing, poisoning pointers on unregistration, dropping an
extra ifq close, adding a ctx back-pointer, reordering fd allocation
in the export path, and killing a dead 'sock' member.
- Allow using io_uring registered buffers for plain SEND and RECV, not
just for the zero-copy send path.
This enables targets like ublk's NBD backend to push/pull IO data
directly to/from a registered buffer over a plain send/recv on a TCP
socket.
- Registered buffer improvements: account huge pages correctly, bump
the io_mapped_ubuf length field to size_t, and raise the previous 1GB
registered buffer size limit.
- Restrict the ctx access exposed to io_uring BPF struct_ops programs
by handing them an opaque type rather than the full io_ring_ctx, and
add a separate MAINTAINERS entry for the bpf-ops code.
- Allow opcode filtering on IORING_OP_CONNECT.
- Validate ring-provided buffer addresses with access_ok(), and align
the legacy buffer add limit with MAX_BIDS_PER_BGID.
- Various other cleanups and minor fixes, including avoiding msghdr
async data on connect/bind, dropping async_size for OP_LISTEN, making
the POLL_FIRST receive side checks consistent, re-checking
IO_WQ_BIT_EXIT for each linked work item, and using
trace_call__##name() at guarded tracepoint call sites.
* tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (31 commits)
io_uring/bpf-ops: add a separate maintainer entry
io_uring/net: make POLL_FIRST receive side checks consistent
io_uring: remove the per-ctx fallback task_work machinery
io_uring: run the tctx task_work fallback directly
io_uring: switch normal task_work to a mpscq
io_uring: switch local task_work to a mpscq
io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
io_uring: grab RCU read lock marking task run
io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args
io_uring/kbuf: validate ring provided buffer addresses with access_ok()
io_uring/net: support registered buffer for plain send and recv
io_uring/nop: Drop a wrong comment in struct io_nop
io_uring/net: Remove async_size for OP_LISTEN
io_uring/net: Avoid msghdr on op_connect/op_bind async data
io_uring/bpf-ops: restrict ctx access to BPF
io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item
io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGID
io_uring/zcrx: add shared-memory notification statistics
io_uring/zcrx: notify user on frag copy fallback
io_uring/zcrx: notify user when out of buffers
...
|
|
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.
Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.
For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:
1.02% sock-test [kernel.kallsyms] [k] io_req_local_work_add
0.88% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop
0.60% sock-test [kernel.kallsyms] [k] llist_reverse_order
0.14% sock-test [kernel.kallsyms] [k] __io_run_local_work
2.64% at ~46Gb/sec
and after this change:
1.08% sock-test [kernel.kallsyms] [k] io_req_local_work_add
1.03% sock-test [kernel.kallsyms] [k] __io_run_local_work
2.11% at ~53Gb/sec
which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:
2.22% sock-test [kernel.kallsyms] [k] llist_reverse_order
0.84% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop
0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add
0.02% sock-test [kernel.kallsyms] [k] __io_run_local_work
3.50% at ~24Gb/sec
we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:
0.90% sock-test [kernel.kallsyms] [k] __io_run_local_work
0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add
1.32% at ~26Gb/sec
most of that overhead is gone, and performance is better as well.
Caleb Sander Mateos <csander@purestorage.com> reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.
[1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The wakeup condition if a min timeout is present and has expired is that
at least _one_ CQE was posted. Thus set the cq_tail target to
->cq_min_tail + 1. Without this commit a spurious wakeup can result in a
premature wakeup because io_should_wake() will return true even if _no_
CQE was posted at all.
Cc: Tip ten Brink <tip@tenbrinkmeijs.com>
Fixes: e15cb2200b93 ("io_uring: fix min_wait wakeups for SQPOLL")
Cc: stable@vger.kernel.org
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
Link: https://patch.msgid.link/20260606201120.1441447-1-lk@c--e.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
timespec from the caller via ext_arg->ts. It arms an ABS mode
hrtimer in __io_cqring_wait_schedule(). The conversion path in
io_uring/wait.c parses ext_arg->ts inline rather than going
through io_parse_user_time(). It therefore does not pick up the
time namespace conversion added by the previous patch.
Apply timens_ktime_to_host() to the parsed time on the
IORING_ENTER_ABS_TIMER branch. This mirrors the IORING_TIMEOUT_ABS
fix in io_parse_user_time(). Use ctx->clockid as the clock id.
ctx->clockid is set either at ring creation or via
IORING_REGISTER_CLOCK.
timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces. It is also a no-op for callers in the initial time
namespace. The fast path is unchanged.
Reproducer: in unshare --user --time, with a -10s monotonic
offset, call io_uring_enter with min_complete=1,
IORING_ENTER_ABS_TIMER, and ts = now + 1s. The call returns
-ETIME after <1ms instead of after the expected ~1s.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Link: https://patch.msgid.link/20260504153755.1293932-3-maoyi.xie@ntu.edu.sg
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 96189080265e addressed one case of ctx->rings being potentially
accessed while a resize is happening on the ring, but there are still
a few others that need handling. Add a helper for retrieving the
rings associated with an io_uring context, and add some sanity checking
to that to catch bad uses. ->rings_rcu is always valid, as long as it's
used within RCU read lock. Any use of ->rings_rcu or ->rings inside
either ->uring_lock or ->completion_lock is sane as well.
Do the minimum fix for the current kernel, but set it up such that this
basic infra can be extended for later kernels to make this harder to
mess up in the future.
Thanks to Junxi Qian for finding and debugging this issue.
Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reviewed-by: Junxi Qian <qjx1298677004@gmail.com>
Tested-by: Junxi Qian <qjx1298677004@gmail.com>
Link: https://lore.kernel.org/io-uring/20260330172348.89416-1-qjx1298677004@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the completion queue waiting and scheduling code out of io_uring.c
into a dedicated wait.c file. This further removes code out of the
main io_uring C and header file, and into a topical new file.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|