linux-stable.git/io_uring, branch v7.0.11

io_uring/nop: pass all errors to userspace

2026-06-01T15:54:54+00:00

[ Upstream commit e97ff8b62d4690c69297f0f6de874f0564cc01a4 ]

This fixes an inconsistency where io_nop() called req_set_fail()
based on ret, but passed just nop->result to userspace.
Originally, ret is a even copy of nop->result, but is set to an error
when such happens subsequently. Now that's also passed to userspace.

Fixes: a85f31052bce ("io_uring/nop: add support for testing registered files and buffers")
Signed-off-by: Alexander A. Klimov 
Link: https://patch.msgid.link/20260520180045.538533-1-grandmaster@al2klimov.de
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

io_uring: propagate array_index_nospec opcode into req->opcode

2026-06-01T15:54:48+00:00

[ Upstream commit cf18e36455603d65d4745de83e2d1743c54ada47 ]

Commit 1e988c3fe126 ("io_uring: prevent opcode speculation") added
array_index_nospec() to io_init_req(), but applied it only to a local
opcode variable. req->opcode is initialized from sqe->opcode before the
bounds check and remains the raw value.

Keep req->opcode as the canonical opcode in io_init_req(): reject
out-of-range values architecturally, then write the array_index_nospec()
result back to req->opcode before any table lookup. This keeps downstream
users of req->opcode from observing the raw user byte on a mispredicted
path.

No functional change: array_index_nospec() is a no-op for opcodes in
[0, IORING_OP_LAST), and out-of-range opcodes are still rejected at the
bounds check above the assignment.

Fixes: 1e988c3fe126 ("io_uring: prevent opcode speculation")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito 
Link: https://patch.msgid.link/20260517213010.696135-1-michael.bommarito@gmail.com
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

io_uring/net: punt IORING_OP_BIND async if it needs file create

2026-06-01T15:54:45+00:00

[ Upstream commit ccd25890f73c082fe2657ed227b497d6ac5fdc40 ]

For two reasons:

1) An opcode cannot block inside io_uring_enter() doing submissions, as
   it'll stall the submission side pipeline.

2) Ending up in sb_start_write() -> __sb_start_write() ->
   percpu_down_read_freezable() introduces a new lockdep edge, which it
   correctly complains about.

Check if the socket type is AF_UNIX and has a non-empty pathname. If it
does, mark it REQ_F_FORCE_ASYNC to punt the submission to io-wq rather
than attempt to do it inline.

Fixes: 7481fd93fa0a ("io_uring: Introduce IORING_OP_BIND")
Reviewed-by: Gabriel Krisman Bertazi 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

io_uring/waitid: clear waitid info before copying it to userspace

2026-06-01T15:54:21+00:00

commit 93d93f5f8da791e98159795c6ef683f45bd95d13 upstream.

IORING_OP_WAITID stores its result fields in struct io_waitid::info and
later copies them to userspace siginfo. The prep path initializes the
request arguments, but it does not initialize info itself.

If the wait operation completes without reporting a child event, the common
wait code can return without writing wo_info. In that case io_waitid_finish()
still copies iw->info to userspace, exposing stale bytes from the reused
io_kiocb command storage.

Clear the result storage during prep so the io_uring path matches the
regular waitid syscall, which uses a zero-initialized struct waitid_info.

Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support")
Cc: stable@vger.kernel.org # 6.7+
Signed-off-by: Heechan Kang 
Link: https://patch.msgid.link/20260516184709.852814-1-gganji11@naver.com
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io-wq: check that the predecessor is hashed in io_wq_remove_pending()

2026-05-23T11:09:40+00:00

commit d6a2d7b04b5a093021a7a0e2e69e9d5237dfa8cc upstream.

io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled
work was the tail of its hash bucket. When doing this, it checks whether
the preceding entry in acct->work_list has the same hash value, but
never checks that the predecessor is hashed at all. io_get_work_hash()
is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash
bits are never set for non-hashed work, so it returns 0. Thus, when a
hashed bucket-0 work is cancelled while a non-hashed work is its list
predecessor, the check spuriously passes and a pointer to the non-hashed
io_kiocb is stored in wq->hash_tail[0].

Because non-hashed work is dequeued via the fast path in
io_get_next_work(), which never touches hash_tail[], the stale pointer
is never cleared. Therefore, after the non-hashed io_kiocb completes and
is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The
io_wq is per-task (tctx->io_wq) and survives ring open/close, so the
dangling pointer persists for the lifetime of the task; the next hashed
bucket-0 enqueue dereferences it in io_wq_insert_work() and
wq_list_add_after() writes through freed memory.

Add the missing io_wq_is_hashed() check so a non-hashed predecessor
never inherits a hash_tail[] slot.

Cc: stable@vger.kernel.org
Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work")
Signed-off-by: Nicholas Carlini 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring/napi: cap busy_poll_to 10 msec

2026-05-23T11:09:32+00:00

[ Upstream commit df8599ee18c0e5fe343ffe0b4c379636b8bb839a ]

Currently there's no cap on the maximum amount of time that napi is
allowed to poll if no events are found, which can lead to kernel
complaints on a task being stuck as there's no conditional rescheduling
done within that loop.

Just cap it to 10 msec in total, that's already way above any kind of
sane value that will reap any benefits, yet low enough that it's
nowhere near being able to trigger preemption complaints.

Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support")
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

io_uring/zcrx: warn on freelist violations

2026-05-17T15:16:33+00:00

commit 770594e78c3964cf23cf5287f849437cdde9b7d0 upstream.

The freelist is appropriately sized to always be able to take a free
niov, but let's be more defensive and check the invariant with a
warning. That should help to catch any double-free issues.

Suggested-by: Kai Aizen 
Signed-off-by: Pavel Begunkov 
Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe 
Signed-off-by: Harshit Mogalapalli 
Signed-off-by: Greg Kroah-Hartman

io_uring/zcrx: use guards for locking

2026-05-17T15:16:33+00:00

commit 898ad80d1207cbdb22b21bafb6de4adfd7627bd0 upstream.

Convert last several places using manual locking to guards to simplify
the code.

Signed-off-by: Pavel Begunkov 
Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe 
Signed-off-by: Harshit Mogalapalli 
Signed-off-by: Greg Kroah-Hartman

io_uring/tw: serialize ctx->retry_llist with ->uring_lock

2026-05-14T13:31:06+00:00

commit 17666e2d7592c3e85260cafd3950121524acc2c5 upstream.

The DEFER_TASKRUN local task work paths all run under ctx->uring_lock,
which serializes them with each other and with the rest of the ring's
hot paths. io_move_task_work_from_local() is the exception - it's called
from io_ring_exit_work() on a kworker without holding the lock and from
the iopoll cancelation side right after dropping it.

->work_llist is fine with this, as it's only ever updated via the
expected paths. But the ->retry_llist is updated while runing, and hence
it could potentially race between normal task_work running and the
task-has-exited shutdown path.

Simply grab ->uring_lock while moving the local work to the fallback
list for exit purposes, which nicely serializes it across both the
normal additions and the exit prune path.

Cc: stable@vger.kernel.org
Fixes: f46b9cdb22f7 ("io_uring: limit local tw done")
Reported-by: Robert Femmer 
Reported-by: Christian Reitter 
Reported-by: Michael Rodler 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring/kbuf: support min length left for incremental buffers

2026-05-14T13:31:06+00:00

commit 7deba791ad495ce1d7921683f4f7d1190fa210d1 upstream.

Incrementally consumed buffer rings are generally fully consumed, but
it's quite possible that the application has a minimum size it needs to
meet to avoid truncation. Currently that minimum limit is 1 byte, but
this should be a setting that is the hands of the application. For
recvmsg multishot, a prime use case for incrementally consumed buffers,
the application may get spurious -EFAULT returned at the end of an
incrementally consumed buffer, as less space is available than the
headers need.

Grab a u32 field in struct io_uring_buf_reg, which the application can
use to inform the kernel of the minimum size that should be available
in an incrementally consumed buffer. If less than that is available,
the current buffer is fully processed and the next one will be picked.

Cc: stable@vger.kernel.org
Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1433
Signed-off-by: Martin Michaelis 
[axboe: write commit message, change io_buffer_list member name]
Reviewed-by: Gabriel Krisman Bertazi 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman