linux.git/io_uring, branch v7.1-rc3

Merge tag 'io_uring-7.1-20260508' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-05-08T20:12:48+00:00

Pull io_uring fixes from Jens Axboe:

 - Ensure that the absolute timeouts for both the command side and the
   waiting side honor the callers time namespace

 - Ensure tracked NAPI entries are cleared at unregistration time, as
   the NAPI polling loop checks the list state rather than the general
   NAPI state. This can lead to NAPI polling even after unregistration
   has been done. If unregistered, all NAPI polling should be disabled

 - Fix for eventfd recursive invocation handling

* tag 'io_uring-7.1-20260508' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
  io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS
  io_uring/eventfd: reset deferred signal state
  io_uring/napi: clear tracked NAPI entries on unregister

io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER

2026-05-06T10:58:56+00:00

io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
timespec from the caller via ext_arg->ts. It arms an ABS mode
hrtimer in __io_cqring_wait_schedule(). The conversion path in
io_uring/wait.c parses ext_arg->ts inline rather than going
through io_parse_user_time(). It therefore does not pick up the
time namespace conversion added by the previous patch.

Apply timens_ktime_to_host() to the parsed time on the
IORING_ENTER_ABS_TIMER branch. This mirrors the IORING_TIMEOUT_ABS
fix in io_parse_user_time(). Use ctx->clockid as the clock id.
ctx->clockid is set either at ring creation or via
IORING_REGISTER_CLOCK.

timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces. It is also a no-op for callers in the initial time
namespace. The fast path is unchanged.

Reproducer: in unshare --user --time, with a -10s monotonic
offset, call io_uring_enter with min_complete=1,
IORING_ENTER_ABS_TIMER, and ts = now + 1s. The call returns
-ETIME after <1ms instead of after the expected ~1s.

Suggested-by: Pavel Begunkov 
Suggested-by: Jens Axboe 
Signed-off-by: Maoyi Xie 
Link: https://patch.msgid.link/20260504153755.1293932-3-maoyi.xie@ntu.edu.sg
Signed-off-by: Jens Axboe

io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS

2026-05-06T10:58:56+00:00

io_uring's IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT accept a
timespec from the caller via io_parse_user_time(). With
IORING_TIMEOUT_ABS, the timestamp is an absolute deadline on the
selected clock. The clock is CLOCK_MONOTONIC by default.
CLOCK_BOOTTIME and CLOCK_REALTIME are also selectable.

A submitter inside a CLONE_NEWTIME time namespace observes
CLOCK_MONOTONIC and CLOCK_BOOTTIME shifted by the namespace's
offsets relative to the host. Every other ABS timer interface in
the kernel converts the caller's absolute time to host view via
timens_ktime_to_host() before arming an hrtimer:

  kernel/time/posix-timers.c    -- timer_settime(TIMER_ABSTIME)
  kernel/time/posix-stubs.c     -- clock_nanosleep(TIMER_ABSTIME)
  kernel/time/alarmtimer.c      -- alarm_timer_nsleep(TIMER_ABSTIME)
  fs/timerfd.c                  -- timerfd_settime(TFD_TIMER_ABSTIME)

io_parse_user_time() does not. As a result, an absolute timeout
submitted from within a time namespace is interpreted in host
view. That is generally a different point in time. It may already
be in the past, causing the timer to fire immediately, or far in
the future, causing the timer not to fire when expected.

Reproducer: in unshare --user --time, with a -10s monotonic
offset, submit IORING_OP_TIMEOUT with IORING_TIMEOUT_ABS and
deadline = now + 1s. The CQE is delivered after <1ms instead of
the expected ~1s.

Apply timens_ktime_to_host() to the parsed time when
IORING_TIMEOUT_ABS is set. Split the existing clock id resolver
in io_timeout_get_clock() into a flags only helper
io_flags_to_clock(), so io_parse_user_time() can resolve the
clock without a struct io_timeout_data.

timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces, e.g. CLOCK_REALTIME. It is also a no-op for callers
in the initial time namespace. The fast path is unchanged.

SQPOLL is also covered. The SQPOLL kernel thread is created via
create_io_thread() with CLONE_THREAD and no CLONE_NEW* flag.
copy_namespaces() therefore shares the submitter's nsproxy by
reference. Inside the SQPOLL kthread, current->nsproxy->time_ns
is the submitter's time_ns. timens_ktime_to_host() resolves
correctly.

Suggested-by: Pavel Begunkov 
Suggested-by: Jens Axboe 
Signed-off-by: Maoyi Xie 
Link: https://patch.msgid.link/20260504153755.1293932-2-maoyi.xie@ntu.edu.sg
Signed-off-by: Jens Axboe

io_uring/eventfd: reset deferred signal state

2026-05-04T05:21:40+00:00

Recursive eventfd wakeups must defer io_uring eventfd signaling because
eventfd_signal_mask() rejects reentry from eventfd wakeup handlers. The
io_ev_fd ops bit tracks an outstanding deferred signal so that the same
rcu_head is not queued twice.

That bit is only set today. Once the first deferred callback runs, later
recursive notifications still see the bit set and skip queueing another
deferred signal. This can leave new completions without a matching
eventfd wake after the first recursive deferral.

Clear the pending bit before issuing the deferred signal. If the wakeup
path recurses while the callback runs, a new signal can be queued for
the next RCU grace period while the current callback keeps its reference
until it returns.

Signed-off-by: Yufan Chen 
Fixes: 60b6c075e8eb ("io_uring/eventfd: move to more idiomatic RCU free usage")
Link: https://patch.msgid.link/20260503175710.37209-1-yufan.chen@linux.dev
Signed-off-by: Jens Axboe

io_uring/napi: clear tracked NAPI entries on unregister

2026-05-04T05:21:23+00:00

IORING_UNREGISTER_NAPI disables NAPI busy polling, but it currently
leaves any previously tracked NAPI IDs on the ring context. The normal
wait path only checks whether the list is empty before entering the busy
poll helper, so an unregistered ring can still observe stale entries and
run an unexpected busy poll pass.

Make unregister switch the context to inactive and free the tracked
entries. Do the same inactive transition while changing the tracking
strategy, and recheck the expected tracking mode under napi_lock before
inserting a newly learned NAPI ID. This prevents a racing poll path from
repopulating the list after unregister or reconfiguration.

Also make the busy poll dispatcher ignore inactive mode explicitly.

Signed-off-by: Yufan Chen 
Fixes: 6bf90bd8c58a ("io_uring/napi: add static napi tracking strategy")
Link: https://patch.msgid.link/20260503175610.35521-1-yufan.chen@linux.dev
Signed-off-by: Jens Axboe

Merge tag 'io_uring-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-05-01T18:01:31+00:00

Pull io_uring fixes from Jens Axboe:

 - Remove dead struct io_buffer_list member

 - Fix for incrementally consumed buffers with recvmsg multishot, which
   requires a minimum value left in a buffer for any receive for the
   headers. If there's still a bit of buffer left but it's smaller than
   that value, then userspace will see a spurious -EFAULT returned in
   the CQE

 - Locking fix for the DEFER_TASKRUN retry list, which otherwise could
   race with fallback cancelations. If the task is exiting with
   task_work left in both the normal and retry list AND the exit cleanup
   races with the task running task work, then entries could either be
   doubly completed or lost

 - Cap NAPI busy poll timeout to something sane, to avoid syzbot running
   into excessive polling and triggering warnings around that

* tag 'io_uring-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/tw: serialize ctx->retry_llist with ->uring_lock
  io_uring/napi: cap busy_poll_to 10 msec
  io_uring/kbuf: support min length left for incremental buffers
  io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' member

io_uring/tw: serialize ctx->retry_llist with ->uring_lock

2026-04-30T12:57:20+00:00

The DEFER_TASKRUN local task work paths all run under ctx->uring_lock,
which serializes them with each other and with the rest of the ring's
hot paths. io_move_task_work_from_local() is the exception - it's called
from io_ring_exit_work() on a kworker without holding the lock and from
the iopoll cancelation side right after dropping it.

->work_llist is fine with this, as it's only ever updated via the
expected paths. But the ->retry_llist is updated while runing, and hence
it could potentially race between normal task_work running and the
task-has-exited shutdown path.

Simply grab ->uring_lock while moving the local work to the fallback
list for exit purposes, which nicely serializes it across both the
normal additions and the exit prune path.

Cc: stable@vger.kernel.org
Fixes: f46b9cdb22f7 ("io_uring: limit local tw done")
Reported-by: Robert Femmer 
Reported-by: Christian Reitter 
Reported-by: Michael Rodler 
Signed-off-by: Jens Axboe

net: add net_iov_init() and use it to initialize ->page_type

2026-04-29T23:40:08+00:00

Commit db359fccf212 ("mm: introduce a new page type for page pool in
page type") added a page_type field to struct net_iov at the same
offset as struct page::page_type, so that page_pool_set_pp_info() can
call __SetPageNetpp() uniformly on both pages and net_iovs.

The page-type API requires the field to hold the UINT_MAX "no type"
sentinel before a type can be set; for real struct page that invariant
is established by the page allocator on free. struct net_iov is not
allocated through the page allocator, so the field is left as zero
(io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem,
which uses kvmalloc_objs() without zeroing). When the page pool then
calls page_pool_set_pp_info() on a freshly-bound niov,
__SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires
and the kernel BUGs. Triggered in selftests by io_uring zcrx setup
through the fbnic queue restart path:

 kernel BUG at ./include/linux/page-flags.h:1062!
 RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062
                                  net/core/page_pool.c:716)
 Call Trace:
  
  net_mp_niov_set_page_pool (net/core/page_pool.c:1360)
  io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110)
  fbnic_fill_bdq (./include/net/page_pool/helpers.h:160
                  drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906)
  __fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470
                      drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874)
  fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903)
  netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137)
  __netif_mp_open_rxq (net/core/netdev_rx_queue.c:234)
  io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903)
  __io_uring_register (io_uring/register.c:931)
  __do_sys_io_uring_register (io_uring/register.c:1029)
  do_syscall_64 (arch/x86/entry/syscall_64.c:63
                 arch/x86/entry/syscall_64.c:94)
  

The same path is reachable through devmem dmabuf binding via
netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue().

Add a net_iov_init() helper that stamps ->owner, ->type and the
->page_type sentinel, and use it from both the devmem and io_uring
zcrx niov init loops.

Fixes: db359fccf212 ("mm: introduce a new page type for page pool in page type")
Acked-by: Vlastimil Babka (SUSE) 
Acked-by: Byungchul Park 
Reviewed-by: Jens Axboe 
Acked-by: Pavel Begunkov 
Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski

io_uring/napi: cap busy_poll_to 10 msec

2026-04-28T22:09:02+00:00

Currently there's no cap on the maximum amount of time that napi is
allowed to poll if no events are found, which can lead to kernel
complaints on a task being stuck as there's no conditional rescheduling
done within that loop.

Just cap it to 10 msec in total, that's already way above any kind of
sane value that will reap any benefits, yet low enough that it's
nowhere near being able to trigger preemption complaints.

Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support")
Signed-off-by: Jens Axboe

io_uring/kbuf: support min length left for incremental buffers

2026-04-28T22:08:56+00:00

Incrementally consumed buffer rings are generally fully consumed, but
it's quite possible that the application has a minimum size it needs to
meet to avoid truncation. Currently that minimum limit is 1 byte, but
this should be a setting that is the hands of the application. For
recvmsg multishot, a prime use case for incrementally consumed buffers,
the application may get spurious -EFAULT returned at the end of an
incrementally consumed buffer, as less space is available than the
headers need.

Grab a u32 field in struct io_uring_buf_reg, which the application can
use to inform the kernel of the minimum size that should be available
in an incrementally consumed buffer. If less than that is available,
the current buffer is fully processed and the next one will be picked.

Cc: stable@vger.kernel.org
Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1433
Signed-off-by: Martin Michaelis 
[axboe: write commit message, change io_buffer_list member name]
Reviewed-by: Gabriel Krisman Bertazi 
Signed-off-by: Jens Axboe