linux-stable.git/io_uring/io_uring.c, branch v6.17

io_uring/msg_ring: kill alloc_cache for io_kiocb allocations

2025-09-18T19:59:15+00:00

A recent commit:

fc582cd26e88 ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU")

fixed an issue with not deferring freeing of io_kiocb structs that
msg_ring allocates to after the current RCU grace period. But this only
covers requests that don't end up in the allocation cache. If a request
goes into the alloc cache, it can get reused before it is sane to do so.
A recent syzbot report would seem to indicate that there's something
there, however it may very well just be because of the KASAN poisoning
that the alloc_cache handles manually.

Rather than attempt to make the alloc_cache sane for that use case, just
drop the usage of the alloc_cache for msg_ring request payload data.

Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries")
Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/
Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe

io_uring: include dying ring in task_work "should cancel" state

2025-09-18T16:24:50+00:00

When running task_work for an exiting task, rather than perform the
issue retry attempt, the task_work is canceled. However, this isn't
done for a ring that has been closed. This can lead to requests being
successfully completed post the ring being closed, which is somewhat
confusing and surprising to an application.

Rather than just check the task exit state, also include the ring
ref state in deciding whether or not to terminate a given request when
run from task_work.

Cc: stable@vger.kernel.org # 6.1+
Link: https://github.com/axboe/liburing/discussions/1459
Reported-by: Benedek Thaler 
Signed-off-by: Jens Axboe

io_uring: clear ->async_data as part of normal init

2025-08-21T19:54:01+00:00

Opcode handlers like POLL_ADD will use ->async_data as the pointer for
double poll handling, which is a bit different than the usual case
where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more
proactive in handling ->async_data, and clear it to NULL as part of
regular init. Init is touching that cacheline anyway, so might as well
clear it.

Signed-off-by: Jens Axboe

Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux

2025-07-28T23:30:12+00:00

Pull io_uring updates from Jens Axboe:

 - Optimization to avoid reference counts on non-cloned registered
   buffers. This is how these buffers were handled prior to having
   cloning support, and we can still use that approach as long as the
   buffers haven't been cloned to another ring.

 - Cleanup and improvement for uring_cmd, where btrfs was the only user
   of storing allocated data for the lifetime of the uring_cmd. Clean
   that up so we can get rid of the need to do that.

 - Avoid unnecessary memory copies in uring_cmd usage. This is
   particularly important as a lot of uring_cmd usage necessitates the
   use of 128b SQEs.

 - A few updates for recv multishot, where it's now possible to add
   fairness limits for limiting how much is transferred for each retry
   loop. Additionally, recv multishot now supports an overall cap as
   well, where once reached the multishot recv will terminate. The
   latter is useful for buffer management and juggling many recv streams
   at the same time.

 - Add support for returning the TX timestamps via a new socket command.
   This feature can work in either singleshot or multishot mode, where
   the latter triggers a completion whenever new timestamps are
   available. This is an alternative to using the existing error queue.

 - Add support for an io_uring "mock" file, which is the start of being
   able to do 100% targeted testing in terms of exercising io_uring
   request handling. The idea is to have a file type that can be
   anything the tester would like, and behave exactly how you want it to
   behave in terms of hitting the code paths you want.

 - Improve zcrx by using sgtables to de-duplicate and improve dma
   address handling.

 - Prep work for supporting larger pages for zcrx.

 - Various little improvements and fixes.

* tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits)
  io_uring/zcrx: fix leaking pages on sg init fail
  io_uring/zcrx: don't leak pages on account failure
  io_uring/zcrx: fix null ifq on area destruction
  io_uring: fix breakage in EXPERT menu
  io_uring/cmd: remove struct io_uring_cmd_data
  btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
  io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
  io_uring/zcrx: account area memory
  io_uring: export io_[un]account_mem
  io_uring/net: Support multishot receive len cap
  io_uring: deduplicate wakeup handling
  io_uring/net: cast min_not_zero() type
  io_uring/poll: cleanup apoll freeing
  io_uring/net: allow multishot receive per-invocation cap
  io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
  io_uring/net: use passed in 'len' in io_recv_buf_select()
  io_uring/zcrx: prepare fallback for larger pages
  io_uring/zcrx: assert area type in io_zcrx_iov_page
  io_uring/zcrx: allocate sgtable for umem areas
  io_uring/zcrx: introduce io_populate_area_dma
  ...

io_uring/poll: cleanup apoll freeing

2025-07-12T21:03:05+00:00

No point having REQ_F_POLLED in both IO_REQ_CLEAN_FLAGS and in
IO_REQ_CLEAN_SLOW_FLAGS, and having both io_free_batch_list() and then
io_clean_op() check for it and clean it.

Move REQ_F_POLLED to IO_REQ_CLEAN_SLOW_FLAGS and drop it from
IO_REQ_CLEAN_FLAGS, and have only io_free_batch_list() do the check and
freeing.

Link: https://lore.kernel.org/io-uring/20250712000344.1579663-2-axboe@kernel.dk
Signed-off-by: Jens Axboe

Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well"

2025-07-08T17:09:01+00:00

This reverts commit 6f11adcc6f36ffd8f33dbdf5f5ce073368975bc3.

The problematic commit was fixed in mainline, so the work-around in
io_uring can be removed at this point. Anonymous inodes no longer
pretend to be regular files after:

1e7ab6f67824 ("anon_inode: rework assertions")

Signed-off-by: Jens Axboe

Merge branch 'io_uring-6.16' into for-6.17/io_uring

2025-07-06T22:42:23+00:00

Merge in 6.16 io_uring fixes, to avoid clashes with pending net and
settings changes.

* io_uring-6.16:
  io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well
  io_uring/kbuf: flag partial buffer mappings
  io_uring/net: mark iov as dynamically allocated even for single segments
  io_uring: fix resource leak in io_import_dmabuf()
  io_uring: don't assume uaddr alignment in io_vec_fill_bvec
  io_uring/rsrc: don't rely on user vaddr alignment
  io_uring/rsrc: fix folio unpinning
  io_uring: make fallocate be hashed work

io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well

2025-06-29T22:52:34+00:00

io_uring marks a request as dealing with a regular file on S_ISREG. This
drives things like retries on short reads or writes, which is generally
not expected on a regular file (or bdev). Applications tend to not
expect that, so io_uring tries hard to ensure it doesn't deliver short
IO on regular files.

However, a recent commit added S_IFREG to anonymous inodes. When
io_uring is used to read from various things that are backed by anon
inodes, like eventfd, timerfd, etc, then it'll now all of a sudden wait
for more data when rather than deliver what was read or written in a
single operation. This breaks applications that issue reads on anon
inodes, if they ask for more data than a single read delivers.

Add a check for !S_ANON_INODE as well before setting REQ_F_ISREG to
prevent that.

Cc: Christian Brauner 
Cc: stable@vger.kernel.org
Link: https://github.com/ghostty-org/ghostty/discussions/7720
Fixes: cfd86ef7e8e7 ("anon_inode: use a proper mode internally")
Signed-off-by: Jens Axboe

io_uring: add mshot helper for posting CQE32

2025-06-23T15:00:12+00:00

Add a helper for posting 32 byte CQEs in a multishot mode and add a cmd
helper on top. As it specifically works with requests, the helper ignore
the passed in cqe->user_data and sets it to the one stored in the
request.

The command helper is only valid with multishot requests.

Signed-off-by: Pavel Begunkov 
Link: https://lore.kernel.org/r/c29d7720c16e1f981cfaa903df187138baa3946b.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe

io_uring: add struct io_cold_def->sqe_copy() method

2025-06-23T14:59:13+00:00

Will be called by the core of io_uring, if inline issue is not going
to be tried for a request. Opcodes can define this handler to defer
copying of SQE data that should remain stable.

Only called if IO_URING_F_INLINE is set. If it isn't set, then there's a
bug in the core handling of this, and -EFAULT will be returned instead
to terminate the request. This will trigger a WARN_ON_ONCE(). Don't
expect this to ever trigger, and down the line this can be removed.

Reviewed-by: Caleb Sander Mateos 
Signed-off-by: Jens Axboe