linux-stable.git/io_uring, branch v6.6.7

io_uring: fix mutex_unlock with unreferenced ctx

2023-12-13T17:45:20+00:00

commit f7b32e785042d2357c5abc23ca6db1b92c91a070 upstream.

Callers of mutex_unlock() have to make sure that the mutex stays alive
for the whole duration of the function call. For io_uring that means
that the following pattern is not valid unless we ensure that the
context outlives the mutex_unlock() call.

mutex_lock(&ctx->uring_lock);
req_put(req); // typically via io_req_task_submit()
mutex_unlock(&ctx->uring_lock);

Most contexts are fine: io-wq pins requests, syscalls hold the file,
task works are taking ctx references and so on. However, the task work
fallback path doesn't follow the rule.

Cc:  
Fixes: 04fc6c802d ("io_uring: save ctx put/get for task_work submit")
Reported-by: Jann Horn 
Signed-off-by: Pavel Begunkov 
Link: https://lore.kernel.org/io-uring/CAG48ez3xSoYb+45f1RLtktROJrpiDQ1otNvdR+YLQf7m+Krj5Q@mail.gmail.com/
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring/af_unix: disable sending io_uring over sockets

2023-12-13T17:45:20+00:00

commit 705318a99a138c29a512a72c3e0043b3cd7f55f4 upstream.

File reference cycles have caused lots of problems for io_uring
in the past, and it still doesn't work exactly right and races with
unix_stream_read_generic(). The safest fix would be to completely
disallow sending io_uring files via sockets via SCM_RIGHT, so there
are no possible cycles invloving registered files and thus rendering
SCM accounting on the io_uring side unnecessary.

Cc:  
Fixes: 0091bfc81741b ("io_uring/af_unix: defer registered files gc to io_uring release")
Reported-and-suggested-by: Jann Horn 
Signed-off-by: Pavel Begunkov 
Link: https://lore.kernel.org/r/c716c88321939156909cfa1bd8b0faaf1c804103.1701868795.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring/kbuf: check for buffer list readiness after NULL check

2023-12-13T17:45:17+00:00

[ Upstream commit 9865346b7e8374b57f1c3ccacdc77846c6352ff4 ]

Move the buffer list 'is_ready' check below the validity check for
the buffer list for a given group.

Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU")
Reported-by: Dan Carpenter 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

io_uring/kbuf: Fix an NULL vs IS_ERR() bug in io_alloc_pbuf_ring()

2023-12-13T17:45:17+00:00

[ Upstream commit e53f7b54b1fdecae897f25002ff0cff04faab228 ]

The io_mem_alloc() function returns error pointers, not NULL.  Update
the check accordingly.

Fixes: b10b73c102a2 ("io_uring/kbuf: recycle freed mapped buffer ring entries")
Signed-off-by: Dan Carpenter 
Link: https://lore.kernel.org/r/5ed268d3-a997-4f64-bd71-47faa92101ab@moroto.mountain
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

io_uring/kbuf: recycle freed mapped buffer ring entries

2023-12-08T07:52:21+00:00

commit b10b73c102a2eab91e1cd62a03d6446f1dfecc64 upstream.

Right now we stash any potentially mmap'ed provided ring buffer range
for freeing at release time, regardless of when they get unregistered.
Since we're keeping track of these ranges anyway, keep track of their
registration state as well, and use that to recycle ranges when
appropriate rather than always allocate new ones.

The lookup is a basic scan of entries, checking for the best matching
free entry.

Fixes: c392cbecd8ec ("io_uring/kbuf: defer release of mapped buffer rings")
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring/kbuf: defer release of mapped buffer rings

2023-12-08T07:52:20+00:00

commit c392cbecd8eca4c53f2bf508731257d9d0a21c2d upstream.

If a provided buffer ring is setup with IOU_PBUF_RING_MMAP, then the
kernel allocates the memory for it and the application is expected to
mmap(2) this memory. However, io_uring uses remap_pfn_range() for this
operation, so we cannot rely on normal munmap/release on freeing them
for us.

Stash an io_buf_free entry away for each of these, if any, and provide
a helper to free them post ->release().

Cc: stable@vger.kernel.org
Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring")
Reported-by: Jann Horn 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring: enable io_mem_alloc/free to be used in other parts

2023-12-08T07:52:20+00:00

commit edecf1689768452ba1a64b7aaf3a47a817da651a upstream.

In preparation for using these helpers, make them non-static and add
them to our internal header.

Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP

2023-12-08T07:52:19+00:00

commit 6f007b1406637d3d73d42e41d7e8d9b245185e69 upstream.

This flag only applies to the SQ and CQ rings, it's perfectly valid
to use a mmap approach for the provided ring buffers. Move the
check into where it belongs.

Cc: stable@vger.kernel.org
Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring: free io_buffer_list entries via RCU

2023-12-08T07:52:18+00:00

commit 5cf4f52e6d8aa2d3b7728f568abbf9d42a3af252 upstream.

mmap_lock nests under uring_lock out of necessity, as we may be doing
user copies with uring_lock held. However, for mmap of provided buffer
rings, we attempt to grab uring_lock with mmap_lock already held from
do_mmap(). This makes lockdep, rightfully, complain:

WARNING: possible circular locking dependency detected
6.7.0-rc1-00009-gff3337ebaf94-dirty #4438 Not tainted
------------------------------------------------------
buf-ring.t/442 is trying to acquire lock:
ffff00020e1480a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_uring_validate_mmap_request.isra.0+0x4c/0x140

but task is already holding lock:
ffff0000dc226190 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x124/0x264

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_lock){++++}-{3:3}:
       __might_fault+0x90/0xbc
       io_register_pbuf_ring+0x94/0x488
       __arm64_sys_io_uring_register+0x8dc/0x1318
       invoke_syscall+0x5c/0x17c
       el0_svc_common.constprop.0+0x108/0x130
       do_el0_svc+0x2c/0x38
       el0_svc+0x4c/0x94
       el0t_64_sync_handler+0x118/0x124
       el0t_64_sync+0x168/0x16c

-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x19a0/0x2d14
       lock_acquire+0x2e0/0x44c
       __mutex_lock+0x118/0x564
       mutex_lock_nested+0x20/0x28
       io_uring_validate_mmap_request.isra.0+0x4c/0x140
       io_uring_mmu_get_unmapped_area+0x3c/0x98
       get_unmapped_area+0xa4/0x158
       do_mmap+0xec/0x5b4
       vm_mmap_pgoff+0x158/0x264
       ksys_mmap_pgoff+0x1d4/0x254
       __arm64_sys_mmap+0x80/0x9c
       invoke_syscall+0x5c/0x17c
       el0_svc_common.constprop.0+0x108/0x130
       do_el0_svc+0x2c/0x38
       el0_svc+0x4c/0x94
       el0t_64_sync_handler+0x118/0x124
       el0t_64_sync+0x168/0x16c

From that mmap(2) path, we really just need to ensure that the buffer
list doesn't go away from underneath us. For the lower indexed entries,
they never go away until the ring is freed and we can always sanely
reference those as long as the caller has a file reference. For the
higher indexed ones in our xarray, we just need to ensure that the
buffer list remains valid while we return the address of it.

Free the higher indexed io_buffer_list entries via RCU. With that we can
avoid needing ->uring_lock inside mmap(2), and simply hold the RCU read
lock around the buffer list lookup and address check.

To ensure that the arrayed lookup either returns a valid fully formulated
entry via RCU lookup, add an 'is_ready' flag that we access with store
and release memory ordering. This isn't needed for the xarray lookups,
but doesn't hurt either. Since this isn't a fast path, retain it across
both types. Similarly, for the allocated array inside the ctx, ensure
we use the proper load/acquire as setup could in theory be running in
parallel with mmap.

While in there, add a few lockdep checks for documentation purposes.

Cc: stable@vger.kernel.org
Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring")
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP

2023-12-08T07:52:18+00:00

commit 820d070feb668aab5bc9413c285a1dda2a70e076 upstream.

io_sqes_map() is used rather than io_mem_alloc(), if the application
passes in memory for mapping rather than have the kernel allocate it and
then mmap(2) the ranges. This then calls __io_uaddr_map() to perform the
page mapping and pinning, which checks if we end up with the same pages,
if more than one page is mapped. But this check is incorrect and only
checks if the first and last pages are the same, where it really should
be checking if the mapped pages are contigous. This allows mapping a
single normal page, or a huge page range.

Down the line we can add support for remapping pages to be virtually
contigous, which is really all that io_uring cares about.

Cc: stable@vger.kernel.org
Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes")
Reported-by: Jann Horn 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman