summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
10 daysublk: reject max_sectors smaller than PAGE_SECTORS in parameter validationMing Lei
blk_validate_limits() requires max_hw_sectors >= PAGE_SECTORS and fires a WARN_ON_ONCE if this invariant is violated. ublk_validate_params() only checked the upper bound of max_sectors against max_io_buf_bytes, allowing userspace to pass small values (including zero) that trigger the warning when blk_mq_alloc_disk() is called from ublk_ctrl_start_dev(). Before 494ea040bcb5, ublk used blk_queue_max_hw_sectors() which silently clamped small values up to PAGE_SECTORS. The conversion to passing queue_limits directly to blk_mq_alloc_disk() lost that clamping and now hits blk_validate_limits()'s WARN_ON_ONCE instead. Validate that max_sectors is at least PAGE_SECTORS in ublk_validate_params() so invalid values are rejected early with -EINVAL instead of reaching the block layer. Fixes: 494ea040bcb5 ("ublk: pass queue_limits to blk_mq_alloc_disk") Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260510144843.769031-1-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysublk: fix use-after-free in ublk_cancel_cmd()Ming Lei
When ublk_reset_ch_dev() clears io->cmd via ublk_queue_reinit() concurrently with ublk_cancel_cmd(), ublk_cancel_cmd() can read a stale pointer and pass it to io_uring_cmd_done(), causing a use-after-free. Fix by synchronizing the two paths with ubq->cancel_lock: - ublk_cancel_cmd(): read and clear io->cmd under cancel_lock, then call io_uring_cmd_done() on the saved local copy outside the lock. - ublk_reset_ch_dev(): hold cancel_lock across ublk_queue_reinit() so that io->cmd and io->flags are cleared atomically with respect to ublk_cancel_cmd(). Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260508123746.242018-1-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-06ublk: validate physical_bs_shift, io_min_shift and io_opt_shiftMing Lei
ublk_validate_params() checks logical_bs_shift is within [9, PAGE_SHIFT] but has no upper bound for physical_bs_shift, io_min_shift, or io_opt_shift. A malicious userspace can set any of these to a large value (e.g., 44), causing undefined behavior from `1 << shift` in ublk_ctrl_start_dev() since the result is stored in 32-bit unsigned int. Cap all three at ilog2(SZ_256M) (28). 256M is big enough to cover all practical block sizes, and originates from the maximum physical block size possible in NVMe (lba_size * (1 + npwg), where npwg is 16-bit). Also zero out ub->params with memset() when copy_from_user() fails or ublk_validate_params() returns error, so that no stale or partial params survive for a subsequent START_DEV to consume. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260506082238.22363-1-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-01ublk: don't issue uring_cmd from fallback task workJens Axboe
When ublk_ch_uring_cmd_cb() runs as fallback task work (e.g., because the submitting task is exiting), the command should not be issued as current is a kworker, not the daemon task. This can cause io->task to capture the wrong task in __ublk_fetch(), leading to a task mismatch warning in ublk_uring_cmd_cancel_fn(). Check tw.cancel and return -ECANCELED instead of issuing the command from fallback context. Fixes: 3421c7f68bba ("ublk: make sure io cmd handled in submitter task context") Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260501112312.947327-1-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-24Merge tag 'block-7.1-20260424' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Series for zloop, fixing a variety of issues - t10-pi code cleanup - Fix for a merge window regression with the bio memory allocation mask - Fix for a merge window regression in ublk, caused by an issue with the maple tree iteration code at teardown - ublk self tests additions - Zoned device pgmap fixes - Various little cleanups and fixes * tag 'block-7.1-20260424' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (21 commits) Revert "floppy: fix reference leak on platform_device_register() failure" ublk: avoid unpinning pages under maple tree spinlock ublk: refactor common helper ublk_shmem_remove_ranges() ublk: fix maple tree lockdep warning in ublk_buf_cleanup selftests: ublk: add ublk auto integrity test selftests: ublk: enable test_integrity_02.sh on fio 3.42 selftests: ublk: remove unused argument to _cleanup block: only restrict bio allocation gfp mask asked to block block/blk-throttle: Add WQ_PERCPU to alloc_workqueue users block: Add WQ_PERCPU to alloc_workqueue users block: relax pgmap check in bio_add_page for compatible zone device pages block: add pgmap check to biovec_phys_mergeable floppy: fix reference leak on platform_device_register() failure ublk: use unchecked copy helpers for bio page data t10-pi: reduce ref tag code duplication zloop: remove irq-safe locking zloop: factor out zloop_mark_{full,empty} helpers zloop: set RQF_QUIET when completing requests on deleted devices zloop: improve the unaligned write pointer warning zloop: use vfs_truncate ...
2026-04-24Merge tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "We have a series from Alex which extends CephFS client metrics with support for per-subvolume data I/O performance and latency tracking (metadata operations aren't included) and a good variety of fixes and cleanups across RBD and CephFS" * tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client: ceph: add subvolume metrics collection and reporting ceph: parse subvolume_id from InodeStat v9 and store in inode ceph: handle InodeStat v8 versioned field in reply parsing libceph: Fix slab-out-of-bounds access in auth message processing rbd: fix null-ptr-deref when device_add_disk() fails crush: cleanup in crush_do_rule() method ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails ceph: only d_add() negative dentries when they are unhashed libceph: update outdated comment in ceph_sock_write_space() libceph: Remove obsolete session key alignment logic ceph: fix num_ops off-by-one when crypto allocation fails libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()
2026-04-23Revert "floppy: fix reference leak on platform_device_register() failure"Jens Axboe
This reverts commit e784f2ea0b4fd0e7b70028ff8218f22456c5dcf8. Jiri says the patch is buggy, and it looks like he is right revert it for now. Link: https://lore.kernel.org/linux-block/897f442d-4e04-4b70-b716-38fd10b8af36@kernel.org/ Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-23ublk: avoid unpinning pages under maple tree spinlockMing Lei
ublk_shmem_remove_ranges() calls unpin_user_pages() while holding the maple tree spinlock (mas_lock). Although unpin_user_pages() is safe in atomic context, holding the spinlock across potentially many page unpinning operations is not ideal. Split into __ublk_shmem_remove_ranges() which erases up to 64 ranges under mas_lock, collecting base_pfn and nr_pages into a temporary xarray. Then drop the lock and unpin pages outside spinlock context. ublk_shmem_remove_ranges() loops until all matching ranges are processed. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260423033058.2805135-4-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-23ublk: refactor common helper ublk_shmem_remove_ranges()Ming Lei
Extract the shared walk+erase+unpin+kfree loop into ublk_shmem_remove_ranges(). When buf_index >= 0, only ranges matching that index are removed; when buf_index < 0, all ranges are removed. Also extract ublk_unpin_range_pages() to share the page unpinning loop. Convert both __ublk_ctrl_unreg_buf() and ublk_buf_cleanup() to use the new helper. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260423033058.2805135-3-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-23ublk: fix maple tree lockdep warning in ublk_buf_cleanupMing Lei
ublk_buf_cleanup() iterates the maple tree with mas_for_each() without holding mas_lock, triggering a lockdep splat on CONFIG_PROVE_RCU kernels since mas_find() internally uses rcu_dereference_check() which requires either RCU or the tree lock. Fix by holding mas_lock around the iteration, and call mas_erase() before freeing each range to avoid dangling pointers in the tree. Fixes: 5e864438e285 ("ublk: replace xarray with IDA for shmem buffer index allocation") Reported-by: Jens Axboe <axboe@kernel.dk> Closes: https://lore.kernel.org/linux-block/0349d72d-dff8-4f9f-b448-919fa5ae96da@kernel.dk/ Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260423033058.2805135-2-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-22rbd: fix null-ptr-deref when device_add_disk() failsDawei Feng
do_rbd_add() publishes the device with device_add() before calling device_add_disk(). If device_add_disk() fails after device_add() succeeds, the error path calls rbd_free_disk() directly and then later falls through to rbd_dev_device_release(), which calls rbd_free_disk() again. This double teardown can leave blk-mq cleanup operating on invalid state and trigger a null-ptr-deref in __blk_mq_free_map_and_rqs(), reached from blk_mq_free_tag_set(). Fix this by following the normal remove ordering: call device_del() before rbd_dev_device_release() when device_add_disk() fails after device_add(). That keeps the teardown sequence consistent and avoids re-entering disk cleanup through the wrong path. The bug was first flagged by an experimental analysis tool we are developing for kernel memory-management bugs while analyzing v6.13-rc1. The tool is still under development and is not yet publicly available. We reproduced the bug on v7.0 with a real Ceph backend and a QEMU x86_64 guest booted with KASAN and CONFIG_FAILSLAB enabled. The reproducer confines failslab injections to the __add_disk() range and injects fail-nth while mapping an RBD image through /sys/bus/rbd/add_single_major. On the unpatched kernel, fail-nth=4 reliably triggered the fault: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 UID: 0 PID: 273 Comm: bash Not tainted 7.0.0-01247-gd60bc1401583 #6 PREEMPT(lazy) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__blk_mq_free_map_and_rqs+0x8c/0x240 Code: 00 00 48 8b 6b 60 41 89 f4 49 c1 e4 03 4c 01 e5 45 85 ed 0f 85 0a 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 e9 48 c1 e9 03 <80> 3c 01 00 0f 85 31 01 00 00 4c 8b 6d 00 4d 85 ed 0f 84 e2 00 00 RSP: 0018:ff1100000ab0fac8 EFLAGS: 00000246 RAX: dffffc0000000000 RBX: ff1100000c4806a0 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ff1100000c4806f4 RBP: 0000000000000000 R08: 0000000000000001 R09: ffe21c000189001b R10: ff1100000c4800df R11: ff1100006cf37be0 R12: 0000000000000000 R13: 0000000000000000 R14: ff1100000c480700 R15: ff1100000c480004 FS: 00007f0fbe8fe740(0000) GS:ff110000e5851000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe53473b2e0 CR3: 0000000012eef000 CR4: 00000000007516f0 PKRU: 55555554 Call Trace: <TASK> blk_mq_free_tag_set+0x77/0x460 do_rbd_add+0x1446/0x2b80 ? __pfx_do_rbd_add+0x10/0x10 ? lock_acquire+0x18c/0x300 ? find_held_lock+0x2b/0x80 ? sysfs_file_kobj+0xb6/0x1b0 ? __pfx_sysfs_kf_write+0x10/0x10 kernfs_fop_write_iter+0x2f4/0x4a0 vfs_write+0x98e/0x1000 ? expand_files+0x51f/0x850 ? __pfx_vfs_write+0x10/0x10 ksys_write+0xf2/0x1d0 ? __pfx_ksys_write+0x10/0x10 do_syscall_64+0x115/0x690 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f0fbea15907 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffe22346ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000058 RCX: 00007f0fbea15907 RDX: 0000000000000058 RSI: 0000563ace6c0ef0 RDI: 0000000000000001 RBP: 0000563ace6c0ef0 R08: 0000563ace6c0ef0 R09: 6b6435726d694141 R10: 5250337279762f78 R11: 0000000000000246 R12: 0000000000000058 R13: 00007f0fbeb1c780 R14: ff1100000c480700 R15: ff1100000c480004 </TASK> With this fix applied, rerunning the reproducer over fail-nth=1..256 yields no KASAN reports. [ idryomov: rename err_out_device_del -> err_out_device ] Cc: stable@vger.kernel.org Fixes: 27c97abc30e2 ("rbd: add add_disk() error handling") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-04-19Merge tag 'mm-stable-2026-04-18-02-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...
2026-04-18zram: reject unrecognized type= values in recompress_store()Andrew Stellman
recompress_store() parses the type= parameter with three if statements checking for "idle", "huge", and "huge_idle". An unrecognized value silently falls through with mode left at 0, causing the recompression pass to run with no slot filter — processing all slots instead of the intended subset. Add a !mode check after the type parsing block to return -EINVAL for unrecognized values, consistent with the function's other parameter validation. Link: https://lore.kernel.org/20260407153027.42425-1-astellman@stellman-greene.com Signed-off-by: Andrew Stellman <astellman@stellman-greene.com> Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18zram: do not forget to endio for partial discard requestsSergey Senozhatsky
As reported by Qu Wenruo and Avinesh Kumar, the following getconf PAGESIZE 65536 blkdiscard -p 4k /dev/zram0 takes literally forever to complete. zram doesn't support partial discards and just returns immediately w/o doing any discard work in such cases. The problem is that we forget to endio on our way out, so blkdiscard sleeps forever in submit_bio_wait(). Fix this by jumping to end_bio label, which does bio_endio(). Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org Fixes: 0120dd6e4e20 ("zram: make zram_bio_discard more self-contained") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Qu Wenruo <wqu@suse.com> Closes: https://lore.kernel.org/linux-block/92361cd3-fb8b-482e-bc89-15ff1acb9a59@suse.com Tested-by: Qu Wenruo <wqu@suse.com> Reported-by: Avinesh Kumar <avinesh.kumar@suse.com> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1256530 Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Brian Geffon <bgeffon@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-17floppy: fix reference leak on platform_device_register() failureGuangshuo Li
When platform_device_register() fails in do_floppy_init(), the embedded struct device in floppy_device[drive] has already been initialized by device_initialize(), but the failure path jumps to out_remove_drives without dropping the device reference for the current drive. Previously registered floppy devices are cleaned up in out_remove_drives, but the device for the drive that fails registration is not, leading to a reference leak. The issue was identified by a static analysis tool I developed and confirmed by manual review. Fix this by calling put_device() for the current floppy device before jumping to the common cleanup path. Fixes: 94fd0db7bfb4a ("[PATCH] Floppy: Add cmos attribute to floppy driver") Cc: stable@vger.kernel.org Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Link: https://patch.msgid.link/20260415145708.3331818-1-lgs201920130244@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-17ublk: use unchecked copy helpers for bio page dataMing Lei
Bio pages may originate from slab caches that lack a usercopy region (e.g. jbd2 frozen metadata buffers allocated via jbd2_alloc()). When CONFIG_HARDENED_USERCOPY is enabled, copy_to_iter() calls check_copy_size() which rejects these slab pages, triggering a kernel BUG in usercopy_abort(). This is a false positive: the data is ordinary block I/O content — the same data the loop driver writes to its backing file via vfs_iter_write(). The bvec length is always trusted, so the size check in check_copy_size() is not needed either. Switch to _copy_to_iter()/_copy_from_iter() which skip the check_copy_size() wrapper while the underlying copy_to_user() remains unchanged. Acked-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 2299ceec364e ("ublk: use copy_{to,from}_iter() for user copy") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://patch.msgid.link/20260415230246.808176-1-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15Merge tag 'mm-stable-2026-04-13-21-45' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
2026-04-15zloop: remove irq-safe lockingChristoph Hellwig
All of zloop runs in user context, so drop the irq-safe locking. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://patch.msgid.link/20260414081811.549755-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15zloop: factor out zloop_mark_{full,empty} helpersChristoph Hellwig
Move a few chunks of duplicated code into helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://patch.msgid.link/20260414081811.549755-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15zloop: set RQF_QUIET when completing requests on deleted devicesChristoph Hellwig
Reduce the dmesg spam for tests that involve device deletion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://patch.msgid.link/20260414081811.549755-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15zloop: improve the unaligned write pointer warningChristoph Hellwig
Use the IS_ALIGNED helper and avoid extra conversions, and tell the user what the unaligned size is. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://patch.msgid.link/20260414081811.549755-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15zloop: use vfs_truncateChristoph Hellwig
While vfs_truncate does various extra checks that we don't really need, it is always better to use a VFS helper rather than open coding the logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://patch.msgid.link/20260414081811.549755-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15zloop: fix write pointer calculation in zloop_forget_cacheChristoph Hellwig
The write pointer is absolute and in sector units, so we need to convert it to a relative byte address first. Fixes: c505448748f7 ("zloop: forget write cache on force removal") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://patch.msgid.link/20260414081811.549755-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-14Merge tag 'x86-cleanups-2026-04-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Consolidate AMD and Hygon cases in parse_topology() (Wei Wang) - asm constraints cleanups in __iowrite32_copy() (Uros Bizjak) - Drop AMD Extended Interrupt LVT macros (Naveen N Rao) - Don't use REALLY_SLOW_IO for delays (Juergen Gross) - paravirt cleanups (Juergen Gross) - FPU code cleanups (Borislav Petkov) - split-lock handling code cleanups (Borislav Petkov, Ronan Pigott) * tag 'x86-cleanups-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Correct the comment explaining what xfeatures_in_use() does x86/split_lock: Don't warn about unknown split_lock_detect parameter x86/fpu: Correct misspelled xfeaures_to_write local var x86/apic: Drop AMD Extended Interrupt LVT macros x86/cpu/topology: Consolidate AMD and Hygon cases in parse_topology() block/floppy: Don't use REALLY_SLOW_IO for delays x86/paravirt: Replace io_delay() hook with a bool x86/irqflags: Preemptively move include paravirt.h directive where it belongs x86/split_lock: Restructure the unwieldy switch-case in sld_state_show() x86/local: Remove trailing semicolon from _ASM_XADD in local_add_return() x86/asm: Use inout "+" asm onstraint modifiers in __iowrite32_copy()
2026-04-13Merge tag 'for-7.1/block-20260411' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Add shared memory zero-copy I/O support for ublk, bypassing per-I/O copies between kernel and userspace by matching registered buffer PFNs at I/O time. Includes selftests. - Refactor bio integrity to support filesystem initiated integrity operations and arbitrary buffer alignment. - Clean up bio allocation, splitting bio_alloc_bioset() into clear fast and slow paths. Add bio_await() and bio_submit_or_kill() helpers, unify synchronous bi_end_io callbacks. - Fix zone write plug refcount handling and plug removal races. Add support for serializing zone writes at QD=1 for rotational zoned devices, yielding significant throughput improvements. - Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET command. - Add io_uring passthrough (uring_cmd) support to the BSG layer. - Replace pp_buf in partition scanning with struct seq_buf. - zloop improvements and cleanups. - drbd genl cleanup, switching to pre_doit/post_doit. - NVMe pull request via Keith: - Fabrics authentication updates - Enhanced block queue limits support - Workqueue usage updates - A new write zeroes device quirk - Tagset cleanup fix for loop device - MD pull requests via Yu Kuai: - Fix raid5 soft lockup in retry_aligned_read() - Fix raid10 deadlock with check operation and nowait requests - Fix raid1 overlapping writes on writemostly disks - Fix sysfs deadlock on array_state=clear - Proactive RAID-5 parity building with llbitmap, with write_zeroes_unmap optimization for initial sync - Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops version mismatch fallback - Fix bcache use-after-free and uninitialized closure - Validate raid5 journal metadata payload size - Various cleanups - Various other fixes, improvements, and cleanups * tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits) ublk: fix tautological comparison warning in ublk_ctrl_reg_buf scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd() block: refactor blkdev_zone_mgmt_ioctl MAINTAINERS: update ublk driver maintainer email Documentation: ublk: address review comments for SHMEM_ZC docs ublk: allow buffer registration before device is started ublk: replace xarray with IDA for shmem buffer index allocation ublk: simplify PFN range loop in __ublk_ctrl_reg_buf ublk: verify all pages in multi-page bvec fall within registered range ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support xfs: use bio_await in xfs_zone_gc_reset_sync block: add a bio_submit_or_kill helper block: factor out a bio_await helper block: unify the synchronous bi_end_io callbacks xfs: fix number of GC bvecs selftests/ublk: add read-only buffer registration test selftests/ublk: add filesystem fio verify test for shmem_zc selftests/ublk: add hugetlbfs shmem_zc test for loop target selftests/ublk: add shared memory zero-copy test selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target ...
2026-04-10ublk: fix tautological comparison warning in ublk_ctrl_reg_bufMing Lei
On 32-bit architectures, 'unsigned long size' can never exceed UBLK_SHMEM_BUF_SIZE_MAX (1ULL << 32), causing a tautological comparison warning. Validate buf_reg.len (__u64) directly before using it, and consolidate all input validation into a single check. Also remove the unnecessary local variables 'addr' and 'size' since buf_reg.addr and buf_reg.len can be used directly. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604101952.3NOzqnu9-lkp@intel.com/ Fixes: 23b3b6f0b584 ("ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support") Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260410124136.3983429-1-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09ublk: allow buffer registration before device is startedMing Lei
Before START_DEV, there is no disk, no queue, no I/O dispatch, so the maple tree can be safely modified under ub->mutex alone without freezing the queue. Add ublk_lock_buf_tree()/ublk_unlock_buf_tree() helpers that take ub->mutex first, then freeze the queue if device is started. This ordering (mutex -> freeze) is safe because ublk_stop_dev_unlocked() already holds ub->mutex when calling del_gendisk() which freezes the queue. Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260409133020.3780098-6-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09ublk: replace xarray with IDA for shmem buffer index allocationMing Lei
Remove struct ublk_buf which only contained nr_pages that was never read after registration. Use IDA for pure index allocation instead of xarray. Make __ublk_ctrl_unreg_buf() return int so the caller can detect invalid index without a separate lookup. Simplify ublk_buf_cleanup() to walk the maple tree directly and unpin all pages in one pass, instead of iterating the xarray by buffer index. Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260409133020.3780098-5-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09ublk: simplify PFN range loop in __ublk_ctrl_reg_bufMing Lei
Use the for-loop increment instead of a manual `i++` past the last page, and fix the mtree_insert_range end key accordingly. Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260409133020.3780098-4-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09ublk: verify all pages in multi-page bvec fall within registered rangeMing Lei
rq_for_each_bvec() yields multi-page bvecs where bv_page is only the first page. ublk_try_buf_match() only validated the start PFN against the maple tree, but a bvec can span multiple pages past the end of a registered range. Use mas_walk() instead of mtree_load() to obtain the range boundaries stored in the maple tree, and check that the bvec's end PFN does not exceed the range. Also remove base_pfn from struct ublk_buf_range since mas.index already provides the range start PFN. Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260409133020.3780098-3-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer supportMing Lei
The __u32 len field cannot represent a 4GB buffer (0x100000000 overflows to 0). Change it to __u64 so buffers up to 4GB can be registered. Add a reserved field for alignment and validate it is zero. The kernel enforces a default max of 4GB (UBLK_SHMEM_BUF_SIZE_MAX) which may be increased in future. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260409133020.3780098-2-tom.leiming@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07ublk: eliminate permanent pages[] array from struct ublk_bufMing Lei
The pages[] array (kvmalloc'd, 8 bytes per page = 2MB for a 1GB buffer) was stored permanently in struct ublk_buf but only needed during pin_user_pages_fast() and maple tree construction. Since the maple tree already stores PFN ranges via ublk_buf_range, struct page pointers can be recovered via pfn_to_page() during unregistration. Make pages[] a temporary allocation in ublk_ctrl_reg_buf(), freed immediately after the maple tree is built. Rewrite __ublk_ctrl_unreg_buf() to iterate the maple tree for matching buf_index entries, recovering struct page pointers via pfn_to_page() and unpinning in batches of 32. Simplify ublk_buf_erase_ranges() to iterate the maple tree by buf_index instead of walking the now-removed pages[] array. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://patch.msgid.link/20260331153207.3635125-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07ublk: enable UBLK_F_SHMEM_ZC feature flagMing Lei
Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL. Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from returning false to checking the actual flag, enabling the shared memory zero-copy feature for devices that request it. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://patch.msgid.link/20260331153207.3635125-4-ming.lei@redhat.com [axboe: ublk_buf_reg -> ublk_shmem_buf_reg errors] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07ublk: add PFN-based buffer matching in I/O pathMing Lei
Add ublk_try_buf_match() which walks a request's bio_vecs, looks up each page's PFN in the per-device maple tree, and verifies all pages belong to the same registered buffer at contiguous offsets. Add ublk_iod_is_shmem_zc() inline helper for checking whether a request uses the shmem zero-copy path. Integrate into the I/O path: - ublk_setup_iod(): if pages match a registered buffer, set UBLK_IO_F_SHMEM_ZC and encode buffer index + offset in addr - ublk_start_io(): skip ublk_map_io() for zero-copy requests - __ublk_complete_rq(): skip ublk_unmap_io() for zero-copy requests The feature remains disabled (ublk_support_shmem_zc() returns false) until the UBLK_F_SHMEM_ZC flag is enabled in the next patch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://patch.msgid.link/20260331153207.3635125-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commandsMing Lei
Add control commands for registering and unregistering shared memory buffers for zero-copy I/O: - UBLK_U_CMD_REG_BUF (0x18): pins pages from userspace, inserts PFN ranges into a per-device maple tree for O(log n) lookup during I/O. Buffer pointers are tracked in a per-device xarray. Returns the assigned buffer index. - UBLK_U_CMD_UNREG_BUF (0x19): removes PFN entries and unpins pages. Queue freeze/unfreeze is handled internally so userspace need not quiesce the device during registration. Also adds: - UBLK_IO_F_SHMEM_ZC flag and addr encoding helpers in UAPI header (16-bit buffer index supporting up to 65536 buffers) - Data structures (ublk_buf, ublk_buf_range) and xarray/maple tree - __ublk_ctrl_reg_buf() helper for PFN insertion with error unwinding - __ublk_ctrl_unreg_buf() helper for cleanup reuse - ublk_support_shmem_zc() / ublk_dev_support_shmem_zc() stubs (returning false — feature not enabled yet) Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://patch.msgid.link/20260331153207.3635125-2-ming.lei@redhat.com [axboe: fixup ublk_buf_reg -> ublk_shmem_buf_reg errors, comments] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07drbd: use get_random_u64() where appropriateDavid Carlier
Use the typed random integer helpers instead of get_random_bytes() when filling a single integer variable. The helpers return the value directly, require no pointer or size argument, and better express intent. Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://patch.msgid.link/20260405154704.4610-1-devnexen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-06drbd: remove DRBD_GENLA_F_MANDATORY flag handlingChristoph Böhmwalder
DRBD used a custom mechanism to mark netlink attributes as "mandatory": bit 14 of nla_type was repurposed as DRBD_GENLA_F_MANDATORY. Attributes sent from userspace that had this bit present and that were unknown to the kernel would lead to an error. Since commit ef6243acb478 ("genetlink: optionally validate strictly/dumps"), the generic netlink layer rejects unknown top-level attributes when strict validation is enabled. DRBD never opted out of strict validation, so unknown top-level attributes are already rejected by the netlink core. The mandatory flag mechanism was required for nested attributes, because these are parsed liberally, silently dropping attributes unknown to the kernel. This prepares for the move to a new YNL-based family, which will use the now-default strict parsing. The current family is not expected to gain any new attributes, which makes this change safe. Old userspace that still sets bit 14 is unaffected: nla_type() strips it before __nla_validate_parse() performs attribute validation, so the bit never reaches DRBD. Remove all references to the mandatory flag in DRBD. Cc: Johannes Berg <johannes.berg@intel.com> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://patch.msgid.link/20260403132953.2248751-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-06ublk: reset per-IO canceled flag on each fetchUday Shankar
If a ublk server starts recovering devices but dies before issuing fetch commands for all IOs, cancellation of the fetch commands that were successfully issued may never complete. This is because the per-IO canceled flag can remain set even after the fetch for that IO has been submitted - the per-IO canceled flags for all IOs in a queue are reset together only once all IOs for that queue have been fetched. So if a nonempty proper subset of the IOs for a queue are fetched when the ublk server dies, the IOs in that subset will never successfully be canceled, as their canceled flags remain set, and this prevents ublk_cancel_cmd from actually calling io_uring_cmd_done on the commands, despite the fact that they are outstanding. Fix this by resetting the per-IO cancel flags immediately when each IO is fetched instead of waiting for all IOs for the queue (which may never happen). Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 728cbac5fe21 ("ublk: move device reset into ublk_ch_release()") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: zhang, the-essence-of-life <zhangweize9@gmail.com> Link: https://patch.msgid.link/20260405-cancel-v2-1-02d711e643c2@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-05zram: change scan_slots to return voidSergey Senozhatsky
scan_slots_for_writeback() and scan_slots_for_recompress() work in a "best effort" fashion, if they cannot allocate memory for a new pp-slot candidate they just return and post-processing selects slots that were successfully scanned thus far. scan_slots functions never return errors and their callers never check the return status, so convert them to return void. Link: https://lkml.kernel.org/r/20260317032349.753645-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: propagate read_from_bdev_async() errorsSergey Senozhatsky
When read_from_bdev_async() fails to chain bio, for instance fails to allocate request or bio, we need to propagate the error condition so that upper layer is aware of it. zram already does that by setting BLK_STS_IOERR ->bi_status, but only for sync reads. Change async read path to return its error status so that async errors are also handled. Link: https://lkml.kernel.org/r/20260316015354.114465-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Brian Geffon <bgeffon@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: optimize LZ4 dictionary compression performancegao xu
Calling `LZ4_loadDict()` repeatedly in Zram causes significant overhead due to its internal dictionary pre-processing. This commit introduces a template stream mechanism to pre-process the dictionary only once when the dictionary is initially set or modified. It then efficiently copies this state for subsequent compressions. Verification Test Items: Test Platform: android16-6.12 1. Collect Anonymous Page Dataset 1) Apply the following patch: static bool zram_meta_alloc(struct zram *zram, u64 disksize) if (!huge_class_size) - huge_class_size = zs_huge_class_size(zram->mem_pool); + huge_class_size = 0; 2)Install multiple apps and monkey testing until SwapFree is close to 0. 3)Execute the following command to export data: dd if=/dev/block/zram0 of=/data/samples/zram_dump.img bs=4K 2. Train Dictionary Since LZ4 does not have a dedicated dictionary training tool, the zstd tool can be used for training[1]. The command is as follows: zstd --train /data/samples/* --split=4096 --maxdict=64KB -o /vendor/etc/dict_data 3. Test Code adb shell "dd if=/data/samples/zram_dump.img of=/dev/test_pattern bs=4096 count=131072 conv=fsync" adb shell "swapoff /dev/block/zram0" adb shell "echo 1 > /sys/block/zram0/reset" adb shell "echo lz4 > /sys/block/zram0/comp_algorithm" adb shell "echo dict=/vendor/etc/dict_data > /sys/block/zram0/algorithm_params" adb shell "echo 6G > /sys/block/zram0/disksize" echo "Start Compression" adb shell "taskset 80 dd if=/dev/test_pattern of=/dev/block/zram0 bs=4096 count=131072 conv=fsync" echo. echo "Start Decompression" adb shell "taskset 80 dd if=/dev/block/zram0 of=/dev/output_result bs=4096 count=131072 conv=fsync" echo "mm_stat:" adb shell "cat /sys/block/zram0/mm_stat" echo. Note: To ensure stable test results, it is best to lock the CPU frequency before executing the test. LZ4 supports dictionaries up to 64KB. Below are the test results for compression rates at various dictionary sizes: dict_size base patch 4 KB 156M/s 219M/s 8 KB 136M/s 217M/s 16KB 98M/s 214M/s 32KB 66M/s 225M/s 64KB 38M/s 224M/s When an LZ4 compression dictionary is enabled, compression speed is negatively impacted by the dictionary's size; larger dictionaries result in slower compression. This patch eliminates the influence of dictionary size on compression speed, ensuring consistent performance regardless of dictionary scale. Link: https://lkml.kernel.org/r/698181478c9c4b10aa21b4a847bdc706@honor.com Link: https://github.com/lz4/lz4?tab=readme-ov-file [1] Signed-off-by: gao xu <gaoxu2@honor.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: unify and harden algo/priority params handlingSergey Senozhatsky
We have two functions that accept algo= and priority= params - algorithm_params_store() and recompress_store(). This patch unifies and hardens handling of those parameters. There are 4 possible cases: - only priority= provided [recommended] We need to verify that provided priority value is within permitted range for each particular function. - both algo= and priority= provided We cannot prioritize one over another. All we should do is to verify that zram is configured in the way that user-space expects it to be. Namely that zram indeed has compressor algo= setup at given priority=. - only algo= provided [not recommended] We should lookup priority in compressors list. - none provided [not recommended] Just use function's defaults. Link: https://lkml.kernel.org/r/20260311084312.1766036-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: remove chained recompressionSergey Senozhatsky
Chained recompression has unpredictable behavior and is not useful in practice. First, systems usually configure just one alternative recompression algorithm, which has slower compression/decompression but better compression ratio. A single alternative algorithm doesn't need chaining. Second, even with multiple recompression algorithms, chained recompression is suboptimal. If a lower priority algorithm succeeds, the page is never attempted with a higher priority algorithm, leading to worse memory savings. If a lower priority algorithm fails, the page is still attempted with a higher priority algorithm, wasting resources on the failed lower priority attempt. In either case, the system would be better off targeting a specific priority directly. Chained recompression also significantly complicates the code. Remove it. Link: https://lkml.kernel.org/r/20260311084312.1766036-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: drop ->num_active_compsSergey Senozhatsky
It's not entirely correct to use ->num_active_comps for max-prio limit, as ->num_active_comps just tells the number of configured algorithms, not the max configured priority. For instance, in the following theoretical example: [lz4] [nil] [nil] [deflate] ->num_active_comps is 2, while the actual max-prio is 3. Drop ->num_active_comps and use ZRAM_MAX_COMPS instead. Link: https://lkml.kernel.org/r/20260311084312.1766036-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: do not autocorrect bad recompression parametersSergey Senozhatsky
Do not silently autocorrect bad recompression priority parameter value and just error out. Link: https://lkml.kernel.org/r/20260311084312.1766036-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: do not permit params change after initSergey Senozhatsky
Patch series "zram: recompression cleanups and tweaks", v2. This series is a somewhat random mix of fixups, recompression cleanups and improvements partly based on internal conversations. A few patches in the series remove unexpected or confusing behaviour, e.g. auto correction of bad priority= param for recompression, which should have always been just an error. Then it also removes "chain recompression" which has a tricky, unexpected and confusing behaviour at times. We also unify and harden the handling of algo/priority params. There is also an addition of missing device lock in algorithm_params_store() which previously permitted modification of algo params while the device is active. This patch (of 6): First, algorithm_params_store(), like any sysfs handler, should grab device lock. Second, like any write() sysfs handler, it should grab device lock in exclusive mode. Third, it should not permit change of algos' parameters after device init, as this doesn't make sense - we cannot compress with one C/D dict and then just change C/D dict to a different one, for example. Another thing to notice is that algorithm_params_store() accesses device's ->comp_algs for algo priority lookup, which should be protected by device lock in exclusive mode in general. Link: https://lkml.kernel.org/r/20260311084312.1766036-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20260311084312.1766036-2-senozhatsky@chromium.org Fixes: 4eac932103a5 ("zram: introduce algorithm_params device attribute") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: use statically allocated compression algorithm namesgao xu
Currently, zram dynamically allocates memory for compressor algorithm names when they are set by the user. This requires careful memory management, including explicit `kfree` calls and special handling to avoid freeing statically defined default compressor names. This patch refactors the way zram handles compression algorithm names. Instead of storing dynamically allocated copies, `zram->comp_algs` will now store pointers directly to the static name strings defined within the `zcomp_ops` backend structures, thereby removing the need for conditional `kfree` calls. Link: https://lkml.kernel.org/r/5bb2e9318d124dbcb2b743dcdce6a950@honor.com Signed-off-by: gao xu <gaoxu2@honor.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-31zloop: add max_open_zones optionDamien Le Moal
Introduce the new max_open_zones option to allow specifying a limit on the maximum number of open zones of a zloop device. This change allows creating a zloop device that can more closely mimick the characteristics of a physical SMR drive. When set to a non zero value, only up to max_open_zones zones can be in the implicit open (BLK_ZONE_COND_IMP_OPEN) and explicit open (BLK_ZONE_COND_EXP_OPEN) conditions at any time. The transition to the implicit open condition of a zone on a write operation can result in an implicit close of an already implicitly open zone. This is handled in the function zloop_do_open_zone(). This function also handles transitions to the explicit open condition. Implicit close transitions are handled using an LRU ordered list of open zones which is managed using the helper functions zloop_lru_rotate_open_zone() and zloop_lru_remove_open_zone(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260326203245.946830-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-26drbd: Balance RCU calls in drbd_adm_dump_devices()Bart Van Assche
Make drbd_adm_dump_devices() call rcu_read_lock() before rcu_read_unlock() is called. This has been detected by the Clang thread-safety analyzer. Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Andreas Gruenbacher <agruenba@redhat.com> Fixes: a55bbd375d18 ("drbd: Backport the "status" command") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20260326214054.284593-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-25drbd: use genl pre_doit/post_doitChristoph Böhmwalder
Every doit handler followed the same pattern: stack-allocate an adm_ctx, call drbd_adm_prepare() at the top, call drbd_adm_finish() at the bottom. This duplicated boilerplate across 25 handlers and made error paths inconsistent, since some handlers could miss sending the reply skb on early-exit paths. The generic netlink framework already provides pre_doit/post_doit hooks for exactly this purpose. An old comment even noted "this would be a good candidate for a pre_doit hook". Use them: - pre_doit heap-allocates adm_ctx, looks up per-command flags from a new drbd_genl_cmd_flags[] table, runs drbd_adm_prepare(), and stores the context in info->user_ptr[0]. - post_doit sends the reply, drops kref references for device/connection/resource, and frees the adm_ctx. - Handlers just receive adm_ctx from info->user_ptr[0], set reply_dh->ret_code, and return. All teardown is in post_doit. - drbd_adm_finish() is removed, superseded by post_doit. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://patch.msgid.link/20260324152907.2840984-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>