summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
6 daysblock: handle REQ_OP_ZONE_APPEND in __bio_integrity_actionChristoph Hellwig
Otherwise zone append commands will miss their integrity data. While this works "fine" for auto-PI, it break file system PI and non-PI metadata. With this XFS on ZNS namespace with non-PI metadata and 512 byte sectors with PI work, while PI 4k sector formats with PI work only when Caleb's "block: fix integrity offset/length conversions" is applied as well. Note that unlike regular writes, zone append does need remapping as partitions are not supported on zoned block devices. Fixes: df3c485e0e60 ("block: switch on bio operation in bio_integrity_prep") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260624080014.1998650-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblock: fix GFP_ flags confusion in bio_integrity_alloc_bufChristoph Hellwig
bio_integrity_alloc_buf usage of GFP_ flags is messed up. For one it mixes GFP_NOFS and GFP_NOIO for neighbouring allocations, but it also makes the allocations fail more often than needed. That code was copied from bio_alloc_bioset which needs to do that so that it can punt to the rescuer workqueue, but none of that is needed for the integrity allocations that either sits in the file system or at the very bottom of the I/O stack. Failing early means we'll do a fully waiting allocation from the mempool ->alloc callback which is usually much larger than required. Fix this by passing a gfp_t so that the file system path can pass GFP_NOFS and the auto-integrity code can pass GFP_NOIO, and don't modify the allocation type except for disabling warnings. Fixes: ec7f31b2a2d3 ("block: make bio auto-integrity deadlock safe") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260624080014.1998650-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblock, bfq: don't grab queue_lock to initialize bfqYu Kuai
The request_queue is frozen and quiesced while the elevator init_sched() method runs, so queue_lock is not needed for BFQ cgroup initialization. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/1965073ea20f33114a8d903816b986e483b9bb34.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs()Yu Kuai
The correct lock order is q->queue_lock before blkcg->lock, and in order to prevent deadlock from blkcg_destroy_blkgs(), trylock is used for q->queue_lock while blkcg->lock is already held, this is hacky. Refactor blkcg_destroy_blkgs() to hold blkcg->lock only long enough to get the first blkg and then release it. Then take q->queue_lock and blkcg->lock in the correct order to destroy the blkg. This is a very cold path, so the extra lock/unlock cycles are acceptable. Also prepare to convert protecting blkcg with blkcg_mutex instead of queue_lock. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/00b03cf74a9937cb4d6dd67a189ddc00a3de0451.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg()Yu Kuai
If a bio is already associated with a blkg, the blkcg is already pinned until the bio is done, so there is no need for RCU protection. Otherwise, protect blkcg_css() with RCU independently. Prepare to protect blkcg with blkcg_mutex instead of queue_lock. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/8496fa234b21d4b31b7f068766906d0bffcac8e6.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()Yu Kuai
Change this in two steps: 1) hold rcu lock and do blkg_lookup() from fast path; 2) hold queue_lock directly from slow path, and don't nest it under rcu lock; Prepare to convert protecting blkcg with blkcg_mutex instead of queue_lock. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/93f33cc9e5a39dddb78dcd934d0c1d04b564fb00.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs()Yu Kuai
With previous modification to delay freeing policy data after an RCU grace period, prfill() can run under RCU instead of taking queue_lock. However, policy teardown can still clear blkg->pd[plid] after blkcg_print_blkgs() observes the policy enabled bit. Load policy data once with READ_ONCE() and skip the blkg if teardown already cleared it. Do the same in recursive stat walks for descendant blkgs. Remove the stale BFQ debug queue_lock assertion because blkcg_print_blkgs() no longer calls prfill() with queue_lock held. This also lets ioc_qos_prfill() and ioc_cost_model_prfill() use IRQ-safe ioc->lock locking without re-enabling IRQs while queue_lock is still held. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/db7633d5e263dd1c2bf9b901762545a84b7d714e.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblk-cgroup: delay freeing policy data after rcu grace periodYu Kuai
Currently blkcg_print_blkgs() must hold RCU to iterate blkgs from a blkcg, and prfill() must hold queue_lock to prevent policy data from being freed by policy deactivation. As a consequence, queue_lock has to be nested under RCU from blkcg_print_blkgs(). Delay freeing policy data until after an RCU grace period so prfill() can be protected by RCU alone. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/e20e5d984b41a026d61851966bed35eb094c4bff.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 daysblk-cgroup: protect iterating blkgs with blkcg->lock in blkcg_print_stat()Yu Kuai
blkcg_print_one_stat() will be called for each blkg: - access blkg->iostat, which is freed from rcu callback blkg_free_workfn(); - access policy data from pd_stat_fn(), which is freed from pd_free_fn(), while pd_free_fn() can be called by removing blkcg or deactivating policy; Take blkcg->lock while iterating so the blkgs stay online and both blkg->iostat and policy data for activated policies stay valid. Use irq-safe locking because blkcg->lock can be nested under q->queue_lock, which is used from IRQ completion paths. Prepare to convert protecting blkgs from request_queue with mutex. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/05799877e720dcd300e2ddd4625e8e162959d7cc.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 daysblk-cgroup: defer blkcg css_put until blkg is unlinked from queueZizhi Wo
[BUG] Our fuzz testing triggered a blkcg use-after-free issue: BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0 Call Trace: ... blkcg_deactivate_policy+0x244/0x4d0 ioc_rqos_exit+0x44/0xe0 rq_qos_exit+0xba/0x120 __del_gendisk+0x50b/0x800 del_gendisk+0xff/0x190 ... [CAUSE] process1 process2 cgroup_rmdir ... css_killed_work_fn offline_css ... blkcg_destroy_blkgs ... __blkg_release css_put(&blkg->blkcg->css) blkg_free INIT_WORK(xxx, blkg_free_workfn) schedule_work css_put ... blkcg_css_free kfree(blkcg)--------blkcg has been freed!!! ====================================schedule_work blkg_free_workfn __del_gendisk rq_qos_exit ioc_rqos_exit blkcg_deactivate_policy mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) list_for_each_entry(blkg, xxx) blkcg = blkg->blkcg spin_lock(&blkcg->lock)-------UAF!!! mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) /* Only then is the blkg removed from the list */ list_del_init(&blkg->q_node) As a result, a blkg can still be reachable through q->blkg_list while its ->blkcg has already been freed. [Fix] Fix this by deferring the blkcg css_put() until after the blkg has been unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the blkcg outlives every blkg still reachable through q->blkg_list, so any iterator holding q->queue_lock is guaranteed to observe a valid blkg->blkcg. While at it, move css_tryget_online() from blkg_create() into blkg_alloc() so that the css reference is owned by the alloc/free pair rather than straddling layers: blkg_alloc() <-> blkg_free() blkg_create() <-> blkg_destroy() Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai@fygo.io> Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://patch.msgid.link/20260616011746.2451461-1-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 daysblk-cgroup: fix UAF in __blkcg_rstat_flush()Michal Koutný
When multiple blkgs in the same blkcg are released concurrently, a use-after-free can occur. The race happens when one blkg's __blkcg_rstat_flush() removes another blkg's iostat entries via llist_del_all(). The second blkg sees an empty list and proceeds to free itself while the first is still iterating over its entries. Move the flush from __blkg_release() (RCU callback) to blkg_release() (before call_rcu). This ensures the RCU grace period waits for any concurrent flush's rcu_read_lock() section to complete before freeing. Cc: stable@vger.kernel.org Cc: Jay Shin <jaeshin@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Fixes: 20cb1c2fb756 ("blk-cgroup: Flush stats before releasing blkcg_gq") Reported-by: coregee2000@gmail.com Closes: https://lore.kernel.org/linux-block/CAHPqNmwT9oRpem3J3erS_W0uSQND47LGGSBsNxP8E6uSUish1w@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Link: https://patch.msgid.link/20260205155425.342084-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 daysblock, bfq: protect async queue reset with blkcg locksCen Zhang
Writing 0 to BFQ's low_latency attribute ends weight raising for active, idle and async queues. The async cgroup path walks q->blkg_list, converts each blkg to BFQ policy data and then reads bfqg->async_bfqq and bfqg->async_idle_bfqq. That walk was protected only by bfqd->lock. blkcg release work is serialized by q->blkcg_mutex and q->queue_lock instead, and blkg_free_workfn() can call BFQ's pd_free_fn before it removes blkg->q_node from q->blkg_list. A low_latency reset can therefore still find the blkg on the queue list after the BFQ policy data has been freed. The buggy scenario involves two paths, with each column showing the order within that path: BFQ low_latency reset: blkcg blkg release work: 1. bfq_low_latency_store() 1. blkg_free_workfn() takes calls bfq_end_wr(). q->blkcg_mutex. 2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the q->blkg_list. final bfq_group reference. 3. blkg_to_bfqg() returns 3. blkg->q_node remains on the stale policy data. q->blkg_list until list_del_init(). 4. bfq_end_wr_async_queues() reads async queue fields. Fix this by taking q->blkcg_mutex and q->queue_lock around the q->blkg_list walk, then taking bfqd->lock before touching BFQ async queues. The mutex serializes against policy-data free and queue_lock stabilizes the list. Move the async reset out of bfq_end_wr()'s existing bfqd->lock critical section so the lock order matches blkcg policy callbacks. Validation reproduced this kernel report: BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340 Call Trace: <TASK> dump_stack_lvl+0x66/0xa0 print_report+0xce/0x630 ? bfq_end_wr_async_queues+0x246/0x340 ? srso_alias_return_thunk+0x5/0xfbef5 ? __virt_addr_valid+0x20d/0x410 ? bfq_end_wr_async_queues+0x246/0x340 kasan_report+0xe0/0x110 ? bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async+0xba/0x180 bfq_low_latency_store+0x4e5/0x690 ? 0xffffffffc02150da ? __pfx_bfq_low_latency_store+0x10/0x10 ? __pfx_bfq_low_latency_store+0x10/0x10 elv_attr_store+0xc4/0x110 kernfs_fop_write_iter+0x2f5/0x4a0 vfs_write+0x604/0x11f0 ? __pfx_locks_remove_posix+0x10/0x10 ? __pfx_vfs_write+0x10/0x10 ksys_write+0xf9/0x1d0 ? __pfx_ksys_write+0x10/0x10 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 544: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0xaa/0xb0 bfq_pd_alloc+0xc0/0x1b0 blkg_alloc+0x346/0x960 blkg_create+0x8c2/0x10d0 bio_associate_blkg_from_css+0x9f3/0xfa0 bio_associate_blkg+0xd9/0x200 bio_init+0x303/0x640 __blkdev_direct_IO_simple+0x56b/0x8a0 blkdev_direct_IO+0x8e7/0x2580 blkdev_read_iter+0x205/0x400 vfs_read+0x7b0/0xda0 ksys_read+0xf9/0x1d0 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 465: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x5f/0x80 kfree+0x307/0x580 blkg_free_workfn+0xef/0x460 process_one_work+0x8d0/0x1870 worker_thread+0x575/0xf80 kthread+0x2e7/0x3c0 ret_from_fork+0x576/0x810 ret_from_fork_asm+0x1a/0x30 Fixes: 44e44a1b329e ("block, bfq: improve responsiveness") Assisted-by: Codex:gpt-5.5 Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Tao Cui <cuitao@kylinos.cn> Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 daysblock: fix incorrect error injection static key decrementChristoph Hellwig
Only decrement the static key when we had items and thus it was incremented before. Fixes: e8dcf2d142bd ("block: add configurable error injection") Reported-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260622160752.1552516-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysblock: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()Qu Wenruo
For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set iov_iter::nofault to prevent deadlock when a page fault is needed to read out the buffer. However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault flag, and just call a plain copy_from_iter() so it can still trigger page fault and cause deadlock in btrfs. Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is set, otherwise use copy_folio_from_iter(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/9c165a314022b61566eb247852eb773ca6c70889.1781597506.git.wqu@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysblock: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()Qu Wenruo
For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty easy to hit short copy inside bio_iov_iter_bounce_write(). This is because btrfs has disabled page fault to avoid certain deadlock during direct writes, and instead btrfs manually fault in the pages then retry. And inside bio_iov_iter_bounce_write(), if we hit a short write, we didn't revert the iov_iter, which can cause problems like unexpected garbage for the next retry. Revert the iov_iter after a short copy. One thing to note is that, the folio is allocated then immediately queued into the bio, so the proper revert size should be (bi_size - this_len + copied). Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/c400989f227343b134110773d5acaaacf7024574.1781597506.git.wqu@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysblock: Remove redundant plug in __submit_bio()Wen Xiong
The patch removes the automatic plug/unplug operations from __submit_bio() that were added to cache nsecs time when no explicit plug is used. The plug mechanism is most effective when batching multiple I/O operations together. Creating a plug for every bio submission provides minimal benefit while adding function call overhead and stack usage for every I/O operation. Below is performance comparison with the latest upstream kernel. Iotype qd nj rmix mpstat busy mpstat busy without plug Randrw 1 20 100 53% 24% Randrw 1 40 100 70% 24% Randrw 1 20 70 40% 24% Randrw 1 40 70 60% 26% Randrw 1 20 0 14% 6% Randrw 1 40 0 20% 7% Fixes: 060406c61c7c ("block: add plug while submitting IO") Signed-off-by: Wen Xiong <wenxiong@linux.ibm.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260616143121.878021-1-wenxiong@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysblock: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmdYitang Yang
blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether this is the first issue. However, this flag lives in cmd->flags instead of issue_flags. Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed, bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with -EINVAL. Fix it by checking cmd->flags as intended. Cc: stable@vger.kernel.org Fixes: 212ec34e4e72 ("block: only read from sqe on initial invocation of blkdev_uring_cmd") Signed-off-by: Yitang Yang <yi1tang.yang@gmail.com> Link: https://patch.msgid.link/20260616155129.406057-1-yi1tang.yang@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
14 daysMerge tag 'for-7.2/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - small cleanups in dm-vdo, dm-raid, dm-cache, dm-zoned-metadata - rework of dm-ima - introduce dm-inlinecrypt - fix wrong return value in dm-ioctl - fix rcu stall when polling * tag 'for-7.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-zoned-metadata: Use strscpy() to copy device name dm cache: make smq background work limit configurable dm-inlinecrypt: add support for hardware-wrapped keys dm: limit target bio polling to one shot dm-ioctl: report an error if a device has no table dm: add documentation for dm-inlinecrypt target dm-inlinecrypt: add target for inline block device encryption block: export blk-crypto symbols required by dm-inlinecrypt dm-ima: use active table's size if available dm-ima: Fail more gracefully in dm_ima_measure_on_* dm-ima: Handle race between rename and table swap dm-ima: Fix issues with dm_ima_measure_on_device_rename dm-ima: remove new_map from dm_ima_measure_on_device_clear dm-ima: Fix UAF errors and measuring incorrect context dm-ima: don't copy the active table to the inactive table dm-ima: Remove status_flags from dm_ima_measure_on_table_load() dm-ima: remove broken last_target_measured logic dm-ima: remove dm_ima_reset_data() dm-raid: only requeue bios when dm is suspending dm vdo: use get_random_u32() where appropriate
14 daysMerge tag 'for-7.2/block-20260615' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Per-controller admin and IO timeout sysfs attributes, and letting the block layer set request timeouts (Maurizio, Maximilian) - Multipath passthrough iostats, and PCI P2PDMA enablement for multipath devices (Keith, Kiran) - A new diag sysfs attribute group exporting per-controller counters (retries, multipath failover, error counters, requeue and failure counts, reset and reconnect events) (Nilay) - FDP configuration validation and bounds check fixes (liuxixin) - Various nvmet fixes, including a pre-auth out-of-bounds read in the Discovery Get Log Page handler, auth payload bounds validation, and tcp error-path leak fixes (Bryam, Tianchu, Geliang) - nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki, Eric) - Assorted other fixes and cleanups (John, Yao, Chao, Mateusz, Achkinazi, Wentao) - MD pull request via Yu Kuai: - raid1/raid10 fixes for a deadlock in the read error recovery path, error-path detection and bio accounting with cloned bios, and an nr_pending leak in the REQ_ATOMIC bad-block error path (Abd-Alrhman) - PCI P2PDMA propagation from member devices to the RAID device (Kiran) - dm-raid bio requeue fix, and various smaller fixes and cleanups (Benjamin, Chen, Li, Thorsten) - Enable Clang lock context analysis for the block layer, with the accompanying annotations across queue limits, the blk_holder_ops callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart) - Block status code infrastructure work: a tagged status table, a str_to_blk_op() helper, a bio_endio_status() helper, and on top of that a new configurable block-layer error injection facility (Christoph) - DRBD netlink rework, replacing the genl_magic machinery with explicit netlink serialization and moving the DRBD UAPI headers to include/uapi/linux/ (Christoph Böhmwalder) - bvec improvements: a bvec_folio() helper and making the bvec_iter helpers proper inline functions (Willy, Christoph) - ublk cleanups and a canceling-flag fix for the disk-not-allocated case (Caleb, Ming) - Partition handling fixes: bound the AIX pp_count scan, fix an of_node refcount leak, and replace __get_free_page() with kmalloc() (Bryam, Wentao, Mike) - Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add WQ_PERCPU to the block workqueue users (Mateusz, Marco) - Block statistics and tracing: propagate in-flight to the whole disk on partition IO, export passthrough stats, and a new block_rq_tag_wait tracepoint (Tang, Keith, Aaron) - A round of removals, unexports and cleanups across bio, direct-io and the bvec helpers (Christoph) - Various driver fixes (mtip32xx use-after-free, rbd snap_count validation and strscpy conversion, nbd socket lockdep reclassify, virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS email/list updates (Coly, Li, Yu, Christoph Böhmwalder) - Other little fixes and cleanups all over * tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits) MAINTAINERS: Update Coly Li's email address block: check bio split for unaligned bvec nbd: Reclassify sockets to avoid lockdep circular dependency block: add configurable error injection block: add a str_to_blk_op helper block: add a "tag" for block status codes block: add a macro to initialize the status table floppy: Drop unused pnp driver data block: propagate in_flight to whole disk on partition I/O virtio-blk: clamp zone report to the report buffer capacity block: optimize I/O merge hot path with unlikely() hints drivers/block/rbd: Use strscpy() to copy strings into arrays partitions: aix: bound the pp_count scan to the ppe array block: Enable lock context analysis block/mq-deadline: Make the lock context annotations compatible with Clang block/Kyber: Make the lock context annotations compatible with Clang block/blk-mq-debugfs: Improve lock context annotations block/blk-iocost: Inline iocg_lock() and iocg_unlock() block/blk-iocost: Split ioc_rqos_throttle() block/crypto: Annotate the crypto functions ...
14 daysMerge tag 'slab-for-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab
2026-06-13block: check bio split for unaligned bvecKeith Busch
Offsets and lengths need to be validated against the dma alignment. This check was skipped for sufficiently a small bio with a single bvec, which may allow an invalid request dispatched to the driver. Force the validation for an unaligned bvec by forcing the bio split path that handles this condition. Fixes: 7eac33186957 ("iomap: simplify direct io validity check") Fixes: 5ff3f74e145a ("block: simplify direct io validity check") Reported-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260612223205.465913-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-12block: add configurable error injectionChristoph Hellwig
Add a new block error injection interface that allows to inject specific status code for specific ranges. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-12block: add a str_to_blk_op helperChristoph Hellwig
Add a helper to find the REQ_OP_XYZ constant from the "XYZ" string. This will be used for the error injection debugfs interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-12block: add a "tag" for block status codesChristoph Hellwig
The full name of the status codes is not good for user interfaces as it can contain white spaces. Add the name of the status code without the BLK_STS_ prefix as a tag so that it can be used for user interfaces. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-12block: add a macro to initialize the status tableChristoph Hellwig
Prepare for adding a new value to the error table by adding a macro to fill it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-09block: propagate in_flight to whole disk on partition I/OTang Yizhou
Now when I/O is submitted to a partition, the per-CPU in_flight[] counter is incremented only on the partition's block_device, not on the underlying whole disk. This leads to a problem which can be shown by a fio test: lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS mydev 252:1 0 20G 0 disk └─mydev1 259:0 0 10G 0 part iostat -xp 1 Device r/s rkB/s ... aqu-sz %util mydev 128153.00 512612.00 ... 13.22 72.20 mydev1 128154.00 512616.00 ... 13.22 100.00 %util is different between mydev and mydev1, which is unexpected. This is the cumulative effect of a series of patches. The root cause is commit e016b78201a2 ("block: return just one value from part_in_flight"), which deleted the branch in part_in_flight() that aggregated the whole-disk in_flight count on top of the partition's. Then the second commit is commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their only callers"), which folded the whole-disk in_flight accounting into generic_start_io_acct() and generic_end_io_acct(). Those two helpers were then removed by commit e722fff238bb ("block: remove generic_{start,end}_io_acct"), and from that point on the whole disk's in_flight is no longer accounted at all. In update_io_ticks(), if calling bdev_count_inflight() finds that the inflight value of the whole device is 0, the accumulation of io_ticks will be skipped, causing the reported util% value to be underestimated. Fix it by restoring the whole-disk in_flight accounting. Fixes: e016b78201a2 ("block: return just one value from part_in_flight") Suggested-by: Leon Hwang <leon.huangfu@shopee.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260526021555.359500-1-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-08block: optimize I/O merge hot path with unlikely() hintsSteven Feng
Remove redundant '== false' comparisons and add unlikely() branch prediction hints in block I/O merge path functions. These functions (ll_new_hw_segment, ll_merge_requests_fn, and blk_rq_merge_ok) are executed on every I/O request merge attempt, making them critical hot paths. Data integrity check failures are rare events, so marking these conditions as unlikely() helps the CPU optimize the common case by improving branch prediction. Changes: - Replace 'func() == false' with 'unlikely(!func())' for better code style and branch prediction This micro-optimization reduces branch misprediction penalties in high-frequency I/O merge paths. Signed-off-by: Steven Feng <steven@joint-cloud.com> Link: https://patch.msgid.link/tencent_79B652BD0CC23E093F27914380F161E7E505@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-08partitions: aix: bound the pp_count scan to the ppe arrayBryam Vargas
aix_partition() reads the physical volume descriptor into a fixed-size struct pvd and then scans its physical-partition-extent array: int numpps = be16_to_cpu(pvd->pp_count); ... for (i = 0; i < numpps; i += 1) { struct ppe *p = pvd->ppe + i; ... lp_ix = be16_to_cpu(p->lp_ix); pvd points at a single kmalloc()'d struct pvd whose ppe[] member holds a fixed ARRAY_SIZE(pvd->ppe) (1016) entries, but the loop runs up to the on-disk pp_count. pp_count is an unvalidated __be16 read straight from the descriptor, so a crafted AIX image with pp_count larger than 1016 drives the loop to read pvd->ppe[i] past the end of the allocation (up to 65535 entries, ~2 MB out of bounds). The partition scan runs without mounting anything, when a block device with a crafted AIX/IBM partition table appears (an attacker-supplied image attached with losetup -P, or a device auto-scanned by udev), via msdos_partition() -> aix_partition(). Clamp the scan to the number of entries the ppe[] array can hold. Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files") Cc: stable@vger.kernel.org Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Acked-by: Philippe De Muyter <phdm@macqel.be> Link: https://patch.msgid.link/20260607064137.302574-1-hexlabsecurity@proton.me Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block: Enable lock context analysisBart Van Assche
Now that all block/*.c files have been annotated, enable lock context analysis for all these source files. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/e248ca3aeead238bbc489cf3afdafcbff9e41faf.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/mq-deadline: Make the lock context annotations compatible with ClangBart Van Assche
While sparse ignores the __acquires() and __releases() arguments, Clang verifies these. Make the arguments of __acquires() and __releases() acceptable for Clang. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/3b6e336ced91e27213608ffce205ccd24f4ba285.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/Kyber: Make the lock context annotations compatible with ClangBart Van Assche
While sparse ignores the __acquires() and __releases() arguments, Clang verifies these. Make the arguments of __acquires() and __releases() acceptable for Clang. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/91cb8c790fc8b26b8aa742569fbf8c2c1d099dac.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/blk-mq-debugfs: Improve lock context annotationsBart Van Assche
Make the existing lock context annotations compatible with Clang. Add the lock context annotations that are missing. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/f58fe220ff98f9dfddfed4573f40005c773b7fb7.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/blk-iocost: Inline iocg_lock() and iocg_unlock()Bart Van Assche
Both iocg_lock() and iocg_unlock() use conditional locking. Fold these functions into their callers such that unlocking becomes unconditional. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/f8c9867788957d2e40a32e23c6d9b866e480ad9d.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/blk-iocost: Split ioc_rqos_throttle()Bart Van Assche
Prepare for inlining iocg_lock() and iocg_unlock() by moving the code between these two calls into a new function. No functionality has been changed. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/a6d3ed953cef6669d23a80923bf46600733cbdae.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/crypto: Annotate the crypto functionsBart Van Assche
Add the lock context annotations required for Clang's thread-safety analysis. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/297b40e43a7f9b7d20e91a6c44b41a69d01f5c63.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/cgroup: Inline blkg_conf_{open,close}_bdev_frozen()Bart Van Assche
The blkg_conf_open_bdev_frozen() calling convention is not compatible with lock context annotations. Fold both blkg_conf_open_bdev_frozen() and blkg_conf_close_bdev_frozen() into their only caller. This patch prepares for enabling lock context analysis. The type of 'memflags' has been changed from unsigned long into unsigned int to match the type of current->flags. See also <linux/sched.h>. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/05661d1555decc6dd5389174ba448d803b72ed9a.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/blk-iocost: Combine two error paths in ioc_qos_write()Bart Van Assche
Reduce code duplication by combining two error paths. No functionality has been changed. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/80d4fc1ecd5eaf187c0a31c63a1033a7326d4c7e.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/cgroup: Improve lock context annotationsBart Van Assche
Add lock context annotations where these are missing. Move the blkg_conf_prep() annotation into block/blk-cgroup.h to make it visible to all blkg_conf_prep() callers. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/58ddd6e2b960bdfa03d0007984386bc0ba351391.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/cgroup: Split blkg_conf_exit()Bart Van Assche
Split blkg_conf_exit() into blkg_conf_unprep() and blkg_conf_close_bdev() because blkg_conf_exit() is not compatible with the Clang thread-safety annotations. Remove blkg_conf_exit(). Rename blkg_conf_exit_frozen() into blkg_conf_close_bdev_frozen(). Add thread-safety annotations to the new functions. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/c1ec1f1c4b675bc5f187f77b3e6436234c6b244c.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block/cgroup: Split blkg_conf_prep()Bart Van Assche
Move the blkg_conf_open_bdev() call out of blkg_conf_prep() to make it possible to add lock context annotations to blkg_conf_prep(). Change an if-statement in blkg_conf_open_bdev() into a WARN_ON_ONCE() call. Export blkg_conf_open_bdev() because it is called by the BFQ I/O scheduler and the BFQ I/O scheduler may be built as a kernel module. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/e6ea0387f413217c8561a0ca54ce7b846aa5c7c5.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05block: Add WQ_PERCPU to alloc_workqueue usersMarco Crivellari
This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. In order to keep alloc_workqueue() behavior identical, explicitly request WQ_PERCPU. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260604105347.168322-1-marco.crivellari@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-03mm: simplify the mempool_alloc_bulk APIChristoph Hellwig
The mempool_alloc_bulk was modelled after the alloc_pages_bulk API, including some misunderstanding of it. Remove checking for NULL slots in the array, as alloc_pages_bulk and kmem_cache_alloc_bulk always fill the array from the beginning and thus we know the offset of the first failing allocation. This removes support for working well with alloc_pages_bulk used to refill page arrays that might have an entry removed from in the middle, but that is only used by sunrpc and hopefully on it's way out. Also remove the allocated parameter as it is redundant because the caller can simply specific and offset into the entries array. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260602160038.3976341-1-hch@lst.de Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-06-02block/partitions/acorn: use min in {riscix,linux}_partitionThorsten Blum
Use min() to replace the open-coded implementations and to simplify riscix_partition() and linux_partition(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20260602160757.973736-3-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-02block, bfq: release cgroup stats with bfq_groupYu Kuai
BFQ cgroup stats contain percpu counters embedded in struct bfq_group, but the old free path destroys them from bfq_pd_free(), which is tied to blkg policy-data teardown. That is not the same lifetime as struct bfq_group. BFQ pins bfq_group while bfq_queue entities refer to it, so bfq_pd_free() can drop the policy-data reference while other bfq_group references still exist. The following blkcg change also defers policy-data release through RCU and leaves BFQ to run the final bfqg_put() from an RCU callback. For that conversion, stats teardown must belong to the last bfq_group put, not to policy-data teardown. Move stats teardown to bfqg_put() so the embedded counters are destroyed exactly when the last bfq_group reference is released, before kfree(bfqg). Without this preparatory change, the RCU-delayed policy-data free conversion reproduced the following KASAN report: BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0 Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535 CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT ea13f83d4b74a12510d20db4a7d9a0fe8275f05c Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x54/0x70 print_address_description+0x77/0x200 ? percpu_counter_destroy_many+0xf1/0x2e0 print_report+0x64/0x70 kasan_report+0x118/0x150 ? percpu_counter_destroy_many+0xf1/0x2e0 percpu_counter_destroy_many+0xf1/0x2e0 __mmdrop+0x1d8/0x350 finish_task_switch+0x3f5/0x570 __schedule+0xe8e/0x18a0 schedule+0xfe/0x1c0 schedule_timeout+0x7f/0x1d0 __wait_for_common+0x26c/0x3f0 wait_for_completion_state+0x21/0x40 call_usermodehelper_exec+0x271/0x2c0 __request_module+0x296/0x410 elv_iosched_store+0x1bc/0x2c0 queue_attr_store+0x152/0x1c0 kernfs_fop_write_iter+0x1d7/0x280 vfs_write+0x580/0x630 ksys_write+0xec/0x190 do_syscall_64+0x156/0x490 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 535: kasan_save_track+0x3e/0x80 __kasan_kmalloc+0x72/0x90 bfq_pd_alloc+0x60/0x100 [bfq] blkg_create+0x3bb/0xbe0 blkg_lookup_create+0x3a2/0x460 blkg_conf_start+0x24a/0x2d0 bfq_io_set_weight+0x17f/0x430 [bfq] cgroup_file_write+0x1c5/0x4b0 kernfs_fop_write_iter+0x1d7/0x280 vfs_write+0x580/0x630 ksys_write+0xec/0x190 do_syscall_64+0x156/0x490 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 0: kasan_save_track+0x3e/0x80 kasan_save_free_info+0x46/0x50 __kasan_slab_free+0x3a/0x60 kfree+0x14e/0x4f0 rcu_core+0x6f3/0xcd0 handle_softirqs+0x1a0/0x550 __irq_exit_rcu+0x8c/0x150 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x6e/0x80 asm_sysvec_apic_timer_interrupt+0x1a/0x20 Last potentially related work creation: kasan_save_stack+0x3e/0x60 kasan_record_aux_stack+0x99/0xb0 call_rcu+0x55/0x5c0 blkg_free_workfn+0x130/0x220 process_scheduled_works+0x655/0xb60 worker_thread+0x446/0x600 kthread+0x1f4/0x230 ret_from_fork+0x259/0x420 ret_from_fork_asm+0x1a/0x30 Signed-off-by: Yu Kuai <yukuai@fygo.io> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260601061502.899552-1-yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-29Merge tag 'block-7.1-20260529' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fix from Jens Axboe: "Just a single fix for the block side, making a slight tweak to a fix from this cycle" * tag 'block-7.1-20260529' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: reinsert cached request to the list
2026-05-29block: Add bvec_folio()Matthew Wilcox (Oracle)
This is a simple helper which replaces page_folio(bvec->bv_page). Minor improvement in readability, but the real motivation is to reduce the number of references to bvec->bv_page so that it can be changed with less work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Leon Romanovsky <leon@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@linux.dev> Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28block: export passthrough stats enabledKeith Busch
A user can enable io accounting for passthrough requests, so export the helper that checks if the request should be tracked. This will enable stacking drivers to to report iostats for passthrough workloads. Since the stacking request_queue may not be the one providing the request, the API has to add a parameter for the caller to specify which one to check. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260528010041.1533124-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28block: add a bio_endio_status helperChristoph Hellwig
Add a helper that sets bi_status and call bio_endio() as that is a very common pattern and convert the core block code over to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260528084632.2505277-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28block: mark biovec_init_pool staticChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20260527150646.2349405-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-27blk-throttle: schedule parent dispatch in tg_flush_bios()Tao Cui
tg_flush_bios() schedules pending_timer on the child tg's own service_queue, which causes throtl_pending_timer_fn() to dispatch from the child's pending_tree. For leaf cgroups this tree is empty, so the timer fires and exits without dispatching the throttled bio. The throttled bio sits in the parent's pending_tree with disptime set to jiffies (THROTL_TG_CANCELING zeroes all dispatch times), but the parent's timer is never explicitly rescheduled. The bio only gets dispatched when the parent timer eventually fires at its previously scheduled expiry. Fix by calling throtl_schedule_next_dispatch(sq->parent_sq, true) instead, matching what tg_set_limit() already does. This forces the parent's dispatch cycle to run immediately and flush all canceling bios without waiting for a stale timer. For the device deletion path (blk_throtl_cancel_bios), directly complete throttled bios with EIO via bio_io_error() instead of dispatching them through the timer -> work -> submission chain. This avoids a race with the SCSI state machine where bios can reach the SCSI layer while the device is in SDEV_CANCEL state, causing ENODEV instead of the expected EIO. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/all/ag2owaQQoigp_fSV@shinmob/ Signed-off-by: Tao Cui <cuitao@kylinos.cn> Link: https://patch.msgid.link/20260522091530.1901437-1-cuitao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>