| Age | Commit message (Collapse) | Author |
|
Otherwise zone append commands will miss their integrity data. While
this works "fine" for auto-PI, it break file system PI and non-PI
metadata.
With this XFS on ZNS namespace with non-PI metadata and 512 byte sectors
with PI work, while PI 4k sector formats with PI work only when Caleb's
"block: fix integrity offset/length conversions" is applied as well.
Note that unlike regular writes, zone append does need remapping as
partitions are not supported on zoned block devices.
Fixes: df3c485e0e60 ("block: switch on bio operation in bio_integrity_prep")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260624080014.1998650-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bio_integrity_alloc_buf usage of GFP_ flags is messed up. For one it
mixes GFP_NOFS and GFP_NOIO for neighbouring allocations, but it also
makes the allocations fail more often than needed. That code was copied
from bio_alloc_bioset which needs to do that so that it can punt to the
rescuer workqueue, but none of that is needed for the integrity
allocations that either sits in the file system or at the very bottom
of the I/O stack. Failing early means we'll do a fully waiting
allocation from the mempool ->alloc callback which is usually much
larger than required.
Fix this by passing a gfp_t so that the file system path can pass
GFP_NOFS and the auto-integrity code can pass GFP_NOIO, and don't
modify the allocation type except for disabling warnings.
Fixes: ec7f31b2a2d3 ("block: make bio auto-integrity deadlock safe")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260624080014.1998650-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The request_queue is frozen and quiesced while the elevator init_sched()
method runs, so queue_lock is not needed for BFQ cgroup initialization.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/1965073ea20f33114a8d903816b986e483b9bb34.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The correct lock order is q->queue_lock before blkcg->lock, and in order
to prevent deadlock from blkcg_destroy_blkgs(), trylock is used for
q->queue_lock while blkcg->lock is already held, this is hacky.
Refactor blkcg_destroy_blkgs() to hold blkcg->lock only long enough to
get the first blkg and then release it. Then take q->queue_lock and
blkcg->lock in the correct order to destroy the blkg. This is a very cold
path, so the extra lock/unlock cycles are acceptable.
Also prepare to convert protecting blkcg with blkcg_mutex instead of
queue_lock.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/00b03cf74a9937cb4d6dd67a189ddc00a3de0451.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If a bio is already associated with a blkg, the blkcg is already pinned
until the bio is done, so there is no need for RCU protection. Otherwise,
protect blkcg_css() with RCU independently. Prepare to protect blkcg with
blkcg_mutex instead of queue_lock.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/8496fa234b21d4b31b7f068766906d0bffcac8e6.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Change this in two steps:
1) hold rcu lock and do blkg_lookup() from fast path;
2) hold queue_lock directly from slow path, and don't nest it under rcu
lock;
Prepare to convert protecting blkcg with blkcg_mutex instead of
queue_lock.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/93f33cc9e5a39dddb78dcd934d0c1d04b564fb00.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With previous modification to delay freeing policy data after an RCU grace
period, prfill() can run under RCU instead of taking queue_lock. However,
policy teardown can still clear blkg->pd[plid] after blkcg_print_blkgs()
observes the policy enabled bit.
Load policy data once with READ_ONCE() and skip the blkg if teardown
already cleared it. Do the same in recursive stat walks for descendant
blkgs. Remove the stale BFQ debug queue_lock assertion because
blkcg_print_blkgs() no longer calls prfill() with queue_lock held. This
also lets ioc_qos_prfill() and ioc_cost_model_prfill() use IRQ-safe
ioc->lock locking without re-enabling IRQs while queue_lock is still held.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/db7633d5e263dd1c2bf9b901762545a84b7d714e.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently blkcg_print_blkgs() must hold RCU to iterate blkgs from a
blkcg, and prfill() must hold queue_lock to prevent policy data from
being freed by policy deactivation. As a consequence, queue_lock has to
be nested under RCU from blkcg_print_blkgs().
Delay freeing policy data until after an RCU grace period so prfill() can
be protected by RCU alone.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/e20e5d984b41a026d61851966bed35eb094c4bff.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blkcg_print_one_stat() will be called for each blkg:
- access blkg->iostat, which is freed from rcu callback
blkg_free_workfn();
- access policy data from pd_stat_fn(), which is freed from
pd_free_fn(), while pd_free_fn() can be called by removing blkcg or
deactivating policy;
Take blkcg->lock while iterating so the blkgs stay online and both
blkg->iostat and policy data for activated policies stay valid. Use
irq-safe locking because blkcg->lock can be nested under q->queue_lock,
which is used from IRQ completion paths.
Prepare to convert protecting blkgs from request_queue with mutex.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/05799877e720dcd300e2ddd4625e8e162959d7cc.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
[BUG]
Our fuzz testing triggered a blkcg use-after-free issue:
BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
Call Trace:
...
blkcg_deactivate_policy+0x244/0x4d0
ioc_rqos_exit+0x44/0xe0
rq_qos_exit+0xba/0x120
__del_gendisk+0x50b/0x800
del_gendisk+0xff/0x190
...
[CAUSE]
process1 process2
cgroup_rmdir
...
css_killed_work_fn
offline_css
...
blkcg_destroy_blkgs
...
__blkg_release
css_put(&blkg->blkcg->css)
blkg_free
INIT_WORK(xxx, blkg_free_workfn)
schedule_work
css_put
...
blkcg_css_free
kfree(blkcg)--------blkcg has been freed!!!
====================================schedule_work
blkg_free_workfn
__del_gendisk
rq_qos_exit
ioc_rqos_exit
blkcg_deactivate_policy
mutex_lock(&q->blkcg_mutex)
spin_lock_irq(&q->queue_lock)
list_for_each_entry(blkg, xxx)
blkcg = blkg->blkcg
spin_lock(&blkcg->lock)-------UAF!!!
mutex_lock(&q->blkcg_mutex)
spin_lock_irq(&q->queue_lock)
/* Only then is the blkg removed from the list */
list_del_init(&blkg->q_node)
As a result, a blkg can still be reachable through q->blkg_list while
its ->blkcg has already been freed.
[Fix]
Fix this by deferring the blkcg css_put() until after the blkg has been
unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
blkcg outlives every blkg still reachable through q->blkg_list, so any
iterator holding q->queue_lock is guaranteed to observe a valid
blkg->blkcg.
While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
so that the css reference is owned by the alloc/free pair rather than
straddling layers:
blkg_alloc() <-> blkg_free()
blkg_create() <-> blkg_destroy()
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Link: https://patch.msgid.link/20260616011746.2451461-1-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When multiple blkgs in the same blkcg are released concurrently,
a use-after-free can occur. The race happens when one blkg's
__blkcg_rstat_flush() removes another blkg's iostat entries via
llist_del_all(). The second blkg sees an empty list and proceeds
to free itself while the first is still iterating over its entries.
Move the flush from __blkg_release() (RCU callback) to blkg_release()
(before call_rcu). This ensures the RCU grace period waits for any
concurrent flush's rcu_read_lock() section to complete before freeing.
Cc: stable@vger.kernel.org
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Fixes: 20cb1c2fb756 ("blk-cgroup: Flush stats before releasing blkcg_gq")
Reported-by: coregee2000@gmail.com
Closes: https://lore.kernel.org/linux-block/CAHPqNmwT9oRpem3J3erS_W0uSQND47LGGSBsNxP8E6uSUish1w@mail.gmail.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
Link: https://patch.msgid.link/20260205155425.342084-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Writing 0 to BFQ's low_latency attribute ends weight raising for active,
idle and async queues. The async cgroup path walks q->blkg_list, converts
each blkg to BFQ policy data and then reads bfqg->async_bfqq and
bfqg->async_idle_bfqq.
That walk was protected only by bfqd->lock. blkcg release work is
serialized by q->blkcg_mutex and q->queue_lock instead, and
blkg_free_workfn() can call BFQ's pd_free_fn before it removes
blkg->q_node from q->blkg_list. A low_latency reset can therefore still
find the blkg on the queue list after the BFQ policy data has been freed.
The buggy scenario involves two paths, with each column showing the order
within that path:
BFQ low_latency reset: blkcg blkg release work:
1. bfq_low_latency_store() 1. blkg_free_workfn() takes
calls bfq_end_wr(). q->blkcg_mutex.
2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the
q->blkg_list. final bfq_group reference.
3. blkg_to_bfqg() returns 3. blkg->q_node remains on
the stale policy data. q->blkg_list until list_del_init().
4. bfq_end_wr_async_queues()
reads async queue fields.
Fix this by taking q->blkcg_mutex and q->queue_lock around the
q->blkg_list walk, then taking bfqd->lock before touching BFQ async
queues. The mutex serializes against policy-data free and queue_lock
stabilizes the list. Move the async reset out of bfq_end_wr()'s existing
bfqd->lock critical section so the lock order matches blkcg policy
callbacks.
Validation reproduced this kernel report:
BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340
Call Trace:
<TASK>
dump_stack_lvl+0x66/0xa0
print_report+0xce/0x630
? bfq_end_wr_async_queues+0x246/0x340
? srso_alias_return_thunk+0x5/0xfbef5
? __virt_addr_valid+0x20d/0x410
? bfq_end_wr_async_queues+0x246/0x340
kasan_report+0xe0/0x110
? bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async+0xba/0x180
bfq_low_latency_store+0x4e5/0x690
? 0xffffffffc02150da
? __pfx_bfq_low_latency_store+0x10/0x10
? __pfx_bfq_low_latency_store+0x10/0x10
elv_attr_store+0xc4/0x110
kernfs_fop_write_iter+0x2f5/0x4a0
vfs_write+0x604/0x11f0
? __pfx_locks_remove_posix+0x10/0x10
? __pfx_vfs_write+0x10/0x10
ksys_write+0xf9/0x1d0
? __pfx_ksys_write+0x10/0x10
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Allocated by task 544:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
bfq_pd_alloc+0xc0/0x1b0
blkg_alloc+0x346/0x960
blkg_create+0x8c2/0x10d0
bio_associate_blkg_from_css+0x9f3/0xfa0
bio_associate_blkg+0xd9/0x200
bio_init+0x303/0x640
__blkdev_direct_IO_simple+0x56b/0x8a0
blkdev_direct_IO+0x8e7/0x2580
blkdev_read_iter+0x205/0x400
vfs_read+0x7b0/0xda0
ksys_read+0xf9/0x1d0
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 465:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x5f/0x80
kfree+0x307/0x580
blkg_free_workfn+0xef/0x460
process_one_work+0x8d0/0x1870
worker_thread+0x575/0xf80
kthread+0x2e7/0x3c0
ret_from_fork+0x576/0x810
ret_from_fork_asm+0x1a/0x30
Fixes: 44e44a1b329e ("block, bfq: improve responsiveness")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Reviewed-by: Tao Cui <cuitao@kylinos.cn>
Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Only decrement the static key when we had items and thus it was
incremented before.
Fixes: e8dcf2d142bd ("block: add configurable error injection")
Reported-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260622160752.1552516-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set
iov_iter::nofault to prevent deadlock when a page fault is needed to
read out the buffer.
However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault
flag, and just call a plain copy_from_iter() so it can still trigger
page fault and cause deadlock in btrfs.
Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is
set, otherwise use copy_folio_from_iter().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/9c165a314022b61566eb247852eb773ca6c70889.1781597506.git.wqu@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty
easy to hit short copy inside bio_iov_iter_bounce_write().
This is because btrfs has disabled page fault to avoid certain deadlock
during direct writes, and instead btrfs manually fault in the pages then
retry.
And inside bio_iov_iter_bounce_write(), if we hit a short write, we
didn't revert the iov_iter, which can cause problems like unexpected
garbage for the next retry.
Revert the iov_iter after a short copy.
One thing to note is that, the folio is allocated then immediately
queued into the bio, so the proper revert size should be
(bi_size - this_len + copied).
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/c400989f227343b134110773d5acaaacf7024574.1781597506.git.wqu@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The patch removes the automatic plug/unplug operations from __submit_bio()
that were added to cache nsecs time when no explicit plug is used.
The plug mechanism is most effective when batching multiple I/O
operations together. Creating a plug for every bio submission
provides minimal benefit while adding function call overhead and
stack usage for every I/O operation.
Below is performance comparison with the latest upstream kernel.
Iotype qd nj rmix mpstat busy mpstat busy without plug
Randrw 1 20 100 53% 24%
Randrw 1 40 100 70% 24%
Randrw 1 20 70 40% 24%
Randrw 1 40 70 60% 26%
Randrw 1 20 0 14% 6%
Randrw 1 40 0 20% 7%
Fixes: 060406c61c7c ("block: add plug while submitting IO")
Signed-off-by: Wen Xiong <wenxiong@linux.ibm.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260616143121.878021-1-wenxiong@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether
this is the first issue. However, this flag lives in cmd->flags instead
of issue_flags.
Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with
IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed,
bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with
-EINVAL.
Fix it by checking cmd->flags as intended.
Cc: stable@vger.kernel.org
Fixes: 212ec34e4e72 ("block: only read from sqe on initial invocation of blkdev_uring_cmd")
Signed-off-by: Yitang Yang <yi1tang.yang@gmail.com>
Link: https://patch.msgid.link/20260616155129.406057-1-yi1tang.yang@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:
- small cleanups in dm-vdo, dm-raid, dm-cache, dm-zoned-metadata
- rework of dm-ima
- introduce dm-inlinecrypt
- fix wrong return value in dm-ioctl
- fix rcu stall when polling
* tag 'for-7.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-zoned-metadata: Use strscpy() to copy device name
dm cache: make smq background work limit configurable
dm-inlinecrypt: add support for hardware-wrapped keys
dm: limit target bio polling to one shot
dm-ioctl: report an error if a device has no table
dm: add documentation for dm-inlinecrypt target
dm-inlinecrypt: add target for inline block device encryption
block: export blk-crypto symbols required by dm-inlinecrypt
dm-ima: use active table's size if available
dm-ima: Fail more gracefully in dm_ima_measure_on_*
dm-ima: Handle race between rename and table swap
dm-ima: Fix issues with dm_ima_measure_on_device_rename
dm-ima: remove new_map from dm_ima_measure_on_device_clear
dm-ima: Fix UAF errors and measuring incorrect context
dm-ima: don't copy the active table to the inactive table
dm-ima: Remove status_flags from dm_ima_measure_on_table_load()
dm-ima: remove broken last_target_measured logic
dm-ima: remove dm_ima_reset_data()
dm-raid: only requeue bios when dm is suspending
dm vdo: use get_random_u32() where appropriate
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Per-controller admin and IO timeout sysfs attributes, and
letting the block layer set request timeouts (Maurizio,
Maximilian)
- Multipath passthrough iostats, and PCI P2PDMA enablement for
multipath devices (Keith, Kiran)
- A new diag sysfs attribute group exporting per-controller
counters (retries, multipath failover, error counters, requeue
and failure counts, reset and reconnect events) (Nilay)
- FDP configuration validation and bounds check fixes (liuxixin)
- Various nvmet fixes, including a pre-auth out-of-bounds read in
the Discovery Get Log Page handler, auth payload bounds
validation, and tcp error-path leak fixes (Bryam, Tianchu,
Geliang)
- nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki,
Eric)
- Assorted other fixes and cleanups (John, Yao, Chao, Mateusz,
Achkinazi, Wentao)
- MD pull request via Yu Kuai:
- raid1/raid10 fixes for a deadlock in the read error recovery
path, error-path detection and bio accounting with cloned bios,
and an nr_pending leak in the REQ_ATOMIC bad-block error path
(Abd-Alrhman)
- PCI P2PDMA propagation from member devices to the RAID device
(Kiran)
- dm-raid bio requeue fix, and various smaller fixes and cleanups
(Benjamin, Chen, Li, Thorsten)
- Enable Clang lock context analysis for the block layer, with the
accompanying annotations across queue limits, the blk_holder_ops
callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart)
- Block status code infrastructure work: a tagged status table, a
str_to_blk_op() helper, a bio_endio_status() helper, and on top of
that a new configurable block-layer error injection facility
(Christoph)
- DRBD netlink rework, replacing the genl_magic machinery with explicit
netlink serialization and moving the DRBD UAPI headers to
include/uapi/linux/ (Christoph Böhmwalder)
- bvec improvements: a bvec_folio() helper and making the bvec_iter
helpers proper inline functions (Willy, Christoph)
- ublk cleanups and a canceling-flag fix for the disk-not-allocated
case (Caleb, Ming)
- Partition handling fixes: bound the AIX pp_count scan, fix an of_node
refcount leak, and replace __get_free_page() with kmalloc() (Bryam,
Wentao, Mike)
- Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add
WQ_PERCPU to the block workqueue users (Mateusz, Marco)
- Block statistics and tracing: propagate in-flight to the whole disk
on partition IO, export passthrough stats, and a new
block_rq_tag_wait tracepoint (Tang, Keith, Aaron)
- A round of removals, unexports and cleanups across bio, direct-io and
the bvec helpers (Christoph)
- Various driver fixes (mtip32xx use-after-free, rbd snap_count
validation and strscpy conversion, nbd socket lockdep reclassify,
virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS
email/list updates (Coly, Li, Yu, Christoph Böhmwalder)
- Other little fixes and cleanups all over
* tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits)
MAINTAINERS: Update Coly Li's email address
block: check bio split for unaligned bvec
nbd: Reclassify sockets to avoid lockdep circular dependency
block: add configurable error injection
block: add a str_to_blk_op helper
block: add a "tag" for block status codes
block: add a macro to initialize the status table
floppy: Drop unused pnp driver data
block: propagate in_flight to whole disk on partition I/O
virtio-blk: clamp zone report to the report buffer capacity
block: optimize I/O merge hot path with unlikely() hints
drivers/block/rbd: Use strscpy() to copy strings into arrays
partitions: aix: bound the pp_count scan to the ppe array
block: Enable lock context analysis
block/mq-deadline: Make the lock context annotations compatible with Clang
block/Kyber: Make the lock context annotations compatible with Clang
block/blk-mq-debugfs: Improve lock context annotations
block/blk-iocost: Inline iocg_lock() and iocg_unlock()
block/blk-iocost: Split ioc_rqos_throttle()
block/crypto: Annotate the crypto functions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- Support for "allocation tokens" (currently available in Clang 22+)
for smarter partitioning of kmalloc caches based on the allocated
object type, which can be enabled instead of the "random"
per-caller-address-hash partitioning.
It should be able to deterministically separate types containing a
pointer from those that do not (Marco Elver)
- Improvements and simplification of the kmem_cache_alloc_bulk() and
mempool_alloc_bulk() API. This includes adaptation of callers
(Christoph Hellwig)
- Performance improvements and cleanups related mostly to sheaves
refill (Hao Li, Shengming Hu, Vlastimil Babka)
- Several fixups for the slabinfo tool (Xuewen Wang)
* tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
mm/slub: preserve original size in _kmalloc_nolock_noprof retry path
mm: simplify the mempool_alloc_bulk API
mm/slab: improve kmem_cache_alloc_bulk
mm/slub: detach and reattach partial slabs in batch
mm/slub: introduce helpers for node partial slab state
mm/slub: use empty sheaf helpers for oversized sheaves
tools/mm/slabinfo: remove redundant slab->partial assignment
tools/mm/slabinfo: remove dead assignment in get_obj_and_str()
tools/mm/slabinfo: Fix trace disable logic inversion
MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR
mm/slub: fix typo in sheaves comment
mm, slab: simplify returning slab in __refill_objects_node()
mm, slab: add an optimistic __slab_try_return_freelist()
slab: fix kernel-docs for mm-api
slab: improve KMALLOC_PARTITION_RANDOM randomness
slab: support for compiler-assisted type-based slab cache partitioning
mm/slub: defer freelist construction until after bulk allocation from a new slab
|
|
Offsets and lengths need to be validated against the dma alignment. This
check was skipped for sufficiently a small bio with a single bvec, which
may allow an invalid request dispatched to the driver. Force the
validation for an unaligned bvec by forcing the bio split path that
handles this condition.
Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260612223205.465913-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new block error injection interface that allows to inject specific
status code for specific ranges.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260611140703.2401204-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a helper to find the REQ_OP_XYZ constant from the "XYZ" string.
This will be used for the error injection debugfs interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260611140703.2401204-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The full name of the status codes is not good for user interfaces as it
can contain white spaces. Add the name of the status code without the
BLK_STS_ prefix as a tag so that it can be used for user interfaces.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260611140703.2401204-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Prepare for adding a new value to the error table by adding a macro
to fill it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260611140703.2401204-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
mydev 252:1 0 20G 0 disk
└─mydev1 259:0 0 10G 0 part
iostat -xp 1
Device r/s rkB/s ... aqu-sz %util
mydev 128153.00 512612.00 ... 13.22 72.20
mydev1 128154.00 512616.00 ... 13.22 100.00
%util is different between mydev and mydev1, which is unexpected.
This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.
In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.
Fix it by restoring the whole-disk in_flight accounting.
Fixes: e016b78201a2 ("block: return just one value from part_in_flight")
Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260526021555.359500-1-yizhou.tang@shopee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove redundant '== false' comparisons and add unlikely() branch
prediction hints in block I/O merge path functions.
These functions (ll_new_hw_segment, ll_merge_requests_fn, and
blk_rq_merge_ok) are executed on every I/O request merge attempt,
making them critical hot paths. Data integrity check failures are
rare events, so marking these conditions as unlikely() helps the
CPU optimize the common case by improving branch prediction.
Changes:
- Replace 'func() == false' with 'unlikely(!func())' for better
code style and branch prediction
This micro-optimization reduces branch misprediction penalties in
high-frequency I/O merge paths.
Signed-off-by: Steven Feng <steven@joint-cloud.com>
Link: https://patch.msgid.link/tencent_79B652BD0CC23E093F27914380F161E7E505@qq.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
aix_partition() reads the physical volume descriptor into a fixed-size
struct pvd and then scans its physical-partition-extent array:
int numpps = be16_to_cpu(pvd->pp_count);
...
for (i = 0; i < numpps; i += 1) {
struct ppe *p = pvd->ppe + i;
...
lp_ix = be16_to_cpu(p->lp_ix);
pvd points at a single kmalloc()'d struct pvd whose ppe[] member holds a
fixed ARRAY_SIZE(pvd->ppe) (1016) entries, but the loop runs up to the
on-disk pp_count. pp_count is an unvalidated __be16 read straight from
the descriptor, so a crafted AIX image with pp_count larger than 1016
drives the loop to read pvd->ppe[i] past the end of the allocation (up
to 65535 entries, ~2 MB out of bounds).
The partition scan runs without mounting anything, when a block device
with a crafted AIX/IBM partition table appears (an attacker-supplied
image attached with losetup -P, or a device auto-scanned by udev), via
msdos_partition() -> aix_partition().
Clamp the scan to the number of entries the ppe[] array can hold.
Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Link: https://patch.msgid.link/20260607064137.302574-1-hexlabsecurity@proton.me
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that all block/*.c files have been annotated, enable lock context
analysis for all these source files.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/e248ca3aeead238bbc489cf3afdafcbff9e41faf.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While sparse ignores the __acquires() and __releases() arguments, Clang
verifies these. Make the arguments of __acquires() and __releases()
acceptable for Clang.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/3b6e336ced91e27213608ffce205ccd24f4ba285.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While sparse ignores the __acquires() and __releases() arguments, Clang
verifies these. Make the arguments of __acquires() and __releases()
acceptable for Clang.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/91cb8c790fc8b26b8aa742569fbf8c2c1d099dac.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make the existing lock context annotations compatible with Clang. Add
the lock context annotations that are missing.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/f58fe220ff98f9dfddfed4573f40005c773b7fb7.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Both iocg_lock() and iocg_unlock() use conditional locking. Fold these
functions into their callers such that unlocking becomes unconditional.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/f8c9867788957d2e40a32e23c6d9b866e480ad9d.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Prepare for inlining iocg_lock() and iocg_unlock() by moving the code
between these two calls into a new function. No functionality has been
changed.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/a6d3ed953cef6669d23a80923bf46600733cbdae.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add the lock context annotations required for Clang's thread-safety
analysis.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/297b40e43a7f9b7d20e91a6c44b41a69d01f5c63.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The blkg_conf_open_bdev_frozen() calling convention is not compatible
with lock context annotations. Fold both blkg_conf_open_bdev_frozen()
and blkg_conf_close_bdev_frozen() into their only caller. This patch
prepares for enabling lock context analysis.
The type of 'memflags' has been changed from unsigned long into unsigned
int to match the type of current->flags. See also <linux/sched.h>.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/05661d1555decc6dd5389174ba448d803b72ed9a.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reduce code duplication by combining two error paths. No functionality
has been changed.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/80d4fc1ecd5eaf187c0a31c63a1033a7326d4c7e.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add lock context annotations where these are missing. Move the
blkg_conf_prep() annotation into block/blk-cgroup.h to make it visible
to all blkg_conf_prep() callers.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/58ddd6e2b960bdfa03d0007984386bc0ba351391.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split blkg_conf_exit() into blkg_conf_unprep() and blkg_conf_close_bdev()
because blkg_conf_exit() is not compatible with the Clang thread-safety
annotations. Remove blkg_conf_exit(). Rename blkg_conf_exit_frozen() into
blkg_conf_close_bdev_frozen(). Add thread-safety annotations to the new
functions.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/c1ec1f1c4b675bc5f187f77b3e6436234c6b244c.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the blkg_conf_open_bdev() call out of blkg_conf_prep() to make it
possible to add lock context annotations to blkg_conf_prep(). Change an
if-statement in blkg_conf_open_bdev() into a WARN_ON_ONCE() call. Export
blkg_conf_open_bdev() because it is called by the BFQ I/O scheduler and
the BFQ I/O scheduler may be built as a kernel module.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/e6ea0387f413217c8561a0ca54ce7b846aa5c7c5.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260604105347.168322-1-marco.crivellari@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The mempool_alloc_bulk was modelled after the alloc_pages_bulk API,
including some misunderstanding of it.
Remove checking for NULL slots in the array, as alloc_pages_bulk and
kmem_cache_alloc_bulk always fill the array from the beginning and thus
we know the offset of the first failing allocation. This removes support
for working well with alloc_pages_bulk used to refill page arrays that
might have an entry removed from in the middle, but that is only used by
sunrpc and hopefully on it's way out.
Also remove the allocated parameter as it is redundant because the caller
can simply specific and offset into the entries array.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260602160038.3976341-1-hch@lst.de
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
|
|
Use min() to replace the open-coded implementations and to simplify
riscix_partition() and linux_partition().
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20260602160757.973736-3-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ cgroup stats contain percpu counters embedded in struct bfq_group,
but the old free path destroys them from bfq_pd_free(), which is tied
to blkg policy-data teardown.
That is not the same lifetime as struct bfq_group. BFQ pins bfq_group
while bfq_queue entities refer to it, so bfq_pd_free() can drop the
policy-data reference while other bfq_group references still exist. The
following blkcg change also defers policy-data release through RCU and
leaves BFQ to run the final bfqg_put() from an RCU callback. For that
conversion, stats teardown must belong to the last bfq_group put, not to
policy-data teardown.
Move stats teardown to bfqg_put() so the embedded counters are destroyed
exactly when the last bfq_group reference is released, before kfree(bfqg).
Without this preparatory change, the RCU-delayed policy-data free
conversion reproduced the following KASAN report:
BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0
Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535
CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT ea13f83d4b74a12510d20db4a7d9a0fe8275f05c
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x54/0x70
print_address_description+0x77/0x200
? percpu_counter_destroy_many+0xf1/0x2e0
print_report+0x64/0x70
kasan_report+0x118/0x150
? percpu_counter_destroy_many+0xf1/0x2e0
percpu_counter_destroy_many+0xf1/0x2e0
__mmdrop+0x1d8/0x350
finish_task_switch+0x3f5/0x570
__schedule+0xe8e/0x18a0
schedule+0xfe/0x1c0
schedule_timeout+0x7f/0x1d0
__wait_for_common+0x26c/0x3f0
wait_for_completion_state+0x21/0x40
call_usermodehelper_exec+0x271/0x2c0
__request_module+0x296/0x410
elv_iosched_store+0x1bc/0x2c0
queue_attr_store+0x152/0x1c0
kernfs_fop_write_iter+0x1d7/0x280
vfs_write+0x580/0x630
ksys_write+0xec/0x190
do_syscall_64+0x156/0x490
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Allocated by task 535:
kasan_save_track+0x3e/0x80
__kasan_kmalloc+0x72/0x90
bfq_pd_alloc+0x60/0x100 [bfq]
blkg_create+0x3bb/0xbe0
blkg_lookup_create+0x3a2/0x460
blkg_conf_start+0x24a/0x2d0
bfq_io_set_weight+0x17f/0x430 [bfq]
cgroup_file_write+0x1c5/0x4b0
kernfs_fop_write_iter+0x1d7/0x280
vfs_write+0x580/0x630
ksys_write+0xec/0x190
do_syscall_64+0x156/0x490
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 0:
kasan_save_track+0x3e/0x80
kasan_save_free_info+0x46/0x50
__kasan_slab_free+0x3a/0x60
kfree+0x14e/0x4f0
rcu_core+0x6f3/0xcd0
handle_softirqs+0x1a0/0x550
__irq_exit_rcu+0x8c/0x150
irq_exit_rcu+0xe/0x20
sysvec_apic_timer_interrupt+0x6e/0x80
asm_sysvec_apic_timer_interrupt+0x1a/0x20
Last potentially related work creation:
kasan_save_stack+0x3e/0x60
kasan_record_aux_stack+0x99/0xb0
call_rcu+0x55/0x5c0
blkg_free_workfn+0x130/0x220
process_scheduled_works+0x655/0xb60
worker_thread+0x446/0x600
kthread+0x1f4/0x230
ret_from_fork+0x259/0x420
ret_from_fork_asm+0x1a/0x30
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260601061502.899552-1-yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fix from Jens Axboe:
"Just a single fix for the block side, making a slight tweak to a fix
from this cycle"
* tag 'block-7.1-20260529' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
blk-mq: reinsert cached request to the list
|
|
This is a simple helper which replaces page_folio(bvec->bv_page).
Minor improvement in readability, but the real motivation is to reduce
the number of references to bvec->bv_page so that it can be changed
with less work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@linux.dev>
Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A user can enable io accounting for passthrough requests, so export the
helper that checks if the request should be tracked. This will enable
stacking drivers to to report iostats for passthrough workloads. Since
the stacking request_queue may not be the one providing the request, the
API has to add a parameter for the caller to specify which one to check.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260528010041.1533124-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a helper that sets bi_status and call bio_endio() as that is a very
common pattern and convert the core block code over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260528084632.2505277-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260527150646.2349405-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
tg_flush_bios() schedules pending_timer on the child tg's own
service_queue, which causes throtl_pending_timer_fn() to dispatch from
the child's pending_tree. For leaf cgroups this tree is empty, so the
timer fires and exits without dispatching the throttled bio.
The throttled bio sits in the parent's pending_tree with disptime set
to jiffies (THROTL_TG_CANCELING zeroes all dispatch times), but the
parent's timer is never explicitly rescheduled. The bio only gets
dispatched when the parent timer eventually fires at its previously
scheduled expiry.
Fix by calling throtl_schedule_next_dispatch(sq->parent_sq, true)
instead, matching what tg_set_limit() already does. This forces the
parent's dispatch cycle to run immediately and flush all canceling
bios without waiting for a stale timer.
For the device deletion path (blk_throtl_cancel_bios), directly
complete throttled bios with EIO via bio_io_error() instead of
dispatching them through the timer -> work -> submission chain.
This avoids a race with the SCSI state machine where bios can reach
the SCSI layer while the device is in SDEV_CANCEL state, causing
ENODEV instead of the expected EIO.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/all/ag2owaQQoigp_fSV@shinmob/
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Link: https://patch.msgid.link/20260522091530.1901437-1-cuitao@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|