linux-stable.git/block, branch v6.18.27

block: relax pgmap check in bio_add_page for compatible zone device pages

2026-05-07T04:11:42+00:00

commit 41c665aae2b5dbecddddcc8ace344caf630cc7a4 upstream.

bio_add_page() and bio_integrity_add_page() reject pages from different
dev_pagemaps entirely, returning 0 even when those pages have compatible
DMA mapping requirements. This forces callers to start a new bio when
buffers span pgmap boundaries, even though the pages could safely coexist
as separate bvec entries.

This matters for guests where memory is registered through
devm_memremap_pages() with MEMORY_DEVICE_GENERIC in multiple calls,
creating separate dev_pagemaps for each chunk. When a direct I/O buffer
spans two such chunks, bio_add_page() rejects the second page, forcing an
unnecessary bio split or I/O failure.

Introduce zone_device_pages_compatible() in blk.h to check whether two
pages can coexist in the same bio as separate bvec entries. The block DMA
iterator (blk_dma_map_iter_start) caches the P2PDMA mapping state from the
first segment and applies it to all others, so P2PDMA pages from different
pgmaps must not be mixed, and neither must P2PDMA and non-P2PDMA pages.
All other combinations (MEMORY_DEVICE_GENERIC pages from different pgmaps,
or MEMORY_DEVICE_GENERIC with normal RAM) use the same dma_map_phys path
and are safe.

Replace the blanket zone_device_pages_have_same_pgmap() rejection with
zone_device_pages_compatible(), while keeping
zone_device_pages_have_same_pgmap() as a merge guard.
Pages from different pgmaps can be added as separate bvec entries but
must not be coalesced into the same segment, as that would make
it impossible to recover the correct pgmap via page_pgmap().

Fixes: 49580e690755 ("block: add check when merging zone device pages")
Cc: stable@vger.kernel.org
Signed-off-by: Naman Jain 
Reviewed-by: Christoph Hellwig 
Link: https://patch.msgid.link/20260410153414.4159050-3-namjain@linux.microsoft.com
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: fix zone write plugs refcount handling in disk_zone_wplug_schedule_bio_work()

2026-05-07T04:11:41+00:00

commit 0a8b8af896e0ef83e188e1fe20f98f2bbb1c2459 upstream.

The function disk_zone_wplug_schedule_bio_work() always takes a
reference on the zone write plug of the BIO work being scheduled. This
ensures that the zone write plug cannot be freed while the BIO work is
being scheduled but has not run yet. However, this unconditional
reference taking is fragile since the reference taken is released by the
BIO work blk_zone_wplug_bio_work() function, which implies that there
always must be a 1:1 relation between the work being scheduled and the
work running.

Make sure to drop the reference taken when scheduling the BIO work if
the work is already scheduled, that is, when queue_work() returns false.

Fixes: 9e78c38ab30b ("block: Hold a reference on zone write plugs to schedule submission")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: break pcpu_alloc_mutex dependency on freeze_lock

2026-04-02T11:22:59+00:00

[ Upstream commit 539d1b47e935e8384977dd7e5cec370c08b7a644 ]

While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
after ->freeze_lock is acquired or queue is frozen. This potentially
creates a circular dependency involving ->fs_reclaim if reclaim is
triggered simultaneously in a code path which first acquires ->pcpu_
alloc_mutex. As the queue is already frozen while nr_hw_queue update
allocates tagsets, the reclaim can't forward progress and thus it could
cause a potential deadlock as reported in lockdep splat[1].

Fix this by pre-allocating tagset tags before we freeze queue during
nr_hw_queue update. Later the allocated tagset tags could be safely
installed and used after queue is frozen.

Reported-by: Yi Zhang 
Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
Signed-off-by: Nilay Shroff 
Reviewed-by: Ming Lei 
Tested-by: Yi Zhang 
Reviewed-by: Yu Kuai 
[axboe: fix brace style issue]
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: use trylock to avoid lockdep circular dependency in sysfs

2026-03-12T11:09:59+00:00

[ Upstream commit ce8ee8583ed83122405eabaa8fb351be4d9dc65c ]

Use trylock instead of blocking lock acquisition for update_nr_hwq_lock
in queue_requests_store() and elv_iosched_store() to avoid circular lock
dependency with kernfs active reference during concurrent disk deletion:

  update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
  kn->active -> update_nr_hwq_lock (via sysfs write path)

Return -EBUSY when the lock is not immediately available.

Reported-and-tested-by: Yi Zhang 
Closes: https://lore.kernel.org/linux-block/CAHj4cs-em-4acsHabMdT=jJhXkCzjnprD-aQH1OgrZo4nTnmMw@mail.gmail.com/
Fixes: 626ff4f8ebcb ("blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock")
Signed-off-by: Ming Lei 
Tested-by: Yi Zhang 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: decouple secure erase size limit from discard size limit

2026-03-04T12:19:38+00:00

[ Upstream commit ee81212f74a57c5d2b56cf504f40d528dac6faaf ]

Secure erase should use max_secure_erase_sectors instead of being limited
by max_discard_sectors. Separate the handling of REQ_OP_SECURE_ERASE from
REQ_OP_DISCARD to allow each operation to use its own size limit.

Signed-off-by: Luke Wang 
Reviewed-by: Ulf Hansson 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq-sched: unify elevators checking for async requests

2026-03-04T12:19:38+00:00

[ Upstream commit 1db61b0afdd7e8aa9289c423fdff002603b520b5 ]

bfq and mq-deadline consider sync writes as async requests and only
reserve tags for sync reads by async_depth, however, kyber doesn't
consider sync writes as async requests for now.

Consider the case there are lots of dirty pages, and user use fsync to
flush dirty pages. In this case sched_tags can be exhausted by sync writes
and sync reads can stuck waiting for tag. Hence let kyber follow what
mq-deadline and bfq did, and unify async requests checking for all
elevators.

Signed-off-by: Yu Kuai 
Reviewed-by: Nilay Shroff 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()

2026-03-04T12:19:38+00:00

[ Upstream commit 9d20fd6ce1ba9733cd5ac96fcab32faa9fc404dd ]

In blk_mq_update_nr_hw_queues(), debugfs_mutex is not held while
creating debugfs entries for hctxs. Hence add debugfs_mutex there,
it's safe because queue is not frozen.

Signed-off-by: Yu Kuai 
Reviewed-by: Nilay Shroff 
Reviewed-by: Ming Lei 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block,bfq: fix aux stat accumulation destination

2026-02-11T12:41:47+00:00

[ Upstream commit 04bdb1a04d8a2a89df504c1e34250cd3c6e31a1c ]

Route bfqg_stats_add_aux() time accumulation into the destination
stats object instead of the source, aligning with other stat fields.

Reviewed-by: Yu Kuai 
Signed-off-by: shechenglong 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: zero non-PI portion of auto integrity buffer

2026-01-23T10:21:16+00:00

[ Upstream commit ca22c566b89164f6e670af56ecc45f47ef3df819 ]

The auto-generated integrity buffer for writes needs to be fully
initialized before being passed to the underlying block device,
otherwise the uninitialized memory can be read back by userspace or
anyone with physical access to the storage device. If protection
information is generated, that portion of the integrity buffer is
already initialized. The integrity data is also zeroed if PI generation
is disabled via sysfs or the PI tuple size is 0. However, this misses
the case where PI is generated and the PI tuple size is nonzero, but the
metadata size is larger than the PI tuple. In this case, the remainder
("opaque") of the metadata is left uninitialized.
Generalize the BLK_INTEGRITY_CSUM_NONE check to cover any case when the
metadata is larger than just the PI tuple.

Signed-off-by: Caleb Sander Mateos 
Fixes: c546d6f43833 ("block: only zero non-PI metadata tuples in bio_integrity_prep")
Reviewed-by: Anuj Gupta 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: validate pi_offset integrity limit

2026-01-17T15:35:33+00:00

[ Upstream commit ccb8a3c08adf8121e2afb8e704f007ce99324d79 ]

The PI tuple must be contained within the metadata value, so validate
that pi_offset + pi_tuple_size <= metadata_size. This guards against
block drivers that report invalid pi_offset values.

Signed-off-by: Caleb Sander Mateos 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin