linux-stable.git/drivers/md, branch v7.0.10

md/md-bitmap: add a none backend for bitmap grow

2026-05-23T11:09:30+00:00

[ Upstream commit f2926a533d03fe70d753b512b713e06a2aa174af ]

Add a real none bitmap backend that exposes the common bitmap sysfs
group and use it to keep bitmap/location available when an array has no
bitmap.

Then switch the bitmap location sysfs path to move only between none
and the classic bitmap backend, using the no-sysfs bitmap helpers while
merging or unmerging the internal bitmap sysfs group.

This restores mdadm --grow bitmap addition through bitmap/location.

Fixes: fb8cc3b0d9db ("md/md-bitmap: delay registration of bitmap_ops until creating bitmap")
Reviewed-by: Su Yue 
Link: https://lore.kernel.org/r/20260425024615.1696892-4-yukuai@fnnas.com
Signed-off-by: Yu Kuai 
Signed-off-by: Sasha Levin

md/md-bitmap: split bitmap sysfs groups

2026-05-23T11:09:30+00:00

[ Upstream commit aba3d6d6cb55c6e1116d1215140559dd7ecdf9a9 ]

Split the classic bitmap sysfs files into a common bitmap group with
the location attribute and a separate internal bitmap group for the
remaining files.

At the same time, convert bitmap operations from a single sysfs group
to a sysfs group array so backends can share part of their sysfs
layout while adding backend-specific attributes separately.

Switch the bitmap sysfs helpers to use sysfs_update_groups() for the
add and update path, and remove groups in reverse order so shared named
groups are unmerged before the last group removes the directory.

Also make bitmap operation lookup depend only on the currently selected
bitmap id matching the installed backend. This prepares the lookup path
for a later registered none backend.

Reviewed-by: Su Yue 
Link: https://lore.kernel.org/r/20260425024615.1696892-3-yukuai@fnnas.com
Signed-off-by: Yu Kuai 
Stable-dep-of: f2926a533d03 ("md/md-bitmap: add a none backend for bitmap grow")
Signed-off-by: Sasha Levin

md: factor bitmap creation away from sysfs handling

2026-05-23T11:09:30+00:00

[ Upstream commit 8776d342cf8fa0b98ca5e6fb2d956966fb5ca364 ]

Factor bitmap creation and destruction into helpers that do not touch
bitmap sysfs registration.

This prepares the bitmap sysfs rework so callers such as the sysfs
bitmap location path can create or destroy a bitmap backend without
coupling that to sysfs group lifetime management.

Reviewed-by: Su Yue 
Link: https://lore.kernel.org/r/20260425024615.1696892-2-yukuai@fnnas.com
Signed-off-by: Yu Kuai 
Stable-dep-of: f2926a533d03 ("md/md-bitmap: add a none backend for bitmap grow")
Signed-off-by: Sasha Levin

md: add fallback to correct bitmap_ops on version mismatch

2026-05-23T11:09:30+00:00

[ Upstream commit 09af773650024279a60348e7319d599e6571b15c ]

If default bitmap version and on-disk version doesn't match, and mdadm
is not the latest version to set bitmap_type, set bitmap_ops based on
the disk version.

Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/
Signed-off-by: Yu Kuai 
Stable-dep-of: f2926a533d03 ("md/md-bitmap: add a none backend for bitmap grow")
Signed-off-by: Sasha Levin

md/raid1,raid10: don't fail devices for invalid IO errors

2026-05-23T11:09:30+00:00

[ Upstream commit f7b24c7b41f23b5f9caa8b913afe79cd4c397d39 ]

BLK_STS_INVAL indicates the IO request itself was invalid, not that the
device has failed. When raid1 treats this as a device error, it retries
on alternate mirrors which fail the same way, eventually exceeding the
read error threshold and removing the device from the array.

This happens when stacking configurations bypass bio_split_to_limits()
in the IO path: dm-raid calls md_handle_request() directly without going
through md_submit_bio(), skipping the alignment validation that would
otherwise reject invalid bios early. The invalid bio reaches the
lower block layers, which fail the bio with  BLK_STS_INVAL, and raid1
wrongly interprets this as a device failure.

Add BLK_STS_INVAL to raid1_should_handle_error() so that invalid IO
errors are propagated back to the caller rather than triggering device
removal. This is consistent with the previous kernel behavior when
alignment checks were done earlier in the direct-io path.

Fixes: 5ff3f74e145adc7 ("block: simplify direct io validity check")

Reported-by: Tomáš Trnka 
Closes: https://lore.kernel.org/linux-block/2982107.4sosBPzcNG@electra/
Signed-off-by: Keith Busch 
Tested-by: Tomáš Trnka 
Link: https://lore.kernel.org/r/20260416140345.3872265-1-kbusch@meta.com
Signed-off-by: Yu Kuai 
Signed-off-by: Sasha Levin

dm cache: fix missing return in invalidate_committed's error path

2026-05-23T11:08:55+00:00

[ Upstream commit 8c0ee19db81f0fa1ff25fd75b22b17c0cc2acde3 ]

In passthrough mode, dm-cache defers write submission until after
metadata commit completes via the invalidate_committed() continuation.
On commit error, invalidate_committed() calls invalidate_complete() to
end the bio and free the migration struct, after which it should return
immediately.

The patch 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode")
omitted this early return, causing execution to fall through into the
success path on error. This results in use-after-free on the migration
struct in the subsequent calls.

Fix by adding the missing return after the invalidate_complete() call.

Fixes: 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode")
Reported-by: Dan Carpenter 
Closes: https://lore.kernel.org/dm-devel/adjMq6T5RRjv_uxM@stanley.mountain/
Signed-off-by: Ming-Hung Tsai 
Signed-off-by: Mikulas Patocka 
Signed-off-by: Sasha Levin

dm init: ensure device probing has finished in dm-mod.waitfor=

2026-05-23T11:08:46+00:00

[ Upstream commit 99a2312f69805f4ba92d98a757625e0300a747ab ]

The early_lookup_bdev() function returns successfully when the disk
device is present but not necessarily its partitions. In this situation,
dm_early_create() fails as the partition block device does not exist
yet.

In my case, this phenomenon occurs quite often because the device is
an SD card with slow reading times, on which kernel takes time to
enumerate available partitions.

Fortunately, the underlying device is back to "probing" state while
enumerating partitions. Waiting for all probing to end is enough to fix
this issue.

That's also the reason why this problem never occurs with rootwait=
parameter: the while loop inside wait_for_root() explicitly waits for
probing to be done and then the function calls async_synchronize_full().
These lines were omitted in 035641b, even though the commit says it's
based on the rootwait logic...

Anyway, calling wait_for_device_probe() after our while loop does the
job (it both waits for probing and calls async_synchronize_full).

Fixes: 035641b01e72 ("dm init: add dm-mod.waitfor to wait for asynchronously probed block devices")
Signed-off-by: Guillaume Gonnet 
Signed-off-by: Mikulas Patocka 
Signed-off-by: Sasha Levin

dm log: fix out-of-bounds write due to region_count overflow

2026-05-23T11:08:44+00:00

[ Upstream commit c20e36b7631d83e7535877f08af8b0af72c44b1a ]

The local variable region_count in create_log_context() is declared as
unsigned int (32-bit), but dm_sector_div_up() returns sector_t (64-bit).
When a device-mapper target has a sufficiently large ti->len with a small
region_size, the division result can exceed UINT_MAX. The truncated
value is then used to calculate bitset_size, causing clean_bits,
sync_bits, and recovering_bits to be allocated far smaller than needed
for the actual number of regions.

Subsequent log operations (log_set_bit, log_clear_bit, log_test_bit) use
region indices derived from the full untruncated region space, causing
out-of-bounds writes to kernel heap memory allocated by vmalloc.

This can be reproduced by creating a mirror target whose region_count
overflows 32 bits:

  dmsetup create bigzero --table '0 8589934594 zero'
  dmsetup create mymirror --table '0 8589934594 mirror \
    core 2 2 nosync 2 /dev/mapper/bigzero 0 \
    /dev/mapper/bigzero 0'

The status output confirms the truncation (sync_count=1 instead of
4294967297, because 0x100000001 was truncated to 1):

  $ dmsetup status mymirror
  0 8589934594 mirror 2 254:1 254:1 1/4294967297 ...

This leads to a kernel crash in core_in_sync:

  BUG: scheduling while atomic: (udev-worker)/9150/0x00000000
  RIP: 0010:core_in_sync+0x14/0x30 [dm_log]
  CR2: 0000000000000008
  Fixing recursive fault but reboot is needed!

Fix by widening the local region_count to sector_t and adding an
explicit overflow check before the value is assigned to lc->region_count.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Yuhao Jiang 
Signed-off-by: Junrui Luo 
Signed-off-by: Mikulas Patocka 
Signed-off-by: Sasha Levin

dm cache metadata: fix memory leak on metadata abort retry

2026-05-23T11:08:44+00:00

[ Upstream commit 044ca491d4086dc5bf233e9fcb71db52df32f633 ]

When failing to acquire the root_lock in dm_cache_metadata_abort because
the block_manager is read-only, the temporary block_manager created
outside the root_lock is not properly released, causing a memory leak.

Reproduce steps:

This can be reproduced by reloading a new table while the metadata
is read-only. While the second call to dm_cache_metadata_abort is
caused by lack of support for table preload in dm-cache, mentioned
in commit 9b1cc9f251af ("dm cache: share cache-metadata object across
inactive and active DM tables"), it exposes the memory leak in
dm_cache_metadata_abort when the function is called multiple times.
Specifically, dm-cache fails to sync the new cache object's mode during
preresume, creating the reproducer condition.

This issue could also occur through concurrent metadata_operation_failed
calls due to races in cache mode updates, but the table preload scenario
below provides a reliable reproducer.

1. Create a cache device with some faulty trailing metadata blocks

dmsetup create cmeta <
unreferenced object 0xffff8880080c2010 (size 16):
  comm "dmsetup", pid 132, jiffies 4294982580
  hex dump (first 16 bytes):
    00 38 b9 07 80 88 ff ff 6a 6b 6b 6b 6b 6b 6b a5 ...
  backtrace (crc 3118f31c):
    kmemleak_alloc+0x28/0x40
    __kmalloc_cache_noprof+0x3d9/0x510
    dm_block_manager_create+0x51/0x140
    dm_cache_metadata_abort+0x85/0x320
    metadata_operation_failed+0x103/0x1e0
    cache_preresume+0xacd/0xe70
    dm_table_resume_targets+0xd3/0x320
    __dm_resume+0x1b/0xf0
    dm_resume+0x127/0x170


Fixes: 352b837a5541 ("dm cache: Fix ABBA deadlock between shrink_slab and dm_cache_metadata_abort")
Signed-off-by: Ming-Hung Tsai 
Signed-off-by: Mikulas Patocka 
Signed-off-by: Sasha Levin

dm-mpath: don't stop probing paths at presuspend

2026-05-23T11:08:43+00:00

[ Upstream commit 51d81e14fe6788dc6463064c7517480f2acd2724 ]

Commit 5c977f102315 ("dm-mpath: Don't grab work_mutex while probing
paths"), added code to make multipath quit probing paths early, if it
was trying to suspend. This isn't necessary. It was just an optimization
to try to keep path probing from delaying a suspend. However it causes
problems with the intended user of this code, qemu. The path probing
code was added because failed ioctls to multipath devices don't cause
paths to fail in cases where a regular IO failure would.

If an ioctl to a path failed because the path was down, and the
multipath device had passed presuspend, the M_MPATH_PROBE_PATHS ioctl
would exit early, without probing the path. The caller would then retry
the original ioctl, hoping to use a different path. But if there was
only one path in the pathgroup, it would pick the same non-working path
again, even if there were working paths in other pathgroups.

ioctls to a suspended dm device will return -EAGAIN, notifying the
caller that the device is suspended, but ioctls to a device that is just
preparing to suspend won't (and in general, shouldn't). This means that
the caller (qemu in this case) would get into a tight loop where it
would issue an ioctl that failed, skip probing the paths because the
device had already passed presuspend, and start over issuing the ioctl
again. This would continue until the multipath device finally fully
suspended, or the caller gave up and failed the ioctl.

multipath's path probing code could return -EAGAIN in this case, and the
caller could delay a bit before retrying, but the whole purpose of
skipping the probe after presuspend was to speed things up, and that
would just slow them down. Instead, remove the is_suspending flag, and
check dm_suspended() instead to decide whether to exit the probing code
early. This means that when the probing code exits early, future ioctls
will also be delayed, because the device is fully suspended.

Fixes: 5c977f102315 ("dm-mpath: Don't grab work_mutex while probing paths")
Signed-off-by: Benjamin Marzinski 
Reviewed-by: Martin Wilck 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Mikulas Patocka 
Signed-off-by: Sasha Levin