linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
8 days	dm thin metadata: fix superblock refcount leak on snapshot shadow failure	Genjian Zhang
	__reserve_metadata_snap() increments THIN_SUPERBLOCK_LOCATION in the metadata space map before shadowing it. When dm_tm_shadow_block() fails, a reference is leaked in the metadata space map. Fix by adding the missing dm_sm_dec_block(). Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: cc8394d86f04 ("dm thin: provide userspace access to pool metadata") Cc: stable@vger.kernel.org
9 days	dm-stats: fix dm_jiffies_to_msec64	Mikulas Patocka
	There were wrong calculations in dm_jiffies_to_msec64 that produced incorrect output when HZ was different from 1000. This commit fixes them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4-6 Fixes: fd2ed4d25270 ("dm: add statistics support") Cc: stable@vger.kernel.org
9 days	dm-stats: fix merge accounting	Mikulas Patocka
	There were wrong parentheses when setting stats_aux->merged, so that merging was never properly accounted. This commit fixes it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4-6 Fixes: fd2ed4d25270 ("dm: add statistics support") Cc: stable@vger.kernel.org
9 days	dm-bufio: fix wrong count calculation in dm_bufio_issue_discard	Mikulas Patocka
	block_to_sector converts a block number to a sector number and adds c->start to the result. It is inappropriate to use this function for converting the number of blocks to a number to sectors because c->start would be incorrectly added to the result. Luckily, the only target that uses dm_bufio_issue_discard is dm-ebs, which sets c->start to 0, so this bug is latent. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4-6 Fixes: 6fbeb0048e6b ("dm bufio: implement discard") Cc: stable@vger.kernel.org
9 days	dm-verity: make error counter atomic	Mikulas Patocka
	The error counter "v->corrupted_errs" was not atomic, thus it could be subject to race conditions. The call to dm_audit_log_target("max-corrupted-errors") may be skipped due to the races. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4.6 Fixes: 65ff5b7ddf05 ("dm verity: add error handling modes for corrupted blocks") Cc: stable@vger.kernel.org
9 days	dm-verity: increase sprintf buffer size	Mikulas Patocka
	The prefix "DM_VERITY_ERR_BLOCK_NR" is 22 chars. Add '=', one digit for type, ',', up to 20 digits for a u64 block number, and a NUL terminator: that's 46 bytes. The buffer is 42 bytes. For block numbers >= 16 decimal digits (devices larger than ~16 EB with 4K blocks), snprintf silently truncates the uevent environment variable. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4.6 Fixes: 65ff5b7ddf05 ("dm verity: add error handling modes for corrupted blocks") Cc: stable@vger.kernel.org
9 days	dm-verity: fix a possible NULL pointer dereference	Mikulas Patocka
	Fix a possible NULL pointer dereference dm_verity_loadpin_is_bdev_trusted if the device has no table. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4-6 Fixes: b6c1c5745ccc ("dm: Add verity helpers for LoadPin") Cc: stable@vger.kernel.org
9 days	dm-verity: avoid double increment of &use_bh_wq_enabled	Mikulas Patocka
	verity_parse_opt_args is called twice, first with the only_modifier_opts, first with only_modifier_opts == true and then with only_modifier_opts == false. Thus, the static branch &use_bh_wq_enabled was incremented twice and the destructor verity_dtr would only decrement it once. Fix tihs bug by only incrementing it on the first call, on the second call, when v->use_bh_wq is true, do nothing. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4-6 Cc: stable@vger.kernel.org Fixes: df326e7a0699 ("dm verity: allow optional args to alter primary args handling")
9 days	dm-ioctl: fix a possible overflow in list_version_get_info	Mikulas Patocka
	sizeof(tt->version) is 12 bytes, but the code writes 16 bytes into the output buffer - info->vers->version[0], info->vers->version[1], info->vers->version[2] and info->vers->next. This can cause buffer overflow. Fix this buffer overflow by replacing "sizeof(tt->version)" with "sizeof(struct dm_target_versions)". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4.6 Cc: stable@vger.kernel.org
9 days	dm_early_create: fix freeing used table on dm_resume failure	Mikulas Patocka
	If dm_resume fails, the kernel attempts to free table with dm_table_destroy, but the table was already instantiated with dm_swap_table. This commit skips the call to dm_table_destroy in this case. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4.6 Fixes: 6bbc923dfcf5 ("dm: add support to directly boot to a mapped device") Cc: stable@vger.kernel.org
9 days	dm-integrity: fix a bug if the bio is out of limits	Mikulas Patocka
	If dm_integrity_check_limits fails, the code would exit with DM_MAPIO_KILL. However, the range would be already locked at this point, and it wouldn't be unlocked, resulting in a deadlock. Let's move the limit check up, so that when it exits, no resources are leaked. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4.6 Fixes: fb0987682c62 ("dm-integrity: introduce the Inline mode") Cc: stable@vger.kernel.org
9 days	dm-integrity: don't increment hash_offset twice	Mikulas Patocka
	hash_offset is already incremented in the loop "for (i = 0; i < to_copy; i++, ts--)". Do not increment it again. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4.6 Fixes: 84597a44a9d8 ("dm-integrity: dm integrity: add optional discard support") Cc: stable@vger.kernel.org
9 days	dm-integrity: fix leaking uninitialized kernel memory	Mikulas Patocka
	If hash size is less than device's tuple size, dm-integrity is supposed to zero the remaining space. There was a bug in the code that zeroing didn't work. This commit fixes it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4.6 Fixes: fb0987682c62 ("dm-integrity: introduce the Inline mode") Cc: stable@vger.kernel.org
9 days	dm-integrity: fix the 'fix_hmac' option	Mikulas Patocka
	When the "fix_hmac" argument is used, dm-integrity is supposed to check the superblock with the journal_mac. However, there was a logic bug in the code - the code only checked the superblock mac if the bit SB_FLAG_FIXED_HMAC was set in the superblock. So, the attacker could clear this bit and bypass the checking trivially. This commit changes dm-integrity so that when the user specified the "fix_hmac" flag and the superblock doesn't have the bit SB_FLAG_FIXED_HMAC set, the activation is aborted with an error. Unfortunatelly, there's a bug in the integritysetup tool that when using the 'open' command it passes the "fix_hmac" argument to the kernel even if the user specified --integrity-legacy-hmac. The bug will be fixed in the upcoming 2.8.7 release. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Shukai Ni <shukai.ni@kuleuven.be>
11 days	dm era: fix error code propagation in era_ctr()	Cao Guanghui
	era_ctr() replaces the actual error codes returned by dm_get_device() and dm_set_target_max_io_len() with hardcoded -EINVAL, discarding the real reason for the failure (e.g. -ENODEV, -ENOMEM). This makes it harder for users to diagnose problems and is inconsistent with other dm targets (dm-thin, dm-verity, dm-flakey, dm-ebs) which propagate the original error. Fix all three sites to return 'r' instead of -EINVAL. Signed-off-by: Cao Guanghui <caoguanghui@kylinos.cn> Reviewed-by: Su Yue <glass.su@suse.com> Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
11 days	dm era: fix NULL pointer dereference in metadata_open()	Cao Guanghui
	metadata_open() returns NULL when kzalloc_obj() fails, but the caller era_ctr() only checks IS_ERR(md). Since IS_ERR(NULL) returns false, the NULL pointer is treated as a valid result and later assigned to era->md, leading to a NULL pointer dereference when the metadata is accessed. Fix this by returning ERR_PTR(-ENOMEM) on allocation failure, consistent with dm-cache-metadata.c, dm-thin-metadata.c, and dm-clone-metadata.c which all use ERR_PTR(-ENOMEM) for the same pattern. Fixes: eec40579d848 ("dm: add era target") Signed-off-by: Cao Guanghui <caoguanghui@kylinos.cn> Reviewed-by: Su Yue <glass.su@suse.com> Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
11 days	dm: avoid leaking the caller's thread keyring via the table device file	Ingo Blechschmidt
	The refactoring in commit a28d893eb327 ("md: port block device access to file") accidentally causes the caller's thread keyring to be kept alive long beyond the caller's lifetime. As a result, "cryptsetup luksSuspend" silently fails to wipe the LUKS volume key from memory. In detail: "cryptsetup luksOpen" uses its supposedly ephemeral thread keyring to pass the volume key to the kernel. dm-crypt's crypt_set_keyring_key() copies the key material into its own crypt_config structure and then drops its own reference to the key in the keyring with key_put(). With this fix, restoring pre-v6.9 behavior, the copy in the thread keyring is then promptly garbage collected, such that exactly one copy of the volume key remains. This single copy is correctly wiped from memory on "cryptsetup luksSuspend". Without this fix, the thread keyring and the volume key in it remains. This second copy is only freed on "luksClose". "luksSuspend" neither knows about this copy nor has any way to remove it, so the key remains recoverable from RAM after a suspend that is documented to have wiped it. This fix should not introduce new security problems, as the code is anyway gated by CAP_SYS_ADMIN. The device-mapper core, not the calling task, is the legitimate owner of this long-lived file. Fixes: a28d893eb327 ("md: port block device access to file") Closes: https://gitlab.com/cryptsetup/cryptsetup/-/work_items/993 Link: https://www.speicherleck.de/iblech/cryptsetup-luksSuspend-issue-reproduction/ Signed-off-by: Ingo Blechschmidt <iblech@speicherleck.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Tested-by: Ondrej Kozina <okozina@redhat.com>
11 days	dm-inlinecrypt: Fix an error handling path in inlinecrypt_ctr()	Christophe JAILLET
	All error handling paths, except but this one, branch to the 'bad' label in the error handling path. If not done, there is a memory leak and some sensitive data may be kept around. So, fix this error path and also do the needed clean-up. Also, fix missing goto in the "Wrong alignment of iv_offset sector" path. Fixes: e7f57d2c47e2 ("dm-inlinecrypt: add target for inline block device encryption") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
11 days	dm-pcache: reject option groups without values	Samuel Moelius
	The pcache target parses optional arguments as name/value pairs. A table that advertises one optional argument and supplies only a recognized option name, for example "cache_mode", reaches parse_cache_opts() with argc == 1. The parser consumes the name, decrements argc to zero, then calls dm_shift_arg() again for the value. dm_shift_arg() returns NULL when no arguments remain, and the following strcmp() dereferences that NULL pointer. Check that each recognized option has a value before consuming it. This keeps valid "cache_mode writeback" and "data_crc true/false" tables unchanged while making malformed tables fail during target construction with a precise missing-value error. Assisted-by: Codex:gpt-5.5-cyber-preview Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com> Reviewed-by: Zheng Gu <cengku@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 1d57628ff95b ("dm-pcache: add persistent cache target in device-mapper") Cc: stable@vger.kernel.org
11 days	dm thin metadata: fix metadata snapshot consistency on commit failure	Ming-Hung Tsai
	__reserve_metadata_snap() and __release_metadata_snap() modify the superblock's held_root directly in the block_manager's buffer. If the subsequent metadata commit fails, the held_root gets flushed to disk through the abort_transaction path, resulting in inconsistent metadata. Reproducer 1: __reserve_metadata_snap() 1. Create a 2 MiB metadata device and make the region after the 14th block inaccessible, to trigger metadata commit failure in the subsequent reserve_metadata_snap operation. The 14th block will be the shadow destination for the index block. dmsetup create tmeta --table "0 112 linear /dev/sdc 0 112 3984 error" 2. Create a 16 MiB thin-pool dmsetup create tdata --table "0 32768 zero" dd if=/dev/zero of=/dev/mapper/tmeta bs=4k count=1 dmsetup create tpool --table "0 32768 thin-pool /dev/mapper/tmeta \ /dev/mapper/tdata 128 0 1 skip_block_zeroing" 3. Take a metadata snapshot to trigger metadata commit failure and transaction abort. However, the held_root is written to disk, breaking metadata consistency. dmsetup message tpool 0 "reserve_metadata_snap" thin_check v1.2.2 result: Bad reference count for metadata block 6. Expected 2, but space map contains 1. Bad reference count for metadata block 7. Expected 2, but space map contains 1. Bad reference count for metadata block 13. Expected 1, but space map contains 0. Reproducer 2: __release_metadata_snap() 1. Create a 2 MiB metadata device and make the region after the 16th block inaccessible, to trigger metadata commit failure in the subsequent release_metadata_snap operation. The 16th block will be the shadow destination for the index block. dmsetup create tmeta --table "0 128 linear /dev/sdc 0 128 3968 error" 2. Create a 16 MiB thin-pool dmsetup create tdata --table "0 32768 zero" dd if=/dev/zero of=/dev/mapper/tmeta bs=4k count=1 dmsetup create tpool --table "0 32768 thin-pool /dev/mapper/tmeta \ /dev/mapper/tdata 128 0 1 skip_block_zeroing" 3. Reserve then release the metadata snapshot, to trigger metadata commit failure and transaction abort. The held_root gets removed from the on-disk superblock, causing inconsistent metadata. dmsetup message tpool 0 "reserve_metadata_snap" dmsetup message tpool 0 "release_metadata_snap" thin_check v1.2.2 result: Bad reference count for metadata block 6. Expected 1, but space map contains 2. Bad reference count for metadata block 7. Expected 1, but space map contains 2. 1 metadata blocks have leaked. Fix by deferring the held_root update to commit time. Additionally, move the existing-snapshot check in __reserve_metadata_snap before the shadow operation to avoid unnecessary work. In __release_metadata_snap, clear pmd->held_root before btree deletion so partial failure leaks blocks rather than leaving a stale reference, and unlock the snapshot block before decrementing its refcount. Fixes: 991d9fa02da0 ("dm: add thin provisioning target") Cc: stable@vger.kernel.org Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
11 days	dm-verity: fix buffer overflow in FEC calculation	Mikulas Patocka
	There's a buffer overflow in dm-verity-fec: if (neras && neras <= v->fec->roots) fio->erasures[(neras)++] = i; This allows *neras to reach roots + 1 (the post-increment pushes it past roots). This value is then passed as no_eras to decode_rs8(). Inside the RS decoder (lib/reed_solomon/decode_rs.c:113-121), the erasure locator polynomial loop writes lambda[j] where j can reach nroots + 1 — one element past the end of lambda[] (which is sized nroots + 1, valid indices 0..nroots). The out-of-bounds write lands on syn[0], corrupting the syndrome buffer. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Assisted-by: Claude:claude-opus-4-6 Cc: stable@vger.kernel.org Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
11 days	dm era: fix out-of-bounds memory access for non-zero start sector	Samuel Moelius
	dm-era tracks writes in target-relative blocks, but era_map() calculates the writeset block before applying the target offset. Tables with a non-zero start sector can therefore pass an absolute mapped-device block to metadata_current_marked(). If the absolute block is beyond the current writeset size, writeset_marked() tests past the end of the in-core bitset. KASAN reports this as a vmalloc-out-of-bounds access. Apply the target offset before calculating the era block so writeset lookups use the target-relative block number. Assisted-by: Codex:gpt-5.5-cyber-preview Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com> Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: eec40579d848 ("dm: add era target")
11 days	dm-log: fix a bitset_size overflow on 32bit machines	Benjamin Marzinski
	Commit c20e36b7631d ("dm log: fix out-of-bounds write due to region_count overflow") made sure that region_count could fit in an unsigned int. But the bitmap memory isn't allocated based on region_count. It uses bitset_size (a size_t variable). The first step of calculating bitset_size is to set it to region_count, rounded up to a multiple of BITS_PER_LONG. If region_size is less than BITS_PER_LONG smaller than UINT_MAX, it will get rounded up to 2^32. On a 32bit architecture, this will make bitset_size wrap around to 0 and fail, despite region_count being valid. Since bitset_size gets divided by 8, it can hold any valid region_count. It just needs a special case to handle the rollover. If it is 0, the value rolled over, and bitset size should be set to the number of bytes needed to hold 2^32 bits. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: c20e36b7631d ("dm log: fix out-of-bounds write due to region_count overflow") Cc: stable@vger.kernel.org
2026-06-25	Merge tag 'block-7.2-20260625' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - blk-cgroup locking rework and fixes: - fix a use-after-free in __blkcg_rstat_flush() - defer freeing policy data until after an RCU grace period - defer the blkcg css_put until the blkg is unlinked from the queue - unwind the queue_lock nesting under RCU / blkcg->lock across the lookup, create, associate and destroy paths - NVMe fixes via Keith: - Fix a crash and memory leak during invalid cdev teardown, and related cdev cleanups (Maurizio, John) - nvmet fixes: handle TCP_CLOSING in the tcp state_change handler, reject short AUTH_RECEIVE buffers, handle inline data with a nonzero offset in rdma, fix an sq refcount leak, and allocate ana_state with the port (Maurizio, Michael, Bryam, Wentao, Rosen) - nvme-fc fix to not cancel requests on an IO target before it is initialized (Mohamed) - nvme-apple fix to prevent shared tags across queues on Apple A11 (Nick) - Various smaller fixes and cleanups (John) - MD fixes via Yu Kuai: - raid1/raid10 fixes for writes_pending and barrier reference leaks on write and discard failures, plus REQ_NOWAIT handling fixes (Abd-Alrhman) - raid5 discard accounting and validation, and a batch of fixes for stripe batch races (Yu Kuai, Chen) - Protect raid1 head_position during read balancing (Chen) - block bio-integrity fixes: correct an error injection static key decrement, fix GFP flag confusion in bio_integrity_alloc_buf(), and handle REQ_OP_ZONE_APPEND in __bio_integrity_action() (Christoph) - Fixes for bio_iov_iter_bounce_write(): revert the iov_iter after a short copy, and respect the iov_iter nofault flag (Qu) - Invalidate the cached plug timestamp after a task switch, and clear PF_BLOCK_TS in copy_process() (Usama) - Fix the IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd() (Yitang) - Remove a redundant plug in __submit_bio() (Wen) - Don't warn when reclassifying a busy socket lock in nbd (Deepanshu) * tag 'block-7.2-20260625' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (45 commits) block: handle REQ_OP_ZONE_APPEND in __bio_integrity_action block: fix GFP_ flags confusion in bio_integrity_alloc_buf block, bfq: don't grab queue_lock to initialize bfq mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page() blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs() blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg() blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create() blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs() blk-cgroup: delay freeing policy data after rcu grace period blk-cgroup: protect iterating blkgs with blkcg->lock in blkcg_print_stat() md/raid5: avoid R5_Overlap races while breaking stripe batches md/raid5: use stripe state snapshot in break_stripe_batch_list() blk-cgroup: defer blkcg css_put until blkg is unlinked from queue blk-cgroup: fix UAF in __blkcg_rstat_flush() block, bfq: protect async queue reset with blkcg locks nbd: don't warn when reclassifying a busy socket lock block: fix incorrect error injection static key decrement md/raid5: let stripe batch bm_seq comparison wrap-safe md/raid1: protect head_position for read balance md/raid1: free r1_bio when REQ_NOWAIT is set and read would block on retry ...
2026-06-23	md/raid5: avoid R5_Overlap races while breaking stripe batches	Chen Cheng
	KCSAN report a race in break_stripe_batch_list() vs. raid5_make_request() on sh->dev[i].flags (plain word write vs. atomic bit op).. and .. one possible scenario is: CPU1 CPU2 break_stripe_batch_list(sh1) -> handle sh2 -> lock(sh2) -> sh2->batch_head = NULL -> unlock(sh2) -> test_and_clear_bit(R5_Overlap, sh2->dev[i].flags) -> wake_up_bit(sh2->dev[i].flags) raid5_make_request() -> add_all_stripe_bios(sh2) -> lock(sh2) -> stripe_bio_overlaps(sh2) returns true batch_head is NULL, so new bio overlap exist bio on sh2 -> true -> set_bit(R5_Overlap, sh2->dev[i].flags) -> unlock(sh2) -> wait_on_bit(sh2->dev[i].flags) -> sh2->dev[i].flags = sh1->dev[i].flags & ~R5_Overlap No wait_up_bit(), CPU2 could be wait_on_bit() forever... Fix by : - Expand the protect zone. - Use batch_head's device flag's snaphot when no held head_sh->stripe_lock. - Move sh/head_sh->batch_head = NULL to the end of protected zone , and , any concurrent add_all_stripe_bios() grabs sh->stripe_lock now either: - see batch_head != null, and , is rejected by stripe_bio_overlaps() under the lock (no R5_Overlap wait ) , or , - sees batch_head == NULL, only after dev[i].flags has already been set and the prior R5_Overlap waiters worken. KCSAN report: ================================================ BUG: KCSAN: data-race in break_stripe_batch_list / raid5_make_request write (marked) to 0xffff8e89c8117548 of 8 bytes by task 4042 on cpu 0: raid5_make_request+0xea0/0x2930 md_handle_request+0x4a2/0xa40 md_submit_bio+0x109/0x1a0 __submit_bio+0x2ec/0x390 submit_bio_noacct_nocheck+0x457/0x710 submit_bio_noacct+0x2a7/0xc20 submit_bio+0x56/0x250 blkdev_direct_IO+0x54c/0xda0 blkdev_write_iter+0x38f/0x570 aio_write+0x22b/0x490 io_submit_one+0xa51/0xf70 __x64_sys_io_submit+0xf7/0x220 x64_sys_call+0x1907/0x1c60 do_syscall_64+0x130/0x570 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff8e89c8117548 of 8 bytes by task 4010 on cpu 5: break_stripe_batch_list+0x249/0x480 handle_stripe_clean_event+0x720/0x9b0 handle_stripe+0x32fb/0x4500 handle_active_stripes.isra.0+0x6e0/0xa50 raid5d+0x7e0/0xba0 md_thread+0x15a/0x2d0 kthread+0x1e3/0x220 ret_from_fork+0x37a/0x410 ret_from_fork_asm+0x1a/0x30 value changed: 0x0000000000000019 -> 0x0000000000000099 --> R5_Overlap Fixes: fb642b92c267 ("md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list") Signed-off-by: Chen Cheng <chencheng@fnnas.com> Link: https://patch.msgid.link/20260619041013.1207148-1-chencheng@fnnas.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-23	md/raid5: use stripe state snapshot in break_stripe_batch_list()	Chen Cheng
	The patch just suppress KCSAN noise. No functional change. RAID-5 can group multi full-stripe-write aka stripe_head into a batch aka batch_list, with one head_sh leading them. Call break_stripe_batch_list() when the batch is finished, or, a stripe has to be dropped out of the batch. break_stripe_batch_list() reads stripe state several times while request paths can update thost state words concurrently with lockless bitops, which reported by KCSAN. Use a snapshot to guarantees that the value used for warning, copying, and handle checks is internally consistent at current read moment. KCSAN report: ============================================== BUG: KCSAN: data-race in __add_stripe_bio / break_stripe_batch_list write (marked) to 0xffff8e89d4f0b988 of 8 bytes by task 4323 on cpu 3: __add_stripe_bio+0x35e/0x400 raid5_make_request+0x6ac/0x2930 md_handle_request+0x4a2/0xa40 md_submit_bio+0x109/0x1a0 __submit_bio+0x2ec/0x390 submit_bio_noacct_nocheck+0x457/0x710 submit_bio_noacct+0x2a7/0xc20 submit_bio+0x56/0x250 blkdev_direct_IO+0x54c/0xda0 blkdev_write_iter+0x38f/0x570 aio_write+0x22b/0x490 io_submit_one+0xa51/0xf70 read to 0xffff8e89d4f0b988 of 8 bytes by task 4290 on cpu 4: break_stripe_batch_list+0x3ce/0x480 handle_stripe_clean_event+0x720/0x9b0 handle_stripe+0x32fb/0x4500 handle_active_stripes.isra.0+0x6e0/0xa50 raid5d+0x7e0/0xba0 Signed-off-by: Chen Cheng <chencheng@fnnas.com> Link: https://patch.msgid.link/20260618134748.1168360-1-chencheng@fnnas.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	Merge tag 'mm-nonmm-stable-2026-06-21-10-22' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "taskstats: fix TGID dead-thread stat retention" (Yiyang Chen) Fix a taskstats TGID aggregation bug where fields added in the TGID query path were not preserved after thread exit, and adds a kselftest covering the regression. - "lib/tests: string_helpers: Slight improvements" (Andy Shevchenko) Improve lib/tests/string_helpers_kunit.c a little - "lib/base64: decode fixes" (Josh Law) Address minor issues in lib/base64.c - "selftests/filelock: Make output more kselftestish" (Mark Brown) Make the output from the ofdlocks test a bit easier for tooling to work with. Also ignore the generated file - "uaccess: unify inline vs outline copy_{from,to}_user() selection" (Yury Norov) Simplify the usercopy code by removing the selectability of inlining copy_{from,to}_user(). - "ocfs2: validate inline xattr header consumers" (ZhengYuan Huang) Fix a number of possible issues in the ocfs2 xattr code - "lib and lib/cmdline enhancements" (Dmitry Antipov) Provide additional robustness checking in the cmdline handling code and its in-kernel testing and selftests - "cleanup the RAID6 P/Q library" (Christoph Hellwig) Clean up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries - "ocfs2: harden inode validators against forged metadata" (Michael Bommarito) Add three structural checks to OCFS2 dinode validation so malformed on-disk fields are rejected before ocfs2_populate_inode() copies them into the in-core inode - "lib/raid: replace __get_free_pages() call with kmalloc()" (Mike Rapoport) Clean up the lib/raid code by using kmalloc() in more places * tag 'mm-nonmm-stable-2026-06-21-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (108 commits) ocfs2: fix circular locking dependency in ocfs2_dio_end_io_write ocfs2: fix NULL h_transaction deref in ocfs2_assure_trans_credits lib: interval_tree_test: validate benchmark parameters ocfs2: avoid moving extents to occupied clusters treewide: fix transposed "sign" typos and update spelling.txt ocfs2: fix UBSAN array-index-out-of-bounds in ocfs2_sum_rightmost_rec fat: reject BPB volumes whose data area starts beyond total sectors selftests/uevent: increase __UEVENT_BUFFER_SIZE to avoid ENOBUFS on busy systems lib/test_firmware: allocate the configured into_buf size fs: efs: remove unneeded debug prints checkpatch: cuppress warnings when Reported-by: is followed by Link: MAINTAINERS: add Alexander as a kcov reviewer mailmap: update Alexander Sverdlin's Email addresses fs: fat: inode: replace sprintf() with scnprintf() ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release() ocfs2/dlm: require a ref for locking_state debugfs open ocfs2: reject FITRIM ranges shorter than a cluster ocfs2: validate fast symlink target during inode read ocfs2: add journal NULL check in ocfs2_checkpoint_inode() ...
2026-06-21	md/raid5: let stripe batch bm_seq comparison wrap-safe	Chen Cheng
	Once the 32-bit seq wraps, a newer bm_seq can look smaller than old, so .. covert to wrap-safe calculate way. Signed-off-by: Chen Cheng <chencheng@fnnas.com> Link: https://patch.msgid.link/20260618025735.915113-1-chencheng@fnnas.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid1: protect head_position for read balance	Chen Cheng
	KCSAN reports a data race between raid1_end_read_request() and raid1_read_request(). The completion path updates conf->mirrors[disk].head_position in update_head_pos() without a lock, while the read-balance heuristic reads the same field locklessly in is_sequential() and choose_best_rdev(). KCSAN report: ========================= BUG: KCSAN: data-race in raid1_end_read_request / raid1_read_request write to 0xffff8f0306ba7868 of 8 bytes by interrupt on cpu 9: raid1_end_read_request+0xb5/0x440 bio_endio+0x3c9/0x3e0 blk_update_request+0x257/0x770 scsi_end_request+0x4d/0x520 scsi_io_completion+0x6f/0x990 scsi_finish_command+0x188/0x280 scsi_complete+0xac/0x160 blk_complete_reqs+0x8e/0xb0 blk_done_softirq+0x1d/0x30 [...] read to 0xffff8f0306ba7868 of 8 bytes by task 667002 on cpu 11: raid1_read_request+0x497/0x1a10 raid1_make_request+0xdf/0x1950 md_handle_request+0x2c5/0x700 md_submit_bio+0x126/0x320 __submit_bio+0x2ec/0x3a0 submit_bio_noacct_nocheck+0x572/0x890 [...] value changed: 0x0000000000000078 -> 0x00000000005fe448 Signed-off-by: Chen Cheng <chencheng@fnnas.com> Link: https://patch.msgid.link/20260619044114.1208456-1-chencheng@fnnas.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid1: free r1_bio when REQ_NOWAIT is set and read would block on retry	Abd-Alrhman Masalkhi
	When a read is retried, raid1_read_request() may be called with a pre-allocated r1_bio. If wait_read_barrier() fails for a REQ_NOWAIT read, the bio is completed and the function returns immediately. In this case the existing r1_bio is leaked. This fixes a leak of pre-allocated r1_bio structures for retried reads. Fixes: 5aa705039c4f ("md: raid1 add nowait support") Reported-by: sashiko-bot <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260611083514.754922-1-abd.masalkhi@gmail.com?part=1 Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260611101350.759154-1-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid1: honor REQ_NOWAIT when waiting for behind writes	Abd-Alrhman Masalkhi
	raid1 supports REQ_NOWAIT reads by avoiding waits in the barrier path through wait_read_barrier(). However, a read can still block on a WriteMostly device when the array uses a bitmap and there are outstanding behind writes. In that case raid1 unconditionally calls wait_behind_writes(), which may sleep until all behind writes complete. As a result, a REQ_NOWAIT read can block despite the caller explicitly requesting non-blocking behavior. This ensures that raid1 consistently honors REQ_NOWAIT reads across all paths that may otherwise wait for behind writes. Fixes: 5aa705039c4f ("md: raid1 add nowait support") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260611083514.754922-1-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid5: always convert llbitmap bits for discard	Yu Kuai
	llbitmap discard is useful even when no underlying member device supports it. The discard still converts the llbitmap range to unwritten, so later reads and recovery do not rely on stale parity for that range. Let llbitmap discard bypass the raid5 lower discard support check. If lower discard is not safe or not supported, complete the accounted clone after md_account_bio() so the llbitmap conversion callbacks run without member discard bios. Link: https://patch.msgid.link/20260605072639.2434847-4-yukuai@kernel.org Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid5: validate discard support at request time	Yu Kuai
	Raid5 used to disable discard limits when devices_handle_discard_safely was not set or when stacked member limits could not support a full-stripe discard. That hides discard from userspace before raid5 can decide whether a request can be handled safely. Follow other virtual drivers and advertise a UINT_MAX discard limit for the md device. Cache lower discard support in r5conf when setting queue limits, and reject unsupported discard bios before queuing stripe work. Link: https://patch.msgid.link/20260605072639.2434847-3-yukuai@kernel.org Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid5: account discard IO	Yu Kuai
	Raid5 handles discard bios internally through make_discard_request() and never passes them through md_account_bio(). As a result, discard IO is missing the md-device iostat accounting that normal raid5 IO and discard IO in other raid levels get from md_account_bio(). Before accounting the bio, trim the request to the full data stripes that raid5 will actually discard. The first full stripe is the ceiling of the bio start divided by data-stripe sectors, and the last full stripe is the floor of the bio end divided by data-stripe sectors. Account that exact MD logical full-stripe range, then restore the original iterator so bio completion and iostat still cover the original request. Link: https://patch.msgid.link/20260605072639.2434847-2-yukuai@kernel.org Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid1: simplify raid1_write_request() error handling	Abd-Alrhman Masalkhi
	raid1_write_request() increments rdev->nr_pending before checking the badblocks and then immediately decrements it again when a device is skipped. Move the increment until after the checks succeed so the reference accounting is easier to follow. Consolidate the failure paths so that each error label releases exactly the resources acquired up to that point. err_dec_pending drops pending references and frees the r1bio, while err_allow_barrier handles the barrier release before returning. When a REQ_ATOMIC write cannot be satisfied due to a badblock range, complete the bio with BLK_STS_NOTSUPP rather than reporting an I/O error, since the operation is unsupported rather than having failed during I/O. Rename max_write_sectors to max_sectors and remove the redundant local copy. Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260613182810.1317258-5-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid10: fix writes_pending and barrier reference leaks on discard failures	Abd-Alrhman Masalkhi
	raid10_make_request() acquires a writes_pending reference with md_write_start() before calling raid10_handle_discard(). Several failure paths in raid10_handle_discard() complete the bio and return without releasing the corresponding reference, causing md_write_end() to be skipped. Call md_write_end() before returning from these failure paths to keep writes_pending accounting balanced. Additionally, discard split allocation failures can occur after wait_barrier() succeeds. Those paths return without calling allow_barrier(), leaking the associated barrier reference. Release the barrier before returning from those paths. Fixes: c9aa889b035f ("md: raid10 add nowait support") Fixes: 4cf58d952909 ("md/raid10: Handle bio_split() errors") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260613182810.1317258-4-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid10: fix writes_pending leak on write request failures	Abd-Alrhman Masalkhi
	raid10_make_request() acquires a writes_pending reference with md_write_start() before dispatching write requests. Several failure paths in raid10_write_request() complete the bio and return without reaching the normal write completion path, causing the corresponding md_write_end() to be skipped. Make raid10_write_request() return a status indicating whether the write request was successfully queued. This allows raid10_make_request() to release the writes_pending reference with md_write_end() when a write request fails. Fixes: 4cf58d952909 ("md/raid10: Handle bio_split() errors") Fixes: c9aa889b035f ("md: raid10 add nowait support") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260613182810.1317258-3-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-21	md/raid1: fix writes_pending and barrier reference leaks on write failures	Abd-Alrhman Masalkhi
	raid1_make_request() acquires a writes_pending reference with md_write_start() before calling raid1_write_request(). Several failure paths in raid1_write_request() complete the bio and return without reaching the normal write completion path, causing the corresponding md_write_end() to be skipped. Make raid1_write_request() return a status indicating whether the write request was successfully queued. This allows raid1_make_request() to call md_write_end() when raid1_write_request() fails. Additionally, if wait_blocked_rdev() fails after wait_barrier() succeeds, the associated barrier reference is not released. Call allow_barrier() before returning from that path to keep the barrier accounting balanced. Fixes: b1a7ad8b5c4f ("md/raid1: Handle bio_split() errors") Fixes: f2a38abf5f1c ("md/raid1: Atomic write support") Fixes: 5aa705039c4f ("md: raid1 add nowait support") Reported-by: sashiko-bot <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260611083514.754922-1-abd.masalkhi@gmail.com?part=1 Closes: https://sashiko.dev/#/patchset/20260611132500.763528-1-abd.masalkhi@gmail.com?part=1 Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260613182810.1317258-2-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-06-16	Merge tag 'for-7.2/dm-changes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - small cleanups in dm-vdo, dm-raid, dm-cache, dm-zoned-metadata - rework of dm-ima - introduce dm-inlinecrypt - fix wrong return value in dm-ioctl - fix rcu stall when polling * tag 'for-7.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-zoned-metadata: Use strscpy() to copy device name dm cache: make smq background work limit configurable dm-inlinecrypt: add support for hardware-wrapped keys dm: limit target bio polling to one shot dm-ioctl: report an error if a device has no table dm: add documentation for dm-inlinecrypt target dm-inlinecrypt: add target for inline block device encryption block: export blk-crypto symbols required by dm-inlinecrypt dm-ima: use active table's size if available dm-ima: Fail more gracefully in dm_ima_measure_on_* dm-ima: Handle race between rename and table swap dm-ima: Fix issues with dm_ima_measure_on_device_rename dm-ima: remove new_map from dm_ima_measure_on_device_clear dm-ima: Fix UAF errors and measuring incorrect context dm-ima: don't copy the active table to the inactive table dm-ima: Remove status_flags from dm_ima_measure_on_table_load() dm-ima: remove broken last_target_measured logic dm-ima: remove dm_ima_reset_data() dm-raid: only requeue bios when dm is suspending dm vdo: use get_random_u32() where appropriate
2026-06-16	Merge tag 'for-7.2/block-20260615' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Per-controller admin and IO timeout sysfs attributes, and letting the block layer set request timeouts (Maurizio, Maximilian) - Multipath passthrough iostats, and PCI P2PDMA enablement for multipath devices (Keith, Kiran) - A new diag sysfs attribute group exporting per-controller counters (retries, multipath failover, error counters, requeue and failure counts, reset and reconnect events) (Nilay) - FDP configuration validation and bounds check fixes (liuxixin) - Various nvmet fixes, including a pre-auth out-of-bounds read in the Discovery Get Log Page handler, auth payload bounds validation, and tcp error-path leak fixes (Bryam, Tianchu, Geliang) - nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki, Eric) - Assorted other fixes and cleanups (John, Yao, Chao, Mateusz, Achkinazi, Wentao) - MD pull request via Yu Kuai: - raid1/raid10 fixes for a deadlock in the read error recovery path, error-path detection and bio accounting with cloned bios, and an nr_pending leak in the REQ_ATOMIC bad-block error path (Abd-Alrhman) - PCI P2PDMA propagation from member devices to the RAID device (Kiran) - dm-raid bio requeue fix, and various smaller fixes and cleanups (Benjamin, Chen, Li, Thorsten) - Enable Clang lock context analysis for the block layer, with the accompanying annotations across queue limits, the blk_holder_ops callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart) - Block status code infrastructure work: a tagged status table, a str_to_blk_op() helper, a bio_endio_status() helper, and on top of that a new configurable block-layer error injection facility (Christoph) - DRBD netlink rework, replacing the genl_magic machinery with explicit netlink serialization and moving the DRBD UAPI headers to include/uapi/linux/ (Christoph Böhmwalder) - bvec improvements: a bvec_folio() helper and making the bvec_iter helpers proper inline functions (Willy, Christoph) - ublk cleanups and a canceling-flag fix for the disk-not-allocated case (Caleb, Ming) - Partition handling fixes: bound the AIX pp_count scan, fix an of_node refcount leak, and replace __get_free_page() with kmalloc() (Bryam, Wentao, Mike) - Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add WQ_PERCPU to the block workqueue users (Mateusz, Marco) - Block statistics and tracing: propagate in-flight to the whole disk on partition IO, export passthrough stats, and a new block_rq_tag_wait tracepoint (Tang, Keith, Aaron) - A round of removals, unexports and cleanups across bio, direct-io and the bvec helpers (Christoph) - Various driver fixes (mtip32xx use-after-free, rbd snap_count validation and strscpy conversion, nbd socket lockdep reclassify, virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS email/list updates (Coly, Li, Yu, Christoph Böhmwalder) - Other little fixes and cleanups all over * tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits) MAINTAINERS: Update Coly Li's email address block: check bio split for unaligned bvec nbd: Reclassify sockets to avoid lockdep circular dependency block: add configurable error injection block: add a str_to_blk_op helper block: add a "tag" for block status codes block: add a macro to initialize the status table floppy: Drop unused pnp driver data block: propagate in_flight to whole disk on partition I/O virtio-blk: clamp zone report to the report buffer capacity block: optimize I/O merge hot path with unlikely() hints drivers/block/rbd: Use strscpy() to copy strings into arrays partitions: aix: bound the pp_count scan to the ppe array block: Enable lock context analysis block/mq-deadline: Make the lock context annotations compatible with Clang block/Kyber: Make the lock context annotations compatible with Clang block/blk-mq-debugfs: Improve lock context annotations block/blk-iocost: Inline iocg_lock() and iocg_unlock() block/blk-iocost: Split ioc_rqos_throttle() block/crypto: Annotate the crypto functions ...
2026-06-15	Merge tag 'vfs-7.2-rc1.bh' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull buffer_head updates from Christian Brauner: "This removes b_end_io from struct buffer_head. Instead of setting bio->bi_end_io to end_bio_bh_io_sync() which then calls bh->b_end_io(), the new bh_submit() and __bh_submit() interfaces set bio->bi_end_io to the appropriate completion handler directly, replacing two indirect function calls in the completion path with one. It is also one fewer function pointer in the middle of a writable data structure that can be corrupted, it shrinks struct buffer_head from 104 to 96 bytes allowing roughly 7% more buffer_heads to be cached in the same amount of memory, and it removes some atomic operations as the buffer refcount is no longer incremented before calling the end_io handler. All in-tree users (fs/buffer.c itself, ext4, jbd2, ocfs2, gfs2, nilfs2, and md-bitmap) are converted, and submit_bh(), mark_buffer_async_write(), and end_buffer_write_sync() are removed" * tag 'vfs-7.2-rc1.bh' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (34 commits) buffer: Remove end_buffer_write_sync() buffer: Change calling convention for end_buffer_read_sync() buffer: Remove b_end_io buffer: Remove submit_bh() md-bitmap: Convert read_file_page and write_file_page to bh_submit() nilfs2: Convert nilfs_mdt_submit_block to bh_submit() nilfs2: Convert nilfs_gccache_submit_read_data to bh_submit() nilfs2: Convert nilfs_btnode_submit_block to bh_submit() buffer: Remove mark_buffer_async_write() gfs2: Convert gfs2_aspace_write_folio to bh_submit() gfs2: Remove use of b_end_io in gfs2_meta_read_endio() gfs2: Convert gfs2_dir_readahead to bh_submit() gfs2: Convert gfs2_metapath_ra to bh_submit() ocfs2: Convert ocfs2_write_super_or_backup to bh_submit() ocfs2: Convert ocfs2_read_blocks to bh_submit() ocfs2: Convert ocfs2_read_block to bh_submit() ocfs2: Convert ocfs2_write_block to bh_submit() jbd2: Convert jbd2_write_superblock() to bh_submit() jbd2: Convert journal commit to bh_submit() ext4: Convert ext4_commit_super() to bh_submit() ...
2026-06-08	dm-zoned-metadata: Use strscpy() to copy device name	David Laight
	Replace strcpy with strscpy in drivers/md/dm-zoned-metadata.c. Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-06-04	buffer: Remove b_end_io	Matthew Wilcox (Oracle)
	This shrinks buffer_head by 8 bytes, letting us pack more buffer heads per slab. With a Debian config, it shrinks from 104 bytes to 96 bytes which is 42 objects per 4KiB page rather than 39, a 7% reduction in the amount of memory used. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-33-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-04	md-bitmap: Convert read_file_page and write_file_page to bh_submit()	Matthew Wilcox (Oracle)
	Avoid an extra indirect function call by using bh_submit() instead of submit_bh(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-31-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Cc: linux-raid@vger.kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-01	Merge tag 'for-7.1/dm-fixes-3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - fix race condition in dm-cache-policy-smq * tag 'for-7.1/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: check allocation under invalidate lock
2026-06-01	Merge tag 'md-7.2-20260531' of ↵	Jens Axboe
	https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.2/block Pull MD updates and fixes from Yu Kuai: "Bug Fixes: - Only requeue dm-raid bios when dm is suspending. (Benjamin Marzinski) - Reset raid10 read_slot when reusing r10bio for discard. (Chen Cheng) - Fix raid1/raid10 deadlock in read error recovery path. (Abd-Alrhman Masalkhi) - Fix raid1/raid10 error-path detection with md_cloned_bio(). (Abd-Alrhman Masalkhi) - Fix raid1/raid10 bio accounting for split md cloned bios. (Abd-Alrhman Masalkhi) - Fix raid1 nr_pending leak in REQ_ATOMIC bad-block path. (Abd-Alrhman Masalkhi) Improvements: - Skip redundant raid_disks updates when the value is unchanged. (Abd-Alrhman Masalkhi) Cleanups: - Update MAINTAINERS email addresses. (Yu Kuai, Li Nan) - Clean up raid1 read error handling. (Christoph Hellwig) - Move the exceed_read_errors condition out of fix_read_error(). (Christoph Hellwig) - Use str_plural() in raid0 dump_zones(). (Thorsten Blum)" * tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid0: use str_plural helper in dump_zones raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path md/raid1: move the exceed_read_errors condition out of fix_read_error md/raid1: cleanup handle_read_error md/raid1,raid10: fix bio accounting for split md cloned bios md/raid1,raid10: fix error-path detection with md_cloned_bio() md/raid1,raid10: fix deadlock in read error recovery path md/raid10: reset read_slot when reusing r10bio for discard md: skip redundant raid_disks update when value is unchanged dm-raid: only requeue bios when dm is suspending MAINTAINERS: Update Li Nan's E-mail address MAINTAINERS: update Yu Kuai's email address
2026-06-01	dm cache: make smq background work limit configurable	Cao Guanghui
	The maximum number of concurrent background work items (promotions, demotions, writebacks) in the SMQ policy was hardcoded to 4096, with a FIXME comment noting it should be made configurable. This value was originally tuned down from 10240 to balance memory overhead (~128 bytes per entry, ~512KB at 4096 entries) against I/O parallelism. However, different workloads and cache sizes may benefit from different limits: - Write-heavy workloads may need more writeback concurrency - Very large caches (10+ TB) may need more promotion slots - Memory-constrained systems may want a lower limit Make this configurable via the module parameter "smq_max_background_work" (defaulting to 4096 to preserve existing behaviour). Clamp the value to at least 1 to prevent setting 0, which would block all background work. The parameter only affects newly created cache devices; existing caches retain their value from creation time. Signed-off-by: Cao Guanghui <caoguanghui@kylinos.cn> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-06-01	dm cache policy smq: check allocation under invalidate lock	Guangshuo Li
	commit 2d1f7b65f5de ("dm cache policy smq: fix missing locks in invalidating cache blocks") added mq->lock around the destructive part of smq_invalidate_mapping(), but left the e->allocated check outside the critical section. That leaves a check-then-act race. Two concurrent invalidators can both observe e->allocated as true before either of them takes mq->lock. The first invalidator that acquires the lock removes the entry from the queues and hash table and then calls free_entry(), which clears e->allocated and puts the entry back on the free list. The second invalidator can then acquire mq->lock and continue with the stale result of the unlocked check. This can corrupt the SMQ queues or hash table by deleting an entry that is no longer on those structures. It can also hit the allocation check in free_entry() when the same entry is freed again. Move the allocation check under mq->lock so the predicate and the destructive operations are serialized by the same lock. Fixes: 2d1f7b65f5de ("dm cache policy smq: fix missing locks in invalidating cache blocks") Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-05-31	md/raid0: use str_plural helper in dump_zones	Thorsten Blum
	Replace the manual ternary "s" pluralization with str_plural() to simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20260527141932.1243503-2-thorsten.blum@linux.dev Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31	raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path	Abd-Alrhman Masalkhi
	In raid1_write_request(), each per-mirror loop iteration begins by incrementing rdev->nr_pending. If a REQ_ATOMIC write encounters a badblock within the requested range, the code jumps to err_handle without dropping the reference taken for the current mirror. err_handle's cleanup loop will only decrements for k < i and r1_bio->bios[k] is non-NULL. The current slot is therefore skipped, leaving its nr_pending reference leaked permanently. The reference prevents the rdev from ever being removed, since raid1_remove_conf() refuses to remove an rdev with nr_pending > 0. Fix this by calling rdev_dec_pending() before jumping to err_handle. Fixes: f2a38abf5f1c ("md/raid1: Atomic write support") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260530151411.4119-1-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>