linux-stable.git/block, branch linux-5.18.y

block: don't allow the same type rq_qos add more than once

2022-08-17T12:42:24+00:00

[ Upstream commit 14a6e2eb7df5c7897c15b109cba29ab0c4a791b6 ]

In our test of iocost, we encountered some list add/del corruptions of
inner_walk list in ioc_timer_fn.

The reason can be described as follows:

cpu 0					cpu 1
ioc_qos_write				ioc_qos_write

ioc = q_to_ioc(queue);
if (!ioc) {
        ioc = kzalloc();
					ioc = q_to_ioc(queue);
					if (!ioc) {
						ioc = kzalloc();
						...
						rq_qos_add(q, rqos);
					}
        ...
        rq_qos_add(q, rqos);
        ...
}

When the io.cost.qos file is written by two cpus concurrently, rq_qos may
be added to one disk twice. In that case, there will be two iocs enabled
and running on one disk. They own different iocgs on their active list. In
the ioc_timer_fn function, because of the iocgs from two iocs have the
same root iocg, the root iocg's walk_list may be overwritten by each other
and this leads to list add/del corruptions in building or destroying the
inner_walk list.

And so far, the blk-rq-qos framework works in case that one instance for
one type rq_qos per queue by default. This patch make this explicit and
also fix the crash above.

Signed-off-by: Jinke Han 
Reviewed-by: Muchun Song 
Acked-by: Tejun Heo 
Cc: 
Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: serialize all debugfs operations using q->debugfs_mutex

2022-08-17T12:42:24+00:00

[ Upstream commit 5cf9c91ba927119fc6606b938b1895bb2459d3bc ]

Various places like I/O schedulers or the QOS infrastructure try to
register debugfs files on demans, which can race with creating and
removing the main queue debugfs directory.  Use the existing
debugfs_mutex to serialize all debugfs operations that rely on
q->debugfs_dir or the directories hanging off it.

To make the teardown code a little simpler declare all debugfs dentry
pointers and not just the main one uncoditionally in blkdev.h.

Move debugfs_mutex next to the dentries that it protects and document
what it is used for.

Signed-off-by: Christoph Hellwig 
Link: https://lore.kernel.org/r/20220614074827.458955-3-hch@lst.de
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: ensure iov_iter advances for added pages

2022-08-17T12:41:57+00:00

[ Upstream commit 325347d965e7ccf5424a05398807a6d801846612 ]

There are cases where a bio may not accept additional pages, and the iov
needs to advance to the last data length that was accepted. The zone
append used to handle this correctly, but was inadvertently broken when
the setup was made common with the normal r/w case.

Fixes: 576ed9135489c ("block: use bio_add_page in bio_iov_iter_get_pages")
Fixes: c58c0074c54c2 ("block/bio: remove duplicate append pages code")
Signed-off-by: Keith Busch 
Link: https://lore.kernel.org/r/20220712153256.2202024-1-kbusch@fb.com
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block/bio: remove duplicate append pages code

2022-08-17T12:41:57+00:00

[ Upstream commit c58c0074c54c2e2bb3bb0d5a4d8896bb660cc8bc ]

The getting pages setup for zone append and normal IO are identical. Use
common code for each.

Signed-off-by: Keith Busch 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Christoph Hellwig 
Link: https://lore.kernel.org/r/20220610195830.3574005-3-kbusch@fb.com
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: don't create hctx debugfs dir until q->debugfs_dir is created

2022-08-17T12:40:38+00:00

[ Upstream commit f3ec5d11554778c24ac8915e847223ed71d104fc ]

blk_mq_debugfs_register_hctx() can be called by blk_mq_update_nr_hw_queues
when gendisk isn't added yet, such as nvme tcp.

Fixes the warning of 'debugfs: Directory 'hctx0' with parent '/' already present!'
which can be observed reliably when running blktests nvme/005.

Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions")
Reported-by: Yi Zhang 
Signed-off-by: Ming Lei 
Tested-by: Yi Zhang 
Reviewed-by: Christoph Hellwig 
Link: https://lore.kernel.org/r/20220711090808.259682-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: fix infinite loop for invalid zone append

2022-08-17T12:40:30+00:00

[ Upstream commit b82d9fa257cb3725c49d94d2aeafc4677c34448a ]

Returning 0 early from __bio_iov_append_get_pages() for the
max_append_sectors warning just creates an infinite loop since 0 means
success, and the bio will never fill from the unadvancing iov_iter. We
could turn the return into an error value, but it will already be turned
into an error value later on, so just remove the warning. Clearly no one
ever hit it anyway.

Fixes: 0512a75b98f84 ("block: Introduce REQ_OP_ZONE_APPEND")
Signed-off-by: Keith Busch 
Reviewed-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
Link: https://lore.kernel.org/r/20220610195830.3574005-2-kbusch@fb.com
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: fix default IO priority handling again

2022-08-11T11:20:39+00:00

commit e589f46445960c274cc813a1cc8e2fc73b2a1849 upstream.

Commit e70344c05995 ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before 5a9d041ba2f6 ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit
e70344c05995 change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage:

"If no I/O scheduler has been set for a thread, then by default the I/O
priority will follow the CPU nice value (setpriority(2)).  In Linux
kernels before version 2.6.24, once an I/O priority had been set using
ioprio_set(), there was no way to reset the I/O scheduling behavior to
the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
reset to the default I/O scheduling behavior."

So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit e70344c05995. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context to avoid
future surprises. Note that we tweak ioprio_best() to maintain
ioprio_get(2) behavior and make this commit easily backportable.

CC: stable@vger.kernel.org
Fixes: e70344c05995 ("block: fix default IO priority handling")
Reviewed-by: Damien Le Moal 
Tested-by: Damien Le Moal 
Signed-off-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: pop cached rq before potentially blocking rq_qos_throttle()

2022-06-29T07:04:33+00:00

[ Upstream commit 2645672ffe21f0a1c139bfbc05ad30fd4e4f2583 ]

If rq_qos_throttle() ends up blocking, then we will have invalidated and
flushed our current plug. Since blk_mq_get_cached_request() hasn't
popped the cached request off the plug list just yet, we end holding a
pointer to a request that is no longer valid. This insta-crashes with
rq->mq_hctx being NULL in the validity checks just after.

Pop the request off the cached list before doing rq_qos_throttle() to
avoid using a potentially stale request.

Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
Reported-by: Dylan Yudaken 
Tested-by: Dylan Yudaken 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: disable the elevator int del_gendisk

2022-06-29T07:04:30+00:00

[ Upstream commit 50e34d78815e474d410f342fbe783b18192ca518 ]

The elevator is only used for file system requests, which are stopped in
del_gendisk.  Move disabling the elevator and freeing the scheduler tags
to the end of del_gendisk instead of doing that work in disk_release and
blk_cleanup_queue to avoid a use after free on q->tag_set from
disk_release as the tag_set might not be alive at that point.

Move the blk_qos_exit call as well, as it just depends on the elevator
exit and would be the only reason to keep the not exactly cheap queue
freeze in disk_release.

Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig 
Tested-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20220614074827.458955-2-hch@lst.de
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: Fix handling of offline queues in blk_mq_alloc_request_hctx()

2022-06-22T12:28:03+00:00

[ Upstream commit 14dc7a18abbe4176f5626c13c333670da8e06aa1 ]

This patch prevents that test nvme/004 triggers the following:

UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
index 512 is out of range for type 'long unsigned int [512]'
Call Trace:
 show_stack+0x52/0x58
 dump_stack_lvl+0x49/0x5e
 dump_stack+0x10/0x12
 ubsan_epilogue+0x9/0x3b
 __ubsan_handle_out_of_bounds.cold+0x44/0x49
 blk_mq_alloc_request_hctx+0x304/0x310
 __nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
 nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
 nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
 nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
 nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
 nvmf_dev_write+0xae/0x111 [nvme_fabrics]
 vfs_write+0x144/0x560
 ksys_write+0xb7/0x140
 __x64_sys_write+0x42/0x50
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: Christoph Hellwig 
Cc: Ming Lei 
Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin