linux-stable.git/block, branch v5.4.90

block: fix use-after-free in disk_part_iter_next

2021-01-17T13:05:38+00:00

commit aebf5db917055b38f4945ed6d621d9f07a44ff30 upstream.

Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.

Cc: 
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

scsi: block: Fix a race in the runtime power management code

2021-01-06T13:48:37+00:00

commit fa4d0f1992a96f6d7c988ef423e3127e613f6ac9 upstream.

With the current implementation the following race can happen:

 * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
   blk_mq_unfreeze_queue().

 * blk_queue_enter() calls blk_queue_pm_only() and that function returns
   true.

 * blk_queue_enter() calls blk_pm_request_resume() and that function does
   not call pm_request_resume() because the queue runtime status is
   RPM_ACTIVE.

 * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.

Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
Cc: Ming Lei 
Cc: Rafael J. Wysocki 
Cc: stable 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Jens Axboe 
Acked-by: Alan Stern 
Acked-by: Stanley Chu 
Co-developed-by: Can Guo 
Signed-off-by: Can Guo 
Signed-off-by: Bart Van Assche 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman

blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick

2020-12-30T10:50:54+00:00

[ Upstream commit ab3cee3762e5e69f27c302c43691289fdfc12316 ]

In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
true and the driver returns BLK_STS_RESOURCE then we'll kick the
queue.  However, there's another case where we might need to kick it.
If we were unable to get budget we can be in much the same state as
when the driver returns BLK_STS_RESOURCE, so we should treat it the
same.

It should be noted that even if we add a whole bunch of extra kicking
to the queue in other patches this patch is still important.
Specifically any kicking that happened before we re-spliced leftover
requests into 'hctx->dispatch' wouldn't have found any work, so we
really need to make sure we kick ourselves after we've done the
splicing.

Signed-off-by: Douglas Anderson 
Reviewed-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: factor out requeue handling from dispatch code

2020-12-30T10:50:54+00:00

[ Upstream commit c92a41031a6d57395889b5c87cea359220a24d2a ]

Factor out the requeue handling from the dispatch code, this will make
subsequent addition of different requeueing schemes easier.

Signed-off-by: Johannes Thumshirn 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: Simplify REQ_OP_ZONE_RESET_ALL handling

2020-12-30T10:50:54+00:00

[ Upstream commit c7a1d926dc4076aadad187614500afcd8de78818 ]

There is no need for the function __blkdev_reset_all_zones() as
REQ_OP_ZONE_RESET_ALL can be handled directly in blkdev_reset_zones()
bio loop with an early break from the loop. This patch removes this
function and modifies blkdev_reset_zones(), simplifying the code.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Damien Le Moal 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-cgroup: Pre-allocate tree node on blkg_conf_prep

2020-11-10T11:37:30+00:00

[ Upstream commit f255c19b3ab46d3cad3b1b2e1036f4c926cb1d0c ]

Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq
outside request queue spinlock"), blkg_create can also trigger
occasional -ENOMEM failures at the radix insertion because any
allocation inside blkg_create has to be non-blocking, making it more
likely to fail.  This causes trouble for userspace tools trying to
configure io weights who need to deal with this condition.

This patch reduces the occurrence of -ENOMEMs on this path by preloading
the radix tree element on a GFP_KERNEL context, such that we guarantee
the later non-blocking insertion won't fail.

A similar solution exists in blkcg_init_queue for the same situation.

Acked-by: Tejun Heo 
Signed-off-by: Gabriel Krisman Bertazi 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-cgroup: Fix memleak on error path

2020-11-10T11:37:29+00:00

[ Upstream commit 52abfcbd57eefdd54737fc8c2dc79d8f46d4a3e5 ]

If new_blkg allocation raced with blk_policy change and
blkg_lookup_check fails, new_blkg is leaked.

Acked-by: Tejun Heo 
Signed-off-by: Gabriel Krisman Bertazi 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: ratelimit handle_bad_sector() message

2020-10-29T08:58:01+00:00

[ Upstream commit f4ac712e4fe009635344b9af5d890fe25fcc8c0d ]

syzbot is reporting unkillable task [1], for the caller is failing to
handle a corrupted filesystem image which attempts to access beyond
the end of the device. While we need to fix the caller, flooding the
console with handle_bad_sector() message is unlikely useful.

[1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092

Signed-off-by: Tetsuo Handa 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue

2020-10-29T08:57:34+00:00

[ Upstream commit 47ce030b7ac5a5259a9a5919f230b52497afc31a ]

blk_exit_queue will free elevator_data, while blk_mq_run_work_fn
will access it. Move cancel of hctx->run_work to the front of
blk_exit_queue to avoid use-after-free.

Fixes: 1b97871b501f ("blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release")
Signed-off-by: Yang Yang 
Reviewed-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block/diskstats: more accurate approximation of io_ticks for slow disks

2020-10-07T06:01:29+00:00

commit 2b8bd423614c595540eaadcfbc702afe8e155e50 upstream.

Currently io_ticks is approximated by adding one at each start and end of
requests if jiffies counter has changed. This works perfectly for requests
shorter than a jiffy or if one of requests starts/ends at each jiffy.

If disk executes just one request at a time and they are longer than two
jiffies then only first and last jiffies will be accounted.

Fix is simple: at the end of request add up into io_ticks jiffies passed
since last update rather than just one jiffy.

Example: common HDD executes random read 4k requests around 12ms.

fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
iostat -x 10 sdb

Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

Before:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43

After:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99

Now io_ticks does not loose time between start and end of requests, but
for queue-depth > 1 some I/O time between adjacent starts might be lost.

For load estimation "%util" is not as useful as average queue length,
but it clearly shows how often disk queue is completely empty.

Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov 
Reviewed-by: Ming Lei 
Signed-off-by: Jens Axboe 
From: "Banerjee, Debabrata" 
Signed-off-by: Greg Kroah-Hartman