linux-stable.git/block, branch v4.19.26

blk-mq: fix a hung issue when fsync

2019-02-20T09:25:36+00:00

[ Upstream commit 85bd6e61f34dffa8ec2dc75ff3c02ee7b2f1cbce ]

Florian reported a io hung issue when fsync(). It should be
triggered by following race condition.

data + post flush         a flush

blk_flush_complete_seq
  case REQ_FSEQ_DATA
    blk_flush_queue_rq
    issued to driver      blk_mq_dispatch_rq_list
                            try to issue a flush req
                            failed due to NON-NCQ command
                            .queue_rq return BLK_STS_DEV_RESOURCE

request completion
  req->end_io // doesn't check RESTART
  mq_flush_data_end_io
    case REQ_FSEQ_POSTFLUSH
      blk_kick_flush
        do nothing because previous flush
        has not been completed
     blk_mq_run_hw_queue
                              insert rq to hctx->dispatch
                              due to RESTART is still set, do nothing

To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
with blk_mq_sched_restart to check and clear the RESTART flag.

Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
Reported-by: Florian Stecker 
Tested-by: Florian Stecker 
Signed-off-by: Jianchao Wang 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: use rcu_work instead of call_rcu to avoid sleep in softirq

2019-01-22T20:40:35+00:00

commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream.

We recently got a stack by syzkaller like this:

BUG: sleeping function called from invalid context at mm/slab.h:361
in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
INFO: lockdep is turned off.
CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
 0000000000000000 0000000000000001 0000000000000004 0000000000000000
Call Trace:
   [] __dump_stack lib/dump_stack.c:15 [inline]
   [] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
 [] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
 [] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
 [] slab_pre_alloc_hook mm/slab.h:361 [inline]
 [] slab_alloc_node mm/slub.c:2610 [inline]
 [] slab_alloc mm/slub.c:2692 [inline]
 [] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
 [] kmalloc include/linux/slab.h:479 [inline]
 [] kzalloc include/linux/slab.h:623 [inline]
 [] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
 [] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
 [] kobject_cleanup lib/kobject.c:633 [inline]
 [] kobject_release+0x229/0x440 lib/kobject.c:675
 [] kref_sub include/linux/kref.h:73 [inline]
 [] kref_put include/linux/kref.h:98 [inline]
 [] kobject_put+0x72/0xd0 lib/kobject.c:692
 [] put_device+0x25/0x30 drivers/base/core.c:1237
 [] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
 [] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
 [] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
 [] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
 [] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
 [] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
 [] __do_softirq+0x299/0xe20 kernel/softirq.c:273
 [] invoke_softirq kernel/softirq.c:350 [inline]
 [] irq_exit+0x216/0x2c0 kernel/softirq.c:391
 [] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
 [] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
 [] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
   [] ? audit_kill_trees+0x180/0x180
 [] fd_install+0x57/0x80 fs/file.c:626
 [] do_sys_open+0x45e/0x550 fs/open.c:1043
 [] SYSC_open fs/open.c:1055 [inline]
 [] SyS_open+0x32/0x40 fs/open.c:1050
 [] entry_SYSCALL_64_fastpath+0x1e/0x9a

In softirq context, we call rcu callback function delete_partition_rcu_cb(),
which may allocate memory by kzalloc with GFP_KERNEL flag. If the
allocation cannot be satisfied, it may sleep. However, That is not allowed
in softirq contex.

Although we found this problem on linux 4.4, the latest kernel version
seems to have this problem as well. And it is very similar to the
previous one:
	https://lkml.org/lkml/2018/7/9/391

Fix it by using RCU workqueue, which allows sleep.

Reviewed-by: Paul E. McKenney 
Signed-off-by: Yufen Yu 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: mq-deadline: Fix write completion handling

2019-01-13T08:51:07+00:00

commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream.

For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.

This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.

Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.

Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support")
Cc: 
Signed-off-by: Damien Le Moal 
Signed-off-by: Greg Kroah-Hartman 

Add missing export of blk_mq_sched_restart()

Signed-off-by: Jens Axboe

block: deactivate blk_stat timer in wbt_disable_default()

2019-01-13T08:51:06+00:00

commit 544fbd16a461a318cd80537d1331c0df5c6cf930 upstream.

rwb_enabled() can't be changed when there is any inflight IO.

wbt_disable_default() may set rwb->wb_normal as zero, however the
blk_stat timer may still be pending, and the timer function will update
wrb->wb_normal again.

This patch introduces blk_stat_deactivate() and applies it in
wbt_disable_default(), then the following IO hang triggered when running
parted & switching io scheduler can be fixed:

[  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.940768] parted          D    0  3645   3239 0x00000000
[  369.941500] Call Trace:
[  369.941874]  ? __schedule+0x6d9/0x74c
[  369.942392]  ? wbt_done+0x5e/0x5e
[  369.942864]  ? wbt_cleanup_cb+0x16/0x16
[  369.943404]  ? wbt_done+0x5e/0x5e
[  369.943874]  schedule+0x67/0x78
[  369.944298]  io_schedule+0x12/0x33
[  369.944771]  rq_qos_wait+0xb5/0x119
[  369.945193]  ? karma_partition+0x1c2/0x1c2
[  369.945691]  ? wbt_cleanup_cb+0x16/0x16
[  369.946151]  wbt_wait+0x85/0xb6
[  369.946540]  __rq_qos_throttle+0x23/0x2f
[  369.947014]  blk_mq_make_request+0xe6/0x40a
[  369.947518]  generic_make_request+0x192/0x2fe
[  369.948042]  ? submit_bio+0x103/0x11f
[  369.948486]  ? __radix_tree_lookup+0x35/0xb5
[  369.949011]  submit_bio+0x103/0x11f
[  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
[  369.949962]  submit_bio_wait+0x53/0x7f
[  369.950469]  blkdev_issue_flush+0x8a/0xae
[  369.951032]  blkdev_fsync+0x2f/0x3a
[  369.951502]  do_fsync+0x2e/0x47
[  369.951887]  __x64_sys_fsync+0x10/0x13
[  369.952374]  do_syscall_64+0x89/0x149
[  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  369.953492] RIP: 0033:0x7f95a1e729d4
[  369.953996] Code: Bad RIP value.
[  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

Cc: stable@vger.kernel.org
Cc: Paolo Valente 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block/bio: Do not zero user pages

2018-12-19T18:19:50+00:00

commit f55adad601c6a97c8c9628195453e0fb23b4a0ae upstream.

We don't need to zero fill the bio if not using kernel allocated pages.

Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
Reported-by: Todd Aiken 
Cc: Laurence Oberman 
Cc: stable@vger.kernel.org
Cc: Bart Van Assche 
Tested-by: Laurence Oberman 
Signed-off-by: Keith Busch 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: punt failed direct issue to dispatch list

2018-12-08T11:59:10+00:00

commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream.

After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.

Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.

This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: Ming Lei 
Reported-by: Bart Van Assche 
Tested-by: Ming Lei 
Acked-by: Mike Snitzer 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix corruption with direct issue

2018-12-08T11:59:06+00:00

commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream.

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685

Tested-by: Guenter Roeck 
Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: copy ioprio in __bio_clone_fast() and bounce

2018-12-01T08:37:32+00:00

[ Upstream commit ca474b73896bf6e0c1eb8787eb217b0f80221610 ]

We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke 
Signed-off-by: Jean Delvare 

Fixed up subject, and ordered stores.

Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: Clear kernel memory before copying to user

2018-11-27T15:13:05+00:00

[ Upstream commit f3587d76da05f68098ddb1cb3c98cc6a9e8a402c ]

If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.

Laurence Oberman 
Signed-off-by: Keith Busch 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

SCSI: fix queue cleanup race before queue initialization is done

2018-11-21T08:19:18+00:00

commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: James E.J. Bottomley 
Cc: stable 
Cc: jianchao.wang 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman