linux-stable.git/block, branch linux-4.16.y

blk-mq: reinit q->tag_set_list entry only after grace period

2018-06-25T23:54:01+00:00

commit a347c7ad8edf4c5685154f3fdc3c12fc1db800ba upstream.

It is not allowed to reinit q->tag_set_list list entry while RCU grace
period has not completed yet, otherwise the following soft lockup in
blk_mq_sched_restart() happens:

[ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
[ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
[ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
[ 1064.256510] Call Trace:
[ 1064.256664]  
[ 1064.256824]  blk_mq_free_request+0xea/0x100
[ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
[ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
[ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
[ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
[ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
[ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
[ 1064.258007]  irq_poll_softirq+0xb7/0xe0
[ 1064.258165]  __do_softirq+0x106/0x2a2
[ 1064.258328]  irq_exit+0x92/0xa0
[ 1064.258509]  do_IRQ+0x4a/0xd0
[ 1064.258660]  common_interrupt+0x7a/0x7a
[ 1064.258818]  

Meanwhile another context frees other queue but with the same set of
shared tags:

[ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
[ 1288.201833] bash            D    0  5910   5820 0x00000000
[ 1288.202016] Call Trace:
[ 1288.202315]  schedule+0x32/0x80
[ 1288.202462]  schedule_timeout+0x1e5/0x380
[ 1288.203838]  wait_for_completion+0xb0/0x120
[ 1288.204137]  __wait_rcu_gp+0x125/0x160
[ 1288.204287]  synchronize_sched+0x6e/0x80
[ 1288.204770]  blk_mq_free_queue+0x74/0xe0
[ 1288.204922]  blk_cleanup_queue+0xc7/0x110
[ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
[ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
[ 1288.205548]  kernfs_fop_write+0x109/0x180
[ 1288.206328]  vfs_write+0xb3/0x1a0
[ 1288.206476]  SyS_write+0x52/0xc0
[ 1288.206624]  do_syscall_64+0x68/0x1d0
[ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

What happened is the following:

1. There are several MQ queues with shared tags.
2. One queue is about to be freed and now task is in
   blk_mq_del_queue_tag_set().
3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
   tag list in order to find hctx to restart.

Because linked list entry was modified in blk_mq_del_queue_tag_set()
without proper waiting for a grace period, blk_mq_sched_restart()
never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.

Fix is simple: reinit list entry after an RCU grace period elapsed.

Fixes: Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: stable@vger.kernel.org
Cc: Sagi Grimberg 
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Reviewed-by: Bart Van Assche 
Signed-off-by: Roman Pen 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix sysfs inflight counter

2018-06-20T19:01:26+00:00

[ Upstream commit bf0ddaba65ddbb2715af97041da8e7a45b2d8628 ]

When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

blkcg: init root blkcg_gq under lock

2018-06-20T19:01:20+00:00

[ Upstream commit 901932a3f9b2b80352896be946c6d577c0a9652c ]

The initializing of q->root_blkg is currently outside of queue lock
and rcu, so the blkg may be destroied before the initializing, which
may cause dangling/null references. On the other side, the destroys
of blkg are protected by queue lock or rcu. Put the initializing
inside the queue lock and rcu to make it safer.

Signed-off-by: Jiang Biao 
Signed-off-by: Wen Yang 
CC: Tejun Heo 
CC: Jens Axboe 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

blkcg: don't hold blkcg lock when deactivating policy

2018-06-20T19:01:18+00:00

[ Upstream commit 946b81da114b8ba5c74bb01e57c0c6eca2bdc801 ]

As described in the comment of blkcg_activate_policy(),
*Update of each blkg is protected by both queue and blkcg locks so
that holding either lock and testing blkcg_policy_enabled() is
always enough for dereferencing policy data.*
with queue lock held, there is no need to hold blkcg lock in
blkcg_deactivate_policy(). Similar case is in
blkcg_activate_policy(), which has removed holding of blkcg lock in
commit 4c55f4f9ad3001ac1fefdd8d8ca7641d18558e23.

Signed-off-by: Jiang Biao 
Signed-off-by: Wen Yang 
CC: Tejun Heo 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

2018-06-16T07:43:41+00:00

commit 327ea4adcfa37194739f1ec7c70568944d292281 upstream.

Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:

  fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
  Call Trace:
  dump_stack+0x63/0x88
  warn_alloc+0xf5/0x190
  __alloc_pages_slowpath+0x8f0/0xb0d
  __alloc_pages_nodemask+0x242/0x260
  alloc_pages_current+0x6a/0xb0
  kmalloc_order+0x18/0x50
  kmalloc_order_trace+0x26/0xb0
  __kmalloc+0x20e/0x220
  blkdev_report_zones_ioctl+0xa5/0x1a0
  blkdev_ioctl+0x1ba/0x930
  block_ioctl+0x41/0x50
  do_vfs_ioctl+0xaa/0x610
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x79/0x1b0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Bart Van Assche 
Cc: Shaun Tancheff 
Cc: Damien Le Moal 
Cc: Christoph Hellwig 
Cc: Martin K. Petersen 
Cc: Hannes Reinecke 
Cc: 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: do not use interruptible wait anywhere

2018-05-01T19:47:25+00:00

commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably").  Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15.  A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed.  Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

# dd if=/dev/sda of=/dev/null iflag=direct & \
  while killall -SIGUSR1 dd; do sleep 0.1; done & \
  echo mem > /sys/power/state ; \
  sleep 5; killall dd  # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Reviewed-by: Bart Van Assche 
Cc: stable@vger.kernel.org
Signed-off-by: Alan Jenkins 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bfq-iosched: ensure to clear bic/bfqq pointers when preparing request

2018-05-01T19:47:25+00:00

commit 72961c4e6082be79825265d9193272b8a1634dec upstream.

Even if we don't have an IO context attached to a request, we still
need to clear the priv[0..1] pointers, as they could be pointing
to previously used bic/bfqq structures. If we don't do so, we'll
either corrupt memory on dispatching a request, or cause an
imbalance in counters.

Inspired by a fix from Kees.

Reported-by: Oleksandr Natalenko 
Reported-by: Kees Cook 
Cc: stable@vger.kernel.org
Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: start request gstate with gen 1

2018-05-01T19:47:25+00:00

commit f4560231ec42092c6662acccabb28c6cac9f5dfb upstream.

rq->gstate and rq->aborted_gstate both are zero before rqs are
allocated. If we have a small timeout, when the timer fires,
there could be rqs that are never allocated, and also there could
be rq that has been allocated but not initialized and started. At
the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
the blk_mq_terminate_expired will identify the rq is timed out and
invoke .timeout early.

For scsi, this will cause scsi_times_out to be invoked before the
scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
the moment, then we will get crash.

Cc: Bart Van Assche 
Cc: Tejun Heo 
Cc: Ming Lei 
Cc: Martin Steigerwald 
Cc: stable@vger.kernel.org
Signed-off-by: Jianchao Wang 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: don't keep offline CPUs mapped to hctx 0

2018-04-19T06:54:09+00:00

commit bffa9909a6b48d8ca3398dec601bc9162a4020c4 upstream.

From commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue after CPU topo is changed, that said when
some of these offline CPUs become online, they are still mapped to
hctx 0, then hctx 0 may become the bottleneck of IO dispatch and
completion.

This patch sets up the mapping from the beginning, and aligns to
queue mapping for PCI device (blk_mq_pci_map_queues()).

Cc: Stefan Haberland 
Cc: Keith Busch 
Cc: stable@vger.kernel.org
Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
Tested-by: Christian Borntraeger 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: make sure that correct hctx->next_cpu is set

2018-04-19T06:54:09+00:00

commit a1c735fb790745f94a359df45c11df4a69760389 upstream.

From commit 20e4d81393196 (blk-mq: simplify queue mapping & schedule
with each possisble CPU), one hctx can be mapped from all offline CPUs,
then hctx->next_cpu can be set as wrong.

This patch fixes this issue by making hctx->next_cpu pointing to the
first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline.

Cc: Stefan Haberland 
Tested-by: Christian Borntraeger 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
Fixes: 20e4d81393196 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman