linux-stable.git/drivers/md/dm.c, branch linux-4.1.y

dm: flush queued bios when process blocks to avoid deadlock

2018-01-17T17:27:19+00:00

[ Upstream commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 ]

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks
by redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However other deadlocks (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

** the related dm-snapshot deadlock is detailed here:
https://www.redhat.com/archives/dm-devel/2016-July/msg00065.html

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule() call.  Consequently,
when the process blocks on a mutex, the bios queued on
current->bio_list are dispatched to independent workqueus and they can
complete without waiting for the mutex to be available.

The structure blk_plug contains an entry cb_list and this list can contain
arbitrary callback functions that are called when the process blocks.
To implement this fix DM (ab)uses the onstack plug's cb_list interface
to get its flush_current_bio_list() called at schedule() time.

This fixes the snapshot deadlock - if the map method blocks,
flush_current_bio_list() will be called and it redirects bios waiting
on current->bio_list to appropriate workqueues.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Sasha Levin

dm: fix race between dm_get_from_kobject() and __dm_destroy()

2017-12-08T23:01:02+00:00

[ Upstream commit b9a41d21dceadf8104812626ef85dc56ee8a60ed ]

The following BUG_ON was hit when testing repeat creation and removal of
DM devices:

    kernel BUG at drivers/md/dm.c:2919!
    CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
    Call Trace:
     [] dm_get_from_kobject+0x34/0x3a
     [] dm_attr_show+0x2b/0x5e
     [] ? mutex_lock+0x26/0x44
     [] sysfs_kf_seq_show+0x83/0xcf
     [] kernfs_seq_show+0x23/0x25
     [] seq_read+0x16f/0x325
     [] kernfs_fop_read+0x3a/0x13f
     [] __vfs_read+0x26/0x9d
     [] ? security_file_permission+0x3c/0x44
     [] ? rw_verify_area+0x83/0xd9
     [] vfs_read+0x8f/0xcf
     [] ? __fdget_pos+0x12/0x41
     [] SyS_read+0x4b/0x76
     [] system_call_fastpath+0x12/0x71

The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().

To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.

The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao 
Signed-off-by: Mike Snitzer 
Signed-off-by: Sasha Levin

dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDING

2016-08-20T03:08:33+00:00

[ Upstream commit eaf9a7361f47727b166688a9f2096854eef60fbe ]

Otherwise, there is potential for both DMF_SUSPENDED* and
DMF_NOFLUSH_SUSPENDING to not be set during dm_suspend() -- which is
definitely _not_ a valid state.

This fix, in conjuction with "dm rq: fix the starting and stopping of
blk-mq queues", addresses the potential for request-based DM multipath's
__multipath_map() to see !dm_noflush_suspending() during suspend.

Reported-by: Bart Van Assche 
Signed-off-by: Mike Snitzer 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

dm rq: fix the starting and stopping of blk-mq queues

2016-08-20T03:08:32+00:00

[ Upstream commit 7d9595d848cdff5c7939f68eec39e0c5d36a1d67 ]

Improve dm_stop_queue() to cancel any requeue_work.  Also, have
dm_start_queue() and dm_stop_queue() clear/set the QUEUE_FLAG_STOPPED
for the blk-mq request_queue.

On suspend dm_stop_queue() handles stopping the blk-mq request_queue
BUT: even though the hw_queues are marked BLK_MQ_S_STOPPED at that point
there is still a race that is allowing block/blk-mq.c to call ->queue_rq
against a hctx that it really shouldn't.  Add a check to
dm_mq_queue_rq() that guards against this rarity (albeit _not_
race-free).

Signed-off-by: Mike Snitzer 
Cc: stable@vger.kernel.org # must patch dm.c on < 4.8 kernels
Signed-off-by: Sasha Levin

dm: fix second blk_delay_queue() parameter to be in msec units not jiffies

2016-08-20T03:07:58+00:00

[ Upstream commit bd9f55ea1cf6e14eb054b06ea877d2d1fa339514 ]

Commit d548b34b062 ("dm: reduce the queue delay used in dm_request_fn
from 100ms to 10ms") always intended the value to be 10 msecs -- it
just expressed it in jiffies because earlier commit 7eaceaccab ("block:
remove per-queue plugging") did.

Signed-off-by: Tahsin Erdogan 
Signed-off-by: Mike Snitzer 
Fixes: d548b34b062 ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms")
Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c
Signed-off-by: Sasha Levin

dm: fix excessive dm-mq context switching

2016-04-18T12:50:38+00:00

[ Upstream commit 6acfe68bac7e6f16dc312157b1fa6e2368985013 ]

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e16ca2d !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance.  It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Sagi Grimberg 
Signed-off-by: Mike Snitzer 
Acked-by: Jens Axboe 
Signed-off-by: Sasha Levin

dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths

2016-03-07T17:33:28+00:00

[ Upstream commit 4328daa2e79ed904a42ce00a9f38b9c36b44b21a ]

Using request-based DM mpath configured with the following stacking
(.request_fn DM mpath ontop of scsi-mq paths):

echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
echo N > /sys/module/dm_mod/parameters/use_blk_mq

'struct dm_rq_target_io' would leak if a request is requeued before a
blk-mq clone is allocated (or fails to allocate).  free_rq_tio()
wasn't being called.

kmemleak reported:

unreferenced object 0xffff8800b90b98c0 (size 112):
  comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s)
  hex dump (first 32 bytes):
    00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff  ..\,....@.......
    e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00  ...4............
  backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] kmem_cache_alloc+0xc3/0x1e0
    [] mempool_alloc_slab+0x15/0x20
    [] mempool_alloc+0x6e/0x170
    [] dm_old_prep_fn+0x3c/0x180 [dm_mod]
    [] blk_peek_request+0x168/0x290
    [] dm_request_fn+0xb2/0x1b0 [dm_mod]
    [] __blk_run_queue+0x33/0x40
    [] blk_delay_work+0x25/0x40
    [] process_one_work+0x14f/0x3d0
    [] worker_thread+0x125/0x4b0
    [] kthread+0xd8/0xf0
    [] ret_from_fork+0x3f/0x70
    [] 0xffffffffffffffff

crash> struct -o dm_rq_target_io
struct dm_rq_target_io {
    ...
}
SIZE: 112

Fixes: e5863d9ad7 ("dm: allocate requests in target when stacking on blk-mq devices")
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Mike Snitzer 
Signed-off-by: Sasha Levin

dm: fix AB-BA deadlock in __dm_destroy()

2015-10-22T21:43:26+00:00

commit 2a708cff93f1845b9239bc7d6310aef54e716c6a upstream.

__dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and
suspend_lock in reverse order.  Doing so can cause AB-BA deadlock:

  __dm_destroy                    dm_swap_table
  ---------------------------------------------------
                                  mutex_lock(suspend_lock)
  dm_get_live_table()
    srcu_read_lock(io_barrier)
                                  dm_sync_table()
                                    synchronize_srcu(io_barrier)
                                      .. waiting for dm_put_live_table()
  mutex_lock(suspend_lock)
    .. waiting for suspend_lock

Fix this by taking the locks in proper order.

Signed-off-by: Jun'ichi Nomura 
Fixes: ab7c7bb6f4ab ("dm: hold suspend_lock while suspending device during device deletion")
Acked-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

dm: fix dm_merge_bvec regression on 32 bit systems

2015-08-17T03:52:23+00:00

commit bd4aaf8f9b85d6b2df3231fd62b219ebb75d3568 upstream.

A DM regression on 32 bit systems was reported against v4.2-rc3 here:
https://lkml.org/lkml/2015/7/29/401

Fix this by reverting both commit 1c220c69 ("dm: fix casting bug in
dm_merge_bvec()") and 148e51ba ("dm: improve documentation and code
clarity in dm_merge_bvec").  This combined revert is done to eliminate
the possibility of a partial revert in stable@ kernels.

In hindsight the correct fix, at the time 1c220c69 was applied to fix
the regression that 148e51ba introduced, should've been to simply revert
148e51ba.

Reported-by: Josh Boyer 
Tested-by: Adam Williamson 
Acked-by: Joe Thornber 
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

Revert "dm: only run the queue on completion if congested or no requests pending"

2015-08-10T19:21:54+00:00

commit 621739b00e16ca2d80411dc9b111cb15b91f3ba9 upstream.

This reverts commit 9a0e609e3fd8a95c96629b9fbde6b8c5b9a1456a.
(Resolved a conflict during revert due to commit bfebd1cdb4 that came
after)

This revert is motivated by a couple failure reports on request-based DM
multipath testbeds:
1) Netapp reported that their multipath fault injection test under heavy
   IO load can stall longer than 300 seconds.
2) IBM reported elevated lock contention in their testbed (likely due to
   increased back pressure due to IO not being dispatched as quickly):
   https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html

Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman