linux.git/drivers/md/raid1.c, branch v4.13-rc2

Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md

2017-07-08T19:50:18+00:00

Pull MD update from Shaohua Li:

 - fixed deadlock in MD suspend and a potential bug in bio allocation
   (Neil Brown)

 - fixed signal issue (Mikulas Patocka)

 - fixed typo in FailFast test (Guoqing Jiang)

 - other trival fixes

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  MD: fix sleep in atomic
  MD: fix a null dereference
  md: use a separate bio_set for synchronous IO.
  md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE
  md/raid1: remove unused bio in sync_request_write
  md/raid10: fix FailFast test for wrong device
  md: don't use flush_signals in userspace processes
  md: fix deadlock between mddev_suspend() and md_write_start()

blk: replace bioset_create_nobvec() with a flags arg to bioset_create()

2017-06-18T18:40:59+00:00

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Signed-off-by: NeilBrown 
Signed-off-by: Jens Axboe

md/raid1: remove unused bio in sync_request_write

2017-06-16T19:04:09+00:00

The "bio" is not used in sync_request_write after commit a68e58703575
("md/raid1: split out two sub-functions from sync_request_write").

Signed-off-by: Guoqing Jiang 
Signed-off-by: Shaohua Li

md: don't use flush_signals in userspace processes

2017-06-13T17:18:02+00:00

The function flush_signals clears all pending signals for the process. It
may be used by kernel threads when we need to prepare a kernel thread for
responding to signals. However using this function for an userspaces
processes is incorrect - clearing signals without the program expecting it
can cause misbehavior.

The raid1 and raid5 code uses flush_signals in its request routine because
it wants to prepare for an interruptible wait. This patch drops
flush_signals and uses sigprocmask instead to block all signals (including
SIGKILL) around the schedule() call. The signals are not lost, but the
schedule() call won't respond to them.

Signed-off-by: Mikulas Patocka 
Cc: stable@vger.kernel.org
Acked-by: NeilBrown 
Signed-off-by: Shaohua Li

md: fix deadlock between mddev_suspend() and md_write_start()

2017-06-13T17:18:01+00:00

If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().

Reported-by: Nix 
Fix: 68866e425be2(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown 
Signed-off-by: Shaohua Li

Merge tag 'v4.12-rc5' into for-4.13/block

2017-06-12T14:30:13+00:00

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe

block: switch bios to blk_status_t

2017-06-09T15:27:32+00:00

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe

md: initialise ->writes_pending in personality modules.

2017-06-05T23:04:35+00:00

The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it.  This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.

Reported-by: Heinz Mauelshagen 
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NeilBrown 
Signed-off-by: Shaohua Li

raid1: prefer disk without bad blocks

2017-05-12T21:41:15+00:00

If an array consists of two drives and the first drive has the bad
block, the read request to the region overlapping the bad block chooses
the same disk (with bad block) as device to read from over and over and
the request gets stuck. If the first disk only partially overlaps with
bad block, it becomes a candidate ("best disk") for shorter range of
sectors. The second disk is capable of reading the entire requested
range and it is updated accordingly, however it is not recorded as a
best device for the request. In the end the request is sent to the first
disk to read entire range of sectors. It fails and is re-tried in a
moment but with the same outcome.

Actually it is quite likely scenario but it had little exposure in my
test until commit 715d40b93b10 ("md/raid1: add failfast handling for
reads.") removed preference for idle disk. Such scenario had been
passing as second disk was always chosen when idle.

Reset a candidate ("best disk") to read from if disk can read entire
range. Do it only if other disk has already been chosen as a candidate
for a smaller range. The head position / disk type logic will select
the best disk to read from - it is fine as disk with bad block won't be
considered for it.

Signed-off-by: Tomasz Majchrzak 
Signed-off-by: Shaohua Li

md/raid1/10: avoid unnecessary locking

2017-05-11T22:32:17+00:00

If we add bios to block plugging list, locking is unnecessry, since the block
unplug is guaranteed not to run at that time.

Reviewed-by: NeilBrown 
Signed-off-by: Shaohua Li