linux-stable.git/drivers/md/raid10.c, branch v4.16.2

md: fix a potential deadlock of raid5/raid10 reshape

2018-02-25T18:39:15+00:00

There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
   write through critical data into raid5 emulated block device. So it
   waits for raid5 kernel thread handling stripes in order to finish it
   I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
   called first in order to reap the raid5 resync thread. That is,
   raid5_finish_reshape() will be called. In this function, it will try
   to update conf and call VFS revalidate_disk() to grow the raid5
   emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

2017/12/29:
Thanks to Guoqing Jiang ,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.

Reported-by: Alex Wu 
Reviewed-by: Alex Wu 
Reviewed-by: Chung-Chiang Cheng 
Signed-off-by: BingJing Chang 
Signed-off-by: Shaohua Li

md-cluster: choose correct label when clustered layout is not supported

2018-02-25T18:36:55+00:00

r10conf is already successfully allocated before checking the layout

Signed-off-by: Lidong Zhong 
Reviewed-by: Guoqing Jiang 
Signed-off-by: Shaohua Li

md raid10: fix NULL deference in handle_write_completed()

2018-02-19T17:40:36+00:00

In the case of 'recover', an r10bio with R10BIO_WriteError &
R10BIO_IsRecover will be progressed by handle_write_completed().
This function traverses all r10bio->devs[copies].
If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement
is also not NULL. However, this is not always true.

When there is an rdev of raid10 has replacement, then each r10bio
->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover',
even if corresponded replacement is NULL, it doesn't clear r10bio
->devs[m].repl_bio, resulting in replacement NULL deference.

This bug was introduced when replacement support for raid10 was
added in Linux 3.3.

As NeilBrown suggested:
	Elsewhere the determination of "is this device part of the
	resync/recovery" is made by resting bio->bi_end_io.
	If this is end_sync_write, then we tried to write here.
	If it is NULL, then we didn't try to write.

Fixes: 9ad1aefc8ae8 ("md/raid10:  Handle replacement devices during resync.")
Cc: stable (V3.3+)
Suggested-by: NeilBrown 
Signed-off-by: Yufen Yu 
Signed-off-by: Shaohua Li

raid10: change the size of resync window for clustered raid

2018-02-17T21:06:13+00:00

To align with raid1's resync window, we need to
set the resync window of raid10 to 32M as well.

Fixes: 8db87912c9a8 ("md-cluster: Use a small window for raid10 resync")
Reported-by: Zhilong Liu 
Signed-off-by: Guoqing Jiang 
Signed-off-by: Shaohua Li

md/raid1,raid10: silence warning about wait-within-wait

2017-12-11T16:52:34+00:00

If you prepare_to_wait() after a previous prepare_to_wait(),
but before calling schedule(), you get warning:

  do not call blocking ops when !TASK_RUNNING; state=2

This is appropriate as it is often a bug.  The event that the
first prepare_to_wait() expects might wake up the schedule following
the second prepare_to_wait(), which could be confusing.

However if both prepare_to_wait()s are part of simple wait_event()
loops, and if the inner one is rarely called, then there is
no problem.  The inner loop is too simple to get confused by
a stray wakeup, and the outer loop won't spin unduly because the
inner doesnt affect it often.

This pattern occurs in both raid1.c and raid10.c in the use of
flush_pending_writes().

The warning can be silenced by setting current->state to TASK_RUNNING.

Signed-off-by: NeilBrown 
Signed-off-by: Shaohua Li

md/raid1/10: add missed blk plug

2017-12-01T20:19:48+00:00

flush_pending_writes isn't always called with block plug, so add it, and plug
works in nested way.

Signed-off-by: Shaohua Li

md-cluster: Use a small window for raid10 resync

2017-11-02T04:32:23+00:00

Suspending the entire device for resync could take
too long. Resync in small chunks.

cluster's resync window is maintained in r10conf as
cluster_sync_low and cluster_sync_high, and processed
in raid10's sync_request(). If the current resync is
outside the cluster resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Set cluster_sync_high to cluster_sync_low + stripe
   size.
3. Send a message to all nodes so they may add it in
   their suspension list.

Note:
We only support "near" raid10 so far, resync a far or
offset raid10 array could have trouble. So raid10_run
checks the layout of clustered raid10, it will refuse
to run if the layout is not correct.

With the "near" layout we process one stripe at a time
progressing monotonically through the address space.
So we can have a sliding window of whole-stripes which
moves through the array suspending IO on other nodes,
and both resync which uses array addresses and recovery
which uses device addresses can stay within this window.

Signed-off-by: Guoqing Jiang 
Signed-off-by: Shaohua Li

md-cluster: Suspend writes in RAID10 if within range

2017-11-02T04:32:23+00:00

If there is a resync going on, all nodes must suspend
writes to the range. This is recorded in suspend_info
and suspend_list.

If there is an I/O within the ranges of any of the
suspend_info, area_resyncing will return 1.

Signed-off-by: Guoqing Jiang 
Signed-off-by: Shaohua Li

md-cluster/raid10: set "do_balance = 0" if area is resyncing

2017-11-02T04:32:22+00:00

Just like clustered raid1, it is impossible for cluster raid10
to choose the best device for read balance when the area of
array is resyncing. Because we cannot trust the data to be the
same on all devices at that time, so we choose just the first
one to use, so set do_balance to 0.

Signed-off-by: Guoqing Jiang 
Signed-off-by: Shaohua Li

md: remove special meaning of ->quiesce(.., 2)

2017-11-02T04:32:20+00:00

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.

Suggested-by: Shaohua Li 
Signed-off-by: NeilBrown 
Signed-off-by: Shaohua Li