<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/drivers/md/raid10.c, branch v4.16.2</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>md: fix a potential deadlock of raid5/raid10 reshape</title>
<updated>2018-02-25T18:39:15+00:00</updated>
<author>
<name>BingJing Chang</name>
<email>bingjingc@synology.com</email>
</author>
<published>2018-02-22T05:34:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8876391e440ba615b10eef729576e111f0315f87'/>
<id>8876391e440ba615b10eef729576e111f0315f87</id>
<content type='text'>
There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb-&gt;s_umount lock and tries to
   write through critical data into raid5 emulated block device. So it
   waits for raid5 kernel thread handling stripes in order to finish it
   I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
   called first in order to reap the raid5 resync thread. That is,
   raid5_finish_reshape() will be called. In this function, it will try
   to update conf and call VFS revalidate_disk() to grow the raid5
   emulated block device. It will try to acquire VFS sb-&gt;s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb-&gt;s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

2017/12/29:
Thanks to Guoqing Jiang &lt;gqjiang@suse.com&gt;,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.

Reported-by: Alex Wu &lt;alexwu@synology.com&gt;
Reviewed-by: Alex Wu &lt;alexwu@synology.com&gt;
Reviewed-by: Chung-Chiang Cheng &lt;cccheng@synology.com&gt;
Signed-off-by: BingJing Chang &lt;bingjingc@synology.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb-&gt;s_umount lock and tries to
   write through critical data into raid5 emulated block device. So it
   waits for raid5 kernel thread handling stripes in order to finish it
   I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
   called first in order to reap the raid5 resync thread. That is,
   raid5_finish_reshape() will be called. In this function, it will try
   to update conf and call VFS revalidate_disk() to grow the raid5
   emulated block device. It will try to acquire VFS sb-&gt;s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb-&gt;s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

2017/12/29:
Thanks to Guoqing Jiang &lt;gqjiang@suse.com&gt;,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.

Reported-by: Alex Wu &lt;alexwu@synology.com&gt;
Reviewed-by: Alex Wu &lt;alexwu@synology.com&gt;
Reviewed-by: Chung-Chiang Cheng &lt;cccheng@synology.com&gt;
Signed-off-by: BingJing Chang &lt;bingjingc@synology.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md-cluster: choose correct label when clustered layout is not supported</title>
<updated>2018-02-25T18:36:55+00:00</updated>
<author>
<name>Lidong Zhong</name>
<email>lzhong@suse.com</email>
</author>
<published>2018-01-23T15:06:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=43a521238aca0e24d50add1db125a61bda2a3527'/>
<id>43a521238aca0e24d50add1db125a61bda2a3527</id>
<content type='text'>
r10conf is already successfully allocated before checking the layout

Signed-off-by: Lidong Zhong &lt;lzhong@suse.com&gt;
Reviewed-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
r10conf is already successfully allocated before checking the layout

Signed-off-by: Lidong Zhong &lt;lzhong@suse.com&gt;
Reviewed-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md raid10: fix NULL deference in handle_write_completed()</title>
<updated>2018-02-19T17:40:36+00:00</updated>
<author>
<name>Yufen Yu</name>
<email>yuyufen@huawei.com</email>
</author>
<published>2018-02-06T09:39:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=01a69cab01c184d3786af09e9339311123d63d22'/>
<id>01a69cab01c184d3786af09e9339311123d63d22</id>
<content type='text'>
In the case of 'recover', an r10bio with R10BIO_WriteError &amp;
R10BIO_IsRecover will be progressed by handle_write_completed().
This function traverses all r10bio-&gt;devs[copies].
If devs[m].repl_bio != NULL, it thinks conf-&gt;mirrors[dev].replacement
is also not NULL. However, this is not always true.

When there is an rdev of raid10 has replacement, then each r10bio
-&gt;devs[m].repl_bio != NULL in conf-&gt;r10buf_pool. However, in 'recover',
even if corresponded replacement is NULL, it doesn't clear r10bio
-&gt;devs[m].repl_bio, resulting in replacement NULL deference.

This bug was introduced when replacement support for raid10 was
added in Linux 3.3.

As NeilBrown suggested:
	Elsewhere the determination of "is this device part of the
	resync/recovery" is made by resting bio-&gt;bi_end_io.
	If this is end_sync_write, then we tried to write here.
	If it is NULL, then we didn't try to write.

Fixes: 9ad1aefc8ae8 ("md/raid10:  Handle replacement devices during resync.")
Cc: stable (V3.3+)
Suggested-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Yufen Yu &lt;yuyufen@huawei.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In the case of 'recover', an r10bio with R10BIO_WriteError &amp;
R10BIO_IsRecover will be progressed by handle_write_completed().
This function traverses all r10bio-&gt;devs[copies].
If devs[m].repl_bio != NULL, it thinks conf-&gt;mirrors[dev].replacement
is also not NULL. However, this is not always true.

When there is an rdev of raid10 has replacement, then each r10bio
-&gt;devs[m].repl_bio != NULL in conf-&gt;r10buf_pool. However, in 'recover',
even if corresponded replacement is NULL, it doesn't clear r10bio
-&gt;devs[m].repl_bio, resulting in replacement NULL deference.

This bug was introduced when replacement support for raid10 was
added in Linux 3.3.

As NeilBrown suggested:
	Elsewhere the determination of "is this device part of the
	resync/recovery" is made by resting bio-&gt;bi_end_io.
	If this is end_sync_write, then we tried to write here.
	If it is NULL, then we didn't try to write.

Fixes: 9ad1aefc8ae8 ("md/raid10:  Handle replacement devices during resync.")
Cc: stable (V3.3+)
Suggested-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Yufen Yu &lt;yuyufen@huawei.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>raid10: change the size of resync window for clustered raid</title>
<updated>2018-02-17T21:06:13+00:00</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2018-01-19T03:37:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4b242e97d74192bbc5decd808c058cbc347af016'/>
<id>4b242e97d74192bbc5decd808c058cbc347af016</id>
<content type='text'>
To align with raid1's resync window, we need to
set the resync window of raid10 to 32M as well.

Fixes: 8db87912c9a8 ("md-cluster: Use a small window for raid10 resync")
Reported-by: Zhilong Liu &lt;zlliu@suse.com&gt;
Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
To align with raid1's resync window, we need to
set the resync window of raid10 to 32M as well.

Fixes: 8db87912c9a8 ("md-cluster: Use a small window for raid10 resync")
Reported-by: Zhilong Liu &lt;zlliu@suse.com&gt;
Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1,raid10: silence warning about wait-within-wait</title>
<updated>2017-12-11T16:52:34+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-12-03T21:21:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=474beb575c03e0e7f1a704ac428916898f81b3cd'/>
<id>474beb575c03e0e7f1a704ac428916898f81b3cd</id>
<content type='text'>
If you prepare_to_wait() after a previous prepare_to_wait(),
but before calling schedule(), you get warning:

  do not call blocking ops when !TASK_RUNNING; state=2

This is appropriate as it is often a bug.  The event that the
first prepare_to_wait() expects might wake up the schedule following
the second prepare_to_wait(), which could be confusing.

However if both prepare_to_wait()s are part of simple wait_event()
loops, and if the inner one is rarely called, then there is
no problem.  The inner loop is too simple to get confused by
a stray wakeup, and the outer loop won't spin unduly because the
inner doesnt affect it often.

This pattern occurs in both raid1.c and raid10.c in the use of
flush_pending_writes().

The warning can be silenced by setting current-&gt;state to TASK_RUNNING.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If you prepare_to_wait() after a previous prepare_to_wait(),
but before calling schedule(), you get warning:

  do not call blocking ops when !TASK_RUNNING; state=2

This is appropriate as it is often a bug.  The event that the
first prepare_to_wait() expects might wake up the schedule following
the second prepare_to_wait(), which could be confusing.

However if both prepare_to_wait()s are part of simple wait_event()
loops, and if the inner one is rarely called, then there is
no problem.  The inner loop is too simple to get confused by
a stray wakeup, and the outer loop won't spin unduly because the
inner doesnt affect it often.

This pattern occurs in both raid1.c and raid10.c in the use of
flush_pending_writes().

The warning can be silenced by setting current-&gt;state to TASK_RUNNING.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1/10: add missed blk plug</title>
<updated>2017-12-01T20:19:48+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-12-01T20:12:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=18022a1bd3709b74ca31ef0b28fccd52bcd6c504'/>
<id>18022a1bd3709b74ca31ef0b28fccd52bcd6c504</id>
<content type='text'>
flush_pending_writes isn't always called with block plug, so add it, and plug
works in nested way.

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
flush_pending_writes isn't always called with block plug, so add it, and plug
works in nested way.

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md-cluster: Use a small window for raid10 resync</title>
<updated>2017-11-02T04:32:23+00:00</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2017-10-24T07:11:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8db87912c9a8771c53b98845cd5516ea63b22e1e'/>
<id>8db87912c9a8771c53b98845cd5516ea63b22e1e</id>
<content type='text'>
Suspending the entire device for resync could take
too long. Resync in small chunks.

cluster's resync window is maintained in r10conf as
cluster_sync_low and cluster_sync_high, and processed
in raid10's sync_request(). If the current resync is
outside the cluster resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Set cluster_sync_high to cluster_sync_low + stripe
   size.
3. Send a message to all nodes so they may add it in
   their suspension list.

Note:
We only support "near" raid10 so far, resync a far or
offset raid10 array could have trouble. So raid10_run
checks the layout of clustered raid10, it will refuse
to run if the layout is not correct.

With the "near" layout we process one stripe at a time
progressing monotonically through the address space.
So we can have a sliding window of whole-stripes which
moves through the array suspending IO on other nodes,
and both resync which uses array addresses and recovery
which uses device addresses can stay within this window.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Suspending the entire device for resync could take
too long. Resync in small chunks.

cluster's resync window is maintained in r10conf as
cluster_sync_low and cluster_sync_high, and processed
in raid10's sync_request(). If the current resync is
outside the cluster resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Set cluster_sync_high to cluster_sync_low + stripe
   size.
3. Send a message to all nodes so they may add it in
   their suspension list.

Note:
We only support "near" raid10 so far, resync a far or
offset raid10 array could have trouble. So raid10_run
checks the layout of clustered raid10, it will refuse
to run if the layout is not correct.

With the "near" layout we process one stripe at a time
progressing monotonically through the address space.
So we can have a sliding window of whole-stripes which
moves through the array suspending IO on other nodes,
and both resync which uses array addresses and recovery
which uses device addresses can stay within this window.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md-cluster: Suspend writes in RAID10 if within range</title>
<updated>2017-11-02T04:32:23+00:00</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2017-10-24T07:11:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=cb8a7a7e1098e74d36378b992a6d012668ec10d9'/>
<id>cb8a7a7e1098e74d36378b992a6d012668ec10d9</id>
<content type='text'>
If there is a resync going on, all nodes must suspend
writes to the range. This is recorded in suspend_info
and suspend_list.

If there is an I/O within the ranges of any of the
suspend_info, area_resyncing will return 1.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If there is a resync going on, all nodes must suspend
writes to the range. This is recorded in suspend_info
and suspend_list.

If there is an I/O within the ranges of any of the
suspend_info, area_resyncing will return 1.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md-cluster/raid10: set "do_balance = 0" if area is resyncing</title>
<updated>2017-11-02T04:32:22+00:00</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2017-10-24T07:11:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d4098c7262a47f529765d89614484a957363d623'/>
<id>d4098c7262a47f529765d89614484a957363d623</id>
<content type='text'>
Just like clustered raid1, it is impossible for cluster raid10
to choose the best device for read balance when the area of
array is resyncing. Because we cannot trust the data to be the
same on all devices at that time, so we choose just the first
one to use, so set do_balance to 0.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Just like clustered raid1, it is impossible for cluster raid10
to choose the best device for read balance when the area of
array is resyncing. Because we cannot trust the data to be the
same on all devices at that time, so we choose just the first
one to use, so set do_balance to 0.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: remove special meaning of -&gt;quiesce(.., 2)</title>
<updated>2017-11-02T04:32:20+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-10-19T01:49:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=b03e0ccb5ab9df3efbe51c87843a1ffbecbafa1f'/>
<id>b03e0ccb5ab9df3efbe51c87843a1ffbecbafa1f</id>
<content type='text'>
The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call -&gt;quiesce(.., 1); -&gt;quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.

Suggested-by: Shaohua Li &lt;shli@kernel.org&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call -&gt;quiesce(.., 1); -&gt;quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.

Suggested-by: Shaohua Li &lt;shli@kernel.org&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
