<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/drivers/md/raid1.c, branch v4.12.2</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>md: initialise -&gt;writes_pending in personality modules.</title>
<updated>2017-06-05T23:04:35+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-06-05T06:05:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=a415c0f10627913793709ddb75add09d2ea334dc'/>
<id>a415c0f10627913793709ddb75add09d2ea334dc</id>
<content type='text'>
The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it.  This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.

Reported-by: Heinz Mauelshagen &lt;heinzm@redhat.com&gt;
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it.  This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.

Reported-by: Heinz Mauelshagen &lt;heinzm@redhat.com&gt;
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>raid1: prefer disk without bad blocks</title>
<updated>2017-05-12T21:41:15+00:00</updated>
<author>
<name>Tomasz Majchrzak</name>
<email>tomasz.majchrzak@intel.com</email>
</author>
<published>2017-05-12T12:26:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d82dd0e34d0347be201fd274dc84cd645dccc064'/>
<id>d82dd0e34d0347be201fd274dc84cd645dccc064</id>
<content type='text'>
If an array consists of two drives and the first drive has the bad
block, the read request to the region overlapping the bad block chooses
the same disk (with bad block) as device to read from over and over and
the request gets stuck. If the first disk only partially overlaps with
bad block, it becomes a candidate ("best disk") for shorter range of
sectors. The second disk is capable of reading the entire requested
range and it is updated accordingly, however it is not recorded as a
best device for the request. In the end the request is sent to the first
disk to read entire range of sectors. It fails and is re-tried in a
moment but with the same outcome.

Actually it is quite likely scenario but it had little exposure in my
test until commit 715d40b93b10 ("md/raid1: add failfast handling for
reads.") removed preference for idle disk. Such scenario had been
passing as second disk was always chosen when idle.

Reset a candidate ("best disk") to read from if disk can read entire
range. Do it only if other disk has already been chosen as a candidate
for a smaller range. The head position / disk type logic will select
the best disk to read from - it is fine as disk with bad block won't be
considered for it.

Signed-off-by: Tomasz Majchrzak &lt;tomasz.majchrzak@intel.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If an array consists of two drives and the first drive has the bad
block, the read request to the region overlapping the bad block chooses
the same disk (with bad block) as device to read from over and over and
the request gets stuck. If the first disk only partially overlaps with
bad block, it becomes a candidate ("best disk") for shorter range of
sectors. The second disk is capable of reading the entire requested
range and it is updated accordingly, however it is not recorded as a
best device for the request. In the end the request is sent to the first
disk to read entire range of sectors. It fails and is re-tried in a
moment but with the same outcome.

Actually it is quite likely scenario but it had little exposure in my
test until commit 715d40b93b10 ("md/raid1: add failfast handling for
reads.") removed preference for idle disk. Such scenario had been
passing as second disk was always chosen when idle.

Reset a candidate ("best disk") to read from if disk can read entire
range. Do it only if other disk has already been chosen as a candidate
for a smaller range. The head position / disk type logic will select
the best disk to read from - it is fine as disk with bad block won't be
considered for it.

Signed-off-by: Tomasz Majchrzak &lt;tomasz.majchrzak@intel.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1/10: avoid unnecessary locking</title>
<updated>2017-05-11T22:32:17+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-05-10T15:47:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=23b245c04d0ef408087430dd4d1b214a5da1eb78'/>
<id>23b245c04d0ef408087430dd4d1b214a5da1eb78</id>
<content type='text'>
If we add bios to block plugging list, locking is unnecessry, since the block
unplug is guaranteed not to run at that time.

Reviewed-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If we add bios to block plugging list, locking is unnecessry, since the block
unplug is guaranteed not to run at that time.

Reviewed-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: don't return -EAGAIN in md_allow_write for external metadata arrays</title>
<updated>2017-05-08T17:32:59+00:00</updated>
<author>
<name>Artur Paszkiewicz</name>
<email>artur.paszkiewicz@intel.com</email>
</author>
<published>2017-05-08T09:56:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2214c260c72b0bd94e6c1c19bf451686212025d3'/>
<id>2214c260c72b0bd94e6c1c19bf451686212025d3</id>
<content type='text'>
This essentially reverts commit b5470dc5fc18 ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.

Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.

Signed-off-by: Artur Paszkiewicz &lt;artur.paszkiewicz@intel.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This essentially reverts commit b5470dc5fc18 ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.

Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.

Signed-off-by: Artur Paszkiewicz &lt;artur.paszkiewicz@intel.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'md-next' into md-linus</title>
<updated>2017-05-01T21:09:21+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-05-01T21:09:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e265eb3a30543a237b2ebc4e0422ac82e55b07e4'/>
<id>e265eb3a30543a237b2ebc4e0422ac82e55b07e4</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1: Use a new variable to count flighting sync requests</title>
<updated>2017-04-27T21:01:16+00:00</updated>
<author>
<name>Xiao Ni</name>
<email>xni@redhat.com</email>
</author>
<published>2017-04-27T08:28:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=43ac9b84a399bc10210a2d9f4e0778b7c6059c07'/>
<id>43ac9b84a399bc10210a2d9f4e0778b7c6059c07</id>
<content type='text'>
In new barrier codes, raise_barrier waits if conf-&gt;nr_pending[idx] is not zero.
After all the conditions are true, the resync request can go on be handled. But
it adds conf-&gt;nr_pending[idx] again. The next resync request hit the same bucket
idx need to wait the resync request which is submitted before. The performance
of resync/recovery is degraded.
So we should use a new variable to count sync requests which are in flight.

I did a simple test:
1. Without the patch, create a raid1 with two disks. The resync speed:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  166.00    0.00    10.38     0.00   128.00     0.03    0.20    0.20    0.00   0.19   3.20
sdc               0.00     0.00    0.00  166.00     0.00    10.38   128.00     0.96    5.77    0.00    5.77   5.75  95.50
2. With the patch, the result is:
sdb            2214.00     0.00  766.00    0.00   185.69     0.00   496.46     2.80    3.66    3.66    0.00   1.03  79.10
sdc               0.00  2205.00    0.00  769.00     0.00   186.44   496.52     5.25    6.84    0.00    6.84   1.30 100.10

Suggested-by: Shaohua Li &lt;shli@kernel.org&gt;
Signed-off-by: Xiao Ni &lt;xni@redhat.com&gt;
Acked-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In new barrier codes, raise_barrier waits if conf-&gt;nr_pending[idx] is not zero.
After all the conditions are true, the resync request can go on be handled. But
it adds conf-&gt;nr_pending[idx] again. The next resync request hit the same bucket
idx need to wait the resync request which is submitted before. The performance
of resync/recovery is degraded.
So we should use a new variable to count sync requests which are in flight.

I did a simple test:
1. Without the patch, create a raid1 with two disks. The resync speed:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  166.00    0.00    10.38     0.00   128.00     0.03    0.20    0.20    0.00   0.19   3.20
sdc               0.00     0.00    0.00  166.00     0.00    10.38   128.00     0.96    5.77    0.00    5.77   5.75  95.50
2. With the patch, the result is:
sdb            2214.00     0.00  766.00    0.00   185.69     0.00   496.46     2.80    3.66    3.66    0.00   1.03  79.10
sdc               0.00  2205.00    0.00  769.00     0.00   186.44   496.52     5.25    6.84    0.00    6.84   1.30 100.10

Suggested-by: Shaohua Li &lt;shli@kernel.org&gt;
Signed-off-by: Xiao Ni &lt;xni@redhat.com&gt;
Acked-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: clear WantReplacement once disk is removed</title>
<updated>2017-04-25T16:36:29+00:00</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2017-04-24T07:58:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e5bc9c3c5432f5531a58e6fdd9f6c6587f2137b3'/>
<id>e5bc9c3c5432f5531a58e6fdd9f6c6587f2137b3</id>
<content type='text'>
We can clear 'WantReplacement' flag directly no
matter it's replacement existed or not since the
semantic is same as before.

Also since the disk is removed from array, then
it is straightforward to remove 'WantReplacement'
flag and the comments in raid10/5 can be removed
as well.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We can clear 'WantReplacement' flag directly no
matter it's replacement existed or not since the
semantic is same as before.

Also since the disk is removed from array, then
it is straightforward to remove 'WantReplacement'
flag and the comments in raid10/5 can be removed
as well.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1/10: remove unused queue</title>
<updated>2017-04-23T23:59:13+00:00</updated>
<author>
<name>Lidong Zhong</name>
<email>lidong.zhong@suse.com</email>
</author>
<published>2017-04-21T07:21:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=296617581eac713b3fda588216ae6d16d1e76dd5'/>
<id>296617581eac713b3fda588216ae6d16d1e76dd5</id>
<content type='text'>
A queue is declared and get from the disk of the array, but it's not
used anywhere. So removing it from the source.

Signed-off-by: Lidong Zhong &lt;lzhong@suse.com&gt;
Acted-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A queue is declared and get from the disk of the array, but it's not
used anywhere. So removing it from the source.

Signed-off-by: Lidong Zhong &lt;lzhong@suse.com&gt;
Acted-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1: factor out flush_bio_list()</title>
<updated>2017-04-11T17:12:36+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-04-05T04:05:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=673ca68d93879b9ffbbed874c9e70ca6e37cab15'/>
<id>673ca68d93879b9ffbbed874c9e70ca6e37cab15</id>
<content type='text'>
flush_pending_writes() and raid1_unplug() each contain identical
copies of a fairly large slab of code.  So factor that out into
new flush_bio_list() to simplify maintenance.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
flush_pending_writes() and raid1_unplug() each contain identical
copies of a fairly large slab of code.  So factor that out into
new flush_bio_list() to simplify maintenance.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid1: simplify handle_read_error().</title>
<updated>2017-04-11T17:10:20+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-04-05T04:05:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=689389a06ce79fdced85b5115717f71c71e623e0'/>
<id>689389a06ce79fdced85b5115717f71c71e623e0</id>
<content type='text'>
handle_read_error() duplicates a lot of the work that raid1_read_request()
does, so it makes sense to just use that function.
This doesn't quite work as handle_read_error() relies on the same r1bio
being re-used so that, in the case of a read-only array, setting
IO_BLOCKED in r1bio-&gt;bios[] ensures read_balance() won't re-use
that device.
So we need to allow a r1bio to be passed to raid1_read_request(), and to
have that function mostly initialise the r1bio, but leave the bios[]
array untouched.

Two parts of handle_read_error() that need to be preserved are the warning
message it prints, so they are conditionally added to raid1_read_request().

Note that this highlights a minor bug on alloc_r1bio().  It doesn't
initalise the bios[] array, so it is possible that old content is there,
which might cause read_balance() to ignore some devices with no good reason.

With this change, we no longer need inc_pending(), or the sectors_handled
arg to alloc_r1bio().

As handle_read_error() is called from raid1d() and allocates memory,
there is tiny chance of a deadlock.  All element of various pools
could be queued waiting for raid1 to handle them, and there may be no
extra memory free.
Achieving guaranteed forward progress would probably require a second
thread and another mempool.  Instead of that complexity, add
__GFP_HIGH to any allocations when read1_read_request() is called
from raid1d.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
handle_read_error() duplicates a lot of the work that raid1_read_request()
does, so it makes sense to just use that function.
This doesn't quite work as handle_read_error() relies on the same r1bio
being re-used so that, in the case of a read-only array, setting
IO_BLOCKED in r1bio-&gt;bios[] ensures read_balance() won't re-use
that device.
So we need to allow a r1bio to be passed to raid1_read_request(), and to
have that function mostly initialise the r1bio, but leave the bios[]
array untouched.

Two parts of handle_read_error() that need to be preserved are the warning
message it prints, so they are conditionally added to raid1_read_request().

Note that this highlights a minor bug on alloc_r1bio().  It doesn't
initalise the bios[] array, so it is possible that old content is there,
which might cause read_balance() to ignore some devices with no good reason.

With this change, we no longer need inc_pending(), or the sectors_handled
arg to alloc_r1bio().

As handle_read_error() is called from raid1d() and allocates memory,
there is tiny chance of a deadlock.  All element of various pools
could be queued waiting for raid1 to handle them, and there may be no
extra memory free.
Achieving guaranteed forward progress would probably require a second
thread and another mempool.  Instead of that complexity, add
__GFP_HIGH to any allocations when read1_read_request() is called
from raid1d.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
