<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/drivers/md/raid0.c, branch linux-6.10.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>md/raid0: don't free conf on raid0_run failure</title>
<updated>2024-08-03T06:59:10+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2024-06-04T17:25:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7526c725d51200aab2ee0ee8c44808873c965ddb'/>
<id>7526c725d51200aab2ee0ee8c44808873c965ddb</id>
<content type='text'>
[ Upstream commit 35f20acaa3585f25f8356da0ee6bc143e0256522 ]

The core md code calls the -&gt;free method which already frees conf.

Fixes: 0c031fd37f69 ("md: Move alloc/free acct bioset in to personality")
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240604172607.3185916-2-hch@lst.de
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 35f20acaa3585f25f8356da0ee6bc143e0256522 ]

The core md code calls the -&gt;free method which already frees conf.

Fixes: 0c031fd37f69 ("md: Move alloc/free acct bioset in to personality")
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240604172607.3185916-2-hch@lst.de
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: remove mddev-&gt;queue</title>
<updated>2024-03-06T16:59:53+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2024-03-03T14:01:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=396799eb5b6f87ec2d759e1a90e179f7058ab9e6'/>
<id>396799eb5b6f87ec2d759e1a90e179f7058ab9e6</id>
<content type='text'>
Just use the request_queue from the gendisk pointer in the relatively
few places that sill need it.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Just use the request_queue from the gendisk pointer in the relatively
few places that sill need it.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid0: use the atomic queue limit update APIs</title>
<updated>2024-03-06T16:59:53+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2024-03-03T14:01:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=56cf22d6f672453f8d3392776c9317f217e68dca'/>
<id>56cf22d6f672453f8d3392776c9317f217e68dca</id>
<content type='text'>
Build the queue limits outside the queue and apply them using
queue_limits_set.  To make the code more obvious also split the queue
limits handling into a separate helper function.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-6-hch@lst.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Build the queue limits outside the queue and apply them using
queue_limits_set.  To make the code more obvious also split the queue
limits handling into a separate helper function.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-6-hch@lst.de
</pre>
</div>
</content>
</entry>
<entry>
<title>md: add a mddev_is_dm helper</title>
<updated>2024-03-06T16:59:53+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2024-03-03T14:01:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=176df894d7974166c65d0cce3b3b019678f9e698'/>
<id>176df894d7974166c65d0cce3b3b019678f9e698</id>
<content type='text'>
Add a helper to check for a DM-mapped MD device instead of using
the obfuscated -&gt;gendisk or -&gt;queue NULL checks.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add a helper to check for a DM-mapped MD device instead of using
the obfuscated -&gt;gendisk or -&gt;queue NULL checks.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de
</pre>
</div>
</content>
</entry>
<entry>
<title>md: add a mddev_trace_remap helper</title>
<updated>2024-03-06T16:59:52+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2024-03-03T14:01:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c396b90e502691fc6ff7b43984cfd9d1b15aaa80'/>
<id>c396b90e502691fc6ff7b43984cfd9d1b15aaa80</id>
<content type='text'>
Add a helper to trace bio remapping that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add a helper to trace bio remapping that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed--by: Song Liu &lt;song@kernel.org&gt;
Tested-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de
</pre>
</div>
</content>
</entry>
<entry>
<title>md: raid0: account for split bio in iostat accounting</title>
<updated>2023-08-18T04:11:31+00:00</updated>
<author>
<name>David Jeffery</name>
<email>djeffery@redhat.com</email>
</author>
<published>2023-08-16T18:13:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=cc22b5407e9ca76adb7efeed843146510b1b72a5'/>
<id>cc22b5407e9ca76adb7efeed843146510b1b72a5</id>
<content type='text'>
When a bio is split by md raid0, the newly created bio will not be tracked
by md for I/O accounting. Only the portion of I/O still assigned to the
original bio which was reduced by the split will be accounted for. This
results in md iostat data sometimes showing I/O values far below the actual
amount of data being sent through md.

md_account_bio() needs to be called for all bio generated by the bio split.

A simple example of the issue was generated using a raid0 device on partitions
to the same device. Since all raid0 I/O then goes to one device, it makes it
easy to see a gap between the md device and its sd storage. Reading an lvm
device on top of the md device, the iostat output (some 0 columns and extra
devices removed to make the data more compact) was:

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read
md2               0.00         0.00         0.00         0.00          0
sde               0.00         0.00         0.00         0.00          0
md2            1364.00    411496.00         0.00         0.00     411496
sde            1734.00    646144.00         0.00         0.00     646144
md2            1699.00    510680.00         0.00         0.00     510680
sde            2155.00    802784.00         0.00         0.00     802784
md2             803.00    241480.00         0.00         0.00     241480
sde            1016.00    377888.00         0.00         0.00     377888
md2               0.00         0.00         0.00         0.00          0
sde               0.00         0.00         0.00         0.00          0

I/O was generated doing large direct I/O reads (12M) with dd to a linear
lvm volume on top of the 4 leg raid0 device.

The md2 reads were showing as roughly 2/3 of the reads to the sde device
containing all of md2's raid partitions. The sum of reads to sde was
1826816 kB, which was the expected amount as it was the amount read by
dd. With the patch, the total reads from md will match the reads from
sde and be consistent with the amount of I/O generated.

Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5")
Signed-off-by: David Jeffery &lt;djeffery@redhat.com&gt;
Tested-by: Laurence Oberman &lt;loberman@redhat.com&gt;
Reviewed-by: Laurence Oberman &lt;loberman@redhat.com&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When a bio is split by md raid0, the newly created bio will not be tracked
by md for I/O accounting. Only the portion of I/O still assigned to the
original bio which was reduced by the split will be accounted for. This
results in md iostat data sometimes showing I/O values far below the actual
amount of data being sent through md.

md_account_bio() needs to be called for all bio generated by the bio split.

A simple example of the issue was generated using a raid0 device on partitions
to the same device. Since all raid0 I/O then goes to one device, it makes it
easy to see a gap between the md device and its sd storage. Reading an lvm
device on top of the md device, the iostat output (some 0 columns and extra
devices removed to make the data more compact) was:

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read
md2               0.00         0.00         0.00         0.00          0
sde               0.00         0.00         0.00         0.00          0
md2            1364.00    411496.00         0.00         0.00     411496
sde            1734.00    646144.00         0.00         0.00     646144
md2            1699.00    510680.00         0.00         0.00     510680
sde            2155.00    802784.00         0.00         0.00     802784
md2             803.00    241480.00         0.00         0.00     241480
sde            1016.00    377888.00         0.00         0.00     377888
md2               0.00         0.00         0.00         0.00          0
sde               0.00         0.00         0.00         0.00          0

I/O was generated doing large direct I/O reads (12M) with dd to a linear
lvm volume on top of the 4 leg raid0 device.

The md2 reads were showing as roughly 2/3 of the reads to the sde device
containing all of md2's raid partitions. The sum of reads to sde was
1826816 kB, which was the expected amount as it was the amount read by
dd. With the patch, the total reads from md will match the reads from
sde and be consistent with the amount of I/O generated.

Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5")
Signed-off-by: David Jeffery &lt;djeffery@redhat.com&gt;
Tested-by: Laurence Oberman &lt;loberman@redhat.com&gt;
Reviewed-by: Laurence Oberman &lt;loberman@redhat.com&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid0: Fix performance regression for large sequential writes</title>
<updated>2023-08-18T04:11:31+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2023-08-14T09:27:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=319ff40a542736d67e5bce18635de35d0e7a0bff'/>
<id>319ff40a542736d67e5bce18635de35d0e7a0bff</id>
<content type='text'>
Commit f00d7c85be9e ("md/raid0: fix up bio splitting.") among other
things changed how bio that needs to be split is submitted. Before this
commit, we have split the bio, mapped and submitted each part. After
this commit, we map only the first part of the split bio and submit the
second part unmapped. Due to bio sorting in __submit_bio_noacct() this
results in the following request ordering:

  9,0   18     1181     0.525037895 15995  Q  WS 1479315464 + 63392

  Split off chunk-sized (1024 sectors) request:

  9,0   18     1182     0.629019647 15995  X  WS 1479315464 / 1479316488

  Request is unaligned to the chunk so it's split in
  raid0_make_request().  This is the first part mapped and punted to
  bio_list:

  8,0   18     7053     0.629020455 15995  A  WS 739921928 + 1016 &lt;- (9,0) 1479315464

  Now raid0_make_request() returns, second part is postponed on
  bio_list. __submit_bio_noacct() resorts the bio_list, mapped request
  is submitted to the underlying device:

  8,0   18     7054     0.629022782 15995  G  WS 739921928 + 1016

  Now we take another request from the bio_list which is the remainder
  of the original huge request. Split off another chunk-sized bit from
  it and the situation repeats:

  9,0   18     1183     0.629024499 15995  X  WS 1479316488 / 1479317512
  8,16  18     6998     0.629025110 15995  A  WS 739921928 + 1016 &lt;- (9,0) 1479316488
  8,16  18     6999     0.629026728 15995  G  WS 739921928 + 1016
  ...
  9,0   18     1184     0.629032940 15995  X  WS 1479317512 / 1479318536 [libnetacq-write]
  8,0   18     7059     0.629033294 15995  A  WS 739922952 + 1016 &lt;- (9,0) 1479317512
  8,0   18     7060     0.629033902 15995  G  WS 739922952 + 1016
  ...

  This repeats until we consume the whole original huge request. Now we
  finally get to processing the second parts of the split off requests
  (in reverse order):

  8,16  18     7181     0.629161384 15995  A  WS 739952640 + 8 &lt;- (9,0) 1479377920
  8,0   18     7239     0.629162140 15995  A  WS 739952640 + 8 &lt;- (9,0) 1479376896
  8,16  18     7186     0.629163881 15995  A  WS 739951616 + 8 &lt;- (9,0) 1479375872
  8,0   18     7242     0.629164421 15995  A  WS 739951616 + 8 &lt;- (9,0) 1479374848
  ...

I guess it is obvious that this IO pattern is extremely inefficient way
to perform sequential IO. It also makes bio_list to grow to rather long
lengths.

Change raid0_make_request() to map both parts of the split bio. Since we
know we are provided with at most chunk-sized bios, we will always need
to split the incoming bio at most once.

Fixes: f00d7c85be9e ("md/raid0: fix up bio splitting.")
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit f00d7c85be9e ("md/raid0: fix up bio splitting.") among other
things changed how bio that needs to be split is submitted. Before this
commit, we have split the bio, mapped and submitted each part. After
this commit, we map only the first part of the split bio and submit the
second part unmapped. Due to bio sorting in __submit_bio_noacct() this
results in the following request ordering:

  9,0   18     1181     0.525037895 15995  Q  WS 1479315464 + 63392

  Split off chunk-sized (1024 sectors) request:

  9,0   18     1182     0.629019647 15995  X  WS 1479315464 / 1479316488

  Request is unaligned to the chunk so it's split in
  raid0_make_request().  This is the first part mapped and punted to
  bio_list:

  8,0   18     7053     0.629020455 15995  A  WS 739921928 + 1016 &lt;- (9,0) 1479315464

  Now raid0_make_request() returns, second part is postponed on
  bio_list. __submit_bio_noacct() resorts the bio_list, mapped request
  is submitted to the underlying device:

  8,0   18     7054     0.629022782 15995  G  WS 739921928 + 1016

  Now we take another request from the bio_list which is the remainder
  of the original huge request. Split off another chunk-sized bit from
  it and the situation repeats:

  9,0   18     1183     0.629024499 15995  X  WS 1479316488 / 1479317512
  8,16  18     6998     0.629025110 15995  A  WS 739921928 + 1016 &lt;- (9,0) 1479316488
  8,16  18     6999     0.629026728 15995  G  WS 739921928 + 1016
  ...
  9,0   18     1184     0.629032940 15995  X  WS 1479317512 / 1479318536 [libnetacq-write]
  8,0   18     7059     0.629033294 15995  A  WS 739922952 + 1016 &lt;- (9,0) 1479317512
  8,0   18     7060     0.629033902 15995  G  WS 739922952 + 1016
  ...

  This repeats until we consume the whole original huge request. Now we
  finally get to processing the second parts of the split off requests
  (in reverse order):

  8,16  18     7181     0.629161384 15995  A  WS 739952640 + 8 &lt;- (9,0) 1479377920
  8,0   18     7239     0.629162140 15995  A  WS 739952640 + 8 &lt;- (9,0) 1479376896
  8,16  18     7186     0.629163881 15995  A  WS 739951616 + 8 &lt;- (9,0) 1479375872
  8,0   18     7242     0.629164421 15995  A  WS 739951616 + 8 &lt;- (9,0) 1479374848
  ...

I guess it is obvious that this IO pattern is extremely inefficient way
to perform sequential IO. It also makes bio_list to grow to rather long
lengths.

Change raid0_make_request() to map both parts of the split bio. Since we
know we are provided with at most chunk-sized bios, we will always need
to split the incoming bio at most once.

Fixes: f00d7c85be9e ("md/raid0: fix up bio splitting.")
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid0: Factor out helper for mapping and submitting a bio</title>
<updated>2023-08-18T04:11:31+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2023-08-14T09:27:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=af50e20afb401cc203bd2a9ff62ece0ae4976103'/>
<id>af50e20afb401cc203bd2a9ff62ece0ae4976103</id>
<content type='text'>
Factor out helper function for mapping and submitting a bio out of
raid0_make_request(). We will use it later for submitting both parts of
a split bio.

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Factor out helper function for mapping and submitting a bio out of
raid0_make_request(). We will use it later for submitting both parts of
a split bio.

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: move initialization and destruction of 'io_acct_set' to md.c</title>
<updated>2023-07-27T07:13:29+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2023-06-21T16:51:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c567c86b90d4715081adfe5eb812141a5b6b4883'/>
<id>c567c86b90d4715081adfe5eb812141a5b6b4883</id>
<content type='text'>
'io_acct_set' is only used for raid0 and raid456, prepare to use it for
raid1 and raid10, so that io accounting from different levels can be
consistent.

By the way, follow up patches will also use this io clone mechanism to
make sure 'active_io' represents in flight io, not io that is dispatching,
so that mddev_suspend will wait for io to be done as designed.

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Reviewed-by: Xiao Ni &lt;xni@redhat.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
'io_acct_set' is only used for raid0 and raid456, prepare to use it for
raid1 and raid10, so that io accounting from different levels can be
consistent.

By the way, follow up patches will also use this io clone mechanism to
make sure 'active_io' represents in flight io, not io that is dispatching,
so that mddev_suspend will wait for io to be done as designed.

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Reviewed-by: Xiao Ni &lt;xni@redhat.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
</pre>
</div>
</content>
</entry>
<entry>
<title>md/raid0: add discard support for the 'original' layout</title>
<updated>2023-06-30T22:43:50+00:00</updated>
<author>
<name>Jason Baron</name>
<email>jbaron@akamai.com</email>
</author>
<published>2023-06-23T18:05:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e836007089ba8fdf24e636ef2b007651fb4582e6'/>
<id>e836007089ba8fdf24e636ef2b007651fb4582e6</id>
<content type='text'>
We've found that using raid0 with the 'original' layout and discard
enabled with different disk sizes (such that at least two zones are
created) can result in data corruption. This is due to the fact that
the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
layout. We've seen this corruption using ext4 but other filesystems are
likely susceptible as well.

More specifically, while multiple zones are necessary to create the
corruption, the corruption may not occur with multiple zones if they
layout in such a way the layout matches what the 'alternate' layout
would have produced. Thus, not all raid0 devices with the 'original'
layout, different size disks and discard enabled will encounter this
corruption.

The 3.14 kernel inadvertently changed the raid0 disk layout for different
size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
same raid0 array could corrupt data. This lead to the creation of the
'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
(to match the post 3.14 layout) in the 5.4 kernel time frame and an option
to tell the kernel which layout to use (since it couldn't be autodetected).
However, when the 'original' layout was added back to 5.4 discard support
for the 'original' layout was not added leading this issue.

I've been able to reliably reproduce the corruption with the following
test case:

1. create raid0 array with different size disks using original layout
2. mkfs
3. mount -o discard
4. create lots of files
5. remove 1/2 the files
6. fstrim -a (or just the mount point for the raid0 array)
7. umount
8. fsck -fn /dev/md0 (spews all sorts of corruptions)

Let's fix this by adding proper discard support to the 'original' layout.
The fix 'maps' the 'original' layout disks to the order in which they are
read/written such that we can compare the disks in the same way that the
current 'alternate' layout does. A 'disk_shift' field is added to
'struct strip_zone'. This could be computed on the fly in
raid0_handle_discard() but by adding this field, we save some computation
in the discard path.

Note we could also potentially fix this by re-ordering the disks in the
zones that follow the first one, and then always read/writing them using
the 'alternate' layout. However, that is seen as a more substantial change,
and we are attempting the least invasive fix at this time to remedy the
corruption.

I've verified the change using the reproducer mentioned above. Typically,
the corruption is seen after less than 3 iterations, while the patch has
run 500+ iterations.

Cc: NeilBrown &lt;neilb@suse.de&gt;
Cc: Song Liu &lt;song@kernel.org&gt;
Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Baron &lt;jbaron@akamai.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20230623180523.1901230-1-jbaron@akamai.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We've found that using raid0 with the 'original' layout and discard
enabled with different disk sizes (such that at least two zones are
created) can result in data corruption. This is due to the fact that
the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
layout. We've seen this corruption using ext4 but other filesystems are
likely susceptible as well.

More specifically, while multiple zones are necessary to create the
corruption, the corruption may not occur with multiple zones if they
layout in such a way the layout matches what the 'alternate' layout
would have produced. Thus, not all raid0 devices with the 'original'
layout, different size disks and discard enabled will encounter this
corruption.

The 3.14 kernel inadvertently changed the raid0 disk layout for different
size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
same raid0 array could corrupt data. This lead to the creation of the
'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
(to match the post 3.14 layout) in the 5.4 kernel time frame and an option
to tell the kernel which layout to use (since it couldn't be autodetected).
However, when the 'original' layout was added back to 5.4 discard support
for the 'original' layout was not added leading this issue.

I've been able to reliably reproduce the corruption with the following
test case:

1. create raid0 array with different size disks using original layout
2. mkfs
3. mount -o discard
4. create lots of files
5. remove 1/2 the files
6. fstrim -a (or just the mount point for the raid0 array)
7. umount
8. fsck -fn /dev/md0 (spews all sorts of corruptions)

Let's fix this by adding proper discard support to the 'original' layout.
The fix 'maps' the 'original' layout disks to the order in which they are
read/written such that we can compare the disks in the same way that the
current 'alternate' layout does. A 'disk_shift' field is added to
'struct strip_zone'. This could be computed on the fly in
raid0_handle_discard() but by adding this field, we save some computation
in the discard path.

Note we could also potentially fix this by re-ordering the disks in the
zones that follow the first one, and then always read/writing them using
the 'alternate' layout. However, that is seen as a more substantial change,
and we are attempting the least invasive fix at this time to remedy the
corruption.

I've verified the change using the reproducer mentioned above. Typically,
the corruption is seen after less than 3 iterations, while the patch has
run 500+ iterations.

Cc: NeilBrown &lt;neilb@suse.de&gt;
Cc: Song Liu &lt;song@kernel.org&gt;
Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Baron &lt;jbaron@akamai.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20230623180523.1901230-1-jbaron@akamai.com
</pre>
</div>
</content>
</entry>
</feed>
