<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/drivers/md/md.c, branch v5.13</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>md-cluster: fix use-after-free issue when removing rdev</title>
<updated>2021-04-23T16:39:04+00:00</updated>
<author>
<name>Heming Zhao</name>
<email>heming.zhao@suse.com</email>
</author>
<published>2021-04-08T07:44:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f7c7a2f9a23e5b6e0f5251f29648d0238bb7757e'/>
<id>f7c7a2f9a23e5b6e0f5251f29648d0238bb7757e</id>
<content type='text'>
md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).

```
node2=192.168.0.3

for i in {1..20}; do
    echo ==== $i `date` ====;

    mdadm -Ss &amp;&amp; ssh ${node2} "mdadm -Ss"
    wipefs -a /dev/sda /dev/sdb

    mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
       /dev/sdb --assume-clean
    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
    mdadm --wait /dev/md0
    ssh ${node2} "mdadm --wait /dev/md0"

    mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
    sleep 1
done
```

Crash stack:

```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 raid1d+0x5c/0xd40 [raid1]
 ? finish_task_switch+0x75/0x2a0
 ? lock_timer_base+0x67/0x80
 ? try_to_del_timer_sync+0x4d/0x80
 ? del_timer_sync+0x41/0x50
 ? schedule_timeout+0x254/0x2d0
 ? md_start_sync+0xe0/0xe0 [md_mod]
 ? md_thread+0x127/0x160 [md_mod]
 md_thread+0x127/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 kthread+0x10d/0x130
 ? kthread_park+0xa0/0xa0
 ret_from_fork+0x1f/0x40
```

Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: stable@vger.kernel.org
Reviewed-by: Gang He &lt;ghe@suse.com&gt;
Signed-off-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).

```
node2=192.168.0.3

for i in {1..20}; do
    echo ==== $i `date` ====;

    mdadm -Ss &amp;&amp; ssh ${node2} "mdadm -Ss"
    wipefs -a /dev/sda /dev/sdb

    mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
       /dev/sdb --assume-clean
    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
    mdadm --wait /dev/md0
    ssh ${node2} "mdadm --wait /dev/md0"

    mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
    sleep 1
done
```

Crash stack:

```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 raid1d+0x5c/0xd40 [raid1]
 ? finish_task_switch+0x75/0x2a0
 ? lock_timer_base+0x67/0x80
 ? try_to_del_timer_sync+0x4d/0x80
 ? del_timer_sync+0x41/0x50
 ? schedule_timeout+0x254/0x2d0
 ? md_start_sync+0xe0/0xe0 [md_mod]
 ? md_thread+0x127/0x160 [md_mod]
 md_thread+0x127/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 kthread+0x10d/0x130
 ? kthread_park+0xa0/0xa0
 ret_from_fork+0x1f/0x40
```

Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: stable@vger.kernel.org
Reviewed-by: Gang He &lt;ghe@suse.com&gt;
Signed-off-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: do not return existing mddevs from mddev_find_or_alloc</title>
<updated>2021-04-15T18:06:32+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-04-12T08:05:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0d809b3837a0bede8f58a67e303e339585777bf4'/>
<id>0d809b3837a0bede8f58a67e303e339585777bf4</id>
<content type='text'>
Instead of returning an existing mddev, just for it to be discarded
later directly return -EEXIST.  Rename the function to mddev_alloc now
that it doesn't find an existing mddev.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Instead of returning an existing mddev, just for it to be discarded
later directly return -EEXIST.  Rename the function to mddev_alloc now
that it doesn't find an existing mddev.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: refactor mddev_find_or_alloc</title>
<updated>2021-04-15T18:06:32+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-04-12T08:05:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d144fe6ff176d79efd411e520103a99e11874c36'/>
<id>d144fe6ff176d79efd411e520103a99e11874c36</id>
<content type='text'>
Allocate the new mddev first speculatively, which greatly simplifies
the code flow.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Allocate the new mddev first speculatively, which greatly simplifies
the code flow.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: factor out a mddev_alloc_unit helper from mddev_find</title>
<updated>2021-04-15T18:06:32+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-04-12T08:05:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=85c8c3c1f8d9e31f626c93435dd91c2f85603e07'/>
<id>85c8c3c1f8d9e31f626c93435dd91c2f85603e07</id>
<content type='text'>
Split out a self contained helper to find a free minor for the md
"unit" number.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Split out a self contained helper to find a free minor for the md
"unit" number.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: split mddev_find</title>
<updated>2021-04-08T05:41:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-04-03T16:15:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=65aa97c4d2bfd76677c211b9d03ef05a98c6d68e'/>
<id>65aa97c4d2bfd76677c211b9d03ef05a98c6d68e</id>
<content type='text'>
Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.

This turns out to fix this bug reported by Zhao Heming.

----------------------------- snip ------------------------------
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating &amp; removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger 1 time with 10 tests

`1  node1="15sp3-mdcluster1"
2  node2="15sp3-mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..100}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
`
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

`ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
 #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
 #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
 #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
 #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
 #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
 #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
 #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
 #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
 #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
 #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c

or:
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
`
*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

`&lt;thread 1&gt;: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

&lt;thread 2&gt;: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return
      -ERESTARTSYS.
`
In non-preempt kernel, &lt;thread 2&gt; is occupying on current CPU. and
mddev_delayed_delete which was created in &lt;thread 1&gt; also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after &lt;thread 2&gt; running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------

Cc: stable@vger.kernel.org
Fixes: d3374825ce57 ("md: make devices disappear when they are no longer needed.")
Reported-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Reviewed-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.

This turns out to fix this bug reported by Zhao Heming.

----------------------------- snip ------------------------------
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating &amp; removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger 1 time with 10 tests

`1  node1="15sp3-mdcluster1"
2  node2="15sp3-mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..100}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
`
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

`ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
 #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
 #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
 #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
 #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
 #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
 #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
 #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
 #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
 #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
 #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c

or:
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
`
*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

`&lt;thread 1&gt;: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

&lt;thread 2&gt;: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return
      -ERESTARTSYS.
`
In non-preempt kernel, &lt;thread 2&gt; is occupying on current CPU. and
mddev_delayed_delete which was created in &lt;thread 1&gt; also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after &lt;thread 2&gt; running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------

Cc: stable@vger.kernel.org
Fixes: d3374825ce57 ("md: make devices disappear when they are no longer needed.")
Reported-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Reviewed-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: factor out a mddev_find_locked helper from mddev_find</title>
<updated>2021-04-08T05:41:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-04-03T16:15:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8b57251f9a91f5e5a599de7549915d2d226cc3af'/>
<id>8b57251f9a91f5e5a599de7549915d2d226cc3af</id>
<content type='text'>
Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".

Cc: stable@vger.kernel.org
Reviewed-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".

Cc: stable@vger.kernel.org
Reviewed-by: Heming Zhao &lt;heming.zhao@suse.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: md_open returns -EBUSY when entering racing area</title>
<updated>2021-04-08T05:41:26+00:00</updated>
<author>
<name>Zhao Heming</name>
<email>heming.zhao@suse.com</email>
</author>
<published>2021-04-03T03:01:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7'/>
<id>6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7</id>
<content type='text'>
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating &amp; removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.

This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find &amp; mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).

For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1  node1="mdcluster1"
2  node2="mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..10}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
&lt;thread 1&gt;: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

&lt;thread 2&gt;: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, &lt;thread 2&gt; is occupying on current CPU. and
mddev_delayed_delete which was created in &lt;thread 1&gt; also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after &lt;thread 2&gt; running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return 0 to caller,
the soft lockup is broken.

Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Zhao Heming &lt;heming.zhao@suse.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating &amp; removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.

This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find &amp; mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).

For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1  node1="mdcluster1"
2  node2="mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..10}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
&lt;thread 1&gt;: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

&lt;thread 2&gt;: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, &lt;thread 2&gt; is occupying on current CPU. and
mddev_delayed_delete which was created in &lt;thread 1&gt; also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after &lt;thread 2&gt; running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev-&gt;gendisk != bdev-&gt;bd_disk)" and return 0 to caller,
the soft lockup is broken.

Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Zhao Heming &lt;heming.zhao@suse.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: Fix missing unused status line of /proc/mdstat</title>
<updated>2021-03-24T23:29:07+00:00</updated>
<author>
<name>Jan Glauber</name>
<email>jglauber@digitalocean.com</email>
</author>
<published>2021-03-17T14:04:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7abfabaf5f805f5171d133ce6af9b65ab766e76a'/>
<id>7abfabaf5f805f5171d133ce6af9b65ab766e76a</id>
<content type='text'>
Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.

So 'dd if=/proc/mdstat bs=64 2&gt;/dev/null' will not print something
like: unused devices: &lt;none&gt;

Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: Jan Glauber &lt;jglauber@digitalocean.com&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.

So 'dd if=/proc/mdstat bs=64 2&gt;/dev/null' will not print something
like: unused devices: &lt;none&gt;

Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: Jan Glauber &lt;jglauber@digitalocean.com&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: add md_submit_discard_bio() for submitting discard bio</title>
<updated>2021-03-24T23:29:06+00:00</updated>
<author>
<name>Xiao Ni</name>
<email>xni@redhat.com</email>
</author>
<published>2021-02-04T07:50:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cf78408f937a67f59f5e90ee8e6cadeed7c128a8'/>
<id>cf78408f937a67f59f5e90ee8e6cadeed7c128a8</id>
<content type='text'>
Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Guoqing Jiang &lt;guoqing.jiang@cloud.ionos.com&gt;
Tested-by: Adrian Huang &lt;ahuang12@lenovo.com&gt;
Signed-off-by: Xiao Ni &lt;xni@redhat.com&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Guoqing Jiang &lt;guoqing.jiang@cloud.ionos.com&gt;
Tested-by: Adrian Huang &lt;ahuang12@lenovo.com&gt;
Signed-off-by: Xiao Ni &lt;xni@redhat.com&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: use rdev_read_only in restart_array</title>
<updated>2021-02-01T16:35:20+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-02-01T13:17:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a42e0d70c517c88c52154bf74ec39092d897aaca'/>
<id>a42e0d70c517c88c52154bf74ec39092d897aaca</id>
<content type='text'>
Make the read-only check in restart_array identical to the other two
read-only checks.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Make the read-only check in restart_array identical to the other two
read-only checks.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
</feed>
