<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/fs, branch v4.9.86</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>xfs: quota: check result of register_shrinker()</title>
<updated>2018-03-03T09:23:25+00:00</updated>
<author>
<name>Aliaksei Karaliou</name>
<email>akaraliou.dev@gmail.com</email>
</author>
<published>2017-12-21T21:18:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c33d49420c5cdab3420a9516a04b988cb65ee9ad'/>
<id>c33d49420c5cdab3420a9516a04b988cb65ee9ad</id>
<content type='text'>
[ Upstream commit 3a3882ff26fbdbaf5f7e13f6a0bccfbf7121041d ]

xfs_qm_init_quotainfo() does not check result of register_shrinker()
which was tagged as __must_check recently, reported by sparse.

Signed-off-by: Aliaksei Karaliou &lt;akaraliou.dev@gmail.com&gt;
[darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos]
Reviewed-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;

Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 3a3882ff26fbdbaf5f7e13f6a0bccfbf7121041d ]

xfs_qm_init_quotainfo() does not check result of register_shrinker()
which was tagged as __must_check recently, reported by sparse.

Signed-off-by: Aliaksei Karaliou &lt;akaraliou.dev@gmail.com&gt;
[darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos]
Reviewed-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;

Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: quota: fix missed destroy of qi_tree_lock</title>
<updated>2018-03-03T09:23:25+00:00</updated>
<author>
<name>Aliaksei Karaliou</name>
<email>akaraliou.dev@gmail.com</email>
</author>
<published>2017-12-21T21:18:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=799948750c137513c2976f62a7345f424245234b'/>
<id>799948750c137513c2976f62a7345f424245234b</id>
<content type='text'>
[ Upstream commit 2196881566225f3c3428d1a5f847a992944daa5b ]

xfs_qm_destroy_quotainfo() does not destroy quotainfo-&gt;qi_tree_lock
while destroys quotainfo-&gt;qi_quotaofflock.

Signed-off-by: Aliaksei Karaliou &lt;akaraliou.dev@gmail.com&gt;
Reviewed-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 2196881566225f3c3428d1a5f847a992944daa5b ]

xfs_qm_destroy_quotainfo() does not destroy quotainfo-&gt;qi_tree_lock
while destroys quotainfo-&gt;qi_quotaofflock.

Signed-off-by: Aliaksei Karaliou &lt;akaraliou.dev@gmail.com&gt;
Reviewed-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sget(): handle failures of register_shrinker()</title>
<updated>2018-03-03T09:23:21+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2017-12-18T20:05:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=fd7cbb5ad8e8b222eff3401089e3244a7a8cb32e'/>
<id>fd7cbb5ad8e8b222eff3401089e3244a7a8cb32e</id>
<content type='text'>
[ Upstream commit 9ee332d99e4d5a97548943b81c54668450ce641b ]

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 9ee332d99e4d5a97548943b81c54668450ce641b ]

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>f2fs: fix a bug caused by NULL extent tree</title>
<updated>2018-03-03T09:23:20+00:00</updated>
<author>
<name>Yunlei He</name>
<email>heyunlei@huawei.com</email>
</author>
<published>2017-05-19T07:06:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4a97b2d09d332c43612f489c99b97d691002b6d4'/>
<id>4a97b2d09d332c43612f489c99b97d691002b6d4</id>
<content type='text'>
commit dad48e73127ba10279ea33e6dbc8d3905c4d31c0 upstream.

Thread A:					Thread B:

-f2fs_remount
    -sbi-&gt;mount_opt.opt = 0;
						&lt;--- -f2fs_iget
						         -do_read_inode
							     -f2fs_init_extent_tree
							         -F2FS_I(inode)-&gt;extent_tree is NULL
        -default_options &amp;&amp; parse_options
	    -remount return
						&lt;---  -f2fs_map_blocks
						          -f2fs_lookup_extent_tree
                                                              -f2fs_bug_on(sbi, !et);

The same problem with f2fs_new_inode.

Signed-off-by: Yunlei He &lt;heyunlei@huawei.com&gt;
Signed-off-by: Jaegeuk Kim &lt;jaegeuk@kernel.org&gt;
Signed-off-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit dad48e73127ba10279ea33e6dbc8d3905c4d31c0 upstream.

Thread A:					Thread B:

-f2fs_remount
    -sbi-&gt;mount_opt.opt = 0;
						&lt;--- -f2fs_iget
						         -do_read_inode
							     -f2fs_init_extent_tree
							         -F2FS_I(inode)-&gt;extent_tree is NULL
        -default_options &amp;&amp; parse_options
	    -remount return
						&lt;---  -f2fs_map_blocks
						          -f2fs_lookup_extent_tree
                                                              -f2fs_bug_on(sbi, !et);

The same problem with f2fs_new_inode.

Signed-off-by: Yunlei He &lt;heyunlei@huawei.com&gt;
Signed-off-by: Jaegeuk Kim &lt;jaegeuk@kernel.org&gt;
Signed-off-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>fs/dax.c: fix inefficiency in dax_writeback_mapping_range()</title>
<updated>2018-02-28T09:18:33+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2018-02-23T22:05:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f06c2c659ccad2c5da074311a6980cd888ea2779'/>
<id>f06c2c659ccad2c5da074311a6980cd888ea2779</id>
<content type='text'>
commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream.

dax_writeback_mapping_range() fails to update iteration index when
searching radix tree for entries needing cache flushing.  Thus each
pagevec worth of entries is searched starting from the start which is
inefficient and prone to livelocks.  Update index properly.

Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz
Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync")
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream.

dax_writeback_mapping_range() fails to update iteration index when
searching radix tree for entries needing cache flushing.  Thus each
pagevec worth of entries is searched starting from the start which is
inefficient and prone to livelocks.  Update index properly.

Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz
Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync")
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt_elf: compat: avoid unused function warning</title>
<updated>2018-02-25T10:05:55+00:00</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2018-02-19T10:13:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=890c52ab3d2a7e19714f698d36e83ef8e1a8e1b3'/>
<id>890c52ab3d2a7e19714f698d36e83ef8e1a8e1b3</id>
<content type='text'>
When CONFIG_ELF_CORE is disabled, we get a harmless warning in the compat
version of binfmt_elf:

fs/compat_binfmt_elf.c:58:13: error: 'cputime_to_compat_timeval' defined but not used [-Werror=unused-function]

This was addressed in mainline Linux as part of a larger rework with commit
cd19c364b313 ("fs/binfmt: Convert obsolete cputime type to nsecs").

For 4.9 and earlier, this just shuts up the warning by adding an #ifdef
around the function definition.

Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When CONFIG_ELF_CORE is disabled, we get a harmless warning in the compat
version of binfmt_elf:

fs/compat_binfmt_elf.c:58:13: error: 'cputime_to_compat_timeval' defined but not used [-Werror=unused-function]

This was addressed in mainline Linux as part of a larger rework with commit
cd19c364b313 ("fs/binfmt: Convert obsolete cputime type to nsecs").

For 4.9 and earlier, this just shuts up the warning by adding an #ifdef
around the function definition.

Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>reiserfs: avoid a -Wmaybe-uninitialized warning</title>
<updated>2018-02-25T10:05:53+00:00</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2017-03-23T15:06:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=445e8f85d87d1a56982c3f9ccdce3faa6e091654'/>
<id>445e8f85d87d1a56982c3f9ccdce3faa6e091654</id>
<content type='text'>
commit ab4949640d6674b617b314ad3c2c00353304bab9 upstream.

The latest gcc-7.0.1 snapshot warns about an unintialized variable use:

In file included from fs/reiserfs/lbalance.c:8:0:
fs/reiserfs/lbalance.c: In function 'leaf_item_bottle.isra.3':
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&amp;n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  v2-&gt;v = (v2-&gt;v &amp; cpu_to_le64(15ULL &lt;&lt; 60)) | cpu_to_le64(offset);
           ~~^~~
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&amp;n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  v2-&gt;v = (v2-&gt;v &amp; cpu_to_le64(15ULL &lt;&lt; 60)) | cpu_to_le64(offset);

This happens because the offset/type pair that is stored in
ih.key.u.k_offset_v2 is actually uninitialized when we call
set_le_ih_k_offset() and set_le_ih_k_type(). After we have called both,
all data is correct, but the first of the two reads uninitialized data
for the type field and writes it back before it gets overwritten.

This works around the warning by initializing the k_offset_v2 through
the slightly larger memcpy().

[JK: Remove now unused define and make it obvious we initialize the key]

Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ab4949640d6674b617b314ad3c2c00353304bab9 upstream.

The latest gcc-7.0.1 snapshot warns about an unintialized variable use:

In file included from fs/reiserfs/lbalance.c:8:0:
fs/reiserfs/lbalance.c: In function 'leaf_item_bottle.isra.3':
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&amp;n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  v2-&gt;v = (v2-&gt;v &amp; cpu_to_le64(15ULL &lt;&lt; 60)) | cpu_to_le64(offset);
           ~~^~~
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&amp;n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  v2-&gt;v = (v2-&gt;v &amp; cpu_to_le64(15ULL &lt;&lt; 60)) | cpu_to_le64(offset);

This happens because the offset/type pair that is stored in
ih.key.u.k_offset_v2 is actually uninitialized when we call
set_le_ih_k_offset() and set_le_ih_k_type(). After we have called both,
all data is correct, but the first of the two reads uninitialized data
for the type field and writes it back before it gets overwritten.

This works around the warning by initializing the k_offset_v2 through
the slightly larger memcpy().

[JK: Remove now unused define and make it obvious we initialize the key]

Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: Fix possible off-by-one in btrfs_search_path_in_tree</title>
<updated>2018-02-25T10:05:48+00:00</updated>
<author>
<name>Nikolay Borisov</name>
<email>nborisov@suse.com</email>
</author>
<published>2017-12-01T09:19:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=1c3aae50cec79164955e920432a7aba748c1b09a'/>
<id>1c3aae50cec79164955e920432a7aba748c1b09a</id>
<content type='text'>
[ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ]

The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.

Implications:

Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.

btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.

Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.

Before this patch the buffer was one byte larger, but then the -1 was
not added.

Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
[ added implications ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;

Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ]

The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.

Implications:

Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.

btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.

Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.

Before this patch the buffer was one byte larger, but then the -1 was
not added.

Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
[ added implications ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;

Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vfs: don't do RCU lookup of empty pathnames</title>
<updated>2018-02-22T14:43:55+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-04-03T00:10:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=012e79b98f1a5eead31ea465518af35700a64c10'/>
<id>012e79b98f1a5eead31ea465518af35700a64c10</id>
<content type='text'>
commit c0eb027e5aef70b71e5a38ee3e264dc0b497f343 upstream.

Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.

And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.

Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.

Found by syzkaller and KASAN.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Reported-by: Vegard Nossum &lt;vegard.nossum@gmail.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Eric Biggers &lt;ebiggers3@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit c0eb027e5aef70b71e5a38ee3e264dc0b497f343 upstream.

Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.

And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.

Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.

Found by syzkaller and KASAN.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Reported-by: Vegard Nossum &lt;vegard.nossum@gmail.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Eric Biggers &lt;ebiggers3@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE</title>
<updated>2018-02-22T14:43:52+00:00</updated>
<author>
<name>Gang He</name>
<email>ghe@suse.com</email>
</author>
<published>2018-02-01T00:14:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=9cb1674008e435d2c0145b352bc66b964d7b37ff'/>
<id>9cb1674008e435d2c0145b352bc66b964d7b37ff</id>
<content type='text'>
commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream.

If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not return
directly here, since this will lead to a softlockup problem when the
kernel is configured with CONFIG_PREEMPT is not set.  The method is to
get a blocking lock and immediately unlock before returning, this can
avoid CPU resource waste due to lots of retries, and benefits fairness
in getting lock among multiple nodes, increase efficiency in case
modifying the same file frequently from multiple nodes.

The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like:

  Kernel panic - not syncing: softlockup: hung tasks
  CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    &lt;IRQ&gt;
    dump_stack+0x5c/0x82
    panic+0xd5/0x21e
    watchdog_timer_fn+0x208/0x210
    __hrtimer_run_queues+0xcc/0x200
    hrtimer_interrupt+0xa6/0x1f0
    smp_apic_timer_interrupt+0x34/0x50
    apic_timer_interrupt+0x96/0xa0
    &lt;/IRQ&gt;
   RIP: 0010:unlock_page+0x17/0x30
   RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
   RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
   RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
   RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
   R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
   R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
    ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
    ocfs2_readpage+0x41/0x2d0 [ocfs2]
    filemap_fault+0x12b/0x5c0
    ocfs2_fault+0x29/0xb0 [ocfs2]
    __do_fault+0x1a/0xa0
    __handle_mm_fault+0xbe8/0x1090
    handle_mm_fault+0xaa/0x1f0
    __do_page_fault+0x235/0x4b0
    trace_do_page_fault+0x3c/0x110
    async_page_fault+0x28/0x30
   RIP: 0033:0x7fa75ded638e
   RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
   RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
   RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
   RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
   R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
   R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

About performance improvement, we can see the testing time is reduced,
and CPU utilization decreases, the detailed data is as follows.  I ran
multi_mmap test case in ocfs2-test package in a three nodes cluster.

Before applying this patch:
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 multi_mmap
   1505 root      rt   0  222236 123060  97224 S 2.658 6.015   0:01.44 corosync
      5 root      20   0       0      0      0 S 1.329 0.000   0:00.19 kworker/u8:0
     95 root      20   0       0      0      0 S 1.329 0.000   0:00.25 kworker/u8:1
   2728 root      20   0       0      0      0 S 0.997 0.000   0:00.24 jbd2/sda1-33
   2721 root      20   0       0      0      0 S 0.664 0.000   0:00.07 ocfs2dc-3C8CFD4
   2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun

  ocfs2test@tb-node2:~&gt;multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
  Tests with "-b 4096 -C 32768"
  Thu Dec 28 14:44:52 CST 2017
  multi_mmap..................................................Passed.
  Runtime 783 seconds.

After apply this patch:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 multi_mmap
    155 root      20   0       0      0      0 S 2.667 0.000   0:01.20 kworker/u8:3
     95 root      20   0       0      0      0 S 2.000 0.000   0:01.58 kworker/u8:1
   2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
      5 root      20   0       0      0      0 S 1.000 0.000   0:01.36 kworker/u8:0
   2482 root      20   0       0      0      0 S 1.000 0.000   0:00.86 jbd2/sda1-33
    299 root       0 -20       0      0      0 S 0.333 0.000   0:00.13 kworker/2:1H
    335 root       0 -20       0      0      0 S 0.333 0.000   0:00.17 kworker/1:1H
    535 root      20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
   1282 root      rt   0  222284 123108  97224 S 0.333 6.017   0:01.33 corosync

  ocfs2test@tb-node2:~&gt;multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
  Tests with "-b 4096 -C 32768"
  Thu Dec 28 15:04:12 CST 2017
  multi_mmap..................................................Passed.
  Runtime 487 seconds.

Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
Signed-off-by: Gang He &lt;ghe@suse.com&gt;
Reviewed-by: Eric Ren &lt;zren@suse.com&gt;
Acked-by: alex chen &lt;alex.chen@huawei.com&gt;
Acked-by: piaojun &lt;piaojun@huawei.com&gt;
Cc: Mark Fasheh &lt;mfasheh@versity.com&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: Junxiao Bi &lt;junxiao.bi@oracle.com&gt;
Cc: Joseph Qi &lt;jiangqi903@gmail.com&gt;
Cc: Changwei Ge &lt;ge.changwei@h3c.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream.

If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not return
directly here, since this will lead to a softlockup problem when the
kernel is configured with CONFIG_PREEMPT is not set.  The method is to
get a blocking lock and immediately unlock before returning, this can
avoid CPU resource waste due to lots of retries, and benefits fairness
in getting lock among multiple nodes, increase efficiency in case
modifying the same file frequently from multiple nodes.

The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like:

  Kernel panic - not syncing: softlockup: hung tasks
  CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    &lt;IRQ&gt;
    dump_stack+0x5c/0x82
    panic+0xd5/0x21e
    watchdog_timer_fn+0x208/0x210
    __hrtimer_run_queues+0xcc/0x200
    hrtimer_interrupt+0xa6/0x1f0
    smp_apic_timer_interrupt+0x34/0x50
    apic_timer_interrupt+0x96/0xa0
    &lt;/IRQ&gt;
   RIP: 0010:unlock_page+0x17/0x30
   RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
   RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
   RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
   RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
   R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
   R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
    ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
    ocfs2_readpage+0x41/0x2d0 [ocfs2]
    filemap_fault+0x12b/0x5c0
    ocfs2_fault+0x29/0xb0 [ocfs2]
    __do_fault+0x1a/0xa0
    __handle_mm_fault+0xbe8/0x1090
    handle_mm_fault+0xaa/0x1f0
    __do_page_fault+0x235/0x4b0
    trace_do_page_fault+0x3c/0x110
    async_page_fault+0x28/0x30
   RIP: 0033:0x7fa75ded638e
   RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
   RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
   RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
   RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
   R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
   R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

About performance improvement, we can see the testing time is reduced,
and CPU utilization decreases, the detailed data is as follows.  I ran
multi_mmap test case in ocfs2-test package in a three nodes cluster.

Before applying this patch:
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 multi_mmap
   1505 root      rt   0  222236 123060  97224 S 2.658 6.015   0:01.44 corosync
      5 root      20   0       0      0      0 S 1.329 0.000   0:00.19 kworker/u8:0
     95 root      20   0       0      0      0 S 1.329 0.000   0:00.25 kworker/u8:1
   2728 root      20   0       0      0      0 S 0.997 0.000   0:00.24 jbd2/sda1-33
   2721 root      20   0       0      0      0 S 0.664 0.000   0:00.07 ocfs2dc-3C8CFD4
   2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun

  ocfs2test@tb-node2:~&gt;multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
  Tests with "-b 4096 -C 32768"
  Thu Dec 28 14:44:52 CST 2017
  multi_mmap..................................................Passed.
  Runtime 783 seconds.

After apply this patch:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 multi_mmap
    155 root      20   0       0      0      0 S 2.667 0.000   0:01.20 kworker/u8:3
     95 root      20   0       0      0      0 S 2.000 0.000   0:01.58 kworker/u8:1
   2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
      5 root      20   0       0      0      0 S 1.000 0.000   0:01.36 kworker/u8:0
   2482 root      20   0       0      0      0 S 1.000 0.000   0:00.86 jbd2/sda1-33
    299 root       0 -20       0      0      0 S 0.333 0.000   0:00.13 kworker/2:1H
    335 root       0 -20       0      0      0 S 0.333 0.000   0:00.17 kworker/1:1H
    535 root      20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
   1282 root      rt   0  222284 123108  97224 S 0.333 6.017   0:01.33 corosync

  ocfs2test@tb-node2:~&gt;multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
  Tests with "-b 4096 -C 32768"
  Thu Dec 28 15:04:12 CST 2017
  multi_mmap..................................................Passed.
  Runtime 487 seconds.

Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
Signed-off-by: Gang He &lt;ghe@suse.com&gt;
Reviewed-by: Eric Ren &lt;zren@suse.com&gt;
Acked-by: alex chen &lt;alex.chen@huawei.com&gt;
Acked-by: piaojun &lt;piaojun@huawei.com&gt;
Cc: Mark Fasheh &lt;mfasheh@versity.com&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: Junxiao Bi &lt;junxiao.bi@oracle.com&gt;
Cc: Joseph Qi &lt;jiangqi903@gmail.com&gt;
Cc: Changwei Ge &lt;ge.changwei@h3c.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
