<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/fs/fs-writeback.c, branch v4.4.166</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>bdi: Fix oops in wb_workfn()</title>
<updated>2018-05-16T08:06:51+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2018-05-03T16:26:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=3affd66711f6bb3a7cd7d880771afddf2eadcc2a'/>
<id>3affd66711f6bb3a7cd7d880771afddf2eadcc2a</id>
<content type='text'>
commit b8b784958eccbf8f51ebeee65282ca3fd59ea391 upstream.

Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb-&gt;bdi-&gt;dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:

	mod_delayed_work(bdi_wq, &amp;wb-&gt;dwork, 0);

and if this happens while wb_shutdown() is waiting in:

	flush_delayed_work(&amp;wb-&gt;dwork);

the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb-&gt;bdi-&gt;dev.

Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.

CC: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
CC: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
Reported-by: syzbot &lt;syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit b8b784958eccbf8f51ebeee65282ca3fd59ea391 upstream.

Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb-&gt;bdi-&gt;dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:

	mod_delayed_work(bdi_wq, &amp;wb-&gt;dwork, 0);

and if this happens while wb_shutdown() is waiting in:

	flush_delayed_work(&amp;wb-&gt;dwork);

the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb-&gt;bdi-&gt;dev.

Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.

CC: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
CC: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
Reported-by: syzbot &lt;syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>writeback: safer lock nesting</title>
<updated>2018-04-24T07:32:12+00:00</updated>
<author>
<name>Greg Thelen</name>
<email>gthelen@google.com</email>
</author>
<published>2018-04-20T21:55:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=6f051f8986a89d0c482ea1dfc96bc226fb12389f'/>
<id>6f051f8986a89d0c482ea1dfc96bc226fb12389f</id>
<content type='text'>
commit 2e898e4c0a3897ccd434adac5abb8330194f527b upstream.

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &amp;locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    &lt;interrupts mistakenly enabled&gt;
                                    &lt;interrupt&gt;
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    &lt;deadlock&gt;
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 &gt; a/memory.move_charge_at_immigrate
  echo 1 &gt; b/memory.move_charge_at_immigrate
  (
    echo $BASHPID &gt; a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &amp;
  while true; do
    sync
  done &amp;
  sleep 1h &amp;
  SLEEP=$!
  while true; do
    echo $SLEEP &gt; a/cgroup.procs
    echo $SLEEP &gt; b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen &lt;gthelen@google.com&gt;
Reported-by: Wang Long &lt;wanglong19@meituan.com&gt;
Acked-by: Wang Long &lt;wanglong19@meituan.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[v4.2+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[natechancellor: Applied to 4.4 based on Greg's backport on lkml.org]
Signed-off-by: Nathan Chancellor &lt;natechancellor@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 2e898e4c0a3897ccd434adac5abb8330194f527b upstream.

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &amp;locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    &lt;interrupts mistakenly enabled&gt;
                                    &lt;interrupt&gt;
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    &lt;deadlock&gt;
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 &gt; a/memory.move_charge_at_immigrate
  echo 1 &gt; b/memory.move_charge_at_immigrate
  (
    echo $BASHPID &gt; a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &amp;
  while true; do
    sync
  done &amp;
  sleep 1h &amp;
  SLEEP=$!
  while true; do
    echo $SLEEP &gt; a/cgroup.procs
    echo $SLEEP &gt; b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen &lt;gthelen@google.com&gt;
Reported-by: Wang Long &lt;wanglong19@meituan.com&gt;
Acked-by: Wang Long &lt;wanglong19@meituan.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[v4.2+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[natechancellor: Applied to 4.4 based on Greg's backport on lkml.org]
Signed-off-by: Nathan Chancellor &lt;natechancellor@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>writeback: fix memory leak in wb_queue_work()</title>
<updated>2017-12-20T09:04:54+00:00</updated>
<author>
<name>Tahsin Erdogan</name>
<email>tahsin@google.com</email>
</author>
<published>2017-03-10T20:09:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=b424289863d09900643c8e2dd6fa53a465895258'/>
<id>b424289863d09900643c8e2dd6fa53a465895258</id>
<content type='text'>
[ Upstream commit 4a3a485b1ed0e109718cc8c9d094fa0f552de9b2 ]

When WB_registered flag is not set, wb_queue_work() skips queuing the
work, but does not perform the necessary clean up. In particular, if
work-&gt;auto_free is true, it should free the memory.

The leak condition can be reprouced by following these steps:

   mount /dev/sdb /mnt/sdb
   /* In qemu console: device_del sdb */
   umount /dev/sdb

Above will result in a wb_queue_work() call on an unregistered wb and
thus leak memory.

Reported-by: John Sperbeck &lt;jsperbeck@google.com&gt;
Signed-off-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 4a3a485b1ed0e109718cc8c9d094fa0f552de9b2 ]

When WB_registered flag is not set, wb_queue_work() skips queuing the
work, but does not perform the necessary clean up. In particular, if
work-&gt;auto_free is true, it should free the memory.

The leak condition can be reprouced by following these steps:

   mount /dev/sdb /mnt/sdb
   /* In qemu console: device_del sdb */
   umount /dev/sdb

Above will result in a wb_queue_work() call on an unregistered wb and
thus leak memory.

Reported-by: John Sperbeck &lt;jsperbeck@google.com&gt;
Signed-off-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode</title>
<updated>2016-04-12T16:09:04+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-03-18T17:52:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=842ec116c7070c8bfc785b609acb19fb29a59cb0'/>
<id>842ec116c7070c8bfc785b609acb19fb29a59cb0</id>
<content type='text'>
commit aaf2559332ba272671bb870464a99b909b29a3a1 upstream.

When cgroup writeback is in use, there can be multiple wb's
(bdi_writeback's) per bdi and an inode may switch among them
dynamically.  In a couple places, the wrong wb was used leading to
performing operations on the wrong list under the wrong lock
corrupting the io lists.

* writeback_single_inode() was taking @wb parameter and used it to
  remove the inode from io lists if it becomes clean after writeback.
  The callers of this function were always passing in the root wb
  regardless of the actual wb that the inode was associated with,
  which could also change while writeback is in progress.

  Fix it by dropping the @wb parameter and using
  inode_to_wb_and_lock_list() to determine and lock the associated wb.

* After writeback_sb_inodes() writes out an inode, it re-locks @wb and
  inode to remove it from or move it to the right io list.  It assumes
  that the inode is still associated with @wb; however, the inode may
  have switched to another wb while writeback was in progress.

  Fix it by using inode_to_wb_and_lock_list() to determine and lock
  the associated wb after writeback is complete.  As the function
  requires the original @wb-&gt;list_lock locked for the next iteration,
  in the unlikely case where the inode has changed association, switch
  the locks.

Kudos to Tahsin for pinpointing these subtle breakages.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.com
Reported-and-diagnosed-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit aaf2559332ba272671bb870464a99b909b29a3a1 upstream.

When cgroup writeback is in use, there can be multiple wb's
(bdi_writeback's) per bdi and an inode may switch among them
dynamically.  In a couple places, the wrong wb was used leading to
performing operations on the wrong list under the wrong lock
corrupting the io lists.

* writeback_single_inode() was taking @wb parameter and used it to
  remove the inode from io lists if it becomes clean after writeback.
  The callers of this function were always passing in the root wb
  regardless of the actual wb that the inode was associated with,
  which could also change while writeback is in progress.

  Fix it by dropping the @wb parameter and using
  inode_to_wb_and_lock_list() to determine and lock the associated wb.

* After writeback_sb_inodes() writes out an inode, it re-locks @wb and
  inode to remove it from or move it to the right io list.  It assumes
  that the inode is still associated with @wb; however, the inode may
  have switched to another wb while writeback was in progress.

  Fix it by using inode_to_wb_and_lock_list() to determine and lock
  the associated wb after writeback is complete.  As the function
  requires the original @wb-&gt;list_lock locked for the next iteration,
  in the unlikely case where the inode has changed association, switch
  the locks.

Kudos to Tahsin for pinpointing these subtle breakages.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.com
Reported-and-diagnosed-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()</title>
<updated>2016-04-12T16:09:04+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-03-18T17:50:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d78ddcfbe7ab8c5f4ff0b8f20b2bbda710fc0e91'/>
<id>d78ddcfbe7ab8c5f4ff0b8f20b2bbda710fc0e91</id>
<content type='text'>
commit 614a4e3773148a31f58dc174bbf578ceb63510c2 upstream.

locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
the target inode, unlocks inode, locks the wb's list_lock and verifies
that the inode is still associated with the wb.  To prevent the wb
going away between dropping inode lock and acquiring list_lock, the wb
is pinned while inode lock is held.  The wb reference is put right
after acquiring list_lock citing that the wb won't be dereferenced
anymore.

This isn't true.  If the inode is still associated with the wb, the
inode has reference and it's safe to return the wb; however, if inode
has been switched, the wb still needs to be unlocked which is a
dereference and can lead to use-after-free if it it races with wb
destruction.

Fix it by putting the reference after releasing list_lock.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 87e1d789bf55 ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
Tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 614a4e3773148a31f58dc174bbf578ceb63510c2 upstream.

locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
the target inode, unlocks inode, locks the wb's list_lock and verifies
that the inode is still associated with the wb.  To prevent the wb
going away between dropping inode lock and acquiring list_lock, the wb
is pinned while inode lock is held.  The wb reference is put right
after acquiring list_lock citing that the wb won't be dereferenced
anymore.

This isn't true.  If the inode is still associated with the wb, the
inode has reference and it's safe to return the wb; however, if inode
has been switched, the wb still needs to be unlocked which is a
dereference and can lead to use-after-free if it it races with wb
destruction.

Fix it by putting the reference after releasing list_lock.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 87e1d789bf55 ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
Tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>writeback: flush inode cgroup wb switches instead of pinning super_block</title>
<updated>2016-03-09T23:34:52+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-02-29T23:28:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c5cbbec54fe71c4de2d34f8c0ec8fbfdd7f17339'/>
<id>c5cbbec54fe71c4de2d34f8c0ec8fbfdd7f17339</id>
<content type='text'>
commit a1a0e23e49037c23ea84bc8cc146a03584d13577 upstream.

If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Al Viro &lt;viro@ZenIV.linux.org.uk&gt;
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit a1a0e23e49037c23ea84bc8cc146a03584d13577 upstream.

If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Al Viro &lt;viro@ZenIV.linux.org.uk&gt;
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>writeback: keep superblock pinned during cgroup writeback association switches</title>
<updated>2016-03-03T23:07:28+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-02-16T18:34:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7c465723d0b6f2621f6c712035b117d744a51a8b'/>
<id>7c465723d0b6f2621f6c712035b117d744a51a8b</id>
<content type='text'>
commit 5ff8eaac1636bf6deae86491f4818c4c69d1a9ac upstream.

If cgroup writeback is in use, an inode is associated with a cgroup
for writeback.  If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously.  Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.

 kernel BUG at fs/jbd2/transaction.c:319!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
 CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
 Hardware name: Google Google, BIOS Google 01/01/2011
 Workqueue: events inode_switch_wbs_work_fn
 task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
 RIP: 0010:[&lt;ffffffff803e6922&gt;]  [&lt;ffffffff803e6922&gt;] start_this_handle+0x382/0x3e0
 RSP: 0018:ffff880209267c30  EFLAGS: 00010202
 ...
 Call Trace:
  [&lt;ffffffff803e6be4&gt;] jbd2__journal_start+0xf4/0x190
  [&lt;ffffffff803cfc7e&gt;] __ext4_journal_start_sb+0x4e/0x70
  [&lt;ffffffff803b31ec&gt;] ext4_evict_inode+0x12c/0x3d0
  [&lt;ffffffff8035338b&gt;] evict+0xbb/0x190
  [&lt;ffffffff80354190&gt;] iput+0x130/0x190
  [&lt;ffffffff80360223&gt;] inode_switch_wbs_work_fn+0x343/0x4c0
  [&lt;ffffffff80279819&gt;] process_one_work+0x129/0x300
  [&lt;ffffffff80279b16&gt;] worker_thread+0x126/0x480
  [&lt;ffffffff8027ed14&gt;] kthread+0xc4/0xe0
  [&lt;ffffffff809771df&gt;] ret_from_fork+0x3f/0x70

Fix it by bumping s_active while cgroup association switching is in
flight.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-and-tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 5ff8eaac1636bf6deae86491f4818c4c69d1a9ac upstream.

If cgroup writeback is in use, an inode is associated with a cgroup
for writeback.  If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously.  Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.

 kernel BUG at fs/jbd2/transaction.c:319!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
 CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
 Hardware name: Google Google, BIOS Google 01/01/2011
 Workqueue: events inode_switch_wbs_work_fn
 task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
 RIP: 0010:[&lt;ffffffff803e6922&gt;]  [&lt;ffffffff803e6922&gt;] start_this_handle+0x382/0x3e0
 RSP: 0018:ffff880209267c30  EFLAGS: 00010202
 ...
 Call Trace:
  [&lt;ffffffff803e6be4&gt;] jbd2__journal_start+0xf4/0x190
  [&lt;ffffffff803cfc7e&gt;] __ext4_journal_start_sb+0x4e/0x70
  [&lt;ffffffff803b31ec&gt;] ext4_evict_inode+0x12c/0x3d0
  [&lt;ffffffff8035338b&gt;] evict+0xbb/0x190
  [&lt;ffffffff80354190&gt;] iput+0x130/0x190
  [&lt;ffffffff80360223&gt;] inode_switch_wbs_work_fn+0x343/0x4c0
  [&lt;ffffffff80279819&gt;] process_one_work+0x129/0x300
  [&lt;ffffffff80279b16&gt;] worker_thread+0x126/0x480
  [&lt;ffffffff8027ed14&gt;] kthread+0xc4/0xe0
  [&lt;ffffffff809771df&gt;] ret_from_fork+0x3f/0x70

Fix it by bumping s_active while cgroup association switching is in
flight.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-and-tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>fs/writeback.c: fix kernel-doc warnings</title>
<updated>2015-11-09T23:11:24+00:00</updated>
<author>
<name>Randy Dunlap</name>
<email>rdunlap@infradead.org</email>
</author>
<published>2015-11-09T22:58:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=dbce03b9e3e61e122451a7aa4e6900d5f0bb5993'/>
<id>dbce03b9e3e61e122451a7aa4e6900d5f0bb5993</id>
<content type='text'>
Fix kernel-doc warnings in fs/fs-writeback.c by moving a #define macro to
after the function's opening brace.  Also #undef this macro at the end of
the function.

  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'inode' description in 'I_DIRTY_INODE'
  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'flags' description in 'I_DIRTY_INODE'

Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fix kernel-doc warnings in fs/fs-writeback.c by moving a #define macro to
after the function's opening brace.  Also #undef this macro at the end of
the function.

  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'inode' description in 'I_DIRTY_INODE'
  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'flags' description in 'I_DIRTY_INODE'

Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/filemap.c: make global sync not clear error status of individual inodes</title>
<updated>2015-11-06T03:34:48+00:00</updated>
<author>
<name>Junichi Nomura</name>
<email>j-nomura@ce.jp.nec.com</email>
</author>
<published>2015-11-06T02:47:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=aa750fd71c242dba02ee2034e15fbd7d0cdb2461'/>
<id>aa750fd71c242dba02ee2034e15fbd7d0cdb2461</id>
<content type='text'>
filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.

The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error if events
happen in the following order:

   Application                    System admin
   ----------------------------------------------------------
   write data on page cache
                                  Run sync command
                                  writeback completes with error
                                  filemap_fdatawait() clears error
   fsync returns success
   (but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura &lt;j-nomura@ce.jp.nec.com&gt;
Acked-by: Andi Kleen &lt;ak@linux.intel.com&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Fengguang Wu &lt;fengguang.wu@gmail.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.

The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error if events
happen in the following order:

   Application                    System admin
   ----------------------------------------------------------
   write data on page cache
                                  Run sync command
                                  writeback completes with error
                                  filemap_fdatawait() clears error
   fsync returns success
   (but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura &lt;j-nomura@ce.jp.nec.com&gt;
Acked-by: Andi Kleen &lt;ak@linux.intel.com&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Fengguang Wu &lt;fengguang.wu@gmail.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()</title>
<updated>2015-10-28T12:17:30+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-10-27T05:19:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=b33e18f61bd18227a456016a77b1a968f5bc1d65'/>
<id>b33e18f61bd18227a456016a77b1a968f5bc1d65</id>
<content type='text'>
bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi-&gt;wb_list.  To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.

Don't use the RCU variant for simple pointer offsetting.  Use
list_entry() instead.

Reported-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Dipankar Sarma &lt;dipankar@in.ibm.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Josh Triplett &lt;josh@joshtriplett.org&gt;
Cc: Lai Jiangshan &lt;jiangshanlai@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Patrick Marlier &lt;patrick.marlier@gmail.com&gt;
Cc: Paul McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: pranith kumar &lt;bobby.prani@gmail.com&gt;
Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi-&gt;wb_list.  To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.

Don't use the RCU variant for simple pointer offsetting.  Use
list_entry() instead.

Reported-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Dipankar Sarma &lt;dipankar@in.ibm.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Josh Triplett &lt;josh@joshtriplett.org&gt;
Cc: Lai Jiangshan &lt;jiangshanlai@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Patrick Marlier &lt;patrick.marlier@gmail.com&gt;
Cc: Paul McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: pranith kumar &lt;bobby.prani@gmail.com&gt;
Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
