<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/fs/pipe.c, branch v3.2.83</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>pipe: Fix buffer offset after partially failed read</title>
<updated>2016-02-27T14:28:49+00:00</updated>
<author>
<name>Ben Hutchings</name>
<email>ben@decadent.org.uk</email>
</author>
<published>2016-02-13T02:34:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=feae3ca2e5e1a8f44aa6290255d3d9709985d0b2'/>
<id>feae3ca2e5e1a8f44aa6290255d3d9709985d0b2</id>
<content type='text'>
Quoting the RHEL advisory:

&gt; It was found that the fix for CVE-2015-1805 incorrectly kept buffer
&gt; offset and buffer length in sync on a failed atomic read, potentially
&gt; resulting in a pipe buffer state corruption. A local, unprivileged user
&gt; could use this flaw to crash the system or leak kernel memory to user
&gt; space. (CVE-2016-0774, Moderate)

The same flawed fix was applied to stable branches from 2.6.32.y to
3.14.y inclusive, and I was able to reproduce the issue on 3.2.y.
We need to give pipe_iov_copy_to_user() a separate offset variable
and only update the buffer offset if it succeeds.

References: https://rhn.redhat.com/errata/RHSA-2016-0103.html
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Quoting the RHEL advisory:

&gt; It was found that the fix for CVE-2015-1805 incorrectly kept buffer
&gt; offset and buffer length in sync on a failed atomic read, potentially
&gt; resulting in a pipe buffer state corruption. A local, unprivileged user
&gt; could use this flaw to crash the system or leak kernel memory to user
&gt; space. (CVE-2016-0774, Moderate)

The same flawed fix was applied to stable branches from 2.6.32.y to
3.14.y inclusive, and I was able to reproduce the issue on 3.2.y.
We need to give pipe_iov_copy_to_user() a separate offset variable
and only update the buffer offset if it succeeds.

References: https://rhn.redhat.com/errata/RHSA-2016-0103.html
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pipe: limit the per-user amount of pages allocated in pipes</title>
<updated>2016-02-27T14:28:49+00:00</updated>
<author>
<name>Willy Tarreau</name>
<email>w@1wt.eu</email>
</author>
<published>2016-01-18T15:36:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=92375b85b70395c8180991084c05e8d78e55d066'/>
<id>92375b85b70395c8180991084c05e8d78e55d066</id>
<content type='text'>
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pipe: iovec: Fix memory corruption when retrying atomic copy as non-atomic</title>
<updated>2015-08-06T23:32:15+00:00</updated>
<author>
<name>Ben Hutchings</name>
<email>ben@decadent.org.uk</email>
</author>
<published>2015-06-15T02:51:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=75cf667b7fac08a7b21694adca7dff07361be68a'/>
<id>75cf667b7fac08a7b21694adca7dff07361be68a</id>
<content type='text'>
pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec,
the first time atomically and the second time not.  The second attempt
needs to continue from the iovec position, pipe buffer offset and
remaining length where the first attempt failed, but currently the
pipe buffer offset and remaining length are reset.  This will corrupt
the piped data (possibly also leading to an information leak between
processes) and may also corrupt kernel memory.

This was fixed upstream by commits f0d1bec9d58d ("new helper:
copy_page_from_iter()") and 637b58c2887e ("switch pipe_read() to
copy_page_to_iter()"), but those aren't suitable for stable.  This fix
for older kernel versions was made by Seth Jennings for RHEL and I
have extracted it from their update.

CVE-2015-1805

References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec,
the first time atomically and the second time not.  The second attempt
needs to continue from the iovec position, pipe buffer offset and
remaining length where the first attempt failed, but currently the
pipe buffer offset and remaining length are reset.  This will corrupt
the piped data (possibly also leading to an information leak between
processes) and may also corrupt kernel memory.

This was fixed upstream by commits f0d1bec9d58d ("new helper:
copy_page_from_iter()") and 637b58c2887e ("switch pipe_read() to
copy_page_to_iter()"), but those aren't suitable for stable.  This fix
for older kernel versions was made by Seth Jennings for RHEL and I
have extracted it from their update.

CVE-2015-1805

References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vfs: fix pipe counter breakage</title>
<updated>2013-03-20T15:03:29+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@ZenIV.linux.org.uk</email>
</author>
<published>2013-03-12T02:59:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=0edb34d87ec58f4f1f9309e4ca9d0fa85be474c3'/>
<id>0edb34d87ec58f4f1f9309e4ca9d0fa85be474c3</id>
<content type='text'>
commit a930d8790552658140d7d0d2e316af4f0d76a512 upstream.

If you open a pipe for neither read nor write, the pipe code will not
add any usage counters to the pipe, causing the 'struct pipe_inode_info"
to be potentially released early.

That doesn't normally matter, since you cannot actually use the pipe,
but the pipe release code - particularly fasync handling - still expects
the actual pipe infrastructure to all be there.  And rather than adding
NULL pointer checks, let's just disallow this case, the same way we
already do for the named pipe ("fifo") case.

This is ancient going back to pre-2.4 days, and until trinity, nobody
naver noticed.

Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit a930d8790552658140d7d0d2e316af4f0d76a512 upstream.

If you open a pipe for neither read nor write, the pipe code will not
add any usage counters to the pipe, causing the 'struct pipe_inode_info"
to be potentially released early.

That doesn't normally matter, since you cannot actually use the pipe,
but the pipe release code - particularly fasync handling - still expects
the actual pipe infrastructure to all be there.  And rather than adding
NULL pointer checks, let's just disallow this case, the same way we
already do for the named pipe ("fifo") case.

This is ancient going back to pre-2.4 days, and until trinity, nobody
naver noticed.

Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pipes: add a "packetized pipe" mode for writing</title>
<updated>2012-05-11T12:14:43+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-04-29T20:12:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=39bcf98adc9432136e3b6a54881dcf703b7fcb63'/>
<id>39bcf98adc9432136e3b6a54881dcf703b7fcb63</id>
<content type='text'>
commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream.

The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own.  The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops.  Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes).  But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev &lt;mjt@tls.msk.ru&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Ian Kent &lt;raven@themaw.net&gt;
Cc: Thomas Meyer &lt;thomas@m3y3r.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream.

The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own.  The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops.  Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes).  But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev &lt;mjt@tls.msk.ru&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Ian Kent &lt;raven@themaw.net&gt;
Cc: Thomas Meyer &lt;thomas@m3y3r.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/pipe.c: add -&gt;statfs callback for pipefs</title>
<updated>2011-11-01T00:30:51+00:00</updated>
<author>
<name>Pavel Emelyanov</name>
<email>xemul@parallels.com</email>
</author>
<published>2011-11-01T00:10:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d70ef97baf048412c395bb5d65791d8fe133a52b'/>
<id>d70ef97baf048412c395bb5d65791d8fe133a52b</id>
<content type='text'>
Currently a statfs on a pipe's /proc/&lt;pid&gt;/fd/ link returns -ENOSYS.  Wire
pipfs up so that the statfs succeeds.

This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.

When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry-&gt;inode pair as the original process has.  Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly.  Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.

Signed-off-by: Pavel Emelyanov &lt;xemul@parallels.com&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Cc: Al Viro &lt;viro@ZenIV.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently a statfs on a pipe's /proc/&lt;pid&gt;/fd/ link returns -ENOSYS.  Wire
pipfs up so that the statfs succeeds.

This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.

When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry-&gt;inode pair as the original process has.  Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly.  Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.

Signed-off-by: Pavel Emelyanov &lt;xemul@parallels.com&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Cc: Al Viro &lt;viro@ZenIV.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vfs: dont chain pipe/anon/socket on superblock s_inodes list</title>
<updated>2011-07-26T16:57:09+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>eric.dumazet@gmail.com</email>
</author>
<published>2011-07-26T09:36:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=a209dfc7b0d94bd6fa94553c097836a2e6d0f0ba'/>
<id>a209dfc7b0d94bd6fa94553c097836a2e6d0f0ba</id>
<content type='text'>
Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>VFS : mount lock scalability for internal mounts</title>
<updated>2011-07-24T14:08:32+00:00</updated>
<author>
<name>Tim Chen</name>
<email>tim.c.chen@linux.intel.com</email>
</author>
<published>2011-07-19T16:32:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=423e0ab086ad8b33626e45fa94ac7613146b7ffa'/>
<id>423e0ab086ad8b33626e45fa94ac7613146b7ffa</id>
<content type='text'>
For a number of file systems that don't have a mount point (e.g. sockfs
and pipefs), they are not marked as long term. Therefore in
mntput_no_expire, all locks in vfs_mount lock are taken instead of just
local cpu's lock to aggregate reference counts when we release
reference to file objects.  In fact, only local lock need to have been
taken to update ref counts as these file systems are in no danger of
going away until we are ready to unregister them.

The attached patch marks file systems using kern_mount without
mount point as long term.  The contentions of vfs_mount lock
is now eliminated.  Before un-registering such file system,
kern_unmount should be called to remove the long term flag and
make the mount point ready to be freed.

Signed-off-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
For a number of file systems that don't have a mount point (e.g. sockfs
and pipefs), they are not marked as long term. Therefore in
mntput_no_expire, all locks in vfs_mount lock are taken instead of just
local cpu's lock to aggregate reference counts when we release
reference to file objects.  In fact, only local lock need to have been
taken to update ref counts as these file systems are in no danger of
going away until we are ready to unregister them.

The attached patch marks file systems using kern_mount without
mount point as long term.  The contentions of vfs_mount lock
is now eliminated.  Before un-registering such file system,
kern_unmount should be called to remove the long term flag and
make the mount point ready to be freed.

Signed-off-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Fix broken "pipe: use event aware wakeups" optimization</title>
<updated>2011-01-21T00:21:59+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-01-21T00:21:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=28e58ee8ce1f0e69c207f747b7b9054b071e328d'/>
<id>28e58ee8ce1f0e69c207f747b7b9054b071e328d</id>
<content type='text'>
Commit e462c448fdc8 ("pipe: use event aware wakeups") optimized the pipe
event wakeup calls to avoid wakeups if the events do not match the
requested set.

However, the optimization was buggy, in that it didn't actually use the
correct sets for the events: when we make room for more data to be
written, the pipe poll() routine will return both the POLLOUT _and_
POLLWRNORM bits.  Similarly for read.

And most critically, when a pipe is released, that will potentially
result in POLLHUP|POLLERR (depending on whether it was the last reader
or writer), not just the regular POLLIN|POLLOUT.

This bug showed itself as a hung gnome-screensaver-dialog process, stuck
forever (or at least until it was poked by a signal or by being traced)
in a poll() system call.

Cc: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit e462c448fdc8 ("pipe: use event aware wakeups") optimized the pipe
event wakeup calls to avoid wakeups if the events do not match the
requested set.

However, the optimization was buggy, in that it didn't actually use the
correct sets for the events: when we make room for more data to be
written, the pipe poll() routine will return both the POLLOUT _and_
POLLWRNORM bits.  Similarly for read.

And most critically, when a pipe is released, that will potentially
result in POLLHUP|POLLERR (depending on whether it was the last reader
or writer), not just the regular POLLIN|POLLOUT.

This bug showed itself as a hung gnome-screensaver-dialog process, stuck
forever (or at least until it was poked by a signal or by being traced)
in a poll() system call.

Cc: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sanitize vfsmount refcounting changes</title>
<updated>2011-01-16T18:47:07+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2011-01-15T03:30:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f03c65993b98eeb909a4012ce7833c5857d74755'/>
<id>f03c65993b98eeb909a4012ce7833c5857d74755</id>
<content type='text'>
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
	* 1 for each fs_struct.root.mnt
	* 1 for each fs_struct.pwd.mnt
	* 1 for having non-NULL -&gt;mnt_ns
	* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc.  And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
	* 1 for each fs_struct.root.mnt
	* 1 for each fs_struct.pwd.mnt
	* 1 for having non-NULL -&gt;mnt_ns
	* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc.  And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
</feed>
