<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/splice.c, branch v5.10</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block</title>
<updated>2020-10-24T19:40:18+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-24T19:40:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=af0041875ce7f5a05362b884e90cf82c27876096'/>
<id>af0041875ce7f5a05362b884e90cf82c27876096</id>
<content type='text'>
Pull io_uring fixes from Jens Axboe:

 - fsize was missed in previous unification of work flags

 - Few fixes cleaning up the flags unification creds cases (Pavel)

 - Fix NUMA affinities for completely unplugged/replugged node for io-wq

 - Two fallout fixes from the set_fs changes. One local to io_uring, one
   for the splice entry point that io_uring uses.

 - Linked timeout fixes (Pavel)

 - Removal of -&gt;flush() -&gt;files work-around that we don't need anymore
   with referenced files (Pavel)

 - Various cleanups (Pavel)

* tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
  splice: change exported internal do_splice() helper to take kernel offset
  io_uring: make loop_rw_iter() use original user supplied pointers
  io_uring: remove req cancel in -&gt;flush()
  io-wq: re-set NUMA node affinities if CPUs come online
  io_uring: don't reuse linked_timeout
  io_uring: unify fsize with def-&gt;work_flags
  io_uring: fix racy REQ_F_LINK_TIMEOUT clearing
  io_uring: do poll's hash_node init in common code
  io_uring: inline io_poll_task_handler()
  io_uring: remove extra -&gt;file check in poll prep
  io_uring: make cached_cq_overflow non atomic_t
  io_uring: inline io_fail_links()
  io_uring: kill ref get/drop in personality init
  io_uring: flags-based creds init in queue
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull io_uring fixes from Jens Axboe:

 - fsize was missed in previous unification of work flags

 - Few fixes cleaning up the flags unification creds cases (Pavel)

 - Fix NUMA affinities for completely unplugged/replugged node for io-wq

 - Two fallout fixes from the set_fs changes. One local to io_uring, one
   for the splice entry point that io_uring uses.

 - Linked timeout fixes (Pavel)

 - Removal of -&gt;flush() -&gt;files work-around that we don't need anymore
   with referenced files (Pavel)

 - Various cleanups (Pavel)

* tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
  splice: change exported internal do_splice() helper to take kernel offset
  io_uring: make loop_rw_iter() use original user supplied pointers
  io_uring: remove req cancel in -&gt;flush()
  io-wq: re-set NUMA node affinities if CPUs come online
  io_uring: don't reuse linked_timeout
  io_uring: unify fsize with def-&gt;work_flags
  io_uring: fix racy REQ_F_LINK_TIMEOUT clearing
  io_uring: do poll's hash_node init in common code
  io_uring: inline io_poll_task_handler()
  io_uring: remove extra -&gt;file check in poll prep
  io_uring: make cached_cq_overflow non atomic_t
  io_uring: inline io_fail_links()
  io_uring: kill ref get/drop in personality init
  io_uring: flags-based creds init in queue
</pre>
</div>
</content>
</entry>
<entry>
<title>splice: change exported internal do_splice() helper to take kernel offset</title>
<updated>2020-10-22T20:15:51+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2020-10-22T20:15:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ee6e00c868221f5f7d0b6eb4e8379a148e26bc20'/>
<id>ee6e00c868221f5f7d0b6eb4e8379a148e26bc20</id>
<content type='text'>
With the set_fs change, we can no longer rely on copy_{to,from}_user()
accepting a kernel pointer, and it was bad form to do so anyway. Clean
this up and change the internal helper that io_uring uses to deal with
kernel pointers instead. This puts the offset copy in/out in __do_splice()
instead, which just calls the same helper.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With the set_fs change, we can no longer rely on copy_{to,from}_user()
accepting a kernel pointer, and it was bad form to do so anyway. Clean
this up and change the internal helper that io_uring uses to deal with
kernel pointers instead. This puts the offset copy in/out in __do_splice()
instead, which just calls the same helper.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2020-10-22T16:59:21+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-22T16:59:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f56e65dff6ad52395ef45738799b4fb70ff43376'/>
<id>f56e65dff6ad52395ef45738799b4fb70ff43376</id>
<content type='text'>
Pull initial set_fs() removal from Al Viro:
 "Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Allow a NULL pos pointer to __kernel_read
  fs: Allow a NULL pos pointer to __kernel_write
  powerpc: remove address space overrides using set_fs()
  powerpc: use non-set_fs based maccess routines
  x86: remove address space overrides using set_fs()
  x86: make TASK_SIZE_MAX usable from assembly code
  x86: move PAGE_OFFSET, TASK_SIZE &amp; friends to page_{32,64}_types.h
  lkdtm: remove set_fs-based tests
  test_bitmap: remove user bitmap tests
  uaccess: add infrastructure for kernel builds with set_fs()
  fs: don't allow splice read/write without explicit ops
  fs: don't allow kernel reads and writes without iter ops
  sysctl: Convert to iter interfaces
  proc: add a read_iter method to proc proc_ops
  proc: cleanup the compat vs no compat file ops
  proc: remove a level of indentation in proc_get_inode
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull initial set_fs() removal from Al Viro:
 "Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Allow a NULL pos pointer to __kernel_read
  fs: Allow a NULL pos pointer to __kernel_write
  powerpc: remove address space overrides using set_fs()
  powerpc: use non-set_fs based maccess routines
  x86: remove address space overrides using set_fs()
  x86: make TASK_SIZE_MAX usable from assembly code
  x86: move PAGE_OFFSET, TASK_SIZE &amp; friends to page_{32,64}_types.h
  lkdtm: remove set_fs-based tests
  test_bitmap: remove user bitmap tests
  uaccess: add infrastructure for kernel builds with set_fs()
  fs: don't allow splice read/write without explicit ops
  fs: don't allow kernel reads and writes without iter ops
  sysctl: Convert to iter interfaces
  proc: add a read_iter method to proc proc_ops
  proc: cleanup the compat vs no compat file ops
  proc: remove a level of indentation in proc_get_inode
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2020-10-12T23:35:51+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-12T23:35:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=85ed13e78dbedf9433115a62c85429922bc5035c'/>
<id>85ed13e78dbedf9433115a62c85429922bc5035c</id>
<content type='text'>
Pull compat iovec cleanups from Al Viro:
 "Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  security/keys: remove compat_keyctl_instantiate_key_iov
  mm: remove compat_process_vm_{readv,writev}
  fs: remove compat_sys_vmsplice
  fs: remove the compat readv/writev syscalls
  fs: remove various compat readv/writev helpers
  iov_iter: transparently handle compat iovecs in import_iovec
  iov_iter: refactor rw_copy_check_uvector and import_iovec
  iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
  compat.h: fix a spelling error in &lt;linux/compat.h&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull compat iovec cleanups from Al Viro:
 "Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  security/keys: remove compat_keyctl_instantiate_key_iov
  mm: remove compat_process_vm_{readv,writev}
  fs: remove compat_sys_vmsplice
  fs: remove the compat readv/writev syscalls
  fs: remove various compat readv/writev helpers
  iov_iter: transparently handle compat iovecs in import_iovec
  iov_iter: refactor rw_copy_check_uvector and import_iovec
  iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
  compat.h: fix a spelling error in &lt;linux/compat.h&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>splice: teach splice pipe reading about empty pipe buffers</title>
<updated>2020-10-06T17:27:22+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-05T18:26:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d1a819a2ec2d3b5e6a8f8a9f67386bda0ad315bc'/>
<id>d1a819a2ec2d3b5e6a8f8a9f67386bda0ad315bc</id>
<content type='text'>
Tetsuo Handa reports that splice() can return 0 before the real EOF, if
the data in the splice source pipe is an empty pipe buffer.  That empty
pipe buffer case doesn't happen in any normal situation, but you can
trigger it by doing a write to a pipe that fails due to a page fault.

Tetsuo has a test-case to show the behavior:

  #define _GNU_SOURCE
  #include &lt;sys/types.h&gt;
  #include &lt;sys/stat.h&gt;
  #include &lt;fcntl.h&gt;
  #include &lt;unistd.h&gt;

  int main(int argc, char *argv[])
  {
	const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
	int pipe_fd[2] = { -1, -1 };
	pipe(pipe_fd);
	write(pipe_fd[1], NULL, 4096);
	/* This splice() should wait unless interrupted. */
	return !splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
  }

which results in

    write(5, NULL, 4096)                    = -1 EFAULT (Bad address)
    splice(4, NULL, 3, NULL, 65536, 0)      = 0

and this can confuse splice() users into believing they have hit EOF
prematurely.

The issue was introduced when the pipe write code started pre-allocating
the pipe buffers before copying data from user space.

This is modified verion of Tetsuo's original patch.

Fixes: a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
Link:https://lore.kernel.org/linux-fsdevel/20201005121339.4063-1-penguin-kernel@I-love.SAKURA.ne.jp/
Reported-by: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Acked-by: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Tetsuo Handa reports that splice() can return 0 before the real EOF, if
the data in the splice source pipe is an empty pipe buffer.  That empty
pipe buffer case doesn't happen in any normal situation, but you can
trigger it by doing a write to a pipe that fails due to a page fault.

Tetsuo has a test-case to show the behavior:

  #define _GNU_SOURCE
  #include &lt;sys/types.h&gt;
  #include &lt;sys/stat.h&gt;
  #include &lt;fcntl.h&gt;
  #include &lt;unistd.h&gt;

  int main(int argc, char *argv[])
  {
	const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
	int pipe_fd[2] = { -1, -1 };
	pipe(pipe_fd);
	write(pipe_fd[1], NULL, 4096);
	/* This splice() should wait unless interrupted. */
	return !splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
  }

which results in

    write(5, NULL, 4096)                    = -1 EFAULT (Bad address)
    splice(4, NULL, 3, NULL, 65536, 0)      = 0

and this can confuse splice() users into believing they have hit EOF
prematurely.

The issue was introduced when the pipe write code started pre-allocating
the pipe buffers before copying data from user space.

This is modified verion of Tetsuo's original patch.

Fixes: a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
Link:https://lore.kernel.org/linux-fsdevel/20201005121339.4063-1-penguin-kernel@I-love.SAKURA.ne.jp/
Reported-by: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Acked-by: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: remove compat_sys_vmsplice</title>
<updated>2020-10-03T04:02:15+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-09-25T04:51:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=598b3cec831fd6ccb3cbe4919a722e868c6364a8'/>
<id>598b3cec831fd6ccb3cbe4919a722e868c6364a8</id>
<content type='text'>
Now that import_iovec handles compat iovecs, the native vmsplice syscall
can be used for the compat case as well.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that import_iovec handles compat iovecs, the native vmsplice syscall
can be used for the compat case as well.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>iov_iter: transparently handle compat iovecs in import_iovec</title>
<updated>2020-10-03T04:02:13+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-09-25T04:51:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=89cd35c58bc2e36bfdc23dde67a429b08cf4ae03'/>
<id>89cd35c58bc2e36bfdc23dde67a429b08cf4ae03</id>
<content type='text'>
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pipe: remove pipe_wait() and fix wakeup race with splice</title>
<updated>2020-10-02T02:14:36+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-02T02:14:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=472e5b056f000a778abb41f1e443de58eb259783'/>
<id>472e5b056f000a778abb41f1e443de58eb259783</id>
<content type='text'>
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock.  So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.

Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.

However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.

It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:

 "I hit a hang with fstest btrfs/187, which does a btrfs send into
  /dev/null. This works by creating a pipe, the write side is given to
  the kernel to write into, and the read side is handed to a thread that
  splices into a file, in this case /dev/null.

  The box that was hung had the write side stuck here [pipe_write] and
  the read side stuck here [splice_from_pipe_next -&gt; pipe_wait].

  [ more details about pipe_wait() scenario ]

  The problem is we're doing the prepare_to_wait, which sets our state
  each time, however we can be woken up either with reads or writes. In
  the case above we race with the WRITER waking us up, and re-set our
  state to INTERRUPTIBLE, and thus never break out of schedule"

Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't.  And the whole "wait for any pipe state change" model really
isn't very good to begin with.

So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Reviewed-and-tested-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock.  So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.

Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.

However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.

It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:

 "I hit a hang with fstest btrfs/187, which does a btrfs send into
  /dev/null. This works by creating a pipe, the write side is given to
  the kernel to write into, and the read side is handed to a thread that
  splices into a file, in this case /dev/null.

  The box that was hung had the write side stuck here [pipe_write] and
  the read side stuck here [splice_from_pipe_next -&gt; pipe_wait].

  [ more details about pipe_wait() scenario ]

  The problem is we're doing the prepare_to_wait, which sets our state
  each time, however we can be woken up either with reads or writes. In
  the case above we race with the WRITER waking us up, and re-set our
  state to INTERRUPTIBLE, and thus never break out of schedule"

Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't.  And the whole "wait for any pipe state change" model really
isn't very good to begin with.

So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Reviewed-and-tested-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: don't allow splice read/write without explicit ops</title>
<updated>2020-09-09T02:21:32+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-09-03T14:22:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=36e2c7421f02a22f71c9283e55fdb672a9eb58e7'/>
<id>36e2c7421f02a22f71c9283e55fdb672a9eb58e7</id>
<content type='text'>
default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers.  It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method.  The usual file systems and other
high bandwidth instances all provide a -&gt;splice_read, so this just removes
support for various device drivers and procfs/debugfs files.  If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers.  It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method.  The usual file systems and other
high bandwidth instances all provide a -&gt;splice_read, so this just removes
support for various device drivers and procfs/debugfs files.  If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs</title>
<updated>2020-06-13T16:56:21+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-06-13T16:56:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6c3297841472b4e53e22e53826eea9e483d993e5'/>
<id>6c3297841472b4e53e22e53826eea9e483d993e5</id>
<content type='text'>
Pull notification queue from David Howells:
 "This adds a general notification queue concept and adds an event
  source for keys/keyrings, such as linking and unlinking keys and
  changing their attributes.

  Thanks to Debarshi Ray, we do have a pull request to use this to fix a
  problem with gnome-online-accounts - as mentioned last time:

     https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

  Without this, g-o-a has to constantly poll a keyring-based kerberos
  cache to find out if kinit has changed anything.

  [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

  LSM hooks are included:

   - A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set. Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable. The
     LSM should use current's credentials. [Wanted by SELinux &amp; Smack]

   - A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue. This is
     given the credentials from the event generator (which may be the
     system) and the watch setter. [Wanted by Smack]

  I've provided SELinux and Smack with implementations of some of these
  hooks.

  WHY
  ===

  Key/keyring notifications are desirable because if you have your
  kerberos tickets in a file/directory, your Gnome desktop will monitor
  that using something like fanotify and tell you if your credentials
  cache changes.

  However, we also have the ability to cache your kerberos tickets in
  the session, user or persistent keyring so that it isn't left around
  on disk across a reboot or logout. Keyrings, however, cannot currently
  be monitored asynchronously, so the desktop has to poll for it - not
  so good on a laptop. This facility will allow the desktop to avoid the
  need to poll.

  DESIGN DECISIONS
  ================

   - The notification queue is built on top of a standard pipe. Messages
     are effectively spliced in. The pipe is opened with a special flag:

        pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem
     like it will ever be applicable in this context)[?]. It is given up
     front to make it a lot easier to prohibit splice&amp;co from accessing
     the pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
         O_* flag if I can avoid it - should I add a pipe3() system call
         instead?

     The pipe is then configured::

        ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
        ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &amp;filter);

     Messages are then read out of the pipe using read().

   - It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the
     kernel has to be able to insert messages into the pipe *without*
     holding pipe-&gt;mutex and the code to make this work needs careful
     auditing.

   - sendfile(), splice() and vmsplice() are disabled on notification
     pipes because of the pipe-&gt;mutex issue and also because they
     sometimes want to revert what they just did - but one or more
     notification messages might've been interleaved in the ring.

   - The kernel inserts messages with the wait queue spinlock held. This
     means that pipe_read() and pipe_write() have to take the spinlock
     to update the queue pointers.

   - Records in the buffer are binary, typed and have a length so that
     they can be of varying size.

     This allows multiple heterogeneous sources to share a common
     buffer; there are 16 million types available, of which I've used
     just a few, so there is scope for others to be used. Tags may be
     specified when a watchpoint is created to help distinguish the
     sources.

   - Records are filterable as types have up to 256 subtypes that can be
     individually filtered. Other filtration is also available.

   - Notification pipes don't interfere with each other; each may be
     bound to a different set of watches. Any particular notification
     will be copied to all the queues that are currently watching for it
     - and only those that are watching for it.

   - When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's
     insufficient space. read() will fabricate a loss notification
     message at an appropriate point later.

   - The notification pipe is created and then watchpoints are attached
     to it, using one of:

        keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
        watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
        watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is
     a tag between 0 and 255.

   - Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed. In the latter case, a message will
     be generated indicating the enforced watch removal.

  Things I want to avoid:

   - Introducing features that make the core VFS dependent on the
     network stack or networking namespaces (ie. usage of netlink).

   - Dumping all this stuff into dmesg and having a daemon that sits
     there parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky. Further, dmesg might not exist or might be
     inaccessible inside a container.

   - Letting users see events they shouldn't be able to see.

  TESTING AND MANPAGES
  ====================

   - The keyutils tree has a pipe-watch branch that has keyctl commands
     for making use of notifications. Proposed manual pages can also be
     found on this branch, though a couple of them really need to go to
     the main manpages repository instead.

     If the kernel supports the watching of keys, then running "make
     test" on that branch will cause the testing infrastructure to spawn
     a monitoring process on the side that monitors a notifications pipe
     for all the key/keyring changes induced by the tests and they'll
     all be checked off to make sure they happened.

        https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

   - A test program is provided (samples/watch_queue/watch_test) that
     can be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout"

* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  smack: Implement the watch_key and post_notification hooks
  selinux: Implement the watch_key security hook
  keys: Make the KEY_NEED_* perms an enum rather than a mask
  pipe: Add notification lossage handling
  pipe: Allow buffers to be marked read-whole-or-error for notifications
  Add sample notification program
  watch_queue: Add a key/keyring notification facility
  security: Add hooks to rule on setting a watch
  pipe: Add general notification queue support
  pipe: Add O_NOTIFICATION_PIPE
  security: Add a hook for the point of notification insertion
  uapi: General notification queue definitions
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull notification queue from David Howells:
 "This adds a general notification queue concept and adds an event
  source for keys/keyrings, such as linking and unlinking keys and
  changing their attributes.

  Thanks to Debarshi Ray, we do have a pull request to use this to fix a
  problem with gnome-online-accounts - as mentioned last time:

     https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

  Without this, g-o-a has to constantly poll a keyring-based kerberos
  cache to find out if kinit has changed anything.

  [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

  LSM hooks are included:

   - A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set. Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable. The
     LSM should use current's credentials. [Wanted by SELinux &amp; Smack]

   - A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue. This is
     given the credentials from the event generator (which may be the
     system) and the watch setter. [Wanted by Smack]

  I've provided SELinux and Smack with implementations of some of these
  hooks.

  WHY
  ===

  Key/keyring notifications are desirable because if you have your
  kerberos tickets in a file/directory, your Gnome desktop will monitor
  that using something like fanotify and tell you if your credentials
  cache changes.

  However, we also have the ability to cache your kerberos tickets in
  the session, user or persistent keyring so that it isn't left around
  on disk across a reboot or logout. Keyrings, however, cannot currently
  be monitored asynchronously, so the desktop has to poll for it - not
  so good on a laptop. This facility will allow the desktop to avoid the
  need to poll.

  DESIGN DECISIONS
  ================

   - The notification queue is built on top of a standard pipe. Messages
     are effectively spliced in. The pipe is opened with a special flag:

        pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem
     like it will ever be applicable in this context)[?]. It is given up
     front to make it a lot easier to prohibit splice&amp;co from accessing
     the pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
         O_* flag if I can avoid it - should I add a pipe3() system call
         instead?

     The pipe is then configured::

        ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
        ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &amp;filter);

     Messages are then read out of the pipe using read().

   - It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the
     kernel has to be able to insert messages into the pipe *without*
     holding pipe-&gt;mutex and the code to make this work needs careful
     auditing.

   - sendfile(), splice() and vmsplice() are disabled on notification
     pipes because of the pipe-&gt;mutex issue and also because they
     sometimes want to revert what they just did - but one or more
     notification messages might've been interleaved in the ring.

   - The kernel inserts messages with the wait queue spinlock held. This
     means that pipe_read() and pipe_write() have to take the spinlock
     to update the queue pointers.

   - Records in the buffer are binary, typed and have a length so that
     they can be of varying size.

     This allows multiple heterogeneous sources to share a common
     buffer; there are 16 million types available, of which I've used
     just a few, so there is scope for others to be used. Tags may be
     specified when a watchpoint is created to help distinguish the
     sources.

   - Records are filterable as types have up to 256 subtypes that can be
     individually filtered. Other filtration is also available.

   - Notification pipes don't interfere with each other; each may be
     bound to a different set of watches. Any particular notification
     will be copied to all the queues that are currently watching for it
     - and only those that are watching for it.

   - When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's
     insufficient space. read() will fabricate a loss notification
     message at an appropriate point later.

   - The notification pipe is created and then watchpoints are attached
     to it, using one of:

        keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
        watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
        watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is
     a tag between 0 and 255.

   - Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed. In the latter case, a message will
     be generated indicating the enforced watch removal.

  Things I want to avoid:

   - Introducing features that make the core VFS dependent on the
     network stack or networking namespaces (ie. usage of netlink).

   - Dumping all this stuff into dmesg and having a daemon that sits
     there parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky. Further, dmesg might not exist or might be
     inaccessible inside a container.

   - Letting users see events they shouldn't be able to see.

  TESTING AND MANPAGES
  ====================

   - The keyutils tree has a pipe-watch branch that has keyctl commands
     for making use of notifications. Proposed manual pages can also be
     found on this branch, though a couple of them really need to go to
     the main manpages repository instead.

     If the kernel supports the watching of keys, then running "make
     test" on that branch will cause the testing infrastructure to spawn
     a monitoring process on the side that monitors a notifications pipe
     for all the key/keyring changes induced by the tests and they'll
     all be checked off to make sure they happened.

        https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

   - A test program is provided (samples/watch_queue/watch_test) that
     can be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout"

* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  smack: Implement the watch_key and post_notification hooks
  selinux: Implement the watch_key security hook
  keys: Make the KEY_NEED_* perms an enum rather than a mask
  pipe: Add notification lossage handling
  pipe: Allow buffers to be marked read-whole-or-error for notifications
  Add sample notification program
  watch_queue: Add a key/keyring notification facility
  security: Add hooks to rule on setting a watch
  pipe: Add general notification queue support
  pipe: Add O_NOTIFICATION_PIPE
  security: Add a hook for the point of notification insertion
  uapi: General notification queue definitions
</pre>
</div>
</content>
</entry>
</feed>
