linux-stable.git/fs/file_table.c, branch v5.19

Merge tag 'pull-18-rc1-work.fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2022-06-05T01:52:00+00:00

Pull file descriptor updates from Al Viro.

 - Descriptor handling cleanups

* tag 'pull-18-rc1-work.fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Unify the primitives for file descriptor closing
  fs: remove fget_many and fput_many interface
  io_uring_enter(): don't leave f.flags uninitialized

fs: remove fget_many and fput_many interface

2022-05-14T22:47:28+00:00

These two interface were added in 091141a42 commit,
but now there is no place to call them.

The only user of fput/fget_many() was removed in commit
62906e89e63b ("io_uring: remove file batch-get optimisation").

A user of get_file_rcu_many() were removed in commit
f073531070d2 ("init: add an init_dup helper").

And replace atomic_long_sub/add to atomic_long_dec/inc
can improve performance.

Here are the test results of unixbench:

Cmd: ./Run -c 64 context1

Without patch:
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
Pipe-based Context Switching                   4000.0    2798407.0   6996.0
                                                                   ========
System Benchmarks Index Score (Partial Only)                         6996.0

With patch:
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
Pipe-based Context Switching                   4000.0    3486268.8   8715.7
                                                                   ========
System Benchmarks Index Score (Partial Only)                         8715.7

Signed-off-by: Gou Hao 
Signed-off-by: Al Viro

SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()

2022-04-07T20:19:47+00:00

We must ensure that all sockets are closed before we call xprt_free()
and release the reference to the net namespace. The problem is that
calling fput() will defer closing the socket until delayed_fput() gets
called.
Let's fix the situation by allowing rpciod and the transport teardown
code (which runs on the system wq) to call __fput_sync(), and directly
close the socket.

Reported-by: Felix Fu 
Acked-by: Al Viro 
Fixes: a73881c96d73 ("SUNRPC: Fix an Oops in udp_poll()")
Cc: stable@vger.kernel.org # 5.1.x: 3be232f11a3c: SUNRPC: Prevent immediate close+reconnect
Cc: stable@vger.kernel.org # 5.1.x: 89f42494f92f: SUNRPC: Don't call connect() more than once on a TCP socket
Cc: stable@vger.kernel.org # 5.1.x
Signed-off-by: Trond Myklebust

fs/file_table: fix adding missing kmemleak_not_leak()

2022-02-17T18:23:19+00:00

Commit b42bc9a3c511 ("Fix regression due to "fs: move binfmt_misc sysctl
to its own file") fixed a regression, however it failed to add a
kmemleak_not_leak().

Fixes: b42bc9a3c511 ("Fix regression due to "fs: move binfmt_misc sysctl to its own file")
Reported-by: Tong Zhang 
Cc: Tong Zhang 
Signed-off-by: Luis Chamberlain 
Signed-off-by: Linus Torvalds

Fix regression due to "fs: move binfmt_misc sysctl to its own file"

2022-02-09T17:50:02+00:00

Commit 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file") did
not go unnoticed, binfmt-support stopped to work on my Debian system
since v5.17-rc2 (did not check with -rc1).

The existance of the /proc/sys/fs/binfmt_misc is a precondition for
attempting to mount the binfmt_misc fs, which in turn triggers the
autoload of the binfmt_misc module.  Without it, no module is loaded and
no binfmt is available at boot.

Building as built-in or manually loading the module and mounting the fs
works fine, it's therefore only a matter of interaction with user-space.
I could try to improve the Debian systemd configuration but I can't say
anything about the other distributions.

This patch restores a working system right after boot.

Fixes: 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file")
Signed-off-by: Domenico Andreoli 
Cc: Andrew Morton 
Cc: Luis Chamberlain 
Reviewed-by: Tong Zhang 
Signed-off-by: Linus Torvalds

fs: move fs stat sysctls to file_table.c

2022-01-22T06:33:36+00:00

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

We can create the sysctl dynamically on early init for fs stat to help
with this clutter.  This dusts off the fs stat syctls knobs and puts
them into where they are declared.

Link: https://lkml.kernel.org/r/20211129205548.605569-3-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain 
Cc: Al Viro 
Cc: Andy Shevchenko 
Cc: Antti Palosaari 
Cc: Eric Biederman 
Cc: Iurii Zaikin 
Cc: "J. Bruce Fields" 
Cc: Jeff Layton 
Cc: Kees Cook 
Cc: Lukas Middendorf 
Cc: Stephen Kitt 
Cc: Xiaoming Ni 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

epoll: take epitem list out of struct file

2020-10-26T00:02:08+00:00

Move the head of epitem list out of struct file; for epoll ones it's
moved into struct eventpoll (->refs there), for non-epoll - into
the new object (struct epitem_head).  In place of ->f_ep_links we
leave a pointer to the list head (->f_ep).

->f_ep is protected by ->f_lock and it's zeroed as soon as the list
of epitems becomes empty (that can happen only in ep_remove() by
now).

The list of files for reverse path check is *not* going through
struct file now - it's a single-linked list going through epitem_head
instances.  It's terminated by ERR_PTR(-1) (== EP_UNACTIVE_POINTER),
so the elements of list can be distinguished by head->next != NULL.

epitem_head instances are allocated at ep_insert() time (by
attach_epitem()) and freed either by ep_remove() (if it empties
the set of epitems *and* epitem_head does not belong to the
reverse path check list) or by clear_tfile_check_list() when
the list is emptied (if the set of epitems is empty by that
point).  Allocations are done from a separate slab - minimal kmalloc()
size is too large on some architectures.

As the result, we trim struct file _and_ get rid of the games with
temporary file references.

Locking and barriers are interesting (aren't they always); see unlist_file()
and ep_remove() for details.  The non-obvious part is that ep_remove() needs
to decide if it will be the one to free the damn thing *before* actually
storing NULL to head->epitems.first - that's what smp_load_acquire is for
in there.  unlist_file() lockless path is safe, since we hit it only if
we observe NULL in head->epitems.first and whoever had done that store is
guaranteed to have observed non-NULL in head->next.  IOW, their last access
had been the store of NULL into ->epitems.first and we can safely free
the sucker.  OTOH, we are under rcu_read_lock() and both epitem and
epitem->file have their freeing RCU-delayed.  So if we see non-NULL
->epitems.first, we can grab ->f_lock (all epitems in there share the
same struct file) and safely recheck the emptiness of ->epitems; again,
->next is still non-NULL, so ep_remove() couldn't have freed head yet.
->f_lock serializes us wrt ep_remove(); the rest is trivial.

Note that once head->epitems becomes NULL, nothing can get inserted into
it - the only remaining reference to head after that point is from the
reverse path check list.

Signed-off-by: Al Viro

task_work: cleanup notification modes

2020-10-17T21:05:30+00:00

A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

Clean this up properly, and define a proper enum for the notification
mode. Now we have:

- TWA_NONE. This is 0, same as before the original change, meaning no
  notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
  that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
  notification.

Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.

Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner 
Signed-off-by: Jens Axboe

Revert "fs: Do not check if there is a fsnotify watcher on pseudo inodes"

2020-06-29T16:40:55+00:00

This reverts commit e9c15badbb7b ("fs: Do not check if there is a
fsnotify watcher on pseudo inodes"). The commit intended to eliminate
fsnotify-related overhead for pseudo inodes but it is broken in
concept. inotify can receive events of pipe files under /proc/X/fd and
chromium relies on close and open events for sandboxing. Maxim Levitsky
reported the following

  Chromium starts as a white rectangle, shows few white rectangles that
  resemble its notifications and then crashes.

  The stdout output from chromium:

  [mlevitsk@starship ~]$chromium-freeworld
  mesa: for the   --simplifycfg-sink-common option: may only occur zero or one times!
  mesa: for the   --global-isel-abort option: may only occur zero or one times!
  [3379:3379:0628/135151.440930:ERROR:browser_switcher_service.cc(238)] XXX Init()
  ../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0072
  Received signal 11 SEGV_MAPERR 0000004a9048

Crashes are not universal but even if chromium does not crash, it certainly
does not work properly. While filtering just modify and access might be
safe, the benefit is not worth the risk hence the revert.

Reported-by: Maxim Levitsky 
Fixes: e9c15badbb7b ("fs: Do not check if there is a fsnotify watcher on pseudo inodes")
Signed-off-by: Mel Gorman 
Signed-off-by: Linus Torvalds

fs: Do not check if there is a fsnotify watcher on pseudo inodes

2020-06-16T07:40:45+00:00

The kernel uses internal mounts created by kern_mount() and populated
with files with no lookup path by alloc_file_pseudo() for a variety of
reasons. An example of such a mount is for anonymous pipes. For pipes,
every vfs_write() regardless of filesystem, calls fsnotify_modify()
to notify of any changes which incurs a small amount of overhead in
fsnotify even when there are no watchers. It can also trigger for reads
and readv and writev, it was simply vfs_write() that was noticed first.

A patch is pending that reduces, but does not eliminate, the overhead of
fsnotify but for files that cannot be looked up via a path, even that
small overhead is unnecessary. The user API for all notification
subsystems (inotify, fanotify, ...) is based on the pathname and a dirfd
and proc entries appear to be the only visible representation of the
files. Proc does not have the same pathname as the internal entry and
the proc inode is not the same as the internal inode so even if fanotify
is used on a file under /proc/XX/fd, no useful events are notified.

This patch changes alloc_file_pseudo() to always opt out of fsnotify by
setting FMODE_NONOTIFY flag so that no check is made for fsnotify
watchers on pseudo files. This should be safe as the underlying helper
for the dentry is d_alloc_pseudo() which explicitly states that no
lookups are ever performed meaning that fanotify should have nothing
useful to attach to.

The test motivating this was "perf bench sched messaging --pipe". On
a single-socket machine using threads the difference of the patch was
as follows.

                              5.7.0                  5.7.0
                            vanilla        nofsnotify-v1r1
Amean     1       1.3837 (   0.00%)      1.3547 (   2.10%)
Amean     3       3.7360 (   0.00%)      3.6543 (   2.19%)
Amean     5       5.8130 (   0.00%)      5.7233 *   1.54%*
Amean     7       8.1490 (   0.00%)      7.9730 *   2.16%*
Amean     12     14.6843 (   0.00%)     14.1820 (   3.42%)
Amean     18     21.8840 (   0.00%)     21.7460 (   0.63%)
Amean     24     28.8697 (   0.00%)     29.1680 (  -1.03%)
Amean     30     36.0787 (   0.00%)     35.2640 *   2.26%*
Amean     32     38.0527 (   0.00%)     38.1223 (  -0.18%)

The difference is small but in some cases it's outside the noise so
while marginal, there is still some small benefit to ignoring fsnotify
for files allocated via alloc_file_pseudo() in some cases.

Link: https://lore.kernel.org/r/20200615121358.GF3183@techsingularity.net
Signed-off-by: Mel Gorman 
Reviewed-by: Amir Goldstein 
Signed-off-by: Jan Kara