linux-stable.git/fs/select.c, branch v6.9

fs/select: rework stack allocation hack for clang

2024-02-20T08:23:52+00:00

A while ago, we changed the way that select() and poll() preallocate
a temporary buffer just under the size of the static warning limit of
1024 bytes, as clang was frequently going slightly above that limit.

The warnings have recently returned and I took another look. As it turns
out, clang is not actually inherently worse at reserving stack space,
it just happens to inline do_select() into core_sys_select(), while gcc
never inlines it.

Annotate do_select() to never be inlined and in turn remove the special
case for the allocation size. This should give the same behavior for
both clang and gcc all the time and once more avoids those warnings.

Fixes: ad312f95d41c ("fs/select: avoid clang stack usage warning")
Signed-off-by: Arnd Bergmann 
Link: https://lore.kernel.org/r/20240216202352.2492798-1-arnd@kernel.org
Reviewed-by: Kees Cook 
Reviewed-by: Andi Kleen 
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner

select: Avoid wrap-around instrumentation in do_sys_poll()

2024-02-02T12:11:49+00:00

The mix of int, unsigned int, and unsigned long used by struct
poll_list::len, todo, len, and j meant that the signed overflow
sanitizer got worried it needed to instrument several places where
arithmetic happens between these variables. Since all of the variables
are always positive and bounded by unsigned int, use a single type in
all places. Additionally expand the zero-test into an explicit range
check before updating "todo".

This keeps sanitizer instrumentation[1] out of a UACCESS path:

vmlinux.o: warning: objtool: do_sys_poll+0x285: call to __ubsan_handle_sub_overflow() with UACCESS enabled

Link: https://github.com/KSPP/linux/issues/26 [1]
Cc: Christian Brauner 
Cc: Alexander Viro 
Cc: Jan Kara 
Cc: 
Signed-off-by: Kees Cook 
Link: https://lore.kernel.org/r/20240129184014.work.593-kees@kernel.org
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner

select: Fix indefinitely sleeping task in poll_schedule_timeout()

2022-01-11T17:03:05+00:00

A task can end up indefinitely sleeping in do_select() ->
poll_schedule_timeout() when the following race happens:

  TASK1 (thread1)             TASK2                   TASK1 (thread2)
  do_select()
    setup poll_wqueues table
    with 'fd'
                              write data to 'fd'
                                pollwake()
                                  table->triggered = 1
                                                      closes 'fd' thread1 is
                                                        waiting for
    poll_schedule_timeout()
      - sees table->triggered
      table->triggered = 0
      return -EINTR
    loop back in do_select()

But at this point when TASK1 loops back, the fdget() in the setup of
poll_wqueues fails.  So now so we never find 'fd' is ready for reading
and sleep in poll_schedule_timeout() indefinitely.

Treat an fd that got closed as a fd on which some event happened.  This
makes sure cannot block indefinitely in do_select().

Another option would be to return -EBADF in this case but that has a
potential of subtly breaking applications that excercise this behavior
and it happens to work for them.  So returning fd as active seems like a
safer choice.

Suggested-by: Linus Torvalds 
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara 
Signed-off-by: Linus Torvalds

net: Don't include filter.h from net/sock.h

2021-12-29T16:48:14+00:00

sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.

There's a lot of missing includes this was masking. Primarily
in networking tho, this time.

Signed-off-by: Jakub Kicinski 
Signed-off-by: Alexei Starovoitov 
Acked-by: Marc Kleine-Budde 
Acked-by: Florian Fainelli 
Acked-by: Nikolay Aleksandrov 
Acked-by: Stefano Garzarella 
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org

Revert "memcg: enable accounting for pollfd and select bits arrays"

2021-09-07T18:26:23+00:00

This reverts commit b655843444152c0a14b749308e4cb35d91cbcf0b.

Just like with the memcg lock accounting, the kernel test robot reports
a sizeable performance regression for this commit, and while it clearly
does the rigth thing in theory, we'll need to look at just how to avoid
or minimize the performance overhead of the memcg accounting.

People already have suggestions on how to do that, but it's "future
work".

So revert it for now.

[ Note: the first link below is for this same commit but a different
  commit ID, because it's the kernel test robot ended up noticing it in
  Andrew Morton's patch queue ]

Link: https://lore.kernel.org/lkml/20210905132732.GC15026@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe 
Acked-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Cc: Tejun Heo 
Signed-off-by: Linus Torvalds

memcg: enable accounting for pollfd and select bits arrays

2021-09-03T16:58:12+00:00

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of
memory till end of these sleeping system calls.  We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/56e31cb5-6e1e-bdba-d7ca-be64b9842363@virtuozzo.com
Signed-off-by: Vasily Averin 
Reviewed-by: Shakeel Butt 
Cc: Alexander Viro 
Cc: Alexey Dobriyan 
Cc: Andrei Vagin 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Christian Brauner 
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" 
Cc: Greg Kroah-Hartman 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "J. Bruce Fields" 
Cc: Jeff Layton 
Cc: Jens Axboe 
Cc: Jiri Slaby 
Cc: Johannes Weiner 
Cc: Kirill Tkhai 
Cc: Michal Hocko 
Cc: Oleg Nesterov 
Cc: Roman Gushchin 
Cc: Serge Hallyn 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Cc: Yutian Yang 
Cc: Zefan Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()

2021-03-16T21:13:10+00:00

Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT.

Add a new helper which sets restart_block->fn and calls a dummy
arch_set_restart_data() helper.

Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Signed-off-by: Oleg Nesterov 
Signed-off-by: Thomas Gleixner 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174641.GA17871@redhat.com

poll: fix performance regression due to out-of-line __put_user()

2021-01-08T19:06:29+00:00

The kernel test robot reported a -5.8% performance regression on the
"poll2" test of will-it-scale, and bisected it to commit d55564cfc222
("x86: Make __put_user() generate an out-of-line call").

I didn't expect an out-of-line __put_user() to matter, because no normal
core code should use that non-checking legacy version of user access any
more.  But I had overlooked the very odd poll() usage, which does a
__put_user() to update the 'revents' values of the poll array.

Now, Al Viro correctly points out that instead of updating just the
'revents' field, it would be much simpler to just copy the _whole_
pollfd entry, and then we could just use "copy_to_user()" on the whole
array of entries, the same way we use "copy_from_user()" a few lines
earlier to get the original values.

But that is not what we've traditionally done, and I worry that threaded
applications might be concurrently modifying the other fields of the
pollfd array.  So while Al's suggestion is simpler - and perhaps worth
trying in the future - this instead keeps the "just update revents"
model.

To fix the performance regression, use the modern "unsafe_put_user()"
instead of __put_user(), with the proper "user_write_access_begin()"
guarding in place. This improves code generation enormously.

Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/
Reported-by: kernel test robot 
Tested-by: Oliver Sang 
Cc: Al Viro 
Cc: David Laight 
Cc: Peter Zijlstra 
Signed-off-by: Linus Torvalds

fs: Replace zero-length array with flexible-array member

2020-10-29T22:22:59+00:00

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva

pselect6() and friends: take handling the combined 6th/7th args into helper

2020-05-29T23:10:42+00:00

... and use unsafe_get_user(), while we are at it.

Signed-off-by: Al Viro