<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/eventpoll.c, branch v6.3</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>eventpoll: add EPOLL_URING_WAKE poll wakeup flag</title>
<updated>2022-11-21T14:45:29+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2022-11-20T17:10:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=caf1aeaffc3b09649a56769e559333ae2c4f1802'/>
<id>caf1aeaffc3b09649a56769e559333ae2c4f1802</id>
<content type='text'>
We can have dependencies between epoll and io_uring. Consider an epoll
context, identified by the epfd file descriptor, and an io_uring file
descriptor identified by iofd. If we add iofd to the epfd context, and
arm a multishot poll request for epfd with iofd, then the multishot
poll request will repeatedly trigger and generate events until terminated
by CQ ring overflow. This isn't a desired behavior.

Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup
key, and io_uring can check for that to detect a potential recursive
invocation.

Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We can have dependencies between epoll and io_uring. Consider an epoll
context, identified by the epfd file descriptor, and an io_uring file
descriptor identified by iofd. If we add iofd to the epfd context, and
arm a multishot poll request for epfd with iofd, then the multishot
poll request will repeatedly trigger and generate events until terminated
by CQ ring overflow. This isn't a desired behavior.

Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup
key, and io_uring can check for that to detect a potential recursive
invocation.

Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: use try_cmpxchg in list_add_tail_lockless</title>
<updated>2022-09-12T04:55:07+00:00</updated>
<author>
<name>Uros Bizjak</name>
<email>ubizjak@gmail.com</email>
</author>
<published>2022-07-14T17:32:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=693fc06e98514c2d5951ead4aca40cf8b21100b1'/>
<id>693fc06e98514c2d5951ead4aca40cf8b21100b1</id>
<content type='text'>
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
list_add_tail_lockless.  x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

No functional change intended.

Link: https://lkml.kernel.org/r/20220714173255.12987-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
list_add_tail_lockless.  x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

No functional change intended.

Link: https://lkml.kernel.org/r/20220714173255.12987-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: autoremove wakers even more aggressively</title>
<updated>2022-07-18T00:31:40+00:00</updated>
<author>
<name>Benjamin Segall</name>
<email>bsegall@google.com</email>
</author>
<published>2022-06-15T21:24:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a16ceb13961068f7209e34d7984f8e42d2c06159'/>
<id>a16ceb13961068f7209e34d7984f8e42d2c06159</id>
<content type='text'>
If a process is killed or otherwise exits while having active network
connections and many threads waiting on epoll_wait, the threads will all
be woken immediately, but not removed from ep-&gt;wq.  Then when network
traffic scans ep-&gt;wq in wake_up, every wakeup attempt will fail, and will
not remove the entries from the list.

This means that the cost of the wakeup attempt is far higher than usual,
does not decrease, and this also competes with the dying threads trying to
actually make progress and remove themselves from the wq.

Handle this by removing visited epoll wq entries unconditionally, rather
than only when the wakeup succeeds - the structure of ep_poll means that
the only potential loss is the timed_out-&gt;eavail heuristic, which now can
race and result in a redundant ep_send_events attempt.  (But only when
incoming data and a timeout actually race, not on every timeout)

Shakeel added:

: We are seeing this issue in production with real workloads and it has
: caused hard lockups.  Particularly network heavy workloads with a lot
: of threads in epoll_wait() can easily trigger this issue if they get
: killed (oom-killed in our case).

Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com
Signed-off-by: Ben Segall &lt;bsegall@google.com&gt;
Tested-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Roman Penyaev &lt;rpenyaev@suse.de&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Khazhismel Kumykov &lt;khazhy@google.com&gt;
Cc: Heiher &lt;r@hev.cc&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If a process is killed or otherwise exits while having active network
connections and many threads waiting on epoll_wait, the threads will all
be woken immediately, but not removed from ep-&gt;wq.  Then when network
traffic scans ep-&gt;wq in wake_up, every wakeup attempt will fail, and will
not remove the entries from the list.

This means that the cost of the wakeup attempt is far higher than usual,
does not decrease, and this also competes with the dying threads trying to
actually make progress and remove themselves from the wq.

Handle this by removing visited epoll wq entries unconditionally, rather
than only when the wakeup succeeds - the structure of ep_poll means that
the only potential loss is the timed_out-&gt;eavail heuristic, which now can
race and result in a redundant ep_send_events attempt.  (But only when
incoming data and a timeout actually race, not on every timeout)

Shakeel added:

: We are seeing this issue in production with real workloads and it has
: caused hard lockups.  Particularly network heavy workloads with a lot
: of threads in epoll_wait() can easily trigger this issue if they get
: killed (oom-killed in our case).

Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com
Signed-off-by: Ben Segall &lt;bsegall@google.com&gt;
Tested-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Roman Penyaev &lt;rpenyaev@suse.de&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Khazhismel Kumykov &lt;khazhy@google.com&gt;
Cc: Heiher &lt;r@hev.cc&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>eventpoll: simplify sysctl declaration with register_sysctl()</title>
<updated>2022-01-22T06:33:35+00:00</updated>
<author>
<name>Xiaoming Ni</name>
<email>nixiaoming@huawei.com</email>
</author>
<published>2022-01-22T06:12:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a8f5de894f76f1c73f4a068d04897a5e2f873825'/>
<id>a8f5de894f76f1c73f4a068d04897a5e2f873825</id>
<content type='text'>
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the epoll_table sysctl to fs/eventpoll.c and use
register_sysctl().

Link: https://lkml.kernel.org/r/20211123202422.819032-9-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni &lt;nixiaoming@huawei.com&gt;
Signed-off-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Amir Goldstein &lt;amir73il@gmail.com&gt;
Cc: Andy Shevchenko &lt;andriy.shevchenko@linux.intel.com&gt;
Cc: Antti Palosaari &lt;crope@iki.fi&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Benjamin LaHaise &lt;bcrl@kvack.org&gt;
Cc: Clemens Ladisch &lt;clemens@ladisch.de&gt;
Cc: David Airlie &lt;airlied@linux.ie&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Cc: Jani Nikula &lt;jani.nikula@linux.intel.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: Joonas Lahtinen &lt;joonas.lahtinen@linux.intel.com&gt;
Cc: Joseph Qi &lt;joseph.qi@linux.alibaba.com&gt;
Cc: Julia Lawall &lt;julia.lawall@inria.fr&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Lukas Middendorf &lt;kernel@tuxforce.de&gt;
Cc: Mark Fasheh &lt;mark@fasheh.com&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Phillip Potter &lt;phil@philpotter.co.uk&gt;
Cc: Qing Wang &lt;wangqing@vivo.com&gt;
Cc: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
Cc: Sebastian Reichel &lt;sre@kernel.org&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Stephen Kitt &lt;steve@sk2.org&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Douglas Gilbert &lt;dgilbert@interlog.com&gt;
Cc: James E.J. Bottomley &lt;jejb@linux.ibm.com&gt;
Cc: Jani Nikula &lt;jani.nikula@intel.com&gt;
Cc: John Ogness &lt;john.ogness@linutronix.de&gt;
Cc: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Cc: "Rafael J. Wysocki" &lt;rafael@kernel.org&gt;
Cc: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the epoll_table sysctl to fs/eventpoll.c and use
register_sysctl().

Link: https://lkml.kernel.org/r/20211123202422.819032-9-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni &lt;nixiaoming@huawei.com&gt;
Signed-off-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Amir Goldstein &lt;amir73il@gmail.com&gt;
Cc: Andy Shevchenko &lt;andriy.shevchenko@linux.intel.com&gt;
Cc: Antti Palosaari &lt;crope@iki.fi&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Benjamin LaHaise &lt;bcrl@kvack.org&gt;
Cc: Clemens Ladisch &lt;clemens@ladisch.de&gt;
Cc: David Airlie &lt;airlied@linux.ie&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Cc: Jani Nikula &lt;jani.nikula@linux.intel.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: Joonas Lahtinen &lt;joonas.lahtinen@linux.intel.com&gt;
Cc: Joseph Qi &lt;joseph.qi@linux.alibaba.com&gt;
Cc: Julia Lawall &lt;julia.lawall@inria.fr&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Lukas Middendorf &lt;kernel@tuxforce.de&gt;
Cc: Mark Fasheh &lt;mark@fasheh.com&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Phillip Potter &lt;phil@philpotter.co.uk&gt;
Cc: Qing Wang &lt;wangqing@vivo.com&gt;
Cc: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
Cc: Sebastian Reichel &lt;sre@kernel.org&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Stephen Kitt &lt;steve@sk2.org&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Douglas Gilbert &lt;dgilbert@interlog.com&gt;
Cc: James E.J. Bottomley &lt;jejb@linux.ibm.com&gt;
Cc: Jani Nikula &lt;jani.nikula@intel.com&gt;
Cc: John Ogness &lt;john.ogness@linutronix.de&gt;
Cc: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Cc: "Rafael J. Wysocki" &lt;rafael@kernel.org&gt;
Cc: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm</title>
<updated>2021-09-09T20:25:49+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-09-09T20:25:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=35776f10513c0d523c5dd2f1b415f642497779e2'/>
<id>35776f10513c0d523c5dd2f1b415f642497779e2</id>
<content type='text'>
Pull ARM development updates from Russell King:

 - Rename "mod_init" and "mod_exit" so that initcall debug output is
   actually useful (Randy Dunlap)

 - Update maintainers entries for linux-arm-kernel to indicate it is
   moderated for non-subscribers (Randy Dunlap)

 - Move install rules to arch/arm/Makefile (Masahiro Yamada)

 - Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij)

 - Don't warn about atags_to_fdt() stack size (David Heidelberg)

 - Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann)

 - Get rid of set_fs() usage (Arnd Bergmann)

 - Remove checks for GCC prior to v4.6 (Geert Uytterhoeven)

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate
  ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK()
  ARM: 9116/1: unified: Remove check for gcc &lt; 4
  ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning
  ARM: 9113/1: uaccess: remove set_fs() implementation
  ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault
  ARM: 9111/1: oabi-compat: rework fcntl64() emulation
  ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation
  ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
  ARM: 9107/1: syscall: always store thread_info-&gt;abi_syscall
  ARM: 9109/1: oabi-compat: add epoll_pwait handler
  ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs()
  ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault
  ARM: 9105/1: atags_to_fdt: don't warn about stack size
  ARM: 9103/1: Drop ARCH_NR_GPIOS definition
  ARM: 9102/1: move theinstall rules to arch/arm/Makefile
  ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated
  ARM: 9099/1: crypto: rename 'mod_init' &amp; 'mod_exit' functions to be module-specific
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull ARM development updates from Russell King:

 - Rename "mod_init" and "mod_exit" so that initcall debug output is
   actually useful (Randy Dunlap)

 - Update maintainers entries for linux-arm-kernel to indicate it is
   moderated for non-subscribers (Randy Dunlap)

 - Move install rules to arch/arm/Makefile (Masahiro Yamada)

 - Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij)

 - Don't warn about atags_to_fdt() stack size (David Heidelberg)

 - Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann)

 - Get rid of set_fs() usage (Arnd Bergmann)

 - Remove checks for GCC prior to v4.6 (Geert Uytterhoeven)

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate
  ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK()
  ARM: 9116/1: unified: Remove check for gcc &lt; 4
  ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning
  ARM: 9113/1: uaccess: remove set_fs() implementation
  ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault
  ARM: 9111/1: oabi-compat: rework fcntl64() emulation
  ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation
  ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
  ARM: 9107/1: syscall: always store thread_info-&gt;abi_syscall
  ARM: 9109/1: oabi-compat: add epoll_pwait handler
  ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs()
  ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault
  ARM: 9105/1: atags_to_fdt: don't warn about stack size
  ARM: 9103/1: Drop ARCH_NR_GPIOS definition
  ARM: 9102/1: move theinstall rules to arch/arm/Makefile
  ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated
  ARM: 9099/1: crypto: rename 'mod_init' &amp; 'mod_exit' functions to be module-specific
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/epoll: use a per-cpu counter for user's watches count</title>
<updated>2021-09-08T18:50:27+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-09-08T03:00:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1e1c15839df084f4011825fee922aa976c9159dc'/>
<id>1e1c15839df084f4011825fee922aa976c9159dc</id>
<content type='text'>
This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, &gt; 1000 thread system.

[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
  Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
  Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net

Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reported-by: Anton Blanchard &lt;anton@ozlabs.org&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, &gt; 1000 thread system.

[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
  Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
  Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net

Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reported-by: Anton Blanchard &lt;anton@ozlabs.org&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation</title>
<updated>2021-08-20T10:39:26+00:00</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2021-08-11T07:30:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=249dbe74d3c4b568a623fb55c56cddf19fdf0b89'/>
<id>249dbe74d3c4b568a623fb55c56cddf19fdf0b89</id>
<content type='text'>
The epoll_wait() system call wrapper is one of the remaining users of
the set_fs() infrasturcture for Arm. Changing it to not require set_fs()
is rather complex unfortunately.

The approach I'm taking here is to allow architectures to override
the code that copies the output to user space, and let the oabi-compat
implementation check whether it is getting called from an EABI or OABI
system call based on the thread_info-&gt;syscall value.

The in_oabi_syscall() check here mirrors the in_compat_syscall() and
in_x32_syscall() helpers for 32-bit compat implementations on other
architectures.

Overall, the amount of code goes down, at least with the newly added
sys_oabi_epoll_pwait() helper getting removed again. The downside
is added complexity in the source code for the native implementation.
There should be no difference in runtime performance except for Arm
kernels with CONFIG_OABI_COMPAT enabled that now have to go through
an external function call to check which of the two variants to use.

Acked-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Russell King (Oracle) &lt;rmk+kernel@armlinux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The epoll_wait() system call wrapper is one of the remaining users of
the set_fs() infrasturcture for Arm. Changing it to not require set_fs()
is rather complex unfortunately.

The approach I'm taking here is to allow architectures to override
the code that copies the output to user space, and let the oabi-compat
implementation check whether it is getting called from an EABI or OABI
system call based on the thread_info-&gt;syscall value.

The in_oabi_syscall() check here mirrors the in_compat_syscall() and
in_x32_syscall() helpers for 32-bit compat implementations on other
architectures.

Overall, the amount of code goes down, at least with the newly added
sys_oabi_epoll_pwait() helper getting removed again. The downside
is added complexity in the source code for the native implementation.
There should be no difference in runtime performance except for Arm
kernels with CONFIG_OABI_COMPAT enabled that now have to go through
an external function call to check which of the two variants to use.

Acked-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Russell King (Oracle) &lt;rmk+kernel@armlinux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/epoll: restore waking from ep_done_scan()</title>
<updated>2021-05-07T02:24:13+00:00</updated>
<author>
<name>Davidlohr Bueso</name>
<email>dave@stgolabs.net</email>
</author>
<published>2021-05-07T01:04:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7fab29e356309ff93a4b30ecc466129682ec190b'/>
<id>7fab29e356309ff93a4b30ecc466129682ec190b</id>
<content type='text'>
Commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
epoll") changed the userspace visible behavior of exclusive waiters
blocked on a common epoll descriptor upon a single event becoming ready.

Previously, all tasks doing epoll_wait would awake, and now only one is
awoken, potentially causing missed wakeups on applications that rely on
this behavior, such as Apache Qpid.

While the aforementioned commit aims at having only a wakeup single path
in ep_poll_callback (with the exceptions of epoll_ctl cases), we need to
restore the wakeup in what was the old ep_scan_ready_list() such that
the next thread can be awoken, in a cascading style, after the waker's
corresponding ep_send_events().

Link: https://lkml.kernel.org/r/20210405231025.33829-3-dave@stgolabs.net
Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Davidlohr Bueso &lt;dbueso@suse.de&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Roman Penyaev &lt;rpenyaev@suse.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
epoll") changed the userspace visible behavior of exclusive waiters
blocked on a common epoll descriptor upon a single event becoming ready.

Previously, all tasks doing epoll_wait would awake, and now only one is
awoken, potentially causing missed wakeups on applications that rely on
this behavior, such as Apache Qpid.

While the aforementioned commit aims at having only a wakeup single path
in ep_poll_callback (with the exceptions of epoll_ctl cases), we need to
restore the wakeup in what was the old ep_scan_ready_list() such that
the next thread can be awoken, in a cascading style, after the waker's
corresponding ep_send_events().

Link: https://lkml.kernel.org/r/20210405231025.33829-3-dave@stgolabs.net
Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Davidlohr Bueso &lt;dbueso@suse.de&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Roman Penyaev &lt;rpenyaev@suse.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: eventpoll: fix comments &amp; kernel-doc notation</title>
<updated>2021-03-07T00:36:53+00:00</updated>
<author>
<name>Randy Dunlap</name>
<email>rdunlap@infradead.org</email>
</author>
<published>2021-03-01T22:25:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a6c67fee9cf09552a6d37724a91e1183a06a79cb'/>
<id>a6c67fee9cf09552a6d37724a91e1183a06a79cb</id>
<content type='text'>
Use the documented kernel-doc format for function Return: descriptions.
Begin constant values in kernel-doc comments with '%'.

Remove kernel-doc "/**" from 2 functions that are not documented with
kernel-doc notation.

Fix typos, punctuation, &amp; grammar.

Also fix a few kernel-doc warnings:

../fs/eventpoll.c:1883: warning: Function parameter or member 'ep' not described in 'ep_loop_check_proc'
../fs/eventpoll.c:1883: warning: Excess function parameter 'priv' description in 'ep_loop_check_proc'
../fs/eventpoll.c:1932: warning: Function parameter or member 'ep' not described in 'ep_loop_check'
../fs/eventpoll.c:1932: warning: Excess function parameter 'from' description in 'ep_loop_check'

Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Jonathan Corbet &lt;corbet@lwn.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use the documented kernel-doc format for function Return: descriptions.
Begin constant values in kernel-doc comments with '%'.

Remove kernel-doc "/**" from 2 functions that are not documented with
kernel-doc notation.

Fix typos, punctuation, &amp; grammar.

Also fix a few kernel-doc warnings:

../fs/eventpoll.c:1883: warning: Function parameter or member 'ep' not described in 'ep_loop_check_proc'
../fs/eventpoll.c:1883: warning: Excess function parameter 'priv' description in 'ep_loop_check_proc'
../fs/eventpoll.c:1932: warning: Function parameter or member 'ep' not described in 'ep_loop_check'
../fs/eventpoll.c:1932: warning: Excess function parameter 'from' description in 'ep_loop_check'

Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Jonathan Corbet &lt;corbet@lwn.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE</title>
<updated>2021-02-16T08:59:41+00:00</updated>
<author>
<name>Chris Wilson</name>
<email>chris@chris-wilson.co.uk</email>
</author>
<published>2021-02-05T22:00:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bfe3911a91047557eb0e620f95a370aee6a248c7'/>
<id>bfe3911a91047557eb0e620f95a370aee6a248c7</id>
<content type='text'>
Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to
deduplicate the per-service file descriptor store.

Note that some distributions such as Ubuntu are already enabling
CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046
Signed-off-by: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Will Drewry &lt;wad@chromium.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Dave Airlie &lt;airlied@gmail.com&gt;
Cc: Daniel Vetter &lt;daniel@ffwll.ch&gt;
Cc: Lucas Stach &lt;l.stach@pengutronix.de&gt;
Cc: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Cc: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt; # DRM depends on kcmp
Acked-by: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt; # systemd uses kcmp
Reviewed-by: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Acked-by: Thomas Zimmermann &lt;tzimmermann@suse.de&gt;
Signed-off-by: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to
deduplicate the per-service file descriptor store.

Note that some distributions such as Ubuntu are already enabling
CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046
Signed-off-by: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Will Drewry &lt;wad@chromium.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Dave Airlie &lt;airlied@gmail.com&gt;
Cc: Daniel Vetter &lt;daniel@ffwll.ch&gt;
Cc: Lucas Stach &lt;l.stach@pengutronix.de&gt;
Cc: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Cc: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt; # DRM depends on kcmp
Acked-by: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt; # systemd uses kcmp
Reviewed-by: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Acked-by: Thomas Zimmermann &lt;tzimmermann@suse.de&gt;
Signed-off-by: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
</pre>
</div>
</content>
</entry>
</feed>
