<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/eventpoll.c, branch v4.15</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge branch 'akpm' (patches from Andrew)</title>
<updated>2017-11-18T00:56:17+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-11-18T00:56:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=fa7f578076a8814caa5371e9f4949e408140766d'/>
<id>fa7f578076a8814caa5371e9f4949e408140766d</id>
<content type='text'>
Merge more updates from Andrew Morton:

 - a bit more MM

 - procfs updates

 - dynamic-debug fixes

 - lib/ updates

 - checkpatch

 - epoll

 - nilfs2

 - signals

 - rapidio

 - PID management cleanup and optimization

 - kcov updates

 - sysvipc updates

 - quite a few misc things all over the place

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;: (94 commits)
  EXPERT Kconfig menu: fix broken EXPERT menu
  include/asm-generic/topology.h: remove unused parent_node() macro
  arch/tile/include/asm/topology.h: remove unused parent_node() macro
  arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro
  arch/sh/include/asm/topology.h: remove unused parent_node() macro
  arch/ia64/include/asm/topology.h: remove unused parent_node() macro
  drivers/pcmcia/sa1111_badge4.c: avoid unused function warning
  mm: add infrastructure for get_user_pages_fast() benchmarking
  sysvipc: make get_maxid O(1) again
  sysvipc: properly name ipc_addid() limit parameter
  sysvipc: duplicate lock comments wrt ipc_addid()
  sysvipc: unteach ids-&gt;next_id for !CHECKPOINT_RESTORE
  initramfs: use time64_t timestamps
  drivers/watchdog: make use of devm_register_reboot_notifier()
  kernel/reboot.c: add devm_register_reboot_notifier()
  kcov: update documentation
  Makefile: support flag -fsanitizer-coverage=trace-cmp
  kcov: support comparison operands collection
  kcov: remove pointless current != NULL check
  kernel/panic.c: add TAINT_AUX
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Merge more updates from Andrew Morton:

 - a bit more MM

 - procfs updates

 - dynamic-debug fixes

 - lib/ updates

 - checkpatch

 - epoll

 - nilfs2

 - signals

 - rapidio

 - PID management cleanup and optimization

 - kcov updates

 - sysvipc updates

 - quite a few misc things all over the place

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;: (94 commits)
  EXPERT Kconfig menu: fix broken EXPERT menu
  include/asm-generic/topology.h: remove unused parent_node() macro
  arch/tile/include/asm/topology.h: remove unused parent_node() macro
  arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro
  arch/sh/include/asm/topology.h: remove unused parent_node() macro
  arch/ia64/include/asm/topology.h: remove unused parent_node() macro
  drivers/pcmcia/sa1111_badge4.c: avoid unused function warning
  mm: add infrastructure for get_user_pages_fast() benchmarking
  sysvipc: make get_maxid O(1) again
  sysvipc: properly name ipc_addid() limit parameter
  sysvipc: duplicate lock comments wrt ipc_addid()
  sysvipc: unteach ids-&gt;next_id for !CHECKPOINT_RESTORE
  initramfs: use time64_t timestamps
  drivers/watchdog: make use of devm_register_reboot_notifier()
  kernel/reboot.c: add devm_register_reboot_notifier()
  kcov: update documentation
  Makefile: support flag -fsanitizer-coverage=trace-cmp
  kcov: support comparison operands collection
  kcov: remove pointless current != NULL check
  kernel/panic.c: add TAINT_AUX
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: remove ep_call_nested() from ep_eventpoll_poll()</title>
<updated>2017-11-18T00:10:02+00:00</updated>
<author>
<name>Jason Baron</name>
<email>jbaron@akamai.com</email>
</author>
<published>2017-11-17T23:29:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=37b5e5212a448bac0fe29d2a51f088014fbaaa41'/>
<id>37b5e5212a448bac0fe29d2a51f088014fbaaa41</id>
<content type='text'>
The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
routine for an epoll fd, is used to prevent excessively deep epoll
nesting, and to prevent circular paths.

However, we are already preventing these conditions during
EPOLL_CTL_ADD.  In terms of too deep epoll chains, we do in fact allow
deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
however we don't allow more than EP_MAX_NESTS when an epoll file
descriptor is actually connected to a wakeup source.  Thus, we do not
require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
called via ep_scan_ready_list() only continues nesting if there are
events available.

Since ep_call_nested() is implemented using a global lock, applications
that make use of nested epoll can see large performance improvements
with this change.

Davidlohr said:

: Improvements are quite obscene actually, such as for the following
: epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
:
: ncpus  vanilla     dirty     delta
: 1      2447092     3028315   +23.75%
: 4      231265      2986954   +1191.57%
: 8      121631      2898796   +2283.27%
: 16     59749       2902056   +4757.07%
: 32     26837	     2326314   +8568.30%
: 64     12926       1341281   +10276.61%
:
: (http://linux-scalability.org/epoll/epoll-test.c)

Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Salman Qazi &lt;sqazi@google.com&gt;
Cc: Hou Tao &lt;houtao1@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
routine for an epoll fd, is used to prevent excessively deep epoll
nesting, and to prevent circular paths.

However, we are already preventing these conditions during
EPOLL_CTL_ADD.  In terms of too deep epoll chains, we do in fact allow
deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
however we don't allow more than EP_MAX_NESTS when an epoll file
descriptor is actually connected to a wakeup source.  Thus, we do not
require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
called via ep_scan_ready_list() only continues nesting if there are
events available.

Since ep_call_nested() is implemented using a global lock, applications
that make use of nested epoll can see large performance improvements
with this change.

Davidlohr said:

: Improvements are quite obscene actually, such as for the following
: epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
:
: ncpus  vanilla     dirty     delta
: 1      2447092     3028315   +23.75%
: 4      231265      2986954   +1191.57%
: 8      121631      2898796   +2283.27%
: 16     59749       2902056   +4757.07%
: 32     26837	     2326314   +8568.30%
: 64     12926       1341281   +10276.61%
:
: (http://linux-scalability.org/epoll/epoll-test.c)

Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Salman Qazi &lt;sqazi@google.com&gt;
Cc: Hou Tao &lt;houtao1@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: avoid calling ep_call_nested() from ep_poll_safewake()</title>
<updated>2017-11-18T00:10:02+00:00</updated>
<author>
<name>Jason Baron</name>
<email>jbaron@akamai.com</email>
</author>
<published>2017-11-17T23:29:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=57a173bdf5baab48e8e78825c7366c634acd087c'/>
<id>57a173bdf5baab48e8e78825c7366c634acd087c</id>
<content type='text'>
ep_poll_safewake() is used to wakeup potentially nested epoll file
descriptors.  The function uses ep_call_nested() to prevent entering the
same wake up queue more than once, and to prevent excessively deep
wakeup paths (deeper than EP_MAX_NESTS).  However, this is not necessary
since we are already preventing these conditions during EPOLL_CTL_ADD.
This saves extra function calls, and avoids taking a global lock during
the ep_call_nested() calls.

I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
case, since ep_call_nested() keeps track of the nesting level, and this
is required by the call to spin_lock_irqsave_nested().  It would be nice
to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
case as well, however its not clear how to simply pass the nesting level
through multiple wake_up() levels without more surgery.  In any case, I
don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
This patch, also apparently fixes a workload at Google that Salman Qazi
reported by completely removing the poll_safewake_ncalls-&gt;lock from
wakeup paths.

Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron &lt;jbaron@akamai.com&gt;
Acked-by: Davidlohr Bueso &lt;dbueso@suse.de&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Salman Qazi &lt;sqazi@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
ep_poll_safewake() is used to wakeup potentially nested epoll file
descriptors.  The function uses ep_call_nested() to prevent entering the
same wake up queue more than once, and to prevent excessively deep
wakeup paths (deeper than EP_MAX_NESTS).  However, this is not necessary
since we are already preventing these conditions during EPOLL_CTL_ADD.
This saves extra function calls, and avoids taking a global lock during
the ep_call_nested() calls.

I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
case, since ep_call_nested() keeps track of the nesting level, and this
is required by the call to spin_lock_irqsave_nested().  It would be nice
to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
case as well, however its not clear how to simply pass the nesting level
through multiple wake_up() levels without more surgery.  In any case, I
don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
This patch, also apparently fixes a workload at Google that Salman Qazi
reported by completely removing the poll_safewake_ncalls-&gt;lock from
wakeup paths.

Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron &lt;jbaron@akamai.com&gt;
Acked-by: Davidlohr Bueso &lt;dbueso@suse.de&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Salman Qazi &lt;sqazi@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: account epitem and eppoll_entry to kmemcg</title>
<updated>2017-11-18T00:10:02+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2017-11-17T23:28:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2ae928a9441a3b5f13952e1e8a97d03cb23ea603'/>
<id>2ae928a9441a3b5f13952e1e8a97d03cb23ea603</id>
<content type='text'>
A userspace application can directly trigger the allocations from
eventpoll_epi and eventpoll_pwq slabs.  A buggy or malicious application
can consume a significant amount of system memory by triggering such
allocations.  Indeed we have seen in production where a buggy
application was leaking the epoll references and causing a burst of
eventpoll_epi and eventpoll_pwq slab allocations.  This patch opt-in the
charging of eventpoll_epi and eventpoll_pwq slabs.

There is a per-user limit (~4% of total memory if no highmem) on these
caches.  I think it is too generous particularly in the scenario where
jobs of multiple users are running on the system and the administrator
is reducing cost by overcomitting the memory.  This is unaccounted
kernel memory and will not be considered by the oom-killer.  I think by
accounting it to kmemcg, for systems with kmem accounting enabled, we
can provide better isolation between jobs of different users.

Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A userspace application can directly trigger the allocations from
eventpoll_epi and eventpoll_pwq slabs.  A buggy or malicious application
can consume a significant amount of system memory by triggering such
allocations.  Indeed we have seen in production where a buggy
application was leaking the epoll references and causing a burst of
eventpoll_epi and eventpoll_pwq slab allocations.  This patch opt-in the
charging of eventpoll_epi and eventpoll_pwq slabs.

There is a per-user limit (~4% of total memory if no highmem) on these
caches.  I think it is too generous particularly in the scenario where
jobs of multiple users are running on the system and the administrator
is reducing cost by overcomitting the memory.  This is unaccounted
kernel memory and will not be considered by the oom-killer.  I think by
accounting it to kmemcg, for systems with kmem accounting enabled, we
can provide better isolation between jobs of different users.

Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>get_compat_sigset()</title>
<updated>2017-09-19T21:56:01+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2017-09-04T01:45:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3968cf623892d710e651070243fd16af312a9797'/>
<id>3968cf623892d710e651070243fd16af312a9797</id>
<content type='text'>
similar to put_compat_sigset()

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
similar to put_compat_sigset()

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/epoll: use faster rb_first_cached()</title>
<updated>2017-09-09T01:26:49+00:00</updated>
<author>
<name>Davidlohr Bueso</name>
<email>dave@stgolabs.net</email>
</author>
<published>2017-09-08T23:15:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b2ac2ea6296e7dd779168eb085b09d0fab9d1294'/>
<id>b2ac2ea6296e7dd779168eb085b09d0fab9d1294</id>
<content type='text'>
...  such that we can avoid the tree walks to get the node with the
smallest key.  Semantically the same, as the previously used rb_first(),
but O(1).  The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.

Link: http://lkml.kernel.org/r/20170719014603.19029-15-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso &lt;dbueso@suse.de&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
...  such that we can avoid the tree walks to get the node with the
smallest key.  Semantically the same, as the previously used rb_first(),
but O(1).  The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.

Link: http://lkml.kernel.org/r/20170719014603.19029-15-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso &lt;dbueso@suse.de&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()</title>
<updated>2017-09-01T20:07:35+00:00</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2017-09-01T16:55:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=138e4ad67afd5c6c318b056b4d17c17f2c0ca5c0'/>
<id>138e4ad67afd5c6c318b056b4d17c17f2c0ca5c0</id>
<content type='text'>
The race was introduced by me in commit 971316f0503a ("epoll:
ep_unregister_pollwait() can use the freed pwq-&gt;whead").  I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
-&gt;whead = NULL, only whead-&gt;lock can save us from the race with
ep_free() or ep_remove().

Move -&gt;whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

	BUG: KASAN: use-after-free in debug_spin_lock_before
	...
	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll-&gt;lock),

	...
	Freed by task 17774:
	...
	 kfree+0xe8/0x2c0 mm/slub.c:3883
	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq-&gt;whead")
Reported-by: 范龙飞 &lt;long7573@126.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The race was introduced by me in commit 971316f0503a ("epoll:
ep_unregister_pollwait() can use the freed pwq-&gt;whead").  I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
-&gt;whead = NULL, only whead-&gt;lock can save us from the race with
ep_free() or ep_remove().

Move -&gt;whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

	BUG: KASAN: use-after-free in debug_spin_lock_before
	...
	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll-&gt;lock),

	...
	Freed by task 17774:
	...
	 kfree+0xe8/0x2c0 mm/slub.c:3883
	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq-&gt;whead")
Reported-by: 范龙飞 &lt;long7573@126.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE</title>
<updated>2017-07-12T23:26:01+00:00</updated>
<author>
<name>Cyrill Gorcunov</name>
<email>gorcunov@gmail.com</email>
</author>
<published>2017-07-12T21:34:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=92ef6da3d06ff551a86de41ae37df9cc4b58d7a0'/>
<id>92ef6da3d06ff551a86de41ae37df9cc4b58d7a0</id>
<content type='text'>
kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
appropriate helpers in epoll code with the config to build it
conditionally.

Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lan
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Reported-by: Andrew Morton &lt;akpm@linuxfoundation.org&gt;
Cc: Andrey Vagin &lt;avagin@openvz.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
appropriate helpers in epoll code with the config to build it
conditionally.

Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lan
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Reported-by: Andrew Morton &lt;akpm@linuxfoundation.org&gt;
Cc: Andrey Vagin &lt;avagin@openvz.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files</title>
<updated>2017-07-12T23:26:01+00:00</updated>
<author>
<name>Cyrill Gorcunov</name>
<email>gorcunov@gmail.com</email>
</author>
<published>2017-07-12T21:34:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0791e3644e5ef21646fe565b9061788d05ec71d4'/>
<id>0791e3644e5ef21646fe565b9061788d05ec71d4</id>
<content type='text'>
With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.

Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.

Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.

To test if some particular file is matching entry inside epoll one have
to

 - fill kcmp_epoll_slot structure with epoll file descriptor,
   target file number and target file offset (in case if only
   one target is present then it should be 0)

 - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &amp;kcmp_epoll_slot)
    - the kernel fetch file pointer matching file descriptor @fd of pid1
    - lookups for file struct in epoll queue of pid2 and returns traditional
      0,1,2 result for sorting purpose

Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Acked-by: Andrey Vagin &lt;avagin@openvz.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.

Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.

Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.

To test if some particular file is matching entry inside epoll one have
to

 - fill kcmp_epoll_slot structure with epoll file descriptor,
   target file number and target file offset (in case if only
   one target is present then it should be 0)

 - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &amp;kcmp_epoll_slot)
    - the kernel fetch file pointer matching file descriptor @fd of pid1
    - lookups for file struct in epoll queue of pid2 and returns traditional
      0,1,2 result for sorting purpose

Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Acked-by: Andrey Vagin &lt;avagin@openvz.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>procfs: fdinfo: extend information about epoll target files</title>
<updated>2017-07-12T23:26:01+00:00</updated>
<author>
<name>Cyrill Gorcunov</name>
<email>gorcunov@gmail.com</email>
</author>
<published>2017-07-12T21:34:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=77493f04b74cdff3a61fb3fb14b1f5a71d88fd5f'/>
<id>77493f04b74cdff3a61fb3fb14b1f5a71d88fd5f</id>
<content type='text'>
Since it is possbile to have same number in tfd field (say file added,
closed, then nother file dup'ed to same number and added back) it is
imposible to distinguish such target files solely by their numbers.

Strictly speaking regular applications don't need to recognize these
targets at all but for checkpoint/restore sake we need to collect
targets to be able to push them back on restore stage in a proper order.

Thus lets add file position, inode and device number where this target
lays.  This three fields can be used as a primary key for sorting, and
together with kcmp help CRIU can find out an exact file target (from the
whole set of processes being checkpointed).

Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Acked-by: Andrei Vagin &lt;avagin@virtuozzo.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since it is possbile to have same number in tfd field (say file added,
closed, then nother file dup'ed to same number and added back) it is
imposible to distinguish such target files solely by their numbers.

Strictly speaking regular applications don't need to recognize these
targets at all but for checkpoint/restore sake we need to collect
targets to be able to push them back on restore stage in a proper order.

Thus lets add file position, inode and device number where this target
lays.  This three fields can be used as a primary key for sorting, and
together with kcmp help CRIU can find out an exact file target (from the
whole set of processes being checkpointed).

Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com
Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Acked-by: Andrei Vagin &lt;avagin@virtuozzo.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
