<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/eventpoll.c, branch v2.6.30</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>epoll: fix size check in epoll_create()</title>
<updated>2009-05-12T21:11:35+00:00</updated>
<author>
<name>Davide Libenzi</name>
<email>davidel@xmailserver.org</email>
</author>
<published>2009-05-12T20:19:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bfe3891a5f5d3b78146a45f40e435d14f5ae39dd'/>
<id>bfe3891a5f5d3b78146a45f40e435d14f5ae39dd</id>
<content type='text'>
Fix a size check WRT the manual pages.  This was inadvertently broken by
commit 9fe5ad9c8cef9ad5873d8ee55d1cf00d9b607df0 ("flag parameters
add-on: remove epoll_create size param").

Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: &lt;Hiroyuki.Mach@gmail.com&gt;
Cc: rohit verma &lt;rohit.170309@gmail.com&gt;
Cc: Ulrich Drepper &lt;drepper@redhat.com&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fix a size check WRT the manual pages.  This was inadvertently broken by
commit 9fe5ad9c8cef9ad5873d8ee55d1cf00d9b607df0 ("flag parameters
add-on: remove epoll_create size param").

Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: &lt;Hiroyuki.Mach@gmail.com&gt;
Cc: rohit verma &lt;rohit.170309@gmail.com&gt;
Cc: Ulrich Drepper &lt;drepper@redhat.com&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll keyed wakeups: teach epoll about hints coming with the wakeup key</title>
<updated>2009-04-01T15:59:20+00:00</updated>
<author>
<name>Davide Libenzi</name>
<email>davidel@xmailserver.org</email>
</author>
<published>2009-03-31T22:24:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2dfa4eeab0fc7e8633974f2770945311b31eedf6'/>
<id>2dfa4eeab0fc7e8633974f2770945311b31eedf6</id>
<content type='text'>
Use the events hint now sent by some devices, to avoid unnecessary wakeups
for events that are of no interest for the caller.  This code handles both
devices that are sending keyed events, and the ones that are not (and
event the ones that sometimes send events, and sometimes don't).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: William Lee Irwin III &lt;wli@movementarian.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use the events hint now sent by some devices, to avoid unnecessary wakeups
for events that are of no interest for the caller.  This code handles both
devices that are sending keyed events, and the ones that are not (and
event the ones that sometimes send events, and sometimes don't).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: William Lee Irwin III &lt;wli@movementarian.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: use real type instead of void *</title>
<updated>2009-04-01T15:59:20+00:00</updated>
<author>
<name>Tony Battersby</name>
<email>tonyb@cybernetics.com</email>
</author>
<published>2009-03-31T22:24:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4f0989dbfa8d18dd17c32120aac1eb3e906a62a2'/>
<id>4f0989dbfa8d18dd17c32120aac1eb3e906a62a2</id>
<content type='text'>
eventpoll.c uses void * in one place for no obvious reason; change it to
use the real type instead.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
eventpoll.c uses void * in one place for no obvious reason; change it to
use the real type instead.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: clean up ep_modify</title>
<updated>2009-04-01T15:59:19+00:00</updated>
<author>
<name>Tony Battersby</name>
<email>tonyb@cybernetics.com</email>
</author>
<published>2009-03-31T22:24:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e057e15ff66a620eda4c407486cbb8f8fbb7d878'/>
<id>e057e15ff66a620eda4c407486cbb8f8fbb7d878</id>
<content type='text'>
ep_modify() doesn't need to set event.data from within the ep-&gt;lock
spinlock as the comment suggests.  The only place event.data is used is
ep_send_events_proc(), and this is protected by ep-&gt;mtx instead of
ep-&gt;lock.  Also update the comment for mutex_lock() at the top of
ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
epoll_ctl(EPOLL_CTL_MOD).

ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
ep_modify() doesn't need to set event.data from within the ep-&gt;lock
spinlock as the comment suggests.  The only place event.data is used is
ep_send_events_proc(), and this is protected by ep-&gt;mtx instead of
ep-&gt;lock.  Also update the comment for mutex_lock() at the top of
ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
epoll_ctl(EPOLL_CTL_MOD).

ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: remove unnecessary xchg</title>
<updated>2009-04-01T15:59:19+00:00</updated>
<author>
<name>Tony Battersby</name>
<email>tonyb@cybernetics.com</email>
</author>
<published>2009-03-31T22:24:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d1bc90dd5d037079f96b3327f943eb6ae8ef7491'/>
<id>d1bc90dd5d037079f96b3327f943eb6ae8ef7491</id>
<content type='text'>
xchg in ep_unregister_pollwait() is unnecessary because it is protected by
either epmutex or ep-&gt;mtx (the same protection as ep_remove()).

If xchg was necessary, it would be insufficient to protect against
problems: if multiple concurrent calls to ep_unregister_pollwait() were
possible then a second caller that returns without doing anything because
nwait == 0 could return before the waitqueues are removed by the first
caller, which looks like it could lead to problematic races with
ep_poll_callback().

So remove xchg and add comments about the locking.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
xchg in ep_unregister_pollwait() is unnecessary because it is protected by
either epmutex or ep-&gt;mtx (the same protection as ep_remove()).

If xchg was necessary, it would be insufficient to protect against
problems: if multiple concurrent calls to ep_unregister_pollwait() were
possible then a second caller that returns without doing anything because
nwait == 0 could return before the waitqueues are removed by the first
caller, which looks like it could lead to problematic races with
ep_poll_callback().

So remove xchg and add comments about the locking.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: remember the event if epoll_wait returns -EFAULT</title>
<updated>2009-04-01T15:59:19+00:00</updated>
<author>
<name>Tony Battersby</name>
<email>tonyb@cybernetics.com</email>
</author>
<published>2009-03-31T22:24:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d0305882825784e74f68a56eee6c3a812a99f235'/>
<id>d0305882825784e74f68a56eee6c3a812a99f235</id>
<content type='text'>
If epoll_wait returns -EFAULT, the event that was being returned when the
fault was encountered will be forgotten.  This is not a big deal since
EFAULT will happen only if a buggy userspace program passes in a bad
address, in which case what happens later usually doesn't matter.
However, it is easy to remember the event for later, and this patch makes
a simple change to do that.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If epoll_wait returns -EFAULT, the event that was being returned when the
fault was encountered will be forgotten.  This is not a big deal since
EFAULT will happen only if a buggy userspace program passes in a bad
address, in which case what happens later usually doesn't matter.
However, it is easy to remember the event for later, and this patch makes
a simple change to do that.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: don't use current in irq context</title>
<updated>2009-04-01T15:59:19+00:00</updated>
<author>
<name>Tony Battersby</name>
<email>tonyb@cybernetics.com</email>
</author>
<published>2009-03-31T22:24:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=abff55cee1039b5a3b96f7a5eb6e65b9f247a274'/>
<id>abff55cee1039b5a3b96f7a5eb6e65b9f247a274</id>
<content type='text'>
ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
dereferencing it) to detect callback recursion, but it may be called from
irq context where the use of current is generally discouraged.  It would
be better to use get_cpu() and put_cpu() to detect the callback recursion.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
dereferencing it) to detect callback recursion, but it may be called from
irq context where the use of current is generally discouraged.  It would
be better to use get_cpu() and put_cpu() to detect the callback recursion.

Signed-off-by: Tony Battersby &lt;tonyb@cybernetics.com&gt;
Acked-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: remove debugging code</title>
<updated>2009-04-01T15:59:19+00:00</updated>
<author>
<name>Davide Libenzi</name>
<email>davidel@xmailserver.org</email>
</author>
<published>2009-03-31T22:24:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bb57c3edcd2fc51d95914c39448f36e43af9d6af'/>
<id>bb57c3edcd2fc51d95914c39448f36e43af9d6af</id>
<content type='text'>
Remove debugging code from epoll.  There's no need for it to be included
into mainline code.

Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Remove debugging code from epoll.  There's no need for it to be included
into mainline code.

Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: fix epoll's own poll (update)</title>
<updated>2009-04-01T15:59:19+00:00</updated>
<author>
<name>Davide Libenzi</name>
<email>davidel@xmailserver.org</email>
</author>
<published>2009-03-31T22:24:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=296e236e96dddef351a1809c0d414bcddfcf3800'/>
<id>296e236e96dddef351a1809c0d414bcddfcf3800</id>
<content type='text'>
Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: Pavel Pisa &lt;pisa@cmp.felk.cvut.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: Pavel Pisa &lt;pisa@cmp.felk.cvut.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>epoll: fix epoll's own poll</title>
<updated>2009-04-01T15:59:18+00:00</updated>
<author>
<name>Davide Libenzi</name>
<email>davidel@xmailserver.org</email>
</author>
<published>2009-03-31T22:24:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5071f97ec6d74f006072de0ce89b67c8792fe5a1'/>
<id>5071f97ec6d74f006072de0ce89b67c8792fe5a1</id>
<content type='text'>
Fix a bug inside the epoll's f_op-&gt;poll() code, that returns POLLIN even
though there are no actual ready monitored fds.  The bug shows up if you
add an epoll fd inside another fd container (poll, select, epoll).

The problem is that callback-based wake ups used by epoll does not carry
(patches will follow, to fix this) any information about the events that
actually happened.  So the callback code, since it can't call the file*
-&gt;poll() inside the callback, chains the file* into a ready-list.

So, suppose you added an fd with EPOLLOUT only, and some data shows up on
the fd, the file* mapped by the fd will be added into the ready-list (via
wakeup callback).  During normal epoll_wait() use, this condition is
sorted out at the time we're actually able to call the file*'s
f_op-&gt;poll().

Inside the old epoll's f_op-&gt;poll() though, only a quick check
!list_empty(ready-list) was performed, and this could have led to
reporting POLLIN even though no ready fds would show up at a following
epoll_wait().  In order to correctly report the ready status for an epoll
fd, the ready-list must be checked to see if any really available fd+event
would be ready in a following epoll_wait().

Operation (calling f_op-&gt;poll() from inside f_op-&gt;poll()) that, like wake
ups, must be handled with care because of the fact that epoll fds can be
added to other epoll fds.

Test code:

/*
 *  epoll_test by Davide Libenzi (Simple code to test epoll internals)
 *  Copyright (C) 2008  Davide Libenzi
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 *  Davide Libenzi &lt;davidel@xmailserver.org&gt;
 *
 */

#include &lt;sys/types.h&gt;
#include &lt;unistd.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;string.h&gt;
#include &lt;errno.h&gt;
#include &lt;signal.h&gt;
#include &lt;limits.h&gt;
#include &lt;poll.h&gt;
#include &lt;sys/epoll.h&gt;
#include &lt;sys/wait.h&gt;

#define EPWAIT_TIMEO	(1 * 1000)
#ifndef POLLRDHUP
#define POLLRDHUP 0x2000
#endif

#define EPOLL_MAX_CHAIN	100L

#define EPOLL_TF_LOOP (1 &lt;&lt; 0)

struct epoll_test_cfg {
	long size;
	long flags;
};

static int xepoll_create(int n) {
	int epfd;

	if ((epfd = epoll_create(n)) == -1) {
		perror("epoll_create");
		exit(2);
	}

	return epfd;
}

static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
	if (epoll_ctl(epfd, cmd, fd, evt) &lt; 0) {
		perror("epoll_ctl");
		exit(3);
	}
}

static void xpipe(int *fds) {
	if (pipe(fds)) {
		perror("pipe");
		exit(4);
	}
}

static pid_t xfork(void) {
	pid_t pid;

	if ((pid = fork()) == (pid_t) -1) {
		perror("pipe");
		exit(5);
	}

	return pid;
}

static int run_forked_proc(int (*proc)(void *), void *data) {
	int status;
	pid_t pid;

	if ((pid = xfork()) == 0)
		exit((*proc)(data));
	if (waitpid(pid, &amp;status, 0) != pid) {
		perror("waitpid");
		return -1;
	}

	return WIFEXITED(status) ? WEXITSTATUS(status): -2;
}

static int check_events(int fd, int timeo) {
	struct pollfd pfd;

	fprintf(stdout, "Checking events for fd %d\n", fd);
	memset(&amp;pfd, 0, sizeof(pfd));
	pfd.fd = fd;
	pfd.events = POLLIN | POLLOUT;
	if (poll(&amp;pfd, 1, timeo) &lt; 0) {
		perror("poll()");
		return 0;
	}
	if (pfd.revents &amp; POLLIN)
		fprintf(stdout, "\tPOLLIN\n");
	if (pfd.revents &amp; POLLOUT)
		fprintf(stdout, "\tPOLLOUT\n");
	if (pfd.revents &amp; POLLERR)
		fprintf(stdout, "\tPOLLERR\n");
	if (pfd.revents &amp; POLLHUP)
		fprintf(stdout, "\tPOLLHUP\n");
	if (pfd.revents &amp; POLLRDHUP)
		fprintf(stdout, "\tPOLLRDHUP\n");

	return pfd.revents;
}

static int epoll_test_tty(void *data) {
	int epfd, ifd = fileno(stdin), res;
	struct epoll_event evt;

	if (check_events(ifd, 0) != POLLOUT) {
		fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
		return 1;
	}
	epfd = xepoll_create(1);
	fprintf(stdout, "Created epoll fd (%d)\n", epfd);
	memset(&amp;evt, 0, sizeof(evt));
	evt.events = EPOLLIN;
	xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &amp;evt);
	if (check_events(epfd, 0) &amp; POLLIN) {
		res = epoll_wait(epfd, &amp;evt, 1, 0);
		if (res == 0) {
			fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
				epfd);
			return 2;
		}
	}

	return 0;
}

static int epoll_wakeup_chain(void *data) {
	struct epoll_test_cfg *tcfg = data;
	int i, res, epfd, bfd, nfd, pfds[2];
	pid_t pid;
	struct epoll_event evt;

	memset(&amp;evt, 0, sizeof(evt));
	evt.events = EPOLLIN;

	epfd = bfd = xepoll_create(1);

	for (i = 0; i &lt; tcfg-&gt;size; i++) {
		nfd = xepoll_create(1);
		xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &amp;evt);
		bfd = nfd;
	}
	xpipe(pfds);
	if (tcfg-&gt;flags &amp; EPOLL_TF_LOOP)
	{
		xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &amp;evt);
		/*
		 * If we're testing for loop, we want that the wakeup
		 * triggered by the write to the pipe done in the child
		 * process, triggers a fake event. So we add the pipe
		 * read size with EPOLLOUT events. This will trigger
		 * an addition to the ready-list, but no real events
		 * will be there. The the epoll kernel code will proceed
		 * in calling f_op-&gt;poll() of the epfd, triggering the
		 * loop we want to test.
		 */
		evt.events = EPOLLOUT;
	}
	xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &amp;evt);

	/*
	 * The pipe write must come after the poll(2) call inside
	 * check_events(). This tests the nested wakeup code in
	 * fs/eventpoll.c:ep_poll_safewake()
	 * By having the check_events() (hence poll(2)) happens first,
	 * we have poll wait queue filled up, and the write(2) in the
	 * child will trigger the wakeup chain.
	 */
	if ((pid = xfork()) == 0) {
		sleep(1);
		write(pfds[1], "w", 1);
		exit(0);
	}

	res = check_events(epfd, 2000) &amp; POLLIN;

	if (waitpid(pid, NULL, 0) != pid) {
		perror("waitpid");
		return -1;
	}

	return res;
}

static int epoll_poll_chain(void *data) {
	struct epoll_test_cfg *tcfg = data;
	int i, res, epfd, bfd, nfd, pfds[2];
	pid_t pid;
	struct epoll_event evt;

	memset(&amp;evt, 0, sizeof(evt));
	evt.events = EPOLLIN;

	epfd = bfd = xepoll_create(1);

	for (i = 0; i &lt; tcfg-&gt;size; i++) {
		nfd = xepoll_create(1);
		xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &amp;evt);
		bfd = nfd;
	}
	xpipe(pfds);
	if (tcfg-&gt;flags &amp; EPOLL_TF_LOOP)
	{
		xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &amp;evt);
		/*
		 * If we're testing for loop, we want that the wakeup
		 * triggered by the write to the pipe done in the child
		 * process, triggers a fake event. So we add the pipe
		 * read size with EPOLLOUT events. This will trigger
		 * an addition to the ready-list, but no real events
		 * will be there. The the epoll kernel code will proceed
		 * in calling f_op-&gt;poll() of the epfd, triggering the
		 * loop we want to test.
		 */
		evt.events = EPOLLOUT;
	}
	xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &amp;evt);

	/*
	 * The pipe write mush come before the poll(2) call inside
	 * check_events(). This tests the nested f_op-&gt;poll calls code in
	 * fs/eventpoll.c:ep_eventpoll_poll()
	 * By having the pipe write(2) happen first, we make the kernel
	 * epoll code to load the ready lists, and the following poll(2)
	 * done inside check_events() will test nested poll code in
	 * ep_eventpoll_poll().
	 */
	if ((pid = xfork()) == 0) {
		write(pfds[1], "w", 1);
		exit(0);
	}
	sleep(1);
	res = check_events(epfd, 1000) &amp; POLLIN;

	if (waitpid(pid, NULL, 0) != pid) {
		perror("waitpid");
		return -1;
	}

	return res;
}

int main(int ac, char **av) {
	int error;
	struct epoll_test_cfg tcfg;

	fprintf(stdout, "\n********** Testing TTY events\n");
	error = run_forked_proc(epoll_test_tty, NULL);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing short wakeup chain\n");
	error = run_forked_proc(epoll_wakeup_chain, &amp;tcfg);
	fprintf(stdout, error == POLLIN ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = EPOLL_MAX_CHAIN;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
	error = run_forked_proc(epoll_wakeup_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing short poll chain\n");
	error = run_forked_proc(epoll_poll_chain, &amp;tcfg);
	fprintf(stdout, error == POLLIN ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = EPOLL_MAX_CHAIN;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
	error = run_forked_proc(epoll_poll_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = EPOLL_TF_LOOP;
	fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
	error = run_forked_proc(epoll_wakeup_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = EPOLL_TF_LOOP;
	fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
	error = run_forked_proc(epoll_poll_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	return 0;
}

Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: Pavel Pisa &lt;pisa@cmp.felk.cvut.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fix a bug inside the epoll's f_op-&gt;poll() code, that returns POLLIN even
though there are no actual ready monitored fds.  The bug shows up if you
add an epoll fd inside another fd container (poll, select, epoll).

The problem is that callback-based wake ups used by epoll does not carry
(patches will follow, to fix this) any information about the events that
actually happened.  So the callback code, since it can't call the file*
-&gt;poll() inside the callback, chains the file* into a ready-list.

So, suppose you added an fd with EPOLLOUT only, and some data shows up on
the fd, the file* mapped by the fd will be added into the ready-list (via
wakeup callback).  During normal epoll_wait() use, this condition is
sorted out at the time we're actually able to call the file*'s
f_op-&gt;poll().

Inside the old epoll's f_op-&gt;poll() though, only a quick check
!list_empty(ready-list) was performed, and this could have led to
reporting POLLIN even though no ready fds would show up at a following
epoll_wait().  In order to correctly report the ready status for an epoll
fd, the ready-list must be checked to see if any really available fd+event
would be ready in a following epoll_wait().

Operation (calling f_op-&gt;poll() from inside f_op-&gt;poll()) that, like wake
ups, must be handled with care because of the fact that epoll fds can be
added to other epoll fds.

Test code:

/*
 *  epoll_test by Davide Libenzi (Simple code to test epoll internals)
 *  Copyright (C) 2008  Davide Libenzi
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 *  Davide Libenzi &lt;davidel@xmailserver.org&gt;
 *
 */

#include &lt;sys/types.h&gt;
#include &lt;unistd.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;string.h&gt;
#include &lt;errno.h&gt;
#include &lt;signal.h&gt;
#include &lt;limits.h&gt;
#include &lt;poll.h&gt;
#include &lt;sys/epoll.h&gt;
#include &lt;sys/wait.h&gt;

#define EPWAIT_TIMEO	(1 * 1000)
#ifndef POLLRDHUP
#define POLLRDHUP 0x2000
#endif

#define EPOLL_MAX_CHAIN	100L

#define EPOLL_TF_LOOP (1 &lt;&lt; 0)

struct epoll_test_cfg {
	long size;
	long flags;
};

static int xepoll_create(int n) {
	int epfd;

	if ((epfd = epoll_create(n)) == -1) {
		perror("epoll_create");
		exit(2);
	}

	return epfd;
}

static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
	if (epoll_ctl(epfd, cmd, fd, evt) &lt; 0) {
		perror("epoll_ctl");
		exit(3);
	}
}

static void xpipe(int *fds) {
	if (pipe(fds)) {
		perror("pipe");
		exit(4);
	}
}

static pid_t xfork(void) {
	pid_t pid;

	if ((pid = fork()) == (pid_t) -1) {
		perror("pipe");
		exit(5);
	}

	return pid;
}

static int run_forked_proc(int (*proc)(void *), void *data) {
	int status;
	pid_t pid;

	if ((pid = xfork()) == 0)
		exit((*proc)(data));
	if (waitpid(pid, &amp;status, 0) != pid) {
		perror("waitpid");
		return -1;
	}

	return WIFEXITED(status) ? WEXITSTATUS(status): -2;
}

static int check_events(int fd, int timeo) {
	struct pollfd pfd;

	fprintf(stdout, "Checking events for fd %d\n", fd);
	memset(&amp;pfd, 0, sizeof(pfd));
	pfd.fd = fd;
	pfd.events = POLLIN | POLLOUT;
	if (poll(&amp;pfd, 1, timeo) &lt; 0) {
		perror("poll()");
		return 0;
	}
	if (pfd.revents &amp; POLLIN)
		fprintf(stdout, "\tPOLLIN\n");
	if (pfd.revents &amp; POLLOUT)
		fprintf(stdout, "\tPOLLOUT\n");
	if (pfd.revents &amp; POLLERR)
		fprintf(stdout, "\tPOLLERR\n");
	if (pfd.revents &amp; POLLHUP)
		fprintf(stdout, "\tPOLLHUP\n");
	if (pfd.revents &amp; POLLRDHUP)
		fprintf(stdout, "\tPOLLRDHUP\n");

	return pfd.revents;
}

static int epoll_test_tty(void *data) {
	int epfd, ifd = fileno(stdin), res;
	struct epoll_event evt;

	if (check_events(ifd, 0) != POLLOUT) {
		fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
		return 1;
	}
	epfd = xepoll_create(1);
	fprintf(stdout, "Created epoll fd (%d)\n", epfd);
	memset(&amp;evt, 0, sizeof(evt));
	evt.events = EPOLLIN;
	xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &amp;evt);
	if (check_events(epfd, 0) &amp; POLLIN) {
		res = epoll_wait(epfd, &amp;evt, 1, 0);
		if (res == 0) {
			fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
				epfd);
			return 2;
		}
	}

	return 0;
}

static int epoll_wakeup_chain(void *data) {
	struct epoll_test_cfg *tcfg = data;
	int i, res, epfd, bfd, nfd, pfds[2];
	pid_t pid;
	struct epoll_event evt;

	memset(&amp;evt, 0, sizeof(evt));
	evt.events = EPOLLIN;

	epfd = bfd = xepoll_create(1);

	for (i = 0; i &lt; tcfg-&gt;size; i++) {
		nfd = xepoll_create(1);
		xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &amp;evt);
		bfd = nfd;
	}
	xpipe(pfds);
	if (tcfg-&gt;flags &amp; EPOLL_TF_LOOP)
	{
		xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &amp;evt);
		/*
		 * If we're testing for loop, we want that the wakeup
		 * triggered by the write to the pipe done in the child
		 * process, triggers a fake event. So we add the pipe
		 * read size with EPOLLOUT events. This will trigger
		 * an addition to the ready-list, but no real events
		 * will be there. The the epoll kernel code will proceed
		 * in calling f_op-&gt;poll() of the epfd, triggering the
		 * loop we want to test.
		 */
		evt.events = EPOLLOUT;
	}
	xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &amp;evt);

	/*
	 * The pipe write must come after the poll(2) call inside
	 * check_events(). This tests the nested wakeup code in
	 * fs/eventpoll.c:ep_poll_safewake()
	 * By having the check_events() (hence poll(2)) happens first,
	 * we have poll wait queue filled up, and the write(2) in the
	 * child will trigger the wakeup chain.
	 */
	if ((pid = xfork()) == 0) {
		sleep(1);
		write(pfds[1], "w", 1);
		exit(0);
	}

	res = check_events(epfd, 2000) &amp; POLLIN;

	if (waitpid(pid, NULL, 0) != pid) {
		perror("waitpid");
		return -1;
	}

	return res;
}

static int epoll_poll_chain(void *data) {
	struct epoll_test_cfg *tcfg = data;
	int i, res, epfd, bfd, nfd, pfds[2];
	pid_t pid;
	struct epoll_event evt;

	memset(&amp;evt, 0, sizeof(evt));
	evt.events = EPOLLIN;

	epfd = bfd = xepoll_create(1);

	for (i = 0; i &lt; tcfg-&gt;size; i++) {
		nfd = xepoll_create(1);
		xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &amp;evt);
		bfd = nfd;
	}
	xpipe(pfds);
	if (tcfg-&gt;flags &amp; EPOLL_TF_LOOP)
	{
		xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &amp;evt);
		/*
		 * If we're testing for loop, we want that the wakeup
		 * triggered by the write to the pipe done in the child
		 * process, triggers a fake event. So we add the pipe
		 * read size with EPOLLOUT events. This will trigger
		 * an addition to the ready-list, but no real events
		 * will be there. The the epoll kernel code will proceed
		 * in calling f_op-&gt;poll() of the epfd, triggering the
		 * loop we want to test.
		 */
		evt.events = EPOLLOUT;
	}
	xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &amp;evt);

	/*
	 * The pipe write mush come before the poll(2) call inside
	 * check_events(). This tests the nested f_op-&gt;poll calls code in
	 * fs/eventpoll.c:ep_eventpoll_poll()
	 * By having the pipe write(2) happen first, we make the kernel
	 * epoll code to load the ready lists, and the following poll(2)
	 * done inside check_events() will test nested poll code in
	 * ep_eventpoll_poll().
	 */
	if ((pid = xfork()) == 0) {
		write(pfds[1], "w", 1);
		exit(0);
	}
	sleep(1);
	res = check_events(epfd, 1000) &amp; POLLIN;

	if (waitpid(pid, NULL, 0) != pid) {
		perror("waitpid");
		return -1;
	}

	return res;
}

int main(int ac, char **av) {
	int error;
	struct epoll_test_cfg tcfg;

	fprintf(stdout, "\n********** Testing TTY events\n");
	error = run_forked_proc(epoll_test_tty, NULL);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing short wakeup chain\n");
	error = run_forked_proc(epoll_wakeup_chain, &amp;tcfg);
	fprintf(stdout, error == POLLIN ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = EPOLL_MAX_CHAIN;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
	error = run_forked_proc(epoll_wakeup_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing short poll chain\n");
	error = run_forked_proc(epoll_poll_chain, &amp;tcfg);
	fprintf(stdout, error == POLLIN ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = EPOLL_MAX_CHAIN;
	tcfg.flags = 0;
	fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
	error = run_forked_proc(epoll_poll_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = EPOLL_TF_LOOP;
	fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
	error = run_forked_proc(epoll_wakeup_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	tcfg.size = 3;
	tcfg.flags = EPOLL_TF_LOOP;
	fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
	error = run_forked_proc(epoll_poll_chain, &amp;tcfg);
	fprintf(stdout, error == 0 ?
		"********** OK\n": "********** FAIL (%d)\n", error);

	return 0;
}

Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: Pavel Pisa &lt;pisa@cmp.felk.cvut.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
