<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/net/core, branch v4.2-rc5</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket</title>
<updated>2015-07-30T22:59:12+00:00</updated>
<author>
<name>Sowmini Varadhan</name>
<email>sowmini.varadhan@oracle.com</email>
</author>
<published>2015-07-30T13:50:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8a68173691f036613e3d4e6bf8dc129d4a7bf383'/>
<id>8a68173691f036613e3d4e6bf8dc129d4a7bf383</id>
<content type='text'>
The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock-&gt;..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the
      netns of kernel sockets.")
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sowmini Varadhan &lt;sowmini.varadhan@oracle.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock-&gt;..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the
      netns of kernel sockets.")
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sowmini Varadhan &lt;sowmini.varadhan@oracle.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: fix recv with flags MSG_WAITALL | MSG_PEEK</title>
<updated>2015-07-27T08:06:53+00:00</updated>
<author>
<name>Sabrina Dubroca</name>
<email>sd@queasysnail.net</email>
</author>
<published>2015-07-24T16:19:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=dfbafc995304ebb9a9b03f65083e6e9cea143b20'/>
<id>dfbafc995304ebb9a9b03f65083e6e9cea143b20</id>
<content type='text'>
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
Reported-by: Enrico Scholz &lt;rh-bugzilla@ensc.de&gt;
Reported-by: Dan Searle &lt;dan@censornet.com&gt;
Signed-off-by: Sabrina Dubroca &lt;sd@queasysnail.net&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
Reported-by: Enrico Scholz &lt;rh-bugzilla@ensc.de&gt;
Reported-by: Dan Searle &lt;dan@censornet.com&gt;
Signed-off-by: Sabrina Dubroca &lt;sd@queasysnail.net&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cgroup: net_cls: fix false-positive "suspicious RCU usage"</title>
<updated>2015-07-25T07:13:18+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@yandex-team.ru</email>
</author>
<published>2015-07-22T09:23:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cc9f4daa638e660f7a910b8094122561470ac331'/>
<id>cc9f4daa638e660f7a910b8094122561470ac331</id>
<content type='text'>
In dev_queue_xmit() net_cls protected with rcu-bh.

[  270.730026] ===============================
[  270.730029] [ INFO: suspicious RCU usage. ]
[  270.730033] 4.2.0-rc3+ #2 Not tainted
[  270.730036] -------------------------------
[  270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
[  270.730041] other info that might help us debug this:
[  270.730043] rcu_scheduler_active = 1, debug_locks = 1
[  270.730045] 2 locks held by dhclient/748:
[  270.730046]  #0:  (rcu_read_lock_bh){......}, at: [&lt;ffffffff81682b70&gt;] __dev_queue_xmit+0x50/0x960
[  270.730085]  #1:  (&amp;qdisc_tx_lock){+.....}, at: [&lt;ffffffff81682d60&gt;] __dev_queue_xmit+0x240/0x960
[  270.730090] stack backtrace:
[  270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
[  270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[  270.730100]  0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
[  270.730103]  ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
[  270.730105]  ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
[  270.730108] Call Trace:
[  270.730121]  [&lt;ffffffff817ad487&gt;] dump_stack+0x4c/0x65
[  270.730148]  [&lt;ffffffff810ca4f2&gt;] lockdep_rcu_suspicious+0xe2/0x120
[  270.730153]  [&lt;ffffffff816a62d2&gt;] task_cls_state+0x92/0xa0
[  270.730158]  [&lt;ffffffffa00b534f&gt;] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
[  270.730164]  [&lt;ffffffff816aac74&gt;] tc_classify_compat+0x74/0xc0
[  270.730166]  [&lt;ffffffff816ab573&gt;] tc_classify+0x33/0x90
[  270.730170]  [&lt;ffffffffa00bcb0a&gt;] htb_enqueue+0xaa/0x4a0 [sch_htb]
[  270.730172]  [&lt;ffffffff81682e26&gt;] __dev_queue_xmit+0x306/0x960
[  270.730174]  [&lt;ffffffff81682b70&gt;] ? __dev_queue_xmit+0x50/0x960
[  270.730176]  [&lt;ffffffff816834a3&gt;] dev_queue_xmit_sk+0x13/0x20
[  270.730185]  [&lt;ffffffff81787770&gt;] dev_queue_xmit+0x10/0x20
[  270.730187]  [&lt;ffffffff8178b91c&gt;] packet_snd.isra.62+0x54c/0x760
[  270.730190]  [&lt;ffffffff8178be25&gt;] packet_sendmsg+0x2f5/0x3f0
[  270.730203]  [&lt;ffffffff81665245&gt;] ? sock_def_readable+0x5/0x190
[  270.730210]  [&lt;ffffffff817b64bb&gt;] ? _raw_spin_unlock+0x2b/0x40
[  270.730216]  [&lt;ffffffff8173bcbc&gt;] ? unix_dgram_sendmsg+0x5cc/0x640
[  270.730219]  [&lt;ffffffff8165f367&gt;] sock_sendmsg+0x47/0x50
[  270.730221]  [&lt;ffffffff8165f42f&gt;] sock_write_iter+0x7f/0xd0
[  270.730232]  [&lt;ffffffff811fd4c7&gt;] __vfs_write+0xa7/0xf0
[  270.730234]  [&lt;ffffffff811fe5b8&gt;] vfs_write+0xb8/0x190
[  270.730236]  [&lt;ffffffff811fe8c2&gt;] SyS_write+0x52/0xb0
[  270.730239]  [&lt;ffffffff817b6bae&gt;] entry_SYSCALL_64_fastpath+0x12/0x76

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In dev_queue_xmit() net_cls protected with rcu-bh.

[  270.730026] ===============================
[  270.730029] [ INFO: suspicious RCU usage. ]
[  270.730033] 4.2.0-rc3+ #2 Not tainted
[  270.730036] -------------------------------
[  270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
[  270.730041] other info that might help us debug this:
[  270.730043] rcu_scheduler_active = 1, debug_locks = 1
[  270.730045] 2 locks held by dhclient/748:
[  270.730046]  #0:  (rcu_read_lock_bh){......}, at: [&lt;ffffffff81682b70&gt;] __dev_queue_xmit+0x50/0x960
[  270.730085]  #1:  (&amp;qdisc_tx_lock){+.....}, at: [&lt;ffffffff81682d60&gt;] __dev_queue_xmit+0x240/0x960
[  270.730090] stack backtrace:
[  270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
[  270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[  270.730100]  0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
[  270.730103]  ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
[  270.730105]  ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
[  270.730108] Call Trace:
[  270.730121]  [&lt;ffffffff817ad487&gt;] dump_stack+0x4c/0x65
[  270.730148]  [&lt;ffffffff810ca4f2&gt;] lockdep_rcu_suspicious+0xe2/0x120
[  270.730153]  [&lt;ffffffff816a62d2&gt;] task_cls_state+0x92/0xa0
[  270.730158]  [&lt;ffffffffa00b534f&gt;] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
[  270.730164]  [&lt;ffffffff816aac74&gt;] tc_classify_compat+0x74/0xc0
[  270.730166]  [&lt;ffffffff816ab573&gt;] tc_classify+0x33/0x90
[  270.730170]  [&lt;ffffffffa00bcb0a&gt;] htb_enqueue+0xaa/0x4a0 [sch_htb]
[  270.730172]  [&lt;ffffffff81682e26&gt;] __dev_queue_xmit+0x306/0x960
[  270.730174]  [&lt;ffffffff81682b70&gt;] ? __dev_queue_xmit+0x50/0x960
[  270.730176]  [&lt;ffffffff816834a3&gt;] dev_queue_xmit_sk+0x13/0x20
[  270.730185]  [&lt;ffffffff81787770&gt;] dev_queue_xmit+0x10/0x20
[  270.730187]  [&lt;ffffffff8178b91c&gt;] packet_snd.isra.62+0x54c/0x760
[  270.730190]  [&lt;ffffffff8178be25&gt;] packet_sendmsg+0x2f5/0x3f0
[  270.730203]  [&lt;ffffffff81665245&gt;] ? sock_def_readable+0x5/0x190
[  270.730210]  [&lt;ffffffff817b64bb&gt;] ? _raw_spin_unlock+0x2b/0x40
[  270.730216]  [&lt;ffffffff8173bcbc&gt;] ? unix_dgram_sendmsg+0x5cc/0x640
[  270.730219]  [&lt;ffffffff8165f367&gt;] sock_sendmsg+0x47/0x50
[  270.730221]  [&lt;ffffffff8165f42f&gt;] sock_write_iter+0x7f/0xd0
[  270.730232]  [&lt;ffffffff811fd4c7&gt;] __vfs_write+0xa7/0xf0
[  270.730234]  [&lt;ffffffff811fe5b8&gt;] vfs_write+0xb8/0x190
[  270.730236]  [&lt;ffffffff811fe8c2&gt;] SyS_write+0x52/0xb0
[  270.730239]  [&lt;ffffffff817b6bae&gt;] entry_SYSCALL_64_fastpath+0x12/0x76

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: ratelimit warnings about dst entry refcount underflow or overflow</title>
<updated>2015-07-21T07:11:19+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@yandex-team.ru</email>
</author>
<published>2015-07-17T11:01:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8bf4ada2e21378816b28205427ee6b0e1ca4c5f1'/>
<id>8bf4ada2e21378816b28205427ee6b0e1ca4c5f1</id>
<content type='text'>
Kernel generates a lot of warnings when dst entry reference counter
overflows and becomes negative. That bug was seen several times at
machines with outdated 3.10.y kernels. Most like it's already fixed
in upstream. Anyway that flood completely kills machine and makes
further debugging impossible.

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Kernel generates a lot of warnings when dst entry reference counter
overflows and becomes negative. That bug was seen several times at
machines with outdated 3.10.y kernels. Most like it's already fixed
in upstream. Anyway that flood completely kills machine and makes
further debugging impossible.

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: Fix skb csum races when peeking</title>
<updated>2015-07-15T22:59:58+00:00</updated>
<author>
<name>Herbert Xu</name>
<email>herbert@gondor.apana.org.au</email>
</author>
<published>2015-07-13T12:01:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=89c22d8c3b278212eef6a8cc66b570bc840a6f5a'/>
<id>89c22d8c3b278212eef6a8cc66b570bc840a6f5a</id>
<content type='text'>
When we calculate the checksum on the recv path, we store the
result in the skb as an optimisation in case we need the checksum
again down the line.

This is in fact bogus for the MSG_PEEK case as this is done without
any locking.  So multiple threads can peek and then store the result
to the same skb, potentially resulting in bogus skb states.

This patch fixes this by only storing the result if the skb is not
shared.  This preserves the optimisations for the few cases where
it can be done safely due to locking or other reasons, e.g., SIOCINQ.

Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When we calculate the checksum on the recv path, we store the
result in the skb as an optimisation in case we need the checksum
again down the line.

This is in fact bogus for the MSG_PEEK case as this is done without
any locking.  So multiple threads can peek and then store the result
to the same skb, potentially resulting in bogus skb states.

This patch fixes this by only storing the result if the skb is not
shared.  This preserves the optimisations for the few cases where
it can be done safely due to locking or other reasons, e.g., SIOCINQ.

Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: Clone skb before setting peeked flag</title>
<updated>2015-07-15T22:59:58+00:00</updated>
<author>
<name>Herbert Xu</name>
<email>herbert@gondor.apana.org.au</email>
</author>
<published>2015-07-13T08:04:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=738ac1ebb96d02e0d23bc320302a6ea94c612dec'/>
<id>738ac1ebb96d02e0d23bc320302a6ea94c612dec</id>
<content type='text'>
Shared skbs must not be modified and this is crucial for broadcast
and/or multicast paths where we use it as an optimisation to avoid
unnecessary cloning.

The function skb_recv_datagram breaks this rule by setting peeked
without cloning the skb first.  This causes funky races which leads
to double-free.

This patch fixes this by cloning the skb and replacing the skb
in the list when setting skb-&gt;peeked.

Fixes: a59322be07c9 ("[UDP]: Only increment counter on first peek/recv")
Reported-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Shared skbs must not be modified and this is crucial for broadcast
and/or multicast paths where we use it as an optimisation to avoid
unnecessary cloning.

The function skb_recv_datagram breaks this rule by setting peeked
without cloning the skb first.  This causes funky races which leads
to double-free.

This patch fixes this by cloning the skb and replacing the skb
in the list when setting skb-&gt;peeked.

Fixes: a59322be07c9 ("[UDP]: Only increment counter on first peek/recv")
Reported-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rtnetlink: reject non-IFLA_VF_PORT attributes inside IFLA_VF_PORTS</title>
<updated>2015-07-15T22:53:27+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2015-07-12T22:06:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=035d210f928ce083435b4fd351a26d126c02c927'/>
<id>035d210f928ce083435b4fd351a26d126c02c927</id>
<content type='text'>
Similarly as in commit 4f7d2cdfdde7 ("rtnetlink: verify IFLA_VF_INFO
attributes before passing them to driver"), we have a double nesting
of netlink attributes, i.e. IFLA_VF_PORTS only contains IFLA_VF_PORT
that is nested itself. While IFLA_VF_PORTS is a verified attribute
from ifla_policy[], we only check if the IFLA_VF_PORTS container has
IFLA_VF_PORT attributes and then pass the attribute's content itself
via nla_parse_nested(). It would be more correct to reject inner types
other than IFLA_VF_PORT instead of continuing parsing and also similarly
as in commit 4f7d2cdfdde7, to check for a minimum of NLA_HDRLEN.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Roopa Prabhu &lt;roopa@cumulusnetworks.com&gt;
Cc: Scott Feldman &lt;sfeldma@gmail.com&gt;
Cc: Jason Gunthorpe &lt;jgunthorpe@obsidianresearch.com&gt;
Acked-by: Roopa Prabhu &lt;roopa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Similarly as in commit 4f7d2cdfdde7 ("rtnetlink: verify IFLA_VF_INFO
attributes before passing them to driver"), we have a double nesting
of netlink attributes, i.e. IFLA_VF_PORTS only contains IFLA_VF_PORT
that is nested itself. While IFLA_VF_PORTS is a verified attribute
from ifla_policy[], we only check if the IFLA_VF_PORTS container has
IFLA_VF_PORT attributes and then pass the attribute's content itself
via nla_parse_nested(). It would be more correct to reject inner types
other than IFLA_VF_PORT instead of continuing parsing and also similarly
as in commit 4f7d2cdfdde7, to check for a minimum of NLA_HDRLEN.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Roopa Prabhu &lt;roopa@cumulusnetworks.com&gt;
Cc: Scott Feldman &lt;sfeldma@gmail.com&gt;
Cc: Jason Gunthorpe &lt;jgunthorpe@obsidianresearch.com&gt;
Acked-by: Roopa Prabhu &lt;roopa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: call rcu_read_lock early in process_backlog</title>
<updated>2015-07-11T01:16:36+00:00</updated>
<author>
<name>Julian Anastasov</name>
<email>ja@ssi.bg</email>
</author>
<published>2015-07-09T06:59:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2c17d27c36dcce2b6bf689f41a46b9e909877c21'/>
<id>2c17d27c36dcce2b6bf689f41a46b9e909877c21</id>
<content type='text'>
Incoming packet should be either in backlog queue or
in RCU read-side section. Otherwise, the final sequence of
flush_backlog() and synchronize_net() may miss packets
that can run without device reference:

CPU 1                  CPU 2
                       skb-&gt;dev: no reference
                       process_backlog:__skb_dequeue
                       process_backlog:local_irq_enable

on_each_cpu for
flush_backlog =&gt;       IPI(hardirq): flush_backlog
                       - packet not found in backlog

                       CPU delayed ...
synchronize_net
- no ongoing RCU
read-side sections

netdev_run_todo,
rcu_barrier: no
ongoing callbacks
                       __netif_receive_skb_core:rcu_read_lock
                       - too late
free dev
                       process packet for freed dev

Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: Stephen Hemminger &lt;stephen@networkplumber.org&gt;
Signed-off-by: Julian Anastasov &lt;ja@ssi.bg&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Incoming packet should be either in backlog queue or
in RCU read-side section. Otherwise, the final sequence of
flush_backlog() and synchronize_net() may miss packets
that can run without device reference:

CPU 1                  CPU 2
                       skb-&gt;dev: no reference
                       process_backlog:__skb_dequeue
                       process_backlog:local_irq_enable

on_each_cpu for
flush_backlog =&gt;       IPI(hardirq): flush_backlog
                       - packet not found in backlog

                       CPU delayed ...
synchronize_net
- no ongoing RCU
read-side sections

netdev_run_todo,
rcu_barrier: no
ongoing callbacks
                       __netif_receive_skb_core:rcu_read_lock
                       - too late
free dev
                       process packet for freed dev

Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: Stephen Hemminger &lt;stephen@networkplumber.org&gt;
Signed-off-by: Julian Anastasov &lt;ja@ssi.bg&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: do not process device backlog during unregistration</title>
<updated>2015-07-11T01:16:36+00:00</updated>
<author>
<name>Julian Anastasov</name>
<email>ja@ssi.bg</email>
</author>
<published>2015-07-09T06:59:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e9e4dd3267d0c5234c5c0f47440456b10875dec9'/>
<id>e9e4dd3267d0c5234c5c0f47440456b10875dec9</id>
<content type='text'>
commit 381c759d9916 ("ipv4: Avoid crashing in ip_error")
fixes a problem where processed packet comes from device
with destroyed inetdev (dev-&gt;ip_ptr). This is not expected
because inetdev_destroy is called in NETDEV_UNREGISTER
phase and packets should not be processed after
dev_close_many() and synchronize_net(). Above fix is still
required because inetdev_destroy can be called for other
reasons. But it shows the real problem: backlog can keep
packets for long time and they do not hold reference to
device. Such packets are then delivered to upper levels
at the same time when device is unregistered.
Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
accounts all packets from backlog but before that some packets
continue to be delivered to upper levels long after the
synchronize_net call which is supposed to wait the last
ones. Also, as Eric pointed out, processed packets, mostly
from other devices, can continue to add new packets to backlog.

Fix the problem by moving flush_backlog early, after the
device driver is stopped and before the synchronize_net() call.
Then use netif_running check to make sure we do not add more
packets to backlog. We have to do it in enqueue_to_backlog
context when the local IRQ is disabled. As result, after the
flush_backlog and synchronize_net sequence all packets
should be accounted.

Thanks to Eric W. Biederman for the test script and his
valuable feedback!

Reported-by: Vittorio Gambaletta &lt;linuxbugs@vittgam.net&gt;
Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: Stephen Hemminger &lt;stephen@networkplumber.org&gt;
Signed-off-by: Julian Anastasov &lt;ja@ssi.bg&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 381c759d9916 ("ipv4: Avoid crashing in ip_error")
fixes a problem where processed packet comes from device
with destroyed inetdev (dev-&gt;ip_ptr). This is not expected
because inetdev_destroy is called in NETDEV_UNREGISTER
phase and packets should not be processed after
dev_close_many() and synchronize_net(). Above fix is still
required because inetdev_destroy can be called for other
reasons. But it shows the real problem: backlog can keep
packets for long time and they do not hold reference to
device. Such packets are then delivered to upper levels
at the same time when device is unregistered.
Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
accounts all packets from backlog but before that some packets
continue to be delivered to upper levels long after the
synchronize_net call which is supposed to wait the last
ones. Also, as Eric pointed out, processed packets, mostly
from other devices, can continue to add new packets to backlog.

Fix the problem by moving flush_backlog early, after the
device driver is stopped and before the synchronize_net() call.
Then use netif_running check to make sure we do not add more
packets to backlog. We have to do it in enqueue_to_backlog
context when the local IRQ is disabled. As result, after the
flush_backlog and synchronize_net sequence all packets
should be accounted.

Thanks to Eric W. Biederman for the test script and his
valuable feedback!

Reported-by: Vittorio Gambaletta &lt;linuxbugs@vittgam.net&gt;
Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: Stephen Hemminger &lt;stephen@networkplumber.org&gt;
Signed-off-by: Julian Anastasov &lt;ja@ssi.bg&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: pktgen: kill the "Wait for kthread_stop" code in pktgen_thread_worker()</title>
<updated>2015-07-09T22:05:32+00:00</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2015-07-08T19:42:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1fbe4b46caca5b01b070af93d513031ffbcc480c'/>
<id>1fbe4b46caca5b01b070af93d513031ffbcc480c</id>
<content type='text'>
pktgen_thread_worker() doesn't need to wait for kthread_stop(), it
can simply exit. Just pktgen_create_thread() and pg_net_exit() should
do get_task_struct()/put_task_struct(). kthread_stop(dead_thread) is
fine.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
pktgen_thread_worker() doesn't need to wait for kthread_stop(), it
can simply exit. Just pktgen_create_thread() and pg_net_exit() should
do get_task_struct()/put_task_struct(). kthread_stop(dead_thread) is
fine.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
