<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/include/net/sock.h, branch linux-4.8.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>dccp: do not release listeners too soon</title>
<updated>2016-11-21T09:11:34+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-11-03T00:14:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=72b03e549b9582ab257219cbe4236a6e0f683ad0'/>
<id>72b03e549b9582ab257219cbe4236a6e0f683ad0</id>
<content type='text'>
[ Upstream commit c3f24cfb3e508c70c26ee8569d537c8ca67a36c6 ]

Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[&lt;ffffffff819ead5f&gt;]  [&lt;ffffffff819ead5f&gt;]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
 ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
 [&lt;ffffffff819c8dd5&gt;] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
 [&lt;ffffffff82c2a9e7&gt;] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
 [&lt;ffffffff82b81e60&gt;] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
 [&lt;ffffffff838bbf12&gt;] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
 [&lt;ffffffff83069d22&gt;] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306abd2&gt;] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [&lt;     inline     &gt;] dst_input ./include/net/dst.h:507
 [&lt;ffffffff83068500&gt;] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306b82f&gt;] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [&lt;ffffffff82bd9fb7&gt;] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [&lt;ffffffff82bdb19a&gt;] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [&lt;ffffffff82bdb493&gt;] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [&lt;ffffffff82bdb6b8&gt;] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [&lt;ffffffff8241fc75&gt;] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [&lt;ffffffff82421b5a&gt;] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [&lt;     inline     &gt;] new_sync_write fs/read_write.c:499
 [&lt;ffffffff8151bd44&gt;] __vfs_write+0x334/0x570 fs/read_write.c:512
 [&lt;ffffffff8151f85b&gt;] vfs_write+0x17b/0x500 fs/read_write.c:560
 [&lt;     inline     &gt;] SYSC_write fs/read_write.c:607
 [&lt;ffffffff81523184&gt;] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [&lt;ffffffff83fc02c1&gt;] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Tested-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit c3f24cfb3e508c70c26ee8569d537c8ca67a36c6 ]

Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[&lt;ffffffff819ead5f&gt;]  [&lt;ffffffff819ead5f&gt;]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
 ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
 [&lt;ffffffff819c8dd5&gt;] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
 [&lt;ffffffff82c2a9e7&gt;] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
 [&lt;ffffffff82b81e60&gt;] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
 [&lt;ffffffff838bbf12&gt;] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
 [&lt;ffffffff83069d22&gt;] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306abd2&gt;] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [&lt;     inline     &gt;] dst_input ./include/net/dst.h:507
 [&lt;ffffffff83068500&gt;] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306b82f&gt;] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [&lt;ffffffff82bd9fb7&gt;] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [&lt;ffffffff82bdb19a&gt;] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [&lt;ffffffff82bdb493&gt;] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [&lt;ffffffff82bdb6b8&gt;] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [&lt;ffffffff8241fc75&gt;] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [&lt;ffffffff82421b5a&gt;] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [&lt;     inline     &gt;] new_sync_write fs/read_write.c:499
 [&lt;ffffffff8151bd44&gt;] __vfs_write+0x334/0x570 fs/read_write.c:512
 [&lt;ffffffff8151f85b&gt;] vfs_write+0x17b/0x500 fs/read_write.c:560
 [&lt;     inline     &gt;] SYSC_write fs/read_write.c:607
 [&lt;ffffffff81523184&gt;] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [&lt;ffffffff83fc02c1&gt;] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Tested-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: avoid sk_forward_alloc overflows</title>
<updated>2016-09-17T13:59:31+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-09-15T15:48:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d'/>
<id>20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d</id>
<content type='text'>
A malicious TCP receiver, sending SACK, can force the sender to split
skbs in write queue and increase its memory usage.

Then, when socket is closed and its write queue purged, we might
overflow sk_forward_alloc (It becomes negative)

sk_mem_reclaim() does nothing in this case, and more than 2GB
are leaked from TCP perspective (tcp_memory_allocated is not changed)

Then warnings trigger from inet_sock_destruct() and
sk_stream_kill_queues() seeing a not zero sk_forward_alloc

All TCP stack can be stuck because TCP is under memory pressure.

A simple fix is to preemptively reclaim from sk_mem_uncharge().

This makes sure a socket wont have more than 2 MB forward allocated,
after burst and idle period.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A malicious TCP receiver, sending SACK, can force the sender to split
skbs in write queue and increase its memory usage.

Then, when socket is closed and its write queue purged, we might
overflow sk_forward_alloc (It becomes negative)

sk_mem_reclaim() does nothing in this case, and more than 2GB
are leaked from TCP perspective (tcp_memory_allocated is not changed)

Then warnings trigger from inet_sock_destruct() and
sk_stream_kill_queues() seeing a not zero sk_forward_alloc

All TCP stack can be stuck because TCP is under memory pressure.

A simple fix is to preemptively reclaim from sk_mem_uncharge().

This makes sure a socket wont have more than 2 MB forward allocated,
after burst and idle period.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>dccp: limit sk_filter trim to payload</title>
<updated>2016-07-13T18:53:41+00:00</updated>
<author>
<name>Willem de Bruijn</name>
<email>willemb@google.com</email>
</author>
<published>2016-07-12T22:18:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4f0c40d94461cfd23893a17335b2ab78ecb333c8'/>
<id>4f0c40d94461cfd23893a17335b2ab78ecb333c8</id>
<content type='text'>
Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

A call to sk_filter in-between can cause __skb_pull to wrap skb-&gt;len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.

Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn &lt;willemb@google.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

A call to sk_filter in-between can cause __skb_pull to wrap skb-&gt;len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.

Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn &lt;willemb@google.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: sock: move -&gt;sk_shutdown out of bitfields.</title>
<updated>2016-05-20T22:05:32+00:00</updated>
<author>
<name>Andrey Ryabinin</name>
<email>aryabinin@virtuozzo.com</email>
</author>
<published>2016-05-18T16:19:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=fc64869c48494a401b1fb627c9ecc4e6c1d74b0d'/>
<id>fc64869c48494a401b1fb627c9ecc4e6c1d74b0d</id>
<content type='text'>
-&gt;sk_shutdown bits share one bitfield with some other bits in sock struct,
such as -&gt;sk_no_check_[r,t]x, -&gt;sk_userlocks ...
sock_setsockopt() may write to these bits, while holding the socket lock.

In case of AF_UNIX sockets, we change -&gt;sk_shutdown bits while holding only
unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
to corrupting these bits.

Fix this by moving -&gt;sk_shutdown bits out of bitfield into a separate byte.
This will not change the 'struct sock' size since -&gt;sk_shutdown moved into
previously unused 16-bit hole.

Signed-off-by: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Suggested-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
-&gt;sk_shutdown bits share one bitfield with some other bits in sock struct,
such as -&gt;sk_no_check_[r,t]x, -&gt;sk_userlocks ...
sock_setsockopt() may write to these bits, while holding the socket lock.

In case of AF_UNIX sockets, we change -&gt;sk_shutdown bits while holding only
unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
to corrupting these bits.

Fix this by moving -&gt;sk_shutdown bits out of bitfield into a separate byte.
This will not change the 'struct sock' size since -&gt;sk_shutdown moved into
previously unused 16-bit hole.

Signed-off-by: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Suggested-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: fix lockdep splat in tcp_snd_una_update()</title>
<updated>2016-05-04T20:55:11+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-05-03T23:56:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=46cc6e4976e3d9058490f20d93bc7805f7f2d81e'/>
<id>46cc6e4976e3d9058490f20d93bc7805f7f2d81e</id>
<content type='text'>
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.

This triggers a lockdep splat on 32bit &amp; SMP builds.

We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.

I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.

Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam &lt;festevam@gmail.com&gt;
Diagnosed-by: Francois Romieu &lt;romieu@fr.zoreil.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Tested-by: Fabio Estevam &lt;fabio.estevam@nxp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.

This triggers a lockdep splat on 32bit &amp; SMP builds.

We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.

I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.

Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam &lt;festevam@gmail.com&gt;
Diagnosed-by: Francois Romieu &lt;romieu@fr.zoreil.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Tested-by: Fabio Estevam &lt;fabio.estevam@nxp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: add __sock_wfree() helper</title>
<updated>2016-05-03T20:02:36+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-05-02T17:56:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=1d2077ac0165c0d173a2255e37cf4dc5033d92c7'/>
<id>1d2077ac0165c0d173a2255e37cf4dc5033d92c7</id>
<content type='text'>
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: make tcp_sendmsg() aware of socket backlog</title>
<updated>2016-05-02T21:02:26+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-04-29T21:16:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d41a69f1d390fa3f2546498103cdcd78b30676ff'/>
<id>d41a69f1d390fa3f2546498103cdcd78b30676ff</id>
<content type='text'>
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk-&gt;sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Alexei Starovoitov &lt;ast@fb.com&gt;
Cc: Marcelo Ricardo Leitner &lt;marcelo.leitner@gmail.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk-&gt;sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Alexei Starovoitov &lt;ast@fb.com&gt;
Cc: Marcelo Ricardo Leitner &lt;marcelo.leitner@gmail.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: SOCKWQ_ASYNC_WAITDATA optimizations</title>
<updated>2016-04-28T03:08:40+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-04-25T17:39:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4be735225f7cd040ca81c18740e7b672021bafeb'/>
<id>4be735225f7cd040ca81c18740e7b672021bafeb</id>
<content type='text'>
SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
and equivalent functions, so that sock_wake_async() can send
a SIGIO only when necessary.

Since these atomic operations are really not needed unless
socket expressed interest in FASYNC, we can omit them in most
cases.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
and equivalent functions, so that sock_wake_async() can send
a SIGIO only when necessary.

Since these atomic operations are really not needed unless
socket expressed interest in FASYNC, we can omit them in most
cases.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: SOCKWQ_ASYNC_NOSPACE optimizations</title>
<updated>2016-04-28T03:08:40+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-04-25T17:39:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=9317bb69824ec8d078b0b786b6971aedb0af3d4f'/>
<id>9317bb69824ec8d078b0b786b6971aedb0af3d4f</id>
<content type='text'>
SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
so that a SIGIO signal is sent when needed.

tcp_sendmsg() clears the bit.
tcp_poll() sets the bit when stream is not writeable.

We can avoid two atomic operations by first checking if socket
is actually interested in the FASYNC business (most sockets in
real applications do not use AIO, but select()/poll()/epoll())

This also removes one cache line miss to access sk-&gt;sk_wq-&gt;flags
in tcp_sendmsg()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
so that a SIGIO signal is sent when needed.

tcp_sendmsg() clears the bit.
tcp_poll() sets the bit when stream is not writeable.

We can avoid two atomic operations by first checking if socket
is actually interested in the FASYNC business (most sockets in
real applications do not use AIO, but select()/poll()/epoll())

This also removes one cache line miss to access sk-&gt;sk_wq-&gt;flags
in tcp_sendmsg()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>soreuseport: Resolve merge conflict for v4/v6 ordering fix</title>
<updated>2016-04-25T17:27:54+00:00</updated>
<author>
<name>Craig Gallek</name>
<email>kraig@google.com</email>
</author>
<published>2016-04-25T14:42:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d296ba60d8e2de23a350796a567a3aa90fe1cb6e'/>
<id>d296ba60d8e2de23a350796a567a3aa90fe1cb6e</id>
<content type='text'>
d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b58ab ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek &lt;kraig@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b58ab ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek &lt;kraig@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
