<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/net/core/sock.c, branch linux-4.8.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>net: avoid signed overflows for SO_{SND|RCV}BUFFORCE</title>
<updated>2016-12-10T18:09:42+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-12-02T17:44:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f818e5d86aef49c067296d29f1e277c7ee1713e8'/>
<id>f818e5d86aef49c067296d29f1e277c7ee1713e8</id>
<content type='text'>
[ Upstream commit b98b0bc8c431e3ceb4b26b0dfc8db509518fb290 ]

CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...

Note that before commit 82981930125a ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.

This needs to be backported to all known linux kernels.

Again, many thanks to syzkaller team for discovering this gem.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit b98b0bc8c431e3ceb4b26b0dfc8db509518fb290 ]

CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...

Note that before commit 82981930125a ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.

This needs to be backported to all known linux kernels.

Again, many thanks to syzkaller team for discovering this gem.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>dccp: do not release listeners too soon</title>
<updated>2016-11-21T09:11:34+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-11-03T00:14:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=72b03e549b9582ab257219cbe4236a6e0f683ad0'/>
<id>72b03e549b9582ab257219cbe4236a6e0f683ad0</id>
<content type='text'>
[ Upstream commit c3f24cfb3e508c70c26ee8569d537c8ca67a36c6 ]

Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[&lt;ffffffff819ead5f&gt;]  [&lt;ffffffff819ead5f&gt;]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
 ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
 [&lt;ffffffff819c8dd5&gt;] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
 [&lt;ffffffff82c2a9e7&gt;] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
 [&lt;ffffffff82b81e60&gt;] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
 [&lt;ffffffff838bbf12&gt;] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
 [&lt;ffffffff83069d22&gt;] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306abd2&gt;] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [&lt;     inline     &gt;] dst_input ./include/net/dst.h:507
 [&lt;ffffffff83068500&gt;] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306b82f&gt;] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [&lt;ffffffff82bd9fb7&gt;] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [&lt;ffffffff82bdb19a&gt;] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [&lt;ffffffff82bdb493&gt;] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [&lt;ffffffff82bdb6b8&gt;] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [&lt;ffffffff8241fc75&gt;] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [&lt;ffffffff82421b5a&gt;] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [&lt;     inline     &gt;] new_sync_write fs/read_write.c:499
 [&lt;ffffffff8151bd44&gt;] __vfs_write+0x334/0x570 fs/read_write.c:512
 [&lt;ffffffff8151f85b&gt;] vfs_write+0x17b/0x500 fs/read_write.c:560
 [&lt;     inline     &gt;] SYSC_write fs/read_write.c:607
 [&lt;ffffffff81523184&gt;] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [&lt;ffffffff83fc02c1&gt;] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Tested-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit c3f24cfb3e508c70c26ee8569d537c8ca67a36c6 ]

Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[&lt;ffffffff819ead5f&gt;]  [&lt;ffffffff819ead5f&gt;]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
 ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
 [&lt;ffffffff819c8dd5&gt;] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
 [&lt;ffffffff82c2a9e7&gt;] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
 [&lt;ffffffff82b81e60&gt;] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
 [&lt;ffffffff838bbf12&gt;] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
 [&lt;ffffffff83069d22&gt;] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306abd2&gt;] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [&lt;     inline     &gt;] dst_input ./include/net/dst.h:507
 [&lt;ffffffff83068500&gt;] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [&lt;     inline     &gt;] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [&lt;     inline     &gt;] NF_HOOK ./include/linux/netfilter.h:255
 [&lt;ffffffff8306b82f&gt;] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [&lt;ffffffff82bd9fb7&gt;] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [&lt;ffffffff82bdb19a&gt;] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [&lt;ffffffff82bdb493&gt;] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [&lt;ffffffff82bdb6b8&gt;] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [&lt;ffffffff8241fc75&gt;] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [&lt;ffffffff82421b5a&gt;] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [&lt;     inline     &gt;] new_sync_write fs/read_write.c:499
 [&lt;ffffffff8151bd44&gt;] __vfs_write+0x334/0x570 fs/read_write.c:512
 [&lt;ffffffff8151f85b&gt;] vfs_write+0x17b/0x500 fs/read_write.c:560
 [&lt;     inline     &gt;] SYSC_write fs/read_write.c:607
 [&lt;ffffffff81523184&gt;] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [&lt;ffffffff83fc02c1&gt;] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Tested-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: clear sk_err_soft in sk_clone_lock()</title>
<updated>2016-11-21T09:11:34+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-10-28T20:40:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ac22a3ba07964d39e8c2cf2abda7c610d861eefe'/>
<id>ac22a3ba07964d39e8c2cf2abda7c610d861eefe</id>
<content type='text'>
[ Upstream commit e551c32d57c88923f99f8f010e89ca7ed0735e83 ]

At accept() time, it is possible the parent has a non zero
sk_err_soft, leftover from a prior error.

Make sure we do not leave this value in the child, as it
makes future getsockopt(SO_ERROR) calls quite unreliable.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit e551c32d57c88923f99f8f010e89ca7ed0735e83 ]

At accept() time, it is possible the parent has a non zero
sk_err_soft, leftover from a prior error.

Make sure we do not leave this value in the child, as it
makes future getsockopt(SO_ERROR) calls quite unreliable.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cgroup: duplicate cgroup reference when cloning sockets</title>
<updated>2016-09-19T22:36:17+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>jweiner@fb.com</email>
</author>
<published>2016-09-19T21:44:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d979a39d7242e0601bf9b60e89628fb8ac577179'/>
<id>d979a39d7242e0601bf9b60e89628fb8ac577179</id>
<content type='text'>
When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d671 ("sock, cgroup: add sock-&gt;sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.5+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d671 ("sock, cgroup: add sock-&gt;sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.5+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>dccp: limit sk_filter trim to payload</title>
<updated>2016-07-13T18:53:41+00:00</updated>
<author>
<name>Willem de Bruijn</name>
<email>willemb@google.com</email>
</author>
<published>2016-07-12T22:18:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4f0c40d94461cfd23893a17335b2ab78ecb333c8'/>
<id>4f0c40d94461cfd23893a17335b2ab78ecb333c8</id>
<content type='text'>
Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

A call to sk_filter in-between can cause __skb_pull to wrap skb-&gt;len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.

Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn &lt;willemb@google.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

A call to sk_filter in-between can cause __skb_pull to wrap skb-&gt;len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.

Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn &lt;willemb@google.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sock: ignore SCM_RIGHTS and SCM_CREDENTIALS in __sock_cmsg_send</title>
<updated>2016-07-11T21:32:44+00:00</updated>
<author>
<name>Soheil Hassas Yeganeh</name>
<email>soheil@google.com</email>
</author>
<published>2016-07-11T20:51:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=779f1edec664a7b32b71f7b4702e085a08d60592'/>
<id>779f1edec664a7b32b71f7b4702e085a08d60592</id>
<content type='text'>
Sergei Trofimovich reported that pulse audio sends SCM_CREDENTIALS
as a control message to TCP. Since __sock_cmsg_send does not
support SCM_RIGHTS and SCM_CREDENTIALS, it returns an error and
hence breaks pulse audio over TCP.

SCM_RIGHTS and SCM_CREDENTIALS are sent on the SOL_SOCKET layer
but they semantically belong to SOL_UNIX. Since all
cmsg-processing functions including sock_cmsg_send ignore control
messages of other layers, it is best to ignore SCM_RIGHTS
and SCM_CREDENTIALS for consistency (and also for fixing pulse
audio over TCP).

Fixes: c14ac9451c34 ("sock: enable timestamping using control messages")
Signed-off-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Reported-by: Sergei Trofimovich &lt;slyfox@gentoo.org&gt;
Tested-by: Sergei Trofimovich &lt;slyfox@gentoo.org&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Willem de Bruijn &lt;willemb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Sergei Trofimovich reported that pulse audio sends SCM_CREDENTIALS
as a control message to TCP. Since __sock_cmsg_send does not
support SCM_RIGHTS and SCM_CREDENTIALS, it returns an error and
hence breaks pulse audio over TCP.

SCM_RIGHTS and SCM_CREDENTIALS are sent on the SOL_SOCKET layer
but they semantically belong to SOL_UNIX. Since all
cmsg-processing functions including sock_cmsg_send ignore control
messages of other layers, it is best to ignore SCM_RIGHTS
and SCM_CREDENTIALS for consistency (and also for fixing pulse
audio over TCP).

Fixes: c14ac9451c34 ("sock: enable timestamping using control messages")
Signed-off-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Reported-by: Sergei Trofimovich &lt;slyfox@gentoo.org&gt;
Tested-by: Sergei Trofimovich &lt;slyfox@gentoo.org&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Willem de Bruijn &lt;willemb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: add __sock_wfree() helper</title>
<updated>2016-05-03T20:02:36+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-05-02T17:56:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=1d2077ac0165c0d173a2255e37cf4dc5033d92c7'/>
<id>1d2077ac0165c0d173a2255e37cf4dc5033d92c7</id>
<content type='text'>
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: make tcp_sendmsg() aware of socket backlog</title>
<updated>2016-05-02T21:02:26+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-04-29T21:16:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d41a69f1d390fa3f2546498103cdcd78b30676ff'/>
<id>d41a69f1d390fa3f2546498103cdcd78b30676ff</id>
<content type='text'>
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk-&gt;sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Alexei Starovoitov &lt;ast@fb.com&gt;
Cc: Marcelo Ricardo Leitner &lt;marcelo.leitner@gmail.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk-&gt;sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Alexei Starovoitov &lt;ast@fb.com&gt;
Cc: Marcelo Ricardo Leitner &lt;marcelo.leitner@gmail.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: do not block BH while processing socket backlog</title>
<updated>2016-05-02T21:02:26+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2016-04-29T21:16:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=5413d1babe8f10de13d72496c12b862eef8ba613'/>
<id>5413d1babe8f10de13d72496c12b862eef8ba613</id>
<content type='text'>
Socket backlog processing is a major latency source.

With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.

All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.

cond_resched_softirq() could be removed, as it has no more user.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Socket backlog processing is a major latency source.

With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.

All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.

cond_resched_softirq() could be removed, as it has no more user.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2016-04-09T21:41:41+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2016-04-09T21:41:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ae95d7126104591348d37aaf78c8325967e02386'/>
<id>ae95d7126104591348d37aaf78c8325967e02386</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
</feed>
