<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/net/ipv4/tcp_timer.c, branch v4.15</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>net: tcp: close sock if net namespace is exiting</title>
<updated>2018-01-25T15:56:45+00:00</updated>
<author>
<name>Dan Streetman</name>
<email>ddstreet@ieee.org</email>
</author>
<published>2018-01-18T21:14:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4ee806d51176ba7b8ff1efd81f271d7252e03a1d'/>
<id>4ee806d51176ba7b8ff1efd81f271d7252e03a1d</id>
<content type='text'>
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open.  However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open.  In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence.  The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman &lt;ddstreet@canonical.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open.  However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open.  In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence.  The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman &lt;ddstreet@canonical.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: refresh tcp_mstamp from timers callbacks</title>
<updated>2017-12-13T21:04:04+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-12-13T02:22:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4688eb7cf3ae2c2721d1dacff5c1384cba47d176'/>
<id>4688eb7cf3ae2c2721d1dacff5c1384cba47d176</id>
<content type='text'>
Only the retransmit timer currently refreshes tcp_mstamp

We should do the same for delayed acks and keepalives.

Even if RFC 7323 does not request it, this is consistent to what linux
did in the past, when TS values were based on jiffies.

Fixes: 385e20706fac ("tcp: use tp-&gt;tcp_mstamp in output path")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Cc: Mike Maloney &lt;maloney@google.com&gt;
Cc: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by:  Mike Maloney &lt;maloney@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Only the retransmit timer currently refreshes tcp_mstamp

We should do the same for delayed acks and keepalives.

Even if RFC 7323 does not request it, this is consistent to what linux
did in the past, when TS values were based on jiffies.

Fixes: 385e20706fac ("tcp: use tp-&gt;tcp_mstamp in output path")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Cc: Mike Maloney &lt;maloney@google.com&gt;
Cc: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by:  Mike Maloney &lt;maloney@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: tcp_mtu_probing() cleanup</title>
<updated>2017-11-05T13:14:23+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-11-03T13:09:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d0f36847016276920d860d5c089934ff3fea7e30'/>
<id>d0f36847016276920d860d5c089934ff3fea7e30</id>
<content type='text'>
Reduce one indentation level to make code more readable.
tcp_sync_mss() can be factorized.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Reduce one indentation level to make code more readable.
tcp_sync_mss() can be factorized.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Namespace-ify sysctl_tcp_thin_linear_timeouts</title>
<updated>2017-10-27T07:35:42+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-10-27T04:54:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2c04ac8ae0b61e0780a30b7069a11bb202b21f26'/>
<id>2c04ac8ae0b61e0780a30b7069a11bb202b21f26</id>
<content type='text'>
Note that sysctl_tcp_thin_dupack was not used, I deleted it.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Note that sysctl_tcp_thin_dupack was not used, I deleted it.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>inet/connection_sock: Convert timers to use timer_setup()</title>
<updated>2017-10-18T11:39:55+00:00</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2017-10-17T00:29:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=59f379f9046a9e0532ffd19b44e3c32fe79ec51b'/>
<id>59f379f9046a9e0532ffd19b44e3c32fe79ec51b</id>
<content type='text'>
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Gerrit Renker &lt;gerrit@erg.abdn.ac.uk&gt;
Cc: Alexey Kuznetsov &lt;kuznet@ms2.inr.ac.ru&gt;
Cc: Hideaki YOSHIFUJI &lt;yoshfuji@linux-ipv6.org&gt;
Cc: netdev@vger.kernel.org
Cc: dccp@vger.kernel.org
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Gerrit Renker &lt;gerrit@erg.abdn.ac.uk&gt;
Cc: Alexey Kuznetsov &lt;kuznet@ms2.inr.ac.ru&gt;
Cc: Hideaki YOSHIFUJI &lt;yoshfuji@linux-ipv6.org&gt;
Cc: netdev@vger.kernel.org
Cc: dccp@vger.kernel.org
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: implement rb-tree based retransmit queue</title>
<updated>2017-10-06T23:28:54+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-10-06T05:21:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=75c119afe14f74b4dd967d75ed9f57ab6c0ef045'/>
<id>75c119afe14f74b4dd967d75ed9f57ab6c0ef045</id>
<content type='text'>
Using a linear list to store all skbs in write queue has been okay
for quite a while : O(N) is not too bad when N &lt; 500.

Things get messy when N is the order of 100,000 : Modern TCP stacks
want 10Gbit+ of throughput even with 200 ms RTT flows.

40 ns per cache line miss means a full scan can use 4 ms,
blowing away CPU caches.

SACK processing often can use various hints to avoid parsing
whole retransmit queue. But with high packet losses and/or high
reordering, hints no longer work.

Sender has to process thousands of unfriendly SACK, accumulating
a huge socket backlog, burning a cpu and massively dropping packets.

Using an rb-tree for retransmit queue has been avoided for years
because it added complexity and overhead, but now is the time
to be more resistant and say no to quadratic behavior.

1) RTX queue is no longer part of the write queue : already sent skbs
are stored in one rb-tree.

2) Since reaching the head of write queue no longer needs
sk-&gt;sk_send_head, we added an union of sk_send_head and tcp_rtx_queue

Tested:

 On receiver :
 netem on ingress : delay 150ms 200us loss 1
 GRO disabled to force stress and SACK storms.

for f in `seq 1 10`
do
 ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

Before patch :

323.87
351.48
339.59
338.62
306.72
204.07
304.93
291.88
202.47
176.88
   2840

After patch:

1700.83
2207.98
2070.17
1544.26
2114.76
2124.89
1693.14
1080.91
2216.82
1299.94
  18053

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Using a linear list to store all skbs in write queue has been okay
for quite a while : O(N) is not too bad when N &lt; 500.

Things get messy when N is the order of 100,000 : Modern TCP stacks
want 10Gbit+ of throughput even with 200 ms RTT flows.

40 ns per cache line miss means a full scan can use 4 ms,
blowing away CPU caches.

SACK processing often can use various hints to avoid parsing
whole retransmit queue. But with high packet losses and/or high
reordering, hints no longer work.

Sender has to process thousands of unfriendly SACK, accumulating
a huge socket backlog, burning a cpu and massively dropping packets.

Using an rb-tree for retransmit queue has been avoided for years
because it added complexity and overhead, but now is the time
to be more resistant and say no to quadratic behavior.

1) RTX queue is no longer part of the write queue : already sent skbs
are stored in one rb-tree.

2) Since reaching the head of write queue no longer needs
sk-&gt;sk_send_head, we added an union of sk_send_head and tcp_rtx_queue

Tested:

 On receiver :
 netem on ingress : delay 150ms 200us loss 1
 GRO disabled to force stress and SACK storms.

for f in `seq 1 10`
do
 ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

Before patch :

323.87
351.48
339.59
338.62
306.72
204.07
304.93
291.88
202.47
176.88
   2840

After patch:

1700.83
2207.98
2070.17
1544.26
2114.76
2124.89
1693.14
1080.91
2216.82
1299.94
  18053

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2017-08-09T23:28:45+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2017-08-09T23:28:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3118e6e19da7b8d76b2456b880c74a9aa3a2268b'/>
<id>3118e6e19da7b8d76b2456b880c74a9aa3a2268b</id>
<content type='text'>
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: fix keepalive code vs TCP_FASTOPEN_CONNECT</title>
<updated>2017-08-03T16:34:51+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-08-03T06:10:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2dda640040876cd8ae646408b69eea40c24f9ae9'/>
<id>2dda640040876cd8ae646408b69eea40c24f9ae9</id>
<content type='text'>
syzkaller was able to trigger a divide by 0 in TCP stack [1]

Issue here is that keepalive timer needs to be updated to not attempt
to send a probe if the connection setup was deferred using
TCP_FASTOPEN_CONNECT socket option added in linux-4.11

[1]
 divide error: 0000 [#1] SMP
 CPU: 18 PID: 0 Comm: swapper/18 Not tainted
 task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
 RIP: 0010:[&lt;ffffffff8409cc0d&gt;]  [&lt;ffffffff8409cc0d&gt;] __tcp_select_window+0x8d/0x160
 Call Trace:
  &lt;IRQ&gt;
  [&lt;ffffffff8409d951&gt;] tcp_transmit_skb+0x11/0x20
  [&lt;ffffffff8409da21&gt;] tcp_xmit_probe_skb+0xc1/0xe0
  [&lt;ffffffff840a0ee8&gt;] tcp_write_wakeup+0x68/0x160
  [&lt;ffffffff840a151b&gt;] tcp_keepalive_timer+0x17b/0x230
  [&lt;ffffffff83b3f799&gt;] call_timer_fn+0x39/0xf0
  [&lt;ffffffff83b40797&gt;] run_timer_softirq+0x1d7/0x280
  [&lt;ffffffff83a04ddb&gt;] __do_softirq+0xcb/0x257
  [&lt;ffffffff83ae03ac&gt;] irq_exit+0x9c/0xb0
  [&lt;ffffffff83a04c1a&gt;] smp_apic_timer_interrupt+0x6a/0x80
  [&lt;ffffffff83a03eaf&gt;] apic_timer_interrupt+0x7f/0x90
  &lt;EOI&gt;
  [&lt;ffffffff83fed2ea&gt;] ? cpuidle_enter_state+0x13a/0x3b0
  [&lt;ffffffff83fed2cd&gt;] ? cpuidle_enter_state+0x11d/0x3b0

Tested:

Following packetdrill no longer crashes the kernel

`echo 0 &gt;/proc/sys/net/ipv4/tcp_timestamps`

// Cache warmup: send a Fast Open cookie request
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
   +0 &gt; S 0:0(0) &lt;mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop&gt;
 +.01 &lt; S. 123:123(0) ack 1 win 14600 &lt;mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop&gt;
   +0 &gt; . 1:1(0) ack 1
   +0 close(3) = 0
   +0 &gt; F. 1:1(0) ack 1
   +0 &lt; F. 1:1(0) ack 2 win 92
   +0 &gt; .  2:2(0) ack 2

   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
 +.01 connect(4, ..., ...) = 0
   +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
   +10 close(4) = 0

`echo 1 &gt;/proc/sys/net/ipv4/tcp_timestamps`

Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Wei Wang &lt;weiwan@google.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
syzkaller was able to trigger a divide by 0 in TCP stack [1]

Issue here is that keepalive timer needs to be updated to not attempt
to send a probe if the connection setup was deferred using
TCP_FASTOPEN_CONNECT socket option added in linux-4.11

[1]
 divide error: 0000 [#1] SMP
 CPU: 18 PID: 0 Comm: swapper/18 Not tainted
 task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
 RIP: 0010:[&lt;ffffffff8409cc0d&gt;]  [&lt;ffffffff8409cc0d&gt;] __tcp_select_window+0x8d/0x160
 Call Trace:
  &lt;IRQ&gt;
  [&lt;ffffffff8409d951&gt;] tcp_transmit_skb+0x11/0x20
  [&lt;ffffffff8409da21&gt;] tcp_xmit_probe_skb+0xc1/0xe0
  [&lt;ffffffff840a0ee8&gt;] tcp_write_wakeup+0x68/0x160
  [&lt;ffffffff840a151b&gt;] tcp_keepalive_timer+0x17b/0x230
  [&lt;ffffffff83b3f799&gt;] call_timer_fn+0x39/0xf0
  [&lt;ffffffff83b40797&gt;] run_timer_softirq+0x1d7/0x280
  [&lt;ffffffff83a04ddb&gt;] __do_softirq+0xcb/0x257
  [&lt;ffffffff83ae03ac&gt;] irq_exit+0x9c/0xb0
  [&lt;ffffffff83a04c1a&gt;] smp_apic_timer_interrupt+0x6a/0x80
  [&lt;ffffffff83a03eaf&gt;] apic_timer_interrupt+0x7f/0x90
  &lt;EOI&gt;
  [&lt;ffffffff83fed2ea&gt;] ? cpuidle_enter_state+0x13a/0x3b0
  [&lt;ffffffff83fed2cd&gt;] ? cpuidle_enter_state+0x11d/0x3b0

Tested:

Following packetdrill no longer crashes the kernel

`echo 0 &gt;/proc/sys/net/ipv4/tcp_timestamps`

// Cache warmup: send a Fast Open cookie request
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
   +0 &gt; S 0:0(0) &lt;mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop&gt;
 +.01 &lt; S. 123:123(0) ack 1 win 14600 &lt;mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop&gt;
   +0 &gt; . 1:1(0) ack 1
   +0 close(3) = 0
   +0 &gt; F. 1:1(0) ack 1
   +0 &lt; F. 1:1(0) ack 2 win 92
   +0 &gt; .  2:2(0) ack 2

   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
 +.01 connect(4, ..., ...) = 0
   +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
   +10 close(4) = 0

`echo 1 &gt;/proc/sys/net/ipv4/tcp_timestamps`

Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Wei Wang &lt;weiwan@google.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: remove prequeue support</title>
<updated>2017-07-31T21:37:49+00:00</updated>
<author>
<name>Florian Westphal</name>
<email>fw@strlen.de</email>
</author>
<published>2017-07-30T01:57:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e7942d0633c47c791ece6afa038be9cf977226de'/>
<id>e7942d0633c47c791ece6afa038be9cf977226de</id>
<content type='text'>
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal &lt;fw@strlen.de&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal &lt;fw@strlen.de&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: fix TCP_SYNCNT flakes</title>
<updated>2017-05-24T20:29:57+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-05-23T19:38:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ce682ef6e3e019f98cafbdc7058668e0ea8f4a13'/>
<id>ce682ef6e3e019f98cafbdc7058668e0ea8f4a13</id>
<content type='text'>
After the mentioned commit, some of our packetdrill tests became flaky.

TCP_SYNCNT socket option can limit the number of SYN retransmits.

retransmits_timed_out() has to compare times computations based on
local_clock() while timers are based on jiffies. With NTP adjustments
and roundings we can observe 999 ms delay for 1000 ms timers.
We end up sending one extra SYN packet.

Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate
TCP's connection close threshold as a time value") makes no
real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at
all.

Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set
parameter from retransmits_timed_out()

Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
After the mentioned commit, some of our packetdrill tests became flaky.

TCP_SYNCNT socket option can limit the number of SYN retransmits.

retransmits_timed_out() has to compare times computations based on
local_clock() while timers are based on jiffies. With NTP adjustments
and roundings we can observe 999 ms delay for 1000 ms timers.
We end up sending one extra SYN packet.

Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate
TCP's connection close threshold as a time value") makes no
real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at
all.

Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set
parameter from retransmits_timed_out()

Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
