linux.git/net/ipv4/tcp_timer.c, branch v6.5

net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled

2023-08-15T19:24:04+00:00

In the real workload, I encountered an issue which could cause the RTO
timer to retransmit the skb per 1ms with linear option enabled. The amount
of lost-retransmitted skbs can go up to 1000+ instantly.

The root cause is that if the icsk_rto happens to be zero in the 6th round
(which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero
due to the changed calculation method in tcp_retransmit_timer() as follows:

icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);

Above line could be converted to
icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0

Therefore, the timer expires so quickly without any doubt.

I read through the RFC 6298 and found that the RTO value can be rounded
up to a certain value, in Linux, say TCP_RTO_MIN as default, which is
regarded as the lower bound in this patch as suggested by Eric.

Fixes: 36e31b0af587 ("net: TCP thin linear timeouts")
Suggested-by: Eric Dumazet 
Signed-off-by: Jason Xing 
Reviewed-by: Eric Dumazet 
Signed-off-by: David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2023-06-01T22:38:26+00:00

Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/sfc/tc.c
  622ab656344a ("sfc: fix error unwinds in TC offload")
  b6583d5e9e94 ("sfc: support TC decap rules matching on enc_src_port")

net/mptcp/protocol.c
  5b825727d087 ("mptcp: add annotations around msk->subflow accesses")
  e76c8ef5cc5b ("mptcp: refactor mptcp_stream_accept()")

Signed-off-by: Jakub Kicinski

tcp: fix mishandling when the sack compression is deferred.

2023-06-01T11:15:12+00:00

In this patch, we mainly try to handle sending a compressed ack
correctly if it's deferred.

Here are more details in the old logic:
When sack compression is triggered in the tcp_compressed_ack_kick(),
if the sock is owned by user, it will set TCP_DELACK_TIMER_DEFERRED
and then defer to the release cb phrase. Later once user releases
the sock, tcp_delack_timer_handler() should send a ack as expected,
which, however, cannot happen due to lack of ICSK_ACK_TIMER flag.
Therefore, the receiver would not sent an ack until the sender's
retransmission timeout. It definitely increases unnecessary latency.

Fixes: 5d9f4262b7ea ("tcp: add SACK compression")
Suggested-by: Eric Dumazet 
Signed-off-by: fuyuanli 
Signed-off-by: Jason Xing 
Link: https://lore.kernel.org/netdev/20230529113804.GA20300@didi-ThinkCentre-M920t-N000/
Reviewed-by: Eric Dumazet 
Link: https://lore.kernel.org/r/20230531080150.GA20424@didi-ThinkCentre-M920t-N000
Signed-off-by: Paolo Abeni

tcp: make the first N SYN RTO backoffs linear

2023-05-11T08:31:16+00:00

Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.

We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf

This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts

This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val

For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...

Signed-off-by: David Morley 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Tested-by: David Morley 
Reviewed-by: Eric Dumazet 
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni

tcp: annotate lockless access to sk->sk_err

2023-03-17T08:25:05+00:00

tcp_poll() reads sk->sk_err without socket lock held/owned.

We should used READ_ONCE() here, and update writers
to use WRITE_ONCE().

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

tcp: annotate lockless accesses to sk->sk_err_soft

2023-03-17T08:25:05+00:00

This field can be read/written without lock synchronization.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

tcp: Make SYN ACK RTO tunable by BPF programs with TFO

2022-08-17T09:19:22+00:00

Instead of the hardcoded TCP_TIMEOUT_INIT, this diff calls tcp_timeout_init
to initiate req->timeout like the non TFO SYN ACK case.

Tested using the following packetdrill script, on a host with a BPF
program that sets the initial connect timeout to 10ms.

`../../common/defaults.sh`

// Initialize connection
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 
   +0 > S. 0:0(0) ack 1 
   +.01 > S. 0:0(0) ack 1 
   +.02 > S. 0:0(0) ack 1 
   +.04 > S. 0:0(0) ack 1 
   +.01 < . 1:1(0) ack 1 win 32792

   +0 accept(3, ..., ...) = 4

Signed-off-by: Jie Meng 
Acked-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Martin KaFai Lau 
Signed-off-by: David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2022-07-21T20:03:39+00:00

No conflicts.

Signed-off-by: Jakub Kicinski

tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts.

2022-07-20T09:14:50+00:00

While reading sysctl_tcp_thin_linear_timeouts, it can be changed
concurrently.  Thus, we need to add READ_ONCE() to its reader.

Fixes: 36e31b0af587 ("net: TCP thin linear timeouts")
Signed-off-by: Kuniyuki Iwashima 
Signed-off-by: David S. Miller

tcp: Fix data-races around some timeout sysctl knobs.

2022-07-18T11:21:54+00:00

While reading these sysctl knobs, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.

  - tcp_retries1
  - tcp_retries2
  - tcp_orphan_retries
  - tcp_fin_timeout

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima 
Signed-off-by: David S. Miller