linux.git/Documentation/networking/ip-sysctl.rst, branch v6.16

tcp: increase tcp_rmem[2] to 32 MB

2025-05-15T18:30:09+00:00

Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32
("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.

After this series improvements, it is time to increase the default.

Signed-off-by: Eric Dumazet 
Link: https://patch.msgid.link/20250513193919.1089692-12-edumazet@google.com
Signed-off-by: Jakub Kicinski

tcp: increase tcp_limit_output_bytes default value to 4MB

2025-05-15T18:30:09+00:00

Last change happened in 2018 with commit c73e5807e4f6
("tcp: tsq: no longer use limit_output_bytes for paced flows")

Modern NIC speeds got a 4x increase since then.

Signed-off-by: Eric Dumazet 
Link: https://patch.msgid.link/20250513193919.1089692-10-edumazet@google.com
Signed-off-by: Jakub Kicinski

net: Retire DCCP socket.

2025-04-12T01:58:10+00:00

DCCP was orphaned in 2021 by commit 054c4610bd05 ("MAINTAINERS: dccp:
move Gerrit Renker to CREDITS"), which noted that the last maintainer
had been inactive for five years.

In recent years, it has become a playground for syzbot, and most changes
to DCCP have been odd bug fixes triggered by syzbot. Apart from that,
the only changes have been driven by treewide or networking API updates
or adjustments related to TCP.

Thus, in 2023, we announced we would remove DCCP in 2025 via commit
b144fcaf46d4 ("dccp: Print deprecation notice.").

Since then, only one individual has contacted the netdev mailing list. [0]

There is ongoing research for Multipath DCCP. The repository is hosted
on GitHub [1], and development is not taking place through the upstream
community. While the repository is published under the GPLv2 license,
the scheduling part remains proprietary, with a LICENSE file [2] stating:

"This is not Open Source software."

The researcher mentioned a plan to address the licensing issue, upstream
the patches, and step up as a maintainer, but there has been no further
communication since then.

Maintaining DCCP for a decade without any real users has become a burden.

Therefore, it's time to remove it.

Removing DCCP will also provide significant benefits to TCP. It allows
us to freely reorganize the layout of struct inet_connection_sock, which
is currently shared with DCCP, and optimize it to reduce the number of
cachelines accessed in the TCP fast path.

Note that we keep DCCP netfilter modules as requested. [3]

Link: https://lore.kernel.org/netdev/20230710182253.81446-1-kuniyu@amazon.com/T/#u #[0]
Link: https://github.com/telekom/mp-dccp #[1]
Link: https://github.com/telekom/mp-dccp/blob/mpdccp_v03_k5.10/net/dccp/non_gpl_scheduler/LICENSE #[2]
Link: https://lore.kernel.org/netdev/Z_VQ0KlCRkqYWXa-@calendula/ #[3]
Signed-off-by: Kuniyuki Iwashima
Acked-by: Paul Moore (LSM and SELinux)
Acked-by: Casey Schaufler
Link: https://patch.msgid.link/20250410023921.11307-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski

tcp: support TCP_RTO_MIN_US for set/getsockopt use

2025-03-25T11:27:19+00:00

Support adjusting/reading RTO MIN for socket level by using set/getsockopt().

This new option has the same effect as TCP_BPF_RTO_MIN, which means it
doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that
bpf option was implemented before this patch, so we need to use a standalone
new option for pure tcp set/getsockopt() use.

When the socket is created, its icsk_rto_min is set to the default
value that is controlled by sysctl_tcp_rto_min_us. Then if application
calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then
icsk_rto_min will be overridden in jiffies unit.

This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around
icsk_rto_min.

Signed-off-by: Jason Xing 
Reviewed-by: Eric Dumazet 
Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski

tcp: add tcp_rto_max_ms sysctl

2025-02-11T12:08:00+00:00

Previous patch added a TCP_RTO_MAX_MS socket option
to tune a TCP socket max RTO value.

Many setups prefer to change a per netns sysctl.

This patch adds /proc/sys/net/ipv4/tcp_rto_max_ms

Its initial value is 120000 (120 seconds).

Keep in mind that a decrease of tcp_rto_max_ms
means shorter overall timeouts, unless tcp_retries2
sysctl is increased.

Signed-off-by: Eric Dumazet 
Reviewed-by: Jason Xing 
Reviewed-by: Neal Cardwell 
Reviewed-by: Kuniyuki Iwashima 
Signed-off-by: Paolo Abeni

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2024-12-12T22:19:05+00:00

Cross-merge networking fixes after downstream PR (net-6.13-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski

tcp: Add sysctl to configure TIME-WAIT reuse delay

2024-12-12T04:17:33+00:00

Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
reused by reopening a connection. This is a safe choice based on an
assumption that the other TCP timestamp clock frequency, which is unknown
to us, may be as low as 1 Hz (RFC 7323, section 5.4).

However, this means that in the presence of short lived connections with an
RTT of couple of milliseconds, the time during which a 4-tuple is blocked
from reuse can be orders of magnitude longer that the connection lifetime.
Combined with a reduced pool of ephemeral ports, when using
IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
long TIME-WAIT reuse delay can lead to port exhaustion, where all available
4-tuples are tied up in TIME-WAIT state.

Turn the reuse delay into a per-netns setting so that sysadmins can make
more aggressive assumptions about remote TCP timestamp clock frequency and
shorten the delay in order to allow connections to reincarnate faster.

Note that applications can completely bypass the TIME-WAIT delay protection
already today by locking the local port with bind() before connecting. Such
immediate connection reuse may result in PAWS failing to detect old
duplicate segments, leaving us with just the sequence number check as a
safety net.

This new configurable offers a trade off where the sysadmin can balance
between the risk of PAWS detection failing to act versus exhausting ports
by having sockets tied up in TIME-WAIT state for too long.

[1] https://lpc.events/event/16/contributions/1349/

Signed-off-by: Jakub Sitnicki 
Reviewed-by: Eric Dumazet 
Reviewed-by: Jason Xing 
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-2-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski

Documentation: networking: Add a caveat to nexthop_compat_mode sysctl

2024-12-11T02:26:24+00:00

net.ipv4.nexthop_compat_mode was added when nexthop objects were added to
provide the view of nexthop objects through the usual lens of the route
UAPI. As nexthop objects evolved, the information provided through this
lens became incomplete. For example, details of resilient nexthop groups
are obviously omitted.

Now that 16-bit nexthop group weights are a thing, the 8-bit UAPI cannot
convey the >8-bit weight accurately. Instead of inventing workarounds for
an obsolete interface, just document the expectations of inaccuracy.

Fixes: b72a6a7ab957 ("net: nexthop: Increase weight to u16")
Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
Link: https://patch.msgid.link/b575e32399ccacd09079b2a218255164535123bd.1733740749.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski

Add support for PIO p flag

2024-07-31T12:49:48+00:00

draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information
Option to signal that the network can allocate a unique IPv6 prefix per
client via DHCPv6-PD (see draft-ietf-v6ops-dhcp-pd-per-device).

When ra_honor_pio_pflag is enabled, the presence of a P-flag causes
SLAAC autoconfiguration to be disabled for that particular PIO.

An automated test has been added in Android (r.android.com/3195335) to
go along with this change.

Cc: Maciej Żenczykowski 
Cc: Lorenzo Colitti 
Cc: David Lamparter 
Cc: Simon Horman 
Signed-off-by: Patrick Rohr 
Reviewed-by: Maciej Żenczykowski 
Signed-off-by: David S. Miller

net: ipv4: Add a sysctl to set multipath hash seed

2024-06-12T23:42:11+00:00

When calculating hashes for the purpose of multipath forwarding, both IPv4
and IPv6 code currently fall back on flow_hash_from_keys(). That uses a
randomly-generated seed. That's a fine choice by default, but unfortunately
some deployments may need a tighter control over the seed used.

In this patch, make the seed configurable by adding a new sysctl key,
net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used
specifically for multipath forwarding and not for the other concerns that
flow_hash_from_keys() is used for, such as queue selection. Expose the knob
as sysctl because other such settings, such as headers to hash, are also
handled that way. Like those, the multipath hash seed is a per-netns
variable.

Despite being placed in the net.ipv4 namespace, the multipath seed sysctl
is used for both IPv4 and IPv6, similarly to e.g. a number of TCP
variables.

The seed used by flow_hash_from_keys() is a 128-bit quantity. However it
seems that usually the seed is a much more modest value. 32 bits seem
typical (Cisco, Cumulus), some systems go even lower. For that reason, and
to decouple the user interface from implementation details, go with a
32-bit quantity, which is then quadruplicated to form the siphash key.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: Nikolay Aleksandrov 
Reviewed-by: David Ahern 
Link: https://lore.kernel.org/r/20240607151357.421181-3-petrm@nvidia.com
Signed-off-by: Jakub Kicinski