linux.git/Documentation/networking/ip-sysctl.rst, branch v6.10

net: ipv6/addrconf: clamp preferred_lft to the minimum required

2024-02-15T14:34:40+00:00

If the preferred lifetime was less than the minimum required lifetime,
ipv6_create_tempaddr would error out without creating any new address.
On my machine and network, this error happened immediately with the
preferred lifetime set to 5 seconds or less, after a few minutes with
the preferred lifetime set to 6 seconds, and not at all with the
preferred lifetime set to 7 seconds. During my investigation, I found a
Stack Exchange post from another person who seems to have had the same
problem: They stopped getting new addresses if they lowered the
preferred lifetime below 3 seconds, and they didn't really know why.

The preferred lifetime is a preference, not a hard requirement. The
kernel does not strictly forbid new connections on a deprecated address,
nor does it guarantee that the address will be disposed of the instant
its total valid lifetime expires. So rather than disable IPv6 privacy
extensions altogether if the minimum required lifetime swells above the
preferred lifetime, it is more in keeping with the user's intent to
increase the temporary address's lifetime to the minimum necessary for
the current network conditions.

With these fixes, setting the preferred lifetime to 5 or 6 seconds "just
works" because the extra fraction of a second is practically
unnoticeable. It's even possible to reduce the time before deprecation
to 1 or 2 seconds by setting /proc/sys/net/ipv6/conf/*/regen_min_advance
and /proc/sys/net/ipv6/conf/*/dad_transmits to 0. I realize that that is
a pretty niche use case, but I know at least one person who would gladly
sacrifice performance and convenience to be sure that they are getting
the maximum possible level of privacy.

Link: https://serverfault.com/a/1031168/310447
Signed-off-by: Alex Henrie 
Reviewed-by: David Ahern 
Signed-off-by: Paolo Abeni

net: ipv6/addrconf: introduce a regen_min_advance sysctl

2024-02-15T14:34:40+00:00

In RFC 8981, REGEN_ADVANCE cannot be less than 2 seconds, and the RFC
does not permit the creation of temporary addresses with lifetimes
shorter than that:

> When processing a Router Advertisement with a
> Prefix Information option carrying a prefix for the purposes of
> address autoconfiguration (i.e., the A bit is set), the host MUST
> perform the following steps:

> 5.  A temporary address is created only if this calculated preferred
>     lifetime is greater than REGEN_ADVANCE time units.

However, some users want to change their IPv6 address as frequently as
possible regardless of the RFC's arbitrary minimum lifetime. For the
benefit of those users, add a regen_min_advance sysctl parameter that
can be set to below or above 2 seconds.

Link: https://datatracker.ietf.org/doc/html/rfc8981
Signed-off-by: Alex Henrie 
Reviewed-by: David Ahern 
Signed-off-by: Paolo Abeni

net: ipv6/addrconf: ensure that regen_advance is at least 2 seconds

2024-02-15T14:34:40+00:00

RFC 8981 defines REGEN_ADVANCE as follows:

REGEN_ADVANCE = 2 + (TEMP_IDGEN_RETRIES * DupAddrDetectTransmits * RetransTimer / 1000)

Thus, allowing it to be less than 2 seconds is technically a protocol
violation.

Link: https://datatracker.ietf.org/doc/html/rfc8981#name-defined-protocol-parameters
Signed-off-by: Alex Henrie 
Reviewed-by: David Ahern 
Signed-off-by: Paolo Abeni

Revert "net: ipv6/addrconf: clamp preferred_lft to the minimum required"

2024-01-02T22:58:46+00:00

The commit had a bug and might not have been the right approach anyway.

Fixes: 629df6701c8a ("net: ipv6/addrconf: clamp preferred_lft to the minimum required")
Fixes: ec575f885e3e ("Documentation: networking: explain what happens if temp_prefered_lft is too small or too large")
Reported-by: Dan Moulding 
Closes: https://lore.kernel.org/netdev/20231221231115.12402-1-dan@danm.net/
Link: https://lore.kernel.org/netdev/CAMMLpeTdYhd=7hhPi2Y7pwdPCgnnW5JYh-bu3hSc7im39uxnEA@mail.gmail.com/
Signed-off-by: Alex Henrie 
Reviewed-by: David Ahern 
Link: https://lore.kernel.org/r/20231230043252.10530-1-alexhenrie24@gmail.com
Signed-off-by: Jakub Kicinski

Documentation: networking: explain what happens if temp_prefered_lft is too small or too large

2023-10-26T01:23:07+00:00

Signed-off-by: Alex Henrie 
Reviewed-by: David Ahern 
Link: https://lore.kernel.org/r/20231024212312.299370-5-alexhenrie24@gmail.com
Signed-off-by: Jakub Kicinski

Documentation: networking: explain what happens if temp_valid_lft is too small

2023-10-26T01:23:07+00:00

Signed-off-by: Alex Henrie 
Reviewed-by: David Ahern 
Link: https://lore.kernel.org/r/20231024212312.299370-4-alexhenrie24@gmail.com
Signed-off-by: Jakub Kicinski

tcp: Set pingpong threshold via sysctl

2023-10-16T21:55:32+00:00

TCP pingpong threshold is 1 by default. But some applications, like SQL DB
may prefer a higher pingpong threshold to activate delayed acks in quick
ack mode for better performance.

The pingpong threshold and related code were changed to 3 in the year
2019 in:
  commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
And reverted to 1 in the year 2022 in:
  commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")

There is no single value that fits all applications.
Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for
optimal performance based on the application needs.

Signed-off-by: Haiyang Zhang 
Reviewed-by: Simon Horman 
Reviewed-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Reviewed-by: Kuniyuki Iwashima 
Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski

net: add sysctl to disable rfc4862 5.5.3e lifetime handling

2023-10-03T22:51:04+00:00

This change adds a sysctl to opt-out of RFC4862 section 5.5.3e's valid
lifetime derivation mechanism.

RFC4862 section 5.5.3e prescribes that the valid lifetime in a Router
Advertisement PIO shall be ignored if it less than 2 hours and to reset
the lifetime of the corresponding address to 2 hours. An in-progress
6man draft (see draft-ietf-6man-slaac-renum-07 section 4.2) is currently
looking to remove this mechanism. While this draft has not been moving
particularly quickly for other reasons, there is widespread consensus on
section 4.2 which updates RFC4862 section 5.5.3e.

Cc: Maciej Żenczykowski 
Cc: Lorenzo Colitti 
Cc: Jen Linkova 
Signed-off-by: Patrick Rohr 
Reviewed-by: Jiri Pirko 
Reviewed-by: David Ahern 
Link: https://lore.kernel.org/r/20230925214711.959704-1-prohr@google.com
Signed-off-by: Jakub Kicinski

tcp: defer regular ACK while processing socket backlog

2023-09-12T17:10:01+00:00

This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.

For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)

Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.

Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.

Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.

This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.

The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().

This saves cpu cycles on both sides, and network resources.

Performance of a single TCP flow on a 200Gbit NIC:

- Throughput is increased by 20% (100Gbit -> 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.

Benchmark context:
 - Regular netperf TCP_STREAM (no zerocopy)
 - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
 - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)

This feature is guarded by a new sysctl, and enabled by default:
 /proc/sys/net/ipv4/tcp_backlog_ack_defer

Signed-off-by: Eric Dumazet 
Acked-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Dave Taht 
Signed-off-by: Paolo Abeni

net: change accept_ra_min_rtr_lft to affect all RA lifetimes

2023-07-28T20:30:51+00:00

accept_ra_min_rtr_lft only considered the lifetime of the default route
and discarded entire RAs accordingly.

This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and
applies the value to individual RA sections; in particular, router
lifetime, PIO preferred lifetime, and RIO lifetime. If any of those
lifetimes are lower than the configured value, the specific RA section
is ignored.

In order for the sysctl to be useful to Android, it should really apply
to all lifetimes in the RA, since that is what determines the minimum
frequency at which RAs must be processed by the kernel. Android uses
hardware offloads to drop RAs for a fraction of the minimum of all
lifetimes present in the RA (some networks have very frequent RAs (5s)
with high lifetimes (2h)). Despite this, we have encountered networks
that set the router lifetime to 30s which results in very frequent CPU
wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the
WiFi firmware) entirely on such networks, it seems better to ignore the
misconfigured routers while still processing RAs from other IPv6 routers
on the same network (i.e. to support IoT applications).

The previous implementation dropped the entire RA based on router
lifetime. This turned out to be hard to expand to the other lifetimes
present in the RA in a consistent manner; dropping the entire RA based
on RIO/PIO lifetimes would essentially require parsing the whole thing
twice.

Fixes: 1671bcfd76fd ("net: add sysctl accept_ra_min_rtr_lft")
Cc: Lorenzo Colitti 
Signed-off-by: Patrick Rohr 
Reviewed-by: Maciej Żenczykowski 
Reviewed-by: David Ahern 
Link: https://lore.kernel.org/r/20230726230701.919212-1-prohr@google.com
Signed-off-by: Jakub Kicinski