summaryrefslogtreecommitdiff
path: root/include/net/sock.h
AgeCommit message (Collapse)Author
2025-12-05Merge tag 'mm-stable-2025-12-03-21-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...
2025-11-25tcp: remove icsk->icsk_retransmit_timerEric Dumazet
Now sk->sk_timer is no longer used by TCP keepalive, we can use its storage for TCP and MPTCP retransmit timers for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: move sk_dst_pending_confirm and sk_pacing_status to sock_read_tx groupEric Dumazet
These two fields are mostly read in TCP tx path, move them in an more appropriate group for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24net: factor-out _sk_charge() helperPaolo Abeni
Move out of __inet_accept() the code dealing charging newly accepted socket to memcg. MPTCP will soon use it to on a per subflow basis, in different contexts. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Geliang Tang <geliang@kernel.org> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-1-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-16memcg: net: track network throttling due to memcg memory pressureShakeel Butt
The kernel can throttle network sockets if the memory cgroup associated with the corresponding socket is under memory pressure. The throttling actions include clamping the transmit window, failing to expand receive or send buffers, aggressively prune out-of-order receive queue, FIN deferred to a retransmitted packet and more. Let's add memcg metric to track such throttling actions. At the moment memcg memory pressure is defined through vmpressure and in future it may be defined using PSI or we may add more flexible way for the users to define memory pressure, maybe through ebpf. However the potential throttling actions will remain the same, so this newly introduced metric will continue to track throttling actions irrespective of how memcg memory pressure is defined. Link: https://lkml.kernel.org/r/20251016161035.86161-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Daniel Sedlak <daniel.sedlak@cdn77.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-04net: Convert proto callbacks from sockaddr to sockaddr_unsizedKees Cook
Convert struct proto pre_connect(), connect(), bind(), and bind_add() callback function prototypes from struct sockaddr to struct sockaddr_unsized. This does not change per-implementation use of sockaddr for passing around an arbitrarily sized sockaddr struct. Those will be addressed in future patches. Additionally removes the no longer referenced struct sockaddr from include/net/inet_common.h. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-5-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-27net: Add sk_clone().Kuniyuki Iwashima
sctp_accept() will use sk_clone_lock(), but it will be called with the parent socket locked, and sctp_migrate() acquires the child lock later. Let's add no lock version of sk_clone_lock(). Note that lockdep complains if we simply use bh_lock_sock_nested(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21net: avoid extra access to sk->sk_wmem_alloc in sock_wfree()Eric Dumazet
UDP TX packets destructor is sock_wfree(). It suffers from a cache line bouncing in sock_def_write_space_wfree(). Instead of reading sk->sk_wmem_alloc after we just did an atomic RMW on it, use __refcount_sub_and_test() to get the old value for free, and pass the new value to sock_def_write_space_wfree(). Add __sock_writeable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251017133712.2842665-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-17Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-10-16 We've added 6 non-merge commits during the last 1 day(s) which contain a total of 18 files changed, 577 insertions(+), 38 deletions(-). The main changes are: 1) Bypass the global per-protocol memory accounting either by setting a netns sysctl or using bpf_setsockopt in a bpf program, from Kuniyuki Iwashima. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add test for sk->sk_bypass_prot_mem. bpf: Introduce SK_BPF_BYPASS_PROT_MEM. bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE. net: Introduce net.core.bypass_prot_mem sysctl. net: Allow opt-out from global protocol memory accounting. tcp: Save lock_sock() for memcg in inet_csk_accept(). ==================== Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17inet: Avoid ehash lookup race in inet_ehash_insert()Xuanqiang Luo
Since ehash lookups are lockless, if one CPU performs a lookup while another concurrently deletes and inserts (removing reqsk and inserting sk), the lookup may fail to find the socket, an RST may be sent. The call trace map is drawn as follows: CPU 0 CPU 1 ----- ----- inet_ehash_insert() spin_lock() sk_nulls_del_node_init_rcu(osk) __inet_lookup_established() (lookup failed) __sk_nulls_add_node_rcu(sk, list) spin_unlock() As both deletion and insertion operate on the same ehash chain, this patch introduces a new sk_nulls_replace_node_init_rcu() helper functions to implement atomic replacement. Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions") Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251015020236.431822-3-xuanqiang.luo@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16net: Allow opt-out from global protocol memory accounting.Kuniyuki Iwashima
Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. Sometimes, system processes do not want that limitation. For a similar purpose, there is SO_RESERVE_MEM for sockets under memcg. Also, by opting out of the per-protocol accounting, sockets under memcg can avoid paying costs for two orthogonal memory accounting mechanisms. A microbenchmark result is in the subsequent bpf patch. Let's allow opt-out from the per-protocol memory accounting if sk->sk_bypass_prot_mem is true. sk->sk_bypass_prot_mem and sk->sk_prot are placed in the same cache line, and sk_has_account() always fetches sk->sk_prot before accessing sk->sk_bypass_prot_mem, so there is no extra cache miss for this patch. The following patches will set sk->sk_bypass_prot_mem to true, and then, the per-protocol memory accounting will be skipped. Note that this does NOT disable memcg, but rather the per-protocol one. Another option not to use the hole in struct sock_common is create sk_prot variants like tcp_prot_bypass, but this would complicate SOCKMAP logic, tcp_bpf_prots etc. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://patch.msgid.link/20251014235604.3057003-3-kuniyu@google.com
2025-10-15net: remove obsolete WARN_ON(refcount_read(&sk->sk_refcnt) == 1)Eric Dumazet
sk->sk_refcnt has been converted to refcount_t in 2017. __sock_put(sk) being refcount_dec(&sk->sk_refcnt), it will complain loudly if the current refcnt is 1 (or less) in a non racy way. We can remove four WARN_ON() in favor of the generic refcount_dec() check. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Xuanqiang Luo<luoxuanqiang@kylinos.cn> Link: https://patch.msgid.link/20251014140605.2982703-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15net: allow busy connected flows to switch tx queuesEric Dumazet
This is a followup of commit 726e9e8b94b9 ("tcp: refine skb->ooo_okay setting") and of prior commit in this series ("net: control skb->ooo_okay from skb_set_owner_w()") skb->ooo_okay might never be set for bulk flows that always have at least one skb in a qdisc queue of NIC queue, especially if TX completion is delayed because of a stressed cpu. The so-called "strange attractors" has caused many performance issues (see for instance 9b462d02d6dd ("tcp: TCP Small Queues and strange attractors")), we need to do better. We have tried very hard to avoid reorders because TCP was not dealing with them nicely a decade ago. Use the new net.core.txq_reselection_ms sysctl to let flows follow XPS and select a more efficient queue. After this patch, we no longer have to make sure threads are pinned to cpus, they now can be migrated without adding too much spinlock/qdisc/TX completion pressure anymore. TX completion part was problematic, because it added false sharing on various socket fields, but also added false sharing and spinlock contention in mm layers. Calling skb_orphan() from ndo_start_xmit() is not an option unfortunately. Note for later: 1) move sk->sk_tx_queue_mapping closer to sk_tx_queue_mapping_jiffies for better cache locality. 2) Study if 9b462d02d6dd ("tcp: TCP Small Queues and strange attractors") could be revised. Tested: Used a host with 32 TX queues, shared by groups of 8 cores. XPS setup : echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus ... Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15 taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host Checked that only queues 0 and 1 are used as instructed by XPS : tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p" backlog 123489410b 1890p backlog 69809026b 1064p backlog 52401054b 805p Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121 C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done Set txq_reselection_ms to 1000 echo 1000 > /proc/sys/net/core/txq_reselection_ms Check that the flows have migrated nicely: tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p" backlog 130508314b 1916p backlog 8584380b 126p backlog 8584380b 126p backlog 8379990b 123p backlog 8584380b 126p backlog 8487484b 125p backlog 8584380b 126p backlog 8448120b 124p backlog 8584380b 126p backlog 8720640b 128p backlog 8856900b 130p backlog 8584380b 126p backlog 8652510b 127p backlog 8448120b 124p backlog 8516250b 125p backlog 7834950b 115p Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15net: add SK_WMEM_ALLOC_BIAS constantEric Dumazet
sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get() takes care of this initial value. Add SK_WMEM_ALLOC_BIAS define to not spread this magic value. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251013152234.842065-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29Revert "net: group sk_backlog and sk_receive_queue"Eric Dumazet
This reverts commit 4effb335b5dab08cb6e2c38d038910f8b527cfc9. This was a benefit for UDP flood case, which was later greatly improved with commits 6471658dc66c ("udp: use skb_attempt_defer_free()") and b650bf0977d3 ("udp: remove busylock and add per NUMA queues"). Apparently blamed commit added a regression for RAW sockets, possibly because they do not use the dual RX queue strategy that UDP has. sock_queue_rcv_skb_reason() and RAW recvmsg() compete for sk_receive_buf and sk_rmem_alloc changes, and them being in the same cache line reduce performance. Fixes: 4effb335b5da ("net: group sk_backlog and sk_receive_queue") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202509281326.f605b4eb-lkp@intel.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: David Ahern <dsahern@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250929182112.824154-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-23tcp: Update bind bucket state on port releaseJakub Sitnicki
Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or fastreuseport >= 0 after a socket is explicitly bound to a port, it remains in that state until all sockets are removed and the bucket is destroyed. In this state, the bucket is skipped during ephemeral port selection in connect(). For applications using a reduced ephemeral port range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port exhaustion since blocked buckets are excluded from reuse. The reason the bucket state isn't updated on port release is unclear. Possibly a performance trade-off to avoid scanning bucket owners, or just an oversight. Fix it by recalculating the bucket state when a socket releases a port. To limit overhead, each inet_bind2_bucket stores its own (fastreuse, fastreuseport) state. On port release, only the relevant port-addr bucket is scanned, and the overall state is derived from these. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-1-57168b661b47@cloudflare.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-22net: move sk->sk_err_soft and sk->sk_sndbufEric Dumazet
sk->sk_sndbuf is read-mostly in tx path, so move it from sock_write_tx group to more appropriate sock_read_tx. sk->sk_err_soft was not identified previously, but is used from tcp_ack(). Move it to sock_write_tx group for better cache locality. Also change tcp_ack() to clear sk->sk_err_soft only if needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22net: move sk_uid and sk_protocol to sock_read_txEric Dumazet
sk_uid and sk_protocol are read from inet6_csk_route_socket() for each TCP transmit. Also read from udpv6_sendmsg(), udp_sendmsg() and others. Move them to sock_read_tx for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250919204856.2977245-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc7). No conflicts. Adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en/fs.h 9536fbe10c9d ("net/mlx5e: Add PSP steering in local NIC RX") 7601a0a46216 ("net/mlx5e: Add a miss level for ipsec crypto offload") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18net: clear sk->sk_ino in sk_set_socket(sk, NULL)Eric Dumazet
Andrei Vagin reported that blamed commit broke CRIU. Indeed, while we want to keep sk_uid unchanged when a socket is cloned, we want to clear sk->sk_ino. Otherwise, sock_diag might report multiple sockets sharing the same inode number. Move the clearing part from sock_orphan() to sk_set_socket(sk, NULL), called both from sock_orphan() and sk_clone_lock(). Fixes: 5d6b58c932ec ("net: lockless sock_i_ino()") Closes: https://lore.kernel.org/netdev/aMhX-VnXkYDpKd9V@google.com/ Closes: https://github.com/checkpoint-restore/criu/issues/2744 Reported-by: Andrei Vagin <avagin@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrei Vagin <avagin@google.com> Link: https://patch.msgid.link/20250917135337.1736101-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18Merge branch 'add-basic-psp-encryption-for-tcp-connections'Paolo Abeni
Daniel Zahka says: ================== add basic PSP encryption for TCP connections This is v13 of the PSP RFC [1] posted by Jakub Kicinski one year ago. General developments since v1 include a fork of packetdrill [2] with support for PSP added, as well as some test cases, and an implementation of PSP key exchange and connection upgrade [3] integrated into the fbthrift RPC library. Both [2] and [3] have been tested on server platforms with PSP-capable CX7 NICs. Below is the cover letter from the original RFC: Add support for PSP encryption of TCP connections. PSP is a protocol out of Google: https://github.com/google/psp/blob/main/doc/PSP_Arch_Spec.pdf which shares some similarities with IPsec. I added some more info in the first patch so I'll keep it short here. The protocol can work in multiple modes including tunneling. But I'm mostly interested in using it as TLS replacement because of its superior offload characteristics. So this patch does three things: - it adds "core" PSP code PSP is offload-centric, and requires some additional care and feeding, so first chunk of the code exposes device info. This part can be reused by PSP implementations in xfrm, tunneling etc. - TCP integration TLS style Reuse some of the existing concepts from TLS offload, such as attaching crypto state to a socket, marking skbs as "decrypted", egress validation. PSP does not prescribe key exchange protocols. To use PSP as a more efficient TLS offload we intend to perform a TLS handshake ("inline" in the same TCP connection) and negotiate switching to PSP based on capabilities of both endpoints. This is also why I'm not including a software implementation. Nobody would use it in production, software TLS is faster, it has larger crypto records. - mlx5 implementation That's mostly other people's work, not 100% sure those folks consider it ready hence the RFC in the title. But it works :) Not posted, queued a branch [4] are follow up pieces: - standard stats - netdevsim implementation and tests [1] https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/ [2] https://github.com/danieldzahka/packetdrill [3] https://github.com/danieldzahka/fbthrift/tree/dzahka/psp [4] https://github.com/kuba-moo/linux/tree/psp Comments we intend to defer to future series: - we prefer to keep the version field in the tx-assoc netlink request, because it makes parsing keys require less state early on, but we are willing to change in the next version of this series. - using a static branch to wrap psp_enqueue_set_decrypted() and other functions called from tcp. - using INDIRECT_CALL for tls/psp in sk_validate_xmit_skb(). We prefer to address this in a dedicated patch series, so that this series does not need to modify the way tls_validate_xmit_skb() is declared and stubbed out. v12: https://lore.kernel.org/netdev/20250916000559.1320151-1-kuba@kernel.org/ v11: https://lore.kernel.org/20250911014735.118695-1-daniel.zahka@gmail.com v10: https://lore.kernel.org/netdev/20250828162953.2707727-1-daniel.zahka@gmail.com/ v9: https://lore.kernel.org/netdev/20250827155340.2738246-1-daniel.zahka@gmail.com/ v8: https://lore.kernel.org/netdev/20250825200112.1750547-1-daniel.zahka@gmail.com/ v7: https://lore.kernel.org/netdev/20250820113120.992829-1-daniel.zahka@gmail.com/ v6: https://lore.kernel.org/netdev/20250812003009.2455540-1-daniel.zahka@gmail.com/ v5: https://lore.kernel.org/netdev/20250723203454.519540-1-daniel.zahka@gmail.com/ v4: https://lore.kernel.org/netdev/20250716144551.3646755-1-daniel.zahka@gmail.com/ v3: https://lore.kernel.org/netdev/20250702171326.3265825-1-daniel.zahka@gmail.com/ v2: https://lore.kernel.org/netdev/20250625135210.2975231-1-daniel.zahka@gmail.com/ v1: https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/ ================== Links: https://patch.msgid.link/20250917000954.859376-1-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- * add-basic-psp-encryption-for-tcp-connections: net/mlx5e: Implement PSP key_rotate operation net/mlx5e: Add Rx data path offload psp: provide decapsulation and receive helper for drivers net/mlx5e: Configure PSP Rx flow steering rules net/mlx5e: Add PSP steering in local NIC RX net/mlx5e: Implement PSP Tx data path psp: provide encapsulation helper for drivers net/mlx5e: Implement PSP operations .assoc_add and .assoc_del net/mlx5e: Support PSP offload functionality psp: track generations of device key net: psp: update the TCP MSS to reflect PSP packet overhead net: psp: add socket security association code net: tcp: allow tcp_timewait_sock to validate skbs before handing to device net: move sk_validate_xmit_skb() to net/core/dev.c psp: add op for rotation of device key tcp: add datapath logic for PSP with inline key exchange net: modify core data structures for PSP datapath support psp: base PSP device support psp: add documentation
2025-09-18net: move sk_validate_xmit_skb() to net/core/dev.cDaniel Zahka
Move definition of sk_validate_xmit_skb() from net/core/sock.c to net/core/dev.c. This change is in preparation of the next patch, where sk_validate_xmit_skb() will need to cast sk to a tcp_timewait_sock *, and access member fields. Including linux/tcp.h from linux/sock.h creates a circular dependency, and dev.c is the only current call site of this function. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-7-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18net: modify core data structures for PSP datapath supportJakub Kicinski
Add pointers to psp data structures to core networking structs, and an SKB extension to carry the PSP information from the drivers to the socket layer. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-4-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18udp: add udp_drops_inc() helperEric Dumazet
Generic sk_drops_inc() reads sk->sk_drop_counters. We know the precise location for UDP sockets. Move sk_drop_counters out of sock_read_rxtx so that sock_write_rxtx starts at a cache line boundary. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-9-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18net: group sk_backlog and sk_receive_queueEric Dumazet
UDP receivers suffer from sk_rmem_alloc updates, currently sharing a cache line with fields that need to be read-mostly (sock_read_rx group): 1) RFS enabled hosts read sk_napi_id from __udpv6_queue_rcv_skb(). 2) sk->sk_rcvbuf is read from __udp_enqueue_schedule_skb() /* --- cacheline 3 boundary (192 bytes) --- */ struct { atomic_t rmem_alloc; /* 0xc0 0x4 */ // Oops int len; /* 0xc4 0x4 */ struct sk_buff * head; /* 0xc8 0x8 */ struct sk_buff * tail; /* 0xd0 0x8 */ } sk_backlog; /* 0xc0 0x18 */ __u8 __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ __u8 __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ struct dst_entry * sk_rx_dst; /* 0xd8 0x8 */ int sk_rx_dst_ifindex;/* 0xe0 0x4 */ u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ unsigned int sk_ll_usec; /* 0xe8 0x4 */ unsigned int sk_napi_id; /* 0xec 0x4 */ u16 sk_busy_poll_budget;/* 0xf0 0x2 */ u8 sk_prefer_busy_poll;/* 0xf2 0x1 */ u8 sk_userlocks; /* 0xf3 0x1 */ int sk_rcvbuf; /* 0xf4 0x4 */ struct sk_filter * sk_filter; /* 0xf8 0x8 */ Move sk_error (which is less often dirtied) there. Alternative would be to cache align sock_read_rx but this has more implications/risks. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-8-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-14net: use NUMA drop counters for softnet_data.droppedEric Dumazet
Hosts under DOS attack can suffer from false sharing in enqueue_to_backlog() : atomic_inc(&sd->dropped). This is because sd->dropped can be touched from many cpus, possibly residing on different NUMA nodes. Generalize the sk_drop_counters infrastucture added in commit c51613fa276f ("net: add sk->sk_drop_counters") and use it to replace softnet_data.dropped with NUMA friendly softnet_data.drop_counters. This adds 64 bytes per cpu, maybe more in the future if we increase the number of counters (currently 2) per 'struct numa_drop_counters'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250909121942.1202585-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h c51613fa276f ("net: add sk->sk_drop_counters") 5d6b58c932ec ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03net: lockless sock_i_ino()Eric Dumazet
Followup of commit c51da3f7a161 ("net: remove sock_i_uid()") A recent syzbot report was the trigger for this change. Over the years, we had many problems caused by the read_lock[_bh](&sk->sk_callback_lock) in sock_i_uid(). We could fix smc_diag_dump_proto() or make a more radical move: Instead of waiting for new syzbot reports, cache the socket inode number in sk->sk_ino, so that we no longer need to acquire sk->sk_callback_lock in sock_i_ino(). This makes socket dumps faster (one less cache line miss, and two atomic ops avoided). Prior art: commit 25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().") commit 4f9bf2a2f5aa ("tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.") commit efc3dbc37412 ("rds: Make rds_sock_lock BH rather than IRQ safe.") Fixes: d2d6422f8bd1 ("x86: Allow to enable PREEMPT_RT.") Reported-by: syzbot+50603c05bbdf4dfdaffa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68b73804.050a0220.3db4df.01d8.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250902183603.740428-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01tcp: Remove sk->sk_prot->orphan_count.Kuniyuki Iwashima
TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed) sockets in tcp_orphan_count. In some code that was shared with DCCP, tcp_orphan_count is referenced via sk->sk_prot->orphan_count. Let's reference tcp_orphan_count directly. inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c due to header dependency. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28net: add sk->sk_drop_countersEric Dumazet
Some sockets suffer from heavy false sharing on sk->sk_drops, and fields in the same cache line. Add sk->sk_drop_counters to: - move the drop counter(s) to dedicated cache lines. - Add basic NUMA awareness to these drop counter(s). Following patches will use this infrastructure for UDP and RAW sockets. sk_clone_lock() is not yet ready, it would need to properly set newsk->sk_drop_counters if we plan to use this for TCP sockets. v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk_drops_skbadd() helperEric Dumazet
Existing sk_drops_add() helper is renamed to sk_drops_skbadd(). Add sk_drops_add() and convert sk_drops_inc() to use it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpersEric Dumazet
We want to split sk->sk_drops in the future to reduce potential contention on this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-20net: set net.core.rmem_max and net.core.wmem_max to 4 MBEric Dumazet
SO_RCVBUF and SO_SNDBUF have limited range today, unless distros or system admins change rmem_max and wmem_max. Even iproute2 uses 1 MB SO_RCVBUF which is capped by the kernel. Decouple [rw]mem_max and [rw]mem_default and increase [rw]mem_max to 4 MB. Before: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 212992 net.core.wmem_default = 212992 net.core.wmem_max = 212992 After: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 4194304 net.core.wmem_default = 212992 net.core.wmem_max = 4194304 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19net: Define sk_memcg under CONFIG_MEMCG.Kuniyuki Iwashima
Except for sk_clone_lock(), all accesses to sk->sk_memcg is done under CONFIG_MEMCG. As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().Kuniyuki Iwashima
We will store a flag in the lowest bit of sk->sk_memcg. Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure(). Let's pass struct sock to it and rename the function to match other functions starting with mem_cgroup_sk_. Note that the helper is moved to sock.h to use mem_cgroup_from_sk(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19net-memcg: Introduce mem_cgroup_sk_enabled().Kuniyuki Iwashima
The socket memcg feature is enabled by a static key and only works for non-root cgroup. We check both conditions in many places. Let's factorise it as a helper function. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19net-memcg: Introduce mem_cgroup_from_sk().Kuniyuki Iwashima
We will store a flag in the lowest bit of sk->sk_memcg. Then, directly dereferencing sk->sk_memcg will be illegal, and we do not want to allow touching the raw sk->sk_memcg in many places. Let's introduce mem_cgroup_from_sk(). Other places accessing the raw sk->sk_memcg will be converted later. Note that we cannot define the helper as an inline function in memcontrol.h as we cannot access any fields of struct sock there due to circular dependency, so it is placed in sock.h. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skbEric Dumazet
These functions to not modify the skb, add a const qualifier. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711114006.480026-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: make sk->sk_rcvtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout() and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: make sk->sk_sndtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from sock_set_sndtimeo(), and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: remove sock_i_uid()Eric Dumazet
Difference between sock_i_uid() and sk_uid() is that after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID while sk_uid() returns the last cached sk->sk_uid value. None of sock_i_uid() callers care about this. Use sk_uid() which is much faster and inlined. Note that diag/dump users are calling sock_i_ino() and can not see the full benefit yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: annotate races around sk->sk_uidEric Dumazet
sk->sk_uid can be read while another thread changes its value in sockfs_setattr(). Add sk_uid(const struct sock *sk) helper to factorize the needed READ_ONCE() annotations, and add corresponding WRITE_ONCE() where needed. Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17Merge branch 'io_uring-cmd-for-tx-timestamps'Jakub Kicinski
Pavel Begunkov says: ==================== io_uring cmd for tx timestamps (part) Apply the networking helpers for the io_uring timestamp API. ==================== Link: https://patch.msgid.link/cover.1750065793.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: timestamp: add helper returning skb's tx tstampPavel Begunkov
Add a helper function skb_get_tx_timestamp() that returns a tx timestamp associated with an error queue skb. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/702357dd8936ef4c0d3864441e853bfe3224a677.1750065793.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10net: Fix TOCTOU issue in sk_is_readable()Michal Luczaj
sk->sk_prot->sock_is_readable is a valid function pointer when sk resides in a sockmap. After the last sk_psock_put() (which usually happens when socket is removed from sockmap), sk->sk_prot gets restored and sk->sk_prot->sock_is_readable becomes NULL. This makes sk_is_readable() racy, if the value of sk->sk_prot is reloaded after the initial check. Which in turn may lead to a null pointer dereference. Ensure the function pointer does not turn NULL after the check. Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250609-skisreadable-toctou-v1-1-d0dfb2d62c37@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10net: remove unused sock_enable_timestampsWillem de Bruijn
This function was introduced in commit 783da70e8396 ("net: add sock_enable_timestamps"), with one caller in rxrpc. That only caller was removed in commit 7903d4438b3f ("rxrpc: Don't use received skbuff timestamps"). Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250609153254.3504909-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-23af_unix: Introduce SO_PASSRIGHTS.Kuniyuki Iwashima
As long as recvmsg() or recvmmsg() is used with cmsg, it is not possible to avoid receiving file descriptors via SCM_RIGHTS. This behaviour has occasionally been flagged as problematic, as it can be (ab)used to trigger DoS during close(), for example, by passing a FUSE-controlled fd or a hung NFS fd. For instance, as noted on the uAPI Group page [0], an untrusted peer could send a file descriptor pointing to a hung NFS mount and then close it. Once the receiver calls recvmsg() with msg_control, the descriptor is automatically installed, and then the responsibility for the final close() now falls on the receiver, which may result in blocking the process for a long time. Regarding this, systemd calls cmsg_close_all() [1] after each recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS. However, this cannot work around the issue at all, because the final fput() may still occur on the receiver's side once sendmsg() with SCM_RIGHTS succeeds. Also, even filtering by LSM at recvmsg() does not work for the same reason. Thus, we need a better way to refuse SCM_RIGHTS at sendmsg(). Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS. Note that this option is enabled by default for backward compatibility. Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0] Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>