summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-01-14vdso: Remove struct getcpu_cacheThomas Weißschuh
The cache parameter of getcpu() is useless nowadays for various reasons. * It is never passed by userspace for either the vDSO or syscalls. * It is never used by the kernel. * It could not be made to work on the current vDSO architecture. * The structure definition is not part of the UAPI headers. * vdso_getcpu() is superseded by restartable sequences in any case. Remove the struct and its header. As a side-effect this gets rid of an unwanted inclusion of the linux/ header namespace from vDSO code. [ tglx: Adapt to s390 upstream changes */ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://patch.msgid.link/20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de
2026-01-13drm/xe: vram addr range is expanded to bit[17:8]Fei Yang
The bit field used to be [14:8] with [17:15] marked as SPARE and defaulted to 0. So, simply expand the read to bit[17:8] assuming the platforms using only bit[14:8] have zeros in the expanded bits. BSpec: 54991 Signed-off-by: Fei Yang <fei.yang@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260112220330.2267122-2-fei.yang@intel.com
2026-01-13drm/xe: Replace use of system_wq with tlb_inval->timeout_wqMarco Crivellari
This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. After a carefully evaluation, because this is the fence signaling path, we changed the code in order to use one of the Xe's workqueue. So, a new workqueue named 'timeout_wq' has been added to 'struct xe_tlb_inval' and has been initialized with 'gt->ordered_wq' changing the system_wq uses with tlb_inval->timeout_wq. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260112094406.82641-1-marco.crivellari@suse.com
2026-01-14PCI: dw-rockchip: Disable BAR 0 and BAR 1 for Root PortShawn Lin
Some Rockchip PCIe Root Ports report bogus size of 1GiB for the BAR memories and they cause below resource allocation issue during probe. pci 0000:00:00.0: [1d87:3588] type 01 class 0x060400 PCIe Root Port pci 0000:00:00.0: BAR 0 [mem 0x00000000-0x3fffffff] pci 0000:00:00.0: BAR 1 [mem 0x00000000-0x3fffffff] pci 0000:00:00.0: ROM [mem 0x00000000-0x0000ffff pref] ... pci 0000:00:00.0: BAR 0 [mem 0x900000000-0x93fffffff]: assigned pci 0000:00:00.0: BAR 1 [mem size 0x40000000]: can't assign; no space pci 0000:00:00.0: BAR 1 [mem size 0x40000000]: failed to assign pci 0000:00:00.0: ROM [mem 0xf0200000-0xf020ffff pref]: assigned pci 0000:00:00.0: BAR 0 [mem 0x900000000-0x93fffffff]: releasing pci 0000:00:00.0: ROM [mem 0xf0200000-0xf020ffff pref]: releasing pci 0000:00:00.0: BAR 0 [mem 0x900000000-0x93fffffff]: assigned pci 0000:00:00.0: BAR 1 [mem size 0x40000000]: can't assign; no space pci 0000:00:00.0: BAR 1 [mem size 0x40000000]: failed to assign Since there is no use of the Root Port BAR memories, disable both of them. Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> [mani: reworded the description and comment] Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Link: https://patch.msgid.link/1766570461-138256-1-git-send-email-shawn.lin@rock-chips.com
2026-01-13Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK in riscv JIT (Menglong Dong) - Fix reference count leak in bpf_prog_test_run_xdp() (Tetsuo Handa) - Fix metadata size check in bpf_test_run() (Toke Høiland-Jørgensen) - Check that BPF insn array is not allowed as a map for const strings (Deepanshu Kartikey) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix reference count leak in bpf_prog_test_run_xdp() bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() selftests/bpf: Update xdp_context_test_run test to check maximum metadata size bpf, test_run: Subtract size of xdp_frame from allowed metadata size riscv, bpf: Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK
2026-01-13Merge branch 'properly-load-insn-array-values-with-offsets'Alexei Starovoitov
Anton Protopopov says: ==================== properly load insn array values with offsets As was reported by the BPF CI bot in [1] the direct address of an instruction array returned by map_direct_value_addr() is incorrect if the offset is non-zero. Fix this bug and add selftests. Also (commit 2), return EACCES instead of EINVAL when offsets aren't correct. [1] https://lore.kernel.org/bpf/0447c47ac58306546a5dbdbad2601f3e77fa8eb24f3a4254dda3a39f6133e68f@mail.kernel.org/ ==================== Link: https://patch.msgid.link/20260111153047.8388-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13net/sched: sch_qfq: do not free existing class in qfq_change_class()Eric Dumazet
Fixes qfq_change_class() error case. cl->qdisc and cl should only be freed if a new class and qdisc were allocated, or we risk various UAF. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: syzbot+07f3f38f723c335f106d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6965351d.050a0220.eaf7.00c5.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260112175656.17605-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests/bpf: Add tests for loading insn array values with offsetsAnton Protopopov
The ldimm64 instruction for map value supports an offset. For insn array maps it wasn't tested before, as normally such instructions aren't generated. However, this is still possible to pass such instructions, so add a few tests to check that correct offsets work properly and incorrect offsets are rejected. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260111153047.8388-4-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13bpf: Return EACCES for incorrect access to insn arrayAnton Protopopov
The insn_array_map_direct_value_addr() function currently returns -EINVAL when the offset within the map is invalid. Change this to return -EACCES, so that it is consistent with similar boundary access checks in the verifier. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260111153047.8388-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13bpf: Return proper address for non-zero offsets in insn arrayAnton Protopopov
The map_direct_value_addr() function of the instruction array map incorrectly adds offset to the resulting address. This is a bug, because later the resolve_pseudo_ldimm64() function adds the offset. Fix it. Corresponding selftests are added in a consequent commit. Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260111153047.8388-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13Merge branch '20260105-kvmrprocv10-v10-0-022e96815380@oss.qualcomm.com' into ↵Bjorn Andersson
drivers-for-6.20 Merge the support for loading and managing the TrustZone-based remote processors found in the Glymur platform through a topic branch, as it's a mix of qcom-soc and remoteproc patches.
2026-01-13selftests/bpf: assert BPF kfunc default trusted pointer semanticsMatt Bobrowski
The BPF verifier was recently updated to treat pointers to struct types returned from BPF kfuncs as implicitly trusted by default. Add a new test case to exercise this new implicit trust semantic. The KF_ACQUIRE flag was dropped from the bpf_get_root_mem_cgroup() kfunc because it returns a global pointer to root_mem_cgroup without performing any explicit reference counting. This makes it an ideal candidate to verify the new implicit trusted pointer semantics. Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260113083949.2502978-3-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13bpf: drop KF_ACQUIRE flag on BPF kfunc bpf_get_root_mem_cgroup()Matt Bobrowski
With the BPF verifier now treating pointers to struct types returned from BPF kfuncs as implicitly trusted by default, there is no need for bpf_get_root_mem_cgroup() to be annotated with the KF_ACQUIRE flag. bpf_get_root_mem_cgroup() does not acquire any references, but rather simply returns a NULL pointer or a pointer to a struct mem_cgroup object that is valid for the entire lifetime of the kernel. This simplifies BPF programs using this kfunc by removing the requirement to pair the call with bpf_put_mem_cgroup(). Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260113083949.2502978-2-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13bpf: return PTR_TO_BTF_ID | PTR_TRUSTED from BPF kfuncs by defaultMatt Bobrowski
Teach the BPF verifier to treat pointers to struct types returned from BPF kfuncs as implicitly trusted (PTR_TO_BTF_ID | PTR_TRUSTED) by default. Returning untrusted pointers to struct types from BPF kfuncs should be considered an exception only, and certainly not the norm. Update existing selftests to reflect the change in register type printing (e.g. `ptr_` becoming `trusted_ptr_` in verifier error messages). Link: https://lore.kernel.org/bpf/aV4nbCaMfIoM0awM@google.com/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260113083949.2502978-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13rust: bitops: fix missing _find_* functions on 32-bit ARMAlice Ryhl
On 32-bit ARM, you may encounter linker errors such as this one: ld.lld: error: undefined symbol: _find_next_zero_bit >>> referenced by rust_binder_main.43196037ba7bcee1-cgu.0 >>> drivers/android/binder/rust_binder_main.o:(<rust_binder_main::process::Process>::insert_or_update_handle) in archive vmlinux.a >>> referenced by rust_binder_main.43196037ba7bcee1-cgu.0 >>> drivers/android/binder/rust_binder_main.o:(<rust_binder_main::process::Process>::insert_or_update_handle) in archive vmlinux.a This error occurs because even though the functions are declared by include/linux/find.h, the definition is #ifdef'd out on 32-bit ARM. This is because arch/arm/include/asm/bitops.h contains: #define find_first_zero_bit(p,sz) _find_first_zero_bit_le(p,sz) #define find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off) #define find_first_bit(p,sz) _find_first_bit_le(p,sz) #define find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off) And the underscore-prefixed function is conditional on #ifndef of the non-underscore-prefixed name, but the declaration in find.h is *not* conditional on that #ifndef. To fix the linker error, we ensure that the symbols in question exist when compiling Rust code. We do this by defining them in rust/helpers/ whenever the normal definition is #ifndef'd out. Note that these helpers are somewhat unusual in that they do not have the rust_helper_ prefix that most helpers have. Adding the rust_helper_ prefix does not compile, as 'bindings::_find_next_zero_bit()' will result in a call to a symbol called _find_next_zero_bit as defined by include/linux/find.h rather than a symbol with the rust_helper_ prefix. This is because when a symbol is present in both include/ and rust/helpers/, the one from include/ wins under the assumption that the current configuration is one where that helper is unnecessary. This heuristic fails for _find_next_zero_bit() because the header file always declares it even if the symbol does not exist. The functions still use the __rust_helper annotation. This lets the wrapper function be inlined into Rust code even if full kernel LTO is not used once the patch series for that feature lands. Yury: arches are free to implement they own find_bit() functions. Most rely on generic implementation, but arm32 and m86k - not; so they require custom handling. Alice confirmed it fixes the build for both. Cc: stable@vger.kernel.org Fixes: 6cf93a9ed39e ("rust: add bindings for bitops.h") Reported-by: Andreas Hindborg <a.hindborg@kernel.org> Closes: https://rust-for-linux.zulipchat.com/#narrow/channel/x/topic/x/near/561677301 Tested-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com> Signed-off-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
2026-01-13net: mana: Implement ndo_tx_timeout and serialize queue resets per port.Dipayaan Roy
Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected and a device-controlled port reset for all queues can be scheduled to a ordered workqueue. The reset for all queues on stall detection is recomended by hardware team. Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/20260112130552.GA11785@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'selftests-couple-of-fixes-in-toeplitz-rps-cases'Jakub Kicinski
Gal Pressman says: ==================== selftests: Couple of fixes in Toeplitz RPS cases Fix a couple of bugs in the RPS cases of the Toeplitz selftest. ==================== Link: https://patch.msgid.link/20260112173715.384843-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: drv-net: fix RPS mask handling for high CPU numbersGal Pressman
The RPS bitmask bounds check uses ~(RPS_MAX_CPUS - 1) which equals ~15 = 0xfff0, only allowing CPUs 0-3. Change the mask to ~((1UL << RPS_MAX_CPUS) - 1) = ~0xffff to allow CPUs 0-15. Fixes: 5ebfb4cc3048 ("selftests/net: toeplitz test") Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260112173715.384843-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: drv-net: fix RPS mask handling in toeplitz testGal Pressman
The toeplitz.py test passed the hex mask without "0x" prefix (e.g., "300" for CPUs 8,9). The toeplitz.c strtoul() call wrongly parsed this as decimal 300 (0x12c) instead of hex 0x300. Pass the prefixed mask to toeplitz.c, and the unprefixed one to sysfs. Fixes: 9cf9aa77a1f6 ("selftests: drv-net: hw: convert the Toeplitz test to Python") Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260112173715.384843-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13ipv6: Fix use-after-free in inet6_addr_del().Kuniyuki Iwashima
syzbot reported use-after-free of inet6_ifaddr in inet6_addr_del(). [0] The cited commit accidentally moved ipv6_del_addr() for mngtmpaddr before reading its ifp->flags for temporary addresses in inet6_addr_del(). Let's move ipv6_del_addr() down to fix the UAF. [0]: BUG: KASAN: slab-use-after-free in inet6_addr_del.constprop.0+0x67a/0x6b0 net/ipv6/addrconf.c:3117 Read of size 4 at addr ffff88807b89c86c by task syz.3.1618/9593 CPU: 0 UID: 0 PID: 9593 Comm: syz.3.1618 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xcd/0x630 mm/kasan/report.c:482 kasan_report+0xe0/0x110 mm/kasan/report.c:595 inet6_addr_del.constprop.0+0x67a/0x6b0 net/ipv6/addrconf.c:3117 addrconf_del_ifaddr+0x11e/0x190 net/ipv6/addrconf.c:3181 inet6_ioctl+0x1e5/0x2b0 net/ipv6/af_inet6.c:582 sock_do_ioctl+0x118/0x280 net/socket.c:1254 sock_ioctl+0x227/0x6b0 net/socket.c:1375 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl fs/ioctl.c:583 [inline] __x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f164cf8f749 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f164de64038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f164d1e5fa0 RCX: 00007f164cf8f749 RDX: 0000200000000000 RSI: 0000000000008936 RDI: 0000000000000003 RBP: 00007f164d013f91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f164d1e6038 R14: 00007f164d1e5fa0 R15: 00007ffde15c8288 </TASK> Allocated by task 9593: kasan_save_stack+0x33/0x60 mm/kasan/common.c:56 kasan_save_track+0x14/0x30 mm/kasan/common.c:77 poison_kmalloc_redzone mm/kasan/common.c:397 [inline] __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:414 kmalloc_noprof include/linux/slab.h:957 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] ipv6_add_addr+0x4e3/0x2010 net/ipv6/addrconf.c:1120 inet6_addr_add+0x256/0x9b0 net/ipv6/addrconf.c:3050 addrconf_add_ifaddr+0x1fc/0x450 net/ipv6/addrconf.c:3160 inet6_ioctl+0x103/0x2b0 net/ipv6/af_inet6.c:580 sock_do_ioctl+0x118/0x280 net/socket.c:1254 sock_ioctl+0x227/0x6b0 net/socket.c:1375 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl fs/ioctl.c:583 [inline] __x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 6099: kasan_save_stack+0x33/0x60 mm/kasan/common.c:56 kasan_save_track+0x14/0x30 mm/kasan/common.c:77 kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:252 [inline] __kasan_slab_free+0x5f/0x80 mm/kasan/common.c:284 kasan_slab_free include/linux/kasan.h:234 [inline] slab_free_hook mm/slub.c:2540 [inline] slab_free_freelist_hook mm/slub.c:2569 [inline] slab_free_bulk mm/slub.c:6696 [inline] kmem_cache_free_bulk mm/slub.c:7383 [inline] kmem_cache_free_bulk+0x2bf/0x680 mm/slub.c:7362 kfree_bulk include/linux/slab.h:830 [inline] kvfree_rcu_bulk+0x1b7/0x1e0 mm/slab_common.c:1523 kvfree_rcu_drain_ready mm/slab_common.c:1728 [inline] kfree_rcu_monitor+0x1d0/0x2f0 mm/slab_common.c:1801 process_one_work+0x9ba/0x1b20 kernel/workqueue.c:3257 process_scheduled_works kernel/workqueue.c:3340 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3421 kthread+0x3c5/0x780 kernel/kthread.c:463 ret_from_fork+0x983/0xb10 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Fixes: 00b5b7aab9e42 ("net/ipv6: delete temporary address if mngtmpaddr is removed or unmanaged") Reported-by: syzbot+72e610f4f1a930ca9d8a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/696598e9.050a0220.3be5c5.0009.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260113010538.2019411-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13dst: fix races in rt6_uncached_list_del() and rt_del_uncached_list()Eric Dumazet
syzbot was able to crash the kernel in rt6_uncached_list_flush_dev() in an interesting way [1] Crash happens in list_del_init()/INIT_LIST_HEAD() while writing list->prev, while the prior write on list->next went well. static inline void INIT_LIST_HEAD(struct list_head *list) { WRITE_ONCE(list->next, list); // This went well WRITE_ONCE(list->prev, list); // Crash, @list has been freed. } Issue here is that rt6_uncached_list_del() did not attempt to lock ul->lock, as list_empty(&rt->dst.rt_uncached) returned true because the WRITE_ONCE(list->next, list) happened on the other CPU. We might use list_del_init_careful() and list_empty_careful(), or make sure rt6_uncached_list_del() always grabs the spinlock whenever rt->dst.rt_uncached_list has been set. A similar fix is neeed for IPv4. [1] BUG: KASAN: slab-use-after-free in INIT_LIST_HEAD include/linux/list.h:46 [inline] BUG: KASAN: slab-use-after-free in list_del_init include/linux/list.h:296 [inline] BUG: KASAN: slab-use-after-free in rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline] BUG: KASAN: slab-use-after-free in rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020 Write of size 8 at addr ffff8880294cfa78 by task kworker/u8:14/3450 CPU: 0 UID: 0 PID: 3450 Comm: kworker/u8:14 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)} Tainted: [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Workqueue: netns cleanup_net Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 INIT_LIST_HEAD include/linux/list.h:46 [inline] list_del_init include/linux/list.h:296 [inline] rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline] rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020 addrconf_ifdown+0x143/0x18a0 net/ipv6/addrconf.c:3853 addrconf_notify+0x1bc/0x1050 net/ipv6/addrconf.c:-1 notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85 call_netdevice_notifiers_extack net/core/dev.c:2268 [inline] call_netdevice_notifiers net/core/dev.c:2282 [inline] netif_close_many+0x29c/0x410 net/core/dev.c:1785 unregister_netdevice_many_notify+0xb50/0x2330 net/core/dev.c:12353 ops_exit_rtnl_list net/core/net_namespace.c:187 [inline] ops_undo_list+0x3dc/0x990 net/core/net_namespace.c:248 cleanup_net+0x4de/0x7b0 net/core/net_namespace.c:696 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> Allocated by task 803: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 unpoison_slab_object mm/kasan/common.c:340 [inline] __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4953 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270 dst_alloc+0x105/0x170 net/core/dst.c:89 ip6_dst_alloc net/ipv6/route.c:342 [inline] icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333 mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Freed by task 20: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:2540 [inline] slab_free mm/slub.c:6670 [inline] kmem_cache_free+0x18f/0x8d0 mm/slub.c:6781 dst_destroy+0x235/0x350 net/core/dst.c:121 rcu_do_batch kernel/rcu/tree.c:2605 [inline] rcu_core kernel/rcu/tree.c:2857 [inline] rcu_cpu_kthread+0xba5/0x1af0 kernel/rcu/tree.c:2945 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556 __call_rcu_common kernel/rcu/tree.c:3119 [inline] call_rcu+0xee/0x890 kernel/rcu/tree.c:3239 refdst_drop include/net/dst.h:266 [inline] skb_dst_drop include/net/dst.h:278 [inline] skb_release_head_state+0x71/0x360 net/core/skbuff.c:1156 skb_release_all net/core/skbuff.c:1180 [inline] __kfree_skb net/core/skbuff.c:1196 [inline] sk_skb_reason_drop+0xe9/0x170 net/core/skbuff.c:1234 kfree_skb_reason include/linux/skbuff.h:1322 [inline] tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline] __dev_xmit_skb net/core/dev.c:4260 [inline] __dev_queue_xmit+0x26aa/0x3210 net/core/dev.c:4785 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip6_output+0x340/0x550 net/ipv6/ip6_output.c:247 NF_HOOK+0x9e/0x380 include/linux/netfilter.h:318 mld_sendpack+0x8d4/0xe60 net/ipv6/mcast.c:1855 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 The buggy address belongs to the object at ffff8880294cfa00 which belongs to the cache ip6_dst_cache of size 232 The buggy address is located 120 bytes inside of freed 232-byte region [ffff8880294cfa00, ffff8880294cfae8) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x294cf memcg:ffff88803536b781 flags: 0x80000000000000(node=0|zone=1) page_type: f5(slab) raw: 0080000000000000 ffff88802ff1c8c0 ffffea0000bf2bc0 dead000000000006 raw: 0000000000000000 00000000800c000c 00000000f5000000 ffff88803536b781 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 9, tgid 9 (kworker/0:0), ts 91119585830, free_ts 91088628818 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x234/0x290 mm/page_alloc.c:1857 prep_new_page mm/page_alloc.c:1865 [inline] get_page_from_freelist+0x28c0/0x2960 mm/page_alloc.c:3915 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5210 alloc_pages_mpol+0xd1/0x380 mm/mempolicy.c:2486 alloc_slab_page mm/slub.c:3075 [inline] allocate_slab+0x86/0x3b0 mm/slub.c:3248 new_slab mm/slub.c:3302 [inline] ___slab_alloc+0xb10/0x13e0 mm/slub.c:4656 __slab_alloc+0xc6/0x1f0 mm/slub.c:4779 __slab_alloc_node mm/slub.c:4855 [inline] slab_alloc_node mm/slub.c:5251 [inline] kmem_cache_alloc_noprof+0x101/0x6c0 mm/slub.c:5270 dst_alloc+0x105/0x170 net/core/dst.c:89 ip6_dst_alloc net/ipv6/route.c:342 [inline] icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333 mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 page last free pid 5859 tgid 5859 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1406 [inline] __free_frozen_pages+0xfe1/0x1170 mm/page_alloc.c:2943 discard_slab mm/slub.c:3346 [inline] __put_partials+0x149/0x170 mm/slub.c:3886 __slab_free+0x2af/0x330 mm/slub.c:5952 qlink_free mm/kasan/quarantine.c:163 [inline] qlist_free_all+0x97/0x100 mm/kasan/quarantine.c:179 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4953 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270 getname_flags+0xb8/0x540 fs/namei.c:146 getname include/linux/fs.h:2498 [inline] do_sys_openat2+0xbc/0x200 fs/open.c:1426 do_sys_open fs/open.c:1436 [inline] __do_sys_openat fs/open.c:1452 [inline] __se_sys_openat fs/open.c:1447 [inline] __x64_sys_openat+0x138/0x170 fs/open.c:1447 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 Fixes: 8d0b94afdca8 ("ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister") Fixes: 78df76a065ae ("ipv4: take rt_uncached_lock only if needed") Reported-by: syzbot+179fc225724092b8b2b2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6964cdf2.050a0220.eaf7.009d.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260112103825.3810713-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: hv_netvsc: reject RSS hash key programming without RX indirection tableAditya Garg
RSS configuration requires a valid RX indirection table. When the device reports a single receive queue, rndis_filter_device_add() does not allocate an indirection table, accepting RSS hash key updates in this state leads to a hang. Fix this by gating netvsc_set_rxfh() on ndc->rx_table_sz and return -EOPNOTSUPP when the table is absent. This aligns set_rxfh with the device capabilities and prevents incorrect behavior. Fixes: 962f3fee83a4 ("netvsc: add ethtool ops to get/set RSS key") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1768212093-1594-1-git-send-email-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: sxgbe: fix typo in commentJinseok Kim
Fix a misspelling in the sxgbe_mtl_init() function comment. "Algorith" should be spelled as "Algorithm". Signed-off-by: Jinseok Kim <always.starving0@gmail.com> Link: https://patch.msgid.link/20260112044147.2844-1-always.starving0@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'net-phy-fixed_phy-replace-list-of-fixed-phys-with-static-array'Jakub Kicinski
Heiner Kallweit says: ==================== net: phy: fixed_phy: replace list of fixed PHYs with static array Due to max 32 PHY addresses being available per mii bus, using a list can't support more fixed PHY's. And there's no known use case for as much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's, so use an array of that size instead of a list. This allows to significantly reduce the code size and complexity. In addition replace heavy-weight IDA with a simple bitmap. ==================== Link: https://patch.msgid.link/110f676d-727c-4575-abe4-e383f98fc38f@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: fixed_phy: replace IDA with a bitmapHeiner Kallweit
Size of array fmb_fixed_phys is small, so we can use a simple bitmap instead of an IDA to manage dynamic allocation of fixed PHY's. find_first_zero_bit() isn't atomic, so we need the loop to rule out double allocation of a PHY address. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/d4614463-d532-41fc-92e9-ef97107aceb5@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: fixed_phy: replace list of fixed PHYs with static arrayHeiner Kallweit
Due to max 32 PHY addresses being available per mii bus, using a list can't support more fixed PHY's. And there's no known use case for as much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's, so use an array of that size instead of a list. This allows to significantly reduce the code size and complexity. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/8610d30c-eac7-4100-9008-d3b6cee6a5cd@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'ipv6-allow-for-nexthop-device-mismatch-with-onlink'Jakub Kicinski
Ido Schimmel says: ==================== ipv6: Allow for nexthop device mismatch with "onlink" This patchset aligns IPv6 with IPv4 with respect to the "onlink" keyword and allows IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified. Patches #1-#3 are small preparations in the existing "onlink" selftest. Patch #4 is the actual change. See the commit message for detailed description and motivation. Patch #5 adds test cases for both address families, to make sure that this use case does not regress. ==================== Link: https://patch.msgid.link/20260111120813.159799-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Add test cases for nexthop device mismatchIdo Schimmel
Add test cases that verify that when the "onlink" keyword is specified, both address families (with and without VRF) accept routes with a gateway address that is reachable via a different interface than the one specified. Output without "ipv6: Allow for nexthop device mismatch with "onlink"": # ./fib-onlink-tests.sh | grep mismatch TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [FAIL] TEST: nexthop device mismatch [FAIL] Output with "ipv6: Allow for nexthop device mismatch with "onlink"": # ./fib-onlink-tests.sh | grep mismatch TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] That is, the IPv4 tests were always passing, but the IPv6 ones only pass after the specified patch. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13ipv6: Allow for nexthop device mismatch with "onlink"Ido Schimmel
IPv4 allows for a nexthop device mismatch when the "onlink" keyword is specified: # ip link add name dummy1 up type dummy # ip address add 192.0.2.1/24 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 Error: Nexthop has invalid gateway. # ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 onlink # echo $? 0 This seems to be consistent with the description of "onlink" in the ip-route man page: "Pretend that the nexthop is directly attached to this link, even if it does not match any interface prefix". On the other hand, IPv6 rejects a nexthop device mismatch, even when "onlink" is specified: # ip link add name dummy1 up type dummy # ip address add 2001:db8:1::1/64 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 RTNETLINK answers: No route to host # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink Error: Nexthop has invalid gateway or device mismatch. This is intentional according to commit fc1e64e1092f ("net/ipv6: Add support for onlink flag") which added IPv6 "onlink" support and states that "any unicast gateway is allowed as long as the gateway is not a local address and if it resolves it must match the given device". The condition was later relaxed in commit 4ed591c8ab44 ("net/ipv6: Allow onlink routes to have a device mismatch if it is the default route") to allow for a nexthop device mismatch if the gateway address is resolved via the default route: # ip link add name dummy1 up type dummy # ip route add ::/0 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 RTNETLINK answers: No route to host # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink # echo $? 0 While the decision to forbid a nexthop device mismatch in IPv6 seems to be intentional, it is unclear why it was made. Especially when it differs from IPv4 and seems to go against the intended behavior of "onlink". Therefore, relax the condition further and allow for a nexthop device mismatch when "onlink" is specified: # ip link add name dummy1 up type dummy # ip address add 2001:db8:1::1/64 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink # echo $? 0 The motivating use case is the fact that FRR would like to be able to configure overlay routes of the following form: # ip route add <host-Z> vrf <VRF> encap ip id <ID> src <VTEP-A> dst <VTEP-Z> via <VTEP-Z> dev vxlan0 onlink Where vxlan0 is in the default VRF in which "VTEP-Z" is reachable via one of the underlay routes (e.g., via swpX). Without this patch, the above only works with IPv4, but not with IPv6. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Add a test case for IPv4 multicast gatewayIdo Schimmel
A multicast gateway address should be rejected when "onlink" is specified, but it is only tested as part of the IPv6 tests. Add an equivalent IPv4 test. # ./fib-onlink-tests.sh -v [...] COMMAND: ip ro add table 254 169.254.101.12/32 via 233.252.0.1 dev veth1 onlink Error: Nexthop has invalid gateway. TEST: Invalid gw - multicast address [ OK ] [...] COMMAND: ip ro add table 1101 169.254.102.12/32 via 233.252.0.1 dev veth5 onlink Error: Nexthop has invalid gateway. TEST: Invalid gw - multicast address, VRF [ OK ] [...] Tests passed: 37 Tests failed: 0 Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Remove "wrong nexthop device" IPv6 testsIdo Schimmel
The command in the test fails as expected because IPv6 forbids a nexthop device mismatch: # ./fib-onlink-tests.sh -v [...] COMMAND: ip -6 ro add table 1101 2001:db8:102::103/128 via 2001:db8:701::64 dev veth5 onlink Error: Nexthop has invalid gateway or device mismatch. TEST: Gateway resolves to wrong nexthop device - VRF [ OK ] [...] Where: # ip route get 2001:db8:701::64 vrf lisa 2001:db8:701::64 dev veth7 table 1101 proto kernel src 2001:db8:701::1 metric 256 pref medium This is in contrast to IPv4 where a nexthop device mismatch is allowed when "onlink" is specified: # ip route get 169.254.7.2 vrf lisa 169.254.7.2 dev veth7 table 1101 src 169.254.7.1 uid 0 # ip ro add table 1101 169.254.102.103/32 via 169.254.7.2 dev veth5 onlink # echo $? 0 Remove these tests in preparation for aligning IPv6 with IPv4 and allowing nexthop device mismatch when "onlink" is specified. A subsequent patch will add tests that verify that both address families allow a nexthop device mismatch with "onlink". Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Remove "wrong nexthop device" IPv4 testsIdo Schimmel
According to the test description, these tests fail because of a wrong nexthop device: # ./fib-onlink-tests.sh -v [...] COMMAND: ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth1 onlink Error: Nexthop has invalid gateway. TEST: Gateway resolves to wrong nexthop device [ OK ] COMMAND: ip ro add table 1101 169.254.102.103/32 via 169.254.7.1 dev veth5 onlink Error: Nexthop has invalid gateway. TEST: Gateway resolves to wrong nexthop device - VRF [ OK ] [...] But this is incorrect. They fail because the gateway addresses are local addresses: # ip -4 address show [...] 28: veth3@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netns peer_ns-Urqh3o inet 169.254.3.1/24 scope global veth3 [...] 32: veth7@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master lisa state UP group default qlen 1000 link-netns peer_ns-Urqh3o inet 169.254.7.1/24 scope global veth7 Therefore, using a local address that matches the nexthop device fails as well: # ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth3 onlink Error: Nexthop has invalid gateway. Using a gateway address with a "wrong" nexthop device is actually valid and allowed: # ip route get 169.254.1.2 169.254.1.2 dev veth1 src 169.254.1.1 uid 0 # ip ro add table 254 169.254.101.102/32 via 169.254.1.2 dev veth3 onlink # echo $? 0 Remove these tests given that their output is confusing and that the scenario that they are testing is already covered by other tests. A subsequent patch will add tests for the nexthop device mismatch scenario. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'net-phy-introduce-phy-ports-representation'Jakub Kicinski
Maxime Chevallier says: ==================== net: phy: Introduce PHY ports representation A few important notes: - This is only a first phase. It instantiates the port, and leverage that to make the MAC <-> PHY <-> SFP usecase simpler. - Next phase will deal with controlling the port state, as well as the netlink uAPI for that. - The end-goal is to enable support for complex port MUX. This preliminary work focuses on PHY-driven ports, but this will be extended to support muxing at the MII level (Multi-phy, or compo PHY + SFP as found on Turris Omnia for example). - The naming is definitely not set in stone. I named that "phy_port", but this may convey the false sense that this is phylib-specific. Even the word "port" is not that great, as it already has several different meanings in the net world (switch port, devlink port, etc.). I used the term "connector" in the binding. A bit of history on that work : The end goal that I personnaly want to achieve is : + PHY - RJ45 | MAC - MUX -+ PHY - RJ45 After many discussions here on netdev@, but also at netdevconf[1] and LPC[2], there appears to be several analoguous designs that exist out there. [1] : https://netdevconf.info/0x17/sessions/talk/improving-multi-phy-and-multi-port-interfaces.html [2] : https://lpc.events/event/18/contributions/1964/ (video isn't the right one) Take the MAchiatobin, it has 2 interfaces that looks like this : MAC - PHY -+ RJ45 | + SFP - Whatever the module does Now, looking at the Turris Omnia, we have : MAC - MUX -+ PHY - RJ45 | + SFP - Whatever the module does We can find more example of this kind of designs, the common part is that we expose multiple front-facing media ports. This is what this current work aims at supporting. As of right now, it does'nt add any support for muxing, but this will come later on. This first phase focuses on phy-driven ports only, but there are already quite some challenges already. For one, we can't really autodetect how many ports are sitting behind a PHY. That's why this series introduces a new binding. Describing ports in DT should however be a last-resort thing when we need to clear some ambiguity about the PHY media-side. The only use-cases that we have today for multi-port PHYs are combo PHYs that drive both a Copper port and an SFP (the Macchiatobin case). This in itself is challenging and this series only addresses part of this support, by registering a phy_port for the PHY <-> SFP connection. The SFP module should in the end be considered as a port as well, but that's not yet the case. However, because now PHYs can register phy_ports for every media-side interface they have, they can register the capabilities of their ports, which allows making the PHY-driver SFP case much more generic. ==================== Link: https://patch.msgid.link/20260108080041.553250-1-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Documentation: networking: Document the phy_port infrastructureMaxime Chevallier
This documentation aims at describing the main goal of the phy_port infrastructure. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-15-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Only rely on phy_port for PHY-driven SFPMaxime Chevallier
Now that all PHY drivers that support downstream SFP have been converted to phy_port serdes handling, we can make the generic PHY SFP handling mandatory, thus making all phylib sfp helpers static. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-14-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: qca807x: Support SFP through phy_port interfaceMaxime Chevallier
QCA8072/8075 may be used as combo-port PHYs, with Serdes (100/1000BaseX) and Copper interfaces. The PHY has the ability to read the configuration it's in. If the configuration indicates the PHY is in combo mode, allow registering up to 2 ports. Register a dedicated set of port ops to handle the serdes port, and rely on generic phylib SFP support for the SFP handling. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-13-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: at803x: Support SFP through phy_port interfaceMaxime Chevallier
Convert the at803x driver to use the generic phylib SFP handling, via a dedicated .attach_port() callback, populating the supported interfaces. As these devices are limited to 1000BaseX, a workaround is used to also support, in a very limited way, copper modules. This is done by supporting SGMII but limiting it to 1G full duplex (in which case it's somewhat compatible with 1000BaseX). Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-12-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: marvell10g: Support SFP through phy_portMaxime Chevallier
Convert the Marvell10G driver to use the generic SFP handling, through a dedicated .attach_port() handler to populate the port's supported interfaces. As the 88x3310 supports multiple MDI, the .attach_port() logic handles both SFP attach with 10GBaseR support, and support for the "regular" port that usually is a BaseT port. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-11-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: marvell: Support SFP through phy_port interfaceMaxime Chevallier
Convert the Marvell driver (especially the 88e1512 driver) to use the phy_port interface to handle SFPs. This means registering a .attach_port() handler to detect when a serdes line interface is used (most likely, and SFP module). Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-10-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: marvell-88x2222: Support SFP through phy_port interfaceMaxime Chevallier
The 88x2222 PHY from Marvell only supports serialised modes as its line-facing interfaces. Convert that driver to the generic phylib SFP handling. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-9-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Introduce generic SFP handling for PHY driversMaxime Chevallier
There are currently 4 PHY drivers that can drive downstream SFPs: marvell.c, marvell10g.c, at803x.c and marvell-88x2222.c. Most of the logic is boilerplate, either calling into generic phylib helpers (for SFP PHY attach, bus attach, etc.) or performing the same tasks with a bit of validation : - Getting the module's expected interface mode - Making sure the PHY supports it - Optionaly perform some configuration to make sure the PHY outputs the right mode This can be made more generic by leveraging the phy_port, and its configure_mii() callback which allows setting a port's interfaces when the port is a serdes. Introduce a generic PHY SFP support. If a driver doesn't probe the SFP bus itself, but an SFP phandle is found in devicetree/firmware, then the generic PHY SFP support will be used, relying on port ops. PHY driver need to : - Register a .attach_port() callback - When a serdes port is registered to the PHY, drivers must set port->interfaces to the set of PHY_INTERFACE_MODE the port can output - If the port has limitations regarding speed, duplex and aneg, the port can also fine-tune the final linkmodes that can be supported - The port may register a set of ops, including .configure_mii(), that will be called at module_insert time to adjust the interface based on the module detected. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-8-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Create a phy_port for PHY-driven SFPsMaxime Chevallier
Some PHY devices may be used as media-converters to drive SFP ports (for example, to allow using SFP when the SoC can only output RGMII). This is already supported to some extend by allowing PHY drivers to registers themselves as being SFP upstream. However, the logic to drive the SFP can actually be split to a per-port control logic, allowing support for multi-port PHYs, or PHYs that can either drive SFPs or Copper. To that extent, create a phy_port when registering an SFP bus onto a PHY. This port is considered a "serdes" port, in that it can feed data to another entity on the link. The PHY driver needs to specify the various PHY_INTERFACE_MODE_XXX that this port supports. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-7-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13dt-bindings: net: dp83822: Deprecate ti,fiber-modeMaxime Chevallier
The newly added ethernet-connector binding allows describing an Ethernet connector with greater precision, and in a more generic manner, than ti,fiber-mode. Deprecate this property. Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-6-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: dp83822: Add support for phy_port representationMaxime Chevallier
With the phy_port representation introduced, we can use .attach_port to populate the port information based on either the straps or the ti,fiber-mode property. This allows simplifying the probe function and allow users to override the strapping configuration. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-5-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Introduce PHY ports representationMaxime Chevallier
Ethernet provides a wide variety of layer 1 protocols and standards for data transmission. The front-facing ports of an interface have their own complexity and configurability. Introduce a representation of these front-facing ports. The current code is minimalistic and only support ports controlled by PHY devices, but the plan is to extend that to SFP as well as raw Ethernet MACs that don't use PHY devices. This minimal port representation allows describing the media and number of pairs of a BaseT port. From that information, we can derive the linkmodes usable on the port, which can be used to limit the capabilities of an interface. For now, the port pairs and medium is derived from devicetree, defined by the PHY driver, or populated with default values (as we assume that all PHYs expose at least one port). The typical example is 100M ethernet. 100BaseTX works using only 2 pairs on a Cat 5 cables. However, in the situation where a 10/100/1000 capable PHY is wired to its RJ45 port through 2 pairs only, we have no way of detecting that. The "max-speed" DT property can be used, but a more accurate representation can be used : mdi { connector-0 { media = "BaseT"; pairs = <2>; }; }; From that information, we can derive the max speed reachable on the port. Another benefit of having that is to avoid vendor-specific DT properties (micrel,fiber-mode or ti,fiber-mode). This basic representation is meant to be expanded, by the introduction of port ops, userspace listing of ports, and support for multi-port devices. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260108080041.553250-4-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: ethtool: Introduce ETHTOOL_LINK_MEDIUM_* valuesMaxime Chevallier
In an effort to have a better representation of Ethernet ports, introduce enumeration values representing the various ethernet Mediums. This is part of the 802.3 naming convention, for example : 1000 Base T 4 | | | | | | | \_ pairs (4) | | \___ Medium (T == Twisted Copper Pairs) | \_______ Baseband transmission \____________ Speed Other example : 10000 Base K X 4 | | \_ lanes (4) | \___ encoding (BaseX is 8b/10b while BaseR is 66b/64b) \_____ Medium (K is backplane ethernet) In the case of representing a physical port, only the medium and number of pairs should be relevant. One exception would be 1000BaseX, which is currently also used as a medium in what appears to be any of 1000BaseSX, 1000BaseCX, 1000BaseLX, 1000BaseEX, 1000BaseBX10 and some other. This was reflected in the mediums associated with the 1000BaseX linkmode. These mediums are set in the net/ethtool/common.c lookup table that maintains a list of all linkmodes with their number of pairs, medium, encoding, speed and duplex. One notable exception to this is 100BaseT Ethernet. It emcompasses 100BaseTX, which is a 2-pairs protocol but also 100BaseT4, that will also work on 4-pairs cables. As we don't make a disctinction between these, the lookup table contains 2 sets of pair numbers, indicating the min number of pairs for a protocol to work and the "nominal" number of pairs as well. Another set of exceptions are linkmodes such 100000baseLR4_ER4, where the same link mode seems to represent 100GBaseLR4 and 100GBaseER4. The macro __DEFINE_LINK_MODE_PARAMS_MEDIUMS is here used to populate the .mediums bitfield with all appropriate mediums. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260108080041.553250-3-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13dt-bindings: net: Introduce the ethernet-connector descriptionMaxime Chevallier
The ability to describe the physical ports of Ethernet devices is useful to describe multi-port devices, as well as to remove any ambiguity with regard to the nature of the port. Moreover, describing ports allows for a better description of features that are tied to connectors, such as PoE through the PSE-PD devices. Introduce a binding to allow describing the ports, for now with 2 attributes : - The number of pairs, which is a quite generic property that allows differentating between multiple similar technologies such as BaseT1 and "regular" BaseT (which usually means BaseT4). - The media that can be used on that port, such as BaseT for Twisted Copper, BaseC for coax copper, BaseS/L for Fiber, BaseK for backplane ethernet, etc. This allows defining the nature of the port, and therefore avoids the need for vendor-specific properties such as "micrel,fiber-mode" or "ti,fiber-mode". The port description lives in its own file, as it is intended in the future to allow describing the ports for phy-less devices. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20260108080041.553250-2-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-14io_uring/zcrx: document area chunking parameterPavel Begunkov
struct io_uring_zcrx_ifq_reg::rx_buf_len is used as a hint specifying the kernel what buffer size it should use. Document the API and limitations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14selftests: iou-zcrx: test large chunk sizesPavel Begunkov
Add a test using large chunks for zcrx memory area. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14eth: bnxt: support qcfg provided rx page sizePavel Begunkov
Implement support for qcfg provided rx page sizes. For that, implement the ndo_default_qcfg callback and validate the config on restart. Also, use the current config's value in bnxt_init_ring_struct to retain the correct size across resets. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>