summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-01-13selftests: drv-net: fix RPS mask handling for high CPU numbersGal Pressman
The RPS bitmask bounds check uses ~(RPS_MAX_CPUS - 1) which equals ~15 = 0xfff0, only allowing CPUs 0-3. Change the mask to ~((1UL << RPS_MAX_CPUS) - 1) = ~0xffff to allow CPUs 0-15. Fixes: 5ebfb4cc3048 ("selftests/net: toeplitz test") Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260112173715.384843-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: drv-net: fix RPS mask handling in toeplitz testGal Pressman
The toeplitz.py test passed the hex mask without "0x" prefix (e.g., "300" for CPUs 8,9). The toeplitz.c strtoul() call wrongly parsed this as decimal 300 (0x12c) instead of hex 0x300. Pass the prefixed mask to toeplitz.c, and the unprefixed one to sysfs. Fixes: 9cf9aa77a1f6 ("selftests: drv-net: hw: convert the Toeplitz test to Python") Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260112173715.384843-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13ipv6: Fix use-after-free in inet6_addr_del().Kuniyuki Iwashima
syzbot reported use-after-free of inet6_ifaddr in inet6_addr_del(). [0] The cited commit accidentally moved ipv6_del_addr() for mngtmpaddr before reading its ifp->flags for temporary addresses in inet6_addr_del(). Let's move ipv6_del_addr() down to fix the UAF. [0]: BUG: KASAN: slab-use-after-free in inet6_addr_del.constprop.0+0x67a/0x6b0 net/ipv6/addrconf.c:3117 Read of size 4 at addr ffff88807b89c86c by task syz.3.1618/9593 CPU: 0 UID: 0 PID: 9593 Comm: syz.3.1618 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xcd/0x630 mm/kasan/report.c:482 kasan_report+0xe0/0x110 mm/kasan/report.c:595 inet6_addr_del.constprop.0+0x67a/0x6b0 net/ipv6/addrconf.c:3117 addrconf_del_ifaddr+0x11e/0x190 net/ipv6/addrconf.c:3181 inet6_ioctl+0x1e5/0x2b0 net/ipv6/af_inet6.c:582 sock_do_ioctl+0x118/0x280 net/socket.c:1254 sock_ioctl+0x227/0x6b0 net/socket.c:1375 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl fs/ioctl.c:583 [inline] __x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f164cf8f749 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f164de64038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f164d1e5fa0 RCX: 00007f164cf8f749 RDX: 0000200000000000 RSI: 0000000000008936 RDI: 0000000000000003 RBP: 00007f164d013f91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f164d1e6038 R14: 00007f164d1e5fa0 R15: 00007ffde15c8288 </TASK> Allocated by task 9593: kasan_save_stack+0x33/0x60 mm/kasan/common.c:56 kasan_save_track+0x14/0x30 mm/kasan/common.c:77 poison_kmalloc_redzone mm/kasan/common.c:397 [inline] __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:414 kmalloc_noprof include/linux/slab.h:957 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] ipv6_add_addr+0x4e3/0x2010 net/ipv6/addrconf.c:1120 inet6_addr_add+0x256/0x9b0 net/ipv6/addrconf.c:3050 addrconf_add_ifaddr+0x1fc/0x450 net/ipv6/addrconf.c:3160 inet6_ioctl+0x103/0x2b0 net/ipv6/af_inet6.c:580 sock_do_ioctl+0x118/0x280 net/socket.c:1254 sock_ioctl+0x227/0x6b0 net/socket.c:1375 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl fs/ioctl.c:583 [inline] __x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 6099: kasan_save_stack+0x33/0x60 mm/kasan/common.c:56 kasan_save_track+0x14/0x30 mm/kasan/common.c:77 kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:252 [inline] __kasan_slab_free+0x5f/0x80 mm/kasan/common.c:284 kasan_slab_free include/linux/kasan.h:234 [inline] slab_free_hook mm/slub.c:2540 [inline] slab_free_freelist_hook mm/slub.c:2569 [inline] slab_free_bulk mm/slub.c:6696 [inline] kmem_cache_free_bulk mm/slub.c:7383 [inline] kmem_cache_free_bulk+0x2bf/0x680 mm/slub.c:7362 kfree_bulk include/linux/slab.h:830 [inline] kvfree_rcu_bulk+0x1b7/0x1e0 mm/slab_common.c:1523 kvfree_rcu_drain_ready mm/slab_common.c:1728 [inline] kfree_rcu_monitor+0x1d0/0x2f0 mm/slab_common.c:1801 process_one_work+0x9ba/0x1b20 kernel/workqueue.c:3257 process_scheduled_works kernel/workqueue.c:3340 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3421 kthread+0x3c5/0x780 kernel/kthread.c:463 ret_from_fork+0x983/0xb10 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Fixes: 00b5b7aab9e42 ("net/ipv6: delete temporary address if mngtmpaddr is removed or unmanaged") Reported-by: syzbot+72e610f4f1a930ca9d8a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/696598e9.050a0220.3be5c5.0009.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260113010538.2019411-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13dst: fix races in rt6_uncached_list_del() and rt_del_uncached_list()Eric Dumazet
syzbot was able to crash the kernel in rt6_uncached_list_flush_dev() in an interesting way [1] Crash happens in list_del_init()/INIT_LIST_HEAD() while writing list->prev, while the prior write on list->next went well. static inline void INIT_LIST_HEAD(struct list_head *list) { WRITE_ONCE(list->next, list); // This went well WRITE_ONCE(list->prev, list); // Crash, @list has been freed. } Issue here is that rt6_uncached_list_del() did not attempt to lock ul->lock, as list_empty(&rt->dst.rt_uncached) returned true because the WRITE_ONCE(list->next, list) happened on the other CPU. We might use list_del_init_careful() and list_empty_careful(), or make sure rt6_uncached_list_del() always grabs the spinlock whenever rt->dst.rt_uncached_list has been set. A similar fix is neeed for IPv4. [1] BUG: KASAN: slab-use-after-free in INIT_LIST_HEAD include/linux/list.h:46 [inline] BUG: KASAN: slab-use-after-free in list_del_init include/linux/list.h:296 [inline] BUG: KASAN: slab-use-after-free in rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline] BUG: KASAN: slab-use-after-free in rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020 Write of size 8 at addr ffff8880294cfa78 by task kworker/u8:14/3450 CPU: 0 UID: 0 PID: 3450 Comm: kworker/u8:14 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)} Tainted: [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Workqueue: netns cleanup_net Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 INIT_LIST_HEAD include/linux/list.h:46 [inline] list_del_init include/linux/list.h:296 [inline] rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline] rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020 addrconf_ifdown+0x143/0x18a0 net/ipv6/addrconf.c:3853 addrconf_notify+0x1bc/0x1050 net/ipv6/addrconf.c:-1 notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85 call_netdevice_notifiers_extack net/core/dev.c:2268 [inline] call_netdevice_notifiers net/core/dev.c:2282 [inline] netif_close_many+0x29c/0x410 net/core/dev.c:1785 unregister_netdevice_many_notify+0xb50/0x2330 net/core/dev.c:12353 ops_exit_rtnl_list net/core/net_namespace.c:187 [inline] ops_undo_list+0x3dc/0x990 net/core/net_namespace.c:248 cleanup_net+0x4de/0x7b0 net/core/net_namespace.c:696 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> Allocated by task 803: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 unpoison_slab_object mm/kasan/common.c:340 [inline] __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4953 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270 dst_alloc+0x105/0x170 net/core/dst.c:89 ip6_dst_alloc net/ipv6/route.c:342 [inline] icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333 mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Freed by task 20: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:2540 [inline] slab_free mm/slub.c:6670 [inline] kmem_cache_free+0x18f/0x8d0 mm/slub.c:6781 dst_destroy+0x235/0x350 net/core/dst.c:121 rcu_do_batch kernel/rcu/tree.c:2605 [inline] rcu_core kernel/rcu/tree.c:2857 [inline] rcu_cpu_kthread+0xba5/0x1af0 kernel/rcu/tree.c:2945 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556 __call_rcu_common kernel/rcu/tree.c:3119 [inline] call_rcu+0xee/0x890 kernel/rcu/tree.c:3239 refdst_drop include/net/dst.h:266 [inline] skb_dst_drop include/net/dst.h:278 [inline] skb_release_head_state+0x71/0x360 net/core/skbuff.c:1156 skb_release_all net/core/skbuff.c:1180 [inline] __kfree_skb net/core/skbuff.c:1196 [inline] sk_skb_reason_drop+0xe9/0x170 net/core/skbuff.c:1234 kfree_skb_reason include/linux/skbuff.h:1322 [inline] tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline] __dev_xmit_skb net/core/dev.c:4260 [inline] __dev_queue_xmit+0x26aa/0x3210 net/core/dev.c:4785 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip6_output+0x340/0x550 net/ipv6/ip6_output.c:247 NF_HOOK+0x9e/0x380 include/linux/netfilter.h:318 mld_sendpack+0x8d4/0xe60 net/ipv6/mcast.c:1855 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 The buggy address belongs to the object at ffff8880294cfa00 which belongs to the cache ip6_dst_cache of size 232 The buggy address is located 120 bytes inside of freed 232-byte region [ffff8880294cfa00, ffff8880294cfae8) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x294cf memcg:ffff88803536b781 flags: 0x80000000000000(node=0|zone=1) page_type: f5(slab) raw: 0080000000000000 ffff88802ff1c8c0 ffffea0000bf2bc0 dead000000000006 raw: 0000000000000000 00000000800c000c 00000000f5000000 ffff88803536b781 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 9, tgid 9 (kworker/0:0), ts 91119585830, free_ts 91088628818 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x234/0x290 mm/page_alloc.c:1857 prep_new_page mm/page_alloc.c:1865 [inline] get_page_from_freelist+0x28c0/0x2960 mm/page_alloc.c:3915 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5210 alloc_pages_mpol+0xd1/0x380 mm/mempolicy.c:2486 alloc_slab_page mm/slub.c:3075 [inline] allocate_slab+0x86/0x3b0 mm/slub.c:3248 new_slab mm/slub.c:3302 [inline] ___slab_alloc+0xb10/0x13e0 mm/slub.c:4656 __slab_alloc+0xc6/0x1f0 mm/slub.c:4779 __slab_alloc_node mm/slub.c:4855 [inline] slab_alloc_node mm/slub.c:5251 [inline] kmem_cache_alloc_noprof+0x101/0x6c0 mm/slub.c:5270 dst_alloc+0x105/0x170 net/core/dst.c:89 ip6_dst_alloc net/ipv6/route.c:342 [inline] icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333 mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 page last free pid 5859 tgid 5859 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1406 [inline] __free_frozen_pages+0xfe1/0x1170 mm/page_alloc.c:2943 discard_slab mm/slub.c:3346 [inline] __put_partials+0x149/0x170 mm/slub.c:3886 __slab_free+0x2af/0x330 mm/slub.c:5952 qlink_free mm/kasan/quarantine.c:163 [inline] qlist_free_all+0x97/0x100 mm/kasan/quarantine.c:179 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4953 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270 getname_flags+0xb8/0x540 fs/namei.c:146 getname include/linux/fs.h:2498 [inline] do_sys_openat2+0xbc/0x200 fs/open.c:1426 do_sys_open fs/open.c:1436 [inline] __do_sys_openat fs/open.c:1452 [inline] __se_sys_openat fs/open.c:1447 [inline] __x64_sys_openat+0x138/0x170 fs/open.c:1447 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 Fixes: 8d0b94afdca8 ("ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister") Fixes: 78df76a065ae ("ipv4: take rt_uncached_lock only if needed") Reported-by: syzbot+179fc225724092b8b2b2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6964cdf2.050a0220.eaf7.009d.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260112103825.3810713-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: hv_netvsc: reject RSS hash key programming without RX indirection tableAditya Garg
RSS configuration requires a valid RX indirection table. When the device reports a single receive queue, rndis_filter_device_add() does not allocate an indirection table, accepting RSS hash key updates in this state leads to a hang. Fix this by gating netvsc_set_rxfh() on ndc->rx_table_sz and return -EOPNOTSUPP when the table is absent. This aligns set_rxfh with the device capabilities and prevents incorrect behavior. Fixes: 962f3fee83a4 ("netvsc: add ethtool ops to get/set RSS key") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1768212093-1594-1-git-send-email-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: sxgbe: fix typo in commentJinseok Kim
Fix a misspelling in the sxgbe_mtl_init() function comment. "Algorith" should be spelled as "Algorithm". Signed-off-by: Jinseok Kim <always.starving0@gmail.com> Link: https://patch.msgid.link/20260112044147.2844-1-always.starving0@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'net-phy-fixed_phy-replace-list-of-fixed-phys-with-static-array'Jakub Kicinski
Heiner Kallweit says: ==================== net: phy: fixed_phy: replace list of fixed PHYs with static array Due to max 32 PHY addresses being available per mii bus, using a list can't support more fixed PHY's. And there's no known use case for as much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's, so use an array of that size instead of a list. This allows to significantly reduce the code size and complexity. In addition replace heavy-weight IDA with a simple bitmap. ==================== Link: https://patch.msgid.link/110f676d-727c-4575-abe4-e383f98fc38f@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: fixed_phy: replace IDA with a bitmapHeiner Kallweit
Size of array fmb_fixed_phys is small, so we can use a simple bitmap instead of an IDA to manage dynamic allocation of fixed PHY's. find_first_zero_bit() isn't atomic, so we need the loop to rule out double allocation of a PHY address. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/d4614463-d532-41fc-92e9-ef97107aceb5@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: fixed_phy: replace list of fixed PHYs with static arrayHeiner Kallweit
Due to max 32 PHY addresses being available per mii bus, using a list can't support more fixed PHY's. And there's no known use case for as much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's, so use an array of that size instead of a list. This allows to significantly reduce the code size and complexity. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/8610d30c-eac7-4100-9008-d3b6cee6a5cd@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'ipv6-allow-for-nexthop-device-mismatch-with-onlink'Jakub Kicinski
Ido Schimmel says: ==================== ipv6: Allow for nexthop device mismatch with "onlink" This patchset aligns IPv6 with IPv4 with respect to the "onlink" keyword and allows IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified. Patches #1-#3 are small preparations in the existing "onlink" selftest. Patch #4 is the actual change. See the commit message for detailed description and motivation. Patch #5 adds test cases for both address families, to make sure that this use case does not regress. ==================== Link: https://patch.msgid.link/20260111120813.159799-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Add test cases for nexthop device mismatchIdo Schimmel
Add test cases that verify that when the "onlink" keyword is specified, both address families (with and without VRF) accept routes with a gateway address that is reachable via a different interface than the one specified. Output without "ipv6: Allow for nexthop device mismatch with "onlink"": # ./fib-onlink-tests.sh | grep mismatch TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [FAIL] TEST: nexthop device mismatch [FAIL] Output with "ipv6: Allow for nexthop device mismatch with "onlink"": # ./fib-onlink-tests.sh | grep mismatch TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] TEST: nexthop device mismatch [ OK ] That is, the IPv4 tests were always passing, but the IPv6 ones only pass after the specified patch. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13ipv6: Allow for nexthop device mismatch with "onlink"Ido Schimmel
IPv4 allows for a nexthop device mismatch when the "onlink" keyword is specified: # ip link add name dummy1 up type dummy # ip address add 192.0.2.1/24 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 Error: Nexthop has invalid gateway. # ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 onlink # echo $? 0 This seems to be consistent with the description of "onlink" in the ip-route man page: "Pretend that the nexthop is directly attached to this link, even if it does not match any interface prefix". On the other hand, IPv6 rejects a nexthop device mismatch, even when "onlink" is specified: # ip link add name dummy1 up type dummy # ip address add 2001:db8:1::1/64 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 RTNETLINK answers: No route to host # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink Error: Nexthop has invalid gateway or device mismatch. This is intentional according to commit fc1e64e1092f ("net/ipv6: Add support for onlink flag") which added IPv6 "onlink" support and states that "any unicast gateway is allowed as long as the gateway is not a local address and if it resolves it must match the given device". The condition was later relaxed in commit 4ed591c8ab44 ("net/ipv6: Allow onlink routes to have a device mismatch if it is the default route") to allow for a nexthop device mismatch if the gateway address is resolved via the default route: # ip link add name dummy1 up type dummy # ip route add ::/0 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 RTNETLINK answers: No route to host # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink # echo $? 0 While the decision to forbid a nexthop device mismatch in IPv6 seems to be intentional, it is unclear why it was made. Especially when it differs from IPv4 and seems to go against the intended behavior of "onlink". Therefore, relax the condition further and allow for a nexthop device mismatch when "onlink" is specified: # ip link add name dummy1 up type dummy # ip address add 2001:db8:1::1/64 dev dummy1 # ip link add name dummy2 up type dummy # ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink # echo $? 0 The motivating use case is the fact that FRR would like to be able to configure overlay routes of the following form: # ip route add <host-Z> vrf <VRF> encap ip id <ID> src <VTEP-A> dst <VTEP-Z> via <VTEP-Z> dev vxlan0 onlink Where vxlan0 is in the default VRF in which "VTEP-Z" is reachable via one of the underlay routes (e.g., via swpX). Without this patch, the above only works with IPv4, but not with IPv6. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Add a test case for IPv4 multicast gatewayIdo Schimmel
A multicast gateway address should be rejected when "onlink" is specified, but it is only tested as part of the IPv6 tests. Add an equivalent IPv4 test. # ./fib-onlink-tests.sh -v [...] COMMAND: ip ro add table 254 169.254.101.12/32 via 233.252.0.1 dev veth1 onlink Error: Nexthop has invalid gateway. TEST: Invalid gw - multicast address [ OK ] [...] COMMAND: ip ro add table 1101 169.254.102.12/32 via 233.252.0.1 dev veth5 onlink Error: Nexthop has invalid gateway. TEST: Invalid gw - multicast address, VRF [ OK ] [...] Tests passed: 37 Tests failed: 0 Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Remove "wrong nexthop device" IPv6 testsIdo Schimmel
The command in the test fails as expected because IPv6 forbids a nexthop device mismatch: # ./fib-onlink-tests.sh -v [...] COMMAND: ip -6 ro add table 1101 2001:db8:102::103/128 via 2001:db8:701::64 dev veth5 onlink Error: Nexthop has invalid gateway or device mismatch. TEST: Gateway resolves to wrong nexthop device - VRF [ OK ] [...] Where: # ip route get 2001:db8:701::64 vrf lisa 2001:db8:701::64 dev veth7 table 1101 proto kernel src 2001:db8:701::1 metric 256 pref medium This is in contrast to IPv4 where a nexthop device mismatch is allowed when "onlink" is specified: # ip route get 169.254.7.2 vrf lisa 169.254.7.2 dev veth7 table 1101 src 169.254.7.1 uid 0 # ip ro add table 1101 169.254.102.103/32 via 169.254.7.2 dev veth5 onlink # echo $? 0 Remove these tests in preparation for aligning IPv6 with IPv4 and allowing nexthop device mismatch when "onlink" is specified. A subsequent patch will add tests that verify that both address families allow a nexthop device mismatch with "onlink". Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: fib-onlink: Remove "wrong nexthop device" IPv4 testsIdo Schimmel
According to the test description, these tests fail because of a wrong nexthop device: # ./fib-onlink-tests.sh -v [...] COMMAND: ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth1 onlink Error: Nexthop has invalid gateway. TEST: Gateway resolves to wrong nexthop device [ OK ] COMMAND: ip ro add table 1101 169.254.102.103/32 via 169.254.7.1 dev veth5 onlink Error: Nexthop has invalid gateway. TEST: Gateway resolves to wrong nexthop device - VRF [ OK ] [...] But this is incorrect. They fail because the gateway addresses are local addresses: # ip -4 address show [...] 28: veth3@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netns peer_ns-Urqh3o inet 169.254.3.1/24 scope global veth3 [...] 32: veth7@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master lisa state UP group default qlen 1000 link-netns peer_ns-Urqh3o inet 169.254.7.1/24 scope global veth7 Therefore, using a local address that matches the nexthop device fails as well: # ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth3 onlink Error: Nexthop has invalid gateway. Using a gateway address with a "wrong" nexthop device is actually valid and allowed: # ip route get 169.254.1.2 169.254.1.2 dev veth1 src 169.254.1.1 uid 0 # ip ro add table 254 169.254.101.102/32 via 169.254.1.2 dev veth3 onlink # echo $? 0 Remove these tests given that their output is confusing and that the scenario that they are testing is already covered by other tests. A subsequent patch will add tests for the nexthop device mismatch scenario. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260111120813.159799-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'net-phy-introduce-phy-ports-representation'Jakub Kicinski
Maxime Chevallier says: ==================== net: phy: Introduce PHY ports representation A few important notes: - This is only a first phase. It instantiates the port, and leverage that to make the MAC <-> PHY <-> SFP usecase simpler. - Next phase will deal with controlling the port state, as well as the netlink uAPI for that. - The end-goal is to enable support for complex port MUX. This preliminary work focuses on PHY-driven ports, but this will be extended to support muxing at the MII level (Multi-phy, or compo PHY + SFP as found on Turris Omnia for example). - The naming is definitely not set in stone. I named that "phy_port", but this may convey the false sense that this is phylib-specific. Even the word "port" is not that great, as it already has several different meanings in the net world (switch port, devlink port, etc.). I used the term "connector" in the binding. A bit of history on that work : The end goal that I personnaly want to achieve is : + PHY - RJ45 | MAC - MUX -+ PHY - RJ45 After many discussions here on netdev@, but also at netdevconf[1] and LPC[2], there appears to be several analoguous designs that exist out there. [1] : https://netdevconf.info/0x17/sessions/talk/improving-multi-phy-and-multi-port-interfaces.html [2] : https://lpc.events/event/18/contributions/1964/ (video isn't the right one) Take the MAchiatobin, it has 2 interfaces that looks like this : MAC - PHY -+ RJ45 | + SFP - Whatever the module does Now, looking at the Turris Omnia, we have : MAC - MUX -+ PHY - RJ45 | + SFP - Whatever the module does We can find more example of this kind of designs, the common part is that we expose multiple front-facing media ports. This is what this current work aims at supporting. As of right now, it does'nt add any support for muxing, but this will come later on. This first phase focuses on phy-driven ports only, but there are already quite some challenges already. For one, we can't really autodetect how many ports are sitting behind a PHY. That's why this series introduces a new binding. Describing ports in DT should however be a last-resort thing when we need to clear some ambiguity about the PHY media-side. The only use-cases that we have today for multi-port PHYs are combo PHYs that drive both a Copper port and an SFP (the Macchiatobin case). This in itself is challenging and this series only addresses part of this support, by registering a phy_port for the PHY <-> SFP connection. The SFP module should in the end be considered as a port as well, but that's not yet the case. However, because now PHYs can register phy_ports for every media-side interface they have, they can register the capabilities of their ports, which allows making the PHY-driver SFP case much more generic. ==================== Link: https://patch.msgid.link/20260108080041.553250-1-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Documentation: networking: Document the phy_port infrastructureMaxime Chevallier
This documentation aims at describing the main goal of the phy_port infrastructure. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-15-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Only rely on phy_port for PHY-driven SFPMaxime Chevallier
Now that all PHY drivers that support downstream SFP have been converted to phy_port serdes handling, we can make the generic PHY SFP handling mandatory, thus making all phylib sfp helpers static. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-14-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: qca807x: Support SFP through phy_port interfaceMaxime Chevallier
QCA8072/8075 may be used as combo-port PHYs, with Serdes (100/1000BaseX) and Copper interfaces. The PHY has the ability to read the configuration it's in. If the configuration indicates the PHY is in combo mode, allow registering up to 2 ports. Register a dedicated set of port ops to handle the serdes port, and rely on generic phylib SFP support for the SFP handling. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-13-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: at803x: Support SFP through phy_port interfaceMaxime Chevallier
Convert the at803x driver to use the generic phylib SFP handling, via a dedicated .attach_port() callback, populating the supported interfaces. As these devices are limited to 1000BaseX, a workaround is used to also support, in a very limited way, copper modules. This is done by supporting SGMII but limiting it to 1G full duplex (in which case it's somewhat compatible with 1000BaseX). Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-12-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: marvell10g: Support SFP through phy_portMaxime Chevallier
Convert the Marvell10G driver to use the generic SFP handling, through a dedicated .attach_port() handler to populate the port's supported interfaces. As the 88x3310 supports multiple MDI, the .attach_port() logic handles both SFP attach with 10GBaseR support, and support for the "regular" port that usually is a BaseT port. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-11-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: marvell: Support SFP through phy_port interfaceMaxime Chevallier
Convert the Marvell driver (especially the 88e1512 driver) to use the phy_port interface to handle SFPs. This means registering a .attach_port() handler to detect when a serdes line interface is used (most likely, and SFP module). Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-10-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: marvell-88x2222: Support SFP through phy_port interfaceMaxime Chevallier
The 88x2222 PHY from Marvell only supports serialised modes as its line-facing interfaces. Convert that driver to the generic phylib SFP handling. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-9-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Introduce generic SFP handling for PHY driversMaxime Chevallier
There are currently 4 PHY drivers that can drive downstream SFPs: marvell.c, marvell10g.c, at803x.c and marvell-88x2222.c. Most of the logic is boilerplate, either calling into generic phylib helpers (for SFP PHY attach, bus attach, etc.) or performing the same tasks with a bit of validation : - Getting the module's expected interface mode - Making sure the PHY supports it - Optionaly perform some configuration to make sure the PHY outputs the right mode This can be made more generic by leveraging the phy_port, and its configure_mii() callback which allows setting a port's interfaces when the port is a serdes. Introduce a generic PHY SFP support. If a driver doesn't probe the SFP bus itself, but an SFP phandle is found in devicetree/firmware, then the generic PHY SFP support will be used, relying on port ops. PHY driver need to : - Register a .attach_port() callback - When a serdes port is registered to the PHY, drivers must set port->interfaces to the set of PHY_INTERFACE_MODE the port can output - If the port has limitations regarding speed, duplex and aneg, the port can also fine-tune the final linkmodes that can be supported - The port may register a set of ops, including .configure_mii(), that will be called at module_insert time to adjust the interface based on the module detected. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-8-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Create a phy_port for PHY-driven SFPsMaxime Chevallier
Some PHY devices may be used as media-converters to drive SFP ports (for example, to allow using SFP when the SoC can only output RGMII). This is already supported to some extend by allowing PHY drivers to registers themselves as being SFP upstream. However, the logic to drive the SFP can actually be split to a per-port control logic, allowing support for multi-port PHYs, or PHYs that can either drive SFPs or Copper. To that extent, create a phy_port when registering an SFP bus onto a PHY. This port is considered a "serdes" port, in that it can feed data to another entity on the link. The PHY driver needs to specify the various PHY_INTERFACE_MODE_XXX that this port supports. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-7-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13dt-bindings: net: dp83822: Deprecate ti,fiber-modeMaxime Chevallier
The newly added ethernet-connector binding allows describing an Ethernet connector with greater precision, and in a more generic manner, than ti,fiber-mode. Deprecate this property. Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-6-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: dp83822: Add support for phy_port representationMaxime Chevallier
With the phy_port representation introduced, we can use .attach_port to populate the port information based on either the straps or the ti,fiber-mode property. This allows simplifying the probe function and allow users to override the strapping configuration. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-5-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: phy: Introduce PHY ports representationMaxime Chevallier
Ethernet provides a wide variety of layer 1 protocols and standards for data transmission. The front-facing ports of an interface have their own complexity and configurability. Introduce a representation of these front-facing ports. The current code is minimalistic and only support ports controlled by PHY devices, but the plan is to extend that to SFP as well as raw Ethernet MACs that don't use PHY devices. This minimal port representation allows describing the media and number of pairs of a BaseT port. From that information, we can derive the linkmodes usable on the port, which can be used to limit the capabilities of an interface. For now, the port pairs and medium is derived from devicetree, defined by the PHY driver, or populated with default values (as we assume that all PHYs expose at least one port). The typical example is 100M ethernet. 100BaseTX works using only 2 pairs on a Cat 5 cables. However, in the situation where a 10/100/1000 capable PHY is wired to its RJ45 port through 2 pairs only, we have no way of detecting that. The "max-speed" DT property can be used, but a more accurate representation can be used : mdi { connector-0 { media = "BaseT"; pairs = <2>; }; }; From that information, we can derive the max speed reachable on the port. Another benefit of having that is to avoid vendor-specific DT properties (micrel,fiber-mode or ti,fiber-mode). This basic representation is meant to be expanded, by the introduction of port ops, userspace listing of ports, and support for multi-port devices. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260108080041.553250-4-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net: ethtool: Introduce ETHTOOL_LINK_MEDIUM_* valuesMaxime Chevallier
In an effort to have a better representation of Ethernet ports, introduce enumeration values representing the various ethernet Mediums. This is part of the 802.3 naming convention, for example : 1000 Base T 4 | | | | | | | \_ pairs (4) | | \___ Medium (T == Twisted Copper Pairs) | \_______ Baseband transmission \____________ Speed Other example : 10000 Base K X 4 | | \_ lanes (4) | \___ encoding (BaseX is 8b/10b while BaseR is 66b/64b) \_____ Medium (K is backplane ethernet) In the case of representing a physical port, only the medium and number of pairs should be relevant. One exception would be 1000BaseX, which is currently also used as a medium in what appears to be any of 1000BaseSX, 1000BaseCX, 1000BaseLX, 1000BaseEX, 1000BaseBX10 and some other. This was reflected in the mediums associated with the 1000BaseX linkmode. These mediums are set in the net/ethtool/common.c lookup table that maintains a list of all linkmodes with their number of pairs, medium, encoding, speed and duplex. One notable exception to this is 100BaseT Ethernet. It emcompasses 100BaseTX, which is a 2-pairs protocol but also 100BaseT4, that will also work on 4-pairs cables. As we don't make a disctinction between these, the lookup table contains 2 sets of pair numbers, indicating the min number of pairs for a protocol to work and the "nominal" number of pairs as well. Another set of exceptions are linkmodes such 100000baseLR4_ER4, where the same link mode seems to represent 100GBaseLR4 and 100GBaseER4. The macro __DEFINE_LINK_MODE_PARAMS_MEDIUMS is here used to populate the .mediums bitfield with all appropriate mediums. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260108080041.553250-3-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13dt-bindings: net: Introduce the ethernet-connector descriptionMaxime Chevallier
The ability to describe the physical ports of Ethernet devices is useful to describe multi-port devices, as well as to remove any ambiguity with regard to the nature of the port. Moreover, describing ports allows for a better description of features that are tied to connectors, such as PoE through the PSE-PD devices. Introduce a binding to allow describing the ports, for now with 2 attributes : - The number of pairs, which is a quite generic property that allows differentating between multiple similar technologies such as BaseT1 and "regular" BaseT (which usually means BaseT4). - The media that can be used on that port, such as BaseT for Twisted Copper, BaseC for coax copper, BaseS/L for Fiber, BaseK for backplane ethernet, etc. This allows defining the nature of the port, and therefore avoids the need for vendor-specific properties such as "micrel,fiber-mode" or "ti,fiber-mode". The port description lives in its own file, as it is intended in the future to allow describing the ports for phy-less devices. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20260108080041.553250-2-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-14io_uring/zcrx: document area chunking parameterPavel Begunkov
struct io_uring_zcrx_ifq_reg::rx_buf_len is used as a hint specifying the kernel what buffer size it should use. Document the API and limitations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14selftests: iou-zcrx: test large chunk sizesPavel Begunkov
Add a test using large chunks for zcrx memory area. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14eth: bnxt: support qcfg provided rx page sizePavel Begunkov
Implement support for qcfg provided rx page sizes. For that, implement the ndo_default_qcfg callback and validate the config on restart. Also, use the current config's value in bnxt_init_ring_struct to retain the correct size across resets. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14eth: bnxt: adjust the fill level of agg queues with larger buffersJakub Kicinski
The driver tries to provision more agg buffers than header buffers since multiple agg segments can reuse the same header. The calculation / heuristic tries to provide enough pages for 65k of data for each header (or 4 frags per header if the result is too big). This calculation is currently global to the adapter. If we increase the buffer sizes 8x we don't want 8x the amount of memory sitting on the rings. Luckily we don't have to fill the rings completely, adjust the fill level dynamically in case particular queue has buffers larger than the global size. Signed-off-by: Jakub Kicinski <kuba@kernel.org> [pavel: rebase on top of agg_size_fac] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14eth: bnxt: store rx buffer size per queuePavel Begunkov
Instead of using a constant buffer length, allow configuring the size for each queue separately. There is no way to change the length yet, and it'll be passed from memory providers in a later patch. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14net: pass queue rx page size from memory providerPavel Begunkov
Allow memory providers to configure rx queues with a custom receive page size. It's passed in struct pp_memory_provider_params, which is copied into the queue, so it's preserved across queue restarts. Then, it's propagated to the driver in a new queue config parameter. Drivers should explicitly opt into using it by setting QCFG_RX_PAGE_SIZE, in which case they should implement ndo_default_qcfg, validate the size on queue restart and honour the current config in case of a reset. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14net: add bare bone queue configsPavel Begunkov
We'll need to pass extra parameters when allocating a queue for memory providers. Define a new structure for queue configurations, and pass it to qapi callbacks. It's empty for now, actual parameters will be added in following patches. Configurations should persist across resets, and for that they're default-initialised on device registration and stored in struct netdev_rx_queue. We also add a new qapi callback for defaulting a given config. It must be implemented if a driver wants to use queue configs and is optional otherwise. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14net: reduce indent of struct netdev_queue_mgmt_ops membersJakub Kicinski
Trivial change, reduce the indent. I think the original is copied from real NDOs. It's unnecessarily deep, makes passing struct args problematic. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14net: memzero mp params when closing a queuePavel Begunkov
Instead of resetting memory provider parameters one by one in __net_mp_{open,close}_rxq, memzero the entire structure. It'll be used to extend the structure. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-13Merge branch 'selftests-drv-net-gro-enable-hw-gro-and-lro-testing'Jakub Kicinski
Jakub Kicinski says: ==================== selftests: drv-net: gro: enable HW GRO and LRO testing Add support for running our existing GRO test against HW GRO and LRO implementation. The first 3 patches are just ksft lib nice-to-haves, and patch 4 cleans up the existing gro Python. Patches 5 and 6 are of most practical interest. The support reconfiguring the NIC to disable SW GRO and enable HW GRO and LRO. Additionally last patch breaks up the existing GRO cases to track HW compliance at finer granularity. ==================== Link: https://patch.msgid.link/20260113000740.255360-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: drv-net: gro: break out all individual test casesJakub Kicinski
GRO test groups the cases into categories, e.g. "tcp" case checks coalescing in presence of: - packets with bad csum, - sequence number mismatch, - timestamp option value mismatch, - different TCP options. Since we now have TAP support grouping the cases like that lowers our reporting granularity. This matters even more for NICs performing HW GRO and LRO since it appears that most implementation have _some_ bugs. Flagging the whole group of tests as failed prevents us from catching regressions in the things that work today. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260113000740.255360-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: drv-net: gro: run the test against HW GRO and LROJakub Kicinski
Run the test against HW GRO and LRO. NICs I have pass the base cases. Interestingly all are happy to build GROs larger than 64k. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260113000740.255360-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: drv-net: gro: improve feature configJakub Kicinski
We'll need to do a lot more feature handling to test HW-GRO and LRO. Clean up the feature handling for SW GRO a bit to let the next commit focus on the new test cases, only. Make sure HW GRO-like features are not enabled for the SW tests. Be more careful about changing features as "nothing changed" situations may result in non-zero error code from ethtool. Don't disable TSO on the local interface (receiver) when running over netdevsim, we just want GSO to break up the segments on the sender. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260113000740.255360-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: drv-net: gro: use cmd printJakub Kicinski
Now that cmd() can be printed directly remove the old formatting. Before: # fragmented ip6 doesn't coalesce: # Expected {200 100 100 }, Total 3 packets # Received {200 100 }, Total 2 packets. # /root/ksft-net-drv/drivers/net/gro: incorrect number of packets Now: # CMD: drivers/net/gro --ipv6 --dmac 9e:[...] # EXIT: 1 # STDOUT: fragmented ip6 doesn't coalesce: # STDERR: Expected {200 100 100 }, Total 3 packets # Received {200 100 }, Total 2 packets. # /root/ksft-net-drv/drivers/net/gro: incorrect number of packets Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20260113000740.255360-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: net: py: teach cmd() how to print itselfJakub Kicinski
Teach cmd() how to print itself, to make debug prints easier. Example output (leading # due to ksft_pr()): # CMD: /root/ksft-net-drv/drivers/net/gro # EXIT: 1 # STDOUT: ipv6 with ext header does coalesce: # STDERR: Expected {200 }, Total 1 packets # Received {100 [!=200]100 [!=0]}, Total 2 packets. Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20260113000740.255360-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13selftests: net: py: teach ksft_pr() multi-line safetyJakub Kicinski
Make printing multi-line logs easier by automatically prefixing each line in ksft_pr(). Make use of this when formatting exceptions. Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20260113000740.255360-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13tools/net/ynl: suppress jobserver warning in ynltool version detectionBobby Eshleman
When building ynltool with parallel make (-jN), a warning is emitted: make[1]: warning: jobserver unavailable: using -j1. Add '+' to parent make rule. The warning trips up local runs of NIPA's ingest_mdir.py, which correctly fails on make warnings. This occurs because SRC_VERSION uses $(shell make ...) to make kernelversion. The $(shell) function inherits make's MAKEFLAGS env var which specifies "--jobserver-auth=R,W" pointing to file descriptors that the invoked make sub-shell does not have access to. Observed with: $ make --version | head -1 GNU Make 4.3 Instead of suppressing MAKEFLAGS and foregoing all future MAKEFLAGS (some of which may be desirable, such as variable overrides) or introducing a new make target, we instead just ignore the warning by piping stderr to /dev/null. If 'make kernelversion' fails, the ' || echo "unknown"' phrase will catch the failure. Before: NIPA ingest_mdir.py: ynl Full series FAIL (1) Generated files up to date; build has 1 warnings/errors; no diff in generated; After: NIPA ingest_mdir.py: Series level tests: ynl OKAY Validated output: $ ./ynltool/ynltool --version ynltool 6.19.0-rc4 Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260112-ynl-make-fix-v1-1-c399e76925ad@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2026-01-13 * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Add IFC bits for extended ETS rate limit bandwidth value net/mlx5: Add support for querying bond speed net/mlx5: Handle port and vport speed change events in MPESW net/mlx5: Propagate LAG effective max_tx_speed to vports net/mlx5: Add max_tx_speed and its CAP bit to IFC ==================== Link: https://patch.msgid.link/1768299471-1603093-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13KVM: SVM: Assert that Hyper-V's HV_SVM_EXITCODE_ENL == SVM_EXIT_SWSean Christopherson
Add a build-time assertiont that Hyper-V's "enlightened" exit code is that, same as the AMD-defined "Reserved for Host" exit code, mostly to help readers connect the dots and understand why synthesizing a software-defined exit code is safe/ok. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/20251230211347.4099600-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13KVM: SVM: Harden exit_code against being used in Spectre-like attacksSean Christopherson
Explicitly clamp the exit code used to index KVM's exit handlers to guard against Spectre-like attacks, mainly to provide consistency between VMX and SVM (VMX was given the same treatment by commit c926f2f7230b ("KVM: x86: Protect exit_reason from being used in Spectre-v1/L1TF attacks"). For normal VMs, it's _extremely_ unlikely the exit code could be used to exploit a speculation vulnerability, as the exit code is set by hardware and unexpected/unknown exit codes should be quite well bounded (as is/was the case with VMX). But with SEV-ES+, the exit code is guest-controlled as it comes from the GHCB, not from hardware, i.e. an attack from the guest is at least somewhat plausible. Irrespective of SEV-ES+, hardening KVM is easy and inexpensive, and such an attack is theoretically possible. Link: https://patch.msgid.link/20251230211347.4099600-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>