| Age | Commit message (Collapse) | Author |
|
The RPS bitmask bounds check uses ~(RPS_MAX_CPUS - 1) which equals ~15 =
0xfff0, only allowing CPUs 0-3.
Change the mask to ~((1UL << RPS_MAX_CPUS) - 1) = ~0xffff to allow CPUs
0-15.
Fixes: 5ebfb4cc3048 ("selftests/net: toeplitz test")
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260112173715.384843-3-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The toeplitz.py test passed the hex mask without "0x" prefix (e.g.,
"300" for CPUs 8,9). The toeplitz.c strtoul() call wrongly parsed this
as decimal 300 (0x12c) instead of hex 0x300.
Pass the prefixed mask to toeplitz.c, and the unprefixed one to sysfs.
Fixes: 9cf9aa77a1f6 ("selftests: drv-net: hw: convert the Toeplitz test to Python")
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260112173715.384843-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported use-after-free of inet6_ifaddr in
inet6_addr_del(). [0]
The cited commit accidentally moved ipv6_del_addr() for
mngtmpaddr before reading its ifp->flags for temporary
addresses in inet6_addr_del().
Let's move ipv6_del_addr() down to fix the UAF.
[0]:
BUG: KASAN: slab-use-after-free in inet6_addr_del.constprop.0+0x67a/0x6b0 net/ipv6/addrconf.c:3117
Read of size 4 at addr ffff88807b89c86c by task syz.3.1618/9593
CPU: 0 UID: 0 PID: 9593 Comm: syz.3.1618 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xcd/0x630 mm/kasan/report.c:482
kasan_report+0xe0/0x110 mm/kasan/report.c:595
inet6_addr_del.constprop.0+0x67a/0x6b0 net/ipv6/addrconf.c:3117
addrconf_del_ifaddr+0x11e/0x190 net/ipv6/addrconf.c:3181
inet6_ioctl+0x1e5/0x2b0 net/ipv6/af_inet6.c:582
sock_do_ioctl+0x118/0x280 net/socket.c:1254
sock_ioctl+0x227/0x6b0 net/socket.c:1375
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl fs/ioctl.c:583 [inline]
__x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f164cf8f749
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f164de64038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f164d1e5fa0 RCX: 00007f164cf8f749
RDX: 0000200000000000 RSI: 0000000000008936 RDI: 0000000000000003
RBP: 00007f164d013f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f164d1e6038 R14: 00007f164d1e5fa0 R15: 00007ffde15c8288
</TASK>
Allocated by task 9593:
kasan_save_stack+0x33/0x60 mm/kasan/common.c:56
kasan_save_track+0x14/0x30 mm/kasan/common.c:77
poison_kmalloc_redzone mm/kasan/common.c:397 [inline]
__kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:414
kmalloc_noprof include/linux/slab.h:957 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
ipv6_add_addr+0x4e3/0x2010 net/ipv6/addrconf.c:1120
inet6_addr_add+0x256/0x9b0 net/ipv6/addrconf.c:3050
addrconf_add_ifaddr+0x1fc/0x450 net/ipv6/addrconf.c:3160
inet6_ioctl+0x103/0x2b0 net/ipv6/af_inet6.c:580
sock_do_ioctl+0x118/0x280 net/socket.c:1254
sock_ioctl+0x227/0x6b0 net/socket.c:1375
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl fs/ioctl.c:583 [inline]
__x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 6099:
kasan_save_stack+0x33/0x60 mm/kasan/common.c:56
kasan_save_track+0x14/0x30 mm/kasan/common.c:77
kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:252 [inline]
__kasan_slab_free+0x5f/0x80 mm/kasan/common.c:284
kasan_slab_free include/linux/kasan.h:234 [inline]
slab_free_hook mm/slub.c:2540 [inline]
slab_free_freelist_hook mm/slub.c:2569 [inline]
slab_free_bulk mm/slub.c:6696 [inline]
kmem_cache_free_bulk mm/slub.c:7383 [inline]
kmem_cache_free_bulk+0x2bf/0x680 mm/slub.c:7362
kfree_bulk include/linux/slab.h:830 [inline]
kvfree_rcu_bulk+0x1b7/0x1e0 mm/slab_common.c:1523
kvfree_rcu_drain_ready mm/slab_common.c:1728 [inline]
kfree_rcu_monitor+0x1d0/0x2f0 mm/slab_common.c:1801
process_one_work+0x9ba/0x1b20 kernel/workqueue.c:3257
process_scheduled_works kernel/workqueue.c:3340 [inline]
worker_thread+0x6c8/0xf10 kernel/workqueue.c:3421
kthread+0x3c5/0x780 kernel/kthread.c:463
ret_from_fork+0x983/0xb10 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
Fixes: 00b5b7aab9e42 ("net/ipv6: delete temporary address if mngtmpaddr is removed or unmanaged")
Reported-by: syzbot+72e610f4f1a930ca9d8a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/696598e9.050a0220.3be5c5.0009.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260113010538.2019411-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot was able to crash the kernel in rt6_uncached_list_flush_dev()
in an interesting way [1]
Crash happens in list_del_init()/INIT_LIST_HEAD() while writing
list->prev, while the prior write on list->next went well.
static inline void INIT_LIST_HEAD(struct list_head *list)
{
WRITE_ONCE(list->next, list); // This went well
WRITE_ONCE(list->prev, list); // Crash, @list has been freed.
}
Issue here is that rt6_uncached_list_del() did not attempt to lock
ul->lock, as list_empty(&rt->dst.rt_uncached) returned
true because the WRITE_ONCE(list->next, list) happened on the other CPU.
We might use list_del_init_careful() and list_empty_careful(),
or make sure rt6_uncached_list_del() always grabs the spinlock
whenever rt->dst.rt_uncached_list has been set.
A similar fix is neeed for IPv4.
[1]
BUG: KASAN: slab-use-after-free in INIT_LIST_HEAD include/linux/list.h:46 [inline]
BUG: KASAN: slab-use-after-free in list_del_init include/linux/list.h:296 [inline]
BUG: KASAN: slab-use-after-free in rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline]
BUG: KASAN: slab-use-after-free in rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020
Write of size 8 at addr ffff8880294cfa78 by task kworker/u8:14/3450
CPU: 0 UID: 0 PID: 3450 Comm: kworker/u8:14 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)}
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
Workqueue: netns cleanup_net
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
INIT_LIST_HEAD include/linux/list.h:46 [inline]
list_del_init include/linux/list.h:296 [inline]
rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline]
rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020
addrconf_ifdown+0x143/0x18a0 net/ipv6/addrconf.c:3853
addrconf_notify+0x1bc/0x1050 net/ipv6/addrconf.c:-1
notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2268 [inline]
call_netdevice_notifiers net/core/dev.c:2282 [inline]
netif_close_many+0x29c/0x410 net/core/dev.c:1785
unregister_netdevice_many_notify+0xb50/0x2330 net/core/dev.c:12353
ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
ops_undo_list+0x3dc/0x990 net/core/net_namespace.c:248
cleanup_net+0x4de/0x7b0 net/core/net_namespace.c:696
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
</TASK>
Allocated by task 803:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
unpoison_slab_object mm/kasan/common.c:340 [inline]
__kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366
kasan_slab_alloc include/linux/kasan.h:253 [inline]
slab_post_alloc_hook mm/slub.c:4953 [inline]
slab_alloc_node mm/slub.c:5263 [inline]
kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270
dst_alloc+0x105/0x170 net/core/dst.c:89
ip6_dst_alloc net/ipv6/route.c:342 [inline]
icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333
mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844
mld_send_cr net/ipv6/mcast.c:2154 [inline]
mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
Freed by task 20:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:253 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:2540 [inline]
slab_free mm/slub.c:6670 [inline]
kmem_cache_free+0x18f/0x8d0 mm/slub.c:6781
dst_destroy+0x235/0x350 net/core/dst.c:121
rcu_do_batch kernel/rcu/tree.c:2605 [inline]
rcu_core kernel/rcu/tree.c:2857 [inline]
rcu_cpu_kthread+0xba5/0x1af0 kernel/rcu/tree.c:2945
smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
Last potentially related work creation:
kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
__call_rcu_common kernel/rcu/tree.c:3119 [inline]
call_rcu+0xee/0x890 kernel/rcu/tree.c:3239
refdst_drop include/net/dst.h:266 [inline]
skb_dst_drop include/net/dst.h:278 [inline]
skb_release_head_state+0x71/0x360 net/core/skbuff.c:1156
skb_release_all net/core/skbuff.c:1180 [inline]
__kfree_skb net/core/skbuff.c:1196 [inline]
sk_skb_reason_drop+0xe9/0x170 net/core/skbuff.c:1234
kfree_skb_reason include/linux/skbuff.h:1322 [inline]
tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline]
__dev_xmit_skb net/core/dev.c:4260 [inline]
__dev_queue_xmit+0x26aa/0x3210 net/core/dev.c:4785
NF_HOOK_COND include/linux/netfilter.h:307 [inline]
ip6_output+0x340/0x550 net/ipv6/ip6_output.c:247
NF_HOOK+0x9e/0x380 include/linux/netfilter.h:318
mld_sendpack+0x8d4/0xe60 net/ipv6/mcast.c:1855
mld_send_cr net/ipv6/mcast.c:2154 [inline]
mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
The buggy address belongs to the object at ffff8880294cfa00
which belongs to the cache ip6_dst_cache of size 232
The buggy address is located 120 bytes inside of
freed 232-byte region [ffff8880294cfa00, ffff8880294cfae8)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x294cf
memcg:ffff88803536b781
flags: 0x80000000000000(node=0|zone=1)
page_type: f5(slab)
raw: 0080000000000000 ffff88802ff1c8c0 ffffea0000bf2bc0 dead000000000006
raw: 0000000000000000 00000000800c000c 00000000f5000000 ffff88803536b781
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 9, tgid 9 (kworker/0:0), ts 91119585830, free_ts 91088628818
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x234/0x290 mm/page_alloc.c:1857
prep_new_page mm/page_alloc.c:1865 [inline]
get_page_from_freelist+0x28c0/0x2960 mm/page_alloc.c:3915
__alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5210
alloc_pages_mpol+0xd1/0x380 mm/mempolicy.c:2486
alloc_slab_page mm/slub.c:3075 [inline]
allocate_slab+0x86/0x3b0 mm/slub.c:3248
new_slab mm/slub.c:3302 [inline]
___slab_alloc+0xb10/0x13e0 mm/slub.c:4656
__slab_alloc+0xc6/0x1f0 mm/slub.c:4779
__slab_alloc_node mm/slub.c:4855 [inline]
slab_alloc_node mm/slub.c:5251 [inline]
kmem_cache_alloc_noprof+0x101/0x6c0 mm/slub.c:5270
dst_alloc+0x105/0x170 net/core/dst.c:89
ip6_dst_alloc net/ipv6/route.c:342 [inline]
icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333
mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844
mld_send_cr net/ipv6/mcast.c:2154 [inline]
mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158
page last free pid 5859 tgid 5859 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1406 [inline]
__free_frozen_pages+0xfe1/0x1170 mm/page_alloc.c:2943
discard_slab mm/slub.c:3346 [inline]
__put_partials+0x149/0x170 mm/slub.c:3886
__slab_free+0x2af/0x330 mm/slub.c:5952
qlink_free mm/kasan/quarantine.c:163 [inline]
qlist_free_all+0x97/0x100 mm/kasan/quarantine.c:179
kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286
__kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350
kasan_slab_alloc include/linux/kasan.h:253 [inline]
slab_post_alloc_hook mm/slub.c:4953 [inline]
slab_alloc_node mm/slub.c:5263 [inline]
kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270
getname_flags+0xb8/0x540 fs/namei.c:146
getname include/linux/fs.h:2498 [inline]
do_sys_openat2+0xbc/0x200 fs/open.c:1426
do_sys_open fs/open.c:1436 [inline]
__do_sys_openat fs/open.c:1452 [inline]
__se_sys_openat fs/open.c:1447 [inline]
__x64_sys_openat+0x138/0x170 fs/open.c:1447
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94
Fixes: 8d0b94afdca8 ("ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister")
Fixes: 78df76a065ae ("ipv4: take rt_uncached_lock only if needed")
Reported-by: syzbot+179fc225724092b8b2b2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6964cdf2.050a0220.eaf7.009d.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260112103825.3810713-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
RSS configuration requires a valid RX indirection table. When the device
reports a single receive queue, rndis_filter_device_add() does not
allocate an indirection table, accepting RSS hash key updates in this
state leads to a hang.
Fix this by gating netvsc_set_rxfh() on ndc->rx_table_sz and return
-EOPNOTSUPP when the table is absent. This aligns set_rxfh with the device
capabilities and prevents incorrect behavior.
Fixes: 962f3fee83a4 ("netvsc: add ethtool ops to get/set RSS key")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1768212093-1594-1-git-send-email-gargaditya@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix a misspelling in the sxgbe_mtl_init() function comment.
"Algorith" should be spelled as "Algorithm".
Signed-off-by: Jinseok Kim <always.starving0@gmail.com>
Link: https://patch.msgid.link/20260112044147.2844-1-always.starving0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Heiner Kallweit says:
====================
net: phy: fixed_phy: replace list of fixed PHYs with static array
Due to max 32 PHY addresses being available per mii bus, using a list
can't support more fixed PHY's. And there's no known use case for as
much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's,
so use an array of that size instead of a list. This allows to
significantly reduce the code size and complexity.
In addition replace heavy-weight IDA with a simple bitmap.
====================
Link: https://patch.msgid.link/110f676d-727c-4575-abe4-e383f98fc38f@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Size of array fmb_fixed_phys is small, so we can use a simple bitmap
instead of an IDA to manage dynamic allocation of fixed PHY's.
find_first_zero_bit() isn't atomic, so we need the loop to rule out
double allocation of a PHY address.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/d4614463-d532-41fc-92e9-ef97107aceb5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Due to max 32 PHY addresses being available per mii bus, using a list
can't support more fixed PHY's. And there's no known use case for as
much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's,
so use an array of that size instead of a list. This allows to
significantly reduce the code size and complexity.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/8610d30c-eac7-4100-9008-d3b6cee6a5cd@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ido Schimmel says:
====================
ipv6: Allow for nexthop device mismatch with "onlink"
This patchset aligns IPv6 with IPv4 with respect to the "onlink" keyword
and allows IPv6 routes to be configured with a gateway address that is
resolved out of a different interface than the one specified.
Patches #1-#3 are small preparations in the existing "onlink" selftest.
Patch #4 is the actual change. See the commit message for detailed
description and motivation.
Patch #5 adds test cases for both address families, to make sure that
this use case does not regress.
====================
Link: https://patch.msgid.link/20260111120813.159799-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add test cases that verify that when the "onlink" keyword is specified,
both address families (with and without VRF) accept routes with a
gateway address that is reachable via a different interface than the one
specified.
Output without "ipv6: Allow for nexthop device mismatch with "onlink"":
# ./fib-onlink-tests.sh | grep mismatch
TEST: nexthop device mismatch [ OK ]
TEST: nexthop device mismatch [ OK ]
TEST: nexthop device mismatch [FAIL]
TEST: nexthop device mismatch [FAIL]
Output with "ipv6: Allow for nexthop device mismatch with "onlink"":
# ./fib-onlink-tests.sh | grep mismatch
TEST: nexthop device mismatch [ OK ]
TEST: nexthop device mismatch [ OK ]
TEST: nexthop device mismatch [ OK ]
TEST: nexthop device mismatch [ OK ]
That is, the IPv4 tests were always passing, but the IPv6 ones only pass
after the specified patch.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260111120813.159799-6-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
IPv4 allows for a nexthop device mismatch when the "onlink" keyword is
specified:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2
Error: Nexthop has invalid gateway.
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 onlink
# echo $?
0
This seems to be consistent with the description of "onlink" in the
ip-route man page: "Pretend that the nexthop is directly attached to
this link, even if it does not match any interface prefix".
On the other hand, IPv6 rejects a nexthop device mismatch, even when
"onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
Error: Nexthop has invalid gateway or device mismatch.
This is intentional according to commit fc1e64e1092f ("net/ipv6: Add
support for onlink flag") which added IPv6 "onlink" support and states
that "any unicast gateway is allowed as long as the gateway is not a
local address and if it resolves it must match the given device".
The condition was later relaxed in commit 4ed591c8ab44 ("net/ipv6: Allow
onlink routes to have a device mismatch if it is the default route") to
allow for a nexthop device mismatch if the gateway address is resolved
via the default route:
# ip link add name dummy1 up type dummy
# ip route add ::/0 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
While the decision to forbid a nexthop device mismatch in IPv6 seems to
be intentional, it is unclear why it was made. Especially when it
differs from IPv4 and seems to go against the intended behavior of
"onlink".
Therefore, relax the condition further and allow for a nexthop device
mismatch when "onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
The motivating use case is the fact that FRR would like to be able to
configure overlay routes of the following form:
# ip route add <host-Z> vrf <VRF> encap ip id <ID> src <VTEP-A> dst <VTEP-Z> via <VTEP-Z> dev vxlan0 onlink
Where vxlan0 is in the default VRF in which "VTEP-Z" is reachable via
one of the underlay routes (e.g., via swpX). Without this patch, the
above only works with IPv4, but not with IPv6.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260111120813.159799-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A multicast gateway address should be rejected when "onlink" is
specified, but it is only tested as part of the IPv6 tests. Add an
equivalent IPv4 test.
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip ro add table 254 169.254.101.12/32 via 233.252.0.1 dev veth1 onlink
Error: Nexthop has invalid gateway.
TEST: Invalid gw - multicast address [ OK ]
[...]
COMMAND: ip ro add table 1101 169.254.102.12/32 via 233.252.0.1 dev veth5 onlink
Error: Nexthop has invalid gateway.
TEST: Invalid gw - multicast address, VRF [ OK ]
[...]
Tests passed: 37
Tests failed: 0
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260111120813.159799-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The command in the test fails as expected because IPv6 forbids a nexthop
device mismatch:
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip -6 ro add table 1101 2001:db8:102::103/128 via 2001:db8:701::64 dev veth5 onlink
Error: Nexthop has invalid gateway or device mismatch.
TEST: Gateway resolves to wrong nexthop device - VRF [ OK ]
[...]
Where:
# ip route get 2001:db8:701::64 vrf lisa
2001:db8:701::64 dev veth7 table 1101 proto kernel src 2001:db8:701::1 metric 256 pref medium
This is in contrast to IPv4 where a nexthop device mismatch is allowed
when "onlink" is specified:
# ip route get 169.254.7.2 vrf lisa
169.254.7.2 dev veth7 table 1101 src 169.254.7.1 uid 0
# ip ro add table 1101 169.254.102.103/32 via 169.254.7.2 dev veth5 onlink
# echo $?
0
Remove these tests in preparation for aligning IPv6 with IPv4 and
allowing nexthop device mismatch when "onlink" is specified.
A subsequent patch will add tests that verify that both address families
allow a nexthop device mismatch with "onlink".
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260111120813.159799-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
According to the test description, these tests fail because of a wrong
nexthop device:
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth1 onlink
Error: Nexthop has invalid gateway.
TEST: Gateway resolves to wrong nexthop device [ OK ]
COMMAND: ip ro add table 1101 169.254.102.103/32 via 169.254.7.1 dev veth5 onlink
Error: Nexthop has invalid gateway.
TEST: Gateway resolves to wrong nexthop device - VRF [ OK ]
[...]
But this is incorrect. They fail because the gateway addresses are local
addresses:
# ip -4 address show
[...]
28: veth3@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netns peer_ns-Urqh3o
inet 169.254.3.1/24 scope global veth3
[...]
32: veth7@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master lisa state UP group default qlen 1000 link-netns peer_ns-Urqh3o
inet 169.254.7.1/24 scope global veth7
Therefore, using a local address that matches the nexthop device fails
as well:
# ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth3 onlink
Error: Nexthop has invalid gateway.
Using a gateway address with a "wrong" nexthop device is actually valid
and allowed:
# ip route get 169.254.1.2
169.254.1.2 dev veth1 src 169.254.1.1 uid 0
# ip ro add table 254 169.254.101.102/32 via 169.254.1.2 dev veth3 onlink
# echo $?
0
Remove these tests given that their output is confusing and that the
scenario that they are testing is already covered by other tests.
A subsequent patch will add tests for the nexthop device mismatch
scenario.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260111120813.159799-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Maxime Chevallier says:
====================
net: phy: Introduce PHY ports representation
A few important notes:
- This is only a first phase. It instantiates the port, and leverage
that to make the MAC <-> PHY <-> SFP usecase simpler.
- Next phase will deal with controlling the port state, as well as the
netlink uAPI for that.
- The end-goal is to enable support for complex port MUX. This
preliminary work focuses on PHY-driven ports, but this will be
extended to support muxing at the MII level (Multi-phy, or compo PHY
+ SFP as found on Turris Omnia for example).
- The naming is definitely not set in stone. I named that "phy_port",
but this may convey the false sense that this is phylib-specific.
Even the word "port" is not that great, as it already has several
different meanings in the net world (switch port, devlink port,
etc.). I used the term "connector" in the binding.
A bit of history on that work :
The end goal that I personnaly want to achieve is :
+ PHY - RJ45
|
MAC - MUX -+ PHY - RJ45
After many discussions here on netdev@, but also at netdevconf[1] and
LPC[2], there appears to be several analoguous designs that exist out
there.
[1] : https://netdevconf.info/0x17/sessions/talk/improving-multi-phy-and-multi-port-interfaces.html
[2] : https://lpc.events/event/18/contributions/1964/ (video isn't the
right one)
Take the MAchiatobin, it has 2 interfaces that looks like this :
MAC - PHY -+ RJ45
|
+ SFP - Whatever the module does
Now, looking at the Turris Omnia, we have :
MAC - MUX -+ PHY - RJ45
|
+ SFP - Whatever the module does
We can find more example of this kind of designs, the common part is
that we expose multiple front-facing media ports. This is what this
current work aims at supporting. As of right now, it does'nt add any
support for muxing, but this will come later on.
This first phase focuses on phy-driven ports only, but there are already
quite some challenges already. For one, we can't really autodetect how
many ports are sitting behind a PHY. That's why this series introduces a
new binding. Describing ports in DT should however be a last-resort
thing when we need to clear some ambiguity about the PHY media-side.
The only use-cases that we have today for multi-port PHYs are combo PHYs
that drive both a Copper port and an SFP (the Macchiatobin case). This
in itself is challenging and this series only addresses part of this
support, by registering a phy_port for the PHY <-> SFP connection. The
SFP module should in the end be considered as a port as well, but that's
not yet the case.
However, because now PHYs can register phy_ports for every media-side
interface they have, they can register the capabilities of their ports,
which allows making the PHY-driver SFP case much more generic.
====================
Link: https://patch.msgid.link/20260108080041.553250-1-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This documentation aims at describing the main goal of the phy_port
infrastructure.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-15-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that all PHY drivers that support downstream SFP have been converted
to phy_port serdes handling, we can make the generic PHY SFP handling
mandatory, thus making all phylib sfp helpers static.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-14-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
QCA8072/8075 may be used as combo-port PHYs, with Serdes (100/1000BaseX)
and Copper interfaces. The PHY has the ability to read the configuration
it's in. If the configuration indicates the PHY is in combo mode, allow
registering up to 2 ports.
Register a dedicated set of port ops to handle the serdes port, and rely
on generic phylib SFP support for the SFP handling.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-13-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert the at803x driver to use the generic phylib SFP handling, via a
dedicated .attach_port() callback, populating the supported interfaces.
As these devices are limited to 1000BaseX, a workaround is used to also
support, in a very limited way, copper modules. This is done by
supporting SGMII but limiting it to 1G full duplex (in which case it's
somewhat compatible with 1000BaseX).
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-12-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert the Marvell10G driver to use the generic SFP handling, through a
dedicated .attach_port() handler to populate the port's supported
interfaces.
As the 88x3310 supports multiple MDI, the .attach_port() logic handles
both SFP attach with 10GBaseR support, and support for the "regular"
port that usually is a BaseT port.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-11-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert the Marvell driver (especially the 88e1512 driver) to use the
phy_port interface to handle SFPs. This means registering a
.attach_port() handler to detect when a serdes line interface is used
(most likely, and SFP module).
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-10-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The 88x2222 PHY from Marvell only supports serialised modes as its
line-facing interfaces. Convert that driver to the generic phylib SFP
handling.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-9-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are currently 4 PHY drivers that can drive downstream SFPs:
marvell.c, marvell10g.c, at803x.c and marvell-88x2222.c. Most of the
logic is boilerplate, either calling into generic phylib helpers (for
SFP PHY attach, bus attach, etc.) or performing the same tasks with a
bit of validation :
- Getting the module's expected interface mode
- Making sure the PHY supports it
- Optionaly perform some configuration to make sure the PHY outputs
the right mode
This can be made more generic by leveraging the phy_port, and its
configure_mii() callback which allows setting a port's interfaces when
the port is a serdes.
Introduce a generic PHY SFP support. If a driver doesn't probe the SFP
bus itself, but an SFP phandle is found in devicetree/firmware, then the
generic PHY SFP support will be used, relying on port ops.
PHY driver need to :
- Register a .attach_port() callback
- When a serdes port is registered to the PHY, drivers must set
port->interfaces to the set of PHY_INTERFACE_MODE the port can output
- If the port has limitations regarding speed, duplex and aneg, the
port can also fine-tune the final linkmodes that can be supported
- The port may register a set of ops, including .configure_mii(), that
will be called at module_insert time to adjust the interface based on
the module detected.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-8-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some PHY devices may be used as media-converters to drive SFP ports (for
example, to allow using SFP when the SoC can only output RGMII). This is
already supported to some extend by allowing PHY drivers to registers
themselves as being SFP upstream.
However, the logic to drive the SFP can actually be split to a per-port
control logic, allowing support for multi-port PHYs, or PHYs that can
either drive SFPs or Copper.
To that extent, create a phy_port when registering an SFP bus onto a
PHY. This port is considered a "serdes" port, in that it can feed data
to another entity on the link. The PHY driver needs to specify the
various PHY_INTERFACE_MODE_XXX that this port supports.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-7-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The newly added ethernet-connector binding allows describing an Ethernet
connector with greater precision, and in a more generic manner, than
ti,fiber-mode. Deprecate this property.
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-6-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the phy_port representation introduced, we can use .attach_port to
populate the port information based on either the straps or the
ti,fiber-mode property. This allows simplifying the probe function and
allow users to override the strapping configuration.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-5-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ethernet provides a wide variety of layer 1 protocols and standards for
data transmission. The front-facing ports of an interface have their own
complexity and configurability.
Introduce a representation of these front-facing ports. The current code
is minimalistic and only support ports controlled by PHY devices, but
the plan is to extend that to SFP as well as raw Ethernet MACs that
don't use PHY devices.
This minimal port representation allows describing the media and number
of pairs of a BaseT port. From that information, we can derive the
linkmodes usable on the port, which can be used to limit the
capabilities of an interface.
For now, the port pairs and medium is derived from devicetree, defined
by the PHY driver, or populated with default values (as we assume that
all PHYs expose at least one port).
The typical example is 100M ethernet. 100BaseTX works using only 2
pairs on a Cat 5 cables. However, in the situation where a 10/100/1000
capable PHY is wired to its RJ45 port through 2 pairs only, we have no
way of detecting that. The "max-speed" DT property can be used, but a
more accurate representation can be used :
mdi {
connector-0 {
media = "BaseT";
pairs = <2>;
};
};
From that information, we can derive the max speed reachable on the
port.
Another benefit of having that is to avoid vendor-specific DT properties
(micrel,fiber-mode or ti,fiber-mode).
This basic representation is meant to be expanded, by the introduction
of port ops, userspace listing of ports, and support for multi-port
devices.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260108080041.553250-4-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In an effort to have a better representation of Ethernet ports,
introduce enumeration values representing the various ethernet Mediums.
This is part of the 802.3 naming convention, for example :
1000 Base T 4
| | | |
| | | \_ pairs (4)
| | \___ Medium (T == Twisted Copper Pairs)
| \_______ Baseband transmission
\____________ Speed
Other example :
10000 Base K X 4
| | \_ lanes (4)
| \___ encoding (BaseX is 8b/10b while BaseR is 66b/64b)
\_____ Medium (K is backplane ethernet)
In the case of representing a physical port, only the medium and number
of pairs should be relevant. One exception would be 1000BaseX, which is
currently also used as a medium in what appears to be any of 1000BaseSX,
1000BaseCX, 1000BaseLX, 1000BaseEX, 1000BaseBX10 and some other.
This was reflected in the mediums associated with the 1000BaseX linkmode.
These mediums are set in the net/ethtool/common.c lookup table that
maintains a list of all linkmodes with their number of pairs, medium,
encoding, speed and duplex.
One notable exception to this is 100BaseT Ethernet. It emcompasses 100BaseTX,
which is a 2-pairs protocol but also 100BaseT4, that will also work on 4-pairs
cables. As we don't make a disctinction between these, the lookup table
contains 2 sets of pair numbers, indicating the min number of pairs for a
protocol to work and the "nominal" number of pairs as well.
Another set of exceptions are linkmodes such 100000baseLR4_ER4, where
the same link mode seems to represent 100GBaseLR4 and 100GBaseER4. The
macro __DEFINE_LINK_MODE_PARAMS_MEDIUMS is here used to populate the
.mediums bitfield with all appropriate mediums.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260108080041.553250-3-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The ability to describe the physical ports of Ethernet devices is useful
to describe multi-port devices, as well as to remove any ambiguity with
regard to the nature of the port.
Moreover, describing ports allows for a better description of features
that are tied to connectors, such as PoE through the PSE-PD devices.
Introduce a binding to allow describing the ports, for now with 2
attributes :
- The number of pairs, which is a quite generic property that allows
differentating between multiple similar technologies such as BaseT1
and "regular" BaseT (which usually means BaseT4).
- The media that can be used on that port, such as BaseT for Twisted
Copper, BaseC for coax copper, BaseS/L for Fiber, BaseK for backplane
ethernet, etc. This allows defining the nature of the port, and
therefore avoids the need for vendor-specific properties such as
"micrel,fiber-mode" or "ti,fiber-mode".
The port description lives in its own file, as it is intended in the
future to allow describing the ports for phy-less devices.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260108080041.553250-2-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
struct io_uring_zcrx_ifq_reg::rx_buf_len is used as a hint specifying
the kernel what buffer size it should use. Document the API and
limitations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Add a test using large chunks for zcrx memory area.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Implement support for qcfg provided rx page sizes. For that, implement
the ndo_default_qcfg callback and validate the config on restart. Also,
use the current config's value in bnxt_init_ring_struct to retain the
correct size across resets.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
The driver tries to provision more agg buffers than header buffers
since multiple agg segments can reuse the same header. The calculation
/ heuristic tries to provide enough pages for 65k of data for each header
(or 4 frags per header if the result is too big). This calculation is
currently global to the adapter. If we increase the buffer sizes 8x
we don't want 8x the amount of memory sitting on the rings.
Luckily we don't have to fill the rings completely, adjust
the fill level dynamically in case particular queue has buffers
larger than the global size.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[pavel: rebase on top of agg_size_fac]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Instead of using a constant buffer length, allow configuring the size
for each queue separately. There is no way to change the length yet, and
it'll be passed from memory providers in a later patch.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Allow memory providers to configure rx queues with a custom receive
page size. It's passed in struct pp_memory_provider_params, which is
copied into the queue, so it's preserved across queue restarts. Then,
it's propagated to the driver in a new queue config parameter.
Drivers should explicitly opt into using it by setting
QCFG_RX_PAGE_SIZE, in which case they should implement ndo_default_qcfg,
validate the size on queue restart and honour the current config in case
of a reset.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
We'll need to pass extra parameters when allocating a queue for memory
providers. Define a new structure for queue configurations, and pass it
to qapi callbacks. It's empty for now, actual parameters will be added
in following patches.
Configurations should persist across resets, and for that they're
default-initialised on device registration and stored in struct
netdev_rx_queue. We also add a new qapi callback for defaulting a given
config. It must be implemented if a driver wants to use queue configs
and is optional otherwise.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Trivial change, reduce the indent. I think the original is copied
from real NDOs. It's unnecessarily deep, makes passing struct args
problematic.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Instead of resetting memory provider parameters one by one in
__net_mp_{open,close}_rxq, memzero the entire structure. It'll be used
to extend the structure.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Jakub Kicinski says:
====================
selftests: drv-net: gro: enable HW GRO and LRO testing
Add support for running our existing GRO test against HW GRO
and LRO implementation. The first 3 patches are just ksft lib
nice-to-haves, and patch 4 cleans up the existing gro Python.
Patches 5 and 6 are of most practical interest. The support
reconfiguring the NIC to disable SW GRO and enable HW GRO and LRO.
Additionally last patch breaks up the existing GRO cases to
track HW compliance at finer granularity.
====================
Link: https://patch.msgid.link/20260113000740.255360-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
GRO test groups the cases into categories, e.g. "tcp" case
checks coalescing in presence of:
- packets with bad csum,
- sequence number mismatch,
- timestamp option value mismatch,
- different TCP options.
Since we now have TAP support grouping the cases like that
lowers our reporting granularity. This matters even more for
NICs performing HW GRO and LRO since it appears that most
implementation have _some_ bugs. Flagging the whole group
of tests as failed prevents us from catching regressions
in the things that work today.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260113000740.255360-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Run the test against HW GRO and LRO. NICs I have pass the base cases.
Interestingly all are happy to build GROs larger than 64k.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260113000740.255360-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We'll need to do a lot more feature handling to test HW-GRO and LRO.
Clean up the feature handling for SW GRO a bit to let the next commit
focus on the new test cases, only.
Make sure HW GRO-like features are not enabled for the SW tests.
Be more careful about changing features as "nothing changed"
situations may result in non-zero error code from ethtool.
Don't disable TSO on the local interface (receiver) when running over
netdevsim, we just want GSO to break up the segments on the sender.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260113000740.255360-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that cmd() can be printed directly remove the old formatting.
Before:
# fragmented ip6 doesn't coalesce:
# Expected {200 100 100 }, Total 3 packets
# Received {200 100 }, Total 2 packets.
# /root/ksft-net-drv/drivers/net/gro: incorrect number of packets
Now:
# CMD: drivers/net/gro --ipv6 --dmac 9e:[...]
# EXIT: 1
# STDOUT: fragmented ip6 doesn't coalesce:
# STDERR: Expected {200 100 100 }, Total 3 packets
# Received {200 100 }, Total 2 packets.
# /root/ksft-net-drv/drivers/net/gro: incorrect number of packets
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260113000740.255360-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Teach cmd() how to print itself, to make debug prints easier.
Example output (leading # due to ksft_pr()):
# CMD: /root/ksft-net-drv/drivers/net/gro
# EXIT: 1
# STDOUT: ipv6 with ext header does coalesce:
# STDERR: Expected {200 }, Total 1 packets
# Received {100 [!=200]100 [!=0]}, Total 2 packets.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260113000740.255360-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make printing multi-line logs easier by automatically prefixing
each line in ksft_pr(). Make use of this when formatting exceptions.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260113000740.255360-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When building ynltool with parallel make (-jN), a warning is emitted:
make[1]: warning: jobserver unavailable: using -j1.
Add '+' to parent make rule.
The warning trips up local runs of NIPA's ingest_mdir.py, which
correctly fails on make warnings.
This occurs because SRC_VERSION uses $(shell make ...) to make
kernelversion. The $(shell) function inherits make's MAKEFLAGS env var
which specifies "--jobserver-auth=R,W" pointing to file descriptors that
the invoked make sub-shell does not have access to.
Observed with:
$ make --version | head -1
GNU Make 4.3
Instead of suppressing MAKEFLAGS and foregoing all future MAKEFLAGS
(some of which may be desirable, such as variable overrides) or
introducing a new make target, we instead just ignore the warning by
piping stderr to /dev/null. If 'make kernelversion' fails, the ' || echo
"unknown"' phrase will catch the failure.
Before:
NIPA ingest_mdir.py:
ynl
Full series FAIL (1)
Generated files up to date; build has 1 warnings/errors; no diff in
generated;
After:
NIPA ingest_mdir.py:
Series level tests:
ynl OKAY
Validated output:
$ ./ynltool/ynltool --version
ynltool 6.19.0-rc4
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260112-ynl-make-fix-v1-1-c399e76925ad@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2026-01-13
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add IFC bits for extended ETS rate limit bandwidth value
net/mlx5: Add support for querying bond speed
net/mlx5: Handle port and vport speed change events in MPESW
net/mlx5: Propagate LAG effective max_tx_speed to vports
net/mlx5: Add max_tx_speed and its CAP bit to IFC
====================
Link: https://patch.msgid.link/1768299471-1603093-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a build-time assertiont that Hyper-V's "enlightened" exit code is that,
same as the AMD-defined "Reserved for Host" exit code, mostly to help
readers connect the dots and understand why synthesizing a software-defined
exit code is safe/ok.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20251230211347.4099600-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly clamp the exit code used to index KVM's exit handlers to guard
against Spectre-like attacks, mainly to provide consistency between VMX
and SVM (VMX was given the same treatment by commit c926f2f7230b ("KVM:
x86: Protect exit_reason from being used in Spectre-v1/L1TF attacks").
For normal VMs, it's _extremely_ unlikely the exit code could be used to
exploit a speculation vulnerability, as the exit code is set by hardware
and unexpected/unknown exit codes should be quite well bounded (as is/was
the case with VMX). But with SEV-ES+, the exit code is guest-controlled
as it comes from the GHCB, not from hardware, i.e. an attack from the
guest is at least somewhat plausible.
Irrespective of SEV-ES+, hardening KVM is easy and inexpensive, and such
an attack is theoretically possible.
Link: https://patch.msgid.link/20251230211347.4099600-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|