linux.git/net/core/dev.c, branch v6.4-rc2

net: add vlan_get_protocol_and_depth() helper

2023-05-10T09:25:55+00:00

Before blamed commit, pskb_may_pull() was used instead
of skb_header_pointer() in __vlan_get_protocol() and friends.

Few callers depended on skb->head being populated with MAC header,
syzbot caught one of them (skb_mac_gso_segment())

Add vlan_get_protocol_and_depth() to make the intent clearer
and use it where sensible.

This is a more generic fix than commit e9d3f80935b6
("net/af_packet: make sure to pull mac header") which was
dealing with a similar issue.

kernel BUG at include/linux/skbuff.h:2655 !
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline]
RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136
Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286
RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9
R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012
R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7
FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:

[] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419
[] skb_gso_segment include/linux/netdevice.h:4819 [inline]
[] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725
[] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313
[] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029
[] packet_snd net/packet/af_packet.c:3111 [inline]
[] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142
[] sock_sendmsg_nosec net/socket.c:716 [inline]
[] sock_sendmsg net/socket.c:736 [inline]
[] __sys_sendto+0x472/0x5f0 net/socket.c:2139
[] __do_sys_sendto net/socket.c:2151 [inline]
[] __se_sys_sendto net/socket.c:2147 [inline]
[] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147
[] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80
[] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 469aceddfa3e ("vlan: consolidate VLAN parsing code and limit max parsing depth")
Reported-by: syzbot 
Signed-off-by: Eric Dumazet 
Cc: Toke Høiland-Jørgensen 
Cc: Willem de Bruijn 
Reviewed-by: Simon Horman 
Signed-off-by: David S. Miller

net: optimize napi_threaded_poll() vs RPS/RFS

2023-04-23T12:35:07+00:00

We use napi_threaded_poll() in order to reduce our softirq dependency.

We can add a followup of 821eba962d95 ("net: optimize napi_schedule_rps()")
to further remove the need of firing NET_RX_SOFTIRQ whenever
RPS/RFS are used.

Signed-off-by: Eric Dumazet 
Acked-by: Paolo Abeni 
Signed-off-by: David S. Miller

net: make napi_threaded_poll() aware of sd->defer_list

2023-04-23T12:35:07+00:00

If we call skb_defer_free_flush() from napi_threaded_poll(),
we can avoid to raise IPI from skb_attempt_defer_free()
when the list becomes too big.

This allows napi_threaded_poll() to rely less on softirqs,
and lowers latency caused by a too big list.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: move skb_defer_free_flush() up

2023-04-23T12:35:07+00:00

We plan using skb_defer_free_flush() from napi_threaded_poll()
in the following patch.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: do not provide hard irq safety for sd->defer_lock

2023-04-23T12:35:07+00:00

kfree_skb() can be called from hard irq handlers,
but skb_attempt_defer_free() is meant to be used
from process or BH contexts, and skb_defer_free_flush()
is meant to be called from BH contexts.

Not having to mask hard irq can save some cycles.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: skbuff: update and rename __kfree_skb_defer()

2023-04-21T02:25:08+00:00

__kfree_skb_defer() uses the old naming where "defer" meant
slab bulk free/alloc APIs. In the meantime we also made
__kfree_skb_defer() feed the per-NAPI skb cache, which
implies bulk APIs. So take away the 'defer' and add 'napi'.

While at it add a drop reason. This only matters on the
tx_action path, if the skb has a frag_list. But getting
rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net
benefit so why not.

Reviewed-by: Alexander Lobakin 
Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski

net: skbuff: hide csum_not_inet when CONFIG_IP_SCTP not set

2023-04-19T12:04:30+00:00

SCTP is not universally deployed, allow hiding its bit
from the skb.

Reviewed-by: Florian Fainelli 
Reviewed-by: Eric Dumazet 
Signed-off-by: Jakub Kicinski 
Signed-off-by: David S. Miller

page_pool: allow caching from safely localized NAPI

2023-04-15T01:56:12+00:00

Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

  page_pool_refill_alloc_cache   1.17% -> 0%
  _raw_spin_lock                 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.

Reviewed-by: Tariq Toukan 
Acked-by: Jesper Dangaard Brouer 
Tested-by: Dragos Tatulea 
Signed-off-by: Jakub Kicinski

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2023-04-13T23:04:28+00:00

Conflicts:

tools/testing/selftests/net/config
  62199e3f1658 ("selftests: net: Add VXLAN MDB test")
  3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config")

Signed-off-by: Jakub Kicinski

rtnetlink: Restore RTM_NEW/DELLINK notification behavior

2023-04-13T03:47:58+00:00

The commits referenced below allows userspace to use the NLM_F_ECHO flag
for RTM_NEW/DELLINK operations to receive unicast notifications for the
affected link. Prior to these changes, applications may have relied on
multicast notifications to learn the same information without specifying
the NLM_F_ECHO flag.

For such applications, the mentioned commits changed the behavior for
requests not using NLM_F_ECHO. Multicast notifications are still received,
but now use the portid of the requester and the sequence number of the
request instead of zero values used previously. For the application, this
message may be unexpected and likely handled as a response to the
NLM_F_ACKed request, especially if it uses the same socket to handle
requests and notifications.

To fix existing applications relying on the old notification behavior,
set the portid and sequence number in the notification only if the
request included the NLM_F_ECHO flag. This restores the old behavior
for applications not using it, but allows unicasted notifications for
others.

Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
Signed-off-by: Martin Willi 
Acked-by: Guillaume Nault 
Acked-by: Hangbin Liu 
Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org
Signed-off-by: Jakub Kicinski