| Age | Commit message (Collapse) | Author |
|
There is a potential race condition between sock bind and socket write
iter. bind may free the same cmd via mgmt_pending before write iter sends
the cmd, just as syzbot reported in UAF[1].
Here we use hci_dev_lock to synchronize the two, thereby avoiding the
UAF mentioned in [1].
[1]
syzbot reported:
BUG: KASAN: slab-use-after-free in mgmt_pending_remove+0x3b/0x210 net/bluetooth/mgmt_util.c:316
Read of size 8 at addr ffff888077164818 by task syz.0.17/5989
Call Trace:
mgmt_pending_remove+0x3b/0x210 net/bluetooth/mgmt_util.c:316
set_link_security+0x5c2/0x710 net/bluetooth/mgmt.c:1918
hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719
hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
sock_write_iter+0x279/0x360 net/socket.c:1195
Allocated by task 5989:
mgmt_pending_add+0x35/0x140 net/bluetooth/mgmt_util.c:296
set_link_security+0x557/0x710 net/bluetooth/mgmt.c:1910
hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719
hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
sock_write_iter+0x279/0x360 net/socket.c:1195
Freed by task 5991:
mgmt_pending_free net/bluetooth/mgmt_util.c:311 [inline]
mgmt_pending_foreach+0x30d/0x380 net/bluetooth/mgmt_util.c:257
mgmt_index_removed+0x112/0x2f0 net/bluetooth/mgmt.c:9477
hci_sock_bind+0xbe9/0x1000 net/bluetooth/hci_sock.c:1314
Fixes: 6fe26f694c82 ("Bluetooth: MGMT: Protect mgmt_pending list with its own lock")
Reported-by: syzbot+9aa47cd4633a3cf92a80@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9aa47cd4633a3cf92a80
Tested-by: syzbot+9aa47cd4633a3cf92a80@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
HCI_OP_NOP means no command was actually sent so there is no point in
triggering cmd_timer which may cause a hdev->reset in the process since
it is assumed that the controller is stuck processing a command.
Fixes: e2d471b7806b ("Bluetooth: ISO: Fix not using SID from adv report")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Cross-merge networking fixes after downstream PR (net-6.18-rc7).
No conflicts, adjacent changes:
tools/testing/selftests/net/af_unix/Makefile
e1bb28bf13f4 ("selftest: af_unix: Add test for SO_PEEK_OFF.")
45a1cd8346ca ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from IPsec and wireless.
Previous releases - regressions:
- prevent NULL deref in generic_hwtstamp_ioctl_lower(),
newer APIs don't populate all the pointers in the request
- phylink: add missing supported link modes for the fixed-link
- mptcp: fix false positive warning in mptcp_pm_nl_rm_addr
Previous releases - always broken:
- openvswitch: remove never-working support for setting NSH fields
- xfrm: number of fixes for error paths of xfrm_state creation/
modification/deletion
- xfrm: fixes for offload
- fix the determination of the protocol of the inner packet
- don't push locally generated packets directly to L2 tunnel
mode offloading, they still need processing from the standard
xfrm path
- mptcp: fix a couple of corner cases in fallback and fastclose
handling
- wifi: rtw89: hw_scan: prevent connections from getting stuck,
work around apparent bug in FW by tweaking messages we send
- af_unix: fix duplicate data if PEEK w/ peek_offset needs to wait
- veth: more robust handing of race to avoid txq getting stuck
- eth: ps3_gelic_net: handle skb allocation failures"
* tag 'net-6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
vsock: Ignore signal/timeout on connect() if already established
be2net: pass wrb_params in case of OS2BMC
l2tp: reset skb control buffer on xmit
net: dsa: microchip: lan937x: Fix RGMII delay tuning
selftests: mptcp: add a check for 'add_addr_accepted'
mptcp: fix address removal logic in mptcp_pm_nl_rm_addr
selftests: mptcp: join: userspace: longer timeout
selftests: mptcp: join: endpoints: longer timeout
selftests: mptcp: join: fastclose: remove flaky marks
mptcp: fix duplicate reset on fastclose
mptcp: decouple mptcp fastclose from tcp close
mptcp: do not fallback when OoO is present
mptcp: fix premature close in case of fallback
mptcp: avoid unneeded subflow-level drops
mptcp: fix ack generation for fallback msk
wifi: rtw89: hw_scan: Don't let the operating channel be last
net: phylink: add missing supported link modes for the fixed-link
selftest: af_unix: Add test for SO_PEEK_OFF.
af_unix: Read sk_peek_offset() again after sleeping in unix_stream_read_generic().
net/mlx5: Clean up only new IRQ glue on request_irq() failure
...
|
|
During connect(), acting on a signal/timeout by disconnecting an already
established socket leads to several issues:
1. connect() invoking vsock_transport_cancel_pkt() ->
virtio_transport_purge_skbs() may race with sendmsg() invoking
virtio_transport_get_credit(). This results in a permanently elevated
`vvs->bytes_unsent`. Which, in turn, confuses the SOCK_LINGER handling.
2. connect() resetting a connected socket's state may race with socket
being placed in a sockmap. A disconnected socket remaining in a sockmap
breaks sockmap's assumptions. And gives rise to WARNs.
3. connect() transitioning SS_CONNECTED -> SS_UNCONNECTED allows for a
transport change/drop after TCP_ESTABLISHED. Which poses a problem for
any simultaneous sendmsg() or connect() and may result in a
use-after-free/null-ptr-deref.
Do not disconnect socket on signal/timeout. Keep the logic for unconnected
sockets: they don't linger, can't be placed in a sockmap, are rejected by
sendmsg().
[1]: https://lore.kernel.org/netdev/e07fd95c-9a38-4eea-9638-133e38c2ec9b@rbox.co/
[2]: https://lore.kernel.org/netdev/20250317-vsock-trans-signal-race-v4-0-fc8837f3f1d4@rbox.co/
[3]: https://lore.kernel.org/netdev/60f1b7db-3099-4f6a-875e-af9f6ef194f6@rbox.co/
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20251119-vsock-interrupted-connect-v2-1-70734cf1233f@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Merge CMAC-128 and CMAC-256 handlers since they are almost the same.
This removes duplication.
The comment 'MIC = AES-128-CMAC(IGTK, AAD ...' is out-dated since CMAC
is also used with BIGTK, as is the comment for CMAC-256. Simply remove
the comments.
Tested-on: mac80211_hwsim
Signed-off-by: Chien Wong <m@xv97.com>
Link: https://patch.msgid.link/20251113140511.48658-6-m@xv97.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
ieee80211_aes_cmac() and ieee80211_aes_cmac_256() are almost the same.
Merge them. This removes duplication.
Signed-off-by: Chien Wong <m@xv97.com>
Link: https://patch.msgid.link/20251113140511.48658-5-m@xv97.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Make use of the added constants to reduce duplication.
Signed-off-by: Chien Wong <m@xv97.com>
Link: https://patch.msgid.link/20251113140511.48658-4-m@xv97.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The called hash functions could fail thus we should check return values.
Fixes: 26717828b75d ("mac80211: aes-cmac: switch to shash CMAC driver")
Signed-off-by: Chien Wong <m@xv97.com>
Link: https://patch.msgid.link/20251113140511.48658-2-m@xv97.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The L2TP stack did not reset the skb control buffer before sending the
encapsulated package.
In a setup with an ath10k radio and batman-adv over an L2TP tunnel
massive fragmentations happen sporadically if the L2TP tunnel is
established over IPv4.
L2TP might reset some of the fields in the IP control buffer, but L2TP
assumes the type of the control buffer to be of an IPv4 packet.
In case the L2TP interface is used as a batadv hardif or the packet is
an IPv6 packet, this assumption breaks.
Clear the entire control buffer to avoid such mishaps altogether.
Fixes: f77ae9390438 ("[PPPOL2TP]: Reset meta-data in xmit function")
Signed-off-by: David Bauer <mail@david-bauer.net>
Link: https://patch.msgid.link/20251118001619.242107-1-mail@david-bauer.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
For more details see the Link tag below.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
Switch to using system_dfl_wq because system_unbound_wq is going away as part of
a workqueue restructuring.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Link: https://patch.msgid.link/20251113162032.394804-4-marco.crivellari@suse.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
After getting the current skb in napi_skb_cache_get(), the next skb in
cache is highly likely to be used soon, so prefetch would be helpful.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251118070646.61344-5-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
- Replace NAPI_SKB_CACHE_HALF with NAPI_SKB_CACHE_FREE
- Only free 32 skbs in napi_skb_cache_put()
Since the first patch adjusting NAPI_SKB_CACHE_SIZE to 128, the number
of packets to be freed in the softirq was increased from 32 to 64.
Considering a subsequent net_rx_action() calling napi_poll() a few
times can easily consume the 64 available slots and we can afford
keeping a higher value of sk_buffs in per-cpu storage, decrease
NAPI_SKB_CACHE_FREE to 32 like before. So now the logic is 1) keeping
96 skbs, 2) freeing 32 skbs at one time.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251118070646.61344-4-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The previous value 16 is a bit conservative, so adjust it along with
NAPI_SKB_CACHE_SIZE, which can minimize triggering memory allocation
in napi_skb_cache_get*().
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251118070646.61344-3-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After commit b61785852ed0 ("net: increase skb_defer_max default to 128")
changed the value sysctl_skb_defer_max to avoid many calls to
kick_defer_list_purge(), the same situation can be applied to
NAPI_SKB_CACHE_SIZE that was proposed in 2016. It's a trade-off between
using pre-allocated memory in skb_cache and saving more a bit heavy
function calls in the softirq context.
With this patch applied, we can have more skbs per-cpu to accelerate the
sending path that needs to acquire new skbs.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251118070646.61344-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix inverted WARN_ON_ONCE condition that prevented normal address
removal counter updates. The current code only executes decrement
logic when the counter is already 0 (abnormal state), while
normal removals (counter > 0) are ignored.
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Fixes: 636113918508 ("mptcp: pm: remove '_nl' from mptcp_pm_nl_rm_addr_received")
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-10-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The CI reports sporadic failures of the fastclose self-tests. The root
cause is a duplicate reset, not carrying the relevant MPTCP option.
In the failing scenario the bad reset is received by the peer before
the fastclose one, preventing the reception of the latter.
Indeed there is window of opportunity at fastclose time for the
following race:
mptcp_do_fastclose
__mptcp_close_ssk
__tcp_close()
tcp_set_state() [1]
tcp_send_active_reset() [2]
After [1] the stack will send reset to in-flight data reaching the now
closed port. Such reset may race with [2].
Address the issue explicitly sending a single reset on fastclose before
explicitly moving the subflow to close status.
Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/596
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-6-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the current fastclose implementation, the mptcp_do_fastclose()
helper is in charge of two distinct actions: send the fastclose reset
and cleanup the subflows.
Formally decouple the two steps, ensuring that mptcp explicitly closes
all the subflows after the mentioned helper.
This will make the upcoming fix simpler, and allows dropping the 2nd
argument from mptcp_destroy_common(). The Fixes tag is then the same as
in the next commit to help with the backports.
Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-5-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In case of DSS corruption, the MPTCP protocol tries to avoid the subflow
reset if fallback is possible. Such corruptions happen in the receive
path; to ensure fallback is possible the stack additionally needs to
check for OoO data, otherwise the fallback will break the data stream.
Fixes: e32d262c89e2 ("mptcp: handle consistently DSS corruption")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/598
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-4-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I'm observing very frequent self-tests failures in case of fallback when
running on a CONFIG_PREEMPT kernel.
The root cause is that subflow_sched_work_if_closed() closes any subflow
as soon as it is half-closed and has no incoming data pending.
That works well for regular subflows - MPTCP needs bi-directional
connectivity to operate on a given subflow - but for fallback socket is
race prone.
When TCP peer closes the connection before the MPTCP one,
subflow_sched_work_if_closed() will schedule the MPTCP worker to
gracefully close the subflow, and shortly after will do another schedule
to inject and process a dummy incoming DATA_FIN.
On CONFIG_PREEMPT kernel, the MPTCP worker can kick-in and close the
fallback subflow before subflow_sched_work_if_closed() is able to create
the dummy DATA_FIN, unexpectedly interrupting the transfer.
Address the issue explicitly avoiding closing fallback subflows on when
the peer is only half-closed.
Note that, when the subflow is able to create the DATA_FIN before the
worker invocation, the worker will change the msk state before trying to
close the subflow and will skip the latter operation as the msk will not
match anymore the precondition in __mptcp_close_subflow().
Fixes: f09b0ad55a11 ("mptcp: close subflow when receiving TCP+FIN")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-3-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The rcv window is shared among all the subflows. Currently, MPTCP sync
the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time.
The above means that incoming data may sporadically observe outdated
TCP-level rcv window and being wrongly dropped by TCP.
Address the issue checking for the edge condition before queuing the
data at TCP level, and eventually syncing the rcv window as needed.
Note that the issue is actually present from the very first MPTCP
implementation, but backports older than the blamed commit below will
range from impossible to useless.
Before:
$ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow
TcpExtBeyondWindow 14 0.0
After:
$ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow
TcpExtBeyondWindow 0 0.0
Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-2-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mptcp_cleanup_rbuf() needs to know the last most recent, mptcp-level
rcv_wnd sent, and such information is tracked into the msk->old_wspace
field, updated at ack transmission time by mptcp_write_options().
Fallback socket do not add any mptcp options, such helper is never
invoked, and msk->old_wspace value remain stale. That in turn makes
ack generation at recvmsg() time quite random.
Address the issue ensuring mptcp_write_options() is invoked even for
fallback sockets, and just update the needed info in such a case.
The issue went unnoticed for a long time, as mptcp currently overshots
the fallback socket receive buffer autotune significantly. It is going
to change in the near future.
Fixes: e3859603ba13 ("mptcp: better msk receive window updates")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/594
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-1-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use %ptSp instead of open coded variants to print content of
struct timespec64 in human readable format.
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20251113150217.3030010-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
When an IPv6 Router Advertisement (RA) is received for a prefix, the
kernel creates the corresponding on-link route with flags RTF_ADDRCONF
and RTF_PREFIX_RT configured and RTF_EXPIRES if lifetime is set.
If later a user configures a static IPv6 address on the same prefix the
kernel clears the RTF_EXPIRES flag but it doesn't clear the RTF_ADDRCONF
and RTF_PREFIX_RT. When the next RA for that prefix is received, the
kernel sees the route as RA-learned and wrongly configures back the
lifetime. This is problematic because if the route expires, the static
address won't have the corresponding on-link route.
This fix clears the RTF_ADDRCONF and RTF_PREFIX_RT flags preventing that
the lifetime is configured when the next RA arrives. If the static
address is deleted, the route becomes RA-learned again.
Fixes: 14ef37b6d00e ("ipv6: fix route lookup in addrconf_prefix_rcv()")
Reported-by: Garri Djavadyan <g.djavadyan@gmail.com>
Closes: https://lore.kernel.org/netdev/ba807d39aca5b4dcf395cc11dca61a130a52cfd3.camel@gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20251115095939.6967-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_schedule_gc() and wait_for_unix_gc() share some code.
Let's consolidate the two.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251115020935.2643121-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_tot_inflight is no longer used.
Let's remove it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251115020935.2643121-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_tot_inflight is a poor metric, only telling the number of
inflight AF_UNXI sockets, and we should use unix_graph_state instead.
Also, if the receiver is catching up with the passed fds, the
sender does not need to schedule GC.
GC only helps unreferenced cyclic SCM_RIGHTS references, and in
such a situation, the malicious sendmsg() will continue to call
wait_for_unix_gc() and hit the UNIX_INFLIGHT_SANE_USER condition.
Let's make only malicious users schedule GC and wait for it to
finish if a cyclic reference exists during the previous GC run.
Then, sane users will pay almost no cost for wait_for_unix_gc().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251115020935.2643121-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We have been calling wait_for_unix_gc() on every sendmsg() in case
there are too many inflight AF_UNIX sockets.
This is also because the old GC implementation had poor knowledge
of the inflight sockets and had to suspect every sendmsg().
This was improved by commit d9f21b361333 ("af_unix: Try to run GC
async."), but we do not even need to call wait_for_unix_gc() if the
process is not sending AF_UNIX sockets.
The wait_for_unix_gc() call only helps when a malicious process
continues to create cyclic references, and we can detect that
in a better place and slow it down.
Let's move wait_for_unix_gc() to unix_prepare_fpl() that is called
only when AF_UNIX socket fd is passed via SCM_RIGHTS.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251115020935.2643121-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We have been triggering GC on every close() if there is even one
inflight AF_UNIX socket.
This is because the old GC implementation had no idea of the graph
shape formed by SCM_RIGHTS references.
The new GC knows whether there could be a cyclic reference or not,
and we can do better.
Let's not trigger GC from close() if there is no cyclic reference
or GC is already in progress.
While at it, unix_gc() is renamed to unix_schedule_gc() as it does
not actually perform GC since commit 8b90a9f819dc ("af_unix: Run
GC on only one CPU.").
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251115020935.2643121-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
GC manages its state by two variables, unix_graph_maybe_cyclic
and unix_graph_grouped, both of which are set to false in the
initial state.
When an AF_UNIX socket is passed to an in-flight AF_UNIX socket,
unix_update_graph() sets unix_graph_maybe_cyclic to true and
unix_graph_grouped to false, making the next GC invocation call
unix_walk_scc() to group SCCs.
Once unix_walk_scc() finishes, sockets in the same SCC are linked
via vertex->scc_entry. Then, unix_graph_grouped is set to true
so that the following GC invocations can skip Tarjan's algorithm
and simply iterate through the list in unix_walk_scc_fast().
In addition, if we know there is at least one cyclic reference,
we set unix_graph_maybe_cyclic to true so that we do not skip GC.
So the state transitions as follows:
(unix_graph_maybe_cyclic, unix_graph_grouped)
=
(false, false) -> (true, false) -> (true, true) or (false, true)
^.______________/________________/
There is no transition to the initial state where both variables
are false.
If we consider the initial state as grouped, we can see that the
GC actually has a tristate.
Let's consolidate two variables into one enum.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251115020935.2643121-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
__unix_walk_scc() and unix_walk_scc_fast() call unix_scc_cyclic()
for each SCC to check if it forms a cyclic reference, so that we
can skip GC at the following invocations in case all SCCs do not
have any cycles.
If we count the number of cyclic SCCs in __unix_walk_scc(), we can
simplify unix_walk_scc_fast() because the number of cyclic SCCs
only changes when it garbage-collects a SCC.
So, let's count cyclic SCC in __unix_walk_scc() and decrement it
in unix_walk_scc_fast() when performing garbage collection.
Note that we will use this counter in a later patch to check if a
cycle existed in the previous GC run.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251115020935.2643121-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_stream_read_generic().
Miao Wang reported a bug of SO_PEEK_OFF on AF_UNIX SOCK_STREAM
socket.
The unexpected behaviour is triggered when the peek offset is
larger than the recv queue and the thread is unblocked by new
data.
Let's assume a socket which has "aaaa" in the recv queue and
the peek offset is 4.
First, unix_stream_read_generic() reads the offset 4 and skips
the skb(s) of "aaaa" with the code below:
skip = max(sk_peek_offset(sk, flags), 0); /* @skip is 4. */
do {
...
while (skip >= unix_skb_len(skb)) {
skip -= unix_skb_len(skb);
...
skb = skb_peek_next(skb, &sk->sk_receive_queue);
if (!skb)
goto again; /* @skip is 0. */
}
The thread jumps to the 'again' label and goes to sleep since
new data has not arrived yet.
Later, new data "bbbb" unblocks the thread, and the thread jumps
to the 'redo:' label to restart the entire process from the first
skb in the recv queue.
do {
...
redo:
...
last = skb = skb_peek(&sk->sk_receive_queue);
...
again:
if (skb == NULL) {
...
timeo = unix_stream_data_wait(sk, timeo, last,
last_len, freezable);
...
goto redo; /* @skip is 0 !! */
However, the peek offset is not reset in the path.
If the buffer size is 8, recv() will return "aaaabbbb" without
skipping any data, and the final offset will be 12 (the original
offset 4 + peeked skbs' length 8).
After sleeping in unix_stream_read_generic(), we have to fetch the
peek offset again.
Let's move the redo label before mutex_lock(&u->iolock).
Fixes: 9f389e35674f ("af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag")
Reported-by: Miao Wang <shankerwangmiao@gmail.com>
Closes: https://lore.kernel.org/netdev/3B969F90-F51F-4B9D-AB1A-994D9A54D460@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251117174740.3684604-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mptcp_pm_del_add_timer() can call sk_stop_timer_sync(sk, &entry->add_timer)
while another might have free entry already, as reported by syzbot.
Add RCU protection to fix this issue.
Also change confusing add_timer variable with stop_timer boolean.
syzbot report:
BUG: KASAN: slab-use-after-free in __timer_delete_sync+0x372/0x3f0 kernel/time/timer.c:1616
Read of size 4 at addr ffff8880311e4150 by task kworker/1:1/44
CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
Workqueue: events mptcp_worker
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
__timer_delete_sync+0x372/0x3f0 kernel/time/timer.c:1616
sk_stop_timer_sync+0x1b/0x90 net/core/sock.c:3631
mptcp_pm_del_add_timer+0x283/0x310 net/mptcp/pm.c:362
mptcp_incoming_options+0x1357/0x1f60 net/mptcp/options.c:1174
tcp_data_queue+0xca/0x6450 net/ipv4/tcp_input.c:5361
tcp_rcv_established+0x1335/0x2670 net/ipv4/tcp_input.c:6441
tcp_v4_do_rcv+0x98b/0xbf0 net/ipv4/tcp_ipv4.c:1931
tcp_v4_rcv+0x252a/0x2dc0 net/ipv4/tcp_ipv4.c:2374
ip_protocol_deliver_rcu+0x221/0x440 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:239
NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318
NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318
__netif_receive_skb_one_core net/core/dev.c:6079 [inline]
__netif_receive_skb+0x143/0x380 net/core/dev.c:6192
process_backlog+0x31e/0x900 net/core/dev.c:6544
__napi_poll+0xb6/0x540 net/core/dev.c:7594
napi_poll net/core/dev.c:7657 [inline]
net_rx_action+0x5f7/0xda0 net/core/dev.c:7784
handle_softirqs+0x22f/0x710 kernel/softirq.c:622
__do_softirq kernel/softirq.c:656 [inline]
__local_bh_enable_ip+0x1a0/0x2e0 kernel/softirq.c:302
mptcp_pm_send_ack net/mptcp/pm.c:210 [inline]
mptcp_pm_addr_send_ack+0x41f/0x500 net/mptcp/pm.c:-1
mptcp_pm_worker+0x174/0x320 net/mptcp/pm.c:1002
mptcp_worker+0xd5/0x1170 net/mptcp/protocol.c:2762
process_one_work kernel/workqueue.c:3263 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Allocated by task 44:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
poison_kmalloc_redzone mm/kasan/common.c:400 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:417
kasan_kmalloc include/linux/kasan.h:262 [inline]
__kmalloc_cache_noprof+0x1ef/0x6c0 mm/slub.c:5748
kmalloc_noprof include/linux/slab.h:957 [inline]
mptcp_pm_alloc_anno_list+0x104/0x460 net/mptcp/pm.c:385
mptcp_pm_create_subflow_or_signal_addr+0xf9d/0x1360 net/mptcp/pm_kernel.c:355
mptcp_pm_nl_fully_established net/mptcp/pm_kernel.c:409 [inline]
__mptcp_pm_kernel_worker+0x417/0x1ef0 net/mptcp/pm_kernel.c:1529
mptcp_pm_worker+0x1ee/0x320 net/mptcp/pm.c:1008
mptcp_worker+0xd5/0x1170 net/mptcp/protocol.c:2762
process_one_work kernel/workqueue.c:3263 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Freed by task 6630:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
__kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:587
kasan_save_free_info mm/kasan/kasan.h:406 [inline]
poison_slab_object mm/kasan/common.c:252 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284
kasan_slab_free include/linux/kasan.h:234 [inline]
slab_free_hook mm/slub.c:2523 [inline]
slab_free mm/slub.c:6611 [inline]
kfree+0x197/0x950 mm/slub.c:6818
mptcp_remove_anno_list_by_saddr+0x2d/0x40 net/mptcp/pm.c:158
mptcp_pm_flush_addrs_and_subflows net/mptcp/pm_kernel.c:1209 [inline]
mptcp_nl_flush_addrs_list net/mptcp/pm_kernel.c:1240 [inline]
mptcp_pm_nl_flush_addrs_doit+0x593/0xbb0 net/mptcp/pm_kernel.c:1281
genl_family_rcv_msg_doit+0x215/0x300 net/netlink/genetlink.c:1115
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x60e/0x790 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
netlink_unicast+0x846/0xa10 net/netlink/af_netlink.c:1346
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
____sys_sendmsg+0x508/0x820 net/socket.c:2630
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
__sys_sendmsg net/socket.c:2716 [inline]
__do_sys_sendmsg net/socket.c:2721 [inline]
__se_sys_sendmsg net/socket.c:2719 [inline]
__x64_sys_sendmsg+0x1a1/0x260 net/socket.c:2719
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Cc: stable@vger.kernel.org
Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout")
Reported-by: syzbot+2a6fbf0f0530375968df@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/691ad3c3.a70a0220.f6df1.0004.GAE@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251117100745.1913963-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb()
with alien skbs").
Now the per-cpu napi_skb_cache is populated from TX completion path,
we can make use of this cache, especially for cpus not used
from a driver NAPI poll (primary user of napi_cache).
We can use the napi_skb_cache only if current context is not from hard irq.
With this patch, I consistently reach 130 Mpps on my UDP tx stress test
and reduce SLUB spinlock contention to smaller values.
Note there is still some SLUB contention for skb->head allocations.
I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial
and /sys/kernel/slab/skbuff_small_head/min_partial depending
on the platform taxonomy.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251116202717.1542829-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch refactors __alloc_skb() to prepare the following one,
and does not change functionality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251116202717.1542829-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We want to be able in the series last patch to get an skb from
napi_skb_cache from process context, if there is one available.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251116202717.1542829-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
s/connectons/connections/ and s/message based/message-based/
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20251116135616.106079-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tw->tw_transparent is initialised twice in inet_twsk_alloc()
and tcp_time_wait().
Let's remove the latter.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20251118000445.4091280-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-11-18
1) Misc fixes for xfrm_state creation/modification/deletion.
Patchset from Sabrina Dubroca.
2) Fix inner packet family determination for xfrm offloads.
From Jianbo Liu.
3) Don't push locally generated packets directly to L2 tunnel
mode offloading, they still need processing from the standard
xfrm path. From Jianbo Liu.
4) Fix memory leaks in xfrm_add_acquire for policy offloads and policy
security contexts. From Zilin Guan.
* tag 'ipsec-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: fix memory leak in xfrm_add_acquire()
xfrm: Prevent locally generated packets from direct output in tunnel mode
xfrm: Determine inner GSO type from packet inner protocol
xfrm: Check inner packet family directly from skb_dst
xfrm: check all hash buckets for leftover states during netns deletion
xfrm: set err and extack on failure to create pcpu SA
xfrm: call xfrm_dev_state_delete when xfrm_state_migrate fails to add the state
xfrm: make state as DEAD before final put when migrate fails
xfrm: also call xfrm_state_delete_tunnel at destroy time for states that were never added
xfrm: drop SA reference in xfrm_state_update if dir doesn't match
====================
Link: https://patch.msgid.link/20251118085344.2199815-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2025-11-18
1) Relax a lock contention bottleneck to improve IPsec crypto
offload performance. From Jianbo Liu.
2) Deprecate pfkey, the interface will be removed in 2027.
3) Update xfrm documentation and move it to ipsec maintainance.
From Bagas Sanjaya.
* tag 'ipsec-next-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
MAINTAINERS: Add entry for XFRM documentation
net: Move XFRM documentation into its own subdirectory
Documentation: xfrm_sync: Number the fifth section
Documentation: xfrm_sysctl: Trim trailing colon in section heading
Documentation: xfrm_sync: Trim excess section heading characters
Documentation: xfrm_sync: Properly reindent list text
Documentation: xfrm_device: Separate hardware offload sublists
Documentation: xfrm_device: Use numbered list for offloading steps
Documentation: xfrm_device: Wrap iproute2 snippets in literal code block
pfkey: Deprecate pfkey
xfrm: Skip redundant replay recheck for the hardware offload path
xfrm: Refactor xfrm_input lock to reduce contention with RSS
====================
Link: https://patch.msgid.link/20251118092610.2223552-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The function devl_rate_nodes_destroy is documented to "Unset parent for
all rate objects". However, it was only calling the driver-specific
`rate_leaf_parent_set` or `rate_node_parent_set` ops and decrementing
the parent's refcount, without actually setting the
`devlink_rate->parent` pointer to NULL.
This leaves a dangling pointer in the `devlink_rate` struct, which cause
refcount error in netdevsim[1] and mlx5[2]. In addition, this is
inconsistent with the behavior of `devlink_nl_rate_parent_node_set`,
where the parent pointer is correctly cleared.
This patch fixes the issue by explicitly setting `devlink_rate->parent`
to NULL after notifying the driver, thus fulfilling the function's
documented behavior for all rate objects.
[1]
repro steps:
echo 1 > /sys/bus/netdevsim/new_device
devlink dev eswitch set netdevsim/netdevsim1 mode switchdev
echo 1 > /sys/bus/netdevsim/devices/netdevsim1/sriov_numvfs
devlink port function rate add netdevsim/netdevsim1/test_node
devlink port function rate set netdevsim/netdevsim1/128 parent test_node
echo 1 > /sys/bus/netdevsim/del_device
dmesg:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 8 PID: 1530 at lib/refcount.c:31 refcount_warn_saturate+0x42/0xe0
CPU: 8 UID: 0 PID: 1530 Comm: bash Not tainted 6.18.0-rc4+ #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate+0x42/0xe0
Call Trace:
<TASK>
devl_rate_leaf_destroy+0x8d/0x90
__nsim_dev_port_del+0x6c/0x70 [netdevsim]
nsim_dev_reload_destroy+0x11c/0x140 [netdevsim]
nsim_drv_remove+0x2b/0xb0 [netdevsim]
device_release_driver_internal+0x194/0x1f0
bus_remove_device+0xc6/0x130
device_del+0x159/0x3c0
device_unregister+0x1a/0x60
del_device_store+0x111/0x170 [netdevsim]
kernfs_fop_write_iter+0x12e/0x1e0
vfs_write+0x215/0x3d0
ksys_write+0x5f/0xd0
do_syscall_64+0x55/0x10f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[2]
devlink dev eswitch set pci/0000:08:00.0 mode switchdev
devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 1000
devlink port function rate add pci/0000:08:00.0/group1
devlink port function rate set pci/0000:08:00.0/32768 parent group1
modprobe -r mlx5_ib mlx5_fwctl mlx5_core
dmesg:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 7 PID: 16151 at lib/refcount.c:31 refcount_warn_saturate+0x42/0xe0
CPU: 7 UID: 0 PID: 16151 Comm: bash Not tainted 6.17.0-rc7_for_upstream_min_debug_2025_10_02_12_44 #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate+0x42/0xe0
Call Trace:
<TASK>
devl_rate_leaf_destroy+0x8d/0x90
mlx5_esw_offloads_devlink_port_unregister+0x33/0x60 [mlx5_core]
mlx5_esw_offloads_unload_rep+0x3f/0x50 [mlx5_core]
mlx5_eswitch_unload_sf_vport+0x40/0x90 [mlx5_core]
mlx5_sf_esw_event+0xc4/0x120 [mlx5_core]
notifier_call_chain+0x33/0xa0
blocking_notifier_call_chain+0x3b/0x50
mlx5_eswitch_disable_locked+0x50/0x110 [mlx5_core]
mlx5_eswitch_disable+0x63/0x90 [mlx5_core]
mlx5_unload+0x1d/0x170 [mlx5_core]
mlx5_uninit_one+0xa2/0x130 [mlx5_core]
remove_one+0x78/0xd0 [mlx5_core]
pci_device_remove+0x39/0xa0
device_release_driver_internal+0x194/0x1f0
unbind_store+0x99/0xa0
kernfs_fop_write_iter+0x12e/0x1e0
vfs_write+0x215/0x3d0
ksys_write+0x5f/0xd0
do_syscall_64+0x53/0x1f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: d75559845078 ("devlink: Allow setting parent node of rate objects")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1763381149-1234377-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since selem already saves a pointer to smap, use it instead of an
additional argument in bpf_selem_free(). This requires moving the
SDATA(selem)->smap assignment from bpf_selem_link_map() to
bpf_selem_alloc() since bpf_selem_free() may be called without the
selem being linked to smap in bpf_local_storage_update().
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Since commit a96a44aba556 ("bpf: bpf_sk_storage: Fix invalid wait
context lockdep report"), {charge,uncharge}_mem are always true when
allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem
from local storage, so drop these arguments. No functional change.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Replace the open coded implementation with the scoped user access guard.
That also corrects the imbalance between masked_user_access_begin() and
user_write_access_end(), which would affect PowerPC when it gains masked
user access support.
No functional change intended.
[ tglx: Amend change log ]
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/793219313f641eda09a892d06768d2837246bf9f.1763396724.git.christophe.leroy@csgroup.eu
|
|
Just use d_make_persistent() + dput() (and fold the latter into
simple_finish_creating()) and that's it...
NOTE: pipe->dentry is a borrowed reference - it does not contribute
to dentry refcount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
net.ipv4.tcp_comp_sack_slack_ns current default value is too high.
When a flow has many drops (1 % or more), and small RTT, adding 100 usec
before sending SACK stalls the sender relying on getting SACK
fast enough to keep the pipe busy.
Decrease the default to 10 usec.
This is orthogonal to Congestion Control heuristics to determine
if drops are caused by congestion or not.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251114135141.3810964-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
svc_tcp_sendmsg() calls xdr_buf_to_bvec() with the second slot of
rq_bvec as the start, but doesn't reduce the array length by one, which
could lead to an array overrun. Also, rq_bvec is always rq_maxpages in
length, which can be too short in some cases, since the TCP record
marker consumes a slot.
Fix both problems by adding a separate bvec array to the svc_sock that
is specifically for sending. For TCP, make this array one slot longer
than rq_maxpages, to account for the record marker. For UDP, only
allocate as large an array as we need since it's limited to 64k of
payload.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Including the client IP address that generated the overrun traffic
seems like it would be helpful. The message now reads:
kernel: svc: nfsd oversized RPC fragment (1064958 octets) from 100.64.0.11:45866
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
NFSD has always supported added network listeners. The new netlink
protocol now enables the removal of listeners.
Olga noticed that if an RDMA listener is removed and immediately
re-added, the deferred __svc_rdma_free() function might not have
run yet, so some or all of the old listener's RDMA resources
linger, which prevents a new listener on the same address from
being created.
Also, svc_xprt_free() does a module_put() just after calling
->xpo_free(). That means if there is deferred work going on, the
module could be unloaded before that work is even started,
resulting in a UAF.
Neil asks:
> What particular part of __svc_rdma_free() needs to run in order for a
> subsequent registration to succeed?
> Can that bit be run directory from svc_rdma_free() rather than be
> delayed?
> (I know almost nothing about rdma so forgive me if the answers to these
> questions seems obvious)
The reasons I can recall are:
- Some of the transport tear-down work can sleep
- Releasing a cm_id is tricky and can deadlock
We might be able to mitigate the second issue with judicious
application of transport reference counting.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Closes: https://lore.kernel.org/linux-nfs/20250821204328.89218-1-okorniev@redhat.com/
Suggested-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The bpf_skb_check_mtu helper needs to use skb->transport_header when
the BPF_MTU_CHK_SEGS flag is used:
bpf_skb_check_mtu(skb, ifindex, &mtu_len, 0, BPF_MTU_CHK_SEGS)
The transport_header is not always set. There is a WARN_ON_ONCE
report when CONFIG_DEBUG_NET is enabled + skb->gso_size is set +
bpf_prog_test_run is used:
WARNING: CPU: 1 PID: 2216 at ./include/linux/skbuff.h:3071
skb_gso_validate_network_len
bpf_skb_check_mtu
bpf_prog_3920e25740a41171_tc_chk_segs_flag # A test in the next patch
bpf_test_run
bpf_prog_test_run_skb
For a normal ingress skb (not test_run), skb_reset_transport_header
is performed but there is plan to avoid setting it as described in
commit 2170a1f09148 ("net: no longer reset transport_header in __netif_receive_skb_core()").
This patch fixes the bpf helper by checking
skb_transport_header_was_set(). The check is done just before
skb->transport_header is used, to avoid breaking the existing bpf prog.
The WARN_ON_ONCE is limited to bpf_prog_test_run, so targeting bpf-next.
Fixes: 34b2021cc616 ("bpf: Add BPF-helper for MTU checking")
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251112232331.1566074-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|