| Age | Commit message (Collapse) | Author |
|
syzbot found that ndisc_router_discovery() could read and write
in6_dev->ra_mtu without holding a lock [1]
This looks fine, IFLA_INET6_RA_MTU is best effort.
Add READ_ONCE()/WRITE_ONCE() to document the race.
Note that we might also reject illegal MTU values
(mtu < IPV6_MIN_MTU || mtu > skb->dev->mtu) in a future patch.
[1]
BUG: KCSAN: data-race in ndisc_router_discovery / ndisc_router_discovery
read to 0xffff888119809c20 of 4 bytes by task 25817 on cpu 1:
ndisc_router_discovery+0x151d/0x1c90 net/ipv6/ndisc.c:1558
ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841
icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989
ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438
ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500
ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590
dst_input include/net/dst.h:474 [inline]
ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79
...
write to 0xffff888119809c20 of 4 bytes by task 25816 on cpu 0:
ndisc_router_discovery+0x155a/0x1c90 net/ipv6/ndisc.c:1559
ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841
icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989
ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438
ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500
ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590
dst_input include/net/dst.h:474 [inline]
ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79
...
value changed: 0x00000000 -> 0xe5400659
Fixes: 49b99da2c9ce ("ipv6: add IFLA_INET6_RA_MTU to expose mtu value")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rocco Yue <rocco.yue@mediatek.com>
Link: https://patch.msgid.link/20260118152941.2563857-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dev->work can re read locklessly in mISDN_read()
and mISDN_poll(). Add READ_ONCE()/WRITE_ONCE() annotations.
BUG: KCSAN: data-race in mISDN_ioctl / mISDN_read
write to 0xffff88812d848280 of 4 bytes by task 10864 on cpu 1:
misdn_add_timer drivers/isdn/mISDN/timerdev.c:175 [inline]
mISDN_ioctl+0x2fb/0x550 drivers/isdn/mISDN/timerdev.c:233
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xce/0x140 fs/ioctl.c:583
__x64_sys_ioctl+0x43/0x50 fs/ioctl.c:583
x64_sys_call+0x14b0/0x3000 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xd8/0x2c0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffff88812d848280 of 4 bytes by task 10857 on cpu 0:
mISDN_read+0x1f2/0x470 drivers/isdn/mISDN/timerdev.c:112
do_loop_readv_writev fs/read_write.c:847 [inline]
vfs_readv+0x3fb/0x690 fs/read_write.c:1020
do_readv+0xe7/0x210 fs/read_write.c:1080
__do_sys_readv fs/read_write.c:1165 [inline]
__se_sys_readv fs/read_write.c:1162 [inline]
__x64_sys_readv+0x45/0x50 fs/read_write.c:1162
x64_sys_call+0x2831/0x3000 arch/x86/include/generated/asm/syscalls_64.h:20
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xd8/0x2c0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0x00000000 -> 0x00000001
Fixes: 1b2b03f8e514 ("Add mISDN core files")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260118132528.2349573-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The industry standard jumbo frame MTU is 9216 bytes. When using the DSA
subsystem, a 4-byte tag is added to each Ethernet frame.
Increase AIROHA_MAX_MTU to 9220 bytes (9216 + 4) so that users can set a
standard 9216-byte MTU on DSA ports.
The underlying hardware supports significantly larger frame sizes
(approximately 16K). However, the maximum MTU is limited to 9220 bytes
for now, as this is sufficient to support standard jumbo frames and does
not incur additional memory allocation overhead.
Signed-off-by: Sayantan Nandy <sayantann11@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260119073658.6216-1-sayantann11@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For these two firmware mailbox commands, in txgbe_test_hostif() and
txgbe_set_phy_link_hostif(), there is no need to read data from the
buffer.
Under the current setting, OEM firmware will cause the driver to fail to
probe. Because OEM firmware returns more link information, with a larger
OEM structure txgbe_hic_ephy_getlink. However, the current driver does
not support the OEM function. So just fix it in the way that does not
involve reading the returned data.
Fixes: d84a3ff9aae8 ("net: txgbe: Restrict the use of mismatched FW versions")
Cc: stable@vger.kernel.org
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/2914AB0BC6158DDA+20260119065935.6015-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jijie Shao says:
====================
fix some bugs in the flow director of HNS3 driver
This patchset fixes two bugs in the flow director:
1. Incorrect definition of HCLGE_FD_AD_COUNTER_NUM_M
2. Incorrect assignment of HCLGE_FD_AD_NXT_KEY
====================
Link: https://patch.msgid.link/20260119132840.410513-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use next_input_key instead of counter_id to set HCLGE_FD_AD_NXT_KEY.
Fixes: 117328680288 ("net: hns3: Add input key and action config support for flow director")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260119132840.410513-3-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
HCLGE_FD_AD_COUNTER_NUM_M should be at GENMASK(19, 13),
rather than at GENMASK(20, 13), because bit 20 is
HCLGE_FD_AD_NXT_STEP_B.
This patch corrects the wrong definition.
Fixes: 117328680288 ("net: hns3: Add input key and action config support for flow director")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260119132840.410513-2-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tao Wang reports that sometimes, after resume, stmmac can watchdog:
NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms
When this occurs, the DMA transmit descriptors contain:
eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0
eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000
where descriptor 221 is the TSO header and 222 is the TSO payload.
tdes3 for descriptor 221 (0xb04416a0) has both bit 29 (first
descriptor) and bit 28 (last descriptor) set, which is incorrect.
The following packet also has bit 28 set, but isn't marked as a
first descriptor, and this causes the transmit DMA to stall.
This occurs because stmmac_tso_allocator() populates the first
descriptor, but does not set .last_segment correctly. There are two
places where this matters: one is later in stmmac_tso_xmit() where
we use it to update the TSO header descriptor. The other is in the
ring/chain mode clean_desc3() which is a performance optimisation.
Rather than using tx_q->tx_skbuff_dma[].last_segment to determine
whether the first descriptor entry is the only segment, calculate the
number of descriptor entries used. If there is only one descriptor,
then the first is also the last, so mark it as such.
Further work will be necessary to either eliminate .last_segment
entirely or set it correctly. Code analysis also indicates that a
similar issue exists with .is_jumbo. These will be the subject of
a future patch.
Reported-by: Tao Wang <tao03.wang@horizon.auto>
Fixes: c2837423cb54 ("net: stmmac: Rework TX Coalesce logic")
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vhq8O-00000005N5s-0Ke5@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In be_get_new_eqd(), statistics of pkts, protected by u64_stats_sync, are
read and accumulated in ignorance of possible u64_stats_fetch_retry()
events. Before the commit in question, these statistics were retrieved
one by one directly from queues. Fix this by reading them into temporary
variables first.
Fixes: 209477704187 ("be2net: set interrupt moderation for Skyhawk-R using EQ-DB")
Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260119153440.1440578-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In idpf_net_dim(), some statistics protected by u64_stats_sync, are read
and accumulated in ignorance of possible u64_stats_fetch_retry() events.
The correct way to copy statistics is already illustrated by
idpf_add_queue_stats(). Fix this by reading them into temporary variables
first.
Fixes: c2d548cad150 ("idpf: add TX splitq napi poll support")
Fixes: 3a8845af66ed ("idpf: add RX splitq napi poll support")
Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260119162720.1463859-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In hns3_fetch_stats(), ring statistics, protected by u64_stats_sync, are
read and accumulated in ignorance of possible u64_stats_fetch_retry()
events. These statistics are already accumulated by
hns3_ring_stats_update(). Fix this by reading them into a temporary
buffer first.
Fixes: b20d7fe51e0d ("net: hns3: add some statitics info to tx process")
Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260119160759.1455950-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
active_fec is a 2-bit unsigned field, and thus can only have the values
0-3. So checking that it is less than 4 is unnecessary.
Simplify the code by dropping this check.
As it no longer fits well where it is, move FEC_MAX_INDEX to towards the
top of the file. And add the prefix OXT2. I believe this is more
idiomatic.
Flagged by Smatch as:
...//otx2_ethtool.c:1024 otx2_get_fecparam() warn: always true condition '(pfvf->linfo.fec < 4) => (0-3 < 4)'
No functional change intended.
Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://patch.msgid.link/20260119-oob-v1-1-a4147e75e770@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Looks like AI reviewers miss that napi_consume_skb() must have
a real budget passed to it. Let's see if adding a real kdoc will
help them figure this out.
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260119224140.1362729-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the TX queue length reaches the threshold, the netdev watchdog
immediately detects a TX queue timeout.
This patch updates the trans_start timestamp of the transmit queue
on every asynchronous USB URB submission along the transmit path,
ensuring that the network watchdog accurately reflects ongoing
transmission activity.
Signed-off-by: Mingj Ye <insyelu@gmail.com>
Reviewed-by: Hayes Wang <hayeswang@realtek.com>
Link: https://patch.msgid.link/20260120015949.84996-1-insyelu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit ca9d74eb5f6a ("uapi: add INT_MAX and INT_MIN constants")
recently removed some includes of limits.h in uAPI headers.
ncdevmem.c was depending on them:
ncdevmem.c: In function ‘ethtool_add_flow’:
ncdevmem.c:369:60: error: ‘INT_MAX’ undeclared (first use in this function)
369 | if (endptr == id_start || flow_id < 0 || flow_id > INT_MAX)
| ^~~~~~~
ncdevmem.c:77:1: note: ‘INT_MAX’ is defined in header ‘<limits.h>’; did you forget to ‘#include <limits.h>’?
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20260120180319.1673271-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After commit d228ece36345 ("clk: divider: remove round_rate() in favor
of determine_rate()") determining GFX3D clock rate crashes, because the
passed parent map doesn't provide the expected best_parent_hw clock
(with the roundd_rate path before the offending commit the
best_parent_hw was ignored).
Set the field in parent_req in addition to setting it in the req,
fixing the crash.
clk_hw_round_rate (drivers/clk/clk.c:1764) (P)
clk_divider_bestdiv (drivers/clk/clk-divider.c:336)
divider_determine_rate (drivers/clk/clk-divider.c:358)
clk_alpha_pll_postdiv_determine_rate (drivers/clk/qcom/clk-alpha-pll.c:1275)
clk_core_determine_round_nolock (drivers/clk/clk.c:1606)
clk_core_round_rate_nolock (drivers/clk/clk.c:1701)
__clk_determine_rate (drivers/clk/clk.c:1741)
clk_gfx3d_determine_rate (drivers/clk/qcom/clk-rcg2.c:1268)
clk_core_determine_round_nolock (drivers/clk/clk.c:1606)
clk_core_round_rate_nolock (drivers/clk/clk.c:1701)
clk_core_round_rate_nolock (drivers/clk/clk.c:1710)
clk_round_rate (drivers/clk/clk.c:1804)
dev_pm_opp_set_rate (drivers/opp/core.c:1440 (discriminator 1))
msm_devfreq_target (drivers/gpu/drm/msm/msm_gpu_devfreq.c:51)
devfreq_set_target (drivers/devfreq/devfreq.c:360)
devfreq_update_target (drivers/devfreq/devfreq.c:426)
devfreq_monitor (drivers/devfreq/devfreq.c:458)
process_one_work (arch/arm64/include/asm/jump_label.h:36 include/trace/events/workqueue.h:110 kernel/workqueue.c:3284)
worker_thread (kernel/workqueue.c:3356 (discriminator 2) kernel/workqueue.c:3443 (discriminator 2))
kthread (kernel/kthread.c:467)
ret_from_fork (arch/arm64/kernel/entry.S:861)
Fixes: 55213e1acec9 ("clk: qcom: Add gfx3d ping-pong PLL frequency switching")
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Reviewed-by: Abel Vesa <abel.vesa@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Brian Masney <bmasney@redhat.com>
Link: https://lore.kernel.org/r/20260117-db820-fix-gfx3d-v1-1-0f8894d71d63@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
After skb allocation, initial skb->fclone value is 0 (SKB_FCLONE_UNAVAILABLE)
We can replace one RMW sequence with a single OR instruction.
movzbl 0x7e(%r13),%eax // skb->fclone = SKB_FCLONE_ORIG;
and $0xf3,%al
or $0x4,%al
mov %al,0x7e(%r13)
->
or $0x4,0x7e(%r13) // skb->fclone |= SKB_FCLONE_ORIG;
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260116164402.1872649-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Stefan Eichenberger says:
====================
Convert the Micrel bindings to DT schema
Convert the device tree bindings for the Micrel PHYs and switches to DT
schema.
====================
Link: https://patch.msgid.link/20260116130948.79558-1-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert the micrel-ksz90x1.txt to DT schema. Create a separate YAML file
for this PHY series. The old naming of ksz90x1 would be misleading in
this case, so rename it to gigabit, as it contains ksz9xx1 and lan8xxx
gigabit PHYs.
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260116130948.79558-3-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert the devicetree bindings for the Micrel PHYs and switches to DT
schema.
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260116130948.79558-2-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
LAN969x has 2 dedicated RGMII ports, so regular SERDES lanes are not used
for RGMII.
So, lets not require phys to be defined when any of the rgmii phy-modes are
set.
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260115114021.111324-11-robert.marko@sartura.hr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Extend HW timestamp tests to check that ioctl interface is not broken
and configuration setups and requests are equal to netlink interface.
Some linter warnings are disabled because of ctypes classes.
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260116062121.1230184-2-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With all drivers converted to use ndo_hwstamp callbacks the legacy way
can be removed, marking ioctl interface as deprecated.
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20260116062121.1230184-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
kmalloc_reserve() is too big to be inlined.
Put the slow path in a new out-of-line function : kmalloc_pfmemalloc()
Then let kmalloc_reserve() set skb->pfmemalloc only when/if
the slow path is taken.
This makes __alloc_skb() faster :
- kmalloc_reserve() is now automatically inlined by both gcc and clang.
- No more expensive RMW (skb->pfmemalloc = pfmemalloc).
- No more expensive stack canary (for CONFIG_STACKPROTECTOR_STRONG=y).
- Removal of two prefetches that were coming too late for modern cpus.
Text size increase is quite small compared to the cpu savings (~0.7 %)
$ size net/core/skbuff.clang.before.o net/core/skbuff.clang.after.o
text data bss dec hex filename
72507 5897 0 78404 13244 net/core/skbuff.clang.before.o
72681 5897 0 78578 132f2 net/core/skbuff.clang.after.o
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260116041359.181104-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Mohsin Bashir says:
====================
eth: fbnic: Update IPC mailbox support
Update IPC mailbox support for fbnic to cater for several changes.
====================
Link: https://patch.msgid.link/20260115003353.4150771-1-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While waiting for completions on read requests, driver is using
different timeout values for different messages. Make use of a single
timeout value.
Introduce a wrapper function to handle the wait, which also simplify
maintaining the 80 char line limit.
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-6-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The driver retries sensor read requests from firmware, but this is
unnecessary. A functioning firmware should respond to each request
within the timeout period. Remove the retry logic and set the timeout
to the sum of all retry timeouts.
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-5-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, the RX mailbox frees and reallocates a page for each received
message. Since FW Rx messages are processed synchronously, and nothing
hold these pages (unlike skbs which we hand over to the stack), reuse
the pages and put them back on the Rx ring. Now that we ensure the ring
is always fully populated we don't have to worry about filling it up
after partial population during init, either. Update
fbnic_mbx_process_rx_msgs() to recycle pages after message processing.
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-4-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that memory is allocated with GFP_KERNEL, allocation failures
should be extremely rare. Ensure the FW communication ring is
always fully populated with free pages, and hard fail initialization
otherwise. This enables simplifications in next patches.
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-3-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace GFP_ATOMIC with GFP_KERNEL for mailbox RX page allocation. Since
interrupt handler is threaded GFP_KERNEL is a safe option to reduce
allocation failures.
Also remove __GFP_NOWARN so the kernel reports a warning on allocation
failure to aid debugging.
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-2-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove lock from the bpf_timer_cancel() helper. The lock does not
protect from concurrent modification of the bpf_async_cb data fields as
those are modified in the callback without locking.
Use guard(rcu)() instead of pair of explicit lock()/unlock().
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-4-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce bpf_async_update_prog_callback(): lock-free update of cb->prog
and cb->callback_fn. This function allows updating prog and callback_fn
fields of the struct bpf_async_cb without holding lock.
For now use it under the lock from __bpf_async_set_callback(), in the
next patches that lock will be removed.
Lock-free algorithm:
* Acquire a guard reference on prog to prevent it from being freed
during the retry loop.
* Retry loop:
1. Each iteration acquires a new prog reference and stores it
in cb->prog via xchg. The previous prog is released.
2. The loop condition checks if both cb->prog and cb->callback_fn
match what we just wrote. If either differs, a concurrent writer
overwrote our value, and we must retry.
3. When we retry, our previously-stored prog was already released by
the concurrent writer or will be released by us after
overwriting.
* Release guard reference.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-3-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Remove unused arguments from __bpf_async_set_callback().
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-2-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move the timer deletion logic into a dedicated bpf_timer_delete()
helper so it can be reused by later patches.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-1-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Pavel Begunkov says:
====================
Add support for providers with large rx buffer
Many modern NICs support configurable receive buffer lengths, and zcrx and
memory providers can use buffers larger than 4K to improve performance.
When paired with hw-gro larger rx buffer sizes can drastically reduce
the number of buffers traversing the stack and save a lot of processing
time. It also allows to give to users larger contiguous chunks of data.
Single stream benchmarks showed up to ~30% CPU util improvement.
E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC:
packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU %usr %nice %sys %iowait %irq %soft %idle
0 1.53 0.00 27.78 2.72 1.31 66.45 0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU %usr %nice %sys %iowait %irq %soft %idle
0 0.69 0.00 8.26 31.65 1.83 57.00 0.57
This series adds net infrastructure for memory providers configuring
the size and implements it for bnxt. It's an opt-in feature for drivers,
they should advertise support for the parameter in the qops and must check
if the hardware supports the given size. It's limited to memory providers
as it drastically simplifies implementation. It doesn't affect the fast
path zcrx uAPI, and the user exposed parameter is defined in zcrx terms,
which allows it to be flexible and adjusted in the future.
A liburing example can be found at [2]
full branch:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v8
Liburing example:
[2] https://github.com/isilence/liburing.git zcrx/rx-buf-len
* tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux:
io_uring/zcrx: document area chunking parameter
selftests: iou-zcrx: test large chunk sizes
eth: bnxt: support qcfg provided rx page size
eth: bnxt: adjust the fill level of agg queues with larger buffers
eth: bnxt: store rx buffer size per queue
net: pass queue rx page size from memory provider
net: add bare bone queue configs
net: reduce indent of struct netdev_queue_mgmt_ops members
net: memzero mp params when closing a queue
====================
Link: https://patch.msgid.link/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new privileged ioctl so that xfs_scrub can ask the kernel to
verify the media of the devices backing an xfs filesystem, and have any
resulting media errors reported to fsnotify and xfs_healer.
To accomplish this, the kernel allocates a folio between the base page
size and 1MB, and issues read IOs to a gradually incrementing range of
one of the storage devices underlying an xfs filesystem. If any error
occurs, that raw error is reported to the calling process. If the error
happens to be one of the ones that the kernel considers indicative of
data loss, then it will also be reported to xfs_healthmon and fsnotify.
Driving the verification from the kernel enables xfs (and by extension
xfs_scrub) to have precise control over the size and error handling of
IOs that are issued to the underlying block device, and to emit
notifications about problems to other relevant kernel subsystems
immediately.
Note that the caller is also allowed to reduce the size of the IO and
to ask for a relaxation period after each IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Create a new ioctl for the healthmon file that checks that a given fd
points to the same filesystem that the healthmon file is monitoring.
This allows xfs_healer to check that when it reopens a mountpoint to
perform repairs, the file that it gets matches the filesystem that
generated the corruption report.
(Note that xfs_healer doesn't maintain an open fd to a filesystem that
it's monitoring so that it doesn't pin the mount.)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Make it so that we can reconfigure the health monitoring device by
calling the XFS_IOC_HEALTH_MONITOR ioctl on it. As of right now we can
only toggle the verbose flag, but this is less annoying than having to
closing the monitor fd and reopen it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Connect the fserror reporting to the health monitor so that xfs can send
events about file I/O errors to the xfs_healer daemon. These events are
entirely informational because xfs cannot regenerate user data, so
hopefully the fsnotify I/O error event gets noticed by the relevant
management systems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Connect the fsdax media failure notification code to the health monitor
so that xfs can send events about that to the xfs_healer daemon.
Later on we'll add the ability for the xfs_scrub media scan (phase 6) to
report the errors that it finds to the kernel so that those are also
logged by xfs_healer.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Connect the filesystem shutdown code to the health monitor so that xfs
can send events about that to the xfs_healer daemon.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Connect the filesystem metadata health event collection system to the
health monitor so that xfs can send events to xfs_healer as it collects
information.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In xfs_healthmon_unmount, send events to xfs_healer so that it knows
that nothing further can be done for the filesystem.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Create the basic infrastructure that we need to report health events to
userspace. We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle. Make the kernel export C structures via read().
In a previous iteration of this new subsystem, I wanted to explore data
exchange formats that are more flexible and easier for humans to read
than C structures. The thought being that when we want to rev (or
worse, enlarge) the event format, it ought to be trivially easy to do
that in a way that doesn't break old userspace.
I looked at formats such as protobufs and capnproto. These look really
nice in that extending the wire format is fairly easy, you can give it a
data schema and it generates the serialization code for you, handles
endianness problems, etc. The huge downside is that neither support C
all that well.
Too hard, and didn't want to port either of those huge sprawling
libraries first to the kernel and then again to xfsprogs. Then I
thought, how about JSON? Javascript objects are human readable, the
kernel can emit json without much fuss (it's all just strings!) and
there are plenty of interpreters for python/rust/c/etc.
There's a proposed schema format for json, which means that xfs can
publish a description of the events that kernel will emit. Userspace
consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document
and use it to validate the incoming events from the kernel, which means
it can discard events that it doesn't understand, or garbage being
emitted due to bugs.
However, json has a huge crutch -- javascript is well known for its
vague definitions of what are numbers. This makes expressing a large
number rather fraught, because the runtime is free to represent a number
in nearly any way it wants. Stupider ones will truncate values to word
size, others will roll out doubles for uint52_t (yes, fifty-two) with
the resulting loss of precision. Not good when you're dealing with
discrete units.
It just so happens that python's json library is smart enough to see a
sequence of digits and put them in a u64 (at least on x86_64/aarch64)
but an actual javascript interpreter (pasting into Firefox) isn't
necessarily so clever.
It turns out that none of the proposed json schemas were ever ratified
even in an open-consensus way, so json blobs are still just loosely
structured blobs. The parsing in userspace was also noticeably slow and
memory-consumptive.
Hence only the C interface survives.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Start creating helper functions and infrastructure to pass filesystem
health events to a health monitoring file. Since this is an
administrative interface, we only support a single health monitor
process per filesystem, so we don't need to use anything fancy such as
notifier chains (== tons of indirect calls).
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
This reverts commit 77b9c4a438fc66e2ab004c411056b3fb71a54f2c, reversing
changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497:
931420a2fc36 ("selftests/net: Add netkit container tests")
ab771c938d9a ("selftests/net: Make NetDrvContEnv support queue leasing")
6be87fbb2776 ("selftests/net: Add env for container based tests")
61d99ce3dfc2 ("selftests/net: Add bpf skb forwarding program")
920da3634194 ("netkit: Add xsk support for af_xdp applications")
eef51113f8af ("netkit: Add netkit notifier to check for unregistering devices")
b5ef109d22d4 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create")
b5c3fa4a0b16 ("netkit: Add single device mode for netkit")
0073d2fd679d ("xsk: Proxy pool management for leased queues")
1ecea95dd3b5 ("xsk: Extend xsk_rcv_check validation")
804bf334d08a ("net: Proxy netdev_queue_get_dma_dev for leased queues")
0caa9a8ddec3 ("net: Proxy net_mp_{open,close}_rxq for leased queues")
ff8889ff9107 ("net, ethtool: Disallow leased real rxqs to be resized")
9e2103f36110 ("net: Add lease info to queue-get response")
31127deddef4 ("net: Implement netdev_nl_queue_create_doit")
a5546e18f77c ("net: Add queue-create operation")
The series will conflict with io_uring work, and the code needs more
polish.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace the verifier test for default trusted pointer semantics, which
previously relied on BPF kfunc bpf_get_root_mem_cgroup(), with a new
test utilizing dedicated BPF kfuncs defined within the bpf_testmod.
bpf_get_root_mem_cgroup() was modified such that it again relies on
KF_ACQUIRE semantics, therefore no longer making it a suitable
candidate to test BPF verifier default trusted pointer semantics
against.
Link: https://lore.kernel.org/bpf/20260113083949.2502978-2-mattbobrowski@google.com
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20260120091630.3420452-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This tool is built by default, but was not being installed by default
when running `make install`. Fix this by calling ynltool's install
target.
Signed-off-by: Michel Lind <michel@michel-slm.name>
Link: https://patch.msgid.link/aWqr9gUT4hWZwwcI@mbp-m3-fedora.vm
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Zesen Liu says:
====================
bpf: Fix memory access flags in helper prototypes
This series adds missing memory access flags (MEM_RDONLY or MEM_WRITE) to
several bpf helper function prototypes that use ARG_PTR_TO_MEM but lack the
correct flag. It also adds a new check in verifier to ensure the flag is
specified.
Missing memory access flags in helper prototypes can lead to critical
correctness issues when the verifier tries to perform code optimization.
After commit 37cce22dbd51 ("bpf: verifier: Refactor helper access type
tracking"), the verifier relies on the memory access flags, rather than
treating all arguments in helper functions as potentially modifying the
pointed-to memory.
Using ARG_PTR_TO_MEM alone without flags does not make sense because:
- If the helper does not change the argument, missing MEM_RDONLY causes the
verifier to incorrectly reject a read-only buffer.
- If the helper does change the argument, missing MEM_WRITE causes the
verifier to incorrectly assume the memory is unchanged, leading to
errors in code optimization.
We have already seen several reports regarding this:
- commit ac44dcc788b9 ("bpf: Fix verifier assumptions of bpf_d_path's
output buffer") adds MEM_WRITE to bpf_d_path;
- commit 2eb7648558a7 ("bpf: Specify access type of bpf_sysctl_get_name
args") adds MEM_WRITE to bpf_sysctl_get_name.
This series looks through all prototypes in the kernel and completes the
flags. It also adds check_mem_arg_rw_flag_ok() and wires it into
check_func_proto() to statically restrict ARG_PTR_TO_MEM from appearing
without memory access flags.
Changelog
=========
v3:
- Rebased to bpf-next to address check_func_proto() signature changes, as
suggested by Eduard Zingerman.
v2:
- Add missing MEM_RDONLY flags to protos with ARG_PTR_TO_FIXED_SIZE_MEM.
====================
Link: https://patch.msgid.link/20260120-helper_proto-v3-0-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add check to ensure that ARG_PTR_TO_MEM is used with either MEM_WRITE or
MEM_RDONLY.
Using ARG_PTR_TO_MEM alone without flags does not make sense because:
- If the helper does not change the argument, missing MEM_RDONLY causes the
verifier to incorrectly reject a read-only buffer.
- If the helper does change the argument, missing MEM_WRITE causes the
verifier to incorrectly assume the memory is unchanged, leading to errors
in code optimization.
Co-developed-by: Shuran Liu <electronlsr@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Zesen Liu <ftyghome@gmail.com>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120-helper_proto-v3-2-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|