summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-11-12net/mlx5: Expose definition for 1600Gbps link modeTariq Toukan
This patch exposes new link mode for 1600Gbps, utilizing 8 lanes at 200Gbps per lane. Co-developed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1762863888-1092798-1-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-11Merge tag 'for-net-2025-11-11' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_conn: Fix not cleaning up PA_LINK connections - hci_event: Fix not handling PA Sync Lost event - MGMT: cancel mesh send timer when hdev removed - 6lowpan: reset link-local header on ipv6 recv path - 6lowpan: fix BDADDR_LE vs ADDR_LE_DEV address type confusion - L2CAP: export l2cap_chan_hold for modules - 6lowpan: Don't hold spin lock over sleeping functions - 6lowpan: add missing l2cap_chan_lock() - btusb: reorder cleanup in btusb_disconnect to avoid UAF - btrtl: Avoid loading the config file on security chips * tag 'for-net-2025-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: btrtl: Avoid loading the config file on security chips Bluetooth: hci_event: Fix not handling PA Sync Lost event Bluetooth: hci_conn: Fix not cleaning up PA_LINK connections Bluetooth: 6lowpan: add missing l2cap_chan_lock() Bluetooth: 6lowpan: Don't hold spin lock over sleeping functions Bluetooth: L2CAP: export l2cap_chan_hold for modules Bluetooth: 6lowpan: fix BDADDR_LE vs ADDR_LE_DEV address type confusion Bluetooth: 6lowpan: reset link-local header on ipv6 recv path Bluetooth: btusb: reorder cleanup in btusb_disconnect to avoid UAF Bluetooth: MGMT: cancel mesh send timer when hdev removed ==================== Link: https://patch.msgid.link/20251111141357.1983153-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-11ethtool: fix incorrect kernel-doc style comment in ethtool.hKriish Sharma
Building documentation produced the following warning: WARNING: ./include/linux/ethtool.h:495 This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * IEEE 802.3ck/df defines 16 bins for FEC histogram plus one more for This comment was not intended to be parsed as kernel-doc, so replace the '/**' with '/*' to silence the warning and align with normal comment style in header files. No functional changes. Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com> Link: https://patch.msgid.link/20251110182545.2112596-1-kriish.sharma2006@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-11coresight: Change device mode to atomic typeLeo Yan
The device mode is defined as local type. This type cannot promise SMP-safe access. Change to atomic type and impose relax ordering, which ensures the SMP-safe synchronisation and the ordering between the mode setting and relevant operations. Fixes: 22fd532eaa0c ("coresight: etm3x: adding operation mode for etm_enable()") Reviewed-by: Mike Leach <mike.leach@linaro.org> Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20251111-arm_coresight_power_management_fix-v6-1-f55553b6c8b3@arm.com
2025-11-11lib/crypto: x86/polyval: Migrate optimized code into libraryEric Biggers
Migrate the x86_64 implementation of POLYVAL into lib/crypto/, wiring it up to the POLYVAL library interface. This makes the POLYVAL library be properly optimized on x86_64. This drops the x86_64 optimizations of polyval in the crypto_shash API. That's fine, since polyval will be removed from crypto_shash entirely since it is unneeded there. But even if it comes back, the crypto_shash API could just be implemented on top of the library API, as usual. Adjust the names and prototypes of the assembly functions to align more closely with the rest of the library code. Also replace a movaps instruction with movups to remove the assumption that the key struct is 16-byte aligned. Users can still align the key if they want (and at least in this case, movups is just as fast as movaps), but it's inconvenient to require it. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251109234726.638437-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-11lib/crypto: arm64/polyval: Migrate optimized code into libraryEric Biggers
Migrate the arm64 implementation of POLYVAL into lib/crypto/, wiring it up to the POLYVAL library interface. This makes the POLYVAL library be properly optimized on arm64. This drops the arm64 optimizations of polyval in the crypto_shash API. That's fine, since polyval will be removed from crypto_shash entirely since it is unneeded there. But even if it comes back, the crypto_shash API could just be implemented on top of the library API, as usual. Adjust the names and prototypes of the assembly functions to align more closely with the rest of the library code. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251109234726.638437-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-11lib/crypto: polyval: Add POLYVAL libraryEric Biggers
Add support for POLYVAL to lib/crypto/. This will replace the polyval crypto_shash algorithm and its use in the hctr2 template, simplifying the code and reducing overhead. Specifically, this commit introduces the POLYVAL library API and a generic implementation of it. Later commits will migrate the existing architecture-optimized implementations of POLYVAL into lib/crypto/ and add a KUnit test suite. I've also rewritten the generic implementation completely, using a more modern approach instead of the traditional table-based approach. It's now constant-time, requires no precomputation or dynamic memory allocations, decreases the per-key memory usage from 4096 bytes to 16 bytes, and is faster than the old polyval-generic even on bulk data reusing the same key (at least on x86_64, where I measured 15% faster). We should do this for GHASH too, but for now just do it for POLYVAL. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251109234726.638437-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-11efi/runtime-wrappers: Keep track of the efi_runtime_lock ownerArd Biesheuvel
The EFI runtime wrappers use a file local semaphore to serialize access to the EFI runtime services. This means that any calls to the arch wrappers around the runtime services will also be serialized, removing the need for redundant locking. For robustness, add a facility that allows those arch wrappers to assert that the semaphore was taken by the current task. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11efi/memattr: Convert efi_memattr_init() return type to voidBreno Leitao
The efi_memattr_init() function's return values (0 and -ENOMEM) are never checked by callers. Convert the function to return void since the return status is unused. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-11-11reset: mpfs: add non-auxiliary bus probingConor Dooley
While the auxiliary bus was a nice bandaid, and meant that re-writing the representation of the clock regions in devicetree was not required, it has run its course. The "mss_top_sysreg" region that contains the clock and reset regions, also contains pinctrl and an interrupt controller, so the time has come rewrite the devicetree and probe the reset controller from an mfd devicetree node, rather than implement those drivers using the auxiliary bus. Wanting to avoid propagating this naive/incorrect description of the hardware to the new pic64gx SoC is a major motivating factor here. Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Acked-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2025-11-11dt-bindings: clock: document 8ULP's SIM LPAVLaurentiu Mihalcea
Add documentation for i.MX8ULP's SIM LPAV module. Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com> Signed-off-by: Laurentiu Mihalcea <laurentiu.mihalcea@nxp.com> Link: https://lore.kernel.org/r/20251104120301.913-3-laurentiumihalcea111@gmail.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
2025-11-11net: export netdev_get_by_index_lock()David Wei
Need to call netdev_get_by_index_lock() from io_uring/zcrx.c, but it is currently private to net. Export the function in linux/netdevice.h. Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11mlx5: Fix default values in create CQAkiva Goldberger
Currently, CQs without a completion function are assigned the mlx5_add_cq_to_tasklet function by default. This is problematic since only user CQs created through the mlx5_ib driver are intended to use this function. Additionally, all CQs that will use doorbells instead of polling for completions must call mlx5_cq_arm. However, the default CQ creation flow leaves a valid value in the CQ's arm_db field, allowing FW to send interrupts to polling-only CQs in certain corner cases. These two factors would allow a polling-only kernel CQ to be triggered by an EQ interrupt and call a completion function intended only for user CQs, causing a null pointer exception. Some areas in the driver have prevented this issue with one-off fixes but did not address the root cause. This patch fixes the described issue by adding defaults to the create CQ flow. It adds a default dummy completion function to protect against null pointer exceptions, and it sets an invalid command sequence number by default in kernel CQs to prevent the FW from sending an interrupt to the CQ until it is armed. User CQs are responsible for their own initialization values. Callers of mlx5_core_create_cq are responsible for changing the completion function and arming the CQ per their needs. Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO") Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://patch.msgid.link/1762681743-1084694-1-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11Bluetooth: hci_event: Fix not handling PA Sync Lost eventLuiz Augusto von Dentz
This handles PA Sync Lost event which previously was assumed to be handled with BIG Sync Lost but their lifetime are not the same thus why there are 2 different events to inform when each sync is lost. Fixes: b2a5f2e1c127 ("Bluetooth: hci_event: Add support for handling LE BIG Sync Lost event") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-11-11ASoC: cs35l56: Allow restoring factory calibration through ALSA controlRichard Fitzgerald
Add an ALSA control (CAL_DATA) that can be used to restore amp calibration, instead of using debugfs. A readback control (CAL_DATA_RB) is also added for factory testing. On ChromeOS the process that restores amp calibration from NVRAM has limited permissions and cannot access debugfs. It requires an ALSA control that it can write the calibration blob into. ChromeOS also restricts access to ALSA controls, which avoids the risk of accidental or malicious overwriting of good calibration data with bad data. As this control is not needed for normal Linux-based distros it is a Kconfig option. A separate control, CAL_DATA_RB, provides a readback of the current calibration data, which could be either from a write to CAL_DATA or the result of factory production-line calibration. The write and read are intentionally separate controls to defeat "dumb" save-and-restore tools like alsa-restore that assume it is safe to save all control values and write them back in any order at some undefined future time. Such behavior carries the risk of restoring stale or bad data over the top of good data. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://patch.msgid.link/20251111130850.513969-3-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-11-11ASoC: cs35l56: Add control to read CAL_SET_STATUSRichard Fitzgerald
Create an ALSA control to read the value of the firmware CAL_SET_STATUS control. This reports whether the firmware is using a calibration blob or the default calibration from the .bin file. The firmware only reports a valid value in this register while audio is actually playing and the internal PLL is locked to the audio clock. Otherwise it returns a status of "unknown". Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://patch.msgid.link/20251111130850.513969-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-11-11rv: Make rv_reacting_on() staticThomas Weißschuh
There are no external users left. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-2-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-11-11rv: Pass va_list to reactorsThomas Weißschuh
The only thing the reactors can do with the passed in varargs is to convert it into a va_list. Do that in a central helper instead. It simplifies the reactors, removes some hairy macro-generated code and introduces a convenient hook point to modify reactor behavior. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-1-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-11-11net/mlx5: E-Switch, support eswitch inactive modeSaeed Mahameed
Add support for eswitch switchdev inactive mode Inactive mode: Drop all traffic going to FDB, Remove mpfs l2 rules and disconnect adjacent vports. Active mode: Traffic flows through FDB, mpfs table populated, and adjacent vports are connected. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251108070404.1551708-4-saeed@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11devlink: Introduce switchdev_inactive eswitch modeSaeed Mahameed
Adds DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE attribute to UAPI and documentation. Before having traffic flow through an eswitch, a user may want to have the ability to block traffic towards the FDB until FDB is fully programmed and the user is ready to send traffic to it. For example: when two eswitches are present for vports in a multi-PF setup, one eswitch may take over the traffic from the other when the user chooses. Before this take over, a user may want to first program the inactive eswitch and then once ready redirect traffic to this new eswitch. switchdev modes transition semantics: legacy->switchdev_inactive: Create switchdev mode normally, traffic not allowed to flow yet. switchdev_inactive->switchdev: Enable traffic to flow. switchdev->switchdev_inactive: Block traffic on the FDB, FDB and representros state and content is preserved. When eswitch is configured to this mode, traffic is ignored/dropped on this eswitch FDB, while current configuration is kept, e.g FDB rules and netdev representros are kept available, FDB programming is allowed. Example: # start inactive switchdev devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive # setup TC rules, representors etc .. # activate devlink dev eswitch set pci/0000:08:00.1 mode switchdev Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251108070404.1551708-2-saeed@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11Merge tag 'v6.18-rc5' into media-nextMauro Carvalho Chehab
Linux 6.18-rc5 * tag 'v6.18-rc5': (1016 commits) Linux 6.18-rc5 kbuild: Let kernel-doc.py use PYTHON3 override rtc: rx8025: fix incorrect register reference Revert "drm/nouveau: set DMA mask before creating the flush page" io_uring: fix regbuf vector size truncation compiler_types: Move unused static inline functions warning to W=2 smb: client: validate change notify buffer before copy tracing/tools: Fix incorrcet short option in usage text for --threads drm/xe: Enforce correct user fence signaling order using x86/microcode/AMD: Add more known models to entry sign checking drm/xe: Do clean shutdown also when using flr drm/xe: Move declarations under conditional branch drm/xe/guc: Synchronize Dead CT worker with unbind tracing: Fix memory leaks in create_field_var() ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobe tracing: tprobe-events: Fix to register tracepoint correctly gpio: tb10x: Drop unused tb10x_set_bits() function drm/amd/display: Enable mst when it's detected but yet to be initialized drm/amdgpu: Fix wait after reset sequence in S3 ...
2025-11-11sched/deadline: Fix dl_server stop conditionPeter Zijlstra
Gabriel reported that the dl_server doesn't stop as expected. The problem was found to be the fact that idle time and fair runtime are treated equally. Both will count towards dl_server runtime and push the activation forwards when it is in the zero-laxity wait state. Notably: dl_server_update_idle() update_curr_dl_se() if (dl_defer && dl_throttled && dl_runtime_exceeded()) hrtimer_try_to_cancel(); // stop timer replenish_dl_new_period() deadline = now + dl_deadline; // fwd period runtime = dl_runtime; start_dl_timer(); // restart timer And while we do want idle time accounted towards the *current* activation of the dl_server -- after all, a fair task could've ran if we had any -- we don't necessarily want idle time to cause or push forward an activation. Introduce dl_defer_idle to make this distinction. It will be set once idle time pushed the activation forward, once set idle time will only be allowed to consume any runtime but not push the activation. This will then cause dl_server_timer() to fire, which will stop the dl_server. Any non-idle time accounting during this phase will clear dl_defer_idle, so only a full period of idle will cause the dl_server to stop. Reported-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251101000057.GA2184199@noisy.programming.kicks-ass.net
2025-11-11wifi: cfg80211/mac80211: Add fallback mechanism for INDOOR_SP connectionPagadala Yesu Anjaneyulu
Implement fallback to LPI mode when SP mode is not permitted by regulatory constraints for INDOOR_SP connections. Limit fallback mechanism to client mode. Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20251110140806.8b43201a34ae.I37fc7bb5892eb9d044d619802e8f2095fde6b296@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-11wifi: cfg80211/mac80211: clean up duplicate ap_power handlingPagadala Yesu Anjaneyulu
Move duplicated ap_power type handling code to an inline function in cfg80211. Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20251110140806.959948da1cb5.I893b5168329fb3232f249c182a35c99804112da6@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-11fs: move inode fields used during fast path lookup closer togetherMateusz Guzik
This should avoid *some* cache misses. Successful path lookup is guaranteed to load at least ->i_mode, ->i_opflags and ->i_acl. At the same time the common case will avoid looking at more fields. struct inode is not guaranteed to have any particular alignment, notably ext4 has it only aligned to 8 bytes meaning nearby fields might happen to be on the same or only adjacent cache lines depending on luck (or no luck). According to pahole: umode_t i_mode; /* 0 2 */ short unsigned int i_opflags; /* 2 2 */ kuid_t i_uid; /* 4 4 */ kgid_t i_gid; /* 8 4 */ unsigned int i_flags; /* 12 4 */ struct posix_acl * i_acl; /* 16 8 */ struct posix_acl * i_default_acl; /* 24 8 */ ->i_acl is unnecessarily separated by 8 bytes from the other fields. With struct inode being offset 48 bytes into the cacheline this means an avoidable miss. Note it will still be there for the 56 byte case. New layout: umode_t i_mode; /* 0 2 */ short unsigned int i_opflags; /* 2 2 */ unsigned int i_flags; /* 4 4 */ struct posix_acl * i_acl; /* 8 8 */ struct posix_acl * i_default_acl; /* 16 8 */ kuid_t i_uid; /* 24 4 */ kgid_t i_gid; /* 28 4 */ I verified with pahole there are no size or hole changes. This is stopgap until someone(tm) sanitizes the layout in the first place, allocation methods aside. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251109121931.1285366-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11xsk: add indirect call for xsk_destruct_skbJason Xing
Since Eric proposed an idea about adding indirect call wrappers for UDP and managed to see a huge improvement[1], the same situation can also be applied in xsk scenario. This patch adds an indirect call for xsk and helps current copy mode improve the performance by around 1% stably which was observed with IXGBE at 10Gb/sec loaded. If the throughput grows, the positive effect will be magnified. I applied this patch on top of batch xmit series[2], and was able to see <5% improvement from our internal application which is a little bit unstable though. Use INDIRECT wrappers to keep xsk_destruct_skb static as it used to be when the mitigation config is off. Be aware of the freeing path that can be very hot since the frequency can reach around 2,000,000 times per second with the xdpsock test. [1]: https://lore.kernel.org/netdev/20251006193103.2684156-2-edumazet@google.com/ [2]: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/ Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251031103328.95468-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11ns: drop custom reference count initialization for initial namespacesChristian Brauner
Initial namespaces don't modify their reference count anymore. They remain fixed at one so drop the custom refcount initializations. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-16-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11pid: rely on common reference count behaviorChristian Brauner
Now that we changed the generic reference counting mechanism for all namespaces to never manipulate reference counts of initial namespaces we can drop the special handling for pid namespaces. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-15-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11ns: add asserts for initial namespace active reference countsChristian Brauner
They always remain fixed at one. Notice when that assumptions is broken. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-14-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11ns: add asserts for initial namespace reference countsChristian Brauner
They always remain fixed at one. Notice when that assumptions is broken. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-13-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11ns: make all reference counts on initial namespace a nopChristian Brauner
They are always active so no need to needlessly cacheline ping-pong. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-12-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11ns: rename is_initial_namespace()Christian Brauner
Rename is_initial_namespace() to ns_init_inum() and make it symmetrical with the ns id variant. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-9-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11ns: make is_initial_namespace() argument constChristian Brauner
We don't modify the data structure at all so pass it as const. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-8-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: switch to new structuresChristian Brauner
Switch the nstree management to the new combined structures. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-5-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: add helper to operate on struct ns_tree_{node,root}Christian Brauner
Add helpers that work on the combined rbtree and rculist combined. This will make the code a lot more managable and legible. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-4-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: move nstree types into separate headerChristian Brauner
Introduce two new fundamental data structures for namespace tree management in a separate header file. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-3-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: decouple from ns_common headerChristian Brauner
Foward declare struct ns_common and remove the include of ns_common.h. We want ns_common.h to possibly include nstree structures but not the other way around. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-2-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11ns: move namespace types into separate headerChristian Brauner
Add a dedicated header for namespace types. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-1-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11Merge branch 'kbuild-6.19.fms.extension'Christian Brauner
Bring in the shared branch with the kbuild tree to enable '-fms-extensions' for 6.19. Further namespace cleanup work requires this extension. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11md: allow configuring logical block sizeLi Nan
Previously, raid array used the maximum logical block size (LBS) of all member disks. Adding a larger LBS disk at runtime could unexpectedly increase RAID's LBS, risking corruption of existing partitions. This can be reproduced by: ``` # LBS of sd[de] is 512 bytes, sdf is 4096 bytes. mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean # LBS is 512 cat /sys/block/md0/queue/logical_block_size # create partition md0p1 parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100% lsblk | grep md0p1 # LBS becomes 4096 after adding sdf mdadm --add -q /dev/md0 /dev/sdf cat /sys/block/md0/queue/logical_block_size # partition lost partprobe /dev/md0 lsblk | grep md0p1 ``` Simply restricting larger-LBS disks is inflexible. In some scenarios, only disks with 512 bytes LBS are available currently, but later, disks with 4KB LBS may be added to the array. Making LBS configurable is the best way to solve this scenario. After this patch, the raid will: - store LBS in disk metadata - add a read-write sysfs 'mdX/logical_block_size' Future mdadm should support setting LBS via metadata field during RAID creation and the new sysfs. Though the kernel allows runtime LBS changes, users should avoid modifying it after creating partitions or filesystems to prevent compatibility issues. Only 1.x metadata supports configurable LBS. 0.90 metadata inits all fields to default values at auto-detect. Supporting 0.90 would require more extensive changes and no such use case has been observed. Note that many RAID paths rely on PAGE_SIZE alignment, including for metadata I/O. A larger LBS than PAGE_SIZE will result in metadata read/write failures. So this config should be prevented. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-10usbnet: Add support for Byte Queue Limits (BQL)Simon Schippers
In the current implementation, usbnet uses a fixed tx_qlen of: USB2: 60 * 1518 bytes = 91.08 KB USB3: 60 * 5 * 1518 bytes = 454.80 KB Such large transmit queues can be problematic, especially for cellular modems. For example, with a typical celluar link speed of 10 Mbit/s, a fully occupied USB3 transmit queue results in: 454.80 KB / (10 Mbit/s / 8 bit/byte) = 363.84 ms of additional latency. This patch adds support for Byte Queue Limits (BQL) [1] to dynamically manage the transmit queue size and reduce latency without sacrificing throughput. Testing was performed on various devices using the usbnet driver for packet transmission: - DELOCK 66045: USB3 to 2.5 GbE adapter (ax88179_178a) - DELOCK 61969: USB2 to 1 GbE adapter (asix) - Quectel RM520: 5G modem (qmi_wwan) - USB2 Android tethering (cdc_ncm) No performance degradation was observed for iperf3 TCP or UDP traffic, while latency for a prioritized ping application was significantly reduced. For example, using the USB3 to 2.5 GbE adapter, which was fully utilized by iperf3 UDP traffic, the prioritized ping was improved from 1.6 ms to 0.6 ms. With the same setup but with a 100 Mbit/s Ethernet connection, the prioritized ping was improved from 35 ms to 5 ms. [1] https://lwn.net/Articles/469652/ Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106175615.26948-1-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-10Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-11-10 We've added 19 non-merge commits during the last 3 day(s) which contain a total of 22 files changed, 1345 insertions(+), 197 deletions(-). The main changes are: 1) Preserve skb metadata after a TC BPF program has changed the skb, from Jakub Sitnicki. This allows a TC program at the end of a TC filter chain to still see the skb metadata, even if another TC program at the front of the chain has changed the skb using BPF helpers. 2) Initial af_smc bpf_struct_ops support to control the smc specific syn/synack options, from D. Wythe. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: bpf/selftests: Add selftest for bpf_smc_hs_ctrl net/smc: bpf: Introduce generic hook for handshake flow bpf: Export necessary symbols for modules with struct_ops selftests/bpf: Cover skb metadata access after bpf_skb_change_proto selftests/bpf: Cover skb metadata access after change_head/tail helper selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room selftests/bpf: Cover skb metadata access after vlan push/pop helper selftests/bpf: Expect unclone to preserve skb metadata selftests/bpf: Dump skb metadata on verification failure selftests/bpf: Verify skb metadata in BPF instead of userspace bpf: Make bpf_skb_change_head helper metadata-safe bpf: Make bpf_skb_change_proto helper metadata-safe bpf: Make bpf_skb_adjust_room metadata-safe bpf: Make bpf_skb_vlan_push helper metadata-safe bpf: Make bpf_skb_vlan_pop helper metadata-safe vlan: Make vlan_remove_tag return nothing bpf: Unclone skb head on bpf_dynptr_write to skb metadata net: Preserve metadata on pskb_expand_head net: Helper to move packet data and metadata after skb_push/pull ==================== Link: https://patch.msgid.link/20251110232427.3929291-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-10sctp: Don't inherit do_auto_asconf in sctp_clone_sock().Kuniyuki Iwashima
syzbot reported list_del(&sp->auto_asconf_list) corruption in sctp_destroy_sock(). The repro calls setsockopt(SCTP_AUTO_ASCONF, 1) to a SCTP listener, calls accept(), and close()s the child socket. setsockopt(SCTP_AUTO_ASCONF, 1) sets sp->do_auto_asconf to 1 and links sp->auto_asconf_list to a per-netns list. Both fields are placed after sp->pd_lobby in struct sctp_sock, and sctp_copy_descendant() did not copy the fields before the cited commit. Also, sctp_clone_sock() did not set them explicitly. In addition, sctp_auto_asconf_init() is called from sctp_sock_migrate(), but it initialises the fields only conditionally. The two fields relied on __GFP_ZERO added in sk_alloc(), but sk_clone() does not use it. Let's clear newsp->do_auto_asconf in sctp_clone_sock(). [0]: list_del corruption. prev->next should be ffff8880799e9148, but was ffff8880799e8808. (prev=ffff88803347d9f8) kernel BUG at lib/list_debug.c:64! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 6008 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025 RIP: 0010:__list_del_entry_valid_or_report+0x15a/0x190 lib/list_debug.c:62 Code: e8 7b 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 7c ee 92 fd 49 8b 17 48 c7 c7 80 0a bf 8b 48 89 de 4c 89 f9 e8 07 c6 94 fc 90 <0f> 0b 4c 89 f7 e8 4c 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 4d RSP: 0018:ffffc90003067ad8 EFLAGS: 00010246 RAX: 000000000000006d RBX: ffff8880799e9148 RCX: b056988859ee6e00 RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000 RBP: dffffc0000000000 R08: ffffc90003067807 R09: 1ffff9200060cf00 R10: dffffc0000000000 R11: fffff5200060cf01 R12: 1ffff1100668fb3f R13: dffffc0000000000 R14: ffff88803347d9f8 R15: ffff88803347d9f8 FS: 00005555823e5500(0000) GS:ffff88812613e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000200000000480 CR3: 00000000741ce000 CR4: 00000000003526f0 Call Trace: <TASK> __list_del_entry_valid include/linux/list.h:132 [inline] __list_del_entry include/linux/list.h:223 [inline] list_del include/linux/list.h:237 [inline] sctp_destroy_sock+0xb4/0x370 net/sctp/socket.c:5163 sk_common_release+0x75/0x310 net/core/sock.c:3961 sctp_close+0x77e/0x900 net/sctp/socket.c:1550 inet_release+0x144/0x190 net/ipv4/af_inet.c:437 __sock_release net/socket.c:662 [inline] sock_close+0xc3/0x240 net/socket.c:1455 __fput+0x44c/0xa70 fs/file_table.c:468 task_work_run+0x1d4/0x260 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43 exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline] do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 16942cf4d3e3 ("sctp: Use sk_clone() in sctp_accept().") Reported-by: syzbot+ba535cb417f106327741@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/690d2185.a70a0220.22f260.000e.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251106223418.1455510-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-10io_uring/query: return number of available queriesPavel Begunkov
It's useful to know which query opcodes are available. Extend the structure and return that. It's a trivial change, and even though it can be painlessly extended later, it'd still require adding a v2 of the structure. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-10net/smc: bpf: Introduce generic hook for handshake flowD. Wythe
The introduction of IPPROTO_SMC enables eBPF programs to determine whether to use SMC based on the context of socket creation, such as network namespaces, PID and comm name, etc. As a subsequent enhancement, to introduce a new generic hook that allows decisions on whether to use SMC or not at runtime, including but not limited to local/remote IP address or ports. User can write their own implememtion via bpf_struct_ops now to choose whether to use SMC or not before TCP 3rd handshake to be comleted. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20251107035632.115950-3-alibuda@linux.alibaba.com
2025-11-10bpf: Make bpf_skb_vlan_push helper metadata-safeJakub Sitnicki
Use the metadata-aware helper to move packet bytes after skb_push(), ensuring metadata remains valid after calling the BPF helper. Also, take care to reserve sufficient headroom for metadata to fit. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-6-5ceb08a9b37b@cloudflare.com
2025-11-10bpf: Make bpf_skb_vlan_pop helper metadata-safeJakub Sitnicki
Use the metadata-aware helper to move packet bytes after skb_pull(), ensuring metadata remains valid after calling the BPF helper. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-5-5ceb08a9b37b@cloudflare.com
2025-11-10vlan: Make vlan_remove_tag return nothingJakub Sitnicki
All callers ignore the return value. Prepare to reorder memmove() after skb_pull() which is a common pattern. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-4-5ceb08a9b37b@cloudflare.com
2025-11-10bpf: Unclone skb head on bpf_dynptr_write to skb metadataJakub Sitnicki
Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when the skb is cloned, preventing writes to metadata. Remove this restriction and unclone the skb head on bpf_dynptr_write() to metadata, now that the metadata is preserved during uncloning. This makes metadata dynptr consistent with skb dynptr, allowing writes regardless of whether the skb is cloned. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com
2025-11-10net: Helper to move packet data and metadata after skb_push/pullJakub Sitnicki
Lay groundwork for fixing BPF helpers available to TC(X) programs. When skb_push() or skb_pull() is called in a TC(X) ingress BPF program, the skb metadata must be kept in front of the MAC header. Otherwise, BPF programs using the __sk_buff->data_meta pseudo-pointer lose access to it. Introduce a helper that moves both metadata and a specified number of packet data bytes together, suitable as a drop-in replacement for memmove(). Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-1-5ceb08a9b37b@cloudflare.com