summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-07-09drm/framebuffer: Acquire internal references on GEM handlesThomas Zimmermann
Acquire GEM handles in drm_framebuffer_init() and release them in the corresponding drm_framebuffer_cleanup(). Ties the handle's lifetime to the framebuffer. Not all GEM buffer objects have GEM handles. If not set, no refcounting takes place. This is the case for some fbdev emulation. This is not a problem as these GEM objects do not use dma-bufs and drivers will not release them while fbdev emulation is running. Framebuffer flags keep a bit per color plane of which the framebuffer holds a GEM handle reference. As all drivers use drm_framebuffer_init(), they will now all hold dma-buf references as fixed in commit 5307dce878d4 ("drm/gem: Acquire references on GEM handles for framebuffers"). In the GEM framebuffer helpers, restore the original ref counting on buffer objects. As the helpers for handle refcounting are now no longer called from outside the DRM core, unexport the symbols. v3: - don't mix internal flags with mode flags (Christian) v2: - track framebuffer handle refs by flag - drop gma500 cleanup (Christian) Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Fixes: 5307dce878d4 ("drm/gem: Acquire references on GEM handles for framebuffers") Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/dri-devel/20250703115915.3096-1-spasswolf@web.de/ Tested-by: Bert Karwatzki <spasswolf@web.de> Tested-by: Mario Limonciello <superm1@kernel.org> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Anusha Srivatsa <asrivats@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: <stable@vger.kernel.org> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250707131224.249496-1-tzimmermann@suse.de
2025-07-09sched/fair: Fix NO_RUN_TO_PARITY caseVincent Guittot
EEVDF expects the scheduler to allocate a time quantum to the selected entity and then pick a new entity for next quantum. Although this notion of time quantum is not strictly doable in our case, we can ensure a minimum runtime for each task most of the time and pick a new entity after a minimum time has elapsed. Reuse the slice protection of run to parity to ensure such runtime quantum. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.org
2025-07-09sched/deadline: Less agressive dl_server handlingPeter Zijlstra
Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default bandwidth control") caused a significant dip in his favourite benchmark of the day. Simply disabling dl_server cured things. His workload hammers the 0->1, 1->0 transitions, and the dl_server_{start,stop}() overhead kills it -- fairly obviously a bad idea in hind sight and all that. Change things around to only disable the dl_server when there has not been a fair task around for a whole period. Since the default period is 1 second, this ensures the benchmark never trips this, overhead gone. Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
2025-07-09sched/psi: Optimize psi_group_change() cpu_clock() usagePeter Zijlstra
Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") caused a regression for him on a high context switch rate benchmark (schbench) due to the now repeating cpu_clock() calls. In particular the problem is that get_recent_times() will extrapolate the current state to 'now'. But if an update uses a timestamp from before the start of the update, it is possible to get two reads with inconsistent results. It is effectively back-dating an update. (note that this all hard-relies on the clock being synchronized across CPUs -- if this is not the case, all bets are off). Combine this problem with the fact that there are per-group-per-cpu seqcounts, the commit in question pushed the clock read into the group iteration, causing tree-depth cpu_clock() calls. On architectures where cpu_clock() has appreciable overhead, this hurts. Instead move to a per-cpu seqcount, which allows us to have a single clock read for all group updates, increasing internal consistency and lowering update overhead. This comes at the cost of a longer update side (proportional to the tree depth) which can cause the read side to retry more often. Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>, Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
2025-07-09debugfs_get_aux(): allow storing non-const void *Al Viro
typechecking is up to users, anyway... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702212616.GI3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09debugfs: split short and full proxy wrappers, kill debugfs_real_fops()Al Viro
All users outside of fs/debugfs/file.c are gone, in there we can just fully split the wrappers for full and short cases and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702212419.GG3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09pmdomain: core: Leave powered-on genpds on until late_initcall_syncUlf Hansson
Powering-off a genpd that was on during boot, before all of its consumer devices have been probed, is certainly prone to problems. As a step to improve this situation, let's prevent these genpds from being powered-off until genpd_power_off_unused() gets called, which is a late_initcall_sync(). Note that, this still doesn't guarantee that all the consumer devices has been probed before we allow to power-off the genpds. Yet, this should be a step in the right direction. Suggested-by: Saravana Kannan <saravanak@google.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-22-ulf.hansson@linaro.org
2025-07-09driver core: Add dev_set_drv_sync_state()Saravana Kannan
This can be used by frameworks to set the sync_state() helper functions for drivers that don't already have them set. Signed-off-by: Saravana Kannan <saravanak@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-20-ulf.hansson@linaro.org
2025-07-09pmdomain: core: Add common ->sync_state() support for genpd providersUlf Hansson
If the genpd provider's fwnode doesn't have an associated struct device with it, we can make use of the generic genpd->dev and it corresponding driver internally in genpd to manage ->sync_state(). More precisely, while adding a genpd OF provider let's check if the fwnode has a device and if not, make the preparation to handle ->sync_state() internally through the genpd_provider_driver and the genpd_provider_bus. Note that, genpd providers may opt out from this behaviour by setting the GENPD_FLAG_NO_SYNC_STATE config options for the genpds in question. Suggested-by: Saravana Kannan <saravanak@google.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-19-ulf.hansson@linaro.org
2025-07-09driver core: Export get_dev_from_fwnode()Ulf Hansson
It has turned out get_dev_from_fwnode() is useful at a few other places outside of the driver core, as in gpiolib.c for example. Therefore let's make it available as a common helper function. Suggested-by: Saravana Kannan <saravanak@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-18-ulf.hansson@linaro.org
2025-07-09firmware: xilinx: Don't share zynqmp_pm_init_finalize()Ulf Hansson
As there no longer any users outside the zynqmp firmware driver of zynqmp_pm_init_finalize(), let's turn into a local static function. Cc: Michal Simek <michal.simek@amd.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-16-ulf.hansson@linaro.org
2025-07-09pmdomain: core: Prepare to add the common ->sync_state() supportUlf Hansson
Before we can implement the common ->sync_state() support in genpd, we need to allow a few specific genpd providers to opt out from the new behaviour. Let's introduce GENPD_FLAG_NO_SYNC_STATE as a new genpd config option, to allow providers to opt out. Suggested-by: Saravana Kannan <saravanak@google.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-9-ulf.hansson@linaro.org
2025-07-09pmdomain: core: Export a common ->sync_state() helper for genpd providersUlf Hansson
In some cases the typical platform driver that act as genpd provider, may need its own ->sync_state() callback to manage various things. In this regards, the provider most likely wants to allow its corresponding genpds to be powered-off. For this reason, let's introduce a new genpd helper function, of_genpd_sync_state() that helps genpd provider drivers to achieve this. Suggested-by: Saravana Kannan <saravanak@google.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-8-ulf.hansson@linaro.org
2025-07-09efi: add ovmf debug log driverGerd Hoffmann
Recent OVMF versions (edk2-stable202508 + newer) can write their debug log to a memory buffer. This driver exposes the log content via sysfs (/sys/firmware/efi/ovmf_debug_log). Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-07-09usb: core: add dma-noncoherent buffer alloc and free APIXu Yang
This will add usb_alloc_noncoherent() and usb_free_noncoherent() functions to support alloc and free buffer in a dma-noncoherent way. To explicit manage the memory ownership for the kernel and device, this will also add usb_dma_noncoherent_sync_for_cpu/device() functions and call it at proper time. The management requires the user save sg_table returned by usb_alloc_noncoherent() to urb->sgt. Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/r/20250704095751.73765-2-xu.yang_2@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09wifi: mac80211: don't complete management TX on SAE commitJohannes Berg
When SAE commit is sent and received in response, there's no ordering for the SAE confirm messages. As such, don't call drivers to stop listening on the channel when the confirm message is still expected. This fixes an issue if the local confirm is transmitted later than the AP's confirm, for iwlwifi (and possibly mt76) the AP's confirm would then get lost since the device isn't on the channel at the time the AP transmit the confirm. For iwlwifi at least, this also improves the overall timing of the authentication handshake (by about 15ms according to the report), likely since the session protection won't be aborted and rescheduled. Note that even before this, mgd_complete_tx() wasn't always called for each call to mgd_prepare_tx() (e.g. in the case of WEP key shared authentication), and the current drivers that have the complete callback don't seem to mind. Document this as well though. Reported-by: Jan Hendrik Farr <kernel@jfarr.cc> Closes: https://lore.kernel.org/all/aB30Ea2kRG24LINR@archlinux/ Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213232.12691580e140.I3f1d3127acabcd58348a110ab11044213cf147d3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09wifi: cfg80211/mac80211: implement dot11ExtendedRegInfoSupportSomashekhar Puttagangaiah
Implement dot11ExtendedRegInfoSupport to advertise non-AP station regulatory power capability as part of regulatory connectivity element in (Re)Association request frames so that AP can achieve maximum client connectivity. Control field which was interpreted using value of 3-bits B5 to B3, now uses value of 4-bits B6 to B3 to interpret the type of AP. Hence update IEEE80211_HE_6GHZ_OPER_CTRL_REG_INFO to parse 4-bits control field. If older AP still updates only 3-bits value of control field, station can still interpret the value as per section E.2.7 of IEEE 802.11 REVme D7.0 and support the appropriate AP type. Also update IEEE80211_6GHZ_CTRL_REG_INDOOR_SP_AP as the value of standard power AP is changed to 8 instead of 4 so that AP can support both LPI AP and SP AP to maximize the connectivity with stations. For backward compatibility, keeping value 4 as old AP by limiting it to SP AP only. Signed-off-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213232.90cdef116aad.I85da390fbee59355e3855691933e6a5e55c47ac4@changeid [fix kernel-doc] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09wifi: cfg80211: add a flag for the first part of a scanBenjamin Berg
When there are no non-6 GHz channels, then the 6 GHz scan is the first part of a split scan. Add a boolean denoting whether the scan is the first part of a scan as it might be useful to drivers for internal bookkeeping. This flag is also set if the scan is not split. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213231.07e5a8a452ec.Ibf18f513e507422078fb31b28947e582a20df87a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09wifi: mac80211: remove DISALLOW_PUNCTURING_5GHZ codeJohannes Berg
Since iwlwifi was the only driver using this and no longer does, we can remove all this code. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213231.4dff5fb8890f.Ie531f912b252a0042c18c0734db50c3afe1adfb5@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09wifi: cfg80211: only verify part of Extended MLD CapabilitiesBenjamin Berg
We verify that the Extended MLD Capabilities are matching between links. However, some bits are reserved and in particular the Recommended Max Links subfield may not necessarily match. So only verify the known subfields that can reliably be expected to be the same. More information can be found in Table 9-417o, in IEEE P802.11be/D7.0. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213231.a2fad48dd3e6.Iae1740cd2ac833bc4a64fd2af718e1485158fd42@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09wifi: cfg80211: hide scan internalsJohannes Berg
Hide the internal scan fields from mac80211 and drivers, the 'notified' variable is for internal tracking, and the 'info' is output that's passed to cfg80211_scan_done() and stored only for delayed userspace notification. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213231.6a62e41858e2.I004f66e9c087cc6e6ae4a24951cf470961ee9466@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09vdso/helpers: Add helpers for seqlocks of single vdso_clockThomas Weißschuh
Auxiliary clocks will have their vDSO data in a dedicated 'struct vdso_clock', which needs to be synchronized independently. Add a helper to synchronize a single vDSO clock. [ tglx: Move the SMP memory barriers to the call sites and get rid of the confusing first/last arguments and conditional barriers ] Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-4-df7d9f87b9b8@linutronix.de
2025-07-09vdso/vsyscall: Split up __arch_update_vsyscall() into __arch_update_vdso_clock()Thomas Weißschuh
The upcoming auxiliary clocks need this hook, too. To separate the architecture hooks from the timekeeper internals, refactor the hook to only operate on a single vDSO clock. While at it, use a more robust #define for the hook override. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-3-df7d9f87b9b8@linutronix.de
2025-07-09Merge v6.16-rc2 into timers/ptpThomas Gleixner
to pick up the __GENMASK() fix, otherwise the AUX clock VDSO patches fail to compile for compat. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-07-09wifi: mac80211: avoid weird state in error pathMiri Korenblit
If we get to the error path of ieee80211_prep_connection, for example because of a FW issue, then ieee80211_vif_set_links is called with 0. But the call to drv_change_vif_links from ieee80211_vif_update_links will probably fail as well, for the same reason. In this case, the valid_links and active_links bitmaps will be reverted to the value of the failing connection. Then, in the next connection, due to the logic of ieee80211_set_vif_links_bitmaps, valid_links will be set to the ID of the new connection assoc link, but the active_links will remain with the ID of the old connection's assoc link. If those IDs are different, we get into a weird state of valid_links and active_links being different. One of the consequences of this state is to call drv_change_vif_links with new_links as 0, since the & operation between the bitmaps will be 0. Since a removal of a link should always succeed, ignore the return value of drv_change_vif_links if it was called to only remove links, which is the case for the ieee80211_prep_connection's error path. That way, the bitmaps will not be reverted to have the value from the failing connection and will have 0, so the next connection will have a good state. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20250609213231.ba2011fb435f.Id87ff6dab5e1cf757b54094ac2d714c656165059@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09Merge tag 'pm-runtime-6.17-rc1' of ↵Uwe Kleine-König
https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Runtime PM updates related to autosuspend for 6.17 Make several autosuspend functions mark last busy stamp and update the documentation accordingly (Sakari Ailus).
2025-07-09RDMA/uverbs: Add empty rdma_uattrs_has_raw_cap() declarationLeon Romanovsky
The call to rdma_uattrs_has_raw_cap() is placed in mlx5 fs.c file, which is compiled without relation to CONFIG_INFINIBAND_USER_ACCESS. Despite the check is used only in flows with CONFIG_INFINIBAND_USER_ACCESS=y|m, the compilers generate the following error for CONFIG_INFINIBAND_USER_ACCESS=n builds. >> ERROR: modpost: "rdma_uattrs_has_raw_cap" [drivers/infiniband/hw/mlx5/mlx5_ib.ko] undefined! Fixes: f458ccd2aa2c ("RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow create") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507080725.bh7xrhpg-lkp@intel.com/ Link: https://patch.msgid.link/72dee6b379bd709255a5d8e8010b576d50e47170.1751967071.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-08ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.Kuniyuki Iwashima
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock(). In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(), and __in6_dev_get() require RTNL. __dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be converted to RCU version. Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST. setsockopt_needs_rtnl() will be removed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Introduce SO_INQ.Kuniyuki Iwashima
We have an application that uses almost the same code for TCP and AF_UNIX (SOCK_STREAM). TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an alternative. Let's introduce the generic version of TCP_INQ. If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that contains the exact value of ioctl(SIOCINQ). The cmsg is also included when msg->msg_get_inq is non-zero to make sockets io_uring-friendly. Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to override setsockopt() for SOL_SOCKET. By having the flag in struct unix_sock, instead of struct sock, we can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq. Note also that supporting custom getsockopt() for SOL_SOCKET will need preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Use cached value for SOCK_STREAM in unix_inq_len().Kuniyuki Iwashima
Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more expensive, as unix_inq_len() requires iterating through the receive queue and accumulating skb->len. Let's cache the value for SOCK_STREAM to a new field during sendmsg() and recvmsg(). The field is protected by the receive queue lock. Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first skb in the queue. SOCK_SEQPACKET still requires iterating through the queue because we do not touch functions shared with unix_dgram_ops. But, if really needed, we can support it by switching __skb_try_recv_datagram() to a custom version. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250702223606.1054680-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-08 The following pull-request contains common mlx5 updates for your *net-next* tree. v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain ==================== Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08cpumask: introduce cpumask_random()Yury Norov [NVIDIA]
Propagate find_random_bit() to cpumask API. CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Yury Norov [NVIDIA]" <yury.norov@gmail.com>
2025-07-08bitmap: generalize node_random()Yury Norov [NVIDIA]
Generalize node_random() and make it available to general bitmaps and cpumasks users. Notice, find_first_bit() is generally faster than find_nth_bit(), and we employ it when there's a single set bit in the bitmap. See commit 3e061d924fe9c7b4 ("lib/nodemask: optimize node_random for nodemask with single NUMA node"). CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Yury Norov [NVIDIA]" <yury.norov@gmail.com>
2025-07-08KVM: arm64: gic-v5: Support GICv3 compatSascha Bischoff
Add support for GICv3 compat mode (FEAT_GCIE_LEGACY) which allows a GICv5 host to run GICv3-based VMs. This change enables the VHE/nVHE/hVHE/protected modes, but does not support nested virtualization. A lazy-disable approach is taken for compat mode; it is enabled on the vgic_v3_load path but not disabled on the vgic_v3_put path. A non-GICv3 VM, i.e., one based on GICv5, is responsible for disabling compat mode on the corresponding vgic_v5_load path. Currently, GICv5 is not supported, and hence compat mode is not disabled again once it is enabled, and this function is intentionally omitted from the code. Co-authored-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Link: https://lore.kernel.org/r/20250627100847.1022515-5-sascha.bischoff@arm.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-07-08irqchip/gic-v5: Populate struct gic_kvm_infoSascha Bischoff
Populate the gic_kvm_info struct based on support for FEAT_GCIE_LEGACY. The struct is used by KVM to probe for a compatible GIC. Co-authored-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Link: https://lore.kernel.org/r/20250627100847.1022515-3-sascha.bischoff@arm.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-07-08Merge tag 'bitmap-for-6.16-rc6' of https://github.com/norov/linuxLinus Torvalds
Pull bitops UAPI fix from Yury Norov: "Fix BITS_PER_LONG merge error Tomas' fix for __BITS_PER_LONG was effectively reverted by a wrong merge. Fix it and add the related files to MAINTAINERS" * tag 'bitmap-for-6.16-rc6' of https://github.com/norov/linux: MAINTAINERS: bitmap: add UAPI headers uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again (2)
2025-07-08net: ethtool: remove the compat code for _rxfh_context opsJakub Kicinski
All drivers are now converted to dedicated _rxfh_context ops. Remove the use of >set_rxfh() to manage additional contexts. Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250707184115.2285277-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08irqchip/gic-v5: Add GICv5 IWB supportLorenzo Pieralisi
The GICv5 architecture implements the Interrupt Wire Bridge (IWB) in order to support wired interrupts that cannot be connected directly to an IRS and instead uses the ITS to translate a wire event into an IRQ signal. Add the wired-to-MSI IWB driver to manage IWB wired interrupts. An IWB is connected to an ITS and it has its own deviceID for all interrupt wires that it manages; the IWB input wire number must be exposed to the ITS as an eventID with a 1:1 mapping. This eventID is not programmable and therefore requires a new msi_alloc_info_t flag to make sure the ITS driver does not allocate an eventid for the wire but rather it uses the msi_alloc_info_t.hwirq number to gather the ITS eventID. Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Co-developed-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-29-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08irqchip/gic-v5: Add GICv5 ITS supportLorenzo Pieralisi
The GICv5 architecture implements Interrupt Translation Service (ITS) components in order to translate events coming from peripherals into interrupt events delivered to the connected IRSes. Events (ie MSI memory writes to ITS translate frame), are translated by the ITS using tables kept in memory. ITS translation tables for peripherals is kept in memory storage (device table [DT] and Interrupt Translation Table [ITT]) that is allocated by the driver on boot. Both tables can be 1- or 2-level; the structure is chosen by the driver after probing the ITS HW parameters and checking the allowed table splits and supported {device/event}_IDbits. DT table entries are allocated on demand (ie when a device is probed); the DT table is sized using the number of supported deviceID bits in that that's a system design decision (ie the number of deviceID bits implemented should reflect the number of devices expected in a system) therefore it makes sense to allocate a DT table that can cater for the maximum number of devices. DT and ITT tables are allocated using the kmalloc interface; the allocation size may be smaller than a page or larger, and must provide contiguous memory pages. LPIs INTIDs backing the device events are allocated one-by-one and only upon Linux IRQ allocation; this to avoid preallocating a large number of LPIs to cover the HW device MSI vector size whereas few MSI entries are actually enabled by a device. ITS cacheability/shareability attributes are programmed according to the provided firmware ITS description. The GICv5 partially reuses the GICv3 ITS MSI parent infrastructure and adds functions required to retrieve the ITS translate frame addresses out of msi-map and msi-parent properties to implement the GICv5 ITS MSI parent callbacks. Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Co-developed-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-28-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08irqchip/msi-lib: Add IRQ_DOMAIN_FLAG_FWNODE_PARENT handlingLorenzo Pieralisi
In some irqchip implementations the fwnode representing the IRQdomain and the MSI controller fwnode do not match; in particular the IRQdomain fwnode is the MSI controller fwnode parent. To support selecting such IRQ domains, add a flag in core IRQ domain code that explicitly tells the MSI lib to use the parent fwnode while carrying out IRQ domain selection. Update the msi-lib select callback with the resulting logic. Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-27-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08PCI/MSI: Add pci_msi_map_rid_ctlr_node() helper functionLorenzo Pieralisi
IRQchip drivers need a PCI/MSI function to map a RID to a MSI controller deviceID namespace and at the same time retrieve the struct device_node pointer of the MSI controller the RID is mapped to. Add pci_msi_map_rid_ctlr_node() to achieve this purpose. Cc Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-25-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08of/irq: Add of_msi_xlate() helper functionLorenzo Pieralisi
Add an of_msi_xlate() helper that maps a device ID and returns the device node of the MSI controller the device ID is mapped to. Required by core functions that need an MSI controller device node pointer at the same time as a mapped device ID, of_msi_map_id() is not sufficient for that purpose. Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-24-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08irqchip/gic-v5: Add GICv5 LPI/IPI supportLorenzo Pieralisi
An IRS supports Logical Peripheral Interrupts (LPIs) and implement Linux IPIs on top of it. LPIs are used for interrupt signals that are translated by a GICv5 ITS (Interrupt Translation Service) but also for software generated IRQs - namely interrupts that are not driven by a HW signal, ie IPIs. LPIs rely on memory storage for interrupt routing and state. LPIs state and routing information is kept in the Interrupt State Table (IST). IRSes provide support for 1- or 2-level IST tables configured to support a maximum number of interrupts that depend on the OS configuration and the HW capabilities. On systems that provide 2-level IST support, always allow the maximum number of LPIs; On systems with only 1-level support, limit the number of LPIs to 2^12 to prevent wasting memory (presumably a system that supports a 1-level only IST is not expecting a large number of interrupts). On a 2-level IST system, L2 entries are allocated on demand. The IST table memory is allocated using the kmalloc() interface; the allocation required may be smaller than a page and must be made up of contiguous physical pages if larger than a page. On systems where the IRS is not cache-coherent with the CPUs, cache mainteinance operations are executed to clean and invalidate the allocated memory to the point of coherency making it visible to the IRS components. On GICv5 systems, IPIs are implemented using LPIs. Add an LPI IRQ domain and implement an IPI-specific IRQ domain created as a child/subdomain of the LPI domain to allocate the required number of LPIs needed to implement the IPIs. IPIs are backed by LPIs, add LPIs allocation/de-allocation functions. The LPI INTID namespace is managed using an IDA to alloc/free LPI INTIDs. Associate an IPI irqchip with IPI IRQ descriptors to provide core code with the irqchip.ipi_send_single() method required to raise an IPI. Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Co-developed-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-22-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08irqchip/gic-v5: Add GICv5 IRS/SPI supportLorenzo Pieralisi
The GICv5 Interrupt Routing Service (IRS) component implements interrupt management and routing in the GICv5 architecture. A GICv5 system comprises one or more IRSes, that together handle the interrupt routing and state for the system. An IRS supports Shared Peripheral Interrupts (SPIs), that are interrupt sources directly connected to the IRS; they do not rely on memory for storage. The number of supported SPIs is fixed for a given implementation and can be probed through IRS IDR registers. SPI interrupt state and routing are managed through GICv5 instructions. Each core (PE in GICv5 terms) in a GICv5 system is identified with an Interrupt AFFinity ID (IAFFID). An IRS manages a set of cores that are connected to it. Firmware provides a topology description that the driver uses to detect to which IRS a CPU (ie an IAFFID) is associated with. Use probeable information and firmware description to initialize the IRSes and implement GICv5 IRS SPIs support through an SPI-specific IRQ domain. The GICv5 IRS driver: - Probes IRSes in the system to detect SPI ranges - Associates an IRS with a set of cores connected to it - Adds an IRQchip structure for SPI handling SPIs priority is set to a value corresponding to the lowest permissible priority in the system (taking into account the implemented priority bits of the IRS and CPU interface). Since all IRQs are set to the same priority value, the value itself does not matter as long as it is a valid one. Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Co-developed-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-21-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08irqchip/gic-v5: Add GICv5 PPI supportLorenzo Pieralisi
The GICv5 CPU interface implements support for PE-Private Peripheral Interrupts (PPI), that are handled (enabled/prioritized/delivered) entirely within the CPU interface hardware. To enable PPI interrupts, implement the baseline GICv5 host kernel driver infrastructure required to handle interrupts on a GICv5 system. Add the exception handling code path and definitions for GICv5 instructions. Add GICv5 PPI handling code as a specific IRQ domain to: - Set-up PPI priority - Manage PPI configuration and state - Manage IRQ flow handler - IRQs allocation/free - Hook-up a PPI specific IRQchip to provide the relevant methods PPI IRQ priority is chosen as the minimum allowed priority by the system design (after probing the number of priority bits implemented by the CPU interface). Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Co-developed-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-20-12e71f1b3528@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-08io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCUJens Axboe
syzbot reports that defer/local task_work adding via msg_ring can hit a request that has been freed: CPU: 1 UID: 0 PID: 19356 Comm: iou-wrk-19354 Not tainted 6.16.0-rc4-syzkaller-00108-g17bbde2e1716 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xd2/0x2b0 mm/kasan/report.c:521 kasan_report+0x118/0x150 mm/kasan/report.c:634 io_req_local_work_add io_uring/io_uring.c:1184 [inline] __io_req_task_work_add+0x589/0x950 io_uring/io_uring.c:1252 io_msg_remote_post io_uring/msg_ring.c:103 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] __io_msg_ring_data+0x820/0xaa0 io_uring/msg_ring.c:151 io_msg_ring_data io_uring/msg_ring.c:173 [inline] io_msg_ring+0x134/0xa00 io_uring/msg_ring.c:314 __io_issue_sqe+0x17e/0x4b0 io_uring/io_uring.c:1739 io_issue_sqe+0x165/0xfd0 io_uring/io_uring.c:1762 io_wq_submit_work+0x6e9/0xb90 io_uring/io_uring.c:1874 io_worker_handle_work+0x7cd/0x1180 io_uring/io-wq.c:642 io_wq_worker+0x42f/0xeb0 io_uring/io-wq.c:696 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> which is supposed to be safe with how requests are allocated. But msg ring requests alloc and free on their own, and hence must defer freeing to a sane time. Add an rcu_head and use kfree_rcu() in both spots where requests are freed. Only the one in io_msg_tw_complete() is strictly required as it has been visible on the other ring, but use it consistently in the other spot as well. This should not cause any other issues outside of KASAN rightfully complaining about it. Link: https://lore.kernel.org/io-uring/686cd2ea.a00a0220.338033.0007.GAE@google.com/ Reported-by: syzbot+54cbbfb4db9145d26fc2@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Fixes: 0617bb500bfa ("io_uring/msg_ring: improve handling of target CQE posting") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08perf: arm_pmuv3: Add support for the Branch Record Buffer Extension (BRBE)Rob Herring (Arm)
The ARMv9.2 architecture introduces the optional Branch Record Buffer Extension (BRBE), which records information about branches as they are executed into set of branch record registers. BRBE is similar to x86's Last Branch Record (LBR) and PowerPC's Branch History Rolling Buffer (BHRB). BRBE supports filtering by exception level and can filter just the source or target address if excluded to avoid leaking privileged addresses. The h/w filter would be sufficient except when there are multiple events with disjoint filtering requirements. In this case, BRBE is configured with a union of all the events' desired branches, and then the recorded branches are filtered based on each event's filter. For example, with one event capturing kernel events and another event capturing user events, BRBE will be configured to capture both kernel and user branches. When handling event overflow, the branch records have to be filtered by software to only include kernel or user branch addresses for that event. In contrast, x86 simply configures LBR using the last installed event which seems broken. It is possible on x86 to configure branch filter such that no branches are ever recorded (e.g. -j save_type). For BRBE, events with a configuration that will result in no samples are rejected. Recording branches in KVM guests is not supported like x86. However, perf on x86 allows requesting branch recording in guests. The guest events are recorded, but the resulting branches are all from the host. For BRBE, events with branch recording and "exclude_host" set are rejected. Requiring "exclude_guest" to be set did not work. The default for the perf tool does set "exclude_guest" if no exception level options are specified. However, specifying kernel or user events defaults to including both host and guest. In this case, only host branches are recorded. BRBE can support some additional exception branch types compared to x86. On x86, all exceptions other than syscalls are recorded as IRQ. With BRBE, it is possible to better categorize these exceptions. One limitation relative to x86 is we cannot distinguish a syscall return from other exception returns. So all exception returns are recorded as ERET type. The FIQ branch type is omitted as the only FIQ user is Apple platforms which don't support BRBE. The debug branch types are omitted as there is no clear need for them. BRBE records are invalidated whenever events are reconfigured, a new task is scheduled in, or after recording is paused (and the records have been recorded for the event). The architecture allows branch records to be invalidated by the PE under implementation defined conditions. It is expected that these conditions are rare. Cc: Catalin Marinas <catalin.marinas@arm.com> Co-developed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Co-developed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Rob Herring (Arm) <robh@kernel.org> tested-by: Adam Young <admiyo@os.amperecomputing.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20250611-arm-brbe-v19-v23-4-e7775563036e@kernel.org [will: Fix sparse warnings about mixed declarations and code. Fix C99 comment syntax.] Signed-off-by: Will Deacon <will@kernel.org>
2025-07-08tun: enable gso over UDP tunnel support.Paolo Abeni
Add new tun features to represent the newly introduced virtio GSO over UDP tunnel offload. Allows detection and selection of such features via the existing TUNSETOFFLOAD ioctl and compute the expected virtio header size and tunnel header offset using the current netdev features, so that we can plug almost seamless the newly introduced virtio helpers to serialize the extended virtio header. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- v6 -> v7: - rebased v4 -> v5: - encapsulate the guest feature guessing in a tun helper - dropped irrelevant check on xdp buff headroom - do not remove unrelated black line - avoid line len > 80 char v3 -> v4: - virtio tnl-related fields are at fixed offset, cleanup the code accordingly. - use netdev features instead of flags bit to check for the configured offload - drop packet in case of enabled features/configured hdr size mismatch v2 -> v3: - cleaned-up uAPI comments - use explicit struct layout instead of raw buf.
2025-07-08net: implement virtio helpers to handle UDP GSO tunneling.Paolo Abeni
The virtio specification are introducing support for GSO over UDP tunnel. This patch brings in the needed defines and the additional virtio hdr parsing/building helpers. The UDP tunnel support uses additional fields in the virtio hdr, and such fields location can change depending on other negotiated features - specifically VIRTIO_NET_F_HASH_REPORT. Try to be as conservative as possible with the new field validation. Existing implementation for plain GSO offloads allow for invalid/ self-contradictory values of such fields. With GSO over UDP tunnel we can be more strict, with no need to deal with legacy implementation. Since the checksum-related field validation is asymmetric in the driver and in the device, introduce a separate helper to implement the new checks (to be used only on the driver side). Note that while the feature space exceeds the 64-bit boundaries, the guest offload space is fixed by the specification of the VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET command to a 64-bit size. Prior to the UDP tunnel GSO support, each guest offload bit corresponded to the feature bit with the same value and vice versa. Due to the limited 'guest offload' space, relevant features in the high 64 bits are 'mapped' to free bits in the lower range. That is simpler than defining a new command (and associated features) to exchange an extended guest offloads set. As a consequence, the uAPIs also specify the mapped guest offload value corresponding to the UDP tunnel GSO features. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> -- v4 -> v5: - avoid lines above 80 chars v3 -> v4: - fixed offset for UDP GSO tunnel, update accordingly the helpers - tried to clarified vlan_hlen semantic - virtio_net_chk_data_valid() -> virtio_net_handle_csum_offload() v2 -> v3: - add definitions for possible vnet hdr layouts with tunnel support v1 -> v2: - 'relay' -> 'rely' typo - less unclear comment WRT enforced inner GSO checks - inner header fields are allowed only with 'modern' virtio, thus are always le - clarified in the commit message the need for 'mapped features' defines - assume little_endian is true when UDP GSO is enabled. - fix inner proto type value
2025-07-08vhost-net: allow configuring extended featuresPaolo Abeni
Use the extended feature type for 'acked_features' and implement two new ioctls operation allowing the user-space to set/query an unbounded amount of features. The actual number of processed features is limited by VIRTIO_FEATURES_MAX and attempts to set features above such limit fail with EOPNOTSUPP. Note that: the legacy ioctls implicitly truncate the negotiated features to the lower 64 bits range and the 'acked_backend_features' field don't need conversion, as the only negotiated feature there is in the low 64 bit range. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>