summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2026-06-05iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverityAndrey Albershteyn
This flag indicates that I/O is for fsverity metadata. In the write path skip i_size check and i_size updates as metadata is past EOF. In writeback don't update i_size and continue writeback if even folio is beyond EOF. In read path don't zero fsverity folios, again they are past EOF. The iomap_block_needs_zeroing() is also called from write path. For folios of larger order we don't want to zero out pages in the folio as these could contain other merkle tree blocks. For fsverity, filesystem will request to read PAGE_SIZE memory regions. For data folios, iomap will zero the rest of the folio for anything which is beyond EOF. We don't want this for fsverity folios. Christian Brauner <brauner@kernel.org> says: Changed IOMAP_F_FSVERITY from (1U << 10) to (1U << 11) to avoid colliding with IOMAP_F_ZERO_TAIL, which already uses (1U << 10). Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-8-aalbersh@kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05fsverity: generate and store zero-block hashAndrey Albershteyn
Compute the hash of one filesystem block's worth of zeros. A filesystem implementation can decide to elide merkle tree blocks containing only this hash and synthesize the contents at read time. Let's pretend that there's a file containing 131 data block and whose merkle tree looks roughly like this: root +--leaf0 | +--data0 | +--data1 | +--... | `--data128 `--leaf1 +--data129 +--data130 `--data131 If data[0-128] are sparse holes, then leaf0 will contain a repeating sequence of @zero_digest. Therefore, leaf0 need not be written to disk because its contents can be synthesized. A subsequent xfs patch will use this to reduce the size of the merkle tree when dealing with sparse gold master disk images and the like. Note that this works only on the first-level (data holes). fsverity doesn't store/generate zero_digest for any higher levels. Add a helper to pre-fill folio with hashes of empty blocks. This will be used by iomap to synthesize blocks full of zero hashes on the fly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-5-aalbersh@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05netfilter: nf_conntrack_helper: dynamically allocate struct nf_conntrack_helperPablo Neira Ayuso
Adapt all existing helpers to use a modified version of nf_ct_helper_init(), to dynamically allocate struct nf_conntrack_helper. Allocate expect_policy[] built-in into the helper to ensure this area is reachable after helper removal since a follow up patch adds refcount to track use of the nf_conntrack_helper structure from packet path so it remains around until last reference from ct helper extension is dropped. Export __nf_conntrack_helper_register() which allows to register nfnetlink_cthelper dynamically allocated helper. Adapt nfnetlink_cthelper to use the built-in expect_policy[]. This is a preparation patch to add packet path refcounting to helpers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-05netfilter: cttimeout: detach dataplane timeout policy and repurpose refcountPablo Neira Ayuso
Add a refcount for struct nf_ct_timeout which is used by ct extension to set the custom ct timeout policy, this tells us that the ct timeout is being used by a conntrack entry. When the last conntrack entry drops the refcount on the ct timeout, the ct timeout is released. Remove the refcount for control plane which controls if the ruleset refers to the timeout policy. After this update, it is possible to remove the ct timeout policy from nfnetlink_cttimeout immediately. This is for simplicity not to handle two refcounts on a single object. Remove nf_queue_nf_hook_drop(): a packet sitting in nfqueue will just hold a reference to the nf_ct_timeout object until packet is reinjected, since this is part of the ct extension, this will be released by the time the conntrack is freed. nf_ct_untimeout() is still called to clean up in a best effort basis: the ct timeout on existing entries gets removed when the ct timeout goes away, but as long as the iptables ruleset still refers to the ct timeout through a template, new conntracks may keep attaching it and extend its lifetime until the rule is removed. nf_ct_untimeout() is not called anymore from module removal path, this is unlikely to find timeouts give module refcount is bumped, and the new refcount already tracks the ct timeout policy use so it is released when unused. Fixes: 50978462300f ("netfilter: add cttimeout infrastructure for fine timeout tuning") Fixes: 7e0b2b57f01d ("netfilter: nft_ct: add ct timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-05ipvs: add conn_max sysctl to limit connectionsJulian Anastasov
Currently, we are using atomic_t to track the number of connections. On 64-bit setups with large memory there is a risk this counter to overflow. Also, setups with many containers may need to tune the limit for connections. Add sysctl control to limit the number of connections to 1,073,741,824 (64-bit) and 16,777,216 (32-bit). Depending on the admin's privilege, the value is used to change a soft or hard limit allowing unprivileged admins to change the soft limit in range determined by privileged admins. Link: https://sashiko.dev/#/patchset/20260523172715.94795-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260430074420.26697-7-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260522105546.13732-1-ja%40ssi.bg Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-05kasan: Move generic KASAN page tables out of BSS tooArd Biesheuvel
Make sure that all KASAN page tables are emitted into the .pgtbl section (provided that the arch has one - otherwise, fall back to page aligned BSS) This is needed because BSS itself is no longer accessible via the linear map on arm64. Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: kasan-dev@googlegroups.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
2026-06-05iomap: introduce IOMAP_F_ZERO_TAIL flagNamjae Jeon
In filesystems that maintain a separate Valid Data Length, such as exFAT and NTFS, a partial write may start at or beyond the current valid_size and extend it. In this case, the region after the previous valid_size but within the same filesystem block is considered unwritten. This patch introduces IOMAP_F_ZERO_TAIL. When this flag is set in iomap, __iomap_write_begin() will zero only the tail portion while preserving any valid data before it in the same block. Without this tail zeroing, stale data in the unwritten portion of the block can remain in the page cache. Subsequent reads can then return incorrect contents from that region. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Link: https://patch.msgid.link/20260518114705.9601-2-linkinjeon@kernel.org Acked-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05wifi: nl80211: Increase ie_len size to prevent truncated IEs in new peer ↵Thiyagarajan Pandiyan
notifications Currently, ie_len in cfg80211_notify_new_peer_candidate is defined as 1-byte field, capping the maximum IE list size at 255 bytes. When a large beacon is received, the IE list is truncated, passing incomplete data to wpa_supplicant. This causes supplicant to fail parsing the IEs. Increasing the size of ie_len to allow the full length of the IE list to be forwarded properly. Signed-off-by: Thiyagarajan Pandiyan <thiyagarajan@aerlync.com> Link: https://patch.msgid.link/20260605054307.427874-1-thiyagarajan@aerlync.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-06-05media: v4l2-fwnode: Fix subdev owner overwritten in ↵Mirela Rabulea
v4l2_async_register_subdev_sensor() The v4l2 helper v4l2_async_register_subdev_sensor() calls v4l2_async_register_subdev(), which is a macro that expands to __v4l2_async_register_subdev(sd,THIS_MODULE). Since the macro is expanded inside v4l2-fwnode.c, THIS_MODULE resolves to the v4l2-fwnode module rather than the sensor driver module that originally set sd->owner. When v4l2-fwnode is built-in, THIS_MODULE evaluates to NULL, which then overwrites the sensor driver's owner with NULL. This causes the problem that the sensor module's reference count is never incremented during async registration, so the module can be removed while the subdevice is still in use by a notifier (e.g., a CSI-2 receiver bridge driver). Fix this by renaming v4l2_async_register_subdev_sensor() to __v4l2_async_register_subdev_sensor() with an added explicit module argument and introducing a wrapper macro: #define v4l2_async_register_subdev_sensor(sd) \ __v4l2_async_register_subdev_sensor(sd, THIS_MODULE) This ensures the sensor driver module is properly referenced even when the sensor driver does not init the owner field before calling v4l2_async_register_subdev_sensor() and prevents premature module removal. Fixes: aef69d54755d ("media: v4l: fwnode: Add a convenience function for registering sensors") Cc: stable@vger.kernel.org Suggested-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/linux-media/20240315073125.275501-2-sakari.ailus@linux.intel.com/ Signed-off-by: Mirela Rabulea <mirela.rabulea@nxp.com> Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
2026-06-05batman-adv: uapi: keep kernel-doc in struct member orderSven Eckelmann
The order of the members of struct batadv_coded_packet and struct batadv_unicast_tvlv_packet didn't match the kernel doc. This is the case for all other structures and should also be done the same way for these two. Signed-off-by: Sven Eckelmann <sven@narfation.org>
2026-06-05wind ->s_roots via ->d_sib instead of ->d_hashAl Viro
shrink_dcache_for_umount() is supposed to handle the possibility of some of the dentries to be evicted being in other threads shrink lists; it either kills them, leaving an empty husk to be freed by the owner of shrink list whenever it gets around to that, or it waits for the eviction in progress to get completed. That relies upon dentry remaining attached to the tree until the eviction reaches dentry_unlist() and its ->d_sib gets removed from the list. Unfortunately, the secondary roots are linked via ->d_hash, rather than ->d_sib and they become removed from that list before their inode references are dropped. If shrink_dentry_list() from another thread ends up evicting one of the secondary roots and gets to that point in dentry_kill() when shrink_dcache_for_umount() is looking for secondary roots, the latter will *not* notice anything, possibly leading to warnings about busy inodes at umount time and all kinds of breakage after that. Moreover, shrink_dcache_for_umount() walks the list of secondary roots with no protection whatsoever, so it might end up calling dget() on a dentry that already passed through lockref_mark_dead(&dentry->d_lockref); ending up with corrupted refcount and possible UAF. AFAICS, the most straightforward way to deal with that would be to have secondary roots linked via ->d_sib rather than ->d_hash; then they would remain on the list until killed, and we could use d_add_waiter() machinery to wait for eviction in progress. Changes: * secondary roots look the same as ->s_root from d_unhashed() and d_unlinked() POV now. * secondary roots are represented as "no parent, but on ->d_sib" instead of "no parent, but on ->d_hash". * since ->d_sib is a plain hlist, we protect it with per-superblock spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for non-root dentries it would be protected by ->d_lock of parent). * __d_obtain_alias() uses ->d_sib for linkage when allocating a secondary root. * d_splice_alias_ops() detects splicing of a secondary root and removes it from the list before calling __d_move(). * dentry_unlist() detects eviction of a secondary root and removes it from the list; no need to play the games for d_walk() sake, since the latter is not going to look for the next sibling of those anyway. * ___d_drop() doesn't care about ->s_roots anymore. * shrink_dcache_for_umount() uses proper locking for access to the list of secondary roots and if it runs into one that is in the middle of eviction waits for that to finish. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-06-05kill d_dispose_if_unused()Al Viro
Rename to_shrink_list() into __move_to_shrink_list(), document and export it. Switch d_dispose_if_unused() users to that and kill d_dispose_if_unused() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-06-05fix a race between d_find_any_alias() and final dput() of NORCU dentriesAl Viro
Refcount of a NORCU dentry must not be incremented after having dropped to zero. Otherwise we might end up with the following race: CPU1: in fast_dput(d), rcu_read_lock(); CPU1: decrements refcount of d to 0 CPU1: notice that it's unhashed CPU2: grab a reference to d CPU2: dput(d), freeing d CPU1: ... looks like we need to evict d, let's grab ->d_lock, recheck the refcount, etc. and that spin_lock(&d->d_lock) ends up a UAF, despite still being in an RCU read-side critical area started back when the refcount had been positive. If not for DCACHE_NORCU in d->d_flags freeing would've been RCU-delayed, so we'd have grabbed ->d_lock, noticed the negative value stored into refcount by __dentry_kill(), dropped the locks and that would be it. For NORCU dentries freeing is _not_ delayed, though. Most of the non-counting references are excluded for NORCU dentries - they are not allowed to be hashed, they never get placed on LRU, they never get placed into anyone's list of children and while dput_to_list() might put them into a shrink list, nobody bumps refcount of something that had been reached that way. However, inode's list of aliases can be a problem - it does not contribute to dentry refcount (for obvious reasons) and we *do* have places that grab references to something found on that list - that's precisely what d_find_alias() is. In case of d_find_alias() we are safe - it skips unhashed aliases, so all NORCU ones are ignored there. d_find_any_alias() is *not* limited to hashed ones, though, and while it's usually called for directories (which never get NORCU dentries), there are callers that use it to get something for non-directories with no hashed aliases. Having d_find_any_alias() hit a NORCU dentry is not impossible - it can be easily arranged if you have CAP_DAC_READ_SEARCH (memfd_create() + mmap() + name_to_handle_at() for /proc/self/map_files/<...> + munmap() + open_by_handle_at() will do that, and adding a second memfd_create() for mount_fd makes it possible to do that without having memfd pinned). The race window is narrow, and it's probably not feasible on bare hardware, but... It's not hard to fix, fortunately: * separate __d_find_dir_alias() (== current __d_find_any_alias()) to be used for directory inodes. * provide dget_alias_ilocked() that would return false for NORCU dentries with zero refcount and return true incrementing refcount otherwise * make __d_find_any_alias() go over the list of aliases, using dget_alias_ilocked() and returning the alias it succeeds on (normally the first one). Any NORCU alias with zero refcount is going to be evicted by the thread that had dropped the final reference; this makes __d_find_any_alias() pretend it had lost the race with eviction. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-06-05VFS: use wait_var_event for waiting in d_alloc_parallel()NeilBrown
Parallel lookup starts with a call of d_alloc_parallel(). That primitive either returns a matching hashed dentry or allocates a new one in the in-lookup state and returns it to the caller. Once the caller is done with lookup, it indicates so either by call of d_{splice_alias,add}() or by call of d_done_lookup(); at that point dentry leaves the in-lookup state. If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for that dentry to leave the in-lookup state, one way or another. Currently by supplying wait_queue_head to d_alloc_parallel(). If d_alloc_parallel() creates a new in-lookup dentry, the address of that wait_queue_head is stored in ->d_wait of new dentry and stays there while it's in the in-lookup; subsequent d_alloc_parallel() will wait on the queue found in the matching in-lookup dentry. Transition out of in-lookup state wakes waiters on that queue (if any). That works, but the calling conventions are inconvenient - the caller must supply wait_queue_head and make sure that it survives at least until the new in-lookup dentry leaves the in-lookup state. That amounts to boilerplate in the d_alloc_parallel() callers that are followed by a call of d_lookup_done() in the same function; in cases like nfs asynchronous unlink it gets worse than that. This patch changes d_alloc_parallel() to use wake_up_var_locked() to wake up waiters, and wait_var_event_spinlock() to wait. dentry->d_lock is used for synchronisation as it is already held and the relevant times. That eliminates the need of caller-supplied wait_queue_head, simplifying the calling conventions. Better yet, we only need one bit of information stored in dentry itself: whether there are any waiters to be woken up, and that can be easily stored in ->d_flags; ->d_wait goes away. The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var machinery the queues are shared with all kinds of stuff and there's no way tell if any of the waiters have anything to do with our dentry; most of the time none of them will be relevant, so we need to avoid the pointless wakeups. Another benefit of the new scheme comes from the fact that wakeups have to be done outside of write-side critical areas of ->i_dir_seq; with the old scheme we need to carry the value picked from ->d_wait from __d_lookup_unhash() to the place where we actually wake the waiters up. Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get to doing wakeups - that's done within the same ->d_lock scope, so we are fine; new bit is accessed only under ->d_lock and it's seen only on dentries with DCACHE_PAR_LOOKUP in ->d_flags. __d_lookup_unhash() no longer needs to re-init ->d_lru. That was previously shared (in a union) with ->d_wait but ->d_wait is now gone so it no longer corrupts ->d_lru. Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-06-05Merge tag 'drm-rust-next-2026-06-04' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/rust/kernel into drm-next DRM Rust changes for v7.2-rc1 - Driver Core (shared via signed tag dd-lifetimes-7.2-rc1): - Introduce Higher-Ranked Lifetime Types (HRT) for Rust device drivers, allowing driver structs to hold device resources like pci::Bar and IoMem directly with a lifetime tied to the binding scope, removing the need for Devres indirection and ARef<Device>. - Replace drvdata() with scoped registration data on the auxiliary bus, using the new ForLt trait to thread lifetimes through registrations. Remove drvdata() and driver_type. - DRM: - Add GPUVM immediate mode abstraction for Rust GPU drivers: - In immediate mode, GPU virtual address space state is updated during job execution (in the DMA fence signalling critical path), keeping the GPUVM and the GPU's address space always in sync. - Provide GpuVm, GpuVa, and GpuVmBo types for managing address spaces, virtual mappings, and GEM object backing respectively. - Provide split-merge map/unmap operations that handle partial overlaps with existing mappings. - drm_exec integration for dma_resv locking and GEM object validation based on the external/evicted object lists are not yet covered and planned as follow-up work. - Introduce DeviceContext type state for drm::Device, allowing drivers to restrict operations to contexts where the device is guaranteed to be registered (or not yet registered) with userspace. - Add FEAT_RENDER flag to the Driver trait for render node support. - Nova: - Hopper/Blackwell enablement: - Add GPU identification and architecture-based HAL selection for Hopper (GH100) and Blackwell (GB100, GB202). - Implement the FSP (Foundation Security Processor) boot path used by Hopper and Blackwell, including FSP falcon engine support, EMEM operations, MCTP/NVDM message infrastructure, and FSP Chain of Trust boot with GSP lockdown release. - Add support for 32-bit firmware images and auto-detection of firmware image format. - Add architecture-specific framebuffer, sysmem flush, PCI config mirror, DMA mask, and WPR/non-WPR heap sizing. - GSP boot and unload: - Refactor the GSP boot process into a chipset-specific HAL, keeping the SEC2 and FSP boot paths separated cleanly. - Implement proper driver unload: send UNLOADING_GUEST_DRIVER command, run Booter Unloader and FWSEC-SB upon unbinding, and run the unload bundle on Gsp::boot() failure. This removes the need for a manual GPU reset between driver unbind and re-probe. - GA100 support: - Add support for the GA100 GPU, including IFR header detection and skipping, correct fwsignature selection, conditional FRTS boot, and documentation of the IFR header layout. - VBIOS hardening and refactoring: - Harden VBIOS parsing with checked arithmetic, bounds-checked accesses, and FromBytes-based structure reads throughout the FWSEC and Falcon data paths. Simplify the overall VBIOS module structure. - HRT adoption: - Use lifetime-parameterized pci::Bar directly, replacing the Arc<Devres<Bar0>> indirection. Replace ARef<Device> with &'bound Device in SysmemFlush and the GSP sequencer. Separate the driver type from driver data. - Misc: - Rename module names to kebab-case (nova-drm, nova-core). - Require little-endian in Kconfig, making the existing assumption explicit. - Tyr: - Define comprehensive typed register blocks for GPU_CONTROL, JOB_CONTROL, MMU_CONTROL (including per-address-space registers), and DOORBELL_BLOCK using the kernel register!() macro. This replaces manual bit manipulation with typed register and field accessors. - Add shmem-backed GEM objects and set DMA mask based on GPU physical address width. - Adopt HRT: separate driver type from driver data, and use IoMem directly instead of Devres for register access during probe. - Move clock cleanup into a Drop implementation. Signed-off-by: Dave Airlie <airlied@redhat.com> From: "Danilo Krummrich" <dakr@kernel.org> Link: https://patch.msgid.link/DJ0IF39U9ETK.PCCUO7ZEQ4S0@kernel.org
2026-06-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-7.1-rc7). Silent conflicts: net/wireless/nl80211.c cb9959ab5f99 ("wifi: cfg80211: enforce HE/EHT cap/oper consistency") a384ae969902 ("wifi: cfg80211: move AP HT/VHT/... operation to beacon info") https://lore.kernel.org/aiGJDaHV4UlCexIQ@sirena.org.uk Conflicts: drivers/net/wireless/intel/iwlwifi/mld/ap.c a342c99cb70d ("wifi: iwlwifi: mld: honor BSS_CHANGED_BEACON_ENABLED") 9bf1b409afc7 ("wifi: iwlwifi: mld: send tx power constraints before link activation") https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk drivers/net/wireless/intel/iwlwifi/pcie/drv.c 093305d801fa ("wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used") e2323929a68a ("wifi: iwlwifi: pcie: add debug print for resume flow if powered off") https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk Adjacent changes: drivers/net/ethernet/airoha/airoha_eth.c b38cae85d1c4 ("net: airoha: Fix use-after-free in metadata dst teardown") ec6c391bcca7 ("net: airoha: Introduce airoha_gdm_dev struct") drivers/net/ethernet/microchip/lan743x_main.c 8173d22b211f ("net: lan743x: permit VLAN-tagged packets up to configured MTU") e3c6508a46f5 ("net: lan743x: avoid netdev-based logging before netdev registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-04err.h: use __always_inline on all error pointer helpersArnd Bergmann
While testing randconfig builds on s390, I came across a link failure with CONFIG_DMA_SHARED_BUFFER disabled: ERROR: modpost: "dma_buf_put" [drivers/iommu/iommufd/iommufd.ko] undefined! The problem here is that IS_ERR() is not inlined and dead code elimination fails as a consequence. The err.h helpers all turn into a trivial assignment of a bit mask and should never result in a function call, so force them to always be inline. This should generally result in better object code aside from avoiding the link failure above. Link: https://lore.kernel.org/20260526101851.2495110-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Tamir Duberstein <tamird@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ansuel Smith <ansuelsmth@gmail.com> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm: document the folio refcount a little betterMatthew Wilcox (Oracle)
Expand the documentation of folio_ref_count() to talk about expected, temporary and spurious refcounts as well as the concept of freezing. Link: https://lore.kernel.org/20260526200032.353868-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04userfaultfd: make functions that are not used outside uffd staticMike Rapoport (Microsoft)
After merging fs/userfaultfd.c into mm/userfaultfd.c, several functions that were previously shared between the two files are now only used within mm/userfaultfd.c. Make them static and remove their declarations from include/linux/userfaultfd_k.h. Link: https://lore.kernel.org/20260523173759.3964908-3-rppt@kernel.org Assisted-by: Copilot:claude-opus-4-6 Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/mglru: use folio_mark_accessed to replace folio_set_activeBarry Song (Xiaomi)
MGLRU gives high priority to folios mapped in page tables. As a result, folio_set_active() is invoked for all folios read during page faults. In practice, however, readahead can bring in many folios that are never accessed via page tables. A previous attempt by Lei Liu proposed introducing a separate LRU for readahead[1] to make readahead pages easier to reclaim, but that approach is likely over-engineered. Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"), folios with PG_active were always placed in the youngest generation, leading to over-protection and increased refaults. After that commit, PG_active folios are placed in the second youngest generation, which is still too optimistic given the presence of readahead. In contrast, the classic active/inactive scheme is more conservative. This patch switches to using folio_mark_accessed() and begins prefaulted file folios from the second oldest generation instead of active generations. We should also adjust the following accordingly: - WORKINGSET_ACTIVATE: aligned with setting active for refaulted workingset folios; - lru_gen_folio_seq(): place (pre)faulted file folios into the second oldest generation; - promote second-scanned folios to workingset in folio_check_references(): we now have to depend on folio_lru_refs() > 1, since we previously relied on PG_referenced being set during the first scan, but PG_referenced is now set earlier. On x86, running a kernel build inside a memcg with a 1GB memory limit using 20 threads. w/o patch: real 1m50.764s user 25m32.305s sys 4m0.012s pswpin: 1333245 pswpout: 4366443 pgpgin: 6962592 pgpgout: 17780712 swpout_zero: 1019603 swpin_zero: 14764 refault_file: 287794 refault_anon: 1347963 w/ patch: real 1m48.879s user 25m29.224s sys 3m37.421s pswpin: 568480 pswpout: 2322657 pgpgin: 4073416 pgpgout: 9613408 swpout_zero: 593275 swpin_zero: 9118 refault_file: 262505 refault_anon: 577550 active/inactive LRU: real 1m49.928s user 25m28.196s sys 3m40.740s pswpin: 463452 pswpout: 2309119 pgpgin: 4438856 pgpgout: 9568628 swpout_zero: 743704 swpin_zero: 7244 refault_file: 562555 refault_anon: 470694 Lance and Xueyuan made a huge contribution to this patch through testing. Link: https://lore.kernel.org/20260526130938.66253-1-baohua@kernel.org Link: https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/ [1] Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Kairui Song <kasong@tencent.com> Cc: Qi Zheng <qi.zheng@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: wangzicheng <wangzicheng@honor.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lei Liu <liulei.rjpt@vivo.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon: fix missing parens in macro argumentsMaksym Shcherba
Patch series "mm/damon: fix macro arguments and clarify quota goals doc", v2. This patch (of 2): The DAMON iterator macros do not wrap their pointer arguments with parentheses. This can cause build failures when the argument is a complex expression due to operator precedence issues. Add missing parentheses around the arguments in the following macros to prevent potential build failures: - damon_for_each_region() - damon_for_each_region_from() - damon_for_each_region_safe() - damos_for_each_quota_goal() Link: https://lore.kernel.org/20260521202020.126500-1-maksym.shcherba@lnu.edu.ua Link: https://lore.kernel.org/20260521202020.126500-2-maksym.shcherba@lnu.edu.ua Signed-off-by: Maksym Shcherba <maksym.shcherba@lnu.edu.ua> Reviewed-by: SeongJae Park <sj@kernel.org> Assisted-by: Antigravity:Gemini-3.1-Pro Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: hide damon_destroy_region()SeongJae Park
damon_destroy_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: hide damon_insert_region()SeongJae Park
damon_insert_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/damon/core: hide damon_add_region()SeongJae Park
damon_add_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/vma: eliminate mmap_action->error_hook, introduce error_overrideLorenzo Stoakes
Rather than providing a hook, simplify things by providing the ability to override mmap action errors. This allows us to more carefully validate the value provided and thus ensure only a valid error code is specified, and simplifies the interface. This way, we eliminate all hooks but mmap_prepare and allow only mmap actions to be specified (which core mm controls). This significantly improves robustness and eliminates any unnecessary code duplication in driver mmap hooks. We also update the /dev/mem logic (the only user) to use mmap_action->error_override instead. Link: https://lore.kernel.org/55d13f7d016b827c459946d46a56105635be111c.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04mm/vma: remove mmap_action->success_hookLorenzo Stoakes
This hook was introduced to work around code that seemed to absolutely require access to a VMA pointer upon mmap(). However, providing this hook leaves a backdoor to drivers getting access to the very thing mmap_prepare eliminates - a pointer to the VMA. Let's solve this contradiction by removing it. The key intended user was hugetlb, however it seems that the best course now is to avoid allowing all drivers the ability to work around mmap_prepare, and find a different solution there. Link: https://lore.kernel.org/f79434e6d30af6d92999be6b76e197f1847105fa.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04drivers/char/mem: eliminate unnecessary use of success_hookLorenzo Stoakes
Patch series "remove mmap_action success, error hooks", v3. The mmap_action->success_hook was a strange beast added to enable code which appeared to absolutely require access to a VMA pointer to work correctly. Primarily this was for hugetlb, however a different approach will be taken there, as clearly more work is required to figure out a sensible way of converting hugetlb to use mmap_prepare. The other user was the memory char driver, specifically /dev/zero which has the unusual property of explicitly setting file-backed VMAs anonymous. Providing the success hook was always foolish, as it allowed drivers a way to workaround the restriction that they should not access a pointer to a not-yet-correctly-initialised VMA - which defeats the purpose of the mmap_prepare work. We can achieve the same thing in memory char driver without needing the success hook, so this series removes that, then removes the success hook altogether. The error hook is also unnecessary - the motivation for this was for functions which need to override the error code when performing an mmap action in order to avoid breaking userspace. We can achieve this by just providing a field for the error code. Doing this means we don't have to worry about the hook doing anything odd. We also add a check to ensure the error code is in fact valid. Again the memory char driver is the only current user of this, so this series updates it to use that. After this change mmap_action has no custom hooks at all, which seems rather more cromulent than before. This patch (of 3): /dev/zero, uniquely, marks memory mapped there as anonymous. This is currently achieved using the mmap_action->success_hook. However this hook circumvents the abstraction of VMA initialisation so it's preferable to do things a different way. To achieve this, this patch firstly defaults the VMA descriptor's vm_ops field to the dummy VMA operations, which is what file-backed VMAs default this field to. That way, we can detect whether a driver sets this field to NULL in order to mark it anonymous. We then introduce vma_desc_set_anonymous() to do this explicitly, and invoke it in mmap_zero_prepare(). This way, any driver which does not explicitly set desc->vm_ops, retains the dummy vm_ops as they would previously. We also update set_vma_user_defined_fields() to make clear that we are either setting vma->vm_ops to what is provided by the driver (or defaulting to dummy_vm_ops if not set), or setting the VMA anonymous. This lays the groundwork for removing the success hook. Link: https://lore.kernel.org/cover.1780397980.git.ljs@kernel.org Link: https://lore.kernel.org/010579cca6787cf7bb057ab1f7228978b10601c8.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-04Merge tag 'net-7.1-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Netfilter, wireless and Bluetooth. Current release - fix to a fix: - Bluetooth: MGMT: fix backward compatibility with bluetoothd which adds stray bytes to MGMT_OP_ADD_EXT_ADV_DATA Previous releases - regressions: - af_unix: fix inq_len update inaccuracy on partial read - eth: fec: fix pinctrl default state restore order on resume - wifi: iwlwifi: - mvm: don't support the reset handshake for old firmwares - pcie: simplify the resume flow if fast resume is not used, work around NIC access failures Previous releases - always broken: - Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig - sctp: fix a couple of bugs in COOKIE_ECHO processing - sched: fix pedit partial COW leading to page cache corruption - wifi: nl80211: reject oversized EMA RNR lists - netfilter: - conntrack_irc: fix possible out-of-bounds read - bridge: make ebt_snat ARP rewrite writable - appletalk: zero-initialize aarp_entry to prevent heap info leak - ipv4: restrict IPOPT_SSRR and IPOPT_LSRR options - mptcp: fix number of bugs reported by AI scans and discovered during NVMe over MPTCP testing" * tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) Reapply "bnxt_en: bring back rtnl_lock() in the bnxt_open() path" udp: clear skb->dev before running a sockmap verdict sctp: purge outqueue on stale COOKIE-ECHO handling bonding: annotate data-races arcound churn variables net/802/mrp: fix vector attribute parsing in mrp_pdu_parse_vecattr rtase: Avoid sleeping in get_stats64() ieee802154: 6lowpan: only accept IPv6 packets in lowpan_xmit() ipv6: mcast: Fix use-after-free when processing MLD queries selftests: net: add vxlan vnifilter notification test vxlan: vnifilter: fix spurious notification on VNI update vxlan: vnifilter: send notification on VNI add rtase: Reset TX subqueue when clearing TX ring octeontx2-af: npc: Fix CPT channel mask in npc_install_flow dt-bindings: ethernet: eswin: fix hsp-sp-csr backward compatibility sctp: validate cached peer INIT chunk length in COOKIE_ECHO processing net/sched: fix pedit partial COW leading to page cache corruption vsock/vmci: fix sk_ack_backlog leak on failed handshake net: bonding: fix NULL pointer dereference in bond_do_ioctl() geneve: fix length used in GRO hint UDP checksum adjustment net: ethernet: mtk_eth_soc: Fix use-after-free in metadata dst teardown ...
2026-06-04net: ethtool: add netif_get_link_ksettings() for correct ops-locked useJakub Kicinski
__ethtool_get_link_ksettings() is exported and called from sysfs and many drivers. It invokes ethtool_ops->get_link_ksettings so by our own docs it should be holding netdev lock for ops locked devices. Looks like commit 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock") missed adding the ops lock here. There's a number of callers we need to fix up so let's add the netif_get_link_ksettings() helper first, without any actual locking changes (this commit is a nop). Not treating this as a fix because I don't think any driver cares at this point, but if we want to remove the rtnl_lock protection this will become critical. Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260603012840.2254293-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-04net: ethtool: cmis_cdb: hold instance lock for ops locked devicesJakub Kicinski
FW module flashing was written so that the flashing happens without holding rtnl_lock. This allows flashing multiple modules at once. Current drivers can handle that well, but we should let drivers depend on the netdev instance lock. Instance lock is per netdev, and so is the module so we won't break parallel updates. Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260603012840.2254293-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-04net: rename netdev_ops_assert_locked()Jakub Kicinski
Jakub suggests renaming the existing assert to match the netdev_lock_ops_compat() semantics. We want netdev_assert_locked_ops() to mean - if the driver is ops locked - check that it's holding the device lock. The existing helper check for either ops lock or rtnl_lock, which is the locking behavior of netdev_lock_ops_compat(). The reason for naming divergence is likely that netdev_ops_assert_locked() predated the _compat() helpers. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260603012840.2254293-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-04Merge tag 'trace-v7.1-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: - Fix CFI violation in probestub function The probestub is a function to allow tprobes to hook to a tracepoint to gain access to its parameters. The function itself is only referenced by the tracepoint structure which lives in the __tracepoint section. objtool explicitly ignores that section and when processing functions in the kernel, if it detects one that has no references it will seal it to have its ENDBR stripped on boot up. This means the probstub function will have its ENDBR stripped and if a tprobe is attached to it with IBT enabled, it will go *boom*. * tag 'trace-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix CFI violation in probestub being called by tprobes
2026-06-04vdso/datastore: Always provide symbol declarationsThomas Weißschuh
Allow callers to easily reference these symbols in code that is built even when the generic datastore is disabled. As there are no good default no-op variants of these symbols, do not provide stubs but require users to have their own fallback handling using IS_ENABLED(CONFIG_HAVE_GENERIC_VDSO). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260521-vdso-mips-kconfig-v1-2-2f79dcd6c78f@linutronix.de
2026-06-04net/sched: fix pedit partial COW leading to page cache corruptionRajat Gupta
tcf_pedit_act() computes the COW range for skb_ensure_writable() once before the key loop using tcfp_off_max_hint, but the hint does not account for the runtime header offset added by typed keys. This can leave part of the write region un-COW'd. Fix by moving skb_ensure_writable() inside the per-key loop where the actual write offset is known, and add overflow checking on the offset arithmetic. For negative offsets (e.g. Ethernet header edits at ingress), use skb_cow() to COW the headroom instead. Guard offset_valid() against INT_MIN, where negation is undefined. Fixes: 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") Reported-by: Yiming Qian <yimingqian591@gmail.com> Reported-by: Keenan Dong <keenanat2000@gmail.com> Reported-by: Han Guidong <2045gemini@gmail.com> Reported-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: Han Guidong <2045gemini@gmail.com> Tested-by: Han Guidong <2045gemini@gmail.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/20260531123221.48732-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-04ASoC: soc-core: remove card->dmi_longnameKuninori Morimoto
Current Card has dmi_longname[80] (when CONFIG_DMI), but no need to have it in Card, we can alloc it. Tidyup it. This is prepare for Card capsuling Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Link: https://patch.msgid.link/87a4tk2weh.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-06-04xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migrationAntony Antony
Add a new netlink method to migrate a single xfrm_state. Unlike the existing migration mechanism (SA + policy), this supports migrating only the SA and allows changing the reqid. The SA is looked up via xfrm_usersa_id, which uniquely identifies it, so old_saddr is not needed. old_daddr is carried in xfrm_usersa_id.daddr. The reqid is invariant in the old migration. Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-06-04xfrm: make xfrm_dev_state_add xuo parameter constAntony Antony
The xuo pointer is not modified by xfrm_dev_state_add(); make it const. Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-06-04xfrm: move encap and xuo into struct xfrm_migrateAntony Antony
In preparation for an upcoming patch, move the xfrm_encap_tmpl and xfrm_user_offload pointers from separate parameters into struct xfrm_migrate, reducing the parameter count of xfrm_state_migrate_create(), xfrm_state_migrate_install() and xfrm_state_migrate() The fields are placed after the four xfrm_address_t members where the struct is naturally 8-byte aligned, avoiding padding. No functional change. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-06-04xfrm: add state synchronization after migrationAntony Antony
Add xfrm_migrate_sync() to copy curlft and replay state from the old SA to the new one before installation. The function allocates no memory, so it can be called under a spinlock. In preparation for a subsequent patch in this series. A subsequent patch calls this under x->lock, atomically capturing the latest lifetime counters and replay state from the original SA and deleting it in the same critical section to prevent SN/IV reuse for XFRM_MSG_MIGRATE_STATE method. No functional change. Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-06-04xfrm: split xfrm_state_migrate into create and install functionsAntony Antony
To prepare for subsequent patches, split xfrm_state_migrate() into two functions: - xfrm_state_migrate_create(): creates the migrated state - xfrm_state_migrate_install(): installs it into the state table splitting will help to avoid SN/IV reuse when migrating AEAD SA. And add const whenever possible. No functional change. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-06-04xfrm: rename reqid in xfrm_migrateAntony Antony
In preparation for a later patch in this series s/reqid/old_reqid/. No functional change. Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-06-04xfrm: add extack to xfrm_init_stateAntony Antony
Add a struct extack parameter to xfrm_init_state() and pass it through to __xfrm_init_state(). This allows validation errors detected during state initialization to propagate meaningful error messages back to userspace. xfrm_state_migrate() now passes extack so that errors from the XFRM_MSG_MIGRATE_STATE path are properly reported. Callers without an extack context (af_key, ipcomp4, ipcomp6) pass NULL, preserving their existing behaviour. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-06-04timekeeping: Add clocksource read_snapshot() method and hw_cycles to snapshotDavid Woodhouse
Add a read_snapshot() callback to struct clocksource which returns the derived clocksource value while also providing the underlying hardware counter reading and the related clocksource ID. This allows ktime_get_snapshot_id() to populate new hw_cycles and hw_csid fields in struct system_time_snapshot. For clocksources that are derived from an underlying counter (e.g., Hyper-V TSC page scales TSC to 10MHz, kvmclock scales TSC to 1GHz), this provides atomic access to both the derived value needed for timekeeping calculations, and the raw hardware counter needed by consumers like KVM's master clock and the vmclock PTP driver. [ tglx: Reworked it slightly ] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Assisted-by: Kiro:claude-opus-4.6-1m Link: https://patch.msgid.link/20260526230635.136914-1-dwmw2@infradead.org Link: https://patch.msgid.link/20260529195558.202568489@kernel.org
2026-06-04ptp: Switch to ktime_get_snapshot_id() for pre/post timestampsThomas Gleixner
To prepare for a new PTP IOCTL, which exposes the raw counter value along with the requested system time snapshot, switch the pre/post time stamp sampling over to use ktime_get_snapshot_id() and fix up all usage sites. No functional change intended. The ptp_vmclock conversion was simplified by David Woodhouse. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260529195558.149589566@kernel.org
2026-06-04timekeeping: Remove system_device_crosststamp::sys_realtimeThomas Gleixner
All users are converted to sys_systime. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195558.046694580@kernel.org
2026-06-04timekeeping: Prepare for cross timestamps on arbitrary clock IDsThomas Gleixner
PTP device system crosstime stamps support only CLOCK_REALTIME, which is meaningless for AUX clocks. The PTP core hands in the clock ID already, so prepare the core code to honor it. - Add a new sys_systime field to struct system_device_crosststamp which aliases the sys_realtime field. Once all users are converted sys_realtime can be removed. - Prepare get_device_system_crosststamp() and the related code for it by switching to sys_systime and providing the initial changes to utilize different time keepers. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.846634842@kernel.org
2026-06-04timekeeping: Remove ktime_get_snapshot()Thomas Gleixner
All users have been converted to ktime_get_snapshot_id(). Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.795510496@kernel.org
2026-06-04timekeeping: Add CLOCK ID to system_device_crosststampThomas Gleixner
The normal capture for system/device cross timestamps is CLOCK_REALTIME, but that's meaningless for AUX clocks. Add a clock_id field to struct system_device_crosststamp and initialize it with CLOCK_REALTIME at the two places which prepare for cross timestamps. After the related code has been cleaned up, the core code will honor the clock_id field when calculating the system time from the system counter snapshot. No functional change. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.482153523@kernel.org
2026-06-04timekeeping: Add system_counterval_t to struct system_device_crosststampThomas Gleixner
An upcoming extension to the PTP IOCTL requires to return the system counter value and the clocksource ID to user space. get_device_system_crosststamp() has this information already. Extend struct system_device_crosststamp with a system_counterval_t member and fill in the data. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.429406675@kernel.org
2026-06-04timekeeping: Remove system_time_snapshot::real/boot/rawThomas Gleixner
All users are converted over to ktime_get_snapshot_id() and system_time_snapshot::systime and ::monoraw. Remove the leftovers. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.330029635@kernel.org