linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
2026-04-21	Merge tag 'clk-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk updates from Stephen Boyd: "We've finally gotten rid of the struct clk_ops::round_rate() code after months of effort from Brian Masney. Now the only option is to use determine_rate(), which is good because that takes a struct argument instead of just a couple unsigned longs, allowing us to easily modify the way we determine and set rates in the clk tree. Beyond that core framework change we've got the typical pile of new SoC clk driver additions, fixes for clk data and/or adding missing clks because the consumer driver using those clks wasn't ready, etc. The usual suspects are all here: Qualcomm, Samsung, Mediatek, and Rockchip along with some newcomers making RISC-V SoCs like ESWIN's eic700 and Tenstorrent's Atlantis. The clk driver side of this looks pretty normal. Core: - Remove the round_rate() clk op (yay!) New Drivers: - ESWIN eic700 SoC clk support - Econet EN751221 SoC clock/reset support - Global TCSR, RPMh, and display clock controller support for the Qualcomm Eliza platform - TCSR, the multiple global, and the RPMh clock controller support for the Qualcomm Nord platform - GPU clock controller support for Qualcomm SM8750 - Video and GPU clock controller support for Qualcomm Glymur - Global clock controller support for Qualcomm IPQ5210 - Axis ARTPEC-9: Add new PLL clocks and new drivers for eight clock controllers on the SoC - ExynosAutov920: Add G3D (GPU) clock controller - Clock driver for the Rockchip RV1103B SoC - Initial support for the Renesas RZ/G3L (R9A08G046) SoC - Clock and reset controllers (e.g. PRCM) in the Tenstorrent Atlantis SoC" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (132 commits) clk: visconti: pll: initialize clk_init_data to zero clk: fsl-sai: Add MCLK generation support clk: fsl-sai: Extract clock setup into fsl_sai_clk_register() dt-bindings: clock: fsl-sai: Document clock-cells = <1> support clk: fsl-sai: Add i.MX8M support with 8 byte register offset clk: fsl-sai: Sort the headers dt-bindings: clock: fsl-sai: Document i.MX8M support clk: qcom: gcc: Add multiple global clock controller driver for Nord SoC clk: qcom: rpmh: Add support for Nord rpmh clocks clk: qcom: Add TCSR clock driver for Nord SoC dt-bindings: clock: qcom: Add Nord Global Clock Controller dt-bindings: clock: qcom-rpmhcc: Add support for Nord SoCs dt-bindings: clock: qcom: Document the Nord SoC TCSR Clock Controller clk: qcom: gcc-x1e80100: Keep GCC USB QTB clock always ON clk: qcom: Constify list of critical CBCR registers clk: qcom: Constify qcom_cc_driver_data clk: qcom: videocc-glymur: Constify qcom_cc_desc clk: qcom: Add a driver for SM8750 GPU clocks dt-bindings: clock: qcom: Add SM8750 GPU clocks clk: qcom: ipq-cmn-pll: Add IPQ8074 SoC support ...
2026-04-21	spi: Fix the error description in the `ptp_sts_word_post` comment	Dewei Meng
	Based on the comment information, the description within the `ptp_sts_word_post` section should be changed to "See @ptp_sts_word_pre". Signed-off-by: Dewei Meng <mengdewei@cqsoftware.com.cn> Link: https://patch.msgid.link/20260421025808.6572-1-mengdewei@cqsoftware.com.cn Signed-off-by: Mark Brown <broonie@kernel.org>
2026-04-21	Merge tag 'pull-dcache-busy-wait' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache busy loop updates from Al Viro: "Fix livelocks in shrink_dcache_tree() If shrink_dcache_tree() finds a dentry in the middle of being killed by another thread, it has to wait until the victim finishes dying, gets detached from the tree and ceases to pin its parent. The way we used to deal with that amounted to busy-wait; unfortunately, it's not just inefficient but can lead to reliably reproducible hard livelocks. Solved by having shrink_dentry_tree() attach a completion to such dentry, with dentry_unlist() calling complete() on all objects attached to it. With a bit of care it can be done without growing struct dentry or adding overhead in normal case" * tag 'pull-dcache-busy-wait' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: get rid of busy-waiting in shrink_dcache_tree() dcache.c: more idiomatic "positives are not allowed" sanity checks struct dentry: make ->d_u anonymous for_each_alias(): helper macro for iterating through dentries of given inode
2026-04-21	net: cache snapshot entries for ndo_set_rx_mode_async	Stanislav Fomichev
	Add a per-device netdev_hw_addr_list cache (rx_mode_addr_cache) that allows __hw_addr_list_snapshot() and __hw_addr_list_reconcile() to reuse previously allocated entries instead of hitting GFP_ATOMIC on every snapshot cycle. snapshot pops entries from the cache when available, falling back to __hw_addr_create(). reconcile splices both snapshot lists back into the cache via __hw_addr_splice(). The cache is flushed in free_netdev(). Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-4-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-21	net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work	Stanislav Fomichev
	Add ndo_set_rx_mode_async callback that drivers can implement instead of the legacy ndo_set_rx_mode. The legacy callback runs under the netif_addr_lock spinlock with BHs disabled, preventing drivers from sleeping. The async variant runs from a work queue with rtnl_lock and netdev_lock_ops held, in fully sleepable context. When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules netdev_rx_mode_work instead of calling the driver inline. The work function takes two snapshots of each address list (uc/mc) under the addr_lock, then drops the lock and calls the driver with the work copies. After the driver returns, it reconciles the snapshots back to the real lists under the lock. Add netif_rx_mode_sync() to opportunistically execute the pending workqueue update inline, so that rx mode changes are committed before returning to userspace: - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK) - dev_set_promiscuity - dev_set_allmulti - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI - do_setlink (RTM_SETLINK) Note that some deep hierarchies still do skip the lower updates via: - dev_uc_sync - dev_mc_sync If we do end up hitting user-visible issues, we can add more calls to netif_rx_mode_sync in specific places. But hopefully we should not, the actual user-visible lists are still synced, it's that just HW state that might be lagging. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-21	net: add address list snapshot and reconciliation infrastructure	Stanislav Fomichev
	Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile() for use by the upcoming ndo_set_rx_mode_async callback. The async rx_mode path needs to snapshot the device's unicast and multicast address lists under the addr_lock, hand those snapshots to the driver (which may sleep), and then propagate any sync_cnt changes back to the real lists. Two identical snapshots are taken: a work copy for the driver to pass to __hw_addr_sync_dev() and a reference copy to compute deltas against. __hw_addr_list_reconcile() walks the reference snapshot comparing each entry against the work snapshot to determine what the driver synced or unsynced. It then applies those deltas to the real list, handling concurrent modifications: - If the real entry was concurrently removed but the driver synced it to hardware (delta > 0), re-insert a stale entry so the next work run properly unsyncs it from hardware. - If the entry still exists, apply the delta normally. An entry whose refcount drops to zero is removed. # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter # dev_addr_test_snapshot_benchmark.speed: slow Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-2-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-21	rculist: add list_splice_rcu() for private lists	Pablo Neira Ayuso
	This patch adds a helper function, list_splice_rcu(), to safely splice a private (non-RCU-protected) list into an RCU-protected list. The function ensures that only the pointer visible to RCU readers (prev->next) is updated using rcu_assign_pointer(), while the rest of the list manipulations are performed with regular assignments, as the source list is private and not visible to concurrent RCU readers. This is useful for moving elements from a private list into a global RCU-protected list, ensuring safe publication for RCU readers. Subsystems with some sort of batching mechanism from userspace can benefit from this new function. The function __list_splice_rcu() has been added for clarity and to follow the same pattern as in the existing list_splice*() interfaces, where there is a check to ensure that the list to splice is not empty. Note that __list_splice_rcu() has no documentation for this reason. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-04-20	rhashtable: Bounce deferred worker kick through irq_work	Tejun Heo
	Inserts past 75% load call schedule_work(&ht->run_work) to kick an async resize. If a caller holds a raw spinlock (e.g. an insecure_elasticity user), schedule_work() under that lock records caller_lock -> pool->lock -> pi_lock -> rq->__lock A cycle forms if any of these locks is acquired in the reverse direction elsewhere. sched_ext, the only current insecure_elasticity user, hits this: it holds scx_sched_lock across rhashtable inserts of sub-schedulers, while scx_bypass() takes rq->__lock -> scx_sched_lock. Exercising the resize path produces: Chain exists of: &pool->lock --> &rq->__lock --> scx_sched_lock Bounce the kick from the insert paths through irq_work so schedule_work() runs from hard IRQ context with the caller's lock no longer held. rht_deferred_worker()'s self-rearm on error stays on schedule_work(&ht->run_work) - the worker runs in process context with no caller lock held, and keeping the self-requeue on @run_work lets cancel_work_sync() in rhashtable_free_and_destroy() drain it. v3: Keep rht_deferred_worker()'s self-rearm on schedule_work(&run_work). Routing it through irq_work in v2 broke cancel_work_sync()'s self-requeue handling - an irq_work queued after irq_work_sync() returned but while cancel_work_sync() was still waiting could fire post-teardown. v2: Bounce unconditionally instead of gating on insecure_elasticity, as suggested by Herbert. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-04-20	bpf: Extend BTF UAPI vlen, kinds to use unused bits	Alan Maguire
	BTF maximum vlen is encoded using 16 bits with a maximum vlen of 65535. This has sufficed for structs, function parameters and enumerated type values. However, with upcoming BTF location information - in particular information about inline sites - this limit is surpassed. Use bits 16-23 - currently unused in BTF info - to extend to 24 bits, giving a max vlen of (2^24 - 1), or 16 million. Also extend BTF kind encoding from 5 to 7 bits, giving a maximum available number of kinds of 128. Since with the BTF location work we use another 3 kinds, we are fast approaching the current limit of 32. Convert BTF_MAX_* values to enums to allow them to be encoded in kernel BTF; this will allow us to detect if the running kernel supports a 24-bit vlen or not. Add one for max _possible_ (not used) kind. Fix up a few places in the kernel where a 16-bit vlen is assumed; remove BTF_INFO_MASK as now all bits are used. The vlen expansion was suggested by Andrii in [1]; the kind expansion is tackled here too as it may be needed also to support new kinds in BTF. [1] https://lore.kernel.org/bpf/CAEf4BzZx=X6vGqcA8SPU6D+v6k+TR=ZewebXMuXtpmML058piw@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260417143023.1551481-2-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-20	Merge tag 'printk-for-7.1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Fix printk ring buffer initialization and sanity checks - Workaround printf kunit test compilation with gcc < 12.1 - Add IPv6 address printf format tests - Misc code and documentation cleanup * tag 'printk-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printf: Compile the kunit test with DISABLE_BRANCH_PROFILING DISABLE_BRANCH_PROFILING lib/vsprintf: use bool for local decode variable lib/hexdump: print_hex_dump_bytes() calls print_hex_dump_debug() printk: ringbuffer: fix errors in comments printk_ringbuffer: Add sanity check for 0-size data printk_ringbuffer: Fix get_data() size sanity check printf: add IPv6 address format tests printk: Fix _DESCS_COUNT type for 64-bit systems
2026-04-20	Merge tag 'core-urgent-2026-04-20' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull entry cleanup from Ingo Molnar: "Remove the unused ARCH_SYSCALL_WORK_{ENTER,EXIT} flags" * tag 'core-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Kill ARCH_SYSCALL_WORK_{ENTER,EXIT}
2026-04-20	Merge tag 'platform-drivers-x86-v7.1-1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver updates from Ilpo Järvinen: "asus-wmi: - Retain battery charge threshold during boot which avoids unsolicited change to 100%. Return -ENODATA when the limit is not yet known - Improve screenpad power/brightness handling consistency - Fix screenpad brightness range barco-p50-gpio: - Normalize gpio_get return values bitland-mifs-wmi: - Add driver for Bitland laptops (supports platform profile, hwmon, kbd backlight, gpu mode, hotkeys, and fan boost) dell_rbu: - Fix using uninitialized value in sysfs write function dell-wmi-sysman: - Respect destination length when constructing enum strings hp-wmi: - Propagate fan setting apply failures and log an error - Fix sysfs write vs work handler cancel_delayed_work_sync() deadlock - Correct keepalive schedule_delayed_work() to mod_delayed_work() - Fix u8 underflows in GPU delta calculation - Use mutex to protect fan pwm/mode - Ignore kbd backlight and FnLock key events that are handled by FW - Fix fan table parsing (use correct field) - Add support for Omen 14-fb0xxx, 16-n0xxx, 16-wf1xxx, and Omen MAX 16-ak0xxxx input: trackpoint & thinkpad_acpi: - Enable doubletap by default and add sysfs enable/disable int3472: - Add support for GPIO type 0x02 (IR flood LED) intel-speed-select: (updated to v1.26) - Avoid using current base frequency as maximum - Fix CPU extended family ID decoding - Fix exit code - Improve error reporting intel/vsec: - Refactor to support ACPI-enumerated PMT endpoints. pcengines-apuv2: - Attach software node to the gpiochip uniwill: - Refactor hwmon to smaller parts to accomodate HW diversity - Support USB-C power/performance priority switch through sysfs - Add another XMG Fusion 15 (L19) DMI vendor - Enable fine-grained features to device lineup mapping wmi: - Perform output size check within WMI core to allow simpler WMI drivers misc: - acpi_driver -> platform driver conversions (a large number of changes from Rafael J. Wysocki) - cleanups / refactoring / improvements" * tag 'platform-drivers-x86-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (106 commits) platform/x86: hp-wmi: Add support for Omen 16-wf1xxx (8C77) platform/x86: hp-wmi: Add support for Omen 16-n0xxx (8A44) platform/x86: hp-wmi: Add support for OMEN MAX 16-ak0xxx (8D87) platform/x86: hp-wmi: fix fan table parsing platform/x86: hp-wmi: add Omen 14-fb0xxx (board 8C58) support platform/wmi: Replace .no_notify_data with .min_event_size platform/wmi: Extend wmidev_query_block() to reject undersized data platform/wmi: Extend wmidev_invoke_method() to reject undersized data platform/wmi: Prepare to reject undersized unmarshalling results platform/wmi: Convert drivers to use wmidev_invoke_procedure() platform/wmi: Add wmidev_invoke_procedure() platform/x86: int3472: Add support for GPIO type 0x02 (IR flood LED) platform/x86: int3472: Parameterize LED con_id in registration platform/x86: int3472: Rename pled to led in LED registration code platform/x86: int3472: Use local variable for LED struct access platform/x86: thinkpad_acpi: remove obsolete TODO comment platform/x86: dell-wmi-sysman: bound enumeration string aggregation platform/x86: hp-wmi: Ignore backlight and FnLock events platform/x86: uniwill-laptop: Fix signedness bug platform/x86: dell_rbu: avoid uninit value usage in packet_size_write() ...
2026-04-20	pppoe: drop PFC frames	Qingfang Deng
	RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT RECOMMENDED for PPPoE. In practice, pppd does not support negotiating PFC for PPPoE sessions, and the current PPPoE driver assumes an uncompressed (2-byte) protocol field. However, the generic PPP layer function ppp_input() is not aware of the negotiation result, and still accepts PFC frames. If a peer with a broken implementation or an attacker sends a frame with a compressed (1-byte) protocol field, the subsequent PPP payload is shifted by one byte. This causes the network header to be 4-byte misaligned, which may trigger unaligned access exceptions on some architectures. To reduce the attack surface, drop PPPoE PFC frames. Introduce ppp_skb_is_compressed_proto() helper function to be used in both ppp_generic.c and pppoe.c to avoid open-coding. Fixes: 7fb1b8ca8fa1 ("ppp: Move PFC decompression to PPP generic layer") Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260415022456.141758-2-qingfang.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-20	Merge tag 'mfd-next-7.1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "Core: - Add a resource-managed version of alloc_workqueue() (`devm_alloc_workqueue()`) - Preserve the Open Firmware (OF) node when an ACPI handle is present Apple SMC: - Wire up the Apple SMC power driver by adding a new MFD cell Atmel HLCDC: - Fetch the LVDS PLL clock as a fallback if the generic sys_clk is unavailable Broadcom BCM2835 PM: - Add support for the BCM2712 power management device - Introduce a hardware type identifier to distinguish SoC variants Congatec CGBC, KEMPLD, RSMU, Si476x: - Fix various kernel-doc warnings and correct struct member names DLN2: - Drop redundant USB device references and switch to managed resource allocations - Update bare 'unsigned' types to 'unsigned int' ENE KB3930: - Use the of_device_is_system_power_controller() wrapper EZX PCAP: - Avoid rescheduling after destroying the workqueue by switching to a device-managed workqueue - Drop redundant memory allocation error messages - Return directly instead of using empty goto statements Freescale i.MX25 TSADC: - Convert devicetree bindings from TXT to YAML format Freescale MC13xxx: - Fix a memory leak in subdevice platform data allocation by using devm_kmemdup() Intel LPC ICH: - Expose a software node for the GPIO controller cell to fix GPIO lookups Intel LPSS: - Add PCI IDs for the Intel Nova Lake-H platform Maxim MAX77620: - Convert devicetree bindings from TXT to YAML format - Document an optional I2C address for the MAX77663 RTC device Maxim MAX77705: - Make the max77705_pm_ops variable static to resolve a sparse warning MediaTek MT6397: - Correct the hardware CIDs for the MT6328, MT6331, and MT6332 PMICs to allow proper driver binding ROHM BD71828: - Enable system wakeup via the power button ROHM BD72720: - Add a new compatible string for the ROHM BD73900 PMIC SpacemiT P1: - Drop the deprecated "vin-supply" property from the devicetree bindings - Add individual regulator supply properties to match actual hardware topology STMicroelectronics STPMIC1: - Attempt system shutdown a second time to handle transient I2C communication failures Viperboard: - Drop redundant USB device references" * tag 'mfd-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (28 commits) mfd: core: Preserve OF node when ACPI handle is present mfd: ene-kb3930: Use of_device_is_system_power_controller() wrapper mfd: intel-lpss: Add Intel Nova Lake-H PCI IDs dt-bindings: mfd: max77620: Document optional RTC address for MAX77663 dt-bindings: mfd: max77620: Convert to DT schema mfd: ezx-pcap: Avoid rescheduling after destroying workqueue mfd: ezx-pcap: Return directly instead of empty gotos mfd: ezx-pcap: Drop memory allocation error message mfd: bcm2835-pm: Add BCM2712 PM device support mfd: bcm2835-pm: Introduce SoC-specific type identifier dt-bindings: mfd: bd72720: Add ROHM BD73900 mfd: si476x: Fix kernel-doc warnings mfd: rsmu: Remove a empty kernel-doc line mfd: kempld: Fix kernel-doc struct member names mfd: congatec: Fix kernel-doc struct member names dt-bindings: mfd: Convert fsl-imx25-tsadc.txt to yaml format mfd: viperboard: Drop redundant device reference mfd: dln2: Switch to managed resources and fix bare unsigned types mfd: macsmc: Wire up Apple SMC power driver mfd: mt6397: Properly fix CID of MT6328, MT6331 and MT6332 ...
2026-04-20	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: "The usual collection of driver changes, more core infrastructure updates that typical this cycle: - Minor cleanups and kernel-doc fixes in bnxt_re, hns, rdmavt, efa, ocrdma, erdma, rtrs, hfi1, ionic, and pvrdma - New udata validation framework and driver updates - Modernize CQ creation interface in mlx4 and mlx5, manage CQ umem in core - Promote UMEM to a core component, split out DMA block iterator logic - Introduce FRMR pools with aging, statistics, pinned handles, and netlink control and use it in mlx5 - Add PCIe TLP emulation support in mlx5 - Extend umem to work with revocable pinned dmabuf's and use it in irdma - More net namespace improvements for rxe - GEN4 hardware support in irdma - First steps to MW and UC support in mana_ib - Support for CQ umem and doorbells in bnxt_re - Drop opa_vnic driver from hfi1 Fixes: - IB/core zero dmac neighbor resolution race - GID table memory free - rxe pad/ICRC validation and r_key async errors - mlx4 external umem for CQ - umem DMA attributes on unmap - mana_ib RX steering on RSS QP destroy" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (116 commits) RDMA/core: Fix user CQ creation for drivers without create_cq RDMA/ionic: bound node_desc sysfs read with %.64s IB/core: Fix zero dmac race in neighbor resolution RDMA/mana_ib: Support memory windows RDMA/rxe: Validate pad and ICRC before payload_size() in rxe_rcv RDMA/core: Prefer NLA_NUL_STRING RDMA/core: Fix memory free for GID table RDMA/hns: Remove the duplicate calls to ib_copy_validate_udata_in() RDMA: Remove redundant = {} for udata req structs RDMA/irdma: Add missing comp_mask check in alloc_ucontext RDMA/hns: Add missing comp_mask check in create_qp RDMA/mlx5: Pull comp_mask validation into ib_copy_validate_udata_in_cm() RDMA: Use ib_copy_validate_udata_in_cm() for zero comp_mask RDMA/hns: Use ib_copy_validate_udata_in() RDMA/mlx4: Use ib_copy_validate_udata_in() for QP RDMA/mlx4: Use ib_copy_validate_udata_in() RDMA/mlx5: Use ib_copy_validate_udata_in() for MW RDMA/mlx5: Use ib_copy_validate_udata_in() for SRQ RDMA/pvrdma: Use ib_copy_validate_udata_in() for srq RDMA: Use ib_copy_validate_udata_in() for implicit full structs ...
2026-04-20	Merge tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux	Linus Torvalds
	Pull nfsd updates from Chuck Lever: - filehandle signing to defend against filehandle-guessing attacks (Benjamin Coddington) The server now appends a SipHash-2-4 MAC to each filehandle when the new "sign_fh" export option is enabled. NFSD then verifies filehandles received from clients against the expected MAC; mismatches return NFS error STALE - convert the entire NLMv4 server-side XDR layer from hand-written C to xdrgen-generated code, spanning roughly thirty patches (Chuck Lever) XDR functions are generally boilerplate code and are easy to get wrong. The goals of this conversion are improved memory safety, lower maintenance burden, and groundwork for eventual Rust code generation for these functions. - improve pNFS block/SCSI layout robustness with two related changes (Dai Ngo) SCSI persistent reservation fencing is now tracked per client and per device via an xarray, to avoid both redundant preempt operations on devices already fenced and a potential NFSD deadlock when all nfsd threads are waiting for a layout return. - scalability and infrastructure improvements Sincere thanks to all contributors, reviewers, testers, and bug reporters who participated in the v7.1 NFSD development cycle. * tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (83 commits) NFSD: Docs: clean up pnfs server timeout docs nfsd: fix comment typo in nfsxdr nfsd: fix comment typo in nfs3xdr NFSD: convert callback RPC program to per-net namespace NFSD: use per-operation statidx for callback procedures svcrdma: Use contiguous pages for RDMA Read sink buffers SUNRPC: Add svc_rqst_page_release() helper SUNRPC: xdr.h: fix all kernel-doc warnings svcrdma: Factor out WR chain linking into helper svcrdma: Add Write chunk WRs to the RPC's Send WR chain svcrdma: Clean up use of rdma->sc_pd->device svcrdma: Clean up use of rdma->sc_pd->device in Receive paths svcrdma: Add fair queuing for Send Queue access SUNRPC: Optimize rq_respages allocation in svc_alloc_arg SUNRPC: Track consumed rq_pages entries svcrdma: preserve rq_next_page in svc_rdma_save_io_pages SUNRPC: Handle NULL entries in svc_rqst_release_pages SUNRPC: Allocate a separate Reply page array SUNRPC: Tighten bounds checking in svc_rqst_replace_page NFSD: Sign filehandles ...
2026-04-20	sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup	Tejun Heo
	BPF schedulers that can't hold task_struct pointers (arena-backed ones in particular) key tasks by pid. During exit, pid is released before the task finishes passing through scheduler callbacks, so a dying task becomes invisible to the BPF side mid-schedule. scx_qmap hits this: an exiting task's dispatch callback can't recover its queue entry, stalling dispatch until SCX_EXIT_ERROR_STALL. Add a unique non-zero u64 p->scx.tid assigned at fork that survives the full task lifetime including exit. scx_bpf_tid_to_task() looks up the task; unlike bpf_task_from_pid(), it handles exiting tasks. The lookup costs an rhashtable insert/remove under scx_tasks_lock, so root schedulers opt in via SCX_OPS_TID_TO_TASK. Sub-schedulers that set the flag to declare a dependency are rejected at attach if root didn't opt in. scx_qmap converted: keys tasks by tid and enables SCX_OPS_ENQ_EXITING. Pre-patch it stalls within seconds under a non-leader-exec workload; with the patch it runs cleanly. v3: Warn on rhashtable_lookup_insert_fast() failure via new scx_tid_hash_insert() helper (Cheng-Yang Chou). v2: Guard scx_root deref in scx_bpf_tid_to_task() error path. The kfunc is registered via scx_kfunc_set_any and reachable from tracing and syscall programs when no scheduler is attached (Cheng-Yang Chou). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-20	Merge branch 'for-7.1-printf-kunit-build' into for-linus	Petr Mladek

2026-04-20	spi: fix explicit controller deregistration	Mark Brown
	Johan Hovold <johan@kernel.org> says: Turns out we have a few drivers that get the tear down ordering wrong also when not using device managed registration (cf. [1] and [2]). Fix this to avoid issues like system errors due to unclocked accesses, NULL-pointer dereferences, hangs or failed I/O during during deregistration (e.g. when powering down devices). Johan [1] https://lore.kernel.org/lkml/20260409120419.388546-2-johan@kernel.org/ [2] https://lore.kernel.org/lkml/20260410081757.503099-1-johan@kernel.org/
2026-04-19	Merge tag 'driver-core-7.1-rc1-2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Prevent a device from being probed before device_add() has finished initializing it; gate probe with a "ready_to_probe" device flag to avoid races with concurrent driver_register() calls - Fix a kernel-doc warning for DEV_FLAG_COUNT introduced by the above - Return -ENOTCONN from software_node_get_reference_args() when a referenced software node is known but not yet registered, allowing callers to defer probe - In sysfs_group_attrs_change_owner(), also check is_visible_const(); missed when the const variant was introduced * tag 'driver-core-7.1-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: driver core: Add kernel-doc for DEV_FLAG_COUNT enum value sysfs: attribute_group: Respect is_visible_const() when changing owner software node: return -ENOTCONN when referenced swnode is not registered yet driver core: Don't let a device probe until it's ready
2026-04-19	Merge tag 'usb-7.1-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB / Thunderbolt updates from Greg KH: "Here is the big set of USB and Thunderbolt changes for 7.1-rc1. Lots of little things in here, nothing major, just constant improvements, updates, and new features. Highlights are: - new USB power supply driver support. These changes did touch outside of drivers/usb/ but got acks from the relevant mantainers for them. - dts file updates and conversions - string function conversions into "safer" ones - new device quirks - xhci driver updates - usb gadget driver minor fixes - typec driver additions and updates - small number of thunderbolt driver changes - dwc3 driver updates and additions of new hardware support - other minor driver updates All of these have been in the linux-next tree for a while with no reported issues" * tag 'usb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (176 commits) usb: dwc3: starfive: Add JHB100 USB 2.0 DRD controller dt-bindings: usb: dwc3: add support for StarFive JHB100 dt-bindings: usb: atmel,at91sam9rl-udc: convert to DT schema dt-bindings: usb: atmel,at91rm9200-udc: convert to DT schema dt-bindings: usb: generic-ehci: fix schema structure and add at91sam9g45 constraints dt-bindings: usb: generic-ohci: add AT91RM9200 OHCI binding support arm: dts: at91: remove unused #address-cells/#size-cells from sam9x60 udc node drivers/usb/host: Fix spelling error 'seperate' -> 'separate' usbip: tools: add hint when no exported devices are found USB: serial: iuu_phoenix: fix iuutool author name usb: gadget: f_ncm: validate minimum block_len in ncm_unwrap_ntb() usb: gadget: f_phonet: fix skb frags[] overflow in pn_rx_complete() usb: gadget: f_hid: Add missing error code usb: typec: cros_ec_ucsi: Load driver from OF and ACPI definitions dt-bindings: chrome: Add cros-ec-ucsi compatibility to typec binding USB: of: Simplify with scoped for each OF child loop usbip: validate number_of_packets in usbip_pack_ret_submit() usb: gadget: renesas_usb3: validate endpoint index in standard request handlers usb: core: config: reverse the size check of the SSP isoc endpoint descriptor usb: typec: ucsi: Set usb mode on partner change ...
2026-04-19	rhashtable: Restore insecure_elasticity toggle	Herbert Xu
	Some users of rhashtable cannot handle insertion failures, and are happy to accept the consequences of a hash table that having very long chains. Restore the insecure_elasticity toggle for these users. In addition to disabling the chain length checks, this also removes the emergency resize that would otherwise occur when the hash table occupancy hits 100% (an async resize is still scheduled at 75%). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-19	Merge tag 'tty-7.1-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial updates from Greg KH: "Here is the set of tty and serial driver changes for 7.1-rc1. Not much here this cycle, biggest thing is the removal of an old driver that never got any actual hardware support (esp32), and the second try to moving the tty ports to their own workqueues (first try was in 7.0-rc1 but was reverted due to problems) Otherwise it's just a small set of driver updates and some vt modifier key enhancements. All have been in linux-next for a while with no reported issues" * tag 'tty-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (35 commits) tty: serial: ip22zilog: Fix section mispatch warning hvc/xen: Check console connection flag serial: sh-sci: Add support for RZ/G3L RSCI dt-bindings: serial: renesas,rsci: Document RZ/G3L SoC tty: atmel_serial: update outdated reference to atmel_tasklet_func() serial: xilinx_uartps: Drop unused include serial: qcom-geni: drop stray newline format specifier serial: 8250: loongson: Enable building on MIPS Loongson64 dt-bindings: serial: 8250: Add Loongson 3A4000 uart compatible serial: 8250_fintek: Add support for F81214E tty: tty_port: add workqueue to flip TTY buffer vt: support ITU-T T.416 color subparameters serial: qcom-geni: Fix RTS behavior with flow control tty: serial: imx: keep dma request disabled before dma transfer setup tty: serial: 8250: Add SystemBase Multi I/O cards serial: pic32_uart: allow driver to be compiled on all architectures with COMPILE_TEST serial: tegra: remove Kconfig dependency on APB DMA controller dt-bindings: serial: amlogic,meson-uart: Add compatible string for A9 dt-bindings: serial: atmel,at91-usart: add microchip,lan9691-usart serial: auart: check clk_enable() return in console write ...
2026-04-19	Merge tag 'mm-stable-2026-04-18-02-14' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...
2026-04-19	bpf: Refactor dynptr mutability tracking	Amery Hung
	Redefine dynptr mutability and fix inconsistency in the verifier and kfunc signatures. Dynptr mutability is at two levels. The first is the bpf_dynptr structure and the second is the memory the dynptr points to. The verifer currently tracks the mutability of the bpf_dynptr struct through helper and kfunc prototypes, where "const struct bpf_dynptr *" means the structure itself is immutable. The second level is tracked in upper bit of bpf_dynptr->size in runtime and is not changed in this patch. There are two type of inconsistency in the verfier regarding the mutability of the bpf_dynptr struct. First, there are many existing kfuncs whose prototypes are wrong. For example, bpf_dynptr_adjust() mutates a dynptr's start and offset but marks the argument as a const pointer. At the same time many other kfuncs that does not mutate the dynptr but mark themselves as mutable. Second, the verifier currently does not honor the const qualifier in kfunc prototypes as it determines whether tagging the arg_type with MEM_RDONLY or not based on the register state. Since all the verifier care is to prevent CONST_PTR_TO_DYNPTR from being destroyed in callback and global subprogram, redefine the mutability at the bpf_dynptr level to just bpf_dynptr_kern->data. Then, explicitly prohibit passing CONST_PTR_TO_DYNPTR to an argument tagged with MEM_UNINIT or OBJ_RELEASE. The mutability of a dynptr's view is not really interesting so drop MEM_RDONLY annotation for dynptr from the helpers and kfuncs. Plus, if the mutability of the entire bpf_dynptr were to be done correctly, it would kill the bpf_dynptr_adjust() usage in callback and global subporgram. Implementation wise - First, make sure all kfunc arg are correctly tagged: Tag the dynptr argument of bpf_dynptr_file_discard() with OBJ_RELEASE. - Then, in process_dynptr_func(), make sure CONST_PTR_TO_DYNPTR cannot be passed to argument tagged with MEM_UNINIT or OBJ_RELEASE. For MEM_UNINIT, it is already checked by is_dynptr_reg_valid_uninit(). For OBJ_RELEASE, check against OBJ_RELEASE instead of MEM_RDONLY and drop a now identical check in unmark_stack_slots_dynptr(). - Remove the mutual exclusive check between MEM_UNINIT and MEM_RDONLY, but don't add a MEM_UNINIT and OBJ_RELEASE version as it is obviously wrong. Note that while this patch stops following the C semantic for the mutability of bpf_dynptr, the prototype of kfuncs are still fixed to maintain the correct C semantics in the implementation. Adding or removing the const qualifier does not break backward compatibility. In addition, fix kfuncs dropping the const qualifier when casting the opaque bpf_dynptr to bpf_dynptr_kern. In test_kfunc_dynptr_param.c, initialize dynptr to 0 to avoid -Wuninitialized-const-pointer warning. Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20260414191014.1218567-1-ameryhung@gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-04-18	Merge tag 'pinctrl-v7.1-1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control updates from Linus Walleij: "Core changes: - Perform basic checks on pin config properties so as not to allow directly contradictory settings such as setting a pin to more than one bias or drive mode - Handle input-threshold-voltage-microvolt property - Introduce pinctrl_gpio_get_config() handling in the core for SCMI GPIO using pin control New drivers: - GPIO-by-pin control driver (also appearing in the GPIO pull request) fulfilling a promise on a comment from Grant Likely many years ago: "can't GPIO just be a front-end for pin control?" it turns out it can, if and only if you design something new from scratch, such as SCMI - Broadcom BCM7038 as a pinctrl-single delegate - Mobileye EyeQ6Lplus OLB pin controller - Qualcomm Eliza and Hawi families TLMM pin controllers - Qualcomm SDM670 and Milos family LPASS LPI pin controllers - Qualcomm IPQ5210 pin controller - Realtek RTD1625 pin controller support - Rockchip RV1103B pin controller support - Texas Instruments AM62L as a pinctrl-single delegate Improvements: - Set config implementation for the Spacemit K1 pin controller" * tag 'pinctrl-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (84 commits) pinctrl: qcom: Add Hawi pinctrl driver dt-bindings: pinctrl: qcom: Describe Hawi TLMM block dt-bindings: pinctrl: pinctrl-max77620: convert to DT schema pinctrl: single: Add bcm7038-padconf compatible matching dt-bindings: pinctrl: pinctrl-single: Add brcm,bcm7038-padconf dt-bindings: pinctrl: apple,pinctrl: Add t8122 compatible pinctrl: qcom: sdm670-lpass-lpi: label variables as static pinctrl: sophgo: pinctrl-sg2044: Fix wrong module description pinctrl: sophgo: pinctrl-sg2042: Fix wrong module description pinctrl: qcom: add sdm670 lpi tlmm dt-bindings: pinctrl: qcom: Add SDM670 LPASS LPI pinctrl dt-bindings: qcom: lpass-lpi-common: add reserved GPIOs property pinctrl: qcom: Introduce IPQ5210 TLMM driver dt-bindings: pinctrl: qcom: add IPQ5210 pinctrl pinctrl: qcom: Drop redundant intr_target_reg on modern SoCs pinctrl: qcom: eliza: Fix interrupt target bit pinctrl: core: Don't use "proxy" headers pinctrl: amd: Support new ACPI ID AMDI0033 pinctrl: renesas: rzg2l: Drop superfluous blank line pinctrl: renesas: rzg2l: Fix save/restore of {IOLH,IEN,PUPD,SMT} registers ...
2026-04-18	Merge tag 'memblock-v7.1-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - improve debuggability of reserve_mem kernel parameter handling with print outs in case of a failure and debugfs info showing what was actually reserved - Make memblock_free_late() and free_reserved_area() use the same core logic for freeing the memory to buddy and ensure it takes care of updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled. * tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: x86/alternative: delay freeing of smp_locks section memblock: warn when freeing reserved memory before memory map is initialized memblock, treewide: make memblock_free() handle late freeing memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y memblock: extract page freeing from free_reserved_area() into a helper memblock: make free_reserved_area() more robust mm: move free_reserved_area() to mm/memblock.c powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact() powerpc: fadump: pair alloc_pages_exact() with free_pages_exact() memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name() memblock: move reserve_bootmem_range() to memblock.c and make it static memblock: Add reserve_mem debugfs info memblock: Print out errors on reserve_mem parser
2026-04-18	mailbox: update kdoc for struct mbox_controller	Wolfram Sang
	Add field for missing lock around the hrtimer. Add 'Required' where the core checks for valid entries. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
2026-04-18	mm/sparse: fix comment for section map alignment	Muchun Song
	The comment in mmzone.h currently details exhaustive per-architecture bit-width lists and explains alignment using min(PAGE_SHIFT, PFN_SECTION_SHIFT). Such details risk falling out of date over time and may inadvertently be left un-updated. We always expect a single section to cover full pages. Therefore, we can safely assume that PFN_SECTION_SHIFT is large enough to accommodate SECTION_MAP_LAST_BIT. We use BUILD_BUG_ON() to ensure this. Update the comment to accurately reflect this consensus, making it clear that we rely on a single section covering full pages. Link: https://lore.kernel.org/20260402102320.3617578-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops	Mike Rapoport (Microsoft)
	Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	userfaultfd: introduce vm_uffd_ops->alloc_folio()	Mike Rapoport (Microsoft)
	and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio * update folio contents (either copy from userspace of fill with zeros) * update page tables with the new folio Split a __mfill_atomic_pte() helper that handles both cases and uses newly introduced vm_uffd_ops->alloc_folio() to allocate the folio. Pass the ops structure from the callers to __mfill_atomic_pte() to later allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs. Note, that the new ops method is called alloc_folio() rather than folio_alloc() to avoid clash with alloc_tag macro folio_alloc(). Link: https://lore.kernel.org/20260402041156.1377214-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE	Mike Rapoport (Microsoft)
	When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA. Instead of using shmem_get_folio() for that, add a get_folio_noalloc() method to 'struct vm_uffd_ops' that will return a folio if it exists in the VMA's pagecache at given pgoff. Implement get_folio_noalloc() method for shmem and slightly refactor userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support this new API. Link: https://lore.kernel.org/20260402041156.1377214-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	userfaultfd: introduce vm_uffd_ops	Mike Rapoport (Microsoft)
	Current userfaultfd implementation works only with memory managed by core MM: anonymous, shmem and hugetlb. First, there is no fundamental reason to limit userfaultfd support only to the core memory types and userfaults can be handled similarly to regular page faults provided a VMA owner implements appropriate callbacks. Second, historically various code paths were conditioned on vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of these conditions can be expressed as operations implemented by a particular memory type. Introduce vm_uffd_ops extension to vm_operations_struct that will delegate memory type specific operations to a VMA owner. Operations for anonymous memory are handled internally in userfaultfd using anon_uffd_ops that implicitly assigned to anonymous VMAs. Start with a single operation, ->can_userfault() that will verify that a VMA meets requirements for userfaultfd support at registration time. Implement that method for anonymous, shmem and hugetlb and move relevant parts of vma_can_userfault() into the new callbacks. [rppt@kernel.org: relocate VM_DROPPABLE test, per Tal] Link: https://lore.kernel.org/adffgfM5ANxtPIEF@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Tal Zussman <tz2294@columbia.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	userfaultfd: move vma_can_userfault out of line	Mike Rapoport (Microsoft)
	vma_can_userfault() has grown pretty big and it's not called on performance critical path. Move it out of line. No functional changes. Link: https://lore.kernel.org/20260402041156.1377214-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm/damon/core: fix damos_walk() vs kdamond_fn() exit race	SeongJae Park
	When kdamond_fn() main loop is finished, the function cancels remaining damos_walk() request and unset the damon_ctx->kdamond so that API callers and API functions themselves can show the context is terminated. damos_walk() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damos_walk() starts waiting for the kdamond's handling of the newly added request. The damos_walk() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damos_walk() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damow_walk() request cancelling. Right after that, damos_walk() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damos_walk() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damos_walk() caller thread infinitely waits. Fix this by introducing another damon_ctx field, namely walk_control_obsolete. It is protected by the damon_ctx->walk_control_lock, which protects damos_walk() request registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of the remaining damos_walk() request is executed. damos_walk() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-3-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.14.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm/damon/core: fix damon_call() vs kdamond_fn() exit race	SeongJae Park
	Patch series "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race". damon_call() and damos_walk() can leak memory and/or deadlock when they race with kdamond terminations. Fix those. This patch (of 2); When kdamond_fn() main loop is finished, the function cancels all remaining damon_call() requests and unset the damon_ctx->kdamond so that API callers and API functions themselves can know the context is terminated. damon_call() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damon_call() starts waiting for the kdamond's handling of the newly added request. The damon_call() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damon_call() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damon_call() requests cancelling. Right after that, damon_call() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damon_call() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damon_call() caller threads infinitely waits. Fix this by introducing another damon_ctx field, namely call_controls_obsolete. It is protected by the damon_ctx->call_controls_lock, which protects damon_call() requests registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of remaining damon_call() requests is executed. damon_call() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. Note that the deadlock will not happen when damon_call() is called for repeat mode request. In tis case, damon_call() returns instead of waiting for the handling when the request registration succeeds and it shows the kdamond is running. However, if the request also has dealloc_on_cancel, the request memory would be leaked. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-1-sj@kernel.org Link: https://lore.kernel.org/20260327233319.3528-2-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.14.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm/alloc_tag: clear codetag for pages allocated before page_ext initialization	Hao Ge
	Due to initialization ordering, page_ext is allocated and initialized relatively late during boot. Some pages have already been allocated and freed before page_ext becomes available, leaving their codetag uninitialized. A clear example is in init_section_page_ext(): alloc_page_ext() calls kmemleak_alloc(). If the slab cache has no free objects, it falls back to the buddy allocator to allocate memory. However, at this point page_ext is not yet fully initialized, so these newly allocated pages have no codetag set. These pages may later be reclaimed by KASAN, which causes the warning to trigger when they are freed because their codetag ref is still empty. Use a global array to track pages allocated before page_ext is fully initialized. The array size is fixed at 8192 entries, and will emit a warning if this limit is exceeded. When page_ext initialization completes, set their codetag to empty to avoid warnings when they are freed later. This warning is only observed with CONFIG_MEM_ALLOC_PROFILING_DEBUG=Y and mem_profiling_compressed disabled: [ 9.582133] ------------[ cut here ]------------ [ 9.582137] alloc_tag was not set [ 9.582139] WARNING: ./include/linux/alloc_tag.h:164 at __pgalloc_tag_sub+0x40f/0x550, CPU#5: systemd/1 [ 9.582190] CPU: 5 UID: 0 PID: 1 Comm: systemd Not tainted 7.0.0-rc4 #1 PREEMPT(lazy) [ 9.582192] Hardware name: Red Hat KVM, BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 9.582194] RIP: 0010:__pgalloc_tag_sub+0x40f/0x550 [ 9.582196] Code: 00 00 4c 29 e5 48 8b 05 1f 88 56 05 48 8d 4c ad 00 48 8d 2c c8 e9 87 fd ff ff 0f 0b 0f 0b e9 f3 fe ff ff 48 8d 3d 61 2f ed 03 <67> 48 0f b9 3a e9 b3 fd ff ff 0f 0b eb e4 e8 5e cd 14 02 4c 89 c7 [ 9.582197] RSP: 0018:ffffc9000001f940 EFLAGS: 00010246 [ 9.582200] RAX: dffffc0000000000 RBX: 1ffff92000003f2b RCX: 1ffff110200d806c [ 9.582201] RDX: ffff8881006c0360 RSI: 0000000000000004 RDI: ffffffff9bc7b460 [ 9.582202] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff3a62324 [ 9.582203] R10: ffffffff9d311923 R11: 0000000000000000 R12: ffffea0004001b00 [ 9.582204] R13: 0000000000002000 R14: ffffea0000000000 R15: ffff8881006c0360 [ 9.582206] FS: 00007ffbbcf2d940(0000) GS:ffff888450479000(0000) knlGS:0000000000000000 [ 9.582208] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.582210] CR2: 000055ee3aa260d0 CR3: 0000000148b67005 CR4: 0000000000770ef0 [ 9.582211] PKRU: 55555554 [ 9.582212] Call Trace: [ 9.582213] <TASK> [ 9.582214] ? __pfx___pgalloc_tag_sub+0x10/0x10 [ 9.582216] ? check_bytes_and_report+0x68/0x140 [ 9.582219] __free_frozen_pages+0x2e4/0x1150 [ 9.582221] ? __free_slab+0xc2/0x2b0 [ 9.582224] qlist_free_all+0x4c/0xf0 [ 9.582227] kasan_quarantine_reduce+0x15d/0x180 [ 9.582229] __kasan_slab_alloc+0x69/0x90 [ 9.582232] kmem_cache_alloc_noprof+0x14a/0x500 [ 9.582234] do_getname+0x96/0x310 [ 9.582237] do_readlinkat+0x91/0x2f0 [ 9.582239] ? __pfx_do_readlinkat+0x10/0x10 [ 9.582240] ? get_random_bytes_user+0x1df/0x2c0 [ 9.582244] __x64_sys_readlinkat+0x96/0x100 [ 9.582246] do_syscall_64+0xce/0x650 [ 9.582250] ? __x64_sys_getrandom+0x13a/0x1e0 [ 9.582252] ? __pfx___x64_sys_getrandom+0x10/0x10 [ 9.582254] ? do_syscall_64+0x114/0x650 [ 9.582255] ? ksys_read+0xfc/0x1d0 [ 9.582258] ? __pfx_ksys_read+0x10/0x10 [ 9.582260] ? do_syscall_64+0x114/0x650 [ 9.582262] ? do_syscall_64+0x114/0x650 [ 9.582264] ? __pfx_fput_close_sync+0x10/0x10 [ 9.582266] ? file_close_fd_locked+0x178/0x2a0 [ 9.582268] ? __x64_sys_faccessat2+0x96/0x100 [ 9.582269] ? __x64_sys_close+0x7d/0xd0 [ 9.582271] ? do_syscall_64+0x114/0x650 [ 9.582273] ? do_syscall_64+0x114/0x650 [ 9.582275] ? clear_bhb_loop+0x50/0xa0 [ 9.582277] ? clear_bhb_loop+0x50/0xa0 [ 9.582279] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9.582280] RIP: 0033:0x7ffbbda345ee [ 9.582282] Code: 0f 1f 40 00 48 8b 15 29 38 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 0b 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fa 37 0d 00 f7 d8 64 89 01 48 [ 9.582284] RSP: 002b:00007ffe2ad8de58 EFLAGS: 00000202 ORIG_RAX: 000000000000010b [ 9.582286] RAX: ffffffffffffffda RBX: 000055ee3aa25570 RCX: 00007ffbbda345ee [ 9.582287] RDX: 000055ee3aa25570 RSI: 00007ffe2ad8dee0 RDI: 00000000ffffff9c [ 9.582288] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000001001 [ 9.582289] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000033 [ 9.582290] R13: 00007ffe2ad8dee0 R14: 00000000ffffff9c R15: 00007ffe2ad8deb0 [ 9.582292] </TASK> [ 9.582293] ---[ end trace 0000000000000000 ]--- Link: https://lore.kernel.org/20260331081312.123719-1-hao.ge@linux.dev Fixes: dcfe378c81f72 ("lib: introduce support for page allocation tagging") Signed-off-by: Hao Ge <hao.ge@linux.dev> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	liveupdate: make unregister functions return void	Pasha Tatashin
	Change liveupdate_unregister_file_handler and liveupdate_unregister_flb to return void instead of an error code. This follows the design principle that unregistration during module unload should not fail, as the unload cannot be stopped at that point. Link: https://lore.kernel.org/20260327033335.696621-10-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	liveupdate: protect FLB lists with luo_register_rwlock	Pasha Tatashin
	Because liveupdate FLB objects will soon drop their persistent module references when registered, list traversals must be protected against concurrent module unloading. To provide this protection, utilize the global luo_register_rwlock. It protects the global registry of FLBs and the handler's specific list of FLB dependencies. Read locks are used during concurrent list traversals (e.g., during preservation and serialization). Write locks are taken during registration and unregistration. Link: https://lore.kernel.org/20260327033335.696621-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	liveupdate: prevent double management of files	Pasha Tatashin
	Patch series "liveupdate: prevent double preservation", v4. Currently, LUO does not prevent the same file from being managed twice across different active sessions. Because LUO preserves files of absolutely different types: memfd, and upcoming vfiofd [1], iommufd [2], guestmefd (and possible kvmfd/cpufd). There is no common private data or guarantee on how to prevent that the same file is not preserved twice beside using inode or some slower and expensive method like hashtables. This patch (of 4) Currently, LUO does not prevent the same file from being managed twice across different active sessions. Use a global xarray luo_preserved_files to keep track of file identifiers being preserved by LUO. Update luo_preserve_file() to check and insert the file identifier into this xarray when it is preserved, and erase it in luo_file_unpreserve_files() when it is released. To allow handlers to define what constitutes a "unique" file (e.g., different struct file objects pointing to the same hardware resource), add a get_id() callback to struct liveupdate_file_ops. If not provided, the default identifier is the struct file pointer itself. This ensures that the same file (or resource) cannot be managed by multiple sessions. If another session attempts to preserve an already managed file, it will now fail with -EBUSY. Link: https://lore.kernel.org/20260326163943.574070-1-pasha.tatashin@soleen.com Link: https://lore.kernel.org/20260326163943.574070-2-pasha.tatashin@soleen.com Link: https://lore.kernel.org/all/20260129212510.967611-1-dmatlack@google.com [1] Link: https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com [2] Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	kho: kexec-metadata: track previous kernel chain	Breno Leitao
	Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time. Example output: [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1) Motivation ========== Bugs that only reproduce when kexecing from specific kernel versions are difficult to diagnose. These issues occur when a buggy kernel kexecs into a new kernel, with the bug manifesting only in the second kernel. Recent examples include the following commits: * commit eb2266312507 ("x86/boot: Fix page table access in 5-level to 4-level paging transition") * commit 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption") * commit 64b45dd46e15 ("x86/efi: skip memattr table on kexec boot") As kexec-based reboots become more common, these version-dependent bugs are appearing more frequently. At scale, correlating crashes to the previous kernel version is challenging, especially when issues only occur in specific transition scenarios. Implementation ============== The kexec metadata is stored as a plain C struct (struct kho_kexec_metadata) rather than FDT format, for simplicity and direct field access. It is registered via kho_add_subtree() as a separate subtree, keeping it independent from the core KHO ABI. This design choice: - Keeps the core KHO ABI minimal and stable - Allows the metadata format to evolve independently - Avoids requiring version bumps for all KHO consumers (LUO, etc.) when the metadata format changes The struct kho_kexec_metadata contains two fields: - previous_release: The kernel version that initiated the kexec - kexec_count: Number of kexec boots since last cold boot On cold boot, kexec_count starts at 0 and increments with each kexec. The count helps identify issues that only manifest after multiple consecutive kexec reboots. [leitao@debian.org: call kho_kexec_metadata_init() for both boot paths] Link: https://lore.kernel.org/all/20260309-kho-v8-5-c3abcf4ac750@debian.org/ [1] Link: https://lore.kernel.org/20260409-kho_fix_merge_issue-v1-1-710c84ceaa85@debian.org Link: https://lore.kernel.org/20260316-kho-v9-5-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	kho: persist blob size in KHO FDT	Breno Leitao
	kho_add_subtree() accepts a size parameter but only forwards it to debugfs. The size is not persisted in the KHO FDT, so it is lost across kexec. This makes it impossible for the incoming kernel to determine the blob size without understanding the blob format. Store the blob size as a "blob-size" property in the KHO FDT alongside the "preserved-data" physical address. This allows the receiving kernel to recover the size for any blob regardless of format. Also extend kho_retrieve_subtree() with an optional size output parameter so callers can learn the blob size without needing to understand the blob format. Update all callers to pass NULL for the new parameter. Link: https://lore.kernel.org/20260316-kho-v9-3-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	kho: rename fdt parameter to blob in kho_add/remove_subtree()	Breno Leitao
	Since kho_add_subtree() now accepts arbitrary data blobs (not just FDTs), rename the parameter from 'fdt' to 'blob' to better reflect its purpose. Apply the same rename to kho_remove_subtree() for consistency. Also rename kho_debugfs_fdt_add() and kho_debugfs_fdt_remove() to kho_debugfs_blob_add() and kho_debugfs_blob_remove() respectively, with the same parameter rename from 'fdt' to 'blob'. Link: https://lore.kernel.org/20260316-kho-v9-2-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	kho: add size parameter to kho_add_subtree()	Breno Leitao
	Patch series "kho: history: track previous kernel version and kexec boot count", v9. Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time. Example ======= [ 0.000000] Linux version 6.19.0-rc3-upstream-00047-ge5d992347849 ... [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107upstream-00004-g3071b0dc4498 (count 1) Motivation ========== Bugs that only reproduce when kexecing from specific kernel versions are difficult to diagnose. These issues occur when a buggy kernel kexecs into a new kernel, with the bug manifesting only in the second kernel. Recent examples include: * eb2266312507 ("x86/boot: Fix page table access in 5-level to 4-level paging transition") * 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption") * 64b45dd46e15 ("x86/efi: skip memattr table on kexec boot") As kexec-based reboots become more common, these version-dependent bugs are appearing more frequently. At scale, correlating crashes to the previous kernel version is challenging, especially when issues only occur in specific transition scenarios. Some bugs manifest only after multiple consecutive kexec reboots. Tracking the kexec count helps identify these cases (this metric is already used by live update sub-system). KHO provides a reliable mechanism to pass information between kernels. By carrying the previous kernel's release string and kexec count forward, we can print this context at boot time to aid debugging. The goal of this feature is to have this information being printed in early boot, so, users can trace back kernel releases in kexec. Systemd is not helpful because we cannot assume that the previous kernel has systemd or even write access to the disk (common when using Linux as bootloaders) This patch (of 6): kho_add_subtree() assumes the fdt argument is always an FDT and calls fdt_totalsize() on it in the debugfs code path. This assumption will break if a caller passes arbitrary data instead of an FDT. When CONFIG_KEXEC_HANDOVER_DEBUGFS is enabled, kho_debugfs_fdt_add() calls __kho_debugfs_fdt_add(), which executes: f->wrapper.size = fdt_totalsize(fdt); Fix this by adding an explicit size parameter to kho_add_subtree() so callers specify the blob size. This allows subtrees to contain arbitrary data formats, not just FDTs. Update all callers: - memblock.c: use fdt_totalsize(fdt) - luo_core.c: use fdt_totalsize(fdt_out) - test_kho.c: use fdt_totalsize() - kexec_handover.c (root fdt): use fdt_totalsize(kho_out.fdt) Also update __kho_debugfs_fdt_add() to receive the size explicitly instead of computing it internally via fdt_totalsize(). In kho_in_debugfs_init(), pass fdt_totalsize() for the root FDT and sub-blobs since all current users are FDTs. A subsequent patch will persist the size in the KHO FDT so the incoming side can handle non-FDT blobs correctly. Link: https://lore.kernel.org/20260323110747.193569-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260316-kho-v9-1-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm: memcontrol: correct the nr_pages parameter type of ↵	Qi Zheng
	mem_cgroup_update_lru_size() The nr_pages parameter of mem_cgroup_update_lru_size() represents a page count. During the reparenting of LRU folios, the value passed to it can potentially exceed the maximum value of a 32-bit integer. It should be declared as long instead of int to match the types used in lruvec size accounting and to prevent possible overflow. Update the parameter type to long to ensure correctness. Link: https://lore.kernel.org/fd4140de44fa0a3978e4e2426731187fe8625f0b.1774604356.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers	Muchun Song
	We must ensure the folio is deleted from or added to the correct lruvec list. So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as add_page_to_lru_list() will perform the necessary check. Link: https://lore.kernel.org/2c90fc006d9d730331a3caeef96f7e5dabe2036d.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosry@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios	Muchun Song
	Now that everything is set up, switch folio->memcg_data pointers to objcgs, update the accessors, and execute reparenting on cgroup death. Finally, folio->memcg_data of LRU folios and kmem folios will always point to an object cgroup pointer. The folio->memcg_data of slab folios will point to an vector of object cgroups. Link: https://lore.kernel.org/80cb7af198dc6f2173fe616d1207a4c315ece141.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosry@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm: memcontrol: convert objcg to be per-memcg per-node type	Qi Zheng
	Convert objcg to be per-memcg per-node type, so that when reparent LRU folios later, we can hold the lru lock at the node level, thus avoiding holding too many lru locks at once. [zhengqi.arch@bytedance.com: reset pn->orig_objcg to NULL] Link: https://lore.kernel.org/20260309112939.31937-1-qi.zheng@linux.dev [akpm@linux-foundation.org: fix comment typo, per Usama. Reflow comment to 80 cols] [devnexen@gmail.com: fix obj_cgroup leak in mem_cgroup_css_online() error path] Link: https://lore.kernel.org/20260322193631.45457-1-devnexen@gmail.com [devnexen@gmail.com: add newline, per Qi Zheng] Link: https://lore.kernel.org/20260323063007.7783-1-devnexen@gmail.com Link: https://lore.kernel.org/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: David Carlier <devnexen@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosry@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Usama Arif <usama.arif@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm: workingset: use lruvec_lru_size() to get the number of lru pages	Qi Zheng
	For cgroup v2, count_shadow_nodes() is the only place to read non-hierarchical stats (lruvec_stats->state_local). To avoid the need to consider cgroup v2 during subsequent non-hierarchical stats reparenting, use lruvec_lru_size() instead of lruvec_page_state_local() to get the number of lru pages. For NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B cases, it appears that the statistics here have already been problematic for a while since slab pages have been reparented. So just ignore it for now. Link: https://lore.kernel.org/b1d448c667a8fb377c3390d9aba43bdb7e4d5739.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosry@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18	mm: vmscan: prepare for reparenting MGLRU folios	Qi Zheng
	Similar to traditional LRU folios, in order to solve the dying memcg problem, we also need to reparenting MGLRU folios to the parent memcg when memcg offline. However, there are the following challenges: 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the number of generations of the parent and child memcg may be different, so we cannot simply transfer MGLRU folios in the child memcg to the parent memcg as we did for traditional LRU folios. 2. The generation information is stored in folio->flags, but we cannot traverse these folios while holding the lru lock, otherwise it may cause softlockup. 3. In walk_update_folio(), the gen of folio and corresponding lru size may be updated, but the folio is not immediately moved to the corresponding lru list. Therefore, there may be folios of different generations on an LRU list. 4. In lru_gen_del_folio(), the generation to which the folio belongs is found based on the generation information in folio->flags, and the corresponding LRU size will be updated. Therefore, we need to update the lru size correctly during reparenting, otherwise the lru size may be updated incorrectly in lru_gen_del_folio(). Finally, this patch chose a compromise method, which is to splice the lru list in the child memcg to the lru list of the same generation in the parent memcg during reparenting. And in order to ensure that the parent memcg has the same generation, we need to increase the generations in the parent memcg to the MAX_NR_GENS before reparenting. Of course, the same generation has different meanings in the parent and child memcg, this will cause confusion in the hot and cold information of folios. But other than that, this method is simple enough, the lru size is correct, and there is no need to consider some concurrency issues (such as lru_gen_del_folio()). To prepare for the above work, this commit implements the specific functions, which will be used during reparenting. [zhengqi.arch@bytedance.com: use list_splice_tail_init() to reparent child folios] Link: https://lore.kernel.org/20260324114937.28569-1-qi.zheng@linux.dev Link: https://lore.kernel.org/e75050354cdbc42221a04f7cf133292b61105548.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Harry Yoo <harry.yoo@oracle.com> Suggested-by: Imran Khan <imran.f.khan@oracle.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosry@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>