summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2026-05-29workqueue: Add warnings and fallback if system_{unbound}_wq is usedMarco Crivellari
Currently many users transitioned already to the new introduced workqueue (system_percpu_wq, system_dfl_wq), but there are new users who still use the older system_wq and system_unbound_wq. This change try to push this transition forward, by warning whether the old workqueues are used. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-29Merge tag 'iommu-fixes-v7.1-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu fixes from Joerg Roedel: - Fix compile warning with gcc-16.1 - Intel VT-d: Simplify calculate_psi_aligned_address() - MAINTAINERS updates * tag 'iommu-fixes-v7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: MAINTAINERS: Add my employer to my entries MAINTAINERS: Add Vasant Hegde to reviewers of AMD IOMMU iommu, debugobjects: avoid gcc-16.1 section mismatch warnings iommu/vt-d: Simplify calculate_psi_aligned_address()
2026-05-29Merge tag 'hid-for-linus-2026052801' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Benjamin Tissoires: - buffer overflow fix for lenovo (Kean) and wacom (Lee Jones) drivers - segfaults prevention in lenovo-go driver when used with an emulated device (Louis Clinckx) - cleanup of resources in u2fzero (Myeonghun Pak) - a quirk for a USB mouse and a cleanup in hid.h (hlleng and Liu Kai) * tag 'hid-for-linus-2026052801' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: wacom: Fix OOB write in wacom_hid_set_device_mode() HID: lenovo-go: drop dead NULL check on to_usb_interface() HID: lenovo-go: reject non-USB transports in probe HID: lenovo: Fix buffer over-read and unaligned access in X12 Tab raw_event handler HID: quirks: Add ALWAYS_POLL quirk for SIGMACHIP USB mouse HID: remove duplicate hid_warn_ratelimited definition HID: u2fzero: free allocated URB on probe errors
2026-05-29EDAC/igen6: Fix memory topology parsing for Panther Lake-H SoCsQiuxu Zhuo
Panther Lake-H SoC memory controller registers for memory topology have been updated, but the current igen6_edac driver still uses old generation ones to incorrectly parse memory topology. Fix the issue by adding memory topology parsing function pointers to the 'struct res_config' and creating a new configuration structure for Panther Lake-H SoCs to enable igen6_edac to parse memory correctly. Fixes: 0be9f1af3902 ("EDAC/igen6: Add Intel Panther Lake-H SoCs support") Fixes: 4c36e6106997 ("EDAC/igen6: Add more Intel Panther Lake-H SoCs support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20260403054029.3950383-3-qiuxu.zhuo@intel.com
2026-05-29drm/msm: Add PERFCNTR_CONFIG ioctlRob Clark
Add new UABI and implementation of PERFCNTR_CONFIG ioctl. A bit more work is required to configure the pwrup_reglist for the GMU to restore SELect regs on exit of IFPC, before we can stop disabling IFPC while global counter collection. This will follow in a later commit, but will be transparent to userspace. Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com> Reviewed-by: Anna Maniscalco <anna.maniscalco2000@gmail.com> Reviewed-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Patchwork: https://patchwork.freedesktop.org/patch/728217/ Message-ID: <20260526145137.160554-14-robin.clark@oss.qualcomm.com>
2026-05-29firmware: samsung: acpm: remove compile-testing stubsArnd Bergmann
Sashiko reported an inconsistent use of NULL vs ERR_PTR() returns in the stub helpers in xynos-acpm-protocol.h. Since this only happens on dead code for COMPILE_TEST=y, this is not really a bug though. Having stub functions that return NULL is a common way to define optional interfaces, where callers still work when the feature is disabled, though this clearly does not work for acpm because some callers have a NULL pointer dereference when compile testing. Since CONFIG_EXYNOS_ACPM_PROTOCOL already supports compile-testing itself, and all (both) drivers using it clearly require the support, so this just simplifies the option space without losing any build coverage. Remove the stub functions entirely and adjust the one Kconfig dependency to require EXYNOS_ACPM_PROTOCOL unconditionally. Fixes: 6837c006d4e7 ("firmware: exynos-acpm: add empty method to allow compile test") Closes: https://sashiko.dev/#/patchset/20260420-acpm-tmu-v3-0-3dc8e93f0b26%40linaro.org Link: https://lore.kernel.org/all/a7994860-24a3-4f87-84bf-109ed653dda4@linaro.org/ Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260529134454.2147446-1-arnd@kernel.org [krzk: Rebase on difference in devm_acpm_get_by_node()] Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-05-29firmware: samsung: acpm: Add devm_acpm_get_by_phandle helperTudor Ambarus
Introduce devm_acpm_get_by_phandle() to standardize how consumer drivers acquire a handle to the ACPM IPC interface. Enforce the use of the "samsung,acpm-ipc" property name across the SoC and simplify the boilerplate code in client drivers. The first consumer of this helper is the Exynos ACPM Thermal Management Unit (TMU) driver. The TMU utilizes a hybrid management approach: direct register access from the Application Processor (AP) is restricted to the interrupt pending (INTPEND) registers for event identification. High-level functional tasks, such as sensor initialization, threshold programming, and temperature reads, are delegated to the ACPM firmware via this IPC interface. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Reviewed-by: Peter Griffin <peter.griffin@linaro.org> Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-6-8ca011d5a965@linaro.org Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-05-29firmware: samsung: acpm: Add TMU protocol supportTudor Ambarus
The Thermal Management Unit (TMU) on the Google GS101 SoC is managed through a hybrid model shared between the kernel and the Alive Clock and Power Manager (ACPM) firmware. Add the protocol helpers required to communicate with the ACPM for thermal operations, including initialization, threshold configuration, temperature reading, and system suspend/resume handshakes. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Reviewed-by: Peter Griffin <peter.griffin@linaro.org> Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-5-8ca011d5a965@linaro.org Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-05-29firmware: samsung: acpm: Make acpm_ops const and access via pointerTudor Ambarus
Replace the embedded `struct acpm_ops` inside `struct acpm_handle` with a pointer to a `const struct acpm_ops`. Previously, the operations structure was embedded directly within the handle and populated dynamically at runtime via `acpm_setup_ops()`. This resulted in mutable function pointers and unnecessary per-instance memory overhead. By defining `exynos_acpm_driver_ops` statically as a `const` structure, the function pointers are now safely housed in the read-only `.rodata` section. This improves security by preventing function pointer overwrites, saves memory, and slightly reduces initialization overhead in `acpm_probe()`. Consequently, update all consumer drivers (clk, mfd) to access the operations via the new pointer indirection (`->ops->`). Finally, fix the previously empty kernel-doc description for the ops member to reflect its new pointer nature. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Reviewed-by: Peter Griffin <peter.griffin@linaro.org> Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-4-8ca011d5a965@linaro.org Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-05-29firmware: samsung: acpm: Drop redundant _ops suffix in acpm_ops membersTudor Ambarus
Rename the `dvfs_ops` and `pmic_ops` members of `struct acpm_ops` to `dvfs` and `pmic` respectively. Since these members are housed within the `acpm_ops` structure and utilize the `acpm_*_ops` types, the `_ops` suffix on the variable names creates unnecessary redundancy (e.g., `handle.ops.dvfs_ops`). This cleanup removes the stuttering, leading to cleaner consumer code. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Reviewed-by: Peter Griffin <peter.griffin@linaro.org> Acked-by: Lee Jones <lee@kernel.org> Link: https://lore.kernel.org/linux-samsung-soc/CADrjBPqzKpcd9vuCmNUptCUPyPpPbHcc19-7kN-1c0RpW1e5DQ@mail.gmail.com/T/#mcce154a7e0c6cd1ca6cd5a1e37541ed7a85a84d4 [1] Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-3-8ca011d5a965@linaro.org Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-05-29Merge branch 'acpi-driver-devm'Rafael J. Wysocki
Merge updates that introduce devm_acpi_install_notify_handler() and convert some drivers for core ACPI devices previously using acpi_dev_install_notify_handler() to devres-based resource management. * acpi-driver-devm: ACPI: video: Switch over to devres-based resource management ACPI: video: Use devm for video->entry and backlight cleanup ACPI: video: Use devm action for freeing video devices ACPI: video: Use devm action for video bus object cleanup ACPI: video: Rearrange probe and remove code ACPI: video: Reduce the number of auxiliary device dereferences ACPI: PAD: Switch over to devres-based resource management ACPI: PAD: Fix teardown ordering in acpi_pad_remove() ACPI: PAD: Pass struct device pointer to acpi_pad_notify() ACPI: PAD: Rearrange acpi_pad_notify() ACPI: thermal: Switch over to devres-based resource management ACPI: HED: Switch over to devres-based resource management ACPI: HED: Refine guarding against adding a second instance ACPI: battery: Switch over to devres-based resource management ACPI: AC: Switch over to devres-based resource management ACPI: NFIT: core: Use devm_acpi_install_notify_handler() ACPI: bus: Introduce devm_acpi_install_notify_handler()
2026-05-29Merge back earlier cpufreq material for 7.2Rafael J. Wysocki
2026-05-29block: Include bvec.h kernel-doc in the htmldocsMatthew Wilcox (Oracle)
People have gone to the trouble of writing this kernel-doc; the least we can do is publish it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@linux.dev> Link: https://patch.msgid.link/20260528175905.1102280-3-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-29block: Add bvec_folio()Matthew Wilcox (Oracle)
This is a simple helper which replaces page_folio(bvec->bv_page). Minor improvement in readability, but the real motivation is to reduce the number of references to bvec->bv_page so that it can be changed with less work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Leon Romanovsky <leon@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@linux.dev> Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-29sched_ext: Auto-register/unregister dl_server reservationsAndrea Righi
Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks") introduced an ext_server deadline server to protect sched_ext tasks from fair/RT starvation, mirroring the existing fair_server. Currently, both servers reserve their 50ms/1000ms bandwidth at boot, regardless of whether a BPF scheduler is loaded. Unused bandwidth is still reclaimed at runtime by other classes, but the static reservation prevents the RT class from implicitly using that headroom when one of the two classes is guaranteed to be empty. A sysadmin can work around this by writing /sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that requires manual action and not all systems expose debugfs. A better approach is to make server bandwidth reservations dynamic: only the scheduling policy that is currently active should register its reservation, while the inactive one should not artificially hold capacity (keeping both reservations only when the BPF scheduler is running in partial mode): +---------------------------------------------+-------------+------------+ | BPF scheduler state | fair server | ext server | +---------------------------------------------+-------------+------------+ | not loaded (default boot) | reserved | none | | loaded full mode (!SCX_OPS_SWITCH_PARTIAL) | none | reserved | | loaded partial mode (SCX_OPS_SWITCH_PARTIAL)| reserved | reserved | +---------------------------------------------+-------------+------------+ To achieve this, introduce an "attached/detached" state for each deadline server, so the kernel can decide whether a server's bandwidth should be accounted in global bandwidth tracking. At boot, the system starts with only the fair server contributing to bandwidth accounting. When a BPF scheduler is enabled, the ext server is attached and may replace or complement the fair server depending on whether full or partial mode is used. When sched_ext is disabled, the system restores the previous deadline bandwidth values and behavior. The transition logic ensures that switching between scheduling modes is consistent and reversible, without losing runtime configuration or requiring manual intervention. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260526164420.638711-2-arighi@nvidia.com
2026-05-29vfs: make LAST_XXX private to fs/namei.cJori Koolstra
The only user of LAST_XXX outside of fs/namei.c is fs/smb/server/vfs.c; ksmbd_vfs_path_lookup() calls vfs_path_parent_lookup() and expects a LAST_NORM last type (or it will be ENOENT). ksmbd_vfs_rename() also calls vfs_path_parent_lookup() but forgets the LAST_NORM check. It does not really make sense to have vfs_path_parent_lookup() expose the last_type because it is only needed to ensure it is LAST_NORM. So let's do this check in vfs_path_parent_lookup() instead and keep the LAST_XXX internal to fs/namei.c. This changes the ENOENT errno in ksmbd_vfs_path_lookup() to EINVAL, which matches better with how this is handled by callers of filename_parentat(). Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl> Link: https://patch.msgid.link/20260528175854.57626-1-jkoolstra@xs4all.nl Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-29crypto: af_alg - Drop support for off-CPU cryptographyDemi Marie Obenour
AF_ALG is deprecated and exposed to unprivileged userspace. Only use the least buggy algorithm implementations: the pure software ones. This removes one of the main advantages of AF_ALG, which is the ability to use it with off-CPU accelerators. However, using off-CPU accelerators has huge overheads, both in performance and attack surface. I have yet to see real-world, performance-critical workloads where using an accelerator via AF_ALG is actually a win over doing cryptography in userspace. If using an off-CPU accelerator really does turn out to be a win, a new API should be developed that is actually a good fit for it. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-29net: Remove support for AIO on socketsDemi Marie Obenour
The only user of msg->msg_iocb was AF_ALG, but that's deprecated. It can be removed entirely at the cost of only supporting synchronous operations. This doesn't break userspace, which will silently block (for a bounded amount of time) in io_submit instead of operating asynchronously. This also makes struct msghdr smaller, helping every other caller of sendmsg(). Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-29crypto: hisilicon/qm - support doorbell enable controlZongyu Wu
The driver notifies the hardware to handle task through doorbell. Currently, doorbell is enabled by default. To prevent the process from sending doorbells during hardware reset scenarios, which could cause the hardware to process doorbells and trigger new errors: For example, when the physical machine is resetting the device, doorbells are still being sent from the virtual machine. Therefore, the driver disables doorbell during hardware unavailability. After hardware initialization is completed, doorbell is enabled, and any task sent during the unavailability period will return errors. The hardware supports the PF to disable doorbells for all functions, while the VF can only disable its own doorbell function. When the PF is reset, it will disable doorbells for all functions. When VF is reset, it only disables its own doorbell and does not affect tasks on other functions. Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-29crypto: hisilicon/qm - support function-level error resetZhushuai Yin
When executing operations on crypto devices, hardware errors are inevitable. For certain errors, a full device reset is required to recover. However, in certain cases, only a specific function may fail, while other functions can still operate normally. A system-wide RAS reset in such cases would unnecessarily impact functioning components. This patch introduces function-level granularity handling, enabling targeted resets of only the error-reporting functions without affecting other operational functions. Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com> Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-29crypto: hisilicon/qm - place the interrupt status interface after the PM ↵Zhushuai Yin
usage counter To avoid accessing memory of a suspended device, and since the counter interface used by PM involves sleep operations, the counter interface cannot be placed in the interrupt top half. Therefore, the interface for acquiring the interrupt status in the RAS reset flow that resides in the interrupt context needs to be moved to the bottom half for processing. Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com> Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-29crypto: hisilicon/qm - allow VF devices to query hardware isolation statusZhushuai Yin
The problem that the VF device cannot obtain the isolation status and isolation threshold of the device is resolved. The accelerator driver can query the device isolation status and threshold via the VF device using the fault query sysfs interface under uacce. Note that only the PF device supports isolation policy configuration, while the VF device is limited to read-only query operations. Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com> Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-28include: remove unused cnt32_to_63.hCosta Shulyupin
All users have been removed over time as ARM and other architectures switched to generic sched_clock. The last user was microblaze, removed in commit 839396ab88e4 ("microblaze: timer: Use generic sched_clock implementation"). Assisted-by: Claude:claude-opus-4-6 Link: https://lore.kernel.org/20260515183429.1503740-1-costa.shul@redhat.com Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <npitre@baylibre.com> Cc: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28raid6: hide internalsChristoph Hellwig
Split out two new headers from the public pq.h: - lib/raid/raid6/algos.h contains the algorithm lists private to lib/raid/raid6 - include/linux/raid/pq_tables.h contains the tables also used by async_tx providers. The public include/linux/pq.h is now limited to the public interface for the consumers of the RAID6 PQ API. [hch@lst.de: remove duplicate ccflags-y line] Link: https://lore.kernel.org/20260527074539.2292913-2-hch@lst.de Link: https://lore.kernel.org/20260518051804.462141-10-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28raid6: warn when using less than four devicesChristoph Hellwig
Quoting H. Peter Anvin who came up with the RAID6 P/Q algorithm, and who wrote the initial implementation, then still part of the md driver: The RAID-6 code has *never* supported only 3 units, and if it ever worked for *any* of the implementations it was purely by accident. Speaking as the original author I should know; this was deliberate as in some cases the degenerate case (3) would have required extra trays in the code to no user benefit. While md never allowed less than 4 devices, btrfs does. This new warning will trigger for such file systems, but given how it already causes havoc that is a good thing. If btrfs wants to fix third, it should switch to transparently use three-way mirroring underneath, which will work as P and Q are copies of the single data device by the definition of the Linux RAID 6 P/Q algorithm. Link: https://lore.kernel.org/20260518051804.462141-9-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28raid6: improve the public interfaceChristoph Hellwig
Stop directly calling into function pointers from users of the RAID6 PQ API, and provide exported functions with proper documentation and API guarantees asserts where applicable instead. Link: https://lore.kernel.org/20260518051804.462141-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28raid6: remove raid6_get_zero_pageChristoph Hellwig
Just open code it as in other places in the kernel. Link: https://lore.kernel.org/20260518051804.462141-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28raid6: remove unused defines in pq.hChristoph Hellwig
These are not used anywhere in the kernel. Link: https://lore.kernel.org/20260518051804.462141-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28raid6: remove __KERNEL__ ifdefsChristoph Hellwig
With the test code ported to kernel space, none of this is required. Link: https://lore.kernel.org/20260518051804.462141-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28raid6: turn the userspace test harness into a kunit testChristoph Hellwig
Patch series "cleanup the RAID6 P/Q library", v3. This series cleans up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries. This includes providing properly documented external interfaces, hiding the internals, using static_call instead of indirect calls and turning the user space test suite into an in-kernel kunit test which is also extended to improve coverage. Note that this changes registration so that non-priority algorithms are not registered, which greatly helps with the benchmark time at boot time. I'd like to encourage all architecture maintainers to see if they can further optimized this by registering as few as possible algorithms when there is a clear benefit in optimized or more unrolled implementations. This patch (of 18): Currently the raid6 code can be compiled as userspace code to run the test suite. Convert that to be a kunit case with minimal changes to avoid mutating global state so that we can drop this requirement. Note that this is not a good kunit test case yet and will need a lot more work, but that is deferred until the raid6 code is moved to it's new place, which is easier if the userspace makefile doesn't need adjustments for the new location first. Link: https://lore.kernel.org/20260518051804.462141-1-hch@lst.de Link: https://lore.kernel.org/20260518051804.462141-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28kcov: allow simultaneous KCOV_ENABLE/KCOV_REMOTE_ENABLEJann Horn
Allow the same userspace thread to simultaneously collect normal coverage in syscall context (KCOV_ENABLE) and remote coverage of asynchronous work created by the thread (KCOV_REMOTE_ENABLE). With this, remote KCOV coverage becomes useful for generic fuzzing and not just fuzzing of specific data injection interfaces. This requires that the task_struct::kcov_* fields are separated into ones that are used by the task that generates coverage, and ones that are used by the task that requested remote coverage. To split this up: - Split task_struct::kcov into kcov and kcov_remote. kcov_task_exit() now has to clean up both separately. - Only use task_struct::kcov_mode on the task that generates coverage. - Only reset task_struct::kcov_handle on the task that requested remote coverage. After this change, fields used by the task that generates coverage are: - kcov_mode - kcov_size - kcov_area - kcov - kcov_sequence - kcov_softirq Fields used by the task that requested remote coverage are: - kcov_remote - kcov_handle [jannh@google.com: remove unused constant KCOV_MODE_REMOTE, per Dmitry] Link: https://lore.kernel.org/20260515-kcov-simultaneous-remote-v2-1-56fde1cfa509@google.com [jannh@google.com: update documentation on remote coverage collection] Link: https://lore.kernel.org/20260519-kcov-docs-v1-1-5bb22f4cb20c@google.com [jannh@google.com: move and reword sentence on simultaneous normal/remote collection Link: https://lore.kernel.org/20260520-kcov-docs-v2-1-819f78778763@google.com Link: https://lore.kernel.org/20260505-kcov-simultaneous-remote-v1-1-a670ba7cefd2@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28llist: make locking comments consistentPhilipp Stanner
llist's locking requirement table has a legend which claims that all operations not needing a lock a marked with '-', whereas in truth for some table entries just a whitespace is used. Add the '-' to all appropriate places. Link: https://lore.kernel.org/20260507094918.23910-2-phasta@kernel.org Signed-off-by: Philipp Stanner <phasta@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28kcov: refactor common handle ID into kcov_common_handle_idJann Horn
Store common handle IDs in "struct kcov_common_handle_id", which consumes no space in non-KCOV builds. This cleanup removes #ifdef boilerplate code from subsystems that integrate with KCOV (in particular in usbip_common.h and skbuff.h, see the diffstat). This should also make it easier to add KCOV remote coverage to more subsystems in the future. Link: https://lore.kernel.org/20260430-kcov-refactor-common-handle-v1-1-23a0c7a0ba38@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Hongren (Zenithal) Zheng <i@zenithal.me> Cc: Jann Horn <jannh@google.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Valentina Manea <valentina.manea.m@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28uaccess: minimize INLINE_COPY_USER-related ifdeferyYury Norov
Now that we've got the same config selecting inline vs outline copy_to_user() and copy_from_user(), we can simplify the corresponding logic in the uaccess.h. Link: https://lore.kernel.org/20260425020857.356850-4-ynorov@nvidia.com Fixes: 1f9a8286bc0c ("uaccess: always export _copy_[from|to]_user with CONFIG_RUST") Signed-off-by: Yury Norov <ynorov@nvidia.com> Tested-by: Alice Ryhl <aliceryhl@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28uaccess: unify inline vs outline copy_{from,to}_user() selectionYury Norov
The kernel allows arches to select between inline and outline implementations of the copy_{from,to}_user() by defining individual INLINE_COPY_FROM_USER and INLINE_COPY_TO_USER, correspondingly. However, all arches enable or disable them always together. Without the real use-case for one helper being inlined while the other outlined, having independent controls is excessive and error prone. Switch the codebase to the single unified INLINE_COPY_USER control. Link: https://lore.kernel.org/20260425020857.356850-3-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Tested-by: Alice Ryhl <aliceryhl@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28init.h: discard exitcall symbols earlyArnd Bergmann
Any __exitcall() and built-in module_exit() handler is marked as __used, which leads to the code being included in the object file and later discarded at link time. As far as I can tell, this was originally added at the same time as initcalls were marked the same way, to prevent them from getting dropped with gcc-3.4, but it was never actaully necessary to keep exit functions around. Mark them as __maybe_unused instead, which lets the compiler treat the exitcalls as entirely unused, and make better decisions about dropping specializing static functions called from these. Link: https://lore.kernel.org/all/acruxMNdnUlyRHiy@google.com/ Link: https://lore.kernel.org/20260331142846.3187706-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nicolas Schier <nsc@kernel.org> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28highmem-internal.h: fix typo in the comment for kunmap_atomic()Zhouyi Zhou
Replace `PREEMP_RT` with `PREEMPT_RT` in the header comment to match the correct kernel configuration name. Link: https://lore.kernel.org/20260505021125.1941691-1-zhouzhouyi@gmail.com Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/core: remove damon_set_region_biggest_system_ram_default()SeongJae Park
Now nobody is using damon_set_region_biggest_system_ram_default(). Remove it. Link: https://lore.kernel.org/20260429041232.90257-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon: introduce damon_set_region_system_rams_default()SeongJae Park
Patch series "mm/damon/reclaim,lru_sort: monitor all system rams by default". DAMON_RECLAIM and DAMON_LRU_SORT set the biggest 'System RAM' resource of the system as the default monitoring target address range. The main intention behind the design is to minimize the overhead coming from monitoring of non-System RAM areas. This could result in an odd setup when there are multiple discrete System RAMs of considerable sizes. For example, there are System RAMs each having 500 GiB size. In this case, only the first 500 GiB will be set as the monitoring region by default. This is particularly common on NUMA systems. Hence the modules allow users to set the monitoring target address range using the module parameters if the default setup doesn't work for them. In other words, the current design trades ease of setup for lower overhead. However, because DAMON utilizes the sampling based access check and the adaptive regions adjustment mechanisms, the overhead from the monitoring of non-System RAM areas should be negligible in most setups. Meanwhile, the setup complexity is causing real headaches for users who need to run those modules on various types of systems. That is, the current tradeoff is not a good deal. Set the physical address range that can cover all System RAM areas of the system as the default monitoring regions for DAMON_RECLAIM and DAMON_LRU_SORT. Technically speaking, this is changing documented behavior. However, it makes no sense to believe there is a real use case that really depends on the old weird default behavior. If the old default behavior was working for them in the reasonable way, this change will only add a negligible amount of monitoring overhead. If it didn't work, the users may already be using manual monitoring regions setup, and they will not be affected by this change. Patches Sequence ================ Patch 1 introduces a new core function that will be used for the new default monitoring target region setup. Patch 2 and 3 update DAMON_RECLAIM and DAMON_LRU_SORT to use the new function instead of the old one, respectively. Patch 4 removes the old core function that was replaced by the new one, as there is no more user of it. Patch 5 updates DAMON_STAT to use the new one instead of its in-house nearly-duplicate self implementation of the functionality. Finally patches 6 and 7 update the DAMON_RECLAIM and DAMON_LRU_SORT user documentation for the new behaviors, respectively. This patch (of 7): damon_set_region_biggest_system_ram_default() sets the monitoring target region as the caller requested. If the caller didn't specify the region, it finds the biggest System RAM of the system and sets it as the target region. When there are more than one considerable size of System RAM resources in the system, the default target setup makes no sense. Introduce a variant, namely damon_set_region_system_rams_default(). It sets a physical address range that covers all System RAM resources as the default target region. Link: https://lore.kernel.org/20260429041232.90257-1-sj@kernel.org Link: https://lore.kernel.org/20260429041232.90257-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm: skip KASAN tagging for page-allocated page tablesMuhammad Usama Anjum
Page tables are always accessed via the linear mapping with a match-all tag, so HW-tag KASAN never checks them. For page-allocated tables (PTEs and PGDs etc), avoid the tag setup and poisoning overhead by using __GFP_SKIP_KASAN. SLUB-backed page tables are unchanged for now. (They aren't widely used and require more SLUB related skip logic. Leave it later.) Link: https://lore.kernel.org/20260429102704.680174-4-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28kasan: skip HW tagging for all kernel thread stacksMuhammad Usama Anjum
HW-tag KASAN never checks kernel stacks because stack pointers carry the match-all tag, so setting/poisoning tags is pure overhead. - Add __GFP_SKIP_KASAN to THREADINFO_GFP so every stack allocator that uses it skips tagging (fork path plus arch users) - Add __GFP_SKIP_KASAN to GFP_VMAP_STACK for the fork-specific vmap stacks. - When reusing cached vmap stacks, skip kasan_unpoison_range() if HW tags are enabled. Software KASAN is unchanged; this only affects tag-based KASAN. Link: https://lore.kernel.org/20260429102704.680174-3-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28vmalloc: add __GFP_SKIP_KASAN supportMuhammad Usama Anjum
Patch series "kasan: hw_tags: Disable tagging for stack and page-tables", v4. Stacks and page tables are always accessed with the match-all tag, so assigning a new random tag every time at allocation and setting invalid tag at deallocation time, just adds overhead without improving the detection. With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL (match-all tag) is stored in the page flags while keeping the poison tag in the hardware. The benefit of it is that 256 tag setting instruction per 4 kB page aren't needed at allocation and deallocation time. Thus match-all pointers still work, while non-match tags (other than poison tag) still fault. __GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is unchanged. Benchmark: The benchmark has two modes. In thread mode, the child process forks and creates N threads. In pgtable mode, the parent maps and faults a specified memory size and then forks repeatedly with children exiting immediately. Thread benchmark: 2000 iterations, 2000 threads: 2.575 s → 2.229 s (~13.4% faster) The pgtable samples: - 2048 MB, 2000 iters 19.08 s → 17.62 s (~7.6% faster) This patch (of 3): For allocations that will be accessed only with match-all pointers (e.g., kernel stacks), setting tags is wasted work. If the caller already set __GFP_SKIP_KASAN, skip tag setting of vmalloc pages. Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs. So it wasn't being checked. Now its being checked and acted upon. Other KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in the page allocator, and in vmalloc too we ignore this flag for them. This is a preparatory patch for optimizing kernel stack allocations. Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Co-developed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/core: introduce damon_ctx->pausedSeongJae Park
Patch series "mm/damon: let DAMON be paused and resumed", v2. DAMON utilizes a few mechanisms that enhance itself over time. Adaptive regions adjustment, goal-based DAMOS quota auto-tuning and monitoring intervals auto-tuning like self-training mechanisms are such examples. It also adds access frequency stability information (age) to the monitoring results, which makes it enhanced over time. Sometimes users have to stop DAMON. In this case, DAMON internal state that enhanced over the time of the last execution simply goes away. Restarted DAMON have to train itself and enhance its output from the scratch. This makes DAMON less useful in such cases. Introducing three such use cases below. Investigation of DAMON. It is best to do the investigation online, especially when it is a production environment. DAMON therefore provides features for such online investigations, including DAMOS stats, monitoring result snapshot exposure, and multiple tracepoints. When those are insufficient, and there are additional clues that could be interfered by DAMON, users have to temporarily stop DAMON to collect the additional clues. It is not very useful since many of DAMON internal clues are gone when DAMON is stopped. The loss of the monitoring results that improved over time is also problematic, especially in production environments. Monitoring of workloads that have different user-known phases. For example, in Android, applications are known to have very different access patterns and behaviors when they are running on the foreground and the background. It can therefore be useful to separate monitoring of apps based on whether they are running on the foreground and on the background. Having two DAMON threads per application that paused and resumed for the apps foreground/background switches can be useful for the purpose. But such pause/resume of the execution is not supported. Tests of DAMON. A few DAMON selftests are using drgn to dump the internal DAMON status. The tests show if the dumped status is the same as what the test code expected. Because DAMON keeps running and modifying its internal status, there are chances of data races that can cause false test results. Stopping DAMON can avoid the race. But, since the internal state of DAMON is dropped, the test coverage will be limited. Let DAMON execution be paused and resumed without loss of the internal state, to overhaul the limitations. For this, introduce a new DAMON context parameter, namely 'pause'. API callers can update it while the context is running, using the online parameters update functions (damon_commit_ctx() and damon_call()). Once it is set, kdamond_fn() main loop will do only limited works excluding the monitoring and DAMOS works, while sleeping sampling intervals per the work. The limited works include handling of the online parameters update. Hence users can unset the 'pause' parameter again. Once it is unset, kdamond_fn() main loop will do all the work again (resumed). Under the paused state, it also does stop condition checks and handling of it, so that paused DAMON can also be stopped if needed. Expose the feature to the user space via DAMON sysfs interface. Also, update existing drgn-based tests to test and use the feature. Tests ===== I confirmed the feature functionality using real time tracing ('perf trace' or 'trace-cmd stream') of damon:damon_aggregated DAMON tracepoint. By pausing and resuming the DAMON execution, I was able to see the trace stops and continued as expected. Note that the pause feature support is added to DAMON user-space tool (damo) after v3.1.9. Users can use '--pause_ctx' command line option of damo for that, and I actually used it for my test. The extended drgn-based selftests are also testing a part of the functionality. Patches Sequence ================ Patch 1 introduces the new core API for the pause feature. Patch 2 extend DAMON sysfs interface for the new parameter. Patches 3-5 update design, usage and ABI documents for the new sysfs file, respectively. The following five patches are for tests. Patch 6 implements a new kunit test for the pause parameter online commitment. Patches 7 and 8 extend DAMON selftest helpers to support the new feature. Patch 9 extends selftest to test the commitment of the feature. Finally, patch 10 updates existing selftest to be safe from the race condition using the pause/resume feature. This patch (of 10): DAMON supports only start and stop of the execution. When it is stopped, its internal data that it self-trained goes away. It will be useful if the execution can be paused and resumed with the previous self-trained data. Introduce per-context API parameter, 'paused', for the purpose. The parameter can be set and unset while DAMON is running and paused, using the online parameters commit helper functions (damon_commit_ctx() and damon_call()). Once 'paused' is set, the kdamond_fn() main loop does only limited works with sampling interval sleep during the works. The limited works include the handling of the online parameters update, so that users can unset the 'pause' and resume the execution when they want. It also keep checking DAMON stop conditions and handling of it, so that DAMON can be stopped while paused if needed. Link: https://lore.kernel.org/20260427151231.113429-1-sj@kernel.org Link: https://lore.kernel.org/20260427151231.113429-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm: limit filemap_fault readahead to VMA boundariesFrederick Mayle
When a file mapping covers a strict subset of a file, an access to the mapping can trigger readahead of file pages outside the mapped region. Readahead is meant to prefetch pages likely to be accessed soon, but these pages aren't accessible via the same means, so it fair to say we don't have a good indicator they'll be accessed soon. Take an ELF file for example: an access to the end of a program's read-only segment isn't a sign that nearby file contents will be accessed next (they are likely to be mapped discontiguously, or not at all). The pressure from loading these pages into the cache can evict more useful pages. To improve the behavior, make three changes: * Introduce a new readahead_control field, max_index, as a hard limit on the readahead. The existing file_ra_state->size can't be used as a limit, it is more of a hint and can be increased by various heuristics. * Set readahead_control->max_index to the end of the VMA in all of the readahead paths that can be triggered from a fault on a file mapping (both "sync" and "async" readahead). * Limit the read-around range start to the VMA's start. Note that these changes only affect readahead triggered in the context of a fault, they do not affect readahead triggered by read syscalls. If a user mixes the two types of accesses, the behavior is expected to be the following: if a fault causes readahead and places a PG_readahead marker and then a read(2) syscall hits the PG_readahead marker, the resulting async readahead *will not* be limited to the VMA end. Conversely, if a read(2) syscall places a PG_readahead marker and then a fault hits the marker, the async readahead *will* be limited to the VMA end. There is an edge case that the above motivation glosses over: A single file mapping might be backed by multiple VMAs. For example, a whole file could be mapped RW, then part of the mapping made RO using mprotect. This patch would hurt performance of a sequential faulted read of such a mapping, the degree depending on how fragmented the VMAs are. A usage pattern like that is likely rare and already suffering from sub-optimal performance because, e.g., the fragmented VMAs limit the fault-around, so each VMA boundary in a sequential faulted read would cause a minor fault. Still, this patch would make it worse. See a previous discussion of this topic at [1]. Tested by mapping and reading a small subset of a large file, then using the cachestat syscall to verify the number of cached pages didn't exceed the mapping size. In practical scenarios, the effect depends on the specific file and usage. Sometimes there is no effect at all, but, for some ELF files in Android, we see ~20% fewer pages pulled into the cache. A comprehensive performance evaluation hasn't been done, but, in addition to the anecdontal memory savings mentioned above, a benchmark was run with fio 3.38, showing neutral looking results: /data/local/tmp/fio --version fio --name=mmap_test --ioengine=mmap --rw=read --bs=4k \ --offset=1G --size=1G --filesize=3G --numjobs=1 \ --filename=testfile.bin Before: 4366.6 MiB/s (avg of 3459, 4592, 4613, 4697, 4472) After: 4444.0 MiB/s (avg of 4633, 4655, 4511, 4571, 3850) +1.7% Same, with --ioengine=mmap --rw=randread Before: 445.6 MiB/s (avg of 446, 447, 442, 452, 441) After: 447.0 MiB/s (avg of 447, 446, 446, 451, 445) +0.3% Same, with --ioengine=psync --rw=read Before: 3086.6 MiB/s (avg of 3122, 3094, 3066, 3094, 3057) After: 3084.6 MiB/s (avg of 3039, 3103, 3103, 3084, 3094) -0.06% Same, with --ioengine=psync --rw=randread Before: 2226.4 MiB/s (avg of 2256, 2183, 2207, 2265, 2221) After: 2231.4 MiB/s (avg of 2236, 2241, 2236, 2193, 2251) +0.2% Link: https://lore.kernel.org/20260427030148.653228-1-fmayle@google.com Link: https://lore.kernel.org/all/ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz/ [1] Signed-off-by: Frederick Mayle <fmayle@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm: remove page_mapped()David Hildenbrand (Arm)
Let's replace the last user of page_mapped() by folio_mapped() so we can get rid of page_mapped(). Replace the remaining occurrences of page_mapped() in rmap documentation by folio_mapped(). Link: https://lore.kernel.org/20260427-page_mapped-v1-3-e89c3592c74c@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Harry Yoo <harry@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rich Felker <dalias@libc.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon: support MADV_COLLAPSE via DAMOS_COLLAPSE scheme actionAsier Gutierrez
This patch set introces a new action: DAMOS_COLLAPSE. For DAMOS_HUGEPAGE and DAMOS_NOHUGEPAGE to work, khugepaged should be working, since it relies on hugepage_madvise to add a new slot. This slot should be picked up by khugepaged and eventually collapse (or not, if we are using DAMOS_NOHUGEPAGE) the pages. If THP is not enabled, khugepaged will not be working, and therefore no collapse will happen. DAMOS_COLLAPSE eventually calls madvise_collapse, which will collapse the address range synchronously. In cases where there is a large VMA (databases, for example), DAMOS_COLLAPSE allows us to collapse only the hot region, and not the entire VMA. This new action may be required to support autotuning with hugepage as a goal[1]. ========= Benchmarks: ========= MySQL ===== Tests were performed in an ARM physical server with MariaDB 10.5 and sysbench. Read only benchmark was perform with gaussian row hitting, which follows a normal distribution. T n, D h: THP set to never, DAMON action set to hugepage T m, D h: THP set to madvise, DAMON action set to hugepage T n, D c: THP set to never, DAMON action set to collapse Memory consumption. Lower is better. +------------------+----------+----------+----------+ | | T n, D h | T m, D h | T n, D c | +------------------+----------+----------+----------+ | Total memory use | 2.13 | 2.20 | 2.20 | | Huge pages | 0 | 1.3 | 1.27 | +------------------+----------+----------+----------+ Performance in TPS (Transactions Per Second). Higher is better. T n, D h: 18225.58 T m, D h 18252.93 T n, D c: 18270.21 Performance counter I got the number of L1 D/I TLB accesses and the number a D/I TLB accesses that triggered a page walk. I divided the second by the first to get the percentage of page walkes per TLB access. The lower the better. +---------------+--------------+--------------+--------------+ | | T n, D h | T m, D h | T n, D c | +---------------+--------------+--------------+--------------+ | L1 DTLB | 127248242753 | 125431020479 | 125327001821 | | L1 ITLB | 80332558619 | 79346759071 | 79298139590 | | DTLB walk | 75011087 | 52800418 | 55895794 | | ITLB walk | 71577076 | 71505137 | 67262140 | | DTLB % misses | 0.058948623 | 0.042095183 | 0.044599961 | | ITLB % misses | 0.089100954 | 0.090117275 | 0.084821839 | +---------------+--------------+--------------+--------------+ Masim ===== I used masim with the "demo" configuration, but changing the times to 100 seconds for the initial phase and 50 seconds for the rest of the phases. Memory consumption: +------------------+----------+----------+----------+ | | T n, D h | T m, D h | T n, D c | +------------------+----------+----------+----------+ | Total memory use | 2.38 GB | 2.36 GB | 2.37 GB | | Huge pages | 0 | 190 MB | 188 MB | +------------------+----------+----------+----------+ Performance: THP never, DAMOS_HUGEPAGE initial phase: 40,491 accesses/msec, 100001 msecs run low phase 0: 39,658 accesses/msec, 50002 msecs run high phase 0: 41,678 accesses/msec, 50000 msecs run low phase 1: 39,625 accesses/msec, 50003 msecs run high phase 1: 41,658 accesses/msec, 50002 msecs run low phase 2: 39,642 accesses/msec, 50002 msecs run high phase 2: 41,640 accesses/msec, 50001 msecs run THP madvise, DAMOS_HUGEPAGE initial phase: 51,977 accesses/msec, 100000 msecs run low phase 0: 86,953 accesses/msec, 50000 msecs run high phase 0: 94,812 accesses/msec, 50000 msecs run low phase 1: 101,017 accesses/msec, 50000 msecs run high phase 1: 94,841 accesses/msec, 50000 msecs run low phase 2: 100,993 accesses/msec, 50000 msecs run high phase 2: 94,791 accesses/msec, 50001 msecs run THP never, DAMOS_COLLAPSE initial phase: 93,678 accesses/msec, 100001 msecs run low phase 0: 101,475 accesses/msec, 50000 msecs run high phase 0: 98,589 accesses/msec, 50000 msecs run low phase 1: 101,531 accesses/msec, 50001 msecs run high phase 1: 98,506 accesses/msec, 50001 msecs run low phase 2: 101,458 accesses/msec, 50001 msecs run high phase 2: 98,555 accesses/msec, 50000 msecs run Memory consumption dynamic (how quickly collapses occur): It shows in seconds how many huge pages are allocated. +----+----------+----------+ | | T m, D h | T n, D c | +----+----------+----------+ | 5 | 32 | 188 | | 10 | 48 | 188 | | 15 | 64 | 188 | | 20 | 96 | 188 | | 30 | 112 | 188 | | 35 | 144 | 188 | | 40 | 160 | 188 | | 45 | 190 | 188 | | 50 | 190 | 188 | | 55 | 190 | 188 | | 60 | 190 | 188 | +----+----------+----------+ ========= - We can see that DAMOS "hugepage" action works only when THP is set to madvise. "collapse" action works even when THP is set to never. - Performance for "collapse" action is slightly lower than "hugepage" action and THP madvise. This is due to the fact that collapases occur synchronously. With "hugepage" they may occur during page faults. - Memory consumption is slighly lower for "collapse" than "hugepage" with THP madvise. This is due to the khugepage collapses all VMAs, while "collapse" action only collapses the VMAs in the hot region. - There is an improvement in TLB utilization when collapse through "hugepage" or "collapse" actions are triggered. The amount of TLB misses is lower. - "collapse" action is performance synchronously, which means that page collapses happen earlier and more rapidly. This can be useful or not, depending on the scenario. - "hugepage" action may trigger a VMA split in some scenarios, since it needs to change the flag of the VMA to THP enabled. This may lead to additional overhead. Collapse action just adds a new option to chose the correct system balance. Link: https://lore.kernel.org/20260426231619.107231-5-sj@kernel.org Link: https://lore.kernel.org/damon/20260313000816.79933-1-sj@kernel.org/ [1] Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Cheng-Han Wu <hank20010209@gmail.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Liew Rui Yan <aethernet65535@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/sparse-vmemmap: pass @pgmap argument to memory deactivation pathsMuchun Song
Currently, the memory hot-remove call chain -- arch_remove_memory(), __remove_pages(), sparse_remove_section() and section_deactivate() -- does not carry the struct dev_pagemap pointer. This prevents the lower levels from knowing whether the section was originally populated with vmemmap optimizations (e.g., DAX with vmemmap optimization enabled). Without this information, we cannot call vmemmap_can_optimize() to determine if the vmemmap pages were optimized. As a result, the vmemmap page accounting during teardown will mistakenly assume a non-optimized allocation, leading to incorrect memmap statistics. To lay the groundwork for fixing the vmemmap page accounting, we need to pass the @pgmap pointer down to the deactivation location. Plumb the @pgmap argument through the APIs of arch_remove_memory(), __remove_pages() and sparse_remove_section(), mirroring the corresponding *_activate() paths. Link: https://lore.kernel.org/20260428081855.1249045-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/vmpressure: skip socket pressure for costly order reclaimJP Kobryn (Meta)
When reclaim is triggered by high order allocations on a fragmented system, vmpressure() can report poor reclaim efficiency even though the system has plenty of free memory. This is because many pages are scanned, but few are found to actually reclaim - the pages are actively in use and don't need to be freed. The resulting scan:reclaim ratio causes vmpressure() to assert socket pressure, throttling TCP throughput unnecessarily. Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on compaction to succeed, so poor reclaim efficiency at these orders does not necessarily indicate memory pressure. The kernel already treats this order as the boundary where reclaim is no longer expected to succeed and compaction may take over. Make vmpressure() order-aware through an additional parameter sourced from scan_control at existing call sites. Socket pressure is now only asserted when order <= PAGE_ALLOC_COSTLY_ORDER. Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always uses order 0, which passes the filter unconditionally. Similarly, vmpressure_prio() now passes order 0 internally when calling vmpressure(), ensuring critical pressure from low reclaim priority is not suppressed by the order filter. The patch was motivated by a case of impacted net throughput in production. On one affected host, the memory state at the time showed ~15GB available, zero cgroup pressure, and the following buddyinfo state: Order FreePages 0: 133,970 1: 29,230 2: 17,351 3: 18,984 7+: 0 Using bpf, it was found that 94% of vmpressure calls on this host were from order-7 kswapd reclaim. TCP minimum recv window is rcv_ssthresh:19712. Before patch: 723 out of 3,843 (19%) TCP connections stuck at minimum recv window After live-patching and ~30min elapsed: 0 out of 3,470 TCP connections stuck at minimum recv window Link: https://lore.kernel.org/20260406195014.112521-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <qi.zheng@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/sparse: remove sparse buffer pre-allocation mechanismMuchun Song
Commit 9bdac9142407 ("sparsemem: Put mem map for one node together.") introduced a mechanism to pre-allocate a large memory block to hold all memmaps for a NUMA node upfront. However, the original commit message did not clearly state the actual benefits or the necessity of explicitly pre-allocating a single chunk for all memmap areas of a given node. One of the concerns about removing this pre-allocation is that the subsequent per-section memmap allocations could become scattered around, and might turn too many memory blocks/sections into an "un-offlinable" state. However, tests show that even without the explicit node-wide pre-allocation, memblock still allocates memory closely and back-to-back. When tracing vmemmap_set_pmd allocations, the physical chunks allocated by memblock are strictly adjacent to each other in a single contiguous physical range (mapped top-down). Because they are packed tightly together naturally, they will at most consume or pollute the exact same number of memory blocks as the explicit pre-allocation did. Another concern is the boot performance impact of calling memmap_alloc() multiple times compared to one large node-wide allocation. Tests on a 256GB VM showed that memmap allocation time increased from 199,555 ns to 741,292 ns. Even though it is 3.7x slower, on a 1TB machine, the entire memory allocation time would only take a few milliseconds. This boot performance difference is completely negligible. Since no negative impact on memory offlining behavior or noticeable boot performance regression was found, this patch proposes removing the explicit node-wide memmap pre-allocation mechanism to reduce the maintenance burden. Link: https://lore.kernel.org/20260410092419.2446420-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28mm/damon/core: introduce failed region quota charge ratioSeongJae Park
DAMOS quota is charged to all DAMOS action application attempted memory, regardless of how much of the memory the action was successful and failed. This makes understanding quota behavior without DAMOS stat but only with end level metrics (e.g., increased amount of free memory for DAMOS_PAGEOUT action) difficult. Also, charging action-failed memory same as action-successful memory is somewhat unfair, as successful action application will induce more overhead in most cases. Introduce DAMON core API for setting the charge ratio for such action-failed memory. It allows API callers to specify the ratio in a flexible way, by setting the numerator and the denominator. Link: https://lore.kernel.org/20260428013402.115171-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>