summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-12-24riscv: dts: sophgo: cv180x: fix USB dwc2 FIFO sizesAnton D. Stavinskii
I've tested the current dwc2 FIFO configuration and found that USB device mode breaks in ECM mode when transmitting frames larger than 128 bytes. For example, large ICMP packets or iperf3 traffic cause the USB link to hang and eventually disconnect without any messages in dmesg. After switching to more conservative FIFO sizes, ECM becomes stable and no longer drops the connection. iperf3 now shows ~130 Mbit/s RX and ~100 Mbit/s TX on SG2002 (MilkV Duo 256M). Fix the FIFO sizes accordingly. Signed-off-by: Anton D. Stavinskii <stavinsky@gmail.com> Reviewed-by: Inochi Amaoto <inochiama@gmail.com> Fixes: e307248a3c2d ("riscv: dts: sophgo: Add USB support for cv18xx") Link: https://lore.kernel.org/r/20251126172115.1894190-2-stavinsky@gmail.com Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Signed-off-by: Chen Wang <unicorn_wang@outlook.com> Signed-off-by: Chen Wang <wangchen20@iscas.ac.cn>
2025-12-24mhi: host: Add support for loading dual ELF image formatQiang Yu
Currently, the FBC image contains a single ELF header followed by segments for both SBL and WLAN FW. However, TME-L (Trust Management Engine Lite) supported devices (e.g., QCC2072) require separate ELF headers for SBL and WLAN FW segments due to TME-L image authentication requirements. Current image format contains two sections in a single binary: - First 512KB: ELF header + SBL segments - Remaining: WLAN FW segments (raw data) The TME-L supported image format contains two complete ELF files in a single binary: - First 512KB: Complete SBL ELF file (ELF header + SBL segments) - Remaining: Complete WLAN FW ELF file (ELF header + WLAN FW segments) Download behavior: - Legacy: 1. First 512KB via BHI (ELF header + SBL) 2. Full image via BHIe - TME-L: 1. First 512KB via BHI (SBL ELF file) 2. Remaining via BHIe (WLAN FW ELF file only) Add runtime detection to automatically identify the image format by checking for the presence of a second ELF header at the 512KB boundary. When detected, MHI skips the first 512KB during WLAN FW download over BHIe as it is loaded in BHI phase. Signed-off-by: Qiang Yu <qiang.yu@oss.qualcomm.com> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Link: https://patch.msgid.link/20251223-wlan_image_load_skip_512k-v5-1-8d4459d720b5@oss.qualcomm.com
2025-12-24drm/i915/cx0: Use the consolidated HDMI tablesSuraj Kandpal
Use the consolidated HDMI tables before we try to compute them via algorithm. The reason is that these are the ideal values and even though the values calculated via the HDMI algorithm are correct but not always ideal. This is done for C20 and already exists for C10. Signed-off-by: Suraj Kandpal <suraj.kandpal@intel.com> Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com> Link: https://patch.msgid.link/20251223063422.1444968-1-suraj.kandpal@intel.com
2025-12-24riscv: dts: spacemit: PCIe and PHY-related updatesAlex Elder
Define PCIe and PHY-related Device Tree nodes for the SpacemiT K1 SoC. Enable the combo PHY and the two PCIe-only PHYs on the Banana Pi BPI-F3 board. The combo PHY is used for USB on this board, and that will be enabled when USB 3 support is accepted. The combo PHY must perform a calibration step to determine configuration values used by the PCIe-only PHYs. As a result, it must be enabled if either of the other two PHYs is enabled. Signed-off-by: Alex Elder <elder@riscstar.com> Reviewed-by: Yixun Lan <dlan@gentoo.org> Tested-by: Yixun Lan <dlan@gentoo.org> Link: https://lore.kernel.org/r/20251218151235.454997-6-elder@riscstar.com Signed-off-by: Yixun Lan <dlan@gentoo.org>
2025-12-24riscv: dts: spacemit: Add a PCIe regulatorAlex Elder
Define a 3.3v fixed voltage regulator to be used by PCIe on the Banana Pi BPI-F3. On this platform, this regulator is always on. Signed-off-by: Alex Elder <elder@riscstar.com> Reviewed-by: Yixun Lan <dlan@gentoo.org> Tested-by: Yixun Lan <dlan@gentoo.org> Link: https://lore.kernel.org/r/20251218151235.454997-5-elder@riscstar.com Signed-off-by: Yixun Lan <dlan@gentoo.org>
2025-12-23Merge remote-tracking branch 'torvalds/master' into perf-tools-nextArnaldo Carvalho de Melo
To pick up fixes from perf-tools. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-12-23PCI: trace: Add RAS tracepoint to monitor link speed changesShuai Xue
PCIe link speed degradation directly impacts system performance and often indicates hardware issues such as faulty devices, physical layer problems, or configuration errors. To this end, add a RAS tracepoint to monitor link speed changes, enabling proactive health checks and diagnostic analysis. The following output is generated when a device is hotplugged: $ echo 1 > /sys/kernel/debug/tracing/events/pci/pcie_link_event/enable $ cat /sys/kernel/debug/tracing/trace_pipe irq/51-pciehp-88 [001] ..... 381.545386: pcie_link_event: 0000:00:02.0 type:4, reason:4, cur_bus_speed:20, max_bus_speed:23, width:1, flit_mode:0, status:DLLLA Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Suggested-by: Matthew W Carlis <mattc@purestorage.com> Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://patch.msgid.link/20251210132907.58799-3-xueshuai@linux.alibaba.com
2025-12-23PCI: trace: Add generic RAS tracepoint for hotplug eventShuai Xue
Hotplug events are critical indicators for analyzing hardware health, and surprise link downs can significantly impact system performance and reliability. Define a new TRACING_SYSTEM named "pci", add a generic RAS tracepoint for hotplug event to help health checks. Add enum pci_hotplug_event in include/uapi/linux/pci.h so applications like rasdaemon can register tracepoint event handlers for it. The following output is generated when a device is hotplugged: $ echo 1 > /sys/kernel/debug/tracing/events/pci/pci_hp_event/enable $ cat /sys/kernel/debug/tracing/trace_pipe irq/51-pciehp-88 [001] ..... 1311.177459: pci_hp_event: 0000:00:02.0 slot:10, event:CARD_PRESENT irq/51-pciehp-88 [001] ..... 1311.177566: pci_hp_event: 0000:00:02.0 slot:10, event:LINK_UP Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for trace event Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://patch.msgid.link/20251210132907.58799-2-xueshuai@linux.alibaba.com
2025-12-23PCI: Use resource_set_range() that correctly sets ->endIlpo Järvinen
__pci_read_base() sets resource start and end addresses when resource is larger than 4G but pci_bus_addr_t or resource_size_t are not capable of representing 64-bit PCI addresses. This creates a problematic resource that has non-zero flags but the start and end addresses do not yield to resource size of 0 but 1. Replace custom resource addresses setup with resource_set_range() that correctly sets end address as -1 which results in resource_size() returning 0. For consistency, also use resource_set_range() in the other branch that does size based resource setup. Fixes: 23b13bc76f35 ("PCI: Fail safely if we can't handle BARs larger than 4GB") Link: https://lore.kernel.org/all/20251207215359.28895-1-ansuelsmth@gmail.com/T/#m990492684913c5a158ff0e5fc90697d8ad95351b Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: stable@vger.kernel.org Cc: Christian Marangi <ansuelsmth@gmail.com> Link: https://patch.msgid.link/20251208145654.5294-1-ilpo.jarvinen@linux.intel.com
2025-12-23PCI: endpoint: Avoid creating sub-groups asynchronouslyLiu Song
The asynchronous creation of sub-groups by a delayed work could lead to a NULL pointer dereference when the driver directory is removed before the work completes. The crash can be easily reproduced with the following commands: # cd /sys/kernel/config/pci_ep/functions/pci_epf_test # for i in {1..20}; do mkdir test && rmdir test; done BUG: kernel NULL pointer dereference, address: 0000000000000088 ... Call Trace: configfs_register_group+0x3d/0x190 pci_epf_cfs_work+0x41/0x110 process_one_work+0x18f/0x350 worker_thread+0x25a/0x3a0 Fix this issue by using configfs_add_default_group() API which does not have the deadlock problem as configfs_register_group() and does not require the delayed work handler. Fixes: e85a2d783762 ("PCI: endpoint: Add support in configfs to associate two EPCs with EPF") Signed-off-by: Liu Song <liu.song13@zte.com.cn> [mani: slightly reworded the description and added stable list] Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@kernel.org Link: https://patch.msgid.link/20250710143845409gLM6JdlwPhlHG9iX3F6jK@zte.com.cn
2025-12-23Documentation: PCI: endpoint: Fix ntb/vntb copy & paste errorsBaruch Siach
Fix copy & paste errors by changing the references from 'ntb' to 'vntb'. Fixes: 4ac8c8e52cd9 ("Documentation: PCI: Add specification for the PCI vNTB function device") Signed-off-by: Baruch Siach <baruch@tkos.co.il> [mani: squashed the patches and fixed more errors] Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/b51c2a69ffdbfa2c359f5cf33f3ad2acc3db87e4.1762154911.git.baruch@tkos.co.il
2025-12-23drm/xe/guc: READ/WRITE_ONCE ct->stateJonathan Cavitt
Use READ_ONCE and WRITE_ONCE when operating on ct->state to prevent the compiler form ignoring important modifications to its value. Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251222201957.63245-6-jonathan.cavitt@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-12-23drm/xe/guc: READ/WRITE_ONCE g2h_fence->doneJonathan Cavitt
Use READ_ONCE and WRITE_ONCE when operating on g2h_fence->done to prevent the compiler from ignoring important modifications to its value. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251222201957.63245-5-jonathan.cavitt@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-12-23ecryptfs: Drop redundant NUL terminations after calling ecryptfs_to_hexThorsten Blum
ecryptfs_to_hex() already NUL-terminates the destination buffers. Drop the manual NUL terminations. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: Replace memcpy + NUL termination in ecryptfs_new_file_contextThorsten Blum
Use strscpy() to copy the NUL-terminated '->global_default_cipher_name' to the destination buffer instead of using memcpy() followed by a manual NUL termination. Remove the now-unused local variable 'cipher_name_len'. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: Replace strcpy with strscpy in ecryptfs_validate_optionsThorsten Blum
strcpy() has been deprecated [1] because it performs no bounds checking on the destination buffer, which can lead to buffer overflows. Replace it with the safer strscpy(). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: Replace strcpy with strscpy in ecryptfs_cipher_code_to_stringThorsten Blum
strcpy() has been deprecated [1] because it performs no bounds checking on the destination buffer, which can lead to buffer overflows. Since the parameter 'char *str' is just a pointer with no size information, extend the function with a 'size' parameter to pass the destination buffer's size as an additional argument. Adjust the call sites accordingly. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: Replace strcpy with strscpy in ecryptfs_set_default_crypt_stat_valsThorsten Blum
strcpy() has been deprecated [1] because it performs no bounds checking on the destination buffer, which can lead to buffer overflows. Replace it with the safer strscpy(). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: simplify list initialization in ecryptfs_parse_packet_set()Baolin Liu
In ecryptfs_parse_packet_set(),use LIST_HEAD() to declare and initialize the 'auth_tok_list' list in one step instead of using INIT_LIST_HEAD() separately. No functional change. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: Remove unused declartion ecryptfs_fill_zeros()Zhang Zekun
The definition of ecryptfs_fill_zeros() has been removed since commit b6c1d8fcbade ("eCryptfs: remove unused functions and kmem_cache") So, Remove the empty declartion in header files. Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: Fix packet format comment in parse_tag_67_packet()Thorsten Blum
s/TAG 65/TAG 67/ Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: comment typo fixZipeng Zhang
Comment typo fix "vitual" -> "virtual". Signed-off-by: Zipeng Zhang <zhangzipeng0@foxmail.com> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23ecryptfs: keystore: Fix typo 'the the' in commentSlark Xiao
Replace 'the the' with 'the' in the comment. Signed-off-by: Slark Xiao <slark_xiao@163.com> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2025-12-23vfio: selftests: Drop <uapi/linux/types.h> includesDavid Matlack
Drop the <uapi/linux/types.h> includes now that <linux/types.h> (tools/include/linux/types.h) has a definition for __aligned_le64, which is needed by <linux/iommufd.h>. Including <uapi/linux/types.h> is harmless but causes benign typedef redefinitions. This is not a problem for VFIO selftests but becomes an issue when the VFIO selftests library is built into KVM selftests, since they are built with -std=gnu99 which does not allow typedef redifitions. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251219233818.1965306-3-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23tools include: Add definitions for __aligned_{l,b}e64David Matlack
Add definitions for the missing __aligned_le64 and __aligned_be64 to tools/include/linux/types.h. The former is needed by <linux/iommufd.h> for builds where tools/include/ is on the include path ahead of usr/include/. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251219233818.1965306-2-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23vfio/xe: Add default handler for .get_region_info_capsMichal Wajdeczko
New requirement for the vfio drivers was added by the commit f97859503859 ("vfio: Require drivers to implement get_region_info") followed by commit 1b0ecb5baf4a ("vfio/pci: Convert all PCI drivers to get_region_info_caps") that was missed by the new vfio/xe driver. Add handler for .get_region_info_caps to avoid -EINVAL errors. Fixes: 2e38c50ae492 ("vfio/xe: Add device specific vfio_pci driver variant for Intel graphics") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Tested-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Link: https://lore.kernel.org/r/20251218205106.4578-1-michal.wajdeczko@intel.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23vfio/pci: Disable qword access to the VGA regionKevin Tian
Seems no reason to allow qword access to the old VGA resource. Better restrict it to dword access as before. Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20251218081650.555015-3-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23vfio/pci: Disable qword access to the PCI ROM barKevin Tian
Commit 2b938e3db335 ("vfio/pci: Enable iowrite64 and ioread64 for vfio pci") enables qword access to the PCI bar resources. However certain devices (e.g. Intel X710) are observed with problem upon qword accesses to the rom bar, e.g. triggering PCI aer errors. This is triggered by Qemu which caches the rom content by simply does a pread() of the remaining size until it gets the full contents. The other bars would only perform operations at the same access width as their guest drivers. Instead of trying to identify all broken devices, universally disable qword access to the rom bar i.e. going back to the old way which worked reliably for years. Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220740 Fixes: 2b938e3db335 ("vfio/pci: Enable iowrite64 and ioread64 for vfio pci") Cc: stable@vger.kernel.org Signed-off-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Link: https://lore.kernel.org/r/20251218081650.555015-2-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23drm/xe/soc_remapper: Add system controller config for SoC remapperUmesh Nerlige Ramappa
Define system controller config bits and helpers for SoC remapper. Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patch.msgid.link/20251223183943.3175941-8-umesh.nerlige.ramappa@intel.com
2025-12-23drm/xe/soc_remapper: Use SoC remapper helper from VSEC codeUmesh Nerlige Ramappa
Since different drivers can use SoC remapper, modify VSEC code to access SoC remapper via a helper that would synchronize such accesses. Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patch.msgid.link/20251223183943.3175941-7-umesh.nerlige.ramappa@intel.com
2025-12-23drm/xe/soc_remapper: Initialize SoC remapper during Xe probeUmesh Nerlige Ramappa
SoC remapper is used to map different HW functions in the SoC to their respective drivers. Initialize SoC remapper during driver load. Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patch.msgid.link/20251223183943.3175941-6-umesh.nerlige.ramappa@intel.com
2025-12-23Merge branch 'remove-kf_sleepable-from-arena-kfuncs'Alexei Starovoitov
Puranjay Mohan says: ==================== Remove KF_SLEEPABLE from arena kfuncs V7: https://lore.kernel.org/all/20251222190815.4112944-1-puranjay@kernel.org/ Changes in V7->v8: - Use clear_lo32(arena->user_vm_start) in place of user_vm_start in patch 3 V6: https://lore.kernel.org/all/20251217184438.3557859-1-puranjay@kernel.org/ Changes in v6->v7: - Fix a deadlock in patch 1, that was being fixed in patch 2. Move the fix to patch 1. - Call flush_cache_vmap() after setting up the mappings as it is required by some architectures. V5: https://lore.kernel.org/all/20251212044516.37513-1-puranjay@kernel.org/ Changes in v5->v6: Patch 1: - Add a missing ; to make sure this patch builds individually. (AI) V4: https://lore.kernel.org/all/20251212004350.6520-1-puranjay@kernel.org/ Changes in v4->v5: Patch 1: - Fix a memory leak in arena_alloc_pages(), it was being fixed in Patch 3 but, every patch should be complete in itself. (AI) Patch 3: - Don't do useless addition in arena_alloc_pages() (Alexei) - Add a comment about kmalloc_nolock() failure and expectations. v3: https://lore.kernel.org/all/20251117160150.62183-1-puranjay@kernel.org/ Changes in v3->v4: - Coding style changes related to comments in Patch 2/3 (Alexei) v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/ Changes in v2->v3: Patch 1: - Call range_tree_destroy() in error path of populate_pgtable_except_pte() in arena_map_alloc() (AI) Patch 2: - Fix double mutex_unlock() in the error path of arena_alloc_pages() (AI) - Fix coding style issues (Alexei) Patch 3: - Unlock spinlock before returning from arena_vm_fault() in case BPF_F_SEGV_ON_FAULT is set by user. (AI) - Use __llist_del_all() in place of llist_del_all for on-stack llist (free_pages) (Alexei) - Fix build issues on 32-bit systems where arena.c is not compiled. (kernel test robot) - Make bpf_arena_alloc_pages() polymorphic so it knows if it has been called in sleepable or non-sleepable context. This information is passed to arena_free_pages() in the error path. Patch 4: - Add a better comment for the big_alloc3() test that triggers kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works correctly above this limit. v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/ Changes in v1->v2: Patch 1: - Import tlbflush.h to fix build issue in loongarch. (kernel test robot) - Fix unused variable error in apply_range_clear_cb() (kernel test robot) - Call bpf_map_area_free() on error path of populate_pgtable_except_pte() (AI) - Use PAGE_SIZE in apply_to_existing_page_range() (AI) Patch 2: - Cap allocation made by kmalloc_nolock() for pages array to KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop to overcome this limit. (AI) Patch 3: - Do page_ref_add(page, 1); under the spinlock to mitigate a race (AI) Patch 4: - Add a new testcase big_alloc3() verifier_arena_large.c that tries to allocate a large number of pages at once, this is to trigger the kmalloc_nolock() limit in Patch 2 and see if the loop logic works correctly. This set allows arena kfuncs to be called from non-sleepable contexts. It is acheived by the following changes: The range_tree is now protected with a rqspinlock and not a mutex, this change is enough to make bpf_arena_reserve_pages() any context safe. bpf_arena_alloc_pages() had four points where it could sleep: 1. Mutex to protect range_tree: now replaced with rqspinlock 2. kvcalloc() for allocations: now replaced with kmalloc_nolock() 3. Allocating pages with bpf_map_alloc_pages(): this already calls alloc_pages_nolock() in non-sleepable contexts and therefore is safe. 4. Setting up kernel page tables with vm_area_map_pages(): vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. Now, at arena creation time populate all page table levels except the last level and when new pages need to be inserted call apply_to_page_range() again which will only do set_pte_at() for those pages and will not allocate memory. The above four changes make bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages() has to do the following steps: 1. Update the range_tree 2. vm_area_unmap_pages(): to unmap pages from kernel vm_area 3. flush the tlb: done in step 2, already. 4. zap_pages(): to unmap pages from user page tables 5. free pages. The third patch in this set makes bpf_arena_free_pages() polymorphic using the specialize_kfunc() mechanism. When called from a sleepable context, arena_free_pages() remains mostly unchanged except the following: 1. rqspinlock is taken now instead of the mutex for the range tree 2. Instead of using vm_area_unmap_pages() that can free intermediate page table levels, apply_to_existing_page_range() with a callback is used that only does pte_clear() on the last level and leaves the intermediate page table levels intact. This is needed to make sure that bpf_arena_alloc_pages() can safely do set_pte_at() without allocating intermediate page tables. When arena_free_pages() is called from a non-sleepable context or it fails to acquire the rqspinlock in the sleepable case, a lock-less list of struct arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock() is used to allocate this arena_free_span, this can fail but we need to make this trade-off for frees done from non-sleepable contexts. arena_free_pages() then raises an irq_work whose handler in turn schedules work that iterate this list and clears ptes, flushes tlbs, zap pages, and frees pages for the queued uaddr and page cnts. apply_range_clear_cb() with apply_to_existing_page_range() is used to clear PTEs and collect pages to be freed, struct llist_node pcp_llist; in the struct page is used to do this. ==================== Link: https://patch.msgid.link/20251222195022.431211-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23selftests: bpf: test non-sleepable arena allocationsPuranjay Mohan
As arena kfuncs can now be called from non-sleepable contexts, test this by adding non-sleepable copies of tests in verifier_arena, this is done by using a socket program instead of syscall. Add a new test case in verifier_arena_large to check that the bpf_arena_alloc_pages() works for more than 1024 pages. 1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but bpf_arena_alloc_pages() should still succeed because it re-uses this array in a loop. Augment the arena_list selftest to also run in non-sleepable context by taking rcu_read_lock. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-5-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23bpf: arena: make arena kfuncs any context safePuranjay Mohan
Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23bpf: arena: use kmalloc_nolock() in place of kvcalloc()Puranjay Mohan
To make arena_alloc_pages() safe to be called from any context, replace kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any locks. kmalloc_nolock() returns NULL for allocations larger than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 * 8 and reuse the array in a loop. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23bpf: arena: populate vm_area without allocating memoryPuranjay Mohan
vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc non-sleepable change bpf arena to populate pages without allocating memory: - at arena creation time populate all page table levels except the last level - when new pages need to be inserted call apply_to_page_range() again with apply_range_set_cb() which will only set_pte_at() those pages and will not allocate memory. - when freeing pages call apply_to_existing_page_range with apply_range_clear_cb() to clear the pte for the page to be removed. This doesn't free intermediate page table levels. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entrySasha Levin
On ARM32 with HIGHMEM/HIGHPTE, break_ksm_pmd_entry() triggers a BUG during KSM unmerging because pte_unmap_unlock() is passed a pointer that may be beyond the mapped PTE page. The issue occurs when the PTE iteration loop completes without finding a KSM page. After the loop, 'ptep' has been incremented past the last PTE entry. On ARM32 LPAE with 512 PTEs per page (512 * 8 = 4096 bytes), this means ptep points to the next page, outside the kmap'd region. When pte_unmap_unlock(ptep, ptl) calls kunmap_local(ptep), it unmaps the wrong page address, leaving the original kmap slot still mapped. The next kmap_local then finds this slot unexpectedly occupied: WARNING: mm/highmem.c:622 kunmap_local_indexed (address mismatch) kernel BUG at mm/highmem.c:564 __kmap_local_pfn_prot (slot not empty) Fix this by passing start_ptep to pte_unmap_unlock(), which always points within the originally mapped PTE page. Reproducer: Run LTP ksm03 test on ARM32 with HIGHMEM enabled. The test triggers KSM merging followed by unmerging (writing 0 then 2 to /sys/kernel/mm/ksm/run), which exercises break_ksm_pmd_entry(). Link: https://lkml.kernel.org/r/20251220202926.318366-1-sashal@kernel.org Fixes: 5d4939fc2258 ("ksm: perform a range-walk in break_ksm") Signed-off-by: Sasha Levin <sashal@kernel.org> Assisted-by: claude-opus-4-5-20251101 Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm/page_owner: fix memory leak in page_owner_stack_fops->release()Ran Xiaokai
The page_owner_stack_fops->open() callback invokes seq_open_private(), therefore its corresponding ->release() callback must call seq_release_private(). Otherwise it will cause a memory leak of struct stack_print_ctx. Link: https://lkml.kernel.org/r/20251219074232.136482-1-ranxiaokai627@163.com Fixes: 765973a09803 ("mm,page_owner: display all stacks and their count") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Elver <elver@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm/memremap: fix spurious large folio warning for FS-DAXJohn Groves
This patch addresses a warning that I discovered while working on famfs, which is an fs-dax file system that virtually always does PMD faults (next famfs patch series coming after the holidays). However, XFS also does PMD faults in fs-dax mode, and it also triggers the warning. It takes some effort to get XFS to do a PMD fault, but instructions to reproduce it are below. The VM_WARN_ON_ONCE(folio_test_large(folio)) check in free_zone_device_folio() incorrectly triggers for MEMORY_DEVICE_FS_DAX when PMD (2MB) mappings are used. FS-DAX legitimately creates large file-backed folios when handling PMD faults. This is a core feature of FS-DAX that provides significant performance benefits by mapping 2MB regions directly to persistent memory. When these mappings are unmapped, the large folios are freed through free_zone_device_folio(), which triggers the spurious warning. The warning was introduced by commit that added support for large zone device private folios. However, that commit did not account for FS-DAX file-backed folios, which have always supported large (PMD-sized) mappings. The check distinguishes between anonymous folios (which clear AnonExclusive flags for each sub-page) and file-backed folios. For file-backed folios, it assumes large folios are unexpected - but this assumption is incorrect for FS-DAX. The fix is to exempt MEMORY_DEVICE_FS_DAX from the large folio warning, allowing FS-DAX to continue using PMD mappings without triggering false warnings. Link: https://lkml.kernel.org/r/20251219123717.39330-1-john@groves.net Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Signed-off-by: John Groves <john@groves.net> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23MAINTAINERS: notify the "Device Memory" community of memory hotplug changesDan Williams
The recent episode of a warning regression in memremap_pages() [1] highlights that relevant updates are being missed by folks that care about core ZONE_DEVICE changes. Yes, CXL folks should pay more attention to linux-mm@, but it also would not hurt to copy linux-cxl@, where most Device Memory folks hang out, on memory hotplug changes by default. Link: http://lore.kernel.org/20251219123717.39330-1-john@groves.net [1] Link: https://lkml.kernel.org/r/20251220000327.3502994-1-dan.j.williams@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: John Groves <John@Groves.net> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23sparse: update MAINTAINERS infoRandy Dunlap
Chris Li is back as sparse maintainer. See https://git.kernel.org/pub/scm/devel/sparse/sparse.git/commit/?id=67f0a03cee4637e495151c48a02be642a158cbbb Link: https://lkml.kernel.org/r/20251218060921.995516-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christopher Li <sparse@chrisli.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMUJoshua Hahn
Commit 2783088ef24e ("mm/page_alloc: prevent reporting pcp->batch = 0") moved the error handling (0-handling) of zone_batchsize from its callers to inside the function. However, the commit left out the error handling for the NOMMU case, leading to deadlocks on NOMMU systems. For NOMMU systems, return 1 instead of 0 for zone_batchsize, which restores the previous deadlock-free behavior. There is no functional difference expected with this patch before commit 2783088ef24e, other than the pr_debug in zone_pcp_init now printing out 1 instead of 0 for zones in NOMMU systems. Not only is this a pr_debug, the difference is purely semantic anyways. Link: https://lkml.kernel.org/r/20251218083200.2435789-1-joshua.hahnjy@gmail.com Fixes: 2783088ef24e ("mm/page_alloc: prevent reporting pcp->batch = 0") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reported-by: Daniel Palmer <daniel@thingy.jp> Closes: https://lore.kernel.org/linux-mm/CAFr9PX=_HaM3_xPtTiBn5Gw5-0xcRpawpJ02NStfdr0khF2k7g@mail.gmail.com/ Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/all/42143500-c380-41fe-815c-696c17241506@roeck-us.net/ Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Daniel Palmer <daniel@thingy.jp> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: SeongJae Park <sj@kernel.org> Tested-by: Hajime Tazaki <thehajime@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm: consider non-anon swap cache folios in folio_expected_ref_count()Bijan Tabatabai
Currently, folio_expected_ref_count() only adds references for the swap cache if the folio is anonymous. However, according to the comment above the definition of PG_swapcache in enum pageflags, shmem folios can also have PG_swapcache set. This patch makes sure references for the swap cache are added if folio_test_swapcache(folio) is true. This issue was found when trying to hot-unplug memory in a QEMU/KVM virtual machine. When initiating hot-unplug when most of the guest memory is allocated, hot-unplug hangs partway through removal due to migration failures. The following message would be printed several times, and would be printed again about every five seconds: [ 49.641309] migrating pfn b12f25 failed ret:7 [ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25 [ 49.641311] aops:swap_aops [ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3) [ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000 [ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000 [ 49.641315] page dumped because: migration failure When debugging this, I found that these migration failures were due to __migrate_folio() returning -EAGAIN for a small set of folios because the expected reference count it calculates via folio_expected_ref_count() is one less than the actual reference count of the folios. Furthermore, all of the affected folios were not anonymous, but had the PG_swapcache flag set, inspiring this patch. After applying this patch, the memory hot-unplug behaves as expected. I tested this on a machine running Ubuntu 24.04 with kernel version 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the mm-unstable branch as a Dec 16, 2025 was also tested and behaves the same) and 48GB of memory. The libvirt XML definition for the VM can be found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in the guest kernel so the hot-pluggable memory is automatically onlined. Below are the steps to reproduce this behavior: 1) Define and start and virtual machine host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1] host$ virsh -c qemu:///system start test_vm 2) Setup swap in the guest guest$ sudo fallocate -l 32G /swapfile guest$ sudo chmod 0600 /swapfile guest$ sudo mkswap /swapfile guest$ sudo swapon /swapfile 3) Use alloc_data [2] to allocate most of the remaining guest memory guest$ ./alloc_data 45 4) In a separate guest terminal, monitor the amount of used memory guest$ watch -n1 free -h 5) When alloc_data has finished allocating, initiate the memory hot-unplug using the provided xml file [3] host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live After initiating the memory hot-unplug, you should see the amount of available memory in the guest decrease, and the amount of used swap data increase. If everything works as expected, when all of the memory is unplugged, there should be around 8.5-9GB of data in swap. If the unplugging is unsuccessful, the amount of used swap data will settle below that. If that happens, you should be able to see log messages in dmesg similar to the one posted above. Link: https://lkml.kernel.org/r/20251216200727.2360228-1-bijan311@gmail.com Link: https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml [1] Link: https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c [2] Link: https://github.com/BijanT/linux_patch_files/blob/main/remove.xml [3] Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation") Signed-off-by: Bijan Tabatabai <bijan311@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shivank Garg <shivankg@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kairui Song <ryncsn@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23rust: maple_tree: rcu_read_lock() in destructor to silence lockdepAlice Ryhl
When running the Rust maple tree kunit tests with lockdep, you may trigger a warning that looks like this: lib/maple_tree.c:780 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by kunit_try_catch/344. stack backtrace: CPU: 3 UID: 0 PID: 344 Comm: kunit_try_catch Tainted: G N 6.19.0-rc1+ #2 NONE Tainted: [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x71/0x90 lockdep_rcu_suspicious+0x150/0x190 mas_start+0x104/0x150 mas_find+0x179/0x240 _RINvNtCs5QSdWC790r4_4core3ptr13drop_in_placeINtNtCs1cdwasc6FUb_6kernel10maple_tree9MapleTreeINtNtNtBL_5alloc4kbox3BoxlNtNtB1x_9allocator7KmallocEEECsgxAQYCfdR72_25doctests_kernel_generated+0xaf/0x130 rust_doctest_kernel_maple_tree_rs_0+0x600/0x6b0 ? lock_release+0xeb/0x2a0 ? kunit_try_catch_run+0x210/0x210 kunit_try_run_case+0x74/0x160 ? kunit_try_catch_run+0x210/0x210 kunit_generic_run_threadfn_adapter+0x12/0x30 kthread+0x21c/0x230 ? __do_trace_sched_kthread_stop_ret+0x40/0x40 ret_from_fork+0x16c/0x270 ? __do_trace_sched_kthread_stop_ret+0x40/0x40 ret_from_fork_asm+0x11/0x20 </TASK> This is because the destructor of maple tree calls mas_find() without taking rcu_read_lock() or the spinlock. Doing that is actually ok in this case since the destructor has exclusive access to the entire maple tree, but it triggers a lockdep warning. To fix that, take the rcu read lock. In the future, it's possible that memory reclaim could gain a feature where it reallocates entries in maple trees even if no user-code is touching it. If that feature is added, then this use of rcu read lock would become load-bearing, so I did not make it conditional on lockdep. We have to repeatedly take and release rcu because the destructor of T might perform operations that sleep. Link: https://lkml.kernel.org/r/20251217-maple-drop-rcu-v1-1-702af063573f@google.com Fixes: da939ef4c494 ("rust: maple_tree: add MapleTree") Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reported-by: Andreas Hindborg <a.hindborg@kernel.org> Closes: https://rust-for-linux.zulipchat.com/#narrow/channel/x/topic/x/near/564215108 Reviewed-by: Gary Guo <gary@garyguo.net> Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Trevor Gross <tmgross@umich.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm: memcg: fix unit conversion for K() macro in OOM logShakeel Butt
The commit bc8e51c05ad5 ("mm: memcg: dump memcg protection info on oom or alloc failures") added functionality to dump memcg protections on OOM or allocation failures. It uses K() macro to dump the information and passes bytes to the macro. However the macro take number of pages instead of bytes. It is defined as: #define K(x) ((x) << (PAGE_SHIFT-10)) Let's fix this. Link: https://lkml.kernel.org/r/20251216212054.484079-1-shakeel.butt@linux.dev Fixes: bc8e51c05ad5 ("mm: memcg: dump memcg protection info on oom or alloc failures") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: Chris Mason <clm@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm: fixup pfnmap memory failure handling to use pgoffAnkit Agrawal
The memory failure handling implementation for the PFNMAP memory with no struct pages is faulty. The VA of the mapping is determined based on the the PFN. It should instead be based on the file mapping offset. At the occurrence of poison, the memory_failure_pfn is triggered on the poisoned PFN. Introduce a callback function that allows mm to translate the PFN to the corresponding file page offset. The kernel module using the registration API must implement the callback function and provide the translation. The translated value is then used to determine the VA information and sending the SIGBUS to the usermode process mapped to the poisoned PFN. The callback is also useful for the driver to be notified of the poisoned PFN, which may then track it. Link: https://lkml.kernel.org/r/20251211070603.338701-2-ankita@nvidia.com Fixes: 2ec41967189c ("mm: handle poisoning of pfn without struct pages") Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Matthew R. Ochs <mochs@nvidia.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Neo Jia <cjia@nvidia.com> Cc: Vikram Sethi <vsethi@nvidia.com> Cc: Yishai Hadas <yishaih@nvidia.com> Cc: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23tools/mm/page_owner_sort: fix timestamp comparison for stable sortingKaushlendra Kumar
The ternary operator in compare_ts() returns 1 when timestamps are equal, causing unstable sorting behavior. Replace with explicit three-way comparison that returns 0 for equal timestamps, ensuring stable qsort ordering and consistent output. Link: https://lkml.kernel.org/r/20251209044552.3396468-1-kaushlendra.kumar@intel.com Fixes: 8f9c447e2e2b ("tools/vm/page_owner_sort.c: support sorting pid and time") Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Cc: Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23selftests/mm: fix thread state check in uffd-unit-testsWake Liu
In the thread_state_get() function, the logic to find the thread's state character was using `sizeof(header) - 1` to calculate the offset from the "State:\t" string. The `header` variable is a `const char *` pointer. `sizeof()` on a pointer returns the size of the pointer itself, not the length of the string literal it points to. This makes the code's behavior dependent on the architecture's pointer size. This bug was identified on a 32-bit ARM build (`gsi_tv_arm`) for Android, running on an ARMv8-based device, compiled with Clang 19.0.1. On this 32-bit architecture, `sizeof(char *)` is 4. The expression `sizeof(header) - 1` resulted in an incorrect offset of 3, causing the test to read the wrong character from `/proc/[tid]/status` and fail. On 64-bit architectures, `sizeof(char *)` is 8, so the expression coincidentally evaluates to 7, which matches the length of "State:\t". This is why the bug likely remained hidden on 64-bit builds. To fix this and make the code portable and correct across all architectures, this patch replaces `sizeof(header) - 1` with `strlen(header)`. The `strlen()` function correctly calculates the string's length, ensuring the correct offset is always used. Link: https://lkml.kernel.org/r/20251210091408.3781445-1-wakel@google.com Fixes: f60b6634cd88 ("mm/selftests: add a test to verify mmap_changing race with -EAGAIN") Signed-off-by: Wake Liu <wakel@google.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23kernel/kexec: fix IMA when allocation happens in CMA areaPingfan Liu
*** Bug description *** When I tested kexec with the latest kernel, I ran into the following warning: [ 40.712410] ------------[ cut here ]------------ [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 [...] [ 40.816047] Call trace: [ 40.818498] kimage_map_segment+0x144/0x198 (P) [ 40.823221] ima_kexec_post_load+0x58/0xc0 [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 [...] [ 40.855423] ---[ end trace 0000000000000000 ]--- *** How to reproduce *** This bug is only triggered when the kexec target address is allocated in the CMA area. If no CMA area is reserved in the kernel, use the "cma=" option in the kernel command line to reserve one. *** Root cause *** The commit 07d24902977e ("kexec: enable CMA based contiguous allocation") allocates the kexec target address directly on the CMA area to avoid copying during the jump. In this case, there is no IND_SOURCE for the kexec segment. But the current implementation of kimage_map_segment() assumes that IND_SOURCE pages exist and map them into a contiguous virtual address by vmap(). *** Solution *** If IMA segment is allocated in the CMA area, use its page_address() directly. Link: https://lkml.kernel.org/r/20251216014852.8737-2-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23kernel/kexec: change the prototype of kimage_map_segment()Pingfan Liu
The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>