summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-09-29Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "There's good stuff across the board, including some nice mm improvements for CPUs with the 'noabort' BBML2 feature and a clever patch to allow ptdump to play nicely with block mappings in the vmalloc area. Confidential computing: - Add support for accepting secrets from firmware (e.g. ACPI CCEL) and mapping them with appropriate attributes. CPU features: - Advertise atomic floating-point instructions to userspace - Extend Spectre workarounds to cover additional Arm CPU variants - Extend list of CPUs that support break-before-make level 2 and guarantee not to generate TLB conflict aborts for changes of mapping granularity (BBML2_NOABORT) - Add GCS support to our uprobes implementation. Documentation: - Remove bogus SME documentation concerning register state when entering/exiting streaming mode. Entry code: - Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY) - Micro-optimise syscall entry path with a compiler branch hint. Memory management: - Enable huge mappings in vmalloc space even when kernel page-table dumping is enabled - Tidy up the types used in our early MMU setup code - Rework rodata= for closer parity with the behaviour on x86 - For CPUs implementing BBML2_NOABORT, utilise block mappings in the linear map even when rodata= applies to virtual aliases - Don't re-allocate the virtual region between '_text' and '_stext', as doing so confused tools parsing /proc/vmcore. Miscellaneous: - Clean-up Kconfig menuconfig text for architecture features - Avoid redundant bitmap_empty() during determination of supported SME vector lengths - Re-enable warnings when building the 32-bit vDSO object - Avoid breaking our eggs at the wrong end. Perf and PMUs: - Support for v3 of the Hisilicon L3C PMU - Support for Hisilicon's MN and NoC PMUs - Support for Fujitsu's Uncore PMU - Support for SPE's extended event filtering feature - Preparatory work to enable data source filtering in SPE - Support for multiple lanes in the DWC PCIe PMU - Support for i.MX94 in the IMX DDR PMU driver - MAINTAINERS update (Thank you, Yicong) - Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets). Selftests: - Add basic LSFE check to the existing hwcaps test - Support nolibc in GCS tests - Extend SVE ptrace test to pass unsupported regsets and invalid vector lengths - Minor cleanups (typos, cosmetic changes). System registers: - Fix ID_PFR1_EL1 definition - Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1 - Sync TCR_EL1 definition with the latest Arm ARM (L.b) - Be stricter about the input fed into our AWK sysreg generator script - Typo fixes and removal of redundant definitions. ACPI, EFI and PSCI: - Decouple Arm's "Software Delegated Exception Interface" (SDEI) support from the ACPI GHES code so that it can be used by platforms booted with device-tree - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI runtime calls - Fix a node refcount imbalance in the PSCI device-tree code. CPU Features: - Ensure register sanitisation is applied to fields in ID_AA64MMFR4 - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests can reliably query the underlying CPU types from the VMM - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes to our context-switching, signal handling and ptrace code" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits) arm64: cpufeature: Remove duplicate asm/mmu.h header arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN perf/dwc_pcie: Fix use of uninitialized variable arm/syscalls: mark syscall invocation as likely in invoke_syscall Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU Documentation: hisi-pmu: Fix of minor format error drivers/perf: hisi: Add support for L3C PMU v3 drivers/perf: hisi: Refactor the event configuration of L3C PMU drivers/perf: hisi: Extend the field of tt_core drivers/perf: hisi: Extract the event filter check of L3C PMU drivers/perf: hisi: Simplify the probe process of each L3C PMU version drivers/perf: hisi: Export hisi_uncore_pmu_isr() drivers/perf: hisi: Relax the event ID check in the framework perf: Fujitsu: Add the Uncore PMU driver arm64: map [_text, _stext) virtual address range non-executable+read-only arm64/sysreg: Update TCR_EL1 register arm64: Enable vmalloc-huge with ptdump arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list arm64: errata: Apply workarounds for Neoverse-V3AE arm64: cputype: Add Neoverse-V3AE definitions ...
2025-09-29Revert "net: group sk_backlog and sk_receive_queue"Eric Dumazet
This reverts commit 4effb335b5dab08cb6e2c38d038910f8b527cfc9. This was a benefit for UDP flood case, which was later greatly improved with commits 6471658dc66c ("udp: use skb_attempt_defer_free()") and b650bf0977d3 ("udp: remove busylock and add per NUMA queues"). Apparently blamed commit added a regression for RAW sockets, possibly because they do not use the dual RX queue strategy that UDP has. sock_queue_rcv_skb_reason() and RAW recvmsg() compete for sk_receive_buf and sk_rmem_alloc changes, and them being in the same cache line reduce performance. Fixes: 4effb335b5da ("net: group sk_backlog and sk_receive_queue") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202509281326.f605b4eb-lkp@intel.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: David Ahern <dsahern@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250929182112.824154-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29tcp: make tcp_rcvbuf_grow() accessible to mptcp codePaolo Abeni
To leverage the auto-tuning improvements brought by commit 2da35e4b4df9 ("Merge branch 'tcp-receive-side-improvements'"), the MPTCP stack need to access the mentioned helper. Acked-by: Geliang Tang <geliang@kernel.org> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-2-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29Merge tag 'for-net-next-2025-09-27' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: core: - MAINTAINERS: add a sub-entry for the Qualcomm bluetooth driver - Avoid a couple dozen -Wflex-array-member-not-at-end warnings - bcsp: receive data only if registered - HCI: Fix using LE/ACL buffers for ISO packets - hci_core: Detect if an ISO link has stalled - ISO: Don't initiate CIS connections if there are no buffers - ISO: Use sk_sndtimeo as conn_timeout drivers: - btusb: Check for unexpected bytes when defragmenting HCI frames - btusb: Add new VID/PID 13d3/3627 for MT7925 - btusb: Add new VID/PID 13d3/3633 for MT7922 - btusb: Add USB ID 2001:332a for D-Link AX9U rev. A1 - btintel: Add support for BlazarIW core - btintel_pcie: Add support for _suspend() / _resume() - btintel_pcie: Define hdev->wakeup() callback - btintel_pcie: Add Bluetooth core/platform as comments - btintel_pcie: Add id of Scorpious, Panther Lake-H484 - btintel_pcie: Refactor Device Coredump * tag 'for-net-next-2025-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (30 commits) Bluetooth: Avoid a couple dozen -Wflex-array-member-not-at-end warnings Bluetooth: hci_sync: Fix using random address for BIG/PA advertisements Bluetooth: ISO: don't leak skb in ISO_CONT RX Bluetooth: ISO: free rx_skb if not consumed Bluetooth: ISO: Fix possible UAF on iso_conn_free Bluetooth: SCO: Fix UAF on sco_conn_free Bluetooth: bcsp: receive data only if registered Bluetooth: btusb: Add new VID/PID 13d3/3633 for MT7922 Bluetooth: btusb: Add new VID/PID 13d3/3627 for MT7925 Bluetooth: remove duplicate h4_recv_buf() in header Bluetooth: btusb: Check for unexpected bytes when defragmenting HCI frames Bluetooth: hci_core: Print information of hcon on hci_low_sent Bluetooth: hci_core: Print number of packets in conn->data_q Bluetooth: Add function and line information to bt_dbg Bluetooth: MGMT: Fix not exposing debug UUID on MGMT_OP_READ_EXP_FEATURES_INFO Bluetooth: hci_core: Detect if an ISO link has stalled Bluetooth: ISO: Use sk_sndtimeo as conn_timeout Bluetooth: HCI: Fix using LE/ACL buffers for ISO packets Bluetooth: ISO: Don't initiate CIS connections if there are no buffers MAINTAINERS: add a sub-entry for the Qualcomm bluetooth driver ... ==================== Link: https://patch.msgid.link/20250927154616.1032839-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29ptr_ring: __ptr_ring_zero_tail micro optimizationMichael S. Tsirkin
__ptr_ring_zero_tail currently does the - 1 operation twice: - during initialization of head - at each loop iteration Let's just do it in one place, all we need to do is adjust the loop condition. this is better: - a slightly clearer logic with less duplication - uses prefix -- we don't need to save the old value - one less - 1 operation - for example, when ring is empty we now don't do - 1 at all, existing code does it once Text size shrinks from 15081 to 15050 bytes. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/bcd630c7edc628e20d4f8e037341f26c90ab4365.1758976026.git.mst@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29Merge tag 'hardening-v6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "One notable addition is the creation of the 'transitional' keyword for kconfig so CONFIG renaming can go more smoothly. This has been a long-standing deficiency, and with the renaming of CONFIG_CFI_CLANG to CONFIG_CFI (since GCC will soon have KCFI support), this came up again. The breadth of the diffstat is mainly this renaming. - Clean up usage of TRAILING_OVERLAP() (Gustavo A. R. Silva) - lkdtm: fortify: Fix potential NULL dereference on kmalloc failure (Junjie Cao) - Add str_assert_deassert() helper (Lad Prabhakar) - gcc-plugins: Remove TODO_verify_il for GCC >= 16 - kconfig: Fix BrokenPipeError warnings in selftests - kconfig: Add transitional symbol attribute for migration support - kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI" * tag 'hardening-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: lib/string_choices: Add str_assert_deassert() helper kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI kconfig: Add transitional symbol attribute for migration support kconfig: Fix BrokenPipeError warnings in selftests gcc-plugins: Remove TODO_verify_il for GCC >= 16 stddef: Introduce __TRAILING_OVERLAP() stddef: Remove token-pasting in TRAILING_OVERLAP() lkdtm: fortify: Fix potential NULL dereference on kmalloc failure
2025-09-29Merge tag 'execve-v6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - binfmt_elf: preserve original ELF e_flags for core dumps (Svetlana Parfenova) - exec: Fix incorrect type for ret (Xichao Zhao) - binfmt_elf: Replace offsetof() with struct_size() in fill_note_info() (Xichao Zhao) * tag 'execve-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf: preserve original ELF e_flags for core dumps binfmt_elf: Replace offsetof() with struct_size() in fill_note_info() exec: Fix incorrect type for ret
2025-09-29Merge tag 'ffs-const-v6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull ffs const-attribute cleanups from Kees Cook: "While working on various hardening refactoring a while back we encountered inconsistencies in the application of __attribute_const__ on the ffs() family of functions. This series fixes this across all archs and adds KUnit tests. Notably, this found a theoretical underflow in PCI (also fixed here) and uncovered an inefficiency in ARC (fixed in the ARC arch PR). I kept the series separate from the general hardening PR since it is a stand-alone "topic". - PCI: Fix theoretical underflow in use of ffs(). - Universally apply __attribute_const__ to all architecture's ffs()-family of functions. - Add KUnit tests for ffs() behavior and const-ness" * tag 'ffs-const-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: KUnit: ffs: Validate all the __attribute_const__ annotations sparc: Add __attribute_const__ to ffs()-family implementations xtensa: Add __attribute_const__ to ffs()-family implementations s390: Add __attribute_const__ to ffs()-family implementations parisc: Add __attribute_const__ to ffs()-family implementations mips: Add __attribute_const__ to ffs()-family implementations m68k: Add __attribute_const__ to ffs()-family implementations openrisc: Add __attribute_const__ to ffs()-family implementations riscv: Add __attribute_const__ to ffs()-family implementations hexagon: Add __attribute_const__ to ffs()-family implementations alpha: Add __attribute_const__ to ffs()-family implementations sh: Add __attribute_const__ to ffs()-family implementations powerpc: Add __attribute_const__ to ffs()-family implementations x86: Add __attribute_const__ to ffs()-family implementations csky: Add __attribute_const__ to ffs()-family implementations bitops: Add __attribute_const__ to generic ffs()-family implementations KUnit: Introduce ffs()-family tests PCI: Test for bit underflow in pcie_set_readrq()
2025-09-29Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds
Pull interleaved SHA-256 hashing support from Eric Biggers: "Optimize fsverity with 2-way interleaved hashing Add support for 2-way interleaved SHA-256 hashing to lib/crypto/, and make fsverity use it for faster file data verification. This improves fsverity performance on many x86_64 and arm64 processors. Later, I plan to make dm-verity use this too" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: Use 2-way interleaved SHA-256 hashing when supported fsverity: Remove inode parameter from fsverity_hash_block() lib/crypto: tests: Add tests and benchmark for sha256_finup_2x() lib/crypto: x86/sha256: Add support for 2-way interleaved hashing lib/crypto: arm64/sha256: Add support for 2-way interleaved hashing lib/crypto: sha256: Add support for 2-way interleaved hashing
2025-09-29Merge tag 'libcrypto-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull crypto library updates from Eric Biggers: - Add a RISC-V optimized implementation of Poly1305. This code was written by Andy Polyakov and contributed by Zhihang Shao. - Migrate the MD5 code into lib/crypto/, and add KUnit tests for MD5. Yes, it's still the 90s, and several kernel subsystems are still using MD5 for legacy use cases. As long as that remains the case, it's helpful to clean it up in the same way as I've been doing for other algorithms. Later, I plan to convert most of these users of MD5 to use the new MD5 library API instead of the generic crypto API. - Simplify the organization of the ChaCha, Poly1305, BLAKE2s, and Curve25519 code. Consolidate these into one module per algorithm, and centralize the configuration and build process. This is the same reorganization that has already been successful for SHA-1 and SHA-2. - Remove the unused crypto_kpp API for Curve25519. - Migrate the BLAKE2s and Curve25519 self-tests to KUnit. - Always enable the architecture-optimized BLAKE2s code. * tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (38 commits) crypto: md5 - Implement export_core() and import_core() wireguard: kconfig: simplify crypto kconfig selections lib/crypto: tests: Enable Curve25519 test when CRYPTO_SELFTESTS lib/crypto: curve25519: Consolidate into single module lib/crypto: curve25519: Move a couple functions out-of-line lib/crypto: tests: Add Curve25519 benchmark lib/crypto: tests: Migrate Curve25519 self-test to KUnit crypto: curve25519 - Remove unused kpp support crypto: testmgr - Remove curve25519 kpp tests crypto: x86/curve25519 - Remove unused kpp support crypto: powerpc/curve25519 - Remove unused kpp support crypto: arm/curve25519 - Remove unused kpp support crypto: hisilicon/hpre - Remove unused curve25519 kpp support lib/crypto: tests: Add KUnit tests for BLAKE2s lib/crypto: blake2s: Consolidate into single C translation unit lib/crypto: blake2s: Move generic code into blake2s.c lib/crypto: blake2s: Always enable arch-optimized BLAKE2s code lib/crypto: blake2s: Remove obsolete self-test lib/crypto: x86/blake2s: Reduce size of BLAKE2S_SIGMA2 lib/crypto: chacha: Consolidate into single module ...
2025-09-29Merge tag 'crc-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Update crc_kunit to test the CRC functions in softirq and hardirq contexts, similar to what the lib/crypto/ KUnit tests do. Move the helper function needed to do this into a common header. This is useful mainly to test fallback code paths for when FPU/SIMD/vector registers are unusable" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: Documentation/staging: Fix typo and incorrect citation in crc32.rst lib/crc: Drop inline from all *_mod_init_arch() functions lib/crc: Use underlying functions instead of crypto_simd_usable() lib/crc: crc_kunit: Test CRC computation in interrupt contexts kunit, lib/crypto: Move run_irq_test() to common header
2025-09-29Merge tag 'dlm-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This adds a dlm_release_lockspace() flag to request that node-failure recovery be performed for the node leaving the lockspace. The implementation of this flag requires coordination with userland clustering components. It's been requested for use by GFS2" * tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check for undefined release_option values dlm: handle release_option as unsigned dlm: move to rinfo for all middle conversion cases dlm: handle invalid lockspace member remove dlm: add new flag DLM_RELEASE_RECOVER for dlm_lockspace_release dlm: add new configfs entry release_recover for lockspace members dlm: add new RELEASE_RECOVER uevent attribute for release_lockspace dlm: use defines for force values in dlm_release_lockspace dlm: check for defined force value in dlm_lockspace_release
2025-09-29Merge tag 'hfs-v6.18-tag1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs Pull hfs updates from Viacheslav Dubeyko: "This contains several fixes of syzbot reported issues, HFS/HFS+ fixes of xfstests failures, and rework of HFS/HFS+ debug output subsystem. - Kang Chen fixed a slab-out-of-bounds issue in hfsplus_uni2asc() when hfsplus_uni2asc() is called from hfsplus_listxattr(). - Yang Chenzhi fixed a crash in hfsplus_bmap_alloc() if record offset or length is larger than node_size. - Yangtao Li corrected the error code from hfsplus_fill_super() if Catalog File contains corrupted record for the case of hidden directory's type. - KMSAN uninit-value fixes: hfs_find_set_zero_bits() and __hfsplus_ext_cache_extent() use kzalloc() instead of kmalloc(), and in hfsplus_delete_cat() by proper initialization of struct hfsplus_inode_info in the hfsplus_iget() logic. - A slab-out-of-bounds issue could happen in hfsplus_strcasecmp() if the length field of struct hfsplus_unistr is bigger than HFSPLUS_MAX_STRLEN. Fixed by checking the length of comparing strings, and if the strings' length is bigger than HFSPLUS_MAX_STRLEN, then the length is corrected to this value. - The generic/736 xfstest failed for HFS because the HFS volume becomes corrupted after the test run. The main reason was the absence of logic that corrects mdb->drNxtCNID/HFS_SB(sb)->next_id (next unused CNID) after deleting a record in Catalog File. That was fixed by implementing the necessary logic in hfs_correct_next_unused_CNID()" * tag 'hfs-v6.18-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs: hfs/hfsplus: rework debug output subsystem hfsplus: fix slab-out-of-bounds read in hfsplus_strcasecmp() hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc() hfs: clear offset and space out of valid records in b-tree node hfs: add logic of correcting a next unused CNID hfsplus: fix KMSAN uninit-value issue in hfsplus_delete_cat() hfs: fix KMSAN uninit-value issue in hfs_find_set_zero_bits() hfs: make proper initalization of struct hfs_find_data hfsplus: fix KMSAN uninit-value issue in __hfsplus_ext_cache_extent() hfs: validate record offset in hfsplus_bmap_alloc hfsplus: return EIO when type of hidden directory mismatch in hfsplus_fill_super() MAINTAINERS: update location of hfs&hfsplus trees
2025-09-29Merge tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: "For this merge window, there are really no new features, but there are a few things worth to emphasize: - Deprecated for years already, the (no)attr2 and (no)ikeep mount options have been removed for good - Several cleanups (specially from typedefs) and bug fixes - Improvements made in the online repair reap calculations - online fsck is now enabled by default" * tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits) xfs: rework datasync tracking and execution xfs: rearrange code in xfs_inode_item_precommit xfs: scrub: use kstrdup_const() for metapath scan setups xfs: use bt_nr_sectors in xfs_dax_translate_range xfs: track the number of blocks in each buftarg xfs: constify xfs_errortag_random_default xfs: improve default maximum number of open zones xfs: improve zone statistics message xfs: centralize error tag definitions xfs: remove pointless externs in xfs_error.h xfs: remove the expr argument to XFS_TEST_ERROR xfs: remove xfs_errortag_set xfs: remove xfs_errortag_get xfs: move the XLOG_REG_ constants out of xfs_log_format.h xfs: adjust the hint based zone allocation policy xfs: refactor hint based zone allocation fs: add an enum for number of life time hints xfs: fix log CRC mismatches between i386 and other architectures xfs: rename the old_crc variable in xlog_recover_process xfs: remove the unused xfs_log_iovec_t typedef ...
2025-09-29Merge tag 'vfs-6.18-rc1.async' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async directory updates from Christian Brauner: "This contains further preparatory changes for the asynchronous directory locking scheme: - Add lookup_one_positive_killable() which allows overlayfs to perform lookup that won't block on a fatal signal - Unify the mount idmap handling in struct renamedata as a rename can only happen within a single mount - Introduce kern_path_parent() for audit which sets the path to the parent and returns a dentry for the target without holding any locks on return - Rename kern_path_locked() as it is only used to prepare for the removal of an object from the filesystem: kern_path_locked() => start_removing_path() kern_path_create() => start_creating_path() user_path_create() => start_creating_user_path() user_path_locked_at() => start_removing_user_path_at() done_path_create() => end_creating_path() NA => end_removing_path()" * tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: debugfs: rename start_creating() to debugfs_start_creating() VFS: rename kern_path_locked() and related functions. VFS/audit: introduce kern_path_parent() for audit VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata VFS: discard err2 in filename_create() VFS/ovl: add lookup_one_positive_killable()
2025-09-29Merge tag 'vfs-6.18-rc1.writeback' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs writeback updates from Christian Brauner: "This contains work adressing lockups reported by users when a systemd unit reading lots of files from a filesystem mounted with the lazytime mount option exits. With the lazytime mount option enabled we can be switching many dirty inodes on cgroup exit to the parent cgroup. The numbers observed in practice when systemd slice of a large cron job exits can easily reach hundreds of thousands or millions. The logic in inode_do_switch_wbs() which sorts the inode into appropriate place in b_dirty list of the target wb however has linear complexity in the number of dirty inodes thus overall time complexity of switching all the inodes is quadratic leading to workers being pegged for hours consuming 100% of the CPU and switching inodes to the parent wb. Simple reproducer of the issue: FILES=10000 # Filesystem mounted with lazytime mount option MNT=/mnt/ echo "Creating files and switching timestamps" for (( j = 0; j < 50; j ++ )); do mkdir $MNT/dir$j for (( i = 0; i < $FILES; i++ )); do echo "foo" >$MNT/dir$j/file$i done touch -a -t 202501010000 $MNT/dir$j/file* done wait echo "Syncing and flushing" sync echo 3 >/proc/sys/vm/drop_caches echo "Reading all files from a cgroup" mkdir /sys/fs/cgroup/unified/mycg1 || exit echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit for (( j = 0; j < 50; j ++ )); do cat /mnt/dir$j/file* >/dev/null & done wait echo "Switching wbs" # Now rmdir the cgroup after the script exits This can be solved by: - Avoiding contention on the wb->list_lock when switching inodes by running a single work item per wb and managing a queue of items switching to the wb - Allowing rescheduling when switching inodes over to a different cgroup to avoid softlockups - Maintaining the b_dirty list ordering instead of sorting it" * tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: writeback: Add tracepoint to track pending inode switches writeback: Avoid excessively long inode switching times writeback: Avoid softlockup when switching many inodes writeback: Avoid contention on wb->list_lock when switching inodes
2025-09-29Merge tag 'namespace-6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_*() helpers net: port to ns_ref_*() helpers uts: port to ns_ref_*() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_*() helpers time: port to ns_ref_*() helpers pid: port to ns_ref_*() helpers ipc: port to ns_ref_*() helpers cgroup: port to ns_ref_*() helpers ...
2025-09-29Merge tag 'vfs-6.18-rc1.afs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull afs updates from Christian Brauner: "This contains the change to enable afs to support RENAME_NOREPLACE and RENAME_EXCHANGE if the server supports it" * tag 'vfs-6.18-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Add support for RENAME_NOREPLACE and RENAME_EXCHANGE
2025-09-29Merge tag 'kernel-6.18-rc1.clone3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_*() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] * tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29Merge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-29Merge tag 'vfs-6.18-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains some work around mount api handling: - Output the warning message for mnt_too_revealing() triggered during fsmount() to the fscontext log. This makes it possible for the mount tool to output appropriate warnings on the command line. For example, with the newest fsopen()-based mount(8) from util-linux, the error messages now look like: # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. - Do not consume fscontext log entries when returning -EMSGSIZE Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. - Drop an unused argument from do_remount()" * tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: fs/namespace.c: remove ms_flags argument from do_remount selftests/filesystems: add basic fscontext log tests fscontext: do not consume log entries when returning -EMSGSIZE vfs: output mount_too_revealing() errors to fscontext docs/vfs: Remove mentions to the old mount API helpers fscontext: add custom-prefix log helpers fs: Remove mount_bdev fs: Remove mount_nodev
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-29PM: runtime: Drop DEFINE_FREE() for pm_runtime_put()Rafael J. Wysocki
The DEFINE_FREE() for pm_runtime_put has been superseded by recently introduced runtime PM auto-cleanup macros and its only user has been converted to using one of the new macros, so drop it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
2025-09-29PM: runtime: Add auto-cleanup macros for "resume and get" operationsRafael J. Wysocki
It is generally useful to be able to automatically drop a device's runtime PM usage counter incremented by runtime PM operations that resume a device and bump up its usage counter [1]. To that end, add guard definition macros allowing pm_runtime_put() and pm_runtime_put_autosuspend() to be used for the auto-cleanup in those cases. Simply put, a piece of code like below: pm_runtime_get_sync(dev); ..... pm_runtime_put(dev); return 0; can be transformed with guard() like: guard(pm_runtime_active)(dev); ..... return 0; (see the pm_runtime_put() call is gone). However, it is better to do proper error handling in the majority of cases, so doing something like this instead of the above is recommended: ACQUIRE(pm_runtime_active_try, pm)(dev); if (ACQUIRE_ERR(pm_runtime_active_try, &pm)) return -ENXIO; ..... return 0; In all of the cases in which runtime PM is known to be enabled for the given device or the device can be regarded as operational (and so it can be accessed) with runtime PM disabled, a piece of code like: ret = pm_runtime_resume_and_get(dev); if (ret < 0) return ret; ..... pm_runtime_put(dev); return 0; can be changed as follows: ACQUIRE(pm_runtime_active_try, pm)(dev); ret = ACQUIRE_ERR(pm_runtime_active_try, &pm); if (ret < 0) return ret; ..... return 0; (again, see the pm_runtime_put() call is gone). Still, if the device cannot be accessed unless runtime PM has been enabled for it, the pm_runtime_active_try_enabled guard variant needs to be used, that is (in the context of the example above): ACQUIRE(pm_runtime_active_try_enabled, pm)(dev); ret = ACQUIRE_ERR(pm_runtime_active_try_enabled, &pm); if (ret < 0) return ret; ..... return 0; When the original code calls pm_runtime_put_autosuspend(), use one of the "auto" guard variants, pm_runtime_active_auto/_try/_enabled, so for example, a piece of code like: ret = pm_runtime_resume_and_get(dev); if (ret < 0) return ret; ..... pm_runtime_put_autosuspend(dev); return 0; will become: ACQUIRE(pm_runtime_active_auto_try_enabled, pm)(dev); ret = ACQUIRE_ERR(pm_runtime_active_auto_try_enabled, &pm); if (ret < 0) return ret; ..... return 0; Note that the cases in which the return value of pm_runtime_get_sync() is checked can also be handled with the help of the new guard macros. For example, a piece of code like: ret = pm_runtime_get_sync(dev); if (ret < 0) { pm_runtime_put(dev); return ret; } ..... pm_runtime_put(dev); return 0; can be rewritten as: ACQUIRE(pm_runtime_active_auto_try_enabled, pm)(dev); ret = ACQUIRE_ERR(pm_runtime_active_auto_try_enabled, &pm); if (ret < 0) return ret; ..... return 0; or pm_runtime_get_active_try can be used if transparent handling of disabled runtime PM is desirable. Link: https://lore.kernel.org/linux-pm/878qimv24u.wl-tiwai@suse.de/ [1] Link: https://lore.kernel.org/linux-pm/20250926150613.000073a4@huawei.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/2238241.irdbgypaU6@rafael.j.wysocki [ rjw: Fixed leftovers from the previous version in the changelog ] Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Dhruva Gole <d-gole@ti.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-29Merge branches 'acpi-scan', 'acpi-processor' and 'acpi-sysfs'Rafael J. Wysocki
Merge an ACPI device enumeration update, ACPI processor driver updates, and an ACPI sysfs-related code update for 6.18-rc1: - Add Intel CVS ACPI HIDs to acpi_ignore_dep_ids[] so it is not regarded as real dependency (Hans de Goede) - Use ACPI_FREE() for freeing an ACPI object in description_show() in the ACPI sysfs-related code (Kaushlendra Kumar) - Fix memory leak in the ACPI processor idle driver registration error code path and optimize ACPI idle driver registration (Huisong Li, Rafael Wysocki) - Add module import namespace to the ACPI processor idle driver (Rafael Wysocki) - Eliminate static variable flat_state_cnt from the ACPI processor idle driver (Rafael Wysocki) - Release cpufreq policy references using __free() in the ACPI processor thremal driver (Zihuan Zhang) - Remove unused empty stubs of some functions and rearrange function declarations in a header file in the ACPI processor driver (Huisong Li) - Redefine two functions as void in the ACPI processor driver (Rafael Wysocki) - Do not expose global variable acpi_idle_driver in the ACPI processor driver (Huisong Li) * acpi-scan: ACPI: scan: Add Intel CVS ACPI HIDs to acpi_ignore_dep_ids[] * acpi-processor: ACPI: processor: Do not expose global variable acpi_idle_driver ACPI: processor: idle: Redefine two functions as void ACPI: processor: Update cpuidle driver check in __acpi_processor_start() ACPI: processor: idle: Rearrange declarations in header file ACPI: processor: Remove unused empty stubs of some functions ACPI: processor: thermal: Release policy references using __free() ACPI: processor: idle: Fix function defined but not used warning ACPI: processor: idle: Eliminate static variable flat_state_cnt ACPI: processor: idle: Add module import namespace ACPI: processor: idle: Optimize ACPI idle driver registration ACPI: processor: idle: Fix memory leak when register cpuidle device failed * acpi-sysfs: ACPI: sysfs: Use ACPI_FREE() for freeing an ACPI object
2025-09-29Merge branch 'acpica'Rafael J. Wysocki
Merge ACPICA updates (20250807 release material with a few fixes on top) for 6.18-rc1: - Add SoundWire File Table (SWFT) signature to ACPICA (Maciej Strozek) - Rearrange local variable definition involving #ifdef in ACPICA to avoid using uninitialized variables (Zhe Qiao) - Allow ACPICA to skip Global Lock initialization (Huacai Chen) - Apply ACPI_NONSTRING in more places in ACPICA and fix two regressions related to incorrect ACPI_NONSTRING usage (Ahmed Salem) - Fix printing CDAT table header when dissasebling CDAT AML (Ahmed Salem) - Use acpi_ds_clear_operands() in acpi_ds_call_control_method() in ACPICA (Hans de Goede) - Update dsmethod.c in ACPICA to address unused variable warning (Saket Dumbre) - Print error messages in ACPICA for too few or too many control method arguments (Saket Dumbre) - Update ACPICA version to 20250807 (Saket Dumbre) - Fix largest possible resource descriptor index in ACPICA (Dmitry Antipov) - Add Back-Invalidate restriction to CXL Window for CEDT in ACPICA (Davidlohr Bueso). - Add the package type to acceptable Arg3 types for _DSM in ACPICA because ACPI_TYPE_ANY does not cover it (Saket Dumbre) - Fix return values in ap_is_valid_checksum() in the acpidump utility in ACPICA (Kaushlendra Kumar) * acpica: ACPICA: acpidump: fix return values in ap_is_valid_checksum() ACPICA: ACPI_TYPE_ANY does not include the package type ACPICA: CEDT: Add Back-Invalidate restriction to CXL Window ACPICA: Fix largest possible resource descriptor index ACPICA: Update version to 20250807 ACPICA: Print error messages for too few or too many arguments ACPICA: Update dsmethod.c to get rid of unused variable warning ACPICA: dispatcher: Use acpi_ds_clear_operands() in acpi_ds_call_control_method() ACPICA: Debugger: drop ACPI_NONSTRING attribute from name_seg ACPICA: acpidump: drop ACPI_NONSTRING attribute from file_name ACPICA: iASL: Fix printing CDAT table header ACPICA: Apply ACPI_NONSTRING ACPICA: Allow to skip Global Lock initialization ACPICA: Change the compilation conditions ACPICA: Remove redundant "#ifdef" definitions ACPICA: Modify variable definition position ACPICA: Add SoundWire File Table (SWFT) signature
2025-09-29drm/dumb-buffers: Provide helper to set pitch and sizeThomas Zimmermann
Add drm_modes_size_dumb(), a helper to calculate the dumb-buffer scanline pitch and allocation size. Implementations of struct drm_driver.dumb_create can call the new helper for their size computations. There is currently quite a bit of code duplication among DRM's memory managers. Each calculates scanline pitch and buffer size from the given arguments, but the implementations are inconsistent in how they treat alignment and format support. Later patches will unify this code on top of drm_mode_size_dumb() as much as possible. drm_mode_size_dumb() uses existing 4CC format helpers to interpret the given color mode. This makes the dumb-buffer interface behave similar the kernel's video= parameter. Current per-driver implementations again likely have subtle differences or bugs in how they support color modes. The dumb-buffer UAPI is only specified for known color modes. These values describe linear, single-plane RGB color formats or legacy index formats. Other values should not be specified. But some user space still does. So for unknown color modes, there are a number of known exceptions for which drm_mode_size_dumb() calculates the pitch from the bpp value, as before. All other values work the same but print an error. v6: - document additional use cases for DUMB_CREATE2 in TODO list (Tomi) - fix typos in documentation (Tomi) v5: - check for overflows with check_mul_overflow() (Tomi) v4: - use %u conversion specifier (Geert) - list DRM_FORMAT_Dn in UAPI docs (Geert) - avoid dmesg spamming with drm_warn_once() (Sima) - add more information about bpp special case (Sima) - clarify parameters for hardware alignment - add a TODO item for DUMB_CREATE2 v3: - document the UAPI semantics - compute scanline pitch from for unknown color modes (Andy, Tomi) Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> Link: https://lore.kernel.org/r/20250821081918.79786-3-tzimmermann@suse.de
2025-09-29Merge branches 'pm-core', 'pm-runtime' and 'pm-sleep'Rafael J. Wysocki
Merge changes related to system sleep and runtime PM framework for 6.18-rc1: - Annotate loops walking device links in the power management core code as _srcu and add macros for walking device links to reduce the likelihood of coding mistakes related to them (Rafael Wysocki) - Document time units for *_time functions in the runtime PM API (Brian Norris) - Clear power.must_resume in noirq suspend error path to avoid resuming a dependant device under a suspended parent or supplier (Rafael Wysocki) - Fix GFP mask handling during hybrid suspend and make the amdgpu driver handle hybrid suspend correctly (Mario Limonciello, Rafael Wysocki) - Fix GFP mask handling after aborted hibernation in platform mode and combine exit paths in power_down() to avoid code duplication (Rafael Wysocki) - Use vmalloc_array() and vcalloc() in the hibernation core to avoid open-coded size computations (Qianfeng Rong) - Fix typo in hibernation core code comment (Li Jun) - Call pm_wakeup_clear() in the same place where other functions that do bookkeeping prior to suspend_prepare() are called (Samuel Wu) * pm-core: PM: core: Add two macros for walking device links PM: core: Annotate loops walking device links as _srcu * pm-runtime: PM: runtime: Documentation: ABI: Document time units for *_time * pm-sleep: PM: hibernate: Combine return paths in power_down() PM: hibernate: Restrict GFP mask in power_down() PM: hibernate: Fix pm_hibernation_mode_is_suspend() build breakage drm/amd: Fix hybrid sleep PM: hibernate: Add pm_hibernation_mode_is_suspend() PM: hibernate: Fix hybrid-sleep PM: sleep: core: Clear power.must_resume in noirq suspend error path PM: sleep: Make pm_wakeup_clear() call more clear PM: hibernate: Fix typo in memory bitmaps description comment PM: hibernate: Use vmalloc_array() and vcalloc() to improve code
2025-09-29Merge branches 'pm-em', 'pm-opp' and 'pm-devfreq'Rafael J. Wysocki
Merge energy model management, OPP (operating performance points) and devfreq updates for 6.18-rc1: - Prevent CPU capacity updates after registering a perf domain from failing on a first CPU that is not present (Christian Loehle) - Add support for the cases in which frequency alone is not sufficient to uniquely identify an OPP (Krishna Chaitanya Chundru) - Use to_result() for OPP error handling in Rust (Onur Özkan) - Add support for LPDDR5 on Rockhip RK3588 SoC to rockchip-dfi devfreq driver (Nicolas Frattaroli) - Fix an issue where DDR cycle counts on RK3588/RK3528 with LPDDR4(X) are reported as half by adding a cycle multiplier to the DFI driver in rockchip-dfi devfreq-event driver (Nicolas Frattaroli) - Fix missing error pointer dereference check of regulator instance in the mtk-cci devfreq driver probe and remove a redundant condition from an if () statement in that driver (Dan Carpenter, Liao Yuanhong) * pm-em: PM: EM: Fix late boot with holes in CPU topology * pm-opp: OPP: Add support to find OPP for a set of keys rust: opp: use to_result for error handling * pm-devfreq: PM / devfreq: rockchip-dfi: add support for LPDDR5 PM / devfreq: rockchip-dfi: double count on RK3588 PM / devfreq: mtk-cci: avoid redundant conditions PM / devfreq: mtk-cci: Fix potential error pointer dereference in probe()
2025-09-29drm/{i915,xe}: driver agnostic drm to display pointer chaseJani Nikula
The display driver needs to get from the struct drm_device pointer to the struct intel_display pointer. Currently, this depends on knowledge of the struct drm_i915_private and struct xe_device definitions, but we'd like to hide those definitions from display. Require the struct drm_device and struct intel_display * members within struct drm_i915_private and struct xe_device to be placed next to each other, to be able to figure out the display pointer without knowledge of the structures. Use a generic dummy device structure to define the relative offsets of the drm and display members, and add static assertions to ensure this holds for both i915 and xe. Use the dummy structure to do the pointer chase from struct drm_device * to struct intel_display *. This requires moving the display member in struct xe_device after the drm member. Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Suggested-by: Simona Vetter <simona.vetter@ffwll.ch> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250926111032.1188876-1-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-09-29Merge drm/drm-next into drm-intel-nextJani Nikula
Backmerge to sync with drm/xe changes. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-09-29Merge series "slab: Re-entrant kmalloc_nolock()"Vlastimil Babka
From the cover letter [1]: This patch set introduces kmalloc_nolock() which is the next logical step towards any context allocation necessary to remove bpf_mem_alloc and get rid of preallocation requirement in BPF infrastructure. In production BPF maps grew to gigabytes in size. Preallocation wastes memory. Alloc from any context addresses this issue for BPF and other subsystems that are forced to preallocate too. This long task started with introduction of alloc_pages_nolock(), then memcg and objcg were converted to operate from any context including NMI, this set completes the task with kmalloc_nolock() that builds on top of alloc_pages_nolock() and memcg changes. After that BPF subsystem will gradually adopt it everywhere. Link: https://lore.kernel.org/all/20250909010007.1660-1-alexei.starovoitov@gmail.com/ [1]
2025-09-29slab: Introduce kmalloc_nolock() and kfree_nolock().Alexei Starovoitov
kmalloc_nolock() relies on ability of local_trylock_t to detect the situation when per-cpu kmem_cache is locked. In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) disables IRQs and marks s->cpu_slab->lock as acquired. local_lock_is_locked(&s->cpu_slab->lock) returns true when slab is in the middle of manipulating per-cpu cache of that specific kmem_cache. kmalloc_nolock() can be called from any context and can re-enter into ___slab_alloc(): kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> kmalloc_nolock() -> ___slab_alloc(cache_B) or kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() -> ___slab_alloc(cache_B) Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock can be acquired without a deadlock before invoking the function. If that specific per-cpu kmem_cache is busy the kmalloc_nolock() retries in a different kmalloc bucket. The second attempt will likely succeed, since this cpu locked different kmem_cache. Similarly, in PREEMPT_RT local_lock_is_locked() returns true when per-cpu rt_spin_lock is locked by current _task_. In this case re-entrance into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries a different bucket that is most likely is not locked by the current task. Though it may be locked by a different task it's safe to rt_spin_lock() and sleep on it. Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL immediately if called from hard irq or NMI in PREEMPT_RT. kfree_nolock() defers freeing to irq_work when local_lock_is_locked() and (in_nmi() or in PREEMPT_RT). SLUB_TINY config doesn't use local_lock_is_locked() and relies on spin_trylock_irqsave(&n->list_lock) to allocate, while kfree_nolock() always defers to irq_work. Note, kfree_nolock() must be called _only_ for objects allocated with kmalloc_nolock(). Debug checks (like kmemleak and kfence) were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); will miss kmemleak/kfence book keeping and will cause false positives. large_kmalloc is not supported by either kmalloc_nolock() or kfree_nolock(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: Reuse first bit for OBJEXTS_ALLOC_FAILAlexei Starovoitov
Since the combination of valid upper bits in slab->obj_exts with OBJEXTS_ALLOC_FAIL bit can never happen, use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel instead of (1ull << 2) to free up bit 2. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().Alexei Starovoitov
Change alloc_pages_nolock() to default to __GFP_COMP when allocating pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP. Also allow __GFP_ACCOUNT flag to be specified, since most of BPF infra needs __GFP_ACCOUNT except BPF streams. Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29locking/local_lock: Introduce local_lock_is_locked().Alexei Starovoitov
Introduce local_lock_is_locked() that returns true when given local_lock is locked by current cpu (in !PREEMPT_RT) or by current task (in PREEMPT_RT). The goal is to detect a deadlock by the caller. Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29maple_tree: Add single node allocation support to maple stateLiam R. Howlett
The fast path through a write will require replacing a single node in the tree. Using a sheaf (32 nodes) is too heavy for the fast path, so special case the node store operation by just allocating one node in the maple state. Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29maple_tree: Prefilled sheaf conversion and testingLiam R. Howlett
Use prefilled sheaves instead of bulk allocations. This should speed up the allocations and the return path of unused allocations. Remove the push and pop of nodes from the maple state as this is now handled by the slab layer with sheaves. Testing has been removed as necessary since the features of the tree have been reduced. Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: sheaf prefilling for guaranteed allocationsVlastimil Babka
Add functions for efficient guaranteed allocations e.g. in a critical section that cannot sleep, when the exact number of allocations is not known beforehand, but an upper limit can be calculated. kmem_cache_prefill_sheaf() returns a sheaf containing at least given number of objects. kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf and is guaranteed not to fail until depleted. kmem_cache_return_sheaf() is for giving the sheaf back to the slab allocator after the critical section. This will also attempt to refill it to cache's sheaf capacity for better efficiency of sheaves handling, but it's not stricly necessary to succeed. kmem_cache_refill_sheaf() can be used to refill a previously obtained sheaf to requested size. If the current size is sufficient, it does nothing. If the requested size exceeds cache's sheaf_capacity and the sheaf's current capacity, the sheaf will be replaced with a new one, hence the indirect pointer parameter. kmem_cache_sheaf_size() can be used to query the current size. The implementation supports requesting sizes that exceed cache's sheaf_capacity, but it is not efficient - such "oversize" sheaves are allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be especially ineffective when replacing a sheaf with a new one of a larger capacity. It is therefore better to size cache's sheaf_capacity accordingly to make oversize sheaves exceptional. CONFIG_SLUB_STATS counters are added for sheaf prefill and return operations. A prefill or return is considered _fast when it is able to grab or return a percpu spare sheaf (even if the sheaf needs a refill to satisfy the request, as those should amortize over time), and _slow otherwise (when the barn or even sheaf allocation/freeing has to be involved). sheaf_prefill_oversize is provided to determine how many prefills were oversize (counter for oversize returns is not necessary as all oversize refills result in oversize returns). When slub_debug is enabled for a cache with sheaves, no percpu sheaves exist for it, but the prefill functionality is still provided simply by all prefilled sheaves becoming oversize. If percpu sheaves are not created for a cache due to not passing the sheaf_capacity argument on cache creation, the prefills also work through oversize sheaves, but there's a WARN_ON_ONCE() to indicate the omission. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-28lib/string_choices: Add str_assert_deassert() helperLad Prabhakar
Add str_assert_deassert() helper to return "assert" or "deassert" string literal depending on the boolean argument. Also add the inversed variant str_deassert_assert(). Suggested-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Andy Shevchenko <andy@kernel.org> Link: https://lore.kernel.org/r/20250923095229.2149740-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-09-29drm/bridge: imx: add driver for HDMI TX Parallel Audio InterfaceShengjiu Wang
The HDMI TX Parallel Audio Interface (HTX_PAI) is a digital module that acts as the bridge between the Audio Subsystem to the HDMI TX Controller. This IP block is found in the HDMI subsystem of the i.MX8MP SoC. Data received from the audio subsystem can have an arbitrary component ordering. The HTX_PAI block has integrated muxing options to select which sections of the 32-bit input data word will be mapped to each IEC60958 field. The HTX_PAI_FIELD_CTRL register contains mux selects to individually select P,C,U,V,Data, and Preamble. Use component helper so that imx8mp-hdmi-tx will be aggregate driver, imx8mp-hdmi-pai will be component driver, then imx8mp-hdmi-pai can use bind() ops to get the plat_data from imx8mp-hdmi-tx device. Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Reviewed-by: Liu Ying <victor.liu@nxp.com> Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com> Signed-off-by: Liu Ying <victor.liu@nxp.com> Link: https://lore.kernel.org/r/20250923053001.2678596-6-shengjiu.wang@nxp.com
2025-09-29drm/bridge: dw-hdmi: Add API dw_hdmi_set_sample_iec958() for iec958 formatShengjiu Wang
Add API dw_hdmi_set_sample_iec958() for IEC958 format because audio device driver needs IEC958 information to configure this specific setting. Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Acked-by: Liu Ying <victor.liu@nxp.com> Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com> Signed-off-by: Liu Ying <victor.liu@nxp.com> Link: https://lore.kernel.org/r/20250923053001.2678596-5-shengjiu.wang@nxp.com
2025-09-29drm/bridge: dw-hdmi: Add API dw_hdmi_to_plat_data() to get plat_dataShengjiu Wang
Add API dw_hdmi_to_plat_data() to fetch plat_data because audio device driver needs it to enable(disable)_audio(). Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Acked-by: Liu Ying <victor.liu@nxp.com> Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com> Signed-off-by: Liu Ying <victor.liu@nxp.com> Link: https://lore.kernel.org/r/20250923053001.2678596-4-shengjiu.wang@nxp.com
2025-09-29ALSA: Add definitions for the bits in IEC958 subframeShengjiu Wang
The IEC958 subframe format SNDRV_PCM_FMTBIT_IEC958_SUBFRAME_LE are used in HDMI and DisplayPort to describe the audio stream, but hardware device may need to reorder the IEC958 bits for internal transmission, so need these standard bits definitions for IEC958 subframe format. Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Reviewed-by: Takashi Iwai <tiwai@suse.de> Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com> Signed-off-by: Liu Ying <victor.liu@nxp.com> Link: https://lore.kernel.org/r/20250923053001.2678596-3-shengjiu.wang@nxp.com
2025-09-28mm: convert folio_page() back to a macroDavid Hildenbrand
In commit 73b3294b1152 ("mm: simplify folio_page() and folio_page_idx()") we converted folio_page() into a static inline function. However briefly afterwards in commit a847b17009ec ("mm: constify highmem related functions for improved const-correctness") we had to add some nasty const-away casting to make the compiler happy when checking const correctness. So let's just convert it back to a simple macro so the compiler can check const correctness properly. There is the alternative of using a _Generic() similar to page_folio(), but there is not a lot of benefit compared to just using a simple macro. Link: https://lkml.kernel.org/r/20250923140058.2020023-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm: silence data-race in update_hiwater_rssLance Yang
KCSAN reports a data race on mm_cluster.hiwater_rss, which can be accessed concurrently from various paths like page migration and memory unmapping without synchronization. Since hiwater_rss is a statistical field for accounting purposes, this data race is benign. Annotate both the read and write accesses with data_race() to make KCSAN happy. Link: https://lkml.kernel.org/r/20250926092426.43312-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: syzbot+60192c8877d0bc92a92b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/68d6364e.050a0220.3390a8.000d.GAE@google.com Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Marco Elver <elver@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/ksm: fix incorrect KSM counter handling in mm_struct during forkDonet Tom
Patch series "mm/ksm: Fix incorrect accounting of KSM counters during fork", v3. The first patch in this series fixes the incorrect accounting of KSM counters such as ksm_merging_pages, ksm_rmap_items, and the global ksm_zero_pages during fork. The following patch add a selftest to verify the ksm_merging_pages counter was updated correctly during fork. Test Results ============ Without the first patch ----------------------- # [RUN] test_fork_ksm_merging_page_count not ok 10 ksm_merging_page in child: 32 With the first patch -------------------- # [RUN] test_fork_ksm_merging_page_count ok 10 ksm_merging_pages is not inherited after fork This patch (of 2): Currently, the KSM-related counters in `mm_struct`, such as `ksm_merging_pages`, `ksm_rmap_items`, and `ksm_zero_pages`, are inherited by the child process during fork. This results in inconsistent accounting. When a process uses KSM, identical pages are merged and an rmap item is created for each merged page. The `ksm_merging_pages` and `ksm_rmap_items` counters are updated accordingly. However, after a fork, these counters are copied to the child while the corresponding rmap items are not. As a result, when the child later triggers an unmerge, there are no rmap items present in the child, so the counters remain stale, leading to incorrect accounting. A similar issue exists with `ksm_zero_pages`, which maintains both a global counter and a per-process counter. During fork, the per-process counter is inherited by the child, but the global counter is not incremented. Since the child also references zero pages, the global counter should be updated as well. Otherwise, during zero-page unmerge, both the global and per-process counters are decremented, causing the global counter to become inconsistent. To fix this, ksm_merging_pages and ksm_rmap_items are reset to 0 during fork, and the global ksm_zero_pages counter is updated with the per-process ksm_zero_pages value inherited by the child. This ensures that KSM statistics remain accurate and reflect the activity of each process correctly. Link: https://lkml.kernel.org/r/cover.1758648700.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/7b9870eb67ccc0d79593940d9dbd4a0b39b5d396.1758648700.git.donettom@linux.ibm.com Fixes: 7609385337a4 ("ksm: count ksm merging pages for each process") Fixes: cb4df4cae4f2 ("ksm: count allocated ksm rmap_items for each process") Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/page_vma_mapped: track if the page is mapped across page table boundaryKiryl Shutsemau
Patch series "mm: Improve mlock tracking for large folios", v3. The patchset includes several fixes and improvements related to mlock tracking of large folios. The main objective is to reduce the undercount of Mlocked memory in /proc/meminfo and improve the accuracy of the statistics. Patches 1-2: These patches address a minor race condition in folio_referenced_one() related to mlock_vma_folio(). Currently, mlock_vma_folio() is called on large folio without the page table lock, which can result in a race condition with unmap (i.e. MADV_DONTNEED). This can lead to partially mapped folios on the unevictable LRU list. While not a significant issue, I do not believe backporting is necessary. Patch 3: This patch adds mlocking logic similar to folio_referenced_one() to try_to_unmap_one(), allowing for mlocking of large folios where possible. Patch 4-5: These patches modifies finish_fault() and faultaround to map in the entire folio when possible, enabling efficient mlocking upon addition to the rmap. Patch 6: This patch makes rmap mlock large folios if they are fully mapped, addressing the primary source of mlock undercount for large folios. This patch (of 6): Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if the page is mapped across page table boundary. Unlike other PVMW_* flags, this one is result of page_vma_mapped_walk() and not set by the caller. folio_referenced_one() will use it to detect if it safe to mlock the folio. [akpm@linux-foundation.org: s/CROSSSED/CROSSED/] Link: https://lkml.kernel.org/r/20250923110711.690639-1-kirill@shutemov.name Link: https://lkml.kernel.org/r/20250923110711.690639-2-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28Merge tag 'mm-hotfixes-stable-2025-09-27-22-35' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 4 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 6 of these fixes are for MM. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-09-27-22-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to static inlines mm/damon/sysfs: do not ignore callback's return value in damon_sysfs_damon_call() mailmap: add entry for Bence Csókás fs/proc/task_mmu: check p->vec_buf for NULL kmsan: fix out-of-bounds access to shadow memory mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count mm/hugetlb: fix folio is still mapped when deleted
2025-09-28Merge tag 'asoc-v6.18-2' of ↵Takashi Iwai
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-next ASoC: Updates for v6.18 round 2 Some more updates for v6.18, mostly fixes for the earlier pull request with some cleanups and more minor fixes for older code. We do have one new driver, the TI TAS2783A, and some quirks for new platforms.