| Age | Commit message (Collapse) | Author |
|
Currently many users transitioned already to the new introduced workqueue
(system_percpu_wq, system_dfl_wq), but there are new users who still use the
older system_wq and system_unbound_wq.
This change try to push this transition forward, by warning whether the old
workqueues are used.
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Fix compile warning with gcc-16.1
- Intel VT-d: Simplify calculate_psi_aligned_address()
- MAINTAINERS updates
* tag 'iommu-fixes-v7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
MAINTAINERS: Add my employer to my entries
MAINTAINERS: Add Vasant Hegde to reviewers of AMD IOMMU
iommu, debugobjects: avoid gcc-16.1 section mismatch warnings
iommu/vt-d: Simplify calculate_psi_aligned_address()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- buffer overflow fix for lenovo (Kean) and wacom (Lee Jones) drivers
- segfaults prevention in lenovo-go driver when used with an emulated
device (Louis Clinckx)
- cleanup of resources in u2fzero (Myeonghun Pak)
- a quirk for a USB mouse and a cleanup in hid.h (hlleng and Liu Kai)
* tag 'hid-for-linus-2026052801' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: wacom: Fix OOB write in wacom_hid_set_device_mode()
HID: lenovo-go: drop dead NULL check on to_usb_interface()
HID: lenovo-go: reject non-USB transports in probe
HID: lenovo: Fix buffer over-read and unaligned access in X12 Tab raw_event handler
HID: quirks: Add ALWAYS_POLL quirk for SIGMACHIP USB mouse
HID: remove duplicate hid_warn_ratelimited definition
HID: u2fzero: free allocated URB on probe errors
|
|
Panther Lake-H SoC memory controller registers for memory topology have
been updated, but the current igen6_edac driver still uses old generation
ones to incorrectly parse memory topology.
Fix the issue by adding memory topology parsing function pointers to the
'struct res_config' and creating a new configuration structure for Panther
Lake-H SoCs to enable igen6_edac to parse memory correctly.
Fixes: 0be9f1af3902 ("EDAC/igen6: Add Intel Panther Lake-H SoCs support")
Fixes: 4c36e6106997 ("EDAC/igen6: Add more Intel Panther Lake-H SoCs support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20260403054029.3950383-3-qiuxu.zhuo@intel.com
|
|
Add new UABI and implementation of PERFCNTR_CONFIG ioctl.
A bit more work is required to configure the pwrup_reglist for the GMU
to restore SELect regs on exit of IFPC, before we can stop disabling
IFPC while global counter collection. This will follow in a later
commit, but will be transparent to userspace.
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Reviewed-by: Anna Maniscalco <anna.maniscalco2000@gmail.com>
Reviewed-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/728217/
Message-ID: <20260526145137.160554-14-robin.clark@oss.qualcomm.com>
|
|
Sashiko reported an inconsistent use of NULL vs ERR_PTR()
returns in the stub helpers in xynos-acpm-protocol.h.
Since this only happens on dead code for COMPILE_TEST=y, this is not
really a bug though. Having stub functions that return NULL is a common
way to define optional interfaces, where callers still work when the
feature is disabled, though this clearly does not work for acpm because
some callers have a NULL pointer dereference when compile testing.
Since CONFIG_EXYNOS_ACPM_PROTOCOL already supports compile-testing itself,
and all (both) drivers using it clearly require the support, so this
just simplifies the option space without losing any build coverage.
Remove the stub functions entirely and adjust the one Kconfig
dependency to require EXYNOS_ACPM_PROTOCOL unconditionally.
Fixes: 6837c006d4e7 ("firmware: exynos-acpm: add empty method to allow compile test")
Closes: https://sashiko.dev/#/patchset/20260420-acpm-tmu-v3-0-3dc8e93f0b26%40linaro.org
Link: https://lore.kernel.org/all/a7994860-24a3-4f87-84bf-109ed653dda4@linaro.org/
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260529134454.2147446-1-arnd@kernel.org
[krzk: Rebase on difference in devm_acpm_get_by_node()]
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
Introduce devm_acpm_get_by_phandle() to standardize how consumer
drivers acquire a handle to the ACPM IPC interface. Enforce the
use of the "samsung,acpm-ipc" property name across the SoC and
simplify the boilerplate code in client drivers.
The first consumer of this helper is the Exynos ACPM Thermal Management
Unit (TMU) driver. The TMU utilizes a hybrid management approach: direct
register access from the Application Processor (AP) is restricted to the
interrupt pending (INTPEND) registers for event identification.
High-level functional tasks, such as sensor initialization, threshold
programming, and temperature reads, are delegated to the ACPM firmware
via this IPC interface.
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Reviewed-by: Peter Griffin <peter.griffin@linaro.org>
Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-6-8ca011d5a965@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
The Thermal Management Unit (TMU) on the Google GS101 SoC is managed
through a hybrid model shared between the kernel and the Alive Clock
and Power Manager (ACPM) firmware.
Add the protocol helpers required to communicate with the ACPM for
thermal operations, including initialization, threshold configuration,
temperature reading, and system suspend/resume handshakes.
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reviewed-by: Peter Griffin <peter.griffin@linaro.org>
Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-5-8ca011d5a965@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
Replace the embedded `struct acpm_ops` inside `struct acpm_handle` with
a pointer to a `const struct acpm_ops`.
Previously, the operations structure was embedded directly within the
handle and populated dynamically at runtime via `acpm_setup_ops()`.
This resulted in mutable function pointers and unnecessary per-instance
memory overhead.
By defining `exynos_acpm_driver_ops` statically as a `const` structure,
the function pointers are now safely housed in the read-only `.rodata`
section. This improves security by preventing function pointer
overwrites, saves memory, and slightly reduces initialization overhead
in `acpm_probe()`.
Consequently, update all consumer drivers (clk, mfd) to access the
operations via the new pointer indirection (`->ops->`). Finally, fix
the previously empty kernel-doc description for the ops member to
reflect its new pointer nature.
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Reviewed-by: Peter Griffin <peter.griffin@linaro.org>
Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-4-8ca011d5a965@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
Rename the `dvfs_ops` and `pmic_ops` members of `struct acpm_ops` to
`dvfs` and `pmic` respectively.
Since these members are housed within the `acpm_ops` structure and
utilize the `acpm_*_ops` types, the `_ops` suffix on the variable names
creates unnecessary redundancy (e.g., `handle.ops.dvfs_ops`).
This cleanup removes the stuttering, leading to cleaner consumer code.
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Reviewed-by: Peter Griffin <peter.griffin@linaro.org>
Acked-by: Lee Jones <lee@kernel.org>
Link: https://lore.kernel.org/linux-samsung-soc/CADrjBPqzKpcd9vuCmNUptCUPyPpPbHcc19-7kN-1c0RpW1e5DQ@mail.gmail.com/T/#mcce154a7e0c6cd1ca6cd5a1e37541ed7a85a84d4 [1]
Link: https://patch.msgid.link/20260515-acpm-tmu-helpers-v2-3-8ca011d5a965@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
Merge updates that introduce devm_acpi_install_notify_handler()
and convert some drivers for core ACPI devices previously using
acpi_dev_install_notify_handler() to devres-based resource
management.
* acpi-driver-devm:
ACPI: video: Switch over to devres-based resource management
ACPI: video: Use devm for video->entry and backlight cleanup
ACPI: video: Use devm action for freeing video devices
ACPI: video: Use devm action for video bus object cleanup
ACPI: video: Rearrange probe and remove code
ACPI: video: Reduce the number of auxiliary device dereferences
ACPI: PAD: Switch over to devres-based resource management
ACPI: PAD: Fix teardown ordering in acpi_pad_remove()
ACPI: PAD: Pass struct device pointer to acpi_pad_notify()
ACPI: PAD: Rearrange acpi_pad_notify()
ACPI: thermal: Switch over to devres-based resource management
ACPI: HED: Switch over to devres-based resource management
ACPI: HED: Refine guarding against adding a second instance
ACPI: battery: Switch over to devres-based resource management
ACPI: AC: Switch over to devres-based resource management
ACPI: NFIT: core: Use devm_acpi_install_notify_handler()
ACPI: bus: Introduce devm_acpi_install_notify_handler()
|
|
|
|
People have gone to the trouble of writing this kernel-doc; the
least we can do is publish it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@linux.dev>
Link: https://patch.msgid.link/20260528175905.1102280-3-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is a simple helper which replaces page_folio(bvec->bv_page).
Minor improvement in readability, but the real motivation is to reduce
the number of references to bvec->bv_page so that it can be changed
with less work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@linux.dev>
Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks")
introduced an ext_server deadline server to protect sched_ext tasks from
fair/RT starvation, mirroring the existing fair_server.
Currently, both servers reserve their 50ms/1000ms bandwidth at boot,
regardless of whether a BPF scheduler is loaded. Unused bandwidth is
still reclaimed at runtime by other classes, but the static reservation
prevents the RT class from implicitly using that headroom when one of
the two classes is guaranteed to be empty.
A sysadmin can work around this by writing
/sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that
requires manual action and not all systems expose debugfs.
A better approach is to make server bandwidth reservations dynamic: only
the scheduling policy that is currently active should register its
reservation, while the inactive one should not artificially hold
capacity (keeping both reservations only when the BPF scheduler is
running in partial mode):
+---------------------------------------------+-------------+------------+
| BPF scheduler state | fair server | ext server |
+---------------------------------------------+-------------+------------+
| not loaded (default boot) | reserved | none |
| loaded full mode (!SCX_OPS_SWITCH_PARTIAL) | none | reserved |
| loaded partial mode (SCX_OPS_SWITCH_PARTIAL)| reserved | reserved |
+---------------------------------------------+-------------+------------+
To achieve this, introduce an "attached/detached" state for each
deadline server, so the kernel can decide whether a server's bandwidth
should be accounted in global bandwidth tracking.
At boot, the system starts with only the fair server contributing to
bandwidth accounting. When a BPF scheduler is enabled, the ext server is
attached and may replace or complement the fair server depending on
whether full or partial mode is used. When sched_ext is disabled, the
system restores the previous deadline bandwidth values and behavior.
The transition logic ensures that switching between scheduling modes is
consistent and reversible, without losing runtime configuration or
requiring manual intervention.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260526164420.638711-2-arighi@nvidia.com
|
|
The only user of LAST_XXX outside of fs/namei.c is fs/smb/server/vfs.c;
ksmbd_vfs_path_lookup() calls vfs_path_parent_lookup() and expects a
LAST_NORM last type (or it will be ENOENT). ksmbd_vfs_rename() also calls
vfs_path_parent_lookup() but forgets the LAST_NORM check.
It does not really make sense to have vfs_path_parent_lookup() expose
the last_type because it is only needed to ensure it is LAST_NORM. So
let's do this check in vfs_path_parent_lookup() instead and keep the
LAST_XXX internal to fs/namei.c. This changes the ENOENT errno in
ksmbd_vfs_path_lookup() to EINVAL, which matches better with how this is
handled by callers of filename_parentat().
Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
Link: https://patch.msgid.link/20260528175854.57626-1-jkoolstra@xs4all.nl
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
AF_ALG is deprecated and exposed to unprivileged userspace. Only
use the least buggy algorithm implementations: the pure software ones.
This removes one of the main advantages of AF_ALG, which is the
ability to use it with off-CPU accelerators. However, using off-CPU
accelerators has huge overheads, both in performance and attack surface.
I have yet to see real-world, performance-critical workloads where using
an accelerator via AF_ALG is actually a win over doing cryptography in
userspace.
If using an off-CPU accelerator really does turn out to be a win, a new
API should be developed that is actually a good fit for it.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
It can be removed entirely at the cost of only supporting synchronous
operations. This doesn't break userspace, which will silently block
(for a bounded amount of time) in io_submit instead of operating
asynchronously.
This also makes struct msghdr smaller, helping every other caller of
sendmsg().
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The driver notifies the hardware to handle task through
doorbell. Currently, doorbell is enabled by default. To
prevent the process from sending doorbells during hardware
reset scenarios, which could cause the hardware to process
doorbells and trigger new errors:
For example, when the physical machine is resetting the device,
doorbells are still being sent from the virtual machine.
Therefore, the driver disables doorbell during hardware
unavailability. After hardware initialization is completed,
doorbell is enabled, and any task sent during the unavailability
period will return errors.
The hardware supports the PF to disable doorbells for all functions,
while the VF can only disable its own doorbell function. When the PF
is reset, it will disable doorbells for all functions. When VF is
reset, it only disables its own doorbell and does not affect tasks
on other functions.
Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
When executing operations on crypto devices, hardware errors
are inevitable. For certain errors, a full device reset is
required to recover. However, in certain cases, only a
specific function may fail, while other functions can still
operate normally. A system-wide RAS reset in such cases would
unnecessarily impact functioning components.
This patch introduces function-level granularity handling,
enabling targeted resets of only the error-reporting
functions without affecting other operational functions.
Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com>
Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
usage counter
To avoid accessing memory of a suspended device, and since the counter
interface used by PM involves sleep operations, the counter interface
cannot be placed in the interrupt top half. Therefore, the interface for
acquiring the interrupt status in the RAS reset flow that resides in the
interrupt context needs to be moved to the bottom half for processing.
Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com>
Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The problem that the VF device cannot obtain the isolation
status and isolation threshold of the device is resolved.
The accelerator driver can query the device isolation status
and threshold via the VF device using the fault query sysfs
interface under uacce. Note that only the PF device supports
isolation policy configuration, while the VF device is
limited to read-only query operations.
Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com>
Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
All users have been removed over time as ARM and other architectures
switched to generic sched_clock. The last user was microblaze, removed in
commit 839396ab88e4 ("microblaze: timer: Use generic sched_clock
implementation").
Assisted-by: Claude:claude-opus-4-6
Link: https://lore.kernel.org/20260515183429.1503740-1-costa.shul@redhat.com
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <npitre@baylibre.com>
Cc: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Split out two new headers from the public pq.h:
- lib/raid/raid6/algos.h contains the algorithm lists private to
lib/raid/raid6
- include/linux/raid/pq_tables.h contains the tables also used by
async_tx providers.
The public include/linux/pq.h is now limited to the public interface for
the consumers of the RAID6 PQ API.
[hch@lst.de: remove duplicate ccflags-y line]
Link: https://lore.kernel.org/20260527074539.2292913-2-hch@lst.de
Link: https://lore.kernel.org/20260518051804.462141-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Quoting H. Peter Anvin who came up with the RAID6 P/Q algorithm, and who
wrote the initial implementation, then still part of the md driver:
The RAID-6 code has *never* supported only 3 units, and if it ever
worked for *any* of the implementations it was purely by accident.
Speaking as the original author I should know; this was deliberate as
in some cases the degenerate case (3) would have required extra trays
in the code to no user benefit.
While md never allowed less than 4 devices, btrfs does. This new warning
will trigger for such file systems, but given how it already causes havoc
that is a good thing. If btrfs wants to fix third, it should switch to
transparently use three-way mirroring underneath, which will work as P and
Q are copies of the single data device by the definition of the Linux RAID
6 P/Q algorithm.
Link: https://lore.kernel.org/20260518051804.462141-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Stop directly calling into function pointers from users of the RAID6 PQ
API, and provide exported functions with proper documentation and API
guarantees asserts where applicable instead.
Link: https://lore.kernel.org/20260518051804.462141-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Just open code it as in other places in the kernel.
Link: https://lore.kernel.org/20260518051804.462141-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
These are not used anywhere in the kernel.
Link: https://lore.kernel.org/20260518051804.462141-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
With the test code ported to kernel space, none of this is required.
Link: https://lore.kernel.org/20260518051804.462141-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "cleanup the RAID6 P/Q library", v3.
This series cleans up the RAID6 P/Q library to match the recent updates to
the RAID 5 XOR library and other CRC/crypto libraries. This includes
providing properly documented external interfaces, hiding the internals,
using static_call instead of indirect calls and turning the user space
test suite into an in-kernel kunit test which is also extended to improve
coverage.
Note that this changes registration so that non-priority algorithms are
not registered, which greatly helps with the benchmark time at boot time.
I'd like to encourage all architecture maintainers to see if they can
further optimized this by registering as few as possible algorithms when
there is a clear benefit in optimized or more unrolled implementations.
This patch (of 18):
Currently the raid6 code can be compiled as userspace code to run the test
suite. Convert that to be a kunit case with minimal changes to avoid
mutating global state so that we can drop this requirement.
Note that this is not a good kunit test case yet and will need a lot more
work, but that is deferred until the raid6 code is moved to it's new
place, which is easier if the userspace makefile doesn't need adjustments
for the new location first.
Link: https://lore.kernel.org/20260518051804.462141-1-hch@lst.de
Link: https://lore.kernel.org/20260518051804.462141-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Allow the same userspace thread to simultaneously collect normal coverage
in syscall context (KCOV_ENABLE) and remote coverage of asynchronous work
created by the thread (KCOV_REMOTE_ENABLE). With this, remote KCOV
coverage becomes useful for generic fuzzing and not just fuzzing of
specific data injection interfaces.
This requires that the task_struct::kcov_* fields are separated into ones
that are used by the task that generates coverage, and ones that are used
by the task that requested remote coverage. To split this up:
- Split task_struct::kcov into kcov and kcov_remote. kcov_task_exit() now
has to clean up both separately.
- Only use task_struct::kcov_mode on the task that generates coverage.
- Only reset task_struct::kcov_handle on the task that requested remote
coverage.
After this change, fields used by the task that generates coverage are:
- kcov_mode
- kcov_size
- kcov_area
- kcov
- kcov_sequence
- kcov_softirq
Fields used by the task that requested remote coverage are:
- kcov_remote
- kcov_handle
[jannh@google.com: remove unused constant KCOV_MODE_REMOTE, per Dmitry]
Link: https://lore.kernel.org/20260515-kcov-simultaneous-remote-v2-1-56fde1cfa509@google.com
[jannh@google.com: update documentation on remote coverage collection]
Link: https://lore.kernel.org/20260519-kcov-docs-v1-1-5bb22f4cb20c@google.com
[jannh@google.com: move and reword sentence on simultaneous normal/remote collection
Link: https://lore.kernel.org/20260520-kcov-docs-v2-1-819f78778763@google.com
Link: https://lore.kernel.org/20260505-kcov-simultaneous-remote-v1-1-a670ba7cefd2@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
llist's locking requirement table has a legend which claims that all
operations not needing a lock a marked with '-', whereas in truth for some
table entries just a whitespace is used.
Add the '-' to all appropriate places.
Link: https://lore.kernel.org/20260507094918.23910-2-phasta@kernel.org
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Store common handle IDs in "struct kcov_common_handle_id", which consumes
no space in non-KCOV builds.
This cleanup removes #ifdef boilerplate code from subsystems that
integrate with KCOV (in particular in usbip_common.h and skbuff.h, see the
diffstat).
This should also make it easier to add KCOV remote coverage to more
subsystems in the future.
Link: https://lore.kernel.org/20260430-kcov-refactor-common-handle-v1-1-23a0c7a0ba38@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Hongren (Zenithal) Zheng <i@zenithal.me>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Valentina Manea <valentina.manea.m@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that we've got the same config selecting inline vs outline
copy_to_user() and copy_from_user(), we can simplify the corresponding
logic in the uaccess.h.
Link: https://lore.kernel.org/20260425020857.356850-4-ynorov@nvidia.com
Fixes: 1f9a8286bc0c ("uaccess: always export _copy_[from|to]_user with CONFIG_RUST")
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Tested-by: Alice Ryhl <aliceryhl@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel allows arches to select between inline and outline
implementations of the copy_{from,to}_user() by defining individual
INLINE_COPY_FROM_USER and INLINE_COPY_TO_USER, correspondingly. However,
all arches enable or disable them always together.
Without the real use-case for one helper being inlined while the other
outlined, having independent controls is excessive and error prone.
Switch the codebase to the single unified INLINE_COPY_USER control.
Link: https://lore.kernel.org/20260425020857.356850-3-ynorov@nvidia.com
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Tested-by: Alice Ryhl <aliceryhl@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Any __exitcall() and built-in module_exit() handler is marked as __used,
which leads to the code being included in the object file and later
discarded at link time.
As far as I can tell, this was originally added at the same time as
initcalls were marked the same way, to prevent them from getting dropped
with gcc-3.4, but it was never actaully necessary to keep exit functions
around.
Mark them as __maybe_unused instead, which lets the compiler treat the
exitcalls as entirely unused, and make better decisions about dropping
specializing static functions called from these.
Link: https://lore.kernel.org/all/acruxMNdnUlyRHiy@google.com/
Link: https://lore.kernel.org/20260331142846.3187706-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nicolas Schier <nsc@kernel.org>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace `PREEMP_RT` with `PREEMPT_RT` in the header comment to match the
correct kernel configuration name.
Link: https://lore.kernel.org/20260505021125.1941691-1-zhouzhouyi@gmail.com
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now nobody is using damon_set_region_biggest_system_ram_default(). Remove
it.
Link: https://lore.kernel.org/20260429041232.90257-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/reclaim,lru_sort: monitor all system rams by
default".
DAMON_RECLAIM and DAMON_LRU_SORT set the biggest 'System RAM' resource of
the system as the default monitoring target address range. The main
intention behind the design is to minimize the overhead coming from
monitoring of non-System RAM areas.
This could result in an odd setup when there are multiple discrete System
RAMs of considerable sizes. For example, there are System RAMs each
having 500 GiB size. In this case, only the first 500 GiB will be set as
the monitoring region by default. This is particularly common on NUMA
systems. Hence the modules allow users to set the monitoring target
address range using the module parameters if the default setup doesn't
work for them. In other words, the current design trades ease of setup
for lower overhead.
However, because DAMON utilizes the sampling based access check and the
adaptive regions adjustment mechanisms, the overhead from the monitoring
of non-System RAM areas should be negligible in most setups. Meanwhile,
the setup complexity is causing real headaches for users who need to run
those modules on various types of systems. That is, the current tradeoff
is not a good deal.
Set the physical address range that can cover all System RAM areas of the
system as the default monitoring regions for DAMON_RECLAIM and
DAMON_LRU_SORT.
Technically speaking, this is changing documented behavior. However, it
makes no sense to believe there is a real use case that really depends on
the old weird default behavior. If the old default behavior was working
for them in the reasonable way, this change will only add a negligible
amount of monitoring overhead. If it didn't work, the users may already
be using manual monitoring regions setup, and they will not be affected by
this change.
Patches Sequence
================
Patch 1 introduces a new core function that will be used for the new
default monitoring target region setup. Patch 2 and 3 update
DAMON_RECLAIM and DAMON_LRU_SORT to use the new function instead of the
old one, respectively. Patch 4 removes the old core function that was
replaced by the new one, as there is no more user of it. Patch 5 updates
DAMON_STAT to use the new one instead of its in-house nearly-duplicate
self implementation of the functionality. Finally patches 6 and 7 update
the DAMON_RECLAIM and DAMON_LRU_SORT user documentation for the new
behaviors, respectively.
This patch (of 7):
damon_set_region_biggest_system_ram_default() sets the monitoring target
region as the caller requested. If the caller didn't specify the region,
it finds the biggest System RAM of the system and sets it as the target
region. When there are more than one considerable size of System RAM
resources in the system, the default target setup makes no sense.
Introduce a variant, namely damon_set_region_system_rams_default(). It
sets a physical address range that covers all System RAM resources as the
default target region.
Link: https://lore.kernel.org/20260429041232.90257-1-sj@kernel.org
Link: https://lore.kernel.org/20260429041232.90257-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Page tables are always accessed via the linear mapping with a match-all
tag, so HW-tag KASAN never checks them. For page-allocated tables (PTEs
and PGDs etc), avoid the tag setup and poisoning overhead by using
__GFP_SKIP_KASAN. SLUB-backed page tables are unchanged for now. (They
aren't widely used and require more SLUB related skip logic. Leave it
later.)
Link: https://lore.kernel.org/20260429102704.680174-4-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
HW-tag KASAN never checks kernel stacks because stack pointers carry the
match-all tag, so setting/poisoning tags is pure overhead.
- Add __GFP_SKIP_KASAN to THREADINFO_GFP so every stack allocator that
uses it skips tagging (fork path plus arch users)
- Add __GFP_SKIP_KASAN to GFP_VMAP_STACK for the fork-specific vmap
stacks.
- When reusing cached vmap stacks, skip kasan_unpoison_range() if HW tags
are enabled.
Software KASAN is unchanged; this only affects tag-based KASAN.
Link: https://lore.kernel.org/20260429102704.680174-3-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "kasan: hw_tags: Disable tagging for stack and page-tables",
v4.
Stacks and page tables are always accessed with the match-all tag, so
assigning a new random tag every time at allocation and setting invalid
tag at deallocation time, just adds overhead without improving the
detection.
With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL
(match-all tag) is stored in the page flags while keeping the poison tag
in the hardware. The benefit of it is that 256 tag setting instruction
per 4 kB page aren't needed at allocation and deallocation time.
Thus match-all pointers still work, while non-match tags (other than
poison tag) still fault.
__GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is
unchanged.
Benchmark:
The benchmark has two modes. In thread mode, the child process forks
and creates N threads. In pgtable mode, the parent maps and faults a
specified memory size and then forks repeatedly with children exiting
immediately.
Thread benchmark:
2000 iterations, 2000 threads: 2.575 s → 2.229 s (~13.4% faster)
The pgtable samples:
- 2048 MB, 2000 iters 19.08 s → 17.62 s (~7.6% faster)
This patch (of 3):
For allocations that will be accessed only with match-all pointers (e.g.,
kernel stacks), setting tags is wasted work. If the caller already set
__GFP_SKIP_KASAN, skip tag setting of vmalloc pages.
Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs.
So it wasn't being checked. Now its being checked and acted upon. Other
KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in
the page allocator, and in vmalloc too we ignore this flag for them.
This is a preparatory patch for optimizing kernel stack allocations.
Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com
Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Co-developed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: let DAMON be paused and resumed", v2.
DAMON utilizes a few mechanisms that enhance itself over time. Adaptive
regions adjustment, goal-based DAMOS quota auto-tuning and monitoring
intervals auto-tuning like self-training mechanisms are such examples. It
also adds access frequency stability information (age) to the monitoring
results, which makes it enhanced over time.
Sometimes users have to stop DAMON. In this case, DAMON internal state
that enhanced over the time of the last execution simply goes away.
Restarted DAMON have to train itself and enhance its output from the
scratch. This makes DAMON less useful in such cases. Introducing three
such use cases below.
Investigation of DAMON. It is best to do the investigation online,
especially when it is a production environment. DAMON therefore provides
features for such online investigations, including DAMOS stats, monitoring
result snapshot exposure, and multiple tracepoints. When those are
insufficient, and there are additional clues that could be interfered by
DAMON, users have to temporarily stop DAMON to collect the additional
clues. It is not very useful since many of DAMON internal clues are gone
when DAMON is stopped. The loss of the monitoring results that improved
over time is also problematic, especially in production environments.
Monitoring of workloads that have different user-known phases. For
example, in Android, applications are known to have very different access
patterns and behaviors when they are running on the foreground and the
background. It can therefore be useful to separate monitoring of apps
based on whether they are running on the foreground and on the background.
Having two DAMON threads per application that paused and resumed for the
apps foreground/background switches can be useful for the purpose. But
such pause/resume of the execution is not supported.
Tests of DAMON. A few DAMON selftests are using drgn to dump the internal
DAMON status. The tests show if the dumped status is the same as what the
test code expected. Because DAMON keeps running and modifying its
internal status, there are chances of data races that can cause false test
results. Stopping DAMON can avoid the race. But, since the internal
state of DAMON is dropped, the test coverage will be limited.
Let DAMON execution be paused and resumed without loss of the internal
state, to overhaul the limitations. For this, introduce a new DAMON
context parameter, namely 'pause'. API callers can update it while the
context is running, using the online parameters update functions
(damon_commit_ctx() and damon_call()). Once it is set, kdamond_fn() main
loop will do only limited works excluding the monitoring and DAMOS works,
while sleeping sampling intervals per the work. The limited works include
handling of the online parameters update. Hence users can unset the
'pause' parameter again. Once it is unset, kdamond_fn() main loop will do
all the work again (resumed). Under the paused state, it also does stop
condition checks and handling of it, so that paused DAMON can also be
stopped if needed. Expose the feature to the user space via DAMON sysfs
interface. Also, update existing drgn-based tests to test and use the
feature.
Tests
=====
I confirmed the feature functionality using real time tracing ('perf
trace' or 'trace-cmd stream') of damon:damon_aggregated DAMON tracepoint.
By pausing and resuming the DAMON execution, I was able to see the trace
stops and continued as expected. Note that the pause feature support is
added to DAMON user-space tool (damo) after v3.1.9. Users can use
'--pause_ctx' command line option of damo for that, and I actually used it
for my test. The extended drgn-based selftests are also testing a part of
the functionality.
Patches Sequence
================
Patch 1 introduces the new core API for the pause feature. Patch 2 extend
DAMON sysfs interface for the new parameter. Patches 3-5 update design,
usage and ABI documents for the new sysfs file, respectively. The
following five patches are for tests. Patch 6 implements a new kunit test
for the pause parameter online commitment. Patches 7 and 8 extend DAMON
selftest helpers to support the new feature. Patch 9 extends selftest to
test the commitment of the feature. Finally, patch 10 updates existing
selftest to be safe from the race condition using the pause/resume
feature.
This patch (of 10):
DAMON supports only start and stop of the execution. When it is stopped,
its internal data that it self-trained goes away. It will be useful if
the execution can be paused and resumed with the previous self-trained
data.
Introduce per-context API parameter, 'paused', for the purpose. The
parameter can be set and unset while DAMON is running and paused, using
the online parameters commit helper functions (damon_commit_ctx() and
damon_call()). Once 'paused' is set, the kdamond_fn() main loop does only
limited works with sampling interval sleep during the works. The limited
works include the handling of the online parameters update, so that users
can unset the 'pause' and resume the execution when they want. It also
keep checking DAMON stop conditions and handling of it, so that DAMON can
be stopped while paused if needed.
Link: https://lore.kernel.org/20260427151231.113429-1-sj@kernel.org
Link: https://lore.kernel.org/20260427151231.113429-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a file mapping covers a strict subset of a file, an access to the
mapping can trigger readahead of file pages outside the mapped region.
Readahead is meant to prefetch pages likely to be accessed soon, but these
pages aren't accessible via the same means, so it fair to say we don't
have a good indicator they'll be accessed soon. Take an ELF file for
example: an access to the end of a program's read-only segment isn't a
sign that nearby file contents will be accessed next (they are likely to
be mapped discontiguously, or not at all). The pressure from loading
these pages into the cache can evict more useful pages.
To improve the behavior, make three changes:
* Introduce a new readahead_control field, max_index, as a hard limit on
the readahead. The existing file_ra_state->size can't be used as a
limit, it is more of a hint and can be increased by various
heuristics.
* Set readahead_control->max_index to the end of the VMA in all of the
readahead paths that can be triggered from a fault on a file mapping
(both "sync" and "async" readahead).
* Limit the read-around range start to the VMA's start.
Note that these changes only affect readahead triggered in the context of
a fault, they do not affect readahead triggered by read syscalls. If a
user mixes the two types of accesses, the behavior is expected to be the
following: if a fault causes readahead and places a PG_readahead marker
and then a read(2) syscall hits the PG_readahead marker, the resulting
async readahead *will not* be limited to the VMA end. Conversely, if a
read(2) syscall places a PG_readahead marker and then a fault hits the
marker, the async readahead *will* be limited to the VMA end.
There is an edge case that the above motivation glosses over: A single
file mapping might be backed by multiple VMAs. For example, a whole file
could be mapped RW, then part of the mapping made RO using mprotect. This
patch would hurt performance of a sequential faulted read of such a
mapping, the degree depending on how fragmented the VMAs are. A usage
pattern like that is likely rare and already suffering from sub-optimal
performance because, e.g., the fragmented VMAs limit the fault-around, so
each VMA boundary in a sequential faulted read would cause a minor fault.
Still, this patch would make it worse. See a previous discussion of this
topic at [1].
Tested by mapping and reading a small subset of a large file, then using
the cachestat syscall to verify the number of cached pages didn't exceed
the mapping size.
In practical scenarios, the effect depends on the specific file and usage.
Sometimes there is no effect at all, but, for some ELF files in Android,
we see ~20% fewer pages pulled into the cache.
A comprehensive performance evaluation hasn't been done, but, in addition
to the anecdontal memory savings mentioned above, a benchmark was run with
fio 3.38, showing neutral looking results:
/data/local/tmp/fio --version
fio --name=mmap_test --ioengine=mmap --rw=read --bs=4k \
--offset=1G --size=1G --filesize=3G --numjobs=1 \
--filename=testfile.bin
Before: 4366.6 MiB/s (avg of 3459, 4592, 4613, 4697, 4472)
After: 4444.0 MiB/s (avg of 4633, 4655, 4511, 4571, 3850)
+1.7%
Same, with --ioengine=mmap --rw=randread
Before: 445.6 MiB/s (avg of 446, 447, 442, 452, 441)
After: 447.0 MiB/s (avg of 447, 446, 446, 451, 445)
+0.3%
Same, with --ioengine=psync --rw=read
Before: 3086.6 MiB/s (avg of 3122, 3094, 3066, 3094, 3057)
After: 3084.6 MiB/s (avg of 3039, 3103, 3103, 3084, 3094)
-0.06%
Same, with --ioengine=psync --rw=randread
Before: 2226.4 MiB/s (avg of 2256, 2183, 2207, 2265, 2221)
After: 2231.4 MiB/s (avg of 2236, 2241, 2236, 2193, 2251)
+0.2%
Link: https://lore.kernel.org/20260427030148.653228-1-fmayle@google.com
Link: https://lore.kernel.org/all/ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz/ [1]
Signed-off-by: Frederick Mayle <fmayle@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's replace the last user of page_mapped() by folio_mapped() so we can
get rid of page_mapped().
Replace the remaining occurrences of page_mapped() in rmap documentation
by folio_mapped().
Link: https://lore.kernel.org/20260427-page_mapped-v1-3-e89c3592c74c@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Harry Yoo <harry@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch set introces a new action: DAMOS_COLLAPSE.
For DAMOS_HUGEPAGE and DAMOS_NOHUGEPAGE to work, khugepaged should be
working, since it relies on hugepage_madvise to add a new slot. This slot
should be picked up by khugepaged and eventually collapse (or not, if we
are using DAMOS_NOHUGEPAGE) the pages. If THP is not enabled, khugepaged
will not be working, and therefore no collapse will happen.
DAMOS_COLLAPSE eventually calls madvise_collapse, which will collapse the
address range synchronously. In cases where there is a large VMA
(databases, for example), DAMOS_COLLAPSE allows us to collapse only the
hot region, and not the entire VMA.
This new action may be required to support autotuning with hugepage
as a goal[1].
=========
Benchmarks:
=========
MySQL
=====
Tests were performed in an ARM physical server with MariaDB 10.5 and
sysbench. Read only benchmark was perform with gaussian row hitting,
which follows a normal distribution.
T n, D h: THP set to never, DAMON action set to hugepage
T m, D h: THP set to madvise, DAMON action set to hugepage
T n, D c: THP set to never, DAMON action set to collapse
Memory consumption. Lower is better.
+------------------+----------+----------+----------+
| | T n, D h | T m, D h | T n, D c |
+------------------+----------+----------+----------+
| Total memory use | 2.13 | 2.20 | 2.20 |
| Huge pages | 0 | 1.3 | 1.27 |
+------------------+----------+----------+----------+
Performance in TPS (Transactions Per Second). Higher is better.
T n, D h: 18225.58
T m, D h 18252.93
T n, D c: 18270.21
Performance counter
I got the number of L1 D/I TLB accesses and the number a D/I TLB
accesses that triggered a page walk. I divided the second by the
first to get the percentage of page walkes per TLB access. The
lower the better.
+---------------+--------------+--------------+--------------+
| | T n, D h | T m, D h | T n, D c |
+---------------+--------------+--------------+--------------+
| L1 DTLB | 127248242753 | 125431020479 | 125327001821 |
| L1 ITLB | 80332558619 | 79346759071 | 79298139590 |
| DTLB walk | 75011087 | 52800418 | 55895794 |
| ITLB walk | 71577076 | 71505137 | 67262140 |
| DTLB % misses | 0.058948623 | 0.042095183 | 0.044599961 |
| ITLB % misses | 0.089100954 | 0.090117275 | 0.084821839 |
+---------------+--------------+--------------+--------------+
Masim
=====
I used masim with the "demo" configuration, but changing the times
to 100 seconds for the initial phase and 50 seconds for the rest of
the phases.
Memory consumption:
+------------------+----------+----------+----------+
| | T n, D h | T m, D h | T n, D c |
+------------------+----------+----------+----------+
| Total memory use | 2.38 GB | 2.36 GB | 2.37 GB |
| Huge pages | 0 | 190 MB | 188 MB |
+------------------+----------+----------+----------+
Performance:
THP never, DAMOS_HUGEPAGE
initial phase: 40,491 accesses/msec, 100001 msecs run
low phase 0: 39,658 accesses/msec, 50002 msecs run
high phase 0: 41,678 accesses/msec, 50000 msecs run
low phase 1: 39,625 accesses/msec, 50003 msecs run
high phase 1: 41,658 accesses/msec, 50002 msecs run
low phase 2: 39,642 accesses/msec, 50002 msecs run
high phase 2: 41,640 accesses/msec, 50001 msecs run
THP madvise, DAMOS_HUGEPAGE
initial phase: 51,977 accesses/msec, 100000 msecs run
low phase 0: 86,953 accesses/msec, 50000 msecs run
high phase 0: 94,812 accesses/msec, 50000 msecs run
low phase 1: 101,017 accesses/msec, 50000 msecs run
high phase 1: 94,841 accesses/msec, 50000 msecs run
low phase 2: 100,993 accesses/msec, 50000 msecs run
high phase 2: 94,791 accesses/msec, 50001 msecs run
THP never, DAMOS_COLLAPSE
initial phase: 93,678 accesses/msec, 100001 msecs run
low phase 0: 101,475 accesses/msec, 50000 msecs run
high phase 0: 98,589 accesses/msec, 50000 msecs run
low phase 1: 101,531 accesses/msec, 50001 msecs run
high phase 1: 98,506 accesses/msec, 50001 msecs run
low phase 2: 101,458 accesses/msec, 50001 msecs run
high phase 2: 98,555 accesses/msec, 50000 msecs run
Memory consumption dynamic (how quickly collapses occur):
It shows in seconds how many huge pages are allocated.
+----+----------+----------+
| | T m, D h | T n, D c |
+----+----------+----------+
| 5 | 32 | 188 |
| 10 | 48 | 188 |
| 15 | 64 | 188 |
| 20 | 96 | 188 |
| 30 | 112 | 188 |
| 35 | 144 | 188 |
| 40 | 160 | 188 |
| 45 | 190 | 188 |
| 50 | 190 | 188 |
| 55 | 190 | 188 |
| 60 | 190 | 188 |
+----+----------+----------+
=========
- We can see that DAMOS "hugepage" action works only when THP is set
to madvise. "collapse" action works even when THP is set to never.
- Performance for "collapse" action is slightly lower than "hugepage"
action and THP madvise. This is due to the fact that collapases
occur synchronously. With "hugepage" they may occur during page
faults.
- Memory consumption is slighly lower for "collapse" than "hugepage"
with THP madvise. This is due to the khugepage collapses all VMAs,
while "collapse" action only collapses the VMAs in the hot region.
- There is an improvement in TLB utilization when collapse through
"hugepage" or "collapse" actions are triggered. The amount of
TLB misses is lower.
- "collapse" action is performance synchronously, which means that
page collapses happen earlier and more rapidly. This can be
useful or not, depending on the scenario.
- "hugepage" action may trigger a VMA split in some scenarios, since
it needs to change the flag of the VMA to THP enabled. This may
lead to additional overhead.
Collapse action just adds a new option to chose the correct system
balance.
Link: https://lore.kernel.org/20260426231619.107231-5-sj@kernel.org
Link: https://lore.kernel.org/damon/20260313000816.79933-1-sj@kernel.org/ [1]
Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Cheng-Han Wu <hank20010209@gmail.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Liew Rui Yan <aethernet65535@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, the memory hot-remove call chain -- arch_remove_memory(),
__remove_pages(), sparse_remove_section() and section_deactivate() -- does
not carry the struct dev_pagemap pointer. This prevents the lower levels
from knowing whether the section was originally populated with vmemmap
optimizations (e.g., DAX with vmemmap optimization enabled).
Without this information, we cannot call vmemmap_can_optimize() to
determine if the vmemmap pages were optimized. As a result, the vmemmap
page accounting during teardown will mistakenly assume a non-optimized
allocation, leading to incorrect memmap statistics.
To lay the groundwork for fixing the vmemmap page accounting, we need to
pass the @pgmap pointer down to the deactivation location. Plumb the
@pgmap argument through the APIs of arch_remove_memory(), __remove_pages()
and sparse_remove_section(), mirroring the corresponding *_activate()
paths.
Link: https://lore.kernel.org/20260428081855.1249045-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Liam R. Howlett <liam@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When reclaim is triggered by high order allocations on a fragmented
system, vmpressure() can report poor reclaim efficiency even though the
system has plenty of free memory. This is because many pages are scanned,
but few are found to actually reclaim - the pages are actively in use and
don't need to be freed. The resulting scan:reclaim ratio causes
vmpressure() to assert socket pressure, throttling TCP throughput
unnecessarily.
Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on
compaction to succeed, so poor reclaim efficiency at these orders does not
necessarily indicate memory pressure. The kernel already treats this
order as the boundary where reclaim is no longer expected to succeed and
compaction may take over.
Make vmpressure() order-aware through an additional parameter sourced from
scan_control at existing call sites. Socket pressure is now only asserted
when order <= PAGE_ALLOC_COSTLY_ORDER.
Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
uses order 0, which passes the filter unconditionally. Similarly,
vmpressure_prio() now passes order 0 internally when calling vmpressure(),
ensuring critical pressure from low reclaim priority is not suppressed by
the order filter.
The patch was motivated by a case of impacted net throughput in
production. On one affected host, the memory state at the time showed
~15GB available, zero cgroup pressure, and the following buddyinfo state:
Order FreePages
0: 133,970
1: 29,230
2: 17,351
3: 18,984
7+: 0
Using bpf, it was found that 94% of vmpressure calls on this host were
from order-7 kswapd reclaim.
TCP minimum recv window is rcv_ssthresh:19712.
Before patch:
723 out of 3,843 (19%) TCP connections stuck at minimum recv window
After live-patching and ~30min elapsed:
0 out of 3,470 TCP connections stuck at minimum recv window
Link: https://lore.kernel.org/20260406195014.112521-1-jp.kobryn@linux.dev
Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Qi Zheng <qi.zheng@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 9bdac9142407 ("sparsemem: Put mem map for one node together.")
introduced a mechanism to pre-allocate a large memory block to hold all
memmaps for a NUMA node upfront.
However, the original commit message did not clearly state the actual
benefits or the necessity of explicitly pre-allocating a single chunk for
all memmap areas of a given node.
One of the concerns about removing this pre-allocation is that the
subsequent per-section memmap allocations could become scattered around,
and might turn too many memory blocks/sections into an "un-offlinable"
state. However, tests show that even without the explicit node-wide
pre-allocation, memblock still allocates memory closely and back-to-back.
When tracing vmemmap_set_pmd allocations, the physical chunks allocated by
memblock are strictly adjacent to each other in a single contiguous
physical range (mapped top-down). Because they are packed tightly
together naturally, they will at most consume or pollute the exact same
number of memory blocks as the explicit pre-allocation did.
Another concern is the boot performance impact of calling memmap_alloc()
multiple times compared to one large node-wide allocation. Tests on a
256GB VM showed that memmap allocation time increased from 199,555 ns to
741,292 ns. Even though it is 3.7x slower, on a 1TB machine, the entire
memory allocation time would only take a few milliseconds. This boot
performance difference is completely negligible.
Since no negative impact on memory offlining behavior or noticeable boot
performance regression was found, this patch proposes removing the
explicit node-wide memmap pre-allocation mechanism to reduce the
maintenance burden.
Link: https://lore.kernel.org/20260410092419.2446420-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMOS quota is charged to all DAMOS action application attempted memory,
regardless of how much of the memory the action was successful and failed.
This makes understanding quota behavior without DAMOS stat but only with
end level metrics (e.g., increased amount of free memory for DAMOS_PAGEOUT
action) difficult. Also, charging action-failed memory same as
action-successful memory is somewhat unfair, as successful action
application will induce more overhead in most cases.
Introduce DAMON core API for setting the charge ratio for such
action-failed memory. It allows API callers to specify the ratio in a
flexible way, by setting the numerator and the denominator.
Link: https://lore.kernel.org/20260428013402.115171-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|