| Age | Commit message (Collapse) | Author |
|
Per-task monitors use task_mon_slot to determine which slot in the array
to use for the monitor. During destruction, this slot is returned but
this is done before resetting the monitor. As a result, the monitor's
reset is in fact resetting a slot that is outside of the array
(RV_PER_TASK_MONITOR_INIT).
Release the slot only after the reset to avoid out-of-bound memory
access.
Fixes: f5587d1b6ec93 ("rv: Add Hybrid Automata monitor type")
Cc: stable@vger.kernel.org
Suggested-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260601153840.124372-3-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
|
|
Merge the immutable branch dt into next, to allow the updated DT bindings
to be tested together with the pmdomain changes that are targeted for the
next release.
Signed-off-by: Ulf Hansson <ulfh@kernel.org>
|
|
When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in user space looks like this:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
4) robust_list_clear_op_pending();
else
5) sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....);
That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task, which observes that it is the last
user and:
1) unmaps the mutex memory
2) maps a different file, which ends up covering the same address
When then the original task exits before reaching #5 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.
In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.
On X86 this boils down to this simplified assembly sequence:
mov %esi,%eax // Load TID into EAX
xor %ecx,%ecx // Set ECX to 0
#3 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
.Lstart:
jnz .Lend
#4 movq %rcx,(%rdx) // Clear list_op_pending
.Lend:
If the cmpxchg() succeeds and the task is interrupted before it can clear
list_op_pending in the robust list head (#4) and the task crashes in a
signal handler or gets killed then it ends up in do_exit() and subsequently
in the robust list handling, which then might run into the unmap/map issue
described above.
This is only relevant when user space was interrupted and a signal is
pending. The fix-up has to be done before signal delivery is attempted
because:
1) The signal might be fatal so get_signal() ends up in do_exit()
2) The signal handler might crash or the task is killed before returning
from the handler. At that point the instruction pointer in pt_regs is
not longer the instruction pointer of the initially interrupted unlock
sequence.
The right place to handle this is in __exit_to_user_mode_loop() before
invoking arch_do_signal_or_restart() as this covers obviously both
scenarios.
As this is only relevant when the task was interrupted in user space, this
is tied to RSEQ and the generic entry code as RSEQ keeps track of user
space interrupts unconditionally even if the task does not have a RSEQ
region installed. That makes the decision very lightweight:
if (current->rseq.user_irq && within(regs, csr->unlock_ip_range))
futex_fixup_robust_unlock(regs, csr);
futex_fixup_robust_unlock() then invokes a architecture specific function
to return the pending op pointer or NULL. The function evaluates the
register content to decide whether the pending ops pointer in the robust
list head needs to be cleared.
Assuming the above unlock sequence, then on x86 this decision is the
trivial evaluation of the zero flag:
return regs->eflags & X86_EFLAGS_ZF ? regs->dx : NULL;
Other architectures might need to do more complex evaluations due to LLSC,
but the approach is valid in general. The size of the pointer is determined
from the matching range struct, which covers both 32-bit and 64-bit builds
including COMPAT.
The unlock sequence is going to be placed in the VDSO so that the kernel
can keep everything synchronized, especially the register usage. The
resulting code sequence for user space is:
if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) != tid)
err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);
Both the VDSO unlock and the kernel side unlock ensure that the pending_op
pointer is always cleared when the lock becomes unlocked.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.773669210@kernel.org
|
|
There will be a VDSO function to unlock robust futexes in user space. The
unlock sequence is racy vs. clearing the list_pending_op pointer in the
tasks robust list head. To plug this race the kernel needs to know the
instruction window. As the VDSO is per MM the addresses are stored in
mm_struct::futex.
Architectures which implement support for this have to update these
addresses when the VDSO is (re)mapped and indicate the pending op pointer
size which is matching the IP.
Arguably this could be resolved by chasing mm->context->vdso->image, but
that's architecture specific and requires to touch quite some cache
lines. Having it in mm::futex reduces the cache line impact and avoids
having yet another set of architecture specific functionality.
To support multi size robust list applications (gaming) this provides two
ranges when COMPAT is enabled.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.718926819@kernel.org
|
|
Unlocking robust non-PI futexes happens in user space with the following
sequence:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = 0;
3) lval = atomic_xchg(lock, lval);
4) if (lval & WAITERS)
5) sys_futex(WAKE,....);
6) robust_list_clear_op_pending();
That opens a window between #3 and #6 where the mutex could be acquired by
some other task which observes that it is the last user and:
A) unmaps the mutex memory
B) maps a different file, which ends up covering the same address
When the original task exits before reaching #6 then the kernel robust list
handling observes the pending op entry and tries to fix up user space.
In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupting unrelated data.
PI futexes have a similar problem both for the non-contented user space
unlock and the in kernel unlock:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (!atomic_try_cmpxchg(lock, lval, 0))
4) sys_futex(UNLOCK_PI,....);
5) robust_list_clear_op_pending();
Address the first part of the problem where the futexes have waiters and
need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which
is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET
operations.
This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear
whether this is needed and there is no usage of it in glibc either to
investigate.
For the futex2 syscall family this needs to be implemented with a new
syscall.
The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to
robust_list_head::list_pending_op into the kernel. This argument is only
evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward
compatible.
This is an explicit argument to avoid the lookup of the robust list pointer
and retrieving the pending op pointer from there. User space has the
pointer already available so it can just put it into the @uaddr2
argument. Aside of that this allows the usage of multiple robust lists in
the future without any changes to the internal functions as they just operate
on the provided pointer.
This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the
robust list pointer points to an u32 and not to an u64. This is required
for two reasons:
1) sys_futex() has no compat variant
2) The gaming emulators use both both 64-bit and compat 32-bit robust
lists in the same 64-bit application
As a consequence 32-bit applications have to set this flag unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the flag is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.
In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the
unlock succeeded. In case of errors, the user space value is still locked
by the caller and therefore the above cannot happen.
In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and
clears the robust list pending op when the unlock was successful. If not,
the user space value is still locked and user space has to deal with the
returned error. That means that the unlocking of non-PI robust futexes has
to use the same try_cmpxchg() unlock scheme as PI futexes.
If the clearing of the pending list op fails (fault) then the kernel clears
the registered robust list pointer if it matches to prevent that exit()
will try to handle invalid data. That's a valid paranoid decision because
the robust list head sits usually in the TLS and if the TLS is not longer
accessible then the chance for fixing up the resulting mess is very close
to zero.
The problem of non-contended unlocks still exists and will be addressed
separately.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.670514505@kernel.org
|
|
Make the operand defines tabular for readability sake.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.615600933@kernel.org
|
|
The upcoming support for unlocking robust futexes in the kernel requires
store release semantics. Syscalls do not imply memory ordering on all
architectures so the unlock operation requires a barrier.
This barrier can be avoided when stores imply release like on x86.
Provide a generic version with a smp_mb() before the unsafe_put_user(),
which can be overridden by architectures.
Provide also a ARCH_MEMORY_ORDER_TSO Kconfig option, which can be selected
by architectures with Total Store Order (TSO), where store implies release,
so that the smp_mb() in the generic implementation can be avoided.
If that is set a barrier() is used instead of smp_mb(), which is not
required for the use case at hand, but makes it future proof for other
usage to prevent the compiler from reordering.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.513181528@kernel.org
|
|
The marker for PI futexes in the robust list is a hardcoded 0x1 which lacks
any sensible form of documentation.
Provide proper defines for the bit and the mask and fix up the usage
sites. Thereby convert the boolean pi argument into a modifier argument,
which allows new modifier bits to be trivially added and conveyed.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.458758556@kernel.org
|
|
Having all these members in mm_struct along with the required #ifdeffery is
annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.
Move it into a data structure and fix up all usage sites.
The extra struct for the private hash is intentional to make integration of
other conditional mechanisms easier in terms of initialization and separation.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260602090535.407756793@kernel.org
|
|
Nothing fails there. Mop up the leftovers of the early version of this,
which did an allocation.
While at it clean up the stubs and the #ifdef comments to make the header
file readable.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260602090535.356789395@kernel.org
|
|
Having all these members in task_struct along with the required #ifdeffery
is annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.
Move it into a data structure and fix up all usage sites.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patch.msgid.link/20260602090535.308220888@kernel.org
|
|
Slapping __percpu_qual into the next available header is sloppy at best.
It's required by __percpu which is defined in compiler_types.h and that is
meant to be included without requiring a boatload of other headers so that
a struct or function declaration can contain a __percpu qualifier w/o
further prerequisites.
This implicit dependency on linux/percpu.h makes that impossible and causes
a major problem when trying to separate headers.
Create asm/percpu_types.h and move it there. Include that from
compiler_types.h and the whole recursion problem goes away.
Fix up UM so it uses the generic header and includes it in the UM_HOST
build, which pulls in compiler_types.h. The USER_CFLAGS fix was suggested
by Richard.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260602090535.254874125@kernel.org
|
|
The unconditional guard destructors check whether the lock pointer is
NULL before unlocking. This check is dead code because unconditional
guards guarantee a non-NULL lock pointer at destructor time.
DEFINE_GUARD() runs the lock operation unconditionally in the
constructor. If the pointer were NULL, the lock operation (e.g.
mutex_lock(NULL)) would crash before the constructor returns. The
destructor never runs with a NULL pointer. All DEFINE_GUARD() users
dereference the pointer in their lock. Verified by auditing every
instance found by: git grep -n -A 1 'DEFINE_GUARD('. The only exception
is xe_pm_runtime_release_only, whose constructor is a noop, but it has
no callers.
__DEFINE_UNLOCK_GUARD() has only a few usages outside of
include/linux/cleanup.h: tty_port_tty (NULL-checks in its tty_kref_put()
call), irqdesc_lock (fixed earlier) and two guards in
kernel/sched/sched.h (dereference the pointer unconditionally in their
lock constructors).
DEFINE_LOCK_GUARD_1() sets .lock from its argument and runs the lock
operation in the constructor. Same reasoning applies. All
DEFINE_LOCK_GUARD_1() users dereference the pointer in their lock. Also,
verified by auditing every match of: git grep -n 'DEFINE_LOCK_GUARD_1('.
DEFINE_LOCK_GUARD_0() hardcodes .lock = (void *)1 in the constructor,
so it is never NULL by construction.
Conditional (_try) variants: DEFINE_GUARD_COND() and
DEFINE_LOCK_GUARD_1_COND() use EXTEND_CLASS_COND(), whose wrapper
destructor returns early when the lock was not acquired, before reaching
the base destructor since commit 2deccd5c862a ("cleanup: Optimize
guards"):
if (_cond) return; class_##_name##_destructor(_T);
As compiled by GCC-11 with defconfig on top of the locking/core:
Total: Before=23889980, After=23834334, chg -0.23%
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/0503a089389b2270c478a873e095cf0a4ff26d24.1780064327.git.d@ilvokhin.com
|
|
Add __nonnull_args() to unconditional guard constructors so the compiler
warns when NULL is statically known to be passed:
- DEFINE_GUARD(): re-declare the constructor with __nonnull_args().
- __DEFINE_LOCK_GUARD_1(): annotate the constructor directly.
DEFINE_LOCK_GUARD_0() needs no annotation: its constructor takes no
pointer arguments (.lock is hardcoded to (void *)1).
Define the __nonnull_args() macro in compiler_attributes.h, following
the existing convention for attribute wrappers. Deliberately not named
'__nonnull', to avoid clashing with glibc's __nonnull() when kernel and
userspace headers are combined (User Mode Linux for example).
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/85fee12eec20abfcf711443518e8f0caec982a86.1780064327.git.d@ilvokhin.com
|
|
For CONFIG_NR_CPUS < 16K, _Q_PENDING_BITS is 8 and the pending
field occupies bits 8-15 of the lock word. The current comment
documents bit 8 as pending and bits 9-15 as unused, which describes
the pending flag value rather than the field layout.
Describe bits 8-15 as the pending byte so the layout description
is consistent with the lock byte.
Signed-off-by: WEI-HONG, YE <1234567weewee457@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20260525130450.723937-1-1234567weewee457@gmail.com
|
|
Document the RPMh power domain for Nord SoC, and add definitions for
the new power domains present on Nord SoC.
- RPMHPD_NSP3: power domain for the 4th NSP subsystem
- RPMHPD_GFX1: power domain for the 2nd GFX subsystem
Signed-off-by: Kamal Wadhwa <kamal.wadhwa@oss.qualcomm.com>
Signed-off-by: Shawn Guo <shengchao.guo@oss.qualcomm.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Ulf Hansson <ulfh@kernel.org>
|
|
Some RPMHPD_* defines in the Generic RPMH Power Domain Indexes section
were using spaces instead of tabs for alignment. Fix them to be
consistent with the rest of the file.
Signed-off-by: Shawn Guo <shengchao.guo@oss.qualcomm.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Ulf Hansson <ulfh@kernel.org>
|
|
Even it is very unlikely the thermal framework is disabled, the newly
added devm_thermal_cooling_device_register() function has not the stub
when the thermal framework is optout in the kernel.
Add it.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605301554.S9n45bfQ-lkp@intel.com/
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260601090152.1243983-2-daniel.lezcano@kernel.org
|
|
Extend the OF cooling device registration to support an explicit
cooling device identifier (cdev_id), preparing for upcoming DT
bindings where cooling devices are identified by a tuple (device node,
id) instead of relying on child nodes.
Introduce a new helper:
devm_thermal_of_cooling_device_register()
which registers a cooling device using the device's of_node and an
explicit cdev_id. This complements the existing
devm_thermal_of_child_cooling_device_register() helper, which
remains dedicated to the legacy child-node based bindings.
Internally, factorize the devm registration logic into a common
helper to avoid code duplication.
Existing users are unaffected, as the child-based helper continues
to pass a default cdev_id of 0, preserving current behavior.
This change is a preparatory step for supporting indexed cooling
devices in thermal OF bindings.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260526140802.1059293-20-daniel.lezcano@oss.qualcomm.com
|
|
Introduce an identifier (cdev_id) for cooling devices registered from
device tree.
This prepares support for a new DT binding where cooling devices are
identified by a tuple (device node, ID), instead of relying on child
nodes.
Existing users are updated to pass a default ID of 0, preserving the
current behavior.
Future changes will extend the cooling map parsing to match cooling
devices based on both the device node and the ID.
No functional change intended.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260526140802.1059293-19-daniel.lezcano@oss.qualcomm.com
|
|
To clarify that the function operates on child nodes, rename:
devm_thermal_of_cooling_device_register()
|
v
devm_thermal_of_child_cooling_device_register()
Used the command:
find . -type f -name '*.[ch]' -exec \
sed -i 's/devm_thermal_of_cooling_device_register/\
devm_thermal_of_child_cooling_device_register/g' {} \;
Did not used clang-format-diff because it does not indent correctly
and checkpatch complained. Manually reindented to make checkpatch
happy
This prepares for upcoming support of cooling devices identified by
an ID rather than device tree child nodes.
No functional change.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260526140802.1059293-18-daniel.lezcano@oss.qualcomm.com
|
|
The device node pointer stored in struct thermal_cooling_device is
only used by the OF-specific thermal code to associate cooling devices
with thermal zones defined in device tree.
Now that OF and non-OF registration paths are separated and non-OF
users no longer rely on devm_thermal_of_cooling_device_register() with
a NULL device node, the np field is no longer required for non-OF
configurations.
Make this field conditional on CONFIG_THERMAL_OF to reduce memory
footprint and better reflect its usage.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20260526140802.1059293-16-daniel.lezcano@oss.qualcomm.com
|
|
The functions:
- thermal_of_cooling_device_register()
- devm_thermal_of_cooling_device_register()
are specific to device tree usage but are currently implemented in
thermal_core.c.
Move them to thermal_of.c to better reflect the separation between
generic thermal core code and OF-specific logic.
This change is enabled by the recent split of the cooling device
registration into allocation and addition phases, allowing OF-specific
handling (such as device node assignment) to be isolated from the core.
No functional change intended.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260526140802.1059293-17-daniel.lezcano@oss.qualcomm.com
|
|
Introduce a device-managed variant of the non-OF cooling device
registration API.
This complements devm_thermal_of_cooling_device_register() and allows
non-device-tree users to register cooling devices with automatic
cleanup tied to the device lifecycle.
The helper relies on devm_add_action_or_reset() to release the cooling
device via thermal_cooling_device_release() on driver detach or probe
failure.
This keeps the API consistent across OF and non-OF users and avoids
manual cleanup in error paths.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260526140802.1059293-14-daniel.lezcano@oss.qualcomm.com
|
|
sget() and sget_fc() have lived side by side as near-duplicate
find-or-create-and-publish helpers for the legacy and fs_context mount
APIs. The three remaining in-tree callers (CIFS plus the ext4 extents
and mballoc KUnit tests) have all been moved to sget_fc(). Nothing
calls sget() anymore.
Delete sget() from fs/super.c and the prototype in <linux/fs.h>.
Update the two comments that referred to "sget()" or "sget{_fc}()" to
just say "sget_fc()".
This removes ~60 lines of code that only existed to be kept in
lockstep with sget_fc() on every superblock publish-path change.
Link: https://patch.msgid.link/20260529-work-sget-v2-4-57bbe08604e4@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Find and set the memcg_id for damon_filter from the user-passed memory
cgroup path when updating the DAMON input parameters.
Link: https://lore.kernel.org/20260518234119.97569-27-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Belonging memory cgoup is another data attribute that can be useful to
monitor. Introduce a new DAMON filter type, namely
DAMON_FILTER_TYPE_MEMCG, for monitoring of this attribute.
Link: https://lore.kernel.org/20260518234119.97569-23-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a new tracepoint for exposing the per-region per-probe positive
sample count via tracefs.
Link: https://lore.kernel.org/20260518234119.97569-19-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Extend damon_operations struct with a new callback, namely apply_probes.
The callback will be invoked for data attributes monitoring. More
specifically, the callback will apply damon_probe objects to each region
and update the per-region per-probe counters for the number of encountered
probe-positive samples.
Link: https://lore.kernel.org/20260518234119.97569-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add an array for the per-region per-probe positive samples count. For
simple and efficient implementation, add a limit to the number of data
probes and set the array to support only the limited number of counters.
Link: https://lore.kernel.org/20260518234119.97569-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Define a data structure for constructing damon_probe's attributes check,
namely damon_filter. It is very similar to damos_filter but works only
for monitoring purposes. Also embed that into damon_probe, implement
essential handling of the link, with fundamental helpers.
Link: https://lore.kernel.org/20260518234119.97569-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let damon_probe objects be able to be installed on a given damon_ctx, by
adding a linked list header for storing the objects. Add initialization
and cleanup of the new field with helper functions, too.
Link: https://lore.kernel.org/20260518234119.97569-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: introduce data attributes monitoring".
TL; DR
======
Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring. In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and PMU)
and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.
Background: High Cost of Page Level Properties Monitoring
=========================================================
DAMON is initially introduced as a Data Access MONitor. It has been
extended for not only access monitoring but also data access-aware system
operations (DAMOS). But still the monitoring part is only for data
accesses.
Data access patterns is good information, but some users need more
holistic views. Particularly, users want to show the access pattern
information together with the types of the memory. For example, users who
work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages. Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.
For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14. Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters. Then, DAMON
applies the filters to each folio of the entire DAMON region and lets
users know how many bytes of memory in each DAMON region passed the given
filters.
This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size. It was useful for test or debugging
purposes on a small number of machines. But it was obviously too heavy to
be enabled always on all machines running the real user workloads. For
real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches. For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet. If the runtime and the size of the
fleet is long and big enough, it should provide statistically meaningful
data.
But users are too busy to implement such controls on their own.
Data Attributes Monitoring
==========================
Extend DAMON to monitor not only data accesses, but also general data
attributes. Do the extension while keeping the main promise of DAMON, the
bounded and best-effort minimum overhead.
Allow users to specify what data attributes in addition to the data access
they want to monitor. Users can install one 'data probe' per data
attribute of their interest for this purpose. The 'data probe' should be
able to be applied to any memory, and determine if the given memory has
the appropriate data attribute. E.g., if memory of physical address 42
belongs to cgroup A. Each 'data probe' is configured with filters that
are very similar to the DAMOS filters.
When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered. Same
to the number of access check-positive samples accounting (nr_accesses),
it accounts the number of each data probe-positive samples in another
per-region counters array, namely 'probe_hits'. When DAMON resets
nr_accesses every aggregation interval, it resets 'probe_hits' together.
Users can read 'probe_hits' just before the values are reset. In this
way, users can know how many hot/cold memory regions have data attributes
of their interest. E.g., 30 percent of this system's hot memory is
belonging to cgroup A, and 80 percent of the cgroup A-belonging hot memory
is backed by huge pages.
Patches Sequence
================
First eight patches implement the core feature, interface and the working
support. Patch 1 introduces data probe data structure, namely
damon_probe. Patch 2 extends damon_ctx for installing data probes. Patch
3 introduces another data structure for filters of each data probe, namely
damon_filter. Patch 4 updates damon_ctx commit function to handle the
probes. Patch 5 extends damon_region for the per-region per-probe
positive samples counter, namely probe_hits. Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation. Patch 7 updates kdamond_fn() to invoke the probes
applying callback. Patch 8 finally implements the probes support on paddr
ops.
Ten changes for user interface (patches 9-18) come next. Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory and
filter directory internal files, respectively. Patch 14 connects the user
inputs that are made via the sysfs files to DAMON core. Following three
patches (patches 15-17) implement sysfs directories and files for showing
the probe_hits to users, namely probes directory, probe directory and hits
files, respectively. Patch 18 introduces a new tracepoint for showing the
probe_hits via tracefs.
Patch 19 adds a selftest for the sysfs files.
Patches 20 and 21 documents the design and usage of the new feature,
respectively.
Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow. Depending on the feedback, this part might be separated to
another series in future. Patch 22 defines the DAMON filter type for the
new attribute, namely DAMON_FILTER_TYPE_MEMCG. Patch 23 add the support
on paddr ops. Patch 24 updates the sysfs interface for setup of the
target memcg. Patch 25 move code for easy reuse of the filter target
memcg setup. Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.
Discussion
==========
This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads. Because the sampling
time for access check is reused for data attributes check, the
upper-bounded and best-effort minimum overhead of DAMON is kept. Because
the sampling memory for access check is reused for data attributes check,
additional overhead is minimum.
Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information. When in doubt
of the sampling based information, running DAMOS-based one together and
comparing the results would be useful, for debugging and tuning.
Future Works: Mid Term
========================
This version of implementation is limiting the maximum number of data
probes to four. I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.
Future Works: Long Term
=======================
There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring. For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive. Actually there is
another ongoing work [3] for extending DAMON with PMU events. The
motivation of the work is reducing the overhead, though.
In my work [2], I was introducing a new interface for access sampling
primitives control. Now I think this data probe interface can be used for
that, too. That is, data access becomes just one type of data attribute.
Also, pg_idle-confirmed access, page fault-confirmed access, and PMU
event-confirmed access will be different types of data attributes.
The regions adjustment mechanism is currently working based on the access
information. That's because DAMON is designed for data access monitoring.
That is, data access information is the primary interest, and therefore
DAMON adjusts regions in a way that can best-present the information.
Once data access becomes just one of data attributes, there is no reason
to think data access that special. There might be some users not
interested in access at all but want to know the location of memory of
specific type. Data probes interface will allow doing that. Further, we
could extend the interface to let users set any data attribute as the
'primary' attribute. Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.
DAMOS will also be extended, to specify targets based on not only the data
access pattern, but all user-registered data attributes. From this stage,
we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".
This patch (of 28):
Introduce a data structure for data attribute probe. It is just a linked
list header at this step. It will be extended in a way that it can
determine if a given memory has a specific data attribute.
Link: https://lore.kernel.org/20260518234119.97569-1-sj@kernel.org
Link: https://lore.kernel.org/20260518234119.97569-2-sj@kernel.org
Link: https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org [1]
Link: https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/ [2]
Link: https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com [3]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
By allocating one additional bit in the swap table entry's flags field
alongside the count, we can store the zeromap inline
For 64 bit systems, zeromap will store in the swap table, avoiding zeromap
allocation. It reduces the allocated memory. That is the happy path.
For certain 32-bit archs, there might not be enough bits in the swap table
to contain both PFN and flags. Therefore, conditionally let each cluster
have a zeromap field at build time, and use that instead. If the swapfile
cluster is not fully used, it will still save memory for zeromap. The
empty cluster does not allocate a zeromap. In the worst case, all cluster
are fully populated. We will use memory similar to the previous zeromap
implementation.
A few macros were moved to different headers for build time struct
definition.
[akpm@linux-foundation.org: swap_cluster_alloc_table(): remove unused local `ret]
[akpm@linux-foundation.org: fix unused label `err_free']
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-12-88ae43e064c7@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Youngjun Park <youngjun.park@lge.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now all swap cgroup records are stored in the swap cluster directly, the
static array is no longer needed.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-11-88ae43e064c7@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table
instead.
The per-cluster memcg table is 1024 / 512 bytes on most archs, and does
not need RCU protection: the cgroup data is only read and written under
the cluster lock. That keeps things simple, lets the allocation use plain
kmalloc with immediate kfree (no deferred free), and keeps fragmentation
acceptable.
[akpm@linux-foundation.org: memcgv1: don't compile swap functions when CONFIG_SWAP=n]
Link: https://lore.kernel.org/202605281711.bSeZlErK-lkp@intel.com
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Instead of checking the cgroup private ID during page table walk in
swap_pte_batch(), move the memcg lookup into __swap_cache_add_check()
under the cluster lock.
The first pre-alloc check is speculative and skips the memcg check since
the post-alloc stable check ensures all slots covered by the folio belong
to the same memcg. It is very rare for contiguous and aligned entries
across a contiguous region of a page table of the same process or shmem
mapping to belong to different memcgs.
This also prepares for recording the memcg info in the cluster's table.
Also make the order check and fallback more compact.
There should be no user-observable behavior change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The cgroup v1 swap helpers always operate on swap cache folios whose swap
entry is stable: the folio is locked and in the swap cache. There is no
need to pass the swap entry or page count as separate parameters when they
can be derived from the folio itself.
Simplify the redundant parameters and add sanity checks to document the
required preconditions.
Also rename memcg1_swapout to __memcg1_swapout to indicate it requires
special calling context: the folio must be isolated and dying, and the
call must be made with interrupts disabled.
No functional change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Shmem has some special requirements for THP GFP and has to limit it in
certain zones or provide a more lenient fallback.
We'll use this helper for generic swap THP allocation, which needs to
support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper is
basically a no-op. But it's necessary for certain shmem users, mostly
drivers.
No feature change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-3-88ae43e064c7@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Youngjun Park <youngjun.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
- Add a PAGEBLOCK_ prefix to the names to avoid polluting the "global
namespace" too much.
- This new prefix makes MIGRATETYPE_AND_ISO_MASK look pretty long. Well,
that global mask only exists for quite a specific purpose, and is
quite a weird thing to have a name for anyway. So drop it and take
advantage of the newly-defined PAGEBLOCK_ISO_MASK.
Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-3-dacdf5402be8@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: misc cleanups from __GFP_UNMAPPED series".
In v2 of the __GFP_UNMAPPED series [0], we realised that some of the
patches could potentially be merged as independent cleanups.
These are all independent of one another, if you think some are useful
cleanups and others are pointless churn, it should be fine to just pick
whatever subset you prefer.
No functional change intended.
This patch (of 4):
There are a couple of places that iterate over the freelists with
awareness of the data structures' layout.
It seems ideally, code outside of mm should not be aware of the page
allocator's freelists at all. But, this patch just doesn't hide them
completely, it's just a meek incremental step in that direction: provide a
macro to iterate over it without needing to be aware of the actual struct
fields.
Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-0-dacdf5402be8@google.com
Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-1-dacdf5402be8@google.com
Link: https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/ [0]
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fix a goof in the block comment for invalidate_range_{start,end}() where
start() is incorrectly referred to as begin().
No functional change intended.
[seanjc@google.com: split to separate patch, write changelog]
Link: https://lore.kernel.org/20260513163546.1176742-1-seanjc@google.com
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Rename the memory block lookup helper to make the acquired reference
explicit, add memory_block_put() to wrap put_device(), remove
find_memory_block(), and use memory_block_get() as the single block-id
based lookup interface.
This makes it clearer to callers that a successful lookup holds a
reference that must be dropped, reducing the chance of forgetting the
matching put and leaking the memory block device reference.
Link: https://lore.kernel.org/linux-mm/7887915D-E598-42B3-9AFE-BFFBACE8DE2D@linux.dev/#t
Link: https://lore.kernel.org/20260512072635.3969576-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> #s390
Cc: Richard Cheng <icheng@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The call to kmemleak_free_part_phys() was added in 2022 in
commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in
put_page_bootmem").
In 2025, commit b2aad24b5333 ("mm/memmap: prevent double scanning of memmap
by kmemleak") started to use MEMBLOCK_ALLOC_NOLEAKTRACE when allocating
the memmap to skip the kmemleak_alloc_phys() in the buddy.
So remove the call to kmemleak_free_part_phys(). If this would still
be required for other purposes, either free_reserved_page() should take
care of it, or selected users.
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-4-3fb0be6fc688@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Lance Yang <lance.yang@linux.dev>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_rand() on the sampling_addr hot path called get_random_u32_below(),
which takes a local_lock_irqsave() around a per-CPU batched entropy pool
and periodically refills it with ChaCha20. At elevated nr_regions counts
(20k+), the lock_acquire / local_lock pair plus __get_random_u32_below()
dominate kdamond perf profiles.
Replace the helper with a lockless lfsr113 generator (struct rnd_state)
held per damon_ctx and seeded from get_random_u64() in damon_new_ctx().
kdamond is the single consumer of a given ctx, so no synchronization is
required. Range mapping uses traditional reciprocal multiplication,
similar as get_random_u32_below(); for spans larger than U32_MAX (only
reachable on 64-bit) the slow path combines two u32 outputs and uses
mul_u64_u64_shr() at 64-bit width. On 32-bit the slow path is dead code
and gets eliminated by the compiler.
The new helper takes a ctx parameter; damon_split_regions_of() and the
kunit tests that call it directly are updated accordingly.
lfsr113 is a linear PRNG and MUST NOT be used for anything
security-sensitive. DAMON's sampling_addr is not exposed to userspace and
is only consumed as a probe point for PTE accessed-bit sampling, so a
non-cryptographic PRNG is appropriate here.
Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU usage
reduced from ~72% to ~50% of one core.
Link: https://lore.kernel.org/20260505145212.108644-1-jiayuan.chen@linux.dev
Link: https://lore.kernel.org/damon/20260426173346.86238-1-sj@kernel.org/T/#m4f1fd74112728f83a41511e394e8c3fef703039c
Link: https://lore.kernel.org/20260509011816.85145-1-sj@kernel.org
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Shu Anzai <shu17az@gmail.com>
Cc: Quanmin Yan <yanquanmin1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
"userfaultfd: verify VMA state across UFFDIO_COPY retry", which is a
prerequisite for mm-unnstable's series "userfaultfd: merge
fs/userfaultfd.c into mm/userfaultfd.c".
|
|
Both the compilation of kernel/time/vsyscall.c, which contains the real
definition of update_vsyscall() and the other vDSO definitions in
timekeeper_internal.h use CONFIG_GENERIC_GETTIMEOFDAY and not
CONFIG_GENERIC_TIME_VSYSCALL.
Align the code to use a single Kconfig symbol.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260519-vdso-generic_time_vsyscal-v1-2-5c2a5905d5f5@linutronix.de
|
|
The dyntick-idle steal time is currently accounted when the tick restarts
but the stolen idle time is not subtracted from the idle time that was
already accounted. This is to avoid observing the idle time going backward
as the dyntick-idle cputime accessors can't reliably know in advance the
stolen idle time.
In order to maintain a forward progressing idle cputime while subtracting
idle steal time from it, keep track of the previously accounted idle stolen
time and substract it from _later_ idle cputime accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-16-frederic@kernel.org
|
|
The last reason why get_cpu_idle/iowait_time_us() may return -1 now is if
the config doesn't support nohz.
The ad-hoc replacement solution by cpufreq is to compute jiffies minus the
whole busy cputime. Although the intention should provide a coherent low
resolution estimation of the idle and iowait time, the implementation is
buggy because jiffies don't start at 0.
Just provide instead a real get_cpu_[idle|iowait]_time_us() offcase.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-14-frederic@kernel.org
|
|
Fetching the idle cputime is available through a variety of accessors all
over the place depending on the different accounting flavours and needs:
- idle vtime generic accounting can be accessed by kcpustat_field(),
kcpustat_cpu_fetch(), get_idle/iowait_time() and
get_cpu_idle/iowait_time_us()
- dynticks-idle accounting can only be accessed by get_idle/iowait_time()
or get_cpu_idle/iowait_time_us()
- CONFIG_NO_HZ_COMMON=n idle accounting can be accessed by kcpustat_field()
kcpustat_cpu_fetch(), or get_idle/iowait_time() but not by
get_cpu_idle/iowait_time_us()
Moreover get_idle/iowait_time() relies on get_cpu_idle/iowait_time_us()
with a non-sensical conversion to microseconds and back to nanoseconds on
the way.
Start consolidating the APIs with removing get_idle/iowait_time() and make
kcpustat_field() and kcpustat_cpu_fetch() work for all cases.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-13-frederic@kernel.org
|