| Age | Commit message (Collapse) | Author |
|
rt_spin_unlock() releases the RCU protection before unlocking the
lock. That opens the door for the following UAF scenario:
T1 T2
spin_lock(&p->lock); rcu_read_lock();
invalidate(p); p = rcu_dereference(ptr);
rcu_assign_pointer(ptr, NULL); if (!p) return;
spin_unlock(&p->lock); spin_lock(&p->lock)
lock(&lock->lock);
rcu_read_lock();
kfree_rcu(p); rcu_read_unlock();
....
spin_unlock(&p->lock)
rcu_read_unlock(); // Ends grace period
rcu_do_batch()
kfree(p);
UAF -> rt_mutex_cmpxchg_release(&lock->lock...)
Regular spinlocks keep preemption disabled accross the unlock operation,
which provides full RCU protection, but the RT substitution fails to
resemble that. Same applies for the rwlock substitution.
Move the rcu_read_unlock() invocation past the unlock operations to match
the non-RT semantics. This makes it asymmetric vs. rt_xxx_lock(), but
that's harmless as the caller needs to hold RCU read lock across the lock
operation. The migrate_enable() call stays before the unlock operation
because there is no per CPU operation in the unlock path which would
require migration to be kept disabled.
Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant")
Reported-by: syzbot+000c800a02097aaa10ed@syzkaller.appspotmail.com
Decoded-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/87jyrud75z.ffs@fw13
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"SMP load-balancing updates:
- A large series to introduce infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data
within the same Last Level Cache (LLC) domain. By improving cache
locality, the scheduler can reduce cache bouncing and cache misses,
ultimately improving data access efficiency.
Implemented by Chen Yu and Tim Chen, based on early prototype work
by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and
Shrikanth Hegde.
- A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde)
Fair scheduler updates:
- A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing
SMT awareness (Andrea Righi, K Prateek Nayak)
- A series to optimize cfs_rq and sched_entity allocation for better
data locality (Zecheng Li)
- A preparatory series to change fair/cgroup scheduling to a single
runqueue, without the final change (Peter Zijlstra)
- Auto-manage ext/fair dl_server bandwidth (Andrea Righi)
- Fix cpu_util runnable_avg arithmetic (Hongyan Xia)
- Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel)
- Allow account_cfs_rq_runtime() to throttle current hierarchy
(K Prateek Nayak)
- Update util_est after updating util_avg during dequeue, to fix the
util signal update logic, which reduces signal noise (Vincent
Guittot)
Scheduler topology updates:
- Allow multiple domains to claim sched_domain_shared (K Prateek
Nayak)
- Add parameter to split LLC (Peter Zijlstra)
Core scheduler updates:
- Use trace_call__<tp>() to save a static branch (Gabriele Monaco)
Scheduler statistics updates:
- Drop now-stale mul_u64_u64_div_u64() cputime over-approximation
guard (Nicolas Pitre)
Deadline scheduler updates:
- Reject debugfs dl_server writes for offline CPUs (Andrea Righi)
- Fix replenishment logic for non-deferred servers (Yuri Andriaccio)
RT scheduling updates:
- Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt)
- Update default bandwidth for real-time tasks to 1.0 (Yuri
Andriaccio)
Proxy scheduling updates:
- A series to implement Optimized Donor Migration for Proxy Execution
(John Stultz, Peter Zijlstra)
- Various proxy scheduling cleanups and fixes (Peter Zijlstra,
K Prateek Nayak)
Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi,
Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde,
Peter Zijlstra, Liang Luo and Yiyang Chen"
* tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits)
sched/fair: Fix newidle vs core-sched
sched/deadline: Use task_on_rq_migrating() helper
sched/core: Combine separate 'else' and 'if' statements
sched/fair: Fix cpu_util runnable_avg arithmetic
sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
sched/fair: Call update_curr() before unthrottling the hierarchy
sched/fair: Use throttled_csd_list for local unthrottle
sched/fair: Convert cfs bandwidth throttling to use guards
sched/fair: Allocate cfs_tg_state with percpu allocator
sched/fair: Remove task_group->se pointer array
sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
sched: restore timer_slack_ns when resetting RT policy on fork
MAINTAINERS: Fix spelling mistake in Peter's name
sched: Simplify ttwu_runnable()
sched/proxy: Remove superfluous clear_task_blocked_in()
sched/proxy: Remove PROXY_WAKING
sched/proxy: Switch proxy to use p->is_blocked
sched/proxy: Only return migrate when needed
sched: Be more strict about p->is_blocked
...
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Futex updates:
- Optimize futex hash bucket access patterns (Peter Zijlstra)
- Large series to address the robust futex unlock race for real, by
Thomas Gleixner:
"The robust futex unlock mechanism is racy in respect to the
clearing of the robust_list_head::list_op_pending pointer because
unlock and clearing the pointer are not atomic.
The race window is between the unlock and clearing the pending op
pointer. If the task is forced to exit in this window, exit will
access a potentially invalid pending op pointer when cleaning up
the robust list.
That happens if another task manages to unmap the object
containing the lock before the cleanup, which results in an UAF.
In the worst case this UAF can lead to memory corruption when
unrelated content has been mapped to the same address by the time
the access happens.
User space can't solve this problem without help from the kernel.
This series provides the kernel side infrastructure to help it
along:
1) Combined unlock, pointer clearing, wake-up for the
contended case
2) VDSO based unlock and pointer clearing helpers with a
fix-up function in the kernel when user space was interrupted
within the critical section.
... with help by André Almeida:
- Add a note about robust list race condition (André Almeida)
- Add self-tests for robust release operations (André Almeida)
Context analysis updates:
- Implement context analysis for 'struct rt_mutex'. (Bart Van Assche)
- Bump required Clang version to 23 (Marco Elver)
Guard infrastructure updates:
- Series to remove NULL check from unconditional guards (Dmitry
Ilvokhin)
Lockdep updates:
- Restore self-test migrate_disable() and sched_rt_mutex state on
PREEMPT_RT (Karl Mehltretter)
Membarriers updates:
- Use per-CPU mutexes for targeted commands (Aniket Gattani)
- Modernize membarrier_global_expedited with cleanup guards (Aniket
Gattani)
- Add rseq stress test for CFS throttle interactions (Aniket Gattani)
percpu-rwsems updates:
- Extract __percpu_up_read() to optimize inlining overhead (Dmitry
Ilvokhin)
Seqlocks updates:
- Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens)
Lock tracing:
- Add contended_release tracepoint to sleepable locks such as
mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry
Ilvokhin)
MAINTAINERS updates:
- MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng)
Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra,
Dmitry Ilvokhin and Peter Zijlstra"
* tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits)
locking: Add contended_release tracepoint to sleepable locks
locking/percpu-rwsem: Extract __percpu_up_read()
tracing/lock: Remove unnecessary linux/sched.h include
futex: Optimize futex hash bucket access patterns
rust: sync: completion: Mark inline complete_all and wait_for_completion
MAINTAINERS: Add RUST [SYNC] entry
cleanup: Specify nonnull argument index
selftests: futex: Add tests for robust release operations
Documentation: futex: Add a note about robust list race condition
x86/vdso: Implement __vdso_futex_robust_try_unlock()
x86/vdso: Prepare for robust futex unlock support
futex: Provide infrastructure to plug the non contended robust futex unlock race
futex: Add robust futex unlock IP range
futex: Add support for unlocking robust futexes
futex: Cleanup UAPI defines
x86: Select ARCH_MEMORY_ORDER_TSO
uaccess: Provide unsafe_atomic_store_release_user()
futex: Provide UABI defines for robust list entry modifiers
futex: Move futex related mm_struct data into a struct
futex: Make futex_mm_init() void
...
|
|
Add the contended_release trace event. This tracepoint fires on the
holder side when a contended lock is released, complementing the
existing contention_begin/contention_end tracepoints which fire on the
waiter side.
This enables correlating lock hold time under contention with waiter
events by lock address.
Add trace_contended_release()/trace_call__contended_release() calls to
the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore,
rwsem, percpu-rwsem, and RT-specific rwbase locks.
Where possible, trace_contended_release() fires before the lock is
released and before the waiter is woken. For some lock types, the
tracepoint fires after the release but before the wake. Making the
placement consistent across all lock types is not worth the added
complexity.
For reader/writer locks, the tracepoint fires for every reader releasing
while a writer is waiting, not only for the last reader.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Link: https://patch.msgid.link/02f4f6c5ce6761e7f6587cf0ff2289d962ecddd4.1780506267.git.d@ilvokhin.com
|
|
Move the percpu_up_read() slowpath out of the inline function into a new
__percpu_up_read() to avoid binary size increase from adding a
tracepoint to an inlined function.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Link: https://patch.msgid.link/3dd2a1b9ab4f469e1892766cb63f41d6b0f53d29.1780506267.git.d@ilvokhin.com
|
|
syzbot triggered the following splat in remove_waiter() via
FUTEX_CMP_REQUEUE_PI:
KASAN: null-ptr-deref in range [0x0000000000000a88-0x0000000000000a8f]
class_raw_spinlock_constructor
remove_waiter+0x159/0x1200 kernel/locking/rtmutex.c:1561
rt_mutex_start_proxy_lock+0x103/0x120
futex_requeue+0x10e4/0x20d0
__x64_sys_futex+0x34f/0x4d0
task_blocks_on_rt_mutex() does not arm the waiter upon deadlock detection,
leaving waiter->task nil, where 3bfdc63936dd ("rtmutex: Use waiter::task instead
of current in remove_waiter()") made this fatal.
Furthermore, rt_mutex_start_proxy_lock() should not be calling into remove_waiter()
upon a successfully grabbing the rtmutex. 1a1fb985f2e2 ("futex: Handle early deadlock
return correctly"), moved the remove_waiter() out of __rt_mutex_start_proxy_lock()
(where 'ret' was only ever 0 or < 0) into the wrapper. Tighten this check to
account for try_to_take_rt_mutex().
Fixes: 3bfdc63936dd ("rtmutex: Use waiter::task instead of current in remove_waiter()")
Reported-by: syzbot+78147abe6c524f183ee9@syzkaller.appspotmail.com
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/69f114ac.050a0220.ac8b.0003.GAE@google.com/
Link: https://patch.msgid.link/20260507112913.1019537-1-dave@stgolabs.net
|
|
Now that the proxy path uses ->is_blocked, use the '->is_blocked &&
!->blocked_on' state instead of PROXY_WAKING. Notably, this is where a
blocked_on relation is broken but the donor task might still need a return
migration.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.596522894%40infradead.org
|
|
Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com
|
|
TASK_RUNNING
Vineeth found came up with a test driver that could trip up
workqueue stalls. After fixing one issue this test found,
Vineeth reported the test was still failing.
Greatly simplified, a task that tries to take a mutex already
owned by another task that is sleeping, can hit a edge case in
the mutex_lock_common() case.
If the task fails to get the lock, calls into schedule, but gets
a spurious wakeup, it will find that it is first waiter, and
go into the mutex_optimistic_spin() logic. Though before calling
mutex_optimistic_spin(), we clear task blocked_on state, since
mutex_optimistic_spin() may call schedule() if need_resched() is
set.
After mutex_optimistic_spin() fails, we set blocked_on again,
restart the main mutex loop, try to take the lock and call into
schedule_preempt_disabled().
From there, with proxy-execution, we'll see the task is
blocked_on, follow the chain, see the owner is sleeping and
dequeue the waiting task from the runqueue.
This all sounds fine and reasonable. But what I had missed is
that in mutex_optimistic_spin(), not only do we call schedule()
but we set TASK_RUNNABLE right before doing so.
This is ok for that invocation of schedule(). But when we come
back we re-set the blocked_on we had just cleared, but we do not
re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE.
This means we have a task that is blocked_on & TASK_RUNNABLE,
so when the proxy execution code dequeues the task, we are
in trouble since future wakeups will be shortcut by the
ttwu_state_match() check.
Thus, to avoid this, after mutex_optimistic_spin(), set the task
state back when we set blocked_on.
Many many thanks again to Vineeth for his very useful testing
driver that uncovered this long hidden bug, that I hadn't
tripped in all my testing! Very impressed with the problems he's
uncovered!
Reported-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vineeth Pillai <vineethrp@google.com>
Link: https://patch.msgid.link/20260430215103.2978955-3-jstultz@google.com
|
|
Enable context analysis for struct rt_mutex and annotate all functions that
accept a struct rt_mutex pointer. In the __rt_mutex_lock_common() callers,
instead of adding the __no_context_analysis annotation, emit a runtime
warning if the __rt_mutex_lock_common() return value is not zero and add an
__acquire() statement.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260508174520.1416285-1-bvanassche@acm.org
|
|
Chaitanya, John and Mikhail reported commit 25500ba7e77c ("locking/mutex:
Remove the list_head from struct mutex") wrecked ww_mutex.
Specifically there were 2 issues:
- __ww_waiter_prev() had the termination condition wrong; it would terminate
when the previous entry was the first, which results in a truncated
iteration: W3, W2, (no W1).
- __mutex_add_waiter(@pos != NULL), as used by __ww_waiter_add() /
__ww_mutex_add_waiter(); this inserts @waiter before @pos (which is what
list_add_tail() does). But this should then also update lock->first_waiter.
Much thanks to Prateek for spotting the __mutex_add_waiter() issue!
Fixes: 25500ba7e77c ("locking/mutex: Remove the list_head from struct mutex")
Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/af005996-05e9-4336-8450-d14ca652ba5d%40intel.com
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lore.kernel.org/r/CANDhNCq%3Doizzud3hH3oqGzTrcjB8OwGeineJ3mwZuGdDWG8fRQ%40mail.gmail.com
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Closes: https://lore.kernel.org/r/CABXGCsO5fKq2nD9nO8yO1z50ZzgCPWqueNXHANjntaswoOh2Dg@mail.gmail.com
Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://patch.msgid.link/20260422092335.GH3102924%40noisy.programming.kicks-ass.net
|
|
remove_waiter() is used by the slowlock paths, but it is also used for
proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from
futex_requeue().
In the latter case waiter::task is not current, but remove_waiter()
operates on current for the dequeue operation. That results in several
problems:
1) the rbtree dequeue happens without waiter::task::pi_lock being held
2) the waiter task's pi_blocked_on state is not cleared, which leaves a
dangling pointer primed for UAF around.
3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter
task
Use waiter::task instead of current in all related operations in
remove_waiter() to cure those problems.
[ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the
changelog ]
Fixes: 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock")
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduling updates:
- Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
- Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
- Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
- Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
- Avoid overflow in enqueue_entity() (K Prateek Nayak)
- Update overutilized detection (Vincent Guittot)
- Prevent negative lag increase during delayed dequeue (Vincent Guittot)
- Clear buddies for preempt_short (Vincent Guittot)
- Implement more complex proportional newidle balance (Peter Zijlstra)
- Increase weight bits for avg_vruntime (Peter Zijlstra)
- Use full weight to __calc_delta() (Peter Zijlstra)
RT and DL scheduling updates:
- Fix incorrect schedstats for rt and dl thread (Dengjun Su)
- Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
- Move group schedulability check to sched_rt_global_validate()
(Michal Koutný)
- Add reporting of runtime left & abs deadline to sched_getattr()
for DEADLINE tasks (Tommaso Cucinotta)
Scheduling topology updates by K Prateek Nayak:
- Compute sd_weight considering cpuset partitions
- Extract "imb_numa_nr" calculation into a separate helper
- Allocate per-CPU sched_domain_shared in s_data
- Switch to assigning "sd->shared" from s_data
- Remove sched_domain_shared allocation with sd_data
Energy-aware scheduling updates:
- Filter false overloaded_group case for EAS (Vincent Guittot)
- PM: EM: Switch to rcu_dereference_all() in wakeup path
(Dietmar Eggemann)
Infrastructure updates:
- Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
Proxy scheduling updates by John Stultz:
- Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
- Minimise repeated sched_proxy_exec() checking
- Fix potentially missing balancing with Proxy Exec
- Fix and improve task::blocked_on et al handling
- Add assert_balance_callbacks_empty() helper
- Add logic to zap balancing callbacks if we pick again
- Move attach_one_task() and attach_task() helpers to sched.h
- Handle blocked-waiter migration (and return migration)
- Add K Prateek Nayak to scheduler reviewers for proxy execution
Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"
* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
sched/eevdf: Clear buddies for preempt_short
sched/rt: Cleanup global RT bandwidth functions
sched/rt: Move group schedulability check to sched_rt_global_validate()
sched/rt: Skip group schedulable check with rt_group_sched=0
sched/fair: Avoid overflow in enqueue_entity()
sched: Use u64 for bandwidth ratio calculations
sched/fair: Prevent negative lag increase during delayed dequeue
sched/fair: Use sched_energy_enabled()
sched: Handle blocked-waiter migration (and return migration)
sched: Move attach_one_task and attach_task helpers to sched.h
sched: Add logic to zap balance callbacks if we pick again
sched: Add assert_balance_callbacks_empty helper
sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
sched: Fix modifying donor->blocked on without proper locking
locking: Add task::blocked_lock to serialize blocked_on state
sched: Fix potentially missing balancing with Proxy Exec
sched: Minimise repeated sched_proxy_exec() checking
sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
sched/core: Get this cpu once in ttwu_queue_cond()
...
|
|
return-migration
As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.
Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:
,-+-* Mutex-1 <-.
Task-A ---' | | ,-- Task-B
`-> Mutex-2 *-+-'
Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.
Then the blocked_on->owner chain will go in circles.
Task-A -> Mutex-2
^ |
| v
Mutex-1 <- Task-B
We need two things:
- find_proxy_task() to stop iterating the circle;
- the woken task to 'unblock' and run, such that it can
back-off and re-try the transaction.
Now, the current code [without this patch] does:
__clear_task_blocked_on();
wake_q_add();
And surely clearing ->blocked_on is sufficient to break the
cycle.
Suppose it is Task-B that is made to back-off, then we have:
Task-A -> Mutex-2 -> Task-B (no further blocked_on)
and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().
Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.
Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."
Thus we need more than just a binary concept of the task being
blocked on a mutex or not.
So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.
This will then be used in a later patch to handle proxy
return-migration.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com
|
|
So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.
So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com
|
|
Make the spinlock implementation compatible with lock context analysis
(CONTEXT_ANALYSIS := 1) by adding lock context annotations to the
_raw_##op##_...() macros.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260313171510.230998-4-bvanassche@acm.org
|
|
Commit 1ea4b473504b ("locking/rwsem: Remove the list_head from struct
rw_semaphore") introduced a logic error in rwsem_del_waiter().
The root cause of this issue is an inconsistency in the return values of
__rwsem_del_waiter() and rwsem_del_waiter(). Specifically,
__rwsem_del_waiter() returns true when the wait list becomes empty,
whereas rwsem_del_waiter() is supposed to return true if the wait list
is NOT empty.
This caused a null pointer dereference in rwsem_mark_wake() because it
was being called when sem->first_waiter was NULL.
Fixes: 1ea4b473504b ("locking/rwsem: Remove the list_head from struct rw_semaphore")
Reported-by: syzbot+3d2ff92c67127d337463@syzkaller.appspotmail.com
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: syzbot+3d2ff92c67127d337463@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20260314182607.3343346-1-avagin@google.com
|
|
Add compiler context analysis annotations.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260306101417.GT1282955@noisy.programming.kicks-ass.net
|
|
Add compiler context analysis annotations.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121111213.851599178@infradead.org
|
|
Add compiler context analysis annotations.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121111213.745353747@infradead.org
|
|
Instead of embedding a list_head in struct mutex, store a pointer to
the first waiter. The list of waiters remains a doubly linked list so
we can efficiently add to the tail of the list, remove from the front
(or middle) of the list.
Some of the list manipulation becomes more complicated, but it's a
reasonable tradeoff on the slow paths to shrink data structures which
embed a mutex like struct file.
Some of the debug checks have to be deleted because there's no equivalent
to checking them in the new scheme (eg an empty waiter->list now means
that it is the only waiter, not that the waiter is no longer on the list).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260305195545.3707590-4-willy@infradead.org
|
|
Instead of embedding a list_head in struct semaphore, store a pointer to
the first waiter. The list of waiters remains a doubly linked list so
we can efficiently add to the tail of the list and remove from the front
(or middle) of the list.
Some of the list manipulation becomes more complicated, but it's a
reasonable tradeoff on the slow paths to shrink data structures
which embed a semaphore.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260305195545.3707590-3-willy@infradead.org
|
|
Instead of embedding a list_head in struct rw_semaphore, store a pointer
to the first waiter. The list of waiters remains a doubly linked list
so we can efficiently add to the tail of the list, remove from the front
(or middle) of the list.
Some of the list manipulation becomes more complicated, but it's a
reasonable tradeoff on the slow paths to shrink some core data structures
like struct inode.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260305195545.3707590-2-willy@infradead.org
|
|
Typo, this wants to be _lockdep().
Fixes: 51d7a054521d ("locking/mutex: Redo __mutex_init() to reduce generated code size")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260217191512.1180151-2-dave@stgolabs.net
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
In cases where the ww_mutex test was occasionally tripping on
hard to find issues, leaving qemu in a reboot loop was my best
way to reproduce problems. These reboots however wasted time
when I just wanted to run the test-ww_mutex logic.
So tweak the test-ww_mutex test so that it can be re-triggered
via a sysfs file, so the test can be run repeatedly without
doing module loads or restarting.
This has been particularly valuable to stressing and finding
issues with the proxy-exec series.
To use, run as root:
echo 1 > /sys/kernel/test_ww_mutex/run_tests
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-4-jstultz@google.com
|
|
The test-ww_mutex test already allocates its own workqueue
so be sure to use it for the mtx.work and abba.work rather
then the default system workqueue.
This resolves numerous messages of the sort:
"workqueue: test_abba_work hogged CPU... consider switching to WQ_UNBOUND"
"workqueue: test_mutex_work hogged CPU... consider switching to WQ_UNBOUND"
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-3-jstultz@google.com
|
|
Currently the test-ww_mutex tool only utilizes the wait-die
class of ww_mutexes, and thus isn't very helpful in exercising
the wait-wound class of ww_mutexes.
So extend the test to exercise both classes of ww_mutexes for
all of the subtests.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-2-jstultz@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Frederic Weisbecker:
"SRCU:
- Properly handle SRCU readers within IRQ disabled sections in tiny
SRCU
- Preparation to reimplement RCU Tasks Trace on top of SRCU fast:
- Introduce API to expedite a grace period and test it through
rcutorture
- Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown.
Both are still targeted toward faster readers (without full
barriers on LOCK and UNLOCK) at the expense of heavier write
side (using full RCU grace period ordering instead of simply
full ordering) as compared to "traditional" non-fast SRCU. But
those srcu-fast flavours are going to be optimized in two
different ways:
- SRCU-fast will become the reimplementation basis for
RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must
be NMI safe, SRCU-fast must be as well.
- SRCU-fast-updown will be needed for uretprobes code in order
to get rid of the read-side memory barriers while still
allowing entering the reader at task level while exiting it
in a timer handler. It is considered semaphore-like in that
it can have different owners between LOCK and UNLOCK.
However it is not NMI-safe.
The actual optimizations are work in progress for the next
cycle. Only the new interfaces are added for now, along with
related torture and scalability test code.
- Create/document/debug/torture new proper initializers for RCU fast:
DEFINE_SRCU_FAST() and init_srcu_struct_fast()
This allows for using right away the proper ordering on the write
side (either full ordering or full RCU grace period ordering)
without waiting for the read side to tell which to use.
This also optimizes the read side altogether with moving flavour
debug checks under debug config and with removing a costly RmW
operation on their first call.
- Make some diagnostic functions tracing safe
Refscale:
- Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code
Miscellanous:
- In order to prepare the layout for nohz_full work deferral to user
exit, the context tracking state must shrink the counter of
transitions to/from RCU not watching. The only possible hazard is
to trigger wrap-around more easily, delaying a bit grace periods
when that happens. This should be a rare event though. Yet add
debugging and torture code to test that assumption
- Fix memory leak on locktorture module
- Annotate accesses in rculist_nulls.h to prevent from KCSAN
warnings. On recent discussions, we also concluded that all those
WRITE_ONCE() and READ_ONCE() on list APIs deserve appropriate
comments. Something to be expected for the next cycle
- Provide a script to apply several configs to several commits with
torture
- Allow torture to reuse a build directory in order to save needless
rebuild time
- Various cleanups"
* tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (29 commits)
refscale: Add SRCU-fast-updown readers
refscale: Exercise DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast()
rcutorture: Make srcu{,d}_torture_init() announce the SRCU type
srcu: Create an SRCU-fast-updown API
refscale: Do not disable interrupts for tests involving local_bh_enable()
refscale: Add non-atomic per-CPU increment readers
refscale: Add this_cpu_inc() readers
refscale: Add preempt_disable() readers
refscale: Add local_bh_disable() readers
refscale: Add local_irq_disable() and local_irq_save() readers
torture: Permit negative kvm.sh --kconfig numberic arguments
srcu: Add SRCU_READ_FLAVOR_FAST_UPDOWN CPP macro
rcu: Mark diagnostic functions as notrace
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
rcutorture: Remove redundant rcutorture_one_extend() from rcu_torture_one_read()
rcutorture: Permit kvm-again.sh to re-use the build directory
torture: Add kvm-series.sh to test commit/scenario combination
rcu: use WRITE_ONCE() for ->next and ->pprev of hlist_nulls
locktorture: Fix memory leak in param_set_cpumask()
doc: Update for SRCU-fast definitions and initialization
...
|
|
mutex_init() invokes __mutex_init() providing the name of the lock and
a pointer to a the lock class. With LOCKDEP enabled this information is
useful but without LOCKDEP it not used at all. Passing the pointer
information of the lock class might be considered negligible but the
name of the lock is passed as well and the string is stored. This
information is wasting storage.
Split __mutex_init() into a _genereic() variant doing the initialisation
of the lock and a _lockdep() version which does _genereic() plus the
lockdep bits. Restrict the lockdep version to lockdep enabled builds
allowing the compiler to remove the unused parameter.
This results in the following size reduction:
text data bss dec filename
| 30237599 8161430 1176624 39575653 vmlinux.defconfig
| 30233269 8149142 1176560 39558971 vmlinux.defconfig.patched
-4.2KiB -12KiB
| 32455099 8471098 12934684 53860881 vmlinux.defconfig.lockdep
| 32455100 8471098 12934684 53860882 vmlinux.defconfig.patched.lockdep
| 27152407 7191822 2068040 36412269 vmlinux.defconfig.preempt_rt
| 27145937 7183630 2067976 36397543 vmlinux.defconfig.patched.preempt_rt
-6.3KiB -8KiB
| 29382020 7505742 13784608 50672370 vmlinux.defconfig.preempt_rt.lockdep
| 29376229 7505742 13784544 50666515 vmlinux.defconfig.patched.preempt_rt.lockdep
-5.6KiB
[peterz: folded fix from boqun]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20251125145425.68319-1-boqun.feng@gmail.com
Link: https://patch.msgid.link/20251105142350.Tfeevs2N@linutronix.de
|
|
With CONFIG_CPUMASK_OFFSTACK=y, the 'bind_writers' buffer is allocated via
alloc_cpumask_var() in param_set_cpumask(). But it is not freed, when
setting the module parameter multiple times by sysfs interface or removing
module.
Below kmemleak trace is seen for this issue:
unreferenced object 0xffff888100aabff8 (size 8):
comm "bash", pid 323, jiffies 4295059233
hex dump (first 8 bytes):
07 00 00 00 00 00 00 00 ........
backtrace (crc ac50919):
__kmalloc_node_noprof+0x2e5/0x420
alloc_cpumask_var_node+0x1f/0x30
param_set_cpumask+0x26/0xb0 [locktorture]
param_attr_store+0x93/0x100
module_attr_store+0x1b/0x30
kernfs_fop_write_iter+0x114/0x1b0
vfs_write+0x300/0x410
ksys_write+0x60/0xd0
do_syscall_64+0xa4/0x260
entry_SYSCALL_64_after_hwframe+0x77/0x7f
This issue can be reproduced by:
insmod locktorture.ko bind_writers=1
rmmod locktorture
or:
insmod locktorture.ko bind_writers=1
echo 2 > /sys/module/locktorture/parameters/bind_writers
Considering that setting the module parameter 'bind_writers' or
'bind_readers' by sysfs interface has no real effect, set the parameter
permissions to 0444. To fix the memory leak when removing module, free
'bind_writers' and 'bind_readers' memory in lock_torture_cleanup().
Fixes: 73e341242483 ("locktorture: Add readers_bind and writers_bind module parameters")
Suggested-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
KCSAN reports:
BUG: KCSAN: data-race in do_raw_write_lock / do_raw_write_lock
write (marked) to 0xffff800009cf504c of 4 bytes by task 1102 on cpu 1:
do_raw_write_lock+0x120/0x204
_raw_write_lock_irq
do_exit
call_usermodehelper_exec_async
ret_from_fork
read to 0xffff800009cf504c of 4 bytes by task 1103 on cpu 0:
do_raw_write_lock+0x88/0x204
_raw_write_lock_irq
do_exit
call_usermodehelper_exec_async
ret_from_fork
value changed: 0xffffffff -> 0x00000001
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 1103 Comm: kworker/u4:1 6.1.111
Commit 1a365e822372 ("locking/spinlock/debug: Fix various data races") has
adressed most of these races, but seems to be not consistent/not complete.
>From do_raw_write_lock() only debug_write_lock_after() part has been
converted to WRITE_ONCE(), but not debug_write_lock_before() part.
Do it now.
Fixes: 1a365e822372 ("locking/spinlock/debug: Fix various data races")
Reported-by: Adrian Freihofer <adrian.freihofer@siemens.com>
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
|
|
Introduce local_lock_is_locked() that returns true when
given local_lock is locked by current cpu (in !PREEMPT_RT) or
by current task (in PREEMPT_RT).
The goal is to detect a deadlock by the caller.
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Borislav Petkov:
- Make sure sanity checks down in the mutex lock path happen on the
correct type of task so that they don't trigger falsely
- Use the write unsafe user access pairs when writing a futex value to
prevent an error on PowerPC which does user read and write accesses
differently
* tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path
futex: Use user_write_access_begin/_end() in futex_put_value()
|
|
The __clear_task_blocked_on() helper added a number of sanity
checks ensuring we hold the mutex wait lock and that the task
we are clearing blocked_on pointer (if set) matches the mutex.
However, there is an edge case in the _ww_mutex_wound() logic
where we need to clear the blocked_on pointer for the task that
owns the mutex, not the task that is waiting on the mutex.
For this case the sanity checks aren't valid, so handle this
by allowing a NULL lock to skip the additional checks.
K Prateek Nayak and Maarten Lankhorst also pointed out that in
this case where we don't hold the owner's mutex wait_lock, we
need to be a bit more careful using READ_ONCE/WRITE_ONCE in both
the __clear_task_blocked_on() and __set_task_blocked_on()
implementations to avoid accidentally tripping WARN_ONs if two
instances race. So do that here as well.
This issue was easier to miss, I realized, as the test-ww_mutex
driver only exercises the wait-die class of ww_mutexes. I've
sent a patch[1] to address this so the logic will be easier to
test.
[1]: https://lore.kernel.org/lkml/20250801023358.562525-2-jstultz@google.com/
Fixes: a4f0b6fef4b0 ("locking/mutex: Add p->blocked_on wrappers for correctness checks")
Closes: https://lore.kernel.org/lkml/68894443.a00a0220.26d0e1.0015.GAE@google.com/
Reported-by: syzbot+602c4720aed62576cd79@syzkaller.appspotmail.com
Reported-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250805001026.2247040-1-jstultz@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Move sysctls out of the kern_table array
This is the final move of ctl_tables into their respective
subsystems. Only 5 (out of the original 50) will remain in
kernel/sysctl.c file; these handle either sysctl or common arch
variables.
By decentralizing sysctl registrations, subsystem maintainers regain
control over their sysctl interfaces, improving maintainability and
reducing the likelihood of merge conflicts.
- docs: Remove false positives from check-sysctl-docs
Stopped falsely identifying sysctls as undocumented or unimplemented
in the check-sysctl-docs script. This script can now be used to
automatically identify if documentation is missing.
* tag 'sysctl-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (23 commits)
docs: Downgrade arm64 & riscv from titles to comment
docs: Replace spaces with tabs in check-sysctl-docs
docs: Remove colon from ctltable title in vm.rst
docs: Add awk section for ucount sysctl entries
docs: Use skiplist when checking sysctl admin-guide
docs: nixify check-sysctl-docs
sysctl: rename kern_table -> sysctl_subsys_table
kernel/sys.c: Move overflow{uid,gid} sysctl into kernel/sys.c
uevent: mv uevent_helper into kobject_uevent.c
sysctl: Removed unused variable
sysctl: Nixify sysctl.sh
sysctl: Remove superfluous includes from kernel/sysctl.c
sysctl: Remove (very) old file changelog
sysctl: Move sysctl_panic_on_stackoverflow to kernel/panic.c
sysctl: move cad_pid into kernel/pid.c
sysctl: Move tainted ctl_table into kernel/panic.c
Input: sysrq: mv sysrq into drivers/tty/sysrq.c
fork: mv threads-max into kernel/fork.c
parisc/power: Move soft-power into power.c
mm: move randomize_va_space into memory.c
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Locking primitives:
- Mark devm_mutex_init() as __must_check and fix drivers that didn't
check the return code (Thomas Weißschuh)
- Reorganize <linux/local_lock.h> to better expose the internal APIs
to local variables (Sebastian Andrzej Siewior)
- Remove OWNER_SPINNABLE in rwsem (Jinliang Zheng)
- Remove redundant #ifdefs in the mutex code (Ran Xiaokai)
Lockdep:
- Avoid returning struct in lock_stats() (Arnd Bergmann)
- Change `static const` into enum for LOCKF_*_IRQ_* (Arnd Bergmann)
- Temporarily use synchronize_rcu_expedited() in
lockdep_unregister_key() to speed things up. (Breno Leitao)
Rust runtime:
- Add #[must_use] to Lock::try_lock() (Jason Devers)"
* tag 'locking-core-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
locking/mutex: Remove redundant #ifdefs
locking/lockdep: Change 'static const' variables to enum values
locking/lockdep: Avoid struct return in lock_stats()
locking/rwsem: Use OWNER_NONSPINNABLE directly instead of OWNER_SPINNABLE
rust: sync: Add #[must_use] to Lock::try_lock()
locking/mutex: Mark devm_mutex_init() as __must_check
leds: lp8860: Check return value of devm_mutex_init()
spi: spi-nxp-fspi: Check return value of devm_mutex_init()
local_lock: Move this_cpu_ptr() notation from internal to main header
|
|
Move the max_lock_depth sysctl table element into rtmutex_api.c. Removed
the rtmutex.h include from sysctl.c. Chose to move into rtmutex_api.c
to avoid multiple registrations every time rtmutex.c is included in other
files.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Inspired by mutex blocker tracking[1], and having already extended it to
semaphores, let's now add support for reader-writer semaphores (rwsems).
The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while
waiting for an rwsem, we just call hung_task_set_blocker(). The hung task
detector can then query the rwsem's owner to identify the lock holder.
Tracking works reliably for writers, as there can only be a single writer
holding the lock, and its task struct is stored in the owner field.
The main challenge lies with readers. The owner field points to only one
of many concurrent readers, so we might lose track of the blocker if that
specific reader unlocks, even while others remain. This is not a
significant issue, however. In practice, long-lasting lock contention is
almost always caused by a writer. Therefore, reliably tracking the writer
is the primary goal of this patch series ;)
With this change, the hung task detector can now show blocker task's info
like below:
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds.
[Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8
[Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340
[Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0
[Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10
[Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10
[Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700
[Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710
[Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer>
[Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140
[Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[1] https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/
Link: https://lkml.kernel.org/r/20250627072924.36567-3-lance.yang@linux.dev
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Mingzhe Yang <mingzhe.yang@ly.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yongliang Gao <leonylgao@tencent.com>
Cc: Zi Li <zi.li@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "extend hung task blocker tracking to rwsems".
Inspired by mutex blocker tracking[1], and having already extended it to
semaphores, let's now add support for reader-writer semaphores (rwsems).
The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while
waiting for an rwsem, we just call hung_task_set_blocker(). The hung task
detector can then query the rwsem's owner to identify the lock holder.
Tracking works reliably for writers, as there can only be a single writer
holding the lock, and its task struct is stored in the owner field.
The main challenge lies with readers. The owner field points to only one
of many concurrent readers, so we might lose track of the blocker if that
specific reader unlocks, even while others remain. This is not a
significant issue, however. In practice, long-lasting lock contention is
almost always caused by a writer. Therefore, reliably tracking the writer
is the primary goal of this patch series ;)
With this change, the hung task detector can now show blocker task's info
like below:
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds.
[Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8
[Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340
[Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0
[Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10
[Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10
[Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700
[Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710
[Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer>
[Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140
[Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
This patch (of 3):
In preparation for extending blocker tracking to support rwsems, make the
rwsem_owner() and is_rwsem_reader_owned() helpers globally available for
determining if the blocker is a writer or one of the readers.
Additionally, a stale owner pointer in a reader-owned rwsem can lead to
false positives in blocker tracking when CONFIG_DETECT_HUNG_TASK_BLOCKER
is enabled. To mitigate this, clear the owner field on the reader unlock
path, similar to what CONFIG_DEBUG_RWSEMS does. A NULL owner is better
than a stale one for diagnostics.
Link: https://lkml.kernel.org/r/20250627072924.36567-1-lance.yang@linux.dev
Link: https://lkml.kernel.org/r/20250627072924.36567-2-lance.yang@linux.dev
Link: https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/ [1]
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Mingzhe Yang <mingzhe.yang@ly.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yongliang Gao <leonylgao@tencent.com>
Cc: Zi Li <zi.li@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
lockdep_unregister_key() is called from critical code paths, including
sections where rtnl_lock() is held. For example, when replacing a qdisc
in a network device, network egress traffic is disabled while
__qdisc_destroy() is called for every network queue.
If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
which gets blocked waiting for synchronize_rcu() to complete.
For example, a simple tc command to replace a qdisc could take 13
seconds:
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m13.195s
user 0m0.001s
sys 0m2.746s
During this time, network egress is completely frozen while waiting for
RCU synchronization.
Use synchronize_rcu_expedited() instead to minimize the impact on
critical operations like network connectivity changes.
This improves 10x the function call to tc, when replacing the qdisc for
a network card.
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m1.789s
user 0m0.000s
sys 0m1.613s
[boqun: Fixed the comment and add more information for the temporary
workaround, and add TODO information for hazptr]
Reported-by: Erik Lundgren <elundgren@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20250321-lockdep-v1-1-78b732d195fb@debian.org
|
|
hung_task_{set,clear}_blocker() is already guarded by
CONFIG_DETECT_HUNG_TASK_BLOCKER in hung_task.h, So remove
the redudant check of #ifdef.
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20250704015218.359754-1-ranxiaokai627@163.com
|
|
gcc warns about 'static const' variables even in headers when building
with -Wunused-const-variables enabled:
In file included from kernel/locking/lockdep_proc.c:25:
kernel/locking/lockdep_internals.h:69:28: error: 'LOCKF_USED_IN_IRQ_READ' defined but not used [-Werror=unused-const-variable=]
69 | static const unsigned long LOCKF_USED_IN_IRQ_READ =
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/locking/lockdep_internals.h:63:28: error: 'LOCKF_ENABLED_IRQ_READ' defined but not used [-Werror=unused-const-variable=]
63 | static const unsigned long LOCKF_ENABLED_IRQ_READ =
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/locking/lockdep_internals.h:57:28: error: 'LOCKF_USED_IN_IRQ' defined but not used [-Werror=unused-const-variable=]
57 | static const unsigned long LOCKF_USED_IN_IRQ =
| ^~~~~~~~~~~~~~~~~
kernel/locking/lockdep_internals.h:51:28: error: 'LOCKF_ENABLED_IRQ' defined but not used [-Werror=unused-const-variable=]
51 | static const unsigned long LOCKF_ENABLED_IRQ =
| ^~~~~~~~~~~~~~~~~
This one is easy to avoid by changing the generated constant definition
into an equivalent enum.
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20250409122314.2848028-6-arnd@kernel.org
|
|
Returning a large structure from the lock_stats() function causes clang
to have multiple copies of it on the stack and copy between them, which
can end up exceeding the frame size warning limit:
kernel/locking/lockdep.c:300:25: error: stack frame size (1464) exceeds limit (1280) in 'lock_stats' [-Werror,-Wframe-larger-than]
300 | struct lock_class_stats lock_stats(struct lock_class *class)
Change the calling conventions to directly operate on the caller's copy,
which apparently is what gcc does already.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20250610092941.2642847-1-arnd@kernel.org
|
|
Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the donor task and pick again.
To ensure the donor task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, also tweak
the fair class logic to avoid migrating donor or mutex blocked
tasks.
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-9-jstultz@google.com
|
|
This lets us assert mutex::wait_lock is held whenever we access
p->blocked_on, as well as warn us for unexpected state changes.
[fix conflicts, call in more places]
[jstultz: tweaked commit subject, reworked a good bit]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-4-jstultz@google.com
|