linux.git/kernel/futex, branch master

Merge tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip

2026-06-15T08:51:14+00:00

Pull locking updates from Ingo Molnar:
 "Futex updates:

   - Optimize futex hash bucket access patterns (Peter Zijlstra)

   - Large series to address the robust futex unlock race for real, by
     Thomas Gleixner:

      "The robust futex unlock mechanism is racy in respect to the
       clearing of the robust_list_head::list_op_pending pointer because
       unlock and clearing the pointer are not atomic.

       The race window is between the unlock and clearing the pending op
       pointer. If the task is forced to exit in this window, exit will
       access a potentially invalid pending op pointer when cleaning up
       the robust list.

       That happens if another task manages to unmap the object
       containing the lock before the cleanup, which results in an UAF.

       In the worst case this UAF can lead to memory corruption when
       unrelated content has been mapped to the same address by the time
       the access happens.

       User space can't solve this problem without help from the kernel.
       This series provides the kernel side infrastructure to help it
       along:

        1) Combined unlock, pointer clearing, wake-up for the
           contended case

        2) VDSO based unlock and pointer clearing helpers with a
           fix-up function in the kernel when user space was interrupted
           within the critical section.

      ... with help by André Almeida:

        - Add a note about robust list race condition (André Almeida)
        - Add self-tests for robust release operations (André Almeida)

  Context analysis updates:

   - Implement context analysis for 'struct rt_mutex'. (Bart Van Assche)
   - Bump required Clang version to 23 (Marco Elver)

  Guard infrastructure updates:

   - Series to remove NULL check from unconditional guards (Dmitry
     Ilvokhin)

  Lockdep updates:

   - Restore self-test migrate_disable() and sched_rt_mutex state on
     PREEMPT_RT (Karl Mehltretter)

  Membarriers updates:

   - Use per-CPU mutexes for targeted commands (Aniket Gattani)
   - Modernize membarrier_global_expedited with cleanup guards (Aniket
     Gattani)
   - Add rseq stress test for CFS throttle interactions (Aniket Gattani)

  percpu-rwsems updates:

   - Extract __percpu_up_read() to optimize inlining overhead (Dmitry
     Ilvokhin)

  Seqlocks updates:

   - Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens)

  Lock tracing:

   - Add contended_release tracepoint to sleepable locks such as
     mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry
     Ilvokhin)

  MAINTAINERS updates:

   - MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng)

  Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra,
  Dmitry Ilvokhin and Peter Zijlstra"

* tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits)
  locking: Add contended_release tracepoint to sleepable locks
  locking/percpu-rwsem: Extract __percpu_up_read()
  tracing/lock: Remove unnecessary linux/sched.h include
  futex: Optimize futex hash bucket access patterns
  rust: sync: completion: Mark inline complete_all and wait_for_completion
  MAINTAINERS: Add RUST [SYNC] entry
  cleanup: Specify nonnull argument index
  selftests: futex: Add tests for robust release operations
  Documentation: futex: Add a note about robust list race condition
  x86/vdso: Implement __vdso_futex_robust_try_unlock()
  x86/vdso: Prepare for robust futex unlock support
  futex: Provide infrastructure to plug the non contended robust futex unlock race
  futex: Add robust futex unlock IP range
  futex: Add support for unlocking robust futexes
  futex: Cleanup UAPI defines
  x86: Select ARCH_MEMORY_ORDER_TSO
  uaccess: Provide unsafe_atomic_store_release_user()
  futex: Provide UABI defines for robust list entry modifiers
  futex: Move futex related mm_struct data into a struct
  futex: Make futex_mm_init() void
  ...

futex: Optimize futex hash bucket access patterns

2026-06-11T11:41:24+00:00

Breno reported significant c2c HITM in a futex hash heavy workload.

It turns out that the hash bucket to private hash table reverse pointer
(futex_hash_bucket::priv) was to blame. Notably when the hash buckets are
heavily contended, the: 'fph = bh->priv;' load in futex_hash() will typically
miss and consequently become quite expensive.

Since this load in particular is quite superfluous, removing it is fairly
straight forward. However, removing it does not in fact achieve anything much.
The pain moves to the next user, notably: futex_hash_put().

Therefore rework the whole private hash refcounting to avoid needing this back
pointer (and removing it). Instead of passing around 'struct futex_hash_bucket
*hb', pass around a new structure that contains it and the related 'struct
futex_private_hash *fph' pointer in tandem.

Funnily this turns out to remove more code than it adds and significantly
improves futex hash performance (as measured by 'perf bench futex hash'):

SKL dual socket 112 threads:

		  Baseline        Patched
  shared (16k)    1571857         1641435         + 4.4%
  autosize (512)   646390          903371         +39.7%
  -b 256           464395          587014         +26.4%
  -b 512           715687          995943         +39.2%
  -b 1024          995085         1396328         +40.3%
  -b 2048         1293114         1668395         +29.0%
  -b 4096         2124438         2240228         + 5.5%

Zen3 dual socket 256 threads:

                  Baseline        Patched
  shared  (16k)   1275840         1381279         + 8.2%
  autosize (512)  1252745         1482179         +18.3%
  -b 256           856274          955455         +11.5%
  -b 512          1267490         1544010         +21.8%
  -b 1024         1424013         1625424         +14.1%
  -b 2048         1505181         1669342         +10.9%
  -b 4096         1465993         1688932         +15.2%

AMD EPYC 9D64 (Zen4, single socket) 176 threads:

                       Baseline       Patched      Delta
  shared (16k)         1,230,599      1,368,655    +11.2%
  autosize (1024)      1,285,440      1,556,946    +21.1%
  -b 256               1,341,471      1,520,303    +13.3%
  -b 512               1,438,330      1,599,319    +11.2%
  -b 1024              1,443,772      1,622,493    +12.4%
  -b 2048              1,472,108      1,643,975    +11.7%
  -b 4096              1,333,098      1,570,897    +17.8%

Reported-by: Breno Leitao 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Breno Leitao 
Tested-by: Thomas Gleixner 
Link: https://patch.msgid.link/20260610135510.GB1430057@noisy.programming.kicks-ass.net

futex: Provide infrastructure to plug the non contended robust futex unlock race

2026-06-03T09:38:52+00:00

When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in user space looks like this:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);

  	lval = gettid();
  3)	if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
  4)		robust_list_clear_op_pending();
  	else
  5)		sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....);

That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task, which observes that it is the last
user and:

  1) unmaps the mutex memory
  2) maps a different file, which ends up covering the same address

When then the original task exits before reaching #5 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.

On X86 this boils down to this simplified assembly sequence:

		mov		%esi,%eax	// Load TID into EAX
        	xor		%ecx,%ecx	// Set ECX to 0
   #3		lock cmpxchg	%ecx,(%rdi)	// Try the TID -> 0 transition
	.Lstart:
		jnz    		.Lend
   #4 		movq		%rcx,(%rdx)	// Clear list_op_pending
	.Lend:

If the cmpxchg() succeeds and the task is interrupted before it can clear
list_op_pending in the robust list head (#4) and the task crashes in a
signal handler or gets killed then it ends up in do_exit() and subsequently
in the robust list handling, which then might run into the unmap/map issue
described above.

This is only relevant when user space was interrupted and a signal is
pending. The fix-up has to be done before signal delivery is attempted
because:

   1) The signal might be fatal so get_signal() ends up in do_exit()

   2) The signal handler might crash or the task is killed before returning
      from the handler. At that point the instruction pointer in pt_regs is
      not longer the instruction pointer of the initially interrupted unlock
      sequence.

The right place to handle this is in __exit_to_user_mode_loop() before
invoking arch_do_signal_or_restart() as this covers obviously both
scenarios.

As this is only relevant when the task was interrupted in user space, this
is tied to RSEQ and the generic entry code as RSEQ keeps track of user
space interrupts unconditionally even if the task does not have a RSEQ
region installed. That makes the decision very lightweight:

       if (current->rseq.user_irq && within(regs, csr->unlock_ip_range))
       		futex_fixup_robust_unlock(regs, csr);

futex_fixup_robust_unlock() then invokes a architecture specific function
to return the pending op pointer or NULL. The function evaluates the
register content to decide whether the pending ops pointer in the robust
list head needs to be cleared.

Assuming the above unlock sequence, then on x86 this decision is the
trivial evaluation of the zero flag:

	return regs->eflags & X86_EFLAGS_ZF ? regs->dx : NULL;

Other architectures might need to do more complex evaluations due to LLSC,
but the approach is valid in general. The size of the pointer is determined
from the matching range struct, which covers both 32-bit and 64-bit builds
including COMPAT.

The unlock sequence is going to be placed in the VDSO so that the kernel
can keep everything synchronized, especially the register usage. The
resulting code sequence for user space is:

   if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) != tid)
 	err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);

Both the VDSO unlock and the kernel side unlock ensure that the pending_op
pointer is always cleared when the lock becomes unlocked.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.773669210@kernel.org

futex: Add robust futex unlock IP range

2026-06-03T09:38:51+00:00

There will be a VDSO function to unlock robust futexes in user space. The
unlock sequence is racy vs. clearing the list_pending_op pointer in the
tasks robust list head. To plug this race the kernel needs to know the
instruction window. As the VDSO is per MM the addresses are stored in
mm_struct::futex.

Architectures which implement support for this have to update these
addresses when the VDSO is (re)mapped and indicate the pending op pointer
size which is matching the IP.

Arguably this could be resolved by chasing mm->context->vdso->image, but
that's architecture specific and requires to touch quite some cache
lines. Having it in mm::futex reduces the cache line impact and avoids
having yet another set of architecture specific functionality.

To support multi size robust list applications (gaming) this provides two
ranges when COMPAT is enabled.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.718926819@kernel.org

futex: Add support for unlocking robust futexes

2026-06-03T09:38:51+00:00

Unlocking robust non-PI futexes happens in user space with the following
sequence:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);

  	lval = 0;
  3)	lval = atomic_xchg(lock, lval);
  4)	if (lval & WAITERS)
  5)		sys_futex(WAKE,....);
  6)	robust_list_clear_op_pending();

That opens a window between #3 and #6 where the mutex could be acquired by
some other task which observes that it is the last user and:

  A) unmaps the mutex memory
  B) maps a different file, which ends up covering the same address

When the original task exits before reaching #6 then the kernel robust list
handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupting unrelated data.

PI futexes have a similar problem both for the non-contented user space
unlock and the in kernel unlock:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);

  	lval = gettid();
  3)	if (!atomic_try_cmpxchg(lock, lval, 0))
  4)		sys_futex(UNLOCK_PI,....);
  5)	robust_list_clear_op_pending();

Address the first part of the problem where the futexes have waiters and
need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which
is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET
operations.

This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear
whether this is needed and there is no usage of it in glibc either to
investigate.

For the futex2 syscall family this needs to be implemented with a new
syscall.

The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to
robust_list_head::list_pending_op into the kernel. This argument is only
evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward
compatible.

This is an explicit argument to avoid the lookup of the robust list pointer
and retrieving the pending op pointer from there. User space has the
pointer already available so it can just put it into the @uaddr2
argument. Aside of that this allows the usage of multiple robust lists in
the future without any changes to the internal functions as they just operate
on the provided pointer.

This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the
robust list pointer points to an u32 and not to an u64. This is required
for two reasons:

    1) sys_futex() has no compat variant

    2) The gaming emulators use both both 64-bit and compat 32-bit robust
       lists in the same 64-bit application

As a consequence 32-bit applications have to set this flag unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the flag is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.

In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the
unlock succeeded. In case of errors, the user space value is still locked
by the caller and therefore the above cannot happen.

In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and
clears the robust list pending op when the unlock was successful. If not,
the user space value is still locked and user space has to deal with the
returned error. That means that the unlocking of non-PI robust futexes has
to use the same try_cmpxchg() unlock scheme as PI futexes.

If the clearing of the pending list op fails (fault) then the kernel clears
the registered robust list pointer if it matches to prevent that exit()
will try to handle invalid data. That's a valid paranoid decision because
the robust list head sits usually in the TLS and if the TLS is not longer
accessible then the chance for fixing up the resulting mess is very close
to zero.

The problem of non-contended unlocks still exists and will be addressed
separately.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.670514505@kernel.org

futex: Provide UABI defines for robust list entry modifiers

2026-06-03T09:38:50+00:00

The marker for PI futexes in the robust list is a hardcoded 0x1 which lacks
any sensible form of documentation.

Provide proper defines for the bit and the mask and fix up the usage
sites. Thereby convert the boolean pi argument into a modifier argument,
which allows new modifier bits to be trivially added and conveyed.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.458758556@kernel.org

futex: Move futex related mm_struct data into a struct

2026-06-03T09:38:49+00:00

Having all these members in mm_struct along with the required #ifdeffery is
annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.

Move it into a data structure and fix up all usage sites.

The extra struct for the private hash is intentional to make integration of
other conditional mechanisms easier in terms of initialization and separation.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260602090535.407756793@kernel.org

futex: Make futex_mm_init() void

2026-06-03T09:38:49+00:00

Nothing fails there. Mop up the leftovers of the early version of this,
which did an allocation.

While at it clean up the stubs and the #ifdef comments to make the header
file readable.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260602090535.356789395@kernel.org

futex: Move futex task related data into a struct

2026-06-03T09:38:49+00:00

Having all these members in task_struct along with the required #ifdeffery
is annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.

Move it into a data structure and fix up all usage sites.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.308220888@kernel.org

futex/requeue: Prevent NULL pointer dereference in remove_waiter() on self-deadlock

2026-06-02T20:27:04+00:00

When FUTEX_CMP_REQUEUE_PI requeues a non-top waiter that already owns the
target PI futex, task_blocks_on_rt_mutex() returns -EDEADLK before setting
waiter->task.

The subsequent remove_waiter() in rt_mutex_start_proxy_lock() dereferences
the NULL waiter->task, causing a kernel crash.

Add a self-deadlock check for non-top waiters before calling
rt_mutex_start_proxy_lock(), analogous to the top-waiter check in
futex_lock_pi_atomic().

Fixes: 3bfdc63936dd4773109b7b8c280c0f3b5ae7d349 ("rtmutex: Use waiter::task instead of current in remove_waiter()")
Signed-off-by: Ji'an Zhou 
Signed-off-by: Thomas Gleixner 
Cc: stable@vger.kernel.org