linux.git/kernel/futex/core.c, branch v7.2-rc1

futex: Optimize futex hash bucket access patterns

2026-06-11T11:41:24+00:00

Breno reported significant c2c HITM in a futex hash heavy workload.

It turns out that the hash bucket to private hash table reverse pointer
(futex_hash_bucket::priv) was to blame. Notably when the hash buckets are
heavily contended, the: 'fph = bh->priv;' load in futex_hash() will typically
miss and consequently become quite expensive.

Since this load in particular is quite superfluous, removing it is fairly
straight forward. However, removing it does not in fact achieve anything much.
The pain moves to the next user, notably: futex_hash_put().

Therefore rework the whole private hash refcounting to avoid needing this back
pointer (and removing it). Instead of passing around 'struct futex_hash_bucket
*hb', pass around a new structure that contains it and the related 'struct
futex_private_hash *fph' pointer in tandem.

Funnily this turns out to remove more code than it adds and significantly
improves futex hash performance (as measured by 'perf bench futex hash'):

SKL dual socket 112 threads:

		  Baseline        Patched
  shared (16k)    1571857         1641435         + 4.4%
  autosize (512)   646390          903371         +39.7%
  -b 256           464395          587014         +26.4%
  -b 512           715687          995943         +39.2%
  -b 1024          995085         1396328         +40.3%
  -b 2048         1293114         1668395         +29.0%
  -b 4096         2124438         2240228         + 5.5%

Zen3 dual socket 256 threads:

                  Baseline        Patched
  shared  (16k)   1275840         1381279         + 8.2%
  autosize (512)  1252745         1482179         +18.3%
  -b 256           856274          955455         +11.5%
  -b 512          1267490         1544010         +21.8%
  -b 1024         1424013         1625424         +14.1%
  -b 2048         1505181         1669342         +10.9%
  -b 4096         1465993         1688932         +15.2%

AMD EPYC 9D64 (Zen4, single socket) 176 threads:

                       Baseline       Patched      Delta
  shared (16k)         1,230,599      1,368,655    +11.2%
  autosize (1024)      1,285,440      1,556,946    +21.1%
  -b 256               1,341,471      1,520,303    +13.3%
  -b 512               1,438,330      1,599,319    +11.2%
  -b 1024              1,443,772      1,622,493    +12.4%
  -b 2048              1,472,108      1,643,975    +11.7%
  -b 4096              1,333,098      1,570,897    +17.8%

Reported-by: Breno Leitao 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Breno Leitao 
Tested-by: Thomas Gleixner 
Link: https://patch.msgid.link/20260610135510.GB1430057@noisy.programming.kicks-ass.net

futex: Provide infrastructure to plug the non contended robust futex unlock race

2026-06-03T09:38:52+00:00

When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in user space looks like this:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);

  	lval = gettid();
  3)	if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
  4)		robust_list_clear_op_pending();
  	else
  5)		sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....);

That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task, which observes that it is the last
user and:

  1) unmaps the mutex memory
  2) maps a different file, which ends up covering the same address

When then the original task exits before reaching #5 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.

On X86 this boils down to this simplified assembly sequence:

		mov		%esi,%eax	// Load TID into EAX
        	xor		%ecx,%ecx	// Set ECX to 0
   #3		lock cmpxchg	%ecx,(%rdi)	// Try the TID -> 0 transition
	.Lstart:
		jnz    		.Lend
   #4 		movq		%rcx,(%rdx)	// Clear list_op_pending
	.Lend:

If the cmpxchg() succeeds and the task is interrupted before it can clear
list_op_pending in the robust list head (#4) and the task crashes in a
signal handler or gets killed then it ends up in do_exit() and subsequently
in the robust list handling, which then might run into the unmap/map issue
described above.

This is only relevant when user space was interrupted and a signal is
pending. The fix-up has to be done before signal delivery is attempted
because:

   1) The signal might be fatal so get_signal() ends up in do_exit()

   2) The signal handler might crash or the task is killed before returning
      from the handler. At that point the instruction pointer in pt_regs is
      not longer the instruction pointer of the initially interrupted unlock
      sequence.

The right place to handle this is in __exit_to_user_mode_loop() before
invoking arch_do_signal_or_restart() as this covers obviously both
scenarios.

As this is only relevant when the task was interrupted in user space, this
is tied to RSEQ and the generic entry code as RSEQ keeps track of user
space interrupts unconditionally even if the task does not have a RSEQ
region installed. That makes the decision very lightweight:

       if (current->rseq.user_irq && within(regs, csr->unlock_ip_range))
       		futex_fixup_robust_unlock(regs, csr);

futex_fixup_robust_unlock() then invokes a architecture specific function
to return the pending op pointer or NULL. The function evaluates the
register content to decide whether the pending ops pointer in the robust
list head needs to be cleared.

Assuming the above unlock sequence, then on x86 this decision is the
trivial evaluation of the zero flag:

	return regs->eflags & X86_EFLAGS_ZF ? regs->dx : NULL;

Other architectures might need to do more complex evaluations due to LLSC,
but the approach is valid in general. The size of the pointer is determined
from the matching range struct, which covers both 32-bit and 64-bit builds
including COMPAT.

The unlock sequence is going to be placed in the VDSO so that the kernel
can keep everything synchronized, especially the register usage. The
resulting code sequence for user space is:

   if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) != tid)
 	err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);

Both the VDSO unlock and the kernel side unlock ensure that the pending_op
pointer is always cleared when the lock becomes unlocked.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.773669210@kernel.org

futex: Add robust futex unlock IP range

2026-06-03T09:38:51+00:00

There will be a VDSO function to unlock robust futexes in user space. The
unlock sequence is racy vs. clearing the list_pending_op pointer in the
tasks robust list head. To plug this race the kernel needs to know the
instruction window. As the VDSO is per MM the addresses are stored in
mm_struct::futex.

Architectures which implement support for this have to update these
addresses when the VDSO is (re)mapped and indicate the pending op pointer
size which is matching the IP.

Arguably this could be resolved by chasing mm->context->vdso->image, but
that's architecture specific and requires to touch quite some cache
lines. Having it in mm::futex reduces the cache line impact and avoids
having yet another set of architecture specific functionality.

To support multi size robust list applications (gaming) this provides two
ranges when COMPAT is enabled.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.718926819@kernel.org

futex: Add support for unlocking robust futexes

2026-06-03T09:38:51+00:00

Unlocking robust non-PI futexes happens in user space with the following
sequence:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);

  	lval = 0;
  3)	lval = atomic_xchg(lock, lval);
  4)	if (lval & WAITERS)
  5)		sys_futex(WAKE,....);
  6)	robust_list_clear_op_pending();

That opens a window between #3 and #6 where the mutex could be acquired by
some other task which observes that it is the last user and:

  A) unmaps the mutex memory
  B) maps a different file, which ends up covering the same address

When the original task exits before reaching #6 then the kernel robust list
handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupting unrelated data.

PI futexes have a similar problem both for the non-contented user space
unlock and the in kernel unlock:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);

  	lval = gettid();
  3)	if (!atomic_try_cmpxchg(lock, lval, 0))
  4)		sys_futex(UNLOCK_PI,....);
  5)	robust_list_clear_op_pending();

Address the first part of the problem where the futexes have waiters and
need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which
is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET
operations.

This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear
whether this is needed and there is no usage of it in glibc either to
investigate.

For the futex2 syscall family this needs to be implemented with a new
syscall.

The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to
robust_list_head::list_pending_op into the kernel. This argument is only
evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward
compatible.

This is an explicit argument to avoid the lookup of the robust list pointer
and retrieving the pending op pointer from there. User space has the
pointer already available so it can just put it into the @uaddr2
argument. Aside of that this allows the usage of multiple robust lists in
the future without any changes to the internal functions as they just operate
on the provided pointer.

This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the
robust list pointer points to an u32 and not to an u64. This is required
for two reasons:

    1) sys_futex() has no compat variant

    2) The gaming emulators use both both 64-bit and compat 32-bit robust
       lists in the same 64-bit application

As a consequence 32-bit applications have to set this flag unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the flag is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.

In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the
unlock succeeded. In case of errors, the user space value is still locked
by the caller and therefore the above cannot happen.

In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and
clears the robust list pending op when the unlock was successful. If not,
the user space value is still locked and user space has to deal with the
returned error. That means that the unlocking of non-PI robust futexes has
to use the same try_cmpxchg() unlock scheme as PI futexes.

If the clearing of the pending list op fails (fault) then the kernel clears
the registered robust list pointer if it matches to prevent that exit()
will try to handle invalid data. That's a valid paranoid decision because
the robust list head sits usually in the TLS and if the TLS is not longer
accessible then the chance for fixing up the resulting mess is very close
to zero.

The problem of non-contended unlocks still exists and will be addressed
separately.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.670514505@kernel.org

futex: Provide UABI defines for robust list entry modifiers

2026-06-03T09:38:50+00:00

The marker for PI futexes in the robust list is a hardcoded 0x1 which lacks
any sensible form of documentation.

Provide proper defines for the bit and the mask and fix up the usage
sites. Thereby convert the boolean pi argument into a modifier argument,
which allows new modifier bits to be trivially added and conveyed.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.458758556@kernel.org

futex: Move futex related mm_struct data into a struct

2026-06-03T09:38:49+00:00

Having all these members in mm_struct along with the required #ifdeffery is
annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.

Move it into a data structure and fix up all usage sites.

The extra struct for the private hash is intentional to make integration of
other conditional mechanisms easier in terms of initialization and separation.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260602090535.407756793@kernel.org

futex: Make futex_mm_init() void

2026-06-03T09:38:49+00:00

Nothing fails there. Mop up the leftovers of the early version of this,
which did an allocation.

While at it clean up the stubs and the #ifdef comments to make the header
file readable.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260602090535.356789395@kernel.org

futex: Move futex task related data into a struct

2026-06-03T09:38:49+00:00

Having all these members in task_struct along with the required #ifdeffery
is annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.

Move it into a data structure and fix up all usage sites.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.308220888@kernel.org

Merge tag 'locking-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2026-04-14T19:36:25+00:00

Pull locking updates from Ingo Molnar:
 "Mutexes:

   - Add killable flavor to guard definitions (Davidlohr Bueso)

   - Remove the list_head from struct mutex (Matthew Wilcox)

   - Rename mutex_init_lockep() (Davidlohr Bueso)

  rwsems:

   - Remove the list_head from struct rw_semaphore and
     replace it with a single pointer (Matthew Wilcox)

   - Fix logic error in rwsem_del_waiter() (Andrei Vagin)

  Semaphores:

   - Remove the list_head from struct semaphore (Matthew Wilcox)

  Jump labels:

   - Use ATOMIC_INIT() for initialization of .enabled (Thomas Weißschuh)

   - Remove workaround for old compilers in initializations
     (Thomas Weißschuh)

  Lock context analysis changes and improvements:

   - Add context analysis for rwsems (Peter Zijlstra)

   - Fix rwlock and spinlock lock context annotations (Bart Van Assche)

   - Fix rwlock support in  (Bart Van Assche)

   - Add lock context annotations in the spinlock implementation
     (Bart Van Assche)

   - signal: Fix the lock_task_sighand() annotation (Bart Van Assche)

   - ww-mutex: Fix the ww_acquire_ctx function annotations
     (Bart Van Assche)

   - Add lock context support in do_raw_{read,write}_trylock()
     (Bart Van Assche)

   - arm64, compiler-context-analysis: Permit alias analysis through
     __READ_ONCE() with CONFIG_LTO=y (Marco Elver)

   - Add __cond_releases() (Peter Zijlstra)

   - Add context analysis for mutexes (Peter Zijlstra)

   - Add context analysis for rtmutexes (Peter Zijlstra)

   - Convert futexes to compiler context analysis (Peter Zijlstra)

  Rust integration updates:

   - Add atomic fetch_sub() implementation (Andreas Hindborg)

   - Refactor various rust_helper_ methods for expansion (Boqun Feng)

   - Add Atomic<*{mut,const} T> support (Boqun Feng)

   - Add atomic operation helpers over raw pointers (Boqun Feng)

   - Add performance-optimal Flag type for atomic booleans, to avoid
     slow byte-sized RMWs on architectures that don't support them.
     (FUJITA Tomonori)

   - Misc cleanups and fixes (Andreas Hindborg, Boqun Feng, FUJITA
     Tomonori)

  LTO support updates:

   - arm64: Optimize __READ_ONCE() with CONFIG_LTO=y (Marco Elver)

   - compiler: Simplify generic RELOC_HIDE() (Marco Elver)

  Miscellaneous fixes and cleanups by Peter Zijlstra, Randy Dunlap,
  Thomas Weißschuh, Davidlohr Bueso and Mikhail Gavrilov"

* tag 'locking-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  compiler: Simplify generic RELOC_HIDE()
  locking: Add lock context annotations in the spinlock implementation
  locking: Add lock context support in do_raw_{read,write}_trylock()
  locking: Fix rwlock support in 
  lockdep: Raise default stack trace limits when KASAN is enabled
  cleanup: Optimize guards
  jump_label: remove workaround for old compilers in initializations
  jump_label: use ATOMIC_INIT() for initialization of .enabled
  futex: Convert to compiler context analysis
  locking/rwsem: Fix logic error in rwsem_del_waiter()
  locking/rwsem: Add context analysis
  locking/rtmutex: Add context analysis
  locking/mutex: Add context analysis
  compiler-context-analysys: Add __cond_releases()
  locking/mutex: Remove the list_head from struct mutex
  locking/semaphore: Remove the list_head from struct semaphore
  locking/rwsem: Remove the list_head from struct rw_semaphore
  rust: atomic: Update a safety comment in impl of `fetch_add()`
  rust: sync: atomic: Update documentation for `fetch_add()`
  rust: sync: atomic: Add fetch_sub()
  ...

futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()

2026-03-26T15:13:48+00:00

During futex_key_to_node_opt() execution, vma->vm_policy is read under
speculative mmap lock and RCU. Concurrently, mbind() may call
vma_replace_policy() which frees the old mempolicy immediately via
kmem_cache_free().

This creates a race where __futex_key_to_node() dereferences a freed
mempolicy pointer, causing a use-after-free read of mpol->mode.

[  151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349)
[  151.414046] Read of size 2 at addr ffff888001c49634 by task e/87

[  151.415969] Call Trace:

[  151.416732]  __asan_load2 (mm/kasan/generic.c:271)
[  151.416777]  __futex_key_to_node (kernel/futex/core.c:349)
[  151.416822]  get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593)

Fix by adding rcu to __mpol_put().

Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL")
Reported-by: Hao-Yu Yang 
Suggested-by: Eric Dumazet 
Signed-off-by: Hao-Yu Yang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Eric Dumazet 
Acked-by: David Hildenbrand (Arm) 
Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net