linux.git/include/linux/futex.h, branch v7.2-rc1

futex: Provide infrastructure to plug the non contended robust futex unlock race

2026-06-03T09:38:52+00:00

When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in user space looks like this:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);

  	lval = gettid();
  3)	if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
  4)		robust_list_clear_op_pending();
  	else
  5)		sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....);

That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task, which observes that it is the last
user and:

  1) unmaps the mutex memory
  2) maps a different file, which ends up covering the same address

When then the original task exits before reaching #5 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.

On X86 this boils down to this simplified assembly sequence:

		mov		%esi,%eax	// Load TID into EAX
        	xor		%ecx,%ecx	// Set ECX to 0
   #3		lock cmpxchg	%ecx,(%rdi)	// Try the TID -> 0 transition
	.Lstart:
		jnz    		.Lend
   #4 		movq		%rcx,(%rdx)	// Clear list_op_pending
	.Lend:

If the cmpxchg() succeeds and the task is interrupted before it can clear
list_op_pending in the robust list head (#4) and the task crashes in a
signal handler or gets killed then it ends up in do_exit() and subsequently
in the robust list handling, which then might run into the unmap/map issue
described above.

This is only relevant when user space was interrupted and a signal is
pending. The fix-up has to be done before signal delivery is attempted
because:

   1) The signal might be fatal so get_signal() ends up in do_exit()

   2) The signal handler might crash or the task is killed before returning
      from the handler. At that point the instruction pointer in pt_regs is
      not longer the instruction pointer of the initially interrupted unlock
      sequence.

The right place to handle this is in __exit_to_user_mode_loop() before
invoking arch_do_signal_or_restart() as this covers obviously both
scenarios.

As this is only relevant when the task was interrupted in user space, this
is tied to RSEQ and the generic entry code as RSEQ keeps track of user
space interrupts unconditionally even if the task does not have a RSEQ
region installed. That makes the decision very lightweight:

       if (current->rseq.user_irq && within(regs, csr->unlock_ip_range))
       		futex_fixup_robust_unlock(regs, csr);

futex_fixup_robust_unlock() then invokes a architecture specific function
to return the pending op pointer or NULL. The function evaluates the
register content to decide whether the pending ops pointer in the robust
list head needs to be cleared.

Assuming the above unlock sequence, then on x86 this decision is the
trivial evaluation of the zero flag:

	return regs->eflags & X86_EFLAGS_ZF ? regs->dx : NULL;

Other architectures might need to do more complex evaluations due to LLSC,
but the approach is valid in general. The size of the pointer is determined
from the matching range struct, which covers both 32-bit and 64-bit builds
including COMPAT.

The unlock sequence is going to be placed in the VDSO so that the kernel
can keep everything synchronized, especially the register usage. The
resulting code sequence for user space is:

   if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) != tid)
 	err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);

Both the VDSO unlock and the kernel side unlock ensure that the pending_op
pointer is always cleared when the lock becomes unlocked.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.773669210@kernel.org

futex: Add robust futex unlock IP range

2026-06-03T09:38:51+00:00

There will be a VDSO function to unlock robust futexes in user space. The
unlock sequence is racy vs. clearing the list_pending_op pointer in the
tasks robust list head. To plug this race the kernel needs to know the
instruction window. As the VDSO is per MM the addresses are stored in
mm_struct::futex.

Architectures which implement support for this have to update these
addresses when the VDSO is (re)mapped and indicate the pending op pointer
size which is matching the IP.

Arguably this could be resolved by chasing mm->context->vdso->image, but
that's architecture specific and requires to touch quite some cache
lines. Having it in mm::futex reduces the cache line impact and avoids
having yet another set of architecture specific functionality.

To support multi size robust list applications (gaming) this provides two
ranges when COMPAT is enabled.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.718926819@kernel.org

futex: Make futex_mm_init() void

2026-06-03T09:38:49+00:00

Nothing fails there. Mop up the leftovers of the early version of this,
which did an allocation.

While at it clean up the stubs and the #ifdef comments to make the header
file readable.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260602090535.356789395@kernel.org

futex: Move futex task related data into a struct

2026-06-03T09:38:49+00:00

Having all these members in task_struct along with the required #ifdeffery
is annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.

Move it into a data structure and fix up all usage sites.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Reviewed-by: André Almeida 
Link: https://patch.msgid.link/20260602090535.308220888@kernel.org

futex: Use RCU-based per-CPU reference counting instead of rcuref_t

2025-07-11T14:02:00+00:00

The use of rcuref_t for reference counting introduces a performance bottleneck
when accessed concurrently by multiple threads during futex operations.

Replace rcuref_t with special crafted per-CPU reference counters. The
lifetime logic remains the same.

The newly allocate private hash starts in FR_PERCPU state. In this state, each
futex operation that requires the private hash uses a per-CPU counter (an
unsigned int) for incrementing or decrementing the reference count.

When the private hash is about to be replaced, the per-CPU counters are
migrated to a atomic_t counter mm_struct::futex_atomic.
The migration process:
- Waiting for one RCU grace period to ensure all users observe the
  current private hash. This can be skipped if a grace period elapsed
  since the private hash was assigned.

- futex_private_hash::state is set to FR_ATOMIC, forcing all users to
  use mm_struct::futex_atomic for reference counting.

- After a RCU grace period, all users are guaranteed to be using the
  atomic counter. The per-CPU counters can now be summed up and added to
  the atomic_t counter. If the resulting count is zero, the hash can be
  safely replaced. Otherwise, active users still hold a valid reference.

- Once the atomic reference count drops to zero, the next futex
  operation will switch to the new private hash.

call_rcu_hurry() is used to speed up transition which otherwise might be
delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
side effects would be that on auto scaling the new hash is used later
and the SET_SLOTS prctl() will block longer.

[bigeasy: commit description + mm get/ put_async]

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de

futex: Initialize futex_phash_new during fork().

2025-06-23T12:50:37+00:00

During a hash resize operation the new private hash is stored in
mm_struct::futex_phash_new if the current hash can not be immediately
replaced.

The new hash must not be copied during fork() into the new task. Doing
so will lead to a double-free of the memory by the two tasks.

Initialize the mm_struct::futex_phash_new during fork().

Closes: https://lore.kernel.org/all/aFBQ8CBKmRzEqIfS@mozart.vkv.me/
Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash")
Reported-by: Calvin Owens 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Calvin Owens 
Link: https://lkml.kernel.org/r/20250623083408.jTiJiC6_@linutronix.de

futex: Use RCU_INIT_POINTER() in futex_mm_init().

2025-05-21T11:57:41+00:00

There is no need for an explicit NULL pointer initialisation plus a
comment why it is okay. RCU_INIT_POINTER() can be used for NULL
initialisations and it is documented.

This has been build tested with gcc version 9.3.0 (Debian 9.3.0-22) on a
x86-64 defconfig.

Fixes: 094ac8cff7858 ("futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()")
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20250517151455.1065363-4-bigeasy@linutronix.de

futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()

2025-05-11T08:02:12+00:00

The following commit added an rcu_assign_pointer() assignment to
futex_mm_init() in :

  bd54df5ea7ca ("futex: Allow to resize the private local hash")

Which breaks the build on older compilers (gcc-9, x86-64 defconfig):

   CC      io_uring/futex.o
   In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
                    from ./include/linux/compiler.h:390,
                    from ./include/linux/array_size.h:5,
                    from ./include/linux/kernel.h:16,
                    from io_uring/futex.c:2:
   ./include/linux/futex.h: In function 'futex_mm_init':
   ./include/linux/rcupdate.h:555:36: error: dereferencing pointer to incomplete type 'struct futex_private_hash'

The problem is that this variant of rcu_assign_pointer() wants to
know the full type of 'struct futex_private_hash', which type
is local to futex.c:

   kernel/futex/core.c:struct futex_private_hash {

There are a couple of mechanical solutions for this bug:

  - we can uninline futex_mm_init() and move it into futex/core.c

  - or we can share the structure definition with kernel/fork.c.

But both of these solutions have disadvantages: the first one adds
runtime overhead, while the second one dis-encapsulates private
futex types.

A third solution, implemented by this patch, is to just initialize
mm->futex_phash with NULL like the patch below, it's not like this
new MM's ->futex_phash can be observed externally until the task
is inserted into the task list, which guarantees full store ordering.

The relaxation of this initialization might also give a tiny speedup
on certain platforms.

Fixes: bd54df5ea7ca ("futex: Allow to resize the private local hash")
Signed-off-by: Ingo Molnar 
Cc: André Almeida 
Cc: Darren Hart 
Cc: Davidlohr Bueso 
Cc: Juri Lelli 
Cc: Peter Zijlstra 
Cc: Sebastian Andrzej Siewior 
Cc: Valentin Schneider 
Cc: Waiman Long 
Link: https://lore.kernel.org/r/aB8SI00EHBri23lB@gmail.com

futex: Implement FUTEX2_NUMA

2025-05-03T10:02:09+00:00

Extend the futex2 interface to be numa aware.

When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.

  struct futex_numa_32 {
	u32 val;
	u32 node;
  };

When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.

When FUTEX2_NUMA is not set, the node is simply an extension of the
hash, such that traditional futexes are still interleaved over the
nodes.

This is done to avoid having to have a separate !numa hash-table.

[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
pr_info() for size and allocation information. Cast the naddr math to
void*]

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20250416162921.513656-17-bigeasy@linutronix.de

futex: Allow to resize the private local hash

2025-05-03T10:02:08+00:00

The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.

The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize process allocates a new struct futex_private_hash and drops
the initial reference. Synchronized with mm_struct::futex_hash_lock it
is checked if the reference counter for the currently used
mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
on the current private hash are requeued on the new private hash and the
new private hash is set to mm_struct::futex_phash. Otherwise the newly
allocated private hash is saved as mm_struct::futex_phash_new and the
rehashing and reassigning is delayed to the futex_hash() caller once the
reference counter is marked DEAD.
The replacement is not performed at rcuref_put() time because certain
callers, such as futex_wait_queue(), drop their reference after changing
the task state. This change will be destroyed once the futex_hash_lock
is acquired.

The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
multiple times. An increase and decrease is allowed and request blocks
until the assignment is done.

The private hash allocated at thread creation is changed from 16 to
  16 <= 4 * number_of_threads <= global_hash_size
where number_of_threads can not exceed the number of online CPUs. Should
the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.

[peterz: reorganize the code to avoid state tracking and simplify new
object handling, block the user until changes are in effect, allow
increase and decrease of the hash].

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20250416162921.513656-15-bigeasy@linutronix.de