linux.git - Linux kernel source tree

diff options

author	Thomas Gleixner <tglx@kernel.org>	2026-06-02 11:09:55 +0200
committer	Peter Zijlstra <peterz@infradead.org>	2026-06-03 11:38:51 +0200
commit	3ca9595d9fb6cce6633a5b03d98c2aecb5499838 (patch)
tree	c8a64ca452ea34b832460e4f0474322375c78694 /rust/kernel/alloc
parent	1fd053d26f0333485cdbaa9d6e7b8cb53f54de95 (diff)

futex: Add support for unlocking robust futexes

Unlocking robust non-PI futexes happens in user space with the following sequence: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = 0; 3) lval = atomic_xchg(lock, lval); 4) if (lval & WAITERS) 5) sys_futex(WAKE,....); 6) robust_list_clear_op_pending(); That opens a window between #3 and #6 where the mutex could be acquired by some other task which observes that it is the last user and: A) unmaps the mutex memory B) maps a different file, which ends up covering the same address When the original task exits before reaching #6 then the kernel robust list handling observes the pending op entry and tries to fix up user space. In case that the newly mapped data contains the TID of the exiting thread at the address of the mutex/futex the kernel will set the owner died bit in that memory and therefore corrupting unrelated data. PI futexes have a similar problem both for the non-contented user space unlock and the in kernel unlock: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = gettid(); 3) if (!atomic_try_cmpxchg(lock, lval, 0)) 4) sys_futex(UNLOCK_PI,....); 5) robust_list_clear_op_pending(); Address the first part of the problem where the futexes have waiters and need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET operations. This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear whether this is needed and there is no usage of it in glibc either to investigate. For the futex2 syscall family this needs to be implemented with a new syscall. The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to robust_list_head::list_pending_op into the kernel. This argument is only evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward compatible. This is an explicit argument to avoid the lookup of the robust list pointer and retrieving the pending op pointer from there. User space has the pointer already available so it can just put it into the @uaddr2 argument. Aside of that this allows the usage of multiple robust lists in the future without any changes to the internal functions as they just operate on the provided pointer. This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the robust list pointer points to an u32 and not to an u64. This is required for two reasons: 1) sys_futex() has no compat variant 2) The gaming emulators use both both 64-bit and compat 32-bit robust lists in the same 64-bit application As a consequence 32-bit applications have to set this flag unconditionally so they can run on a 64-bit kernel in compat mode unmodified. 32-bit kernels return an error code when the flag is not set. 64-bit kernels will happily clear the full 64 bits if user space fails to set it. In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the unlock succeeded. In case of errors, the user space value is still locked by the caller and therefore the above cannot happen. In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and clears the robust list pending op when the unlock was successful. If not, the user space value is still locked and user space has to deal with the returned error. That means that the unlocking of non-PI robust futexes has to use the same try_cmpxchg() unlock scheme as PI futexes. If the clearing of the pending list op fails (fault) then the kernel clears the registered robust list pointer if it matches to prevent that exit() will try to handle invalid data. That's a valid paranoid decision because the robust list head sits usually in the TLS and if the TLS is not longer accessible then the chance for fixing up the resulting mess is very close to zero. The problem of non-contended unlocks still exists and will be addressed separately. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://patch.msgid.link/20260602090535.670514505@kernel.org

Diffstat (limited to 'rust/kernel/alloc')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: