linux.git/include/linux/rhashtable-types.h, branch v7.2-rc1

Merge tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T21:35:50+00:00

Pull misc kernel updates from Christian Brauner:
 "Fixes

   - rhashtable: give each instance its own lockdep class

     syzbot reported a circular locking dependency between ht->mutex and
     fs_reclaim via the simple_xattrs rhashtable being torn down during
     inode eviction.

     The predicted deadlock cannot occur: rhashtable_free_and_destroy()
     cancels the deferred worker before taking ht->mutex and
     acquisitions on distinct rhashtables are on distinct mutexes.

     Lockdep flags a cycle anyway because every ht->mutex in the kernel
     shared the single static lockdep class from
     rhashtable_init_noprof().

     The lockdep key is lifted to a per-call-site static key so every
     rhashtable instance gets its own class.

   - selftests/clone3: fix misuse of the libcap library interface in the
     cap_checkpoint_restore test and remove unused variables

   - selftests/pid_namespace: compute the pid_max test limits
     dynamically instead of hardcoding values below the kernel-enforced
     minimum of PIDS_PER_CPU_MIN * num_possible_cpus() which made the
     tests fail on machines with many possible CPUs

   - selftests: fix the Makefile TARGETS entry for nsfs which wasn't
     adjusted when the tests moved under filesystems/

  Cleanups

   - ipc/sem.c: use unsigned int for nsops to match the declaration in
     syscalls.h"

* tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/clone3: remove unused variables
  selftests/clone3: fix libcap interface usage
  ipc/sem.c: use unsigned int for nsops
  selftests: Fix Makefile target for nsfs
  rhashtable: give each instance its own lockdep class
  selftests/pid_namespace: compute pid_max test limits dynamically

rhashtable: give each instance its own lockdep class

2026-05-12T07:15:01+00:00

syzbot reported a possible circular locking dependency between
&ht->mutex and fs_reclaim:

  CPU0 (kswapd0)                    CPU1 (kworker)
  --------------                    --------------
  fs_reclaim                        ht->mutex
    shmem_evict_inode                 rhashtable_rehash_alloc
      simple_xattrs_free                bucket_table_alloc(GFP_KERNEL)
        rhashtable_free_and_destroy       __kvmalloc_node
          mutex_lock(&ht->mutex)            might_alloc -> fs_reclaim

The two halves of the splat refer to two different events on
&ht->mutex.

The kswapd0 path is unambiguous: shmem_evict_inode at mm/shmem.c:1429
calls simple_xattrs_free(), which calls rhashtable_free_and_destroy()
on the per-inode simple_xattrs rhashtable being torn down with the
inode.

The previously-recorded ht->mutex -> fs_reclaim edge comes from
rht_deferred_worker -> rhashtable_rehash_alloc ->
bucket_table_alloc(GFP_KERNEL) -> __kvmalloc_node ->
might_alloc -> fs_reclaim. That stack stops at generic library code:
there is no subsystem-specific frame above rht_deferred_worker, so
the splat does not identify which rhashtable's worker recorded the
edge -- only that some rhashtable in the system did.

Whether or not that recording happened on the same simple_xattrs ht
that is now being destroyed, the predicted deadlock cannot occur:
rhashtable_free_and_destroy() does cancel_work_sync(&ht->run_work)
before taking ht->mutex, so the deferred worker cannot be running on
the instance being torn down. If the recording was on a different
rhashtable instance, the two ht->mutex acquisitions are on distinct
mutex objects and cannot deadlock either.

Lockdep flags a cycle regardless because mutex_init(&ht->mutex) lives
on a single source line in rhashtable_init_noprof(), so every
ht->mutex in the kernel shares one static lockdep class. Lockdep
matches by class, not by instance, and collapses all of these into
one node.

Lift the lockdep key out of rhashtable_init_noprof() and into the
caller. The user-visible rhashtable_init_noprof() /
rhltable_init_noprof() identifiers become macros that declare a
per-call-site static lock_class_key.

Link: https://patch.msgid.link/20260427-work-rhashtable-lockdep-v1-1-f69e8bd91cb2@kernel.org
Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc")
Acked-by: Michal Hocko 
Reported-by: syzbot+5af806780f38a5fe691f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/69e798fe.050a0220.24bfd3.0032.GAE@google.com
Signed-off-by: Christian Brauner

rhashtable: Bounce deferred worker kick through irq_work

2026-04-21T06:10:50+00:00

Inserts past 75% load call schedule_work(&ht->run_work) to kick an
async resize. If a caller holds a raw spinlock (e.g. an
insecure_elasticity user), schedule_work() under that lock records

  caller_lock -> pool->lock -> pi_lock -> rq->__lock

A cycle forms if any of these locks is acquired in the reverse
direction elsewhere. sched_ext, the only current insecure_elasticity
user, hits this: it holds scx_sched_lock across rhashtable inserts of
sub-schedulers, while scx_bypass() takes rq->__lock -> scx_sched_lock.
Exercising the resize path produces:

  Chain exists of:
    &pool->lock --> &rq->__lock --> scx_sched_lock

Bounce the kick from the insert paths through irq_work so
schedule_work() runs from hard IRQ context with the caller's lock no
longer held. rht_deferred_worker()'s self-rearm on error stays on
schedule_work(&ht->run_work) - the worker runs in process context with
no caller lock held, and keeping the self-requeue on @run_work lets
cancel_work_sync() in rhashtable_free_and_destroy() drain it.

v3: Keep rht_deferred_worker()'s self-rearm on schedule_work(&run_work).
    Routing it through irq_work in v2 broke cancel_work_sync()'s
    self-requeue handling - an irq_work queued after irq_work_sync()
    returned but while cancel_work_sync() was still waiting could fire
    post-teardown.

v2: Bounce unconditionally instead of gating on insecure_elasticity,
    as suggested by Herbert.

Signed-off-by: Tejun Heo 
Acked-by: Herbert Xu

rhashtable: Restore insecure_elasticity toggle

2026-04-19T15:47:21+00:00

Some users of rhashtable cannot handle insertion failures, and
are happy to accept the consequences of a hash table that having
very long chains.

Restore the insecure_elasticity toggle for these users.  In
addition to disabling the chain length checks, this also removes
the emergency resize that would otherwise occur when the hash
table occupancy hits 100% (an async resize is still scheduled
at 75%).

Signed-off-by: Herbert Xu 
Signed-off-by: Tejun Heo

rhashtable: plumb through alloc tag

2024-04-26T03:55:57+00:00

This gives better memory allocation profiling results; rhashtable
allocations will be accounted to the code that initialized the rhashtable.

[surenb@google.com: undo _noprof additions in the documentation]
  Link: https://lkml.kernel.org/r/20240326231453.1206227-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-32-surenb@google.com
Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Tested-by: Kees Cook 
Cc: Alexander Viro 
Cc: Alex Gaynor 
Cc: Alice Ryhl 
Cc: Andreas Hindborg 
Cc: Benno Lossin 
Cc: "Björn Roy Baron" 
Cc: Boqun Feng 
Cc: Christoph Lameter 
Cc: Dennis Zhou 
Cc: Gary Guo 
Cc: Miguel Ojeda 
Cc: Pasha Tatashin 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Vlastimil Babka 
Cc: Wedson Almeida Filho 
Signed-off-by: Andrew Morton

workqueue: Split out workqueue_types.h

2023-12-21T00:26:31+00:00

More sched.h dependency culling - this lets us kill a rhashtable-types.h
dependency on workqueue.h.

Signed-off-by: Kent Overstreet

rhashtable: use bit_spin_locks to protect hash bucket.

2019-04-08T02:12:12+00:00

This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.

The benefits of a bit spin_lock are:
 - no need to allocate a separate array of locks.
 - no need to have a configuration option to guide the
   choice of the size of this array
 - locking cost is often a single test-and-set in a cache line
   that will have to be loaded anyway.  When inserting at, or removing
   from, the head of the chain, the unlock is free - writing the new
   address in the bucket head implicitly clears the lock bit.
   For __rhashtable_insert_fast() we ensure this always happens
   when adding a new key.
 - even when lockings costs 2 updates (lock and unlock), they are
   in a cacheline that needs to be read anyway.

The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.

Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved.  This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.

As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.

To enhance type checking, a new struct is introduced to represent the
  pointer plus lock-bit
that is stored in the bucket-table.  This is "struct rhash_lock_head"
and is empty.  A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".

Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head.  As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.

Signed-off-by: NeilBrown 
Signed-off-by: David S. Miller

rhashtable: remove nulls_base and related code.

2018-06-22T04:43:27+00:00

This "feature" is unused, undocumented, and untested and so doesn't
really belong.  A patch is under development to properly implement
support for detecting when a search gets diverted down a different
chain, which the common purpose of nulls markers.

This patch actually fixes a bug too.  The table resizing allows a
table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
any growth beyond 2^27 is wasteful an ineffective.

This patch results in NULLS_MARKER(0) being used for all chains,
and leaves the use of rht_is_a_null() to test for it.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
Signed-off-by: David S. Miller

rhashtable: split rhashtable.h

2018-06-22T04:43:27+00:00

Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.

This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code.  rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
Signed-off-by: David S. Miller