linux.git/lib/rhashtable.c, branch v7.2-rc1

Merge tag 'bpf-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

2026-06-17T08:18:14+00:00

Pull bpf updates from Alexei Starovoitov:
 "Major changes:

   - Recover from BPF arena page faults using a scratch page and add
     ptep_try_set() for lockless empty-slot installs on x86 and arm64.

     This allows BPF kfuncs to access arena pointers directly.

     The 'arena_direct_access' stable branch was created for this work
     and was pulled into sched-ext and bpf-next trees (Tejun Heo, Kumar
     Kartikeya Dwivedi)

   - Lift old restriction and support 6+ arguments in BPF programs and
     kfuncs on x86 and arm64 (Yonghong Song, Puranjay Mohan)

  Other features and fixes:

   - Add 24-bit BTF vlen and reclaim unused bits in the BTF UAPI to ease
     addition of new BTF kinds (Alan Maguire)

   - Raise the maximum BPF call chain depth from 8 to 16 frames (Alexei
     Starovoitov)

   - Refactor object relationship tracking in the verifier and fix a
     dynptr use-after-free bug (Amery Hung)

   - Harden the signed program loader and reject exclusive maps as inner
     maps (Daniel Borkmann)

   - Replace the verifier min/max bounds fields with a circular number
     (cnum) representation and improve 32->64 bit range refinements
     (Eduard Zingerman)

   - Introduce the arena library and runtime (libarena) with a buddy
     allocator, rbtree and SPMC queue data structures, ASAN support and
     a parallel test harness. Allow subprograms to return arena pointers
     and switch to a BTF type-tag based __arena annotation (Emil
     Tsalapatis)

   - Cache build IDs in the sleepable stackmap path and avoid faultable
     build ID reads under mm locks (Ihor Solodrai)

   - Introduce the tracing_multi link to attach a single BPF program to
     many kernel functions at once. Allow specifying the uprobe_multi
     target via FD (Jiri Olsa)

   - Extend the bpf_list family of kfuncs with bpf_list_add/del(), and
     bpf_list_is_first/is_last/empty() (Kaitao Cheng)

   - Extend the BPF syscall with common attributes support for
     prog_load, btf_load and map_create (Leon Hwang)

   - Wrap rhashtable as BPF map (Mykyta Yatsenko, Herbert Xu)

   - Add sleepable support for tracepoint programs and fix deadlocks in
     LRU map due to NMI reentry (Mykyta Yatsenko)

   - Fix OOB access in bpf_flow_keys, fix nullness analysis of inner
     arrays, enforce write checks for global subprograms (Nuoqi Gui)

   - Report the maximum combined stack depth and print a breakdown of
     instructions processed per subprogram (Paul Chaignon)

   - Add an XDP load-balancer benchmark and arm64 JIT support for stack
     arguments (Puranjay Mohan)

   - Add kfuncs to traverse over wakeup_sources (Samuel Wu)

   - Allow sleepable BPF programs to use LPM trie maps directly (Vlad
     Poenaru)

   - Many more fixes and cleanups across the verifier, BTF, sockmap,
     devmap, bpffs, security hooks, s390/riscv/loongarch JITs,
     rqspinlock, libbpf, bpftool, selftests"

* tag 'bpf-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (336 commits)
  selftests/bpf: Work around llvm stack overflow in crypto progs
  selftests/bpf: add test for bpf_msg_pop_data() overflow
  bpf, sockmap: fix integer overflow in bpf_msg_pop_data() bounds check
  sockmap: Fix use-after-free in udp_bpf_recvmsg()
  bpf, sockmap: keep sk_msg copy state in sync
  bpf, sockmap: Fix wrong rsge offset in bpf_msg_push_data()
  bpf, sockmap: reject overflowing copy + len in bpf_msg_push_data()
  selftsets/bpf: Retry map update on helper_fill_hashmap()
  selftests/bpf: Add test for sleepable lsm_cgroup rejection
  selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper
  bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket
  selftests/bpf: Avoid static LLVM linking for cross builds
  selftests/bpf: Use common CFLAGS for urandom_read
  selftests/bpf: Initialize operation name before use
  tools/bpf: build: Append extra cflags
  libbpf: Initialize CFLAGS before including Makefile.include
  bpftool: Append extra host flags
  bpftool: Avoid adding EXTRA_CFLAGS to HOST_CFLAGS
  bpftool: Pass host flags to bootstrap libbpf
  selftests/bpf: correct CONFIG_PPC64 macro name in comment
  ...

Merge tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T21:35:50+00:00

Pull misc kernel updates from Christian Brauner:
 "Fixes

   - rhashtable: give each instance its own lockdep class

     syzbot reported a circular locking dependency between ht->mutex and
     fs_reclaim via the simple_xattrs rhashtable being torn down during
     inode eviction.

     The predicted deadlock cannot occur: rhashtable_free_and_destroy()
     cancels the deferred worker before taking ht->mutex and
     acquisitions on distinct rhashtables are on distinct mutexes.

     Lockdep flags a cycle anyway because every ht->mutex in the kernel
     shared the single static lockdep class from
     rhashtable_init_noprof().

     The lockdep key is lifted to a per-call-site static key so every
     rhashtable instance gets its own class.

   - selftests/clone3: fix misuse of the libcap library interface in the
     cap_checkpoint_restore test and remove unused variables

   - selftests/pid_namespace: compute the pid_max test limits
     dynamically instead of hardcoding values below the kernel-enforced
     minimum of PIDS_PER_CPU_MIN * num_possible_cpus() which made the
     tests fail on machines with many possible CPUs

   - selftests: fix the Makefile TARGETS entry for nsfs which wasn't
     adjusted when the tests moved under filesystems/

  Cleanups

   - ipc/sem.c: use unsigned int for nsops to match the declaration in
     syscalls.h"

* tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/clone3: remove unused variables
  selftests/clone3: fix libcap interface usage
  ipc/sem.c: use unsigned int for nsops
  selftests: Fix Makefile target for nsfs
  rhashtable: give each instance its own lockdep class
  selftests/pid_namespace: compute pid_max test limits dynamically

rhashtable: Fix rhashtable_next_key() build warnings

2026-06-07T19:36:13+00:00

rhashtable.o builds with warnings as rhashtable_next_key() kdoc
from lib/rhashtable.c does not have the arguments descriptions.

Move rhashtable_next_key() kdoc from header to c file, matching
other functions.

Move rhashtable_next_key() next to the other forward declarations
in the header file.

Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-kbuild-all/202606061925.WI4bYI8k-lkp@intel.com/
Fixes: 8f4fa9f89b72 ("rhashtable: Add rhashtable_next_key() API")
Signed-off-by: Mykyta Yatsenko 
Link: https://lore.kernel.org/r/20260606-rhash_fixes_1-v1-1-932ab036e6bc@meta.com
Signed-off-by: Alexei Starovoitov

rhashtable: Add rhashtable_next_key() API

2026-06-05T15:00:07+00:00

Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.

  void *rhashtable_next_key(struct rhashtable *ht,
                            const void *prev_key);

Caller holds RCU; passes NULL prev_key for the first element or
the previously returned key to advance. Walks tbl->future_tbl
chain so in-flight rehashes are observed.

Best-effort: in case of concurrent resize, provides no guarantees:
 - may produce duplicate elements
 - may skip any amount of elements
 - termination of the loop is not guaranteed in case of
 sustained rehash. Callers are advised to bound loop externally
 or avoid inserting new elements during such loop.

Returns ERR_PTR(-ENOENT) if prev_key is not found.
Behavior on tables with duplicate keys is undefined.
rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP).

Signed-off-by: Mykyta Yatsenko 
Acked-by: Herbert Xu 
Link: https://lore.kernel.org/r/20260605-rhash-v7-1-5b8e05f8630d@meta.com
Signed-off-by: Alexei Starovoitov

rhashtable: give each instance its own lockdep class

2026-05-12T07:15:01+00:00

syzbot reported a possible circular locking dependency between
&ht->mutex and fs_reclaim:

  CPU0 (kswapd0)                    CPU1 (kworker)
  --------------                    --------------
  fs_reclaim                        ht->mutex
    shmem_evict_inode                 rhashtable_rehash_alloc
      simple_xattrs_free                bucket_table_alloc(GFP_KERNEL)
        rhashtable_free_and_destroy       __kvmalloc_node
          mutex_lock(&ht->mutex)            might_alloc -> fs_reclaim

The two halves of the splat refer to two different events on
&ht->mutex.

The kswapd0 path is unambiguous: shmem_evict_inode at mm/shmem.c:1429
calls simple_xattrs_free(), which calls rhashtable_free_and_destroy()
on the per-inode simple_xattrs rhashtable being torn down with the
inode.

The previously-recorded ht->mutex -> fs_reclaim edge comes from
rht_deferred_worker -> rhashtable_rehash_alloc ->
bucket_table_alloc(GFP_KERNEL) -> __kvmalloc_node ->
might_alloc -> fs_reclaim. That stack stops at generic library code:
there is no subsystem-specific frame above rht_deferred_worker, so
the splat does not identify which rhashtable's worker recorded the
edge -- only that some rhashtable in the system did.

Whether or not that recording happened on the same simple_xattrs ht
that is now being destroyed, the predicted deadlock cannot occur:
rhashtable_free_and_destroy() does cancel_work_sync(&ht->run_work)
before taking ht->mutex, so the deferred worker cannot be running on
the instance being torn down. If the recording was on a different
rhashtable instance, the two ht->mutex acquisitions are on distinct
mutex objects and cannot deadlock either.

Lockdep flags a cycle regardless because mutex_init(&ht->mutex) lives
on a single source line in rhashtable_init_noprof(), so every
ht->mutex in the kernel shares one static lockdep class. Lockdep
matches by class, not by instance, and collapses all of these into
one node.

Lift the lockdep key out of rhashtable_init_noprof() and into the
caller. The user-visible rhashtable_init_noprof() /
rhltable_init_noprof() identifiers become macros that declare a
per-call-site static lock_class_key.

Link: https://patch.msgid.link/20260427-work-rhashtable-lockdep-v1-1-f69e8bd91cb2@kernel.org
Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc")
Acked-by: Michal Hocko 
Reported-by: syzbot+5af806780f38a5fe691f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/69e798fe.050a0220.24bfd3.0032.GAE@google.com
Signed-off-by: Christian Brauner

rhashtable: Add bucket_table_free_atomic() helper

2026-05-05T08:12:07+00:00

rhashtable_insert_rehash() allocates a new bucket table
with GFP_ATOMIC, as it is called from an RCU read-side
critical section.

If rhashtable_rehash_attach() then fails, the new table
is freed via kvfree(). This is unsafe, since kvfree() may
fall back to vfree() for vmalloc-backed allocations, which
can sleep and trigger:

  BUG: sleeping function called from invalid context

Add bucket_table_free_atomic(), which uses kvfree_atomic()
so the table can be freed safely from non-sleeping context.

Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Herbert Xu

rhashtable: drop ht->mutex in rhashtable_free_and_destroy()

2026-05-05T08:12:06+00:00

rhashtable_free_and_destroy() is a single-shot teardown routine:
cancel_work_sync() has already quiesced the deferred rehash worker, and
the function's documented contract requires the caller to guarantee no
other concurrent access to the rhashtable. Under those conditions
ht->mutex is not protecting anything -- taking it is a leftover from
the original teardown path.

That leftover is actively harmful: it closes a circular lock-class
dependency with fs_reclaim. The deferred rehash worker takes ht->mutex
and then allocates GFP_KERNEL memory in bucket_table_alloc(),
establishing

    &ht->mutex  ->  fs_reclaim

After commit b32c4a213698 ("xattr: add rhashtable-based simple_xattr
infrastructure") introduced simple_xattr_ht_free(), which calls
rhashtable_free_and_destroy(), the simple_xattrs teardown became
reachable from evict() under the dcache shrinker. The subsequent
per-subsystem adaptations made the reverse edge concrete in three
independent code paths:

  * commit 52b364fed6e1 ("shmem: adapt to rhashtable-based simple_xattrs with lazy allocation")
  * commit 5bd97f5c5f24 ("kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation")
  * commit 50704c391fbf ("pidfs: adapt to rhashtable-based simple_xattrs")

Any of the three closes the cycle

    fs_reclaim  ->  &ht->mutex

which lockdep reports as follows. This particular splat was observed
organically on a workstation kernel built from vfs-7.1-rc1.xattr at
~35h uptime under normal mixed workload, with CONFIG_PROVE_LOCKING=y.
The path happens to go through kernfs:

  WARNING: possible circular locking dependency detected
  7.0.0-faeab166167f-with-fixes-v1+ #191 Tainted: G     U
  kswapd0/243 is trying to acquire lock:
  ffff8882e475c0f8 (&ht->mutex){+.+.}-{4:4},
    at: rhashtable_free_and_destroy+0x36/0x740
  but task is already holding lock:
  ffffffffa8ad1d00 (fs_reclaim){+.+.}-{0:0},
    at: balance_pgdat+0x995/0x1600

  the existing dependency chain (in reverse order) is:

  -> #1 (fs_reclaim){+.+.}-{0:0}:
         __lock_acquire+0x506/0xbf0
         lock_acquire.part.0+0xc7/0x280
         fs_reclaim_acquire+0xd9/0x130
         __kvmalloc_node_noprof+0xcd/0xb40
         bucket_table_alloc.isra.0+0x5a/0x440
         rhashtable_rehash_alloc+0x4e/0xd0
         rht_deferred_worker+0x14b/0x440
         process_one_work+0x8fd/0x16a0
         worker_thread+0x601/0xff0
         kthread+0x36b/0x470
         ret_from_fork+0x5bf/0x910
         ret_from_fork_asm+0x1a/0x30

  -> #0 (&ht->mutex){+.+.}-{4:4}:
         check_prev_add+0xdb/0xce0
         validate_chain+0x554/0x780
         __lock_acquire+0x506/0xbf0
         lock_acquire.part.0+0xc7/0x280
         __mutex_lock+0x1b2/0x2550
         rhashtable_free_and_destroy+0x36/0x740
         kernfs_put.part.0+0x119/0x570
         evict+0x3b6/0x9c0
         __dentry_kill+0x181/0x540
         shrink_dentry_list+0x135/0x440
         prune_dcache_sb+0xdb/0x150
         super_cache_scan+0x2ff/0x520
         do_shrink_slab+0x35a/0xee0
         shrink_slab_memcg+0x457/0x950
         shrink_slab+0x43b/0x550
         shrink_one+0x31a/0x6f0
         shrink_many+0x31e/0xc80
         shrink_node+0xeb3/0x14a0
         balance_pgdat+0x8ed/0x1600
         kswapd+0x2f3/0x530
         kthread+0x36b/0x470
         ret_from_fork+0x5bf/0x910
         ret_from_fork_asm+0x1a/0x30

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(fs_reclaim);
                                 lock(&ht->mutex);
                                 lock(fs_reclaim);
    lock(&ht->mutex);

Note that lockdep tracks lock classes, not instances: the two
&ht->mutex sites are on different rhashtable objects (the deferred
worker was triggered by some unrelated rhashtable growth), but because
rhashtable_init() uses a single static lockdep key for all rhashtables,
this is a real class-level cycle. Once reported, lockdep disables
itself for the remainder of the boot, masking any subsequent locking
bugs.

Drop the mutex. After cancel_work_sync() the rehash worker is quiesced
and, per this function's contract, no other concurrent access is
possible; the tables are therefore owned exclusively by this function
and can be walked without any lock held.

Switch the table walks from rht_dereference() (which requires
ht->mutex to be held under CONFIG_PROVE_RCU) to rcu_dereference_raw(),
which has no lockdep annotation. rht_ptr_exclusive() already uses
rcu_dereference_protected(p, 1) and needs no change.

This is the only place in lib/rhashtable.c where &ht->mutex is
acquired from a path reachable under fs_reclaim; the deferred worker
is the only other site and it is the forward edge. Removing the
acquisition here therefore eliminates the class cycle for all three
subsystems that use simple_xattrs, not just the one in the splat
above. No locking-semantics change is introduced for correct users;
incorrect users would already be racing with rehash worker completion
regardless of the mutex.

Synthetic reproduction of the splat within a few-minute window was
unsuccessful across several attempts (tmpfs and kernfs zombies via
cgroupfs with open-fd-through-rmdir, with and without swap, up to
~60k reclaim-path executions of simple_xattr_ht_free() in a single
run), consistent with the rare coincidence-of-edges profile of the
bug: the forward edge is already registered in /proc/lockdep on any
idle system via rht_deferred_worker, but the reverse edge requires
evict() to complete kernfs_put()'s final release inside the fs_reclaim
critical section, which in my attempts was ordered against rather than
interleaved with the worker.

Fixes: b32c4a213698 ("xattr: add rhashtable-based simple_xattr infrastructure")
Signed-off-by: Mikhail Gavrilov 
Signed-off-by: Herbert Xu

rhashtable: Bounce deferred worker kick through irq_work

2026-04-21T06:10:50+00:00

Inserts past 75% load call schedule_work(&ht->run_work) to kick an
async resize. If a caller holds a raw spinlock (e.g. an
insecure_elasticity user), schedule_work() under that lock records

  caller_lock -> pool->lock -> pi_lock -> rq->__lock

A cycle forms if any of these locks is acquired in the reverse
direction elsewhere. sched_ext, the only current insecure_elasticity
user, hits this: it holds scx_sched_lock across rhashtable inserts of
sub-schedulers, while scx_bypass() takes rq->__lock -> scx_sched_lock.
Exercising the resize path produces:

  Chain exists of:
    &pool->lock --> &rq->__lock --> scx_sched_lock

Bounce the kick from the insert paths through irq_work so
schedule_work() runs from hard IRQ context with the caller's lock no
longer held. rht_deferred_worker()'s self-rearm on error stays on
schedule_work(&ht->run_work) - the worker runs in process context with
no caller lock held, and keeping the self-requeue on @run_work lets
cancel_work_sync() in rhashtable_free_and_destroy() drain it.

v3: Keep rht_deferred_worker()'s self-rearm on schedule_work(&run_work).
    Routing it through irq_work in v2 broke cancel_work_sync()'s
    self-requeue handling - an irq_work queued after irq_work_sync()
    returned but while cancel_work_sync() was still waiting could fire
    post-teardown.

v2: Bounce unconditionally instead of gating on insecure_elasticity,
    as suggested by Herbert.

Signed-off-by: Tejun Heo 
Acked-by: Herbert Xu

rhashtable: Restore insecure_elasticity toggle

2026-04-19T15:47:21+00:00

Some users of rhashtable cannot handle insertion failures, and
are happy to accept the consequences of a hash table that having
very long chains.

Restore the insecure_elasticity toggle for these users.  In
addition to disabling the chain length checks, this also removes
the emergency resize that would otherwise occur when the hash
table occupancy hits 100% (an async resize is still scheduled
at 75%).

Signed-off-by: Herbert Xu 
Signed-off-by: Tejun Heo

rhashtable: Enable context analysis

2026-01-05T15:43:35+00:00

Enable context analysis for rhashtable, which was used as an initial
test as it contains a combination of RCU, mutex, and bit_spinlock usage.

Users of rhashtable now also benefit from annotations on the API, which
will now warn if the RCU read lock is not held where required.

Signed-off-by: Marco Elver 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20251219154418.3592607-33-elver@google.com