linux-stable.git/kernel/bpf/hashtab.c, branch v6.1.78

bpf: Add map and need_defer parameters to .map_fd_put_ptr()

2024-01-25T23:27:26+00:00

[ Upstream commit 20c20bd11a0702ce4dc9300c3da58acf551d9725 ]

map is the pointer of outer map, and need_defer needs some explanation.
need_defer tells the implementation to defer the reference release of
the passed element and ensure that the element is still alive before
the bpf program, which may manipulate it, exits.

The following three cases will invoke map_fd_put_ptr() and different
need_defer values will be passed to these callers:

1) release the reference of the old element in the map during map update
   or map deletion. The release must be deferred, otherwise the bpf
   program may incur use-after-free problem, so need_defer needs to be
   true.
2) release the reference of the to-be-added element in the error path of
   map update. The to-be-added element is not visible to any bpf
   program, so it is OK to pass false for need_defer parameter.
3) release the references of all elements in the map during map release.
   Any bpf program which has access to the map must have been exited and
   released, so need_defer=false will be OK.

These two parameters will be used by the following patches to fix the
potential use-after-free problem for map-in-map.

Signed-off-by: Hou Tao 
Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov 
Stable-dep-of: 876673364161 ("bpf: Defer the free of inner map when necessary")
Signed-off-by: Sasha Levin

bpf: Fix unnecessary -EBUSY from htab_lock_bucket

2023-11-20T10:51:55+00:00

[ Upstream commit d35381aa73f7e1e8b25f3ed5283287a64d9ddff5 ]

htab_lock_bucket uses the following logic to avoid recursion:

1. preempt_disable();
2. check percpu counter htab->map_locked[hash] for recursion;
   2.1. if map_lock[hash] is already taken, return -BUSY;
3. raw_spin_lock_irqsave();

However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
logic will not able to access the same hash of the hashtab and get -EBUSY.

This -EBUSY is not really necessary. Fix it by disabling IRQ before
checking map_locked:

1. preempt_disable();
2. local_irq_save();
3. check percpu counter htab->map_locked[hash] for recursion;
   3.1. if map_lock[hash] is already taken, return -BUSY;
4. raw_spin_lock().

Similarly, use raw_spin_unlock() and local_irq_restore() in
htab_unlock_bucket().

Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Suggested-by: Tejun Heo 
Signed-off-by: Song Liu 
Signed-off-by: Andrii Nakryiko 
Signed-off-by: Daniel Borkmann 
Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com
Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
Signed-off-by: Sasha Levin

bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps

2023-05-30T13:03:21+00:00

commit b34ffb0c6d23583830f9327864b9c1f486003305 upstream.

The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.

Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Anton Protopopov 
Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com
Signed-off-by: Martin KaFai Lau 
Signed-off-by: Greg Kroah-Hartman

bpf: Zeroing allocated object from slab in bpf memory allocator

2023-03-10T08:33:06+00:00

[ Upstream commit 997849c4b969034e225153f41026657def66d286 ]

Currently the freed element in bpf memory allocator may be immediately
reused, for htab map the reuse will reinitialize special fields in map
value (e.g., bpf_spin_lock), but lookup procedure may still access
these special fields, and it may lead to hard-lockup as shown below:

 NMI backtrace for cpu 16
 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G             L     6.1.0+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
 RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0
 ......
 Call Trace:
  
  copy_map_value_locked+0xb7/0x170
  bpf_map_copy_value+0x113/0x3c0
  __sys_bpf+0x1c67/0x2780
  __x64_sys_bpf+0x1c/0x20
  do_syscall_64+0x30/0x60
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 ......
  

For htab map, just like the preallocated case, these is no need to
initialize these special fields in map value again once these fields
have been initialized. For preallocated htab map, these fields are
initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the
similar thing for non-preallocated htab in bpf memory allocator. And
there is no need to use __GFP_ZERO for per-cpu bpf memory allocator,
because __alloc_percpu_gfp() does it implicitly.

Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
Signed-off-by: Hou Tao 
Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Sasha Levin

bpf: hash map, avoid deadlock with suitable hash mask

2023-02-01T07:34:09+00:00

[ Upstream commit 9f907439dc80e4a2fcfb949927b36c036468dbb3 ]

The deadlock still may occur while accessed in NMI and non-NMI
context. Because in NMI, we still may access the same bucket but with
different map_locked index.

For example, on the same CPU, .max_entries = 2, we update the hash map,
with key = 4, while running bpf prog in NMI nmi_handle(), to update
hash map with key = 20, so it will have the same bucket index but have
different map_locked index.

To fix this issue, using min mask to hash again.

Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Tonghao Zhang 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Andrii Nakryiko 
Cc: Martin KaFai Lau 
Cc: Song Liu 
Cc: Yonghong Song 
Cc: John Fastabend 
Cc: KP Singh 
Cc: Stanislav Fomichev 
Cc: Hao Luo 
Cc: Jiri Olsa 
Cc: Hou Tao 
Acked-by: Yonghong Song 
Acked-by: Hou Tao 
Link: https://lore.kernel.org/r/20230111092903.92389-1-tong@infragraf.org
Signed-off-by: Martin KaFai Lau 
Signed-off-by: Sasha Levin

treewide: use get_random_u32() when possible

2022-10-11T23:42:58+00:00

The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.

Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Kees Cook 
Reviewed-by: Yury Norov 
Reviewed-by: Jan Kara  # for ext4
Acked-by: Toke Høiland-Jørgensen  # for sch_cake
Acked-by: Chuck Lever  # for nfsd
Acked-by: Jakub Kicinski 
Acked-by: Mika Westerberg  # for thunderbolt
Acked-by: Darrick J. Wong  # for xfs
Acked-by: Helge Deller  # for parisc
Acked-by: Heiko Carstens  # for s390
Signed-off-by: Jason A. Donenfeld

bpf: Always use raw spinlock for hash bucket lock

2022-09-22T01:08:54+00:00

For a non-preallocated hash map on RT kernel, regular spinlock instead
of raw spinlock is used for bucket lock. The reason is that on RT kernel
memory allocation is forbidden under atomic context and regular spinlock
is sleepable under RT.

Now hash map has been fully converted to use bpf_map_alloc, and there
will be no synchronous memory allocation for non-preallocated hash map,
so it is safe to always use raw spinlock for bucket lock on RT. So
removing the usage of htab_use_raw_lock() and updating the comments
accordingly.

Signed-off-by: Hou Tao 
Link: https://lore.kernel.org/r/20220921073826.2365800-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov

bpf: add missing percpu_counter_destroy() in htab_map_alloc()

2022-09-10T23:04:07+00:00

syzbot is reporting ODEBUG bug in htab_map_alloc() [1], for
commit 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated
hash map.") added percpu_counter_init() to htab_map_alloc() but forgot to
add percpu_counter_destroy() to the error path.

Link: https://syzkaller.appspot.com/bug?extid=5d1da78b375c3b5e6c2b [1]
Reported-by: syzbot 
Signed-off-by: Tetsuo Handa 
Fixes: 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.")
Reviewed-by: Stanislav Fomichev 
Link: https://lore.kernel.org/r/e2e4cc0e-9d36-4ca1-9bfa-ce23e6f8310b@I-love.SAKURA.ne.jp
Signed-off-by: Alexei Starovoitov

bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.

2022-09-05T13:33:07+00:00

User space might be creating and destroying a lot of hash maps. Synchronous
rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
and other map memory and may cause artificial OOM situation under stress.
Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
- remove rcu_barrier from hash map, since htab doesn't use call_rcu
  directly and there are no callback to wait for.
- bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
  Use it to avoid barriers in fast path.
- When barriers are needed copy bpf_mem_alloc into temp structure
  and wait for rcu barrier-s in the worker to let the rest of
  hash map freeing to proceed.

Signed-off-by: Alexei Starovoitov 
Signed-off-by: Daniel Borkmann 
Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com

bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.

2022-09-05T13:33:06+00:00

Convert dynamic allocations in percpu hash map from alloc_percpu() to
bpf_mem_cache_alloc() from per-cpu bpf_mem_alloc. Since bpf_mem_alloc frees
objects after RCU gp the call_rcu() is removed. pcpu_init_value() now needs to
zero-fill per-cpu allocations, since dynamically allocated map elements are now
similar to full prealloc, since alloc_percpu() is not called inline and the
elements are reused in the freelist.

Signed-off-by: Alexei Starovoitov 
Signed-off-by: Daniel Borkmann 
Acked-by: Kumar Kartikeya Dwivedi 
Acked-by: Andrii Nakryiko 
Link: https://lore.kernel.org/bpf/20220902211058.60789-12-alexei.starovoitov@gmail.com