summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-05-11bpf: Don't EFAULT for getsockopt with optval=NULLStanislav Fomichev
[ Upstream commit 00e74ae0863827d944e36e56a4ce1e77e50edb91 ] Some socket options do getsockopt with optval=NULL to estimate the size of the final buffer (which is returned via optlen). This breaks BPF getsockopt assumptions about permitted optval buffer size. Let's enforce these assumptions only when non-NULL optval is provided. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Reported-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7 Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Fix race between btf_put and btf_idr walk.Alexei Starovoitov
[ Upstream commit acf1c3d68e9a31f10d92bc67ad4673cdae5e8d92 ] Florian and Eduard reported hard dead lock: [ 58.433327] _raw_spin_lock_irqsave+0x40/0x50 [ 58.433334] btf_put+0x43/0x90 [ 58.433338] bpf_find_btf_id+0x157/0x240 [ 58.433353] btf_parse_fields+0x921/0x11c0 This happens since btf->refcount can be 1 at the time of btf_put() and btf_put() will call btf_free_id() which will try to grab btf_idr_lock and will dead lock. Avoid the issue by doing btf_put() without locking. Fixes: 3d78417b60fb ("bpf: Add bpf_btf_find_by_name_kind() helper.") Fixes: 1e89106da253 ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().") Reported-by: Florian Westphal <fw@strlen.de> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf/btf: Fix is_int_ptr()Feng Zhou
[ Upstream commit 91f2dc6838c19342f7f2993627c622835cc24890 ] When tracing a kernel function with arg type is u32*, btf_ctx_access() would report error: arg2 type INT is not a struct. The commit bb6728d75611 ("bpf: Allow access to int pointer arguments in tracing programs") added support for int pointer, but did not skip modifiers before checking it's type. This patch fixes it. Fixes: bb6728d75611 ("bpf: Allow access to int pointer arguments in tracing programs") Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230410085908.98493-2-zhoufeng.zf@bytedance.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc callDave Marchevsky
[ Upstream commit f6a6a5a976288e4d0d94eb1c6c9e983e8e5cdb31 ] bpf_obj_drop_impl has a void return type. In check_kfunc_call, the "else if" which sets insn_aux->kptr_struct_meta for bpf_obj_drop_impl is surrounded by a larger if statement which checks btf_type_is_ptr. As a result: * The bpf_obj_drop_impl-specific code will never execute * The btf_struct_meta input to bpf_obj_drop is always NULL * __bpf_obj_drop_impl will always see a NULL btf_record when called from BPF program, and won't call bpf_obj_free_fields * program-allocated kptrs which have fields that should be cleaned up by bpf_obj_free_fields may instead leak resources This patch adds a btf_type_is_void branch to the larger if and moves special handling for bpf_obj_drop_impl there, fixing the issue. Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop") Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230403200027.2271029-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: factor out fetching basic kfunc metadataAndrii Nakryiko
[ Upstream commit 07236eab7a3139da97aef9f5f21f403be82a82ea ] Factor out logic to fetch basic kfunc metadata based on struct bpf_insn. This is not exactly short or trivial code to just copy/paste and this information is sometimes necessary in other parts of the verifier logic. Subsequent patches will rely on this to determine if an instruction is a kfunc call to iterator next method. No functional changes intended, including that verbose() warning behavior when kfunc is not allowed for a particular program type. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f6a6a5a97628 ("bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Add callback validation to kfunc verifier logicDave Marchevsky
[ Upstream commit 5d92ddc3de1b44a82108af68ed71f638ca20509a ] Some BPF helpers take a callback function which the helper calls. For each helper that takes such a callback, there's a special call to __check_func_call with a callback-state-setting callback that sets up verifier bpf_func_state for the callback's frame. kfuncs don't have any of this infrastructure yet, so let's add it in this patch, following existing helper pattern as much as possible. To validate functionality of this added plumbing, this patch adds callback handling for the bpf_rbtree_add kfunc and hopes to lay groundwork for future graph datastructure callbacks. In the "general plumbing" category we have: * check_kfunc_call doing callback verification right before clearing CALLER_SAVED_REGS, exactly like check_helper_call * recognition of func_ptr BTF types in kfunc args as KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type In the "rbtree_add / graph datastructure-specific plumbing" category: * Since bpf_rbtree_add must be called while the spin_lock associated with the tree is held, don't complain when callback's func_state doesn't unlock it by frame exit * Mark rbtree_add callback's args with ref_set_non_owning to prevent rbtree api functions from being called in the callback. Semantically this makes sense, as less() takes no ownership of its args when determining which comes first. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f6a6a5a97628 ("bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc argsDave Marchevsky
[ Upstream commit cd6791b4b6f66f6b7925c840efe5c8fa0ce1ac87 ] Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args that contain those types special classification and properly handle these types when checking kfunc args. "Properly handling" these types largely requires generalizing similar handling for bpf_list_{head,node}, with little new logic added in this patch. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f6a6a5a97628 ("bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Add bpf_rbtree_{add,remove,first} kfuncsDave Marchevsky
[ Upstream commit bd1279ae8a691d7ec75852c6d0a22139afb034a4 ] This patch adds implementations of bpf_rbtree_{add,remove,first} and teaches verifier about their BTF_IDs as well as those of bpf_rb_{root,node}. All three kfuncs have some nonstandard component to their verification that needs to be addressed in future patches before programs can properly use them: * bpf_rbtree_add: Takes 'less' callback, need to verify it * bpf_rbtree_first: Returns ptr_to_node_type(off=rb_node_off) instead of ptr_to_rb_node(off=0). Return value ref is non-owning. * bpf_rbtree_remove: Returns ptr_to_node_type(off=rb_node_off) instead of ptr_to_rb_node(off=0). 2nd arg (node) is a non-owning reference. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f6a6a5a97628 ("bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Add basic bpf_rb_{root,node} supportDave Marchevsky
[ Upstream commit 9c395c1b99bd23f74bc628fa000480c49593d17f ] This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new types, and adds bpf_rb_root_free function for freeing bpf_rb_root in map_values. structs bpf_rb_root and bpf_rb_node are opaque types meant to obscure structs rb_root_cached rb_node, respectively. btf_struct_access will prevent BPF programs from touching these special fields automatically now that they're recognized. btf_check_and_fixup_fields now groups list_head and rb_root together as "graph root" fields and {list,rb}_node as "graph node", and does same ownership cycle checking as before. Note that this function does _not_ prevent ownership type mixups (e.g. rb_root owning list_node) - that's handled by btf_parse_graph_root. After this patch, a bpf program can have a struct bpf_rb_root in a map_value, but not add anything to nor do anything useful with it. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f6a6a5a97628 ("bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Migrate release_on_unlock logic to non-owning ref semanticsDave Marchevsky
[ Upstream commit 6a3cd3318ff65622415e34e8ee39d76331e7c869 ] This patch introduces non-owning reference semantics to the verifier, specifically linked_list API kfunc handling. release_on_unlock logic for refs is refactored - with small functional changes - to implement these semantics, and bpf_list_push_{front,back} are migrated to use them. When a list node is pushed to a list, the program still has a pointer to the node: n = bpf_obj_new(typeof(*n)); bpf_spin_lock(&l); bpf_list_push_back(&l, n); /* n still points to the just-added node */ bpf_spin_unlock(&l); What the verifier considers n to be after the push, and thus what can be done with n, are changed by this patch. Common properties both before/after this patch: * After push, n is only a valid reference to the node until end of critical section * After push, n cannot be pushed to any list * After push, the program can read the node's fields using n Before: * After push, n retains the ref_obj_id which it received on bpf_obj_new, but the associated bpf_reference_state's release_on_unlock field is set to true * release_on_unlock field and associated logic is used to implement "n is only a valid ref until end of critical section" * After push, n cannot be written to, the node must be removed from the list before writing to its fields * After push, n is marked PTR_UNTRUSTED After: * After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF type flag is added to reg's type, indicating that it's a non-owning reference. * NON_OWN_REF flag and logic is used to implement "n is only a valid ref until end of critical section" * n can be written to (except for special fields e.g. bpf_list_node, timer, ...) Summary of specific implementation changes to achieve the above: * release_on_unlock field, ref_set_release_on_unlock helper, and logic to "release on unlock" based on that field are removed * The anonymous active_lock struct used by bpf_verifier_state is pulled out into a named struct bpf_active_lock. * NON_OWN_REF type flag is introduced along with verifier logic changes to handle non-owning refs * Helpers are added to use NON_OWN_REF flag to implement non-owning ref semantics as described above * invalidate_non_owning_refs - helper to clobber all non-owning refs matching a particular bpf_active_lock identity. Replaces release_on_unlock logic in process_spin_lock. * ref_set_non_owning - set NON_OWN_REF type flag after doing some sanity checking * ref_convert_owning_non_owning - convert owning reference w/ specified ref_obj_id to non-owning references. Set NON_OWN_REF flag for each reg with that ref_obj_id and 0-out its ref_obj_id * Update linked_list selftests to account for minor semantic differences introduced by this patch * Writes to a release_on_unlock node ref are not allowed, while writes to non-owning reference pointees are. As a result the linked_list "write after push" failure tests are no longer scenarios that should fail. * The test##missing_lock##op and test##incorrect_lock##op macro-generated failure tests need to have a valid node argument in order to have the same error output as before. Otherwise verification will fail early and the expected error output won't be seen. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f6a6a5a97628 ("bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: rename list_head -> graph_root in field info typesDave Marchevsky
[ Upstream commit 30465003ad776a922c32b2dac58db14f120f037e ] Many of the structs recently added to track field info for linked-list head are useful as-is for rbtree root. So let's do a mechanical renaming of list_head-related types and fields: include/linux/bpf.h: struct btf_field_list_head -> struct btf_field_graph_root list_head -> graph_root in struct btf_field union kernel/bpf/btf.c: list_head -> graph_root in struct btf_field_info This is a nonfunctional change, functionality to actually use these fields for rbtree will be added in further patches. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20221217082506.1570898-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f6a6a5a97628 ("bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Fix __reg_bound_offset 64->32 var_off subreg propagationDaniel Borkmann
[ Upstream commit 7be14c1c9030f73cc18b4ff23b78a0a081f16188 ] Xu reports that after commit 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking"), the following BPF program is rejected by the verifier: 0: (61) r2 = *(u32 *)(r1 +0) ; R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 3: (07) r1 += 1 4: (2d) if r1 > r3 goto pc+8 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) 6: (18) r0 = 0x7fffffffffffff10 8: (0f) r1 += r0 ; R1_w=scalar(umin=0x7fffffffffffff10,umax=0x800000000000000f) 9: (18) r0 = 0x8000000000000000 11: (07) r0 += 1 12: (ad) if r0 < r1 goto pc-2 13: (b7) r0 = 0 14: (95) exit And the verifier log says: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775810,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775822,umax=9223372036854775822,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775792 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775792 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] The 64bit umin=9223372036854775810 bound continuously bumps by +1 while umax=9223372036854775823 stays as-is until the verifier complexity limit is reached and the program gets finally rejected. During this simulation, the umin also eventually surpasses umax. Looking at the first 'from 12 to 11' output line from the loop, R1 has the following state: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xffffffff)) The var_off has technically not an inconsistent state but it's very imprecise and far off surpassing 64bit umax bounds whereas the expected output with refined known bits in var_off should have been like: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xf)) In the above log, var_off stays as var_off=(0x8000000000000000; 0xffffffff) and does not converge into a narrower mask where more bits become known, eventually transforming R1 into a constant upon umin=9223372036854775823, umax=9223372036854775823 case where the verifier would have terminated and let the program pass. The __reg_combine_64_into_32() marks the subregister unknown and propagates 64bit {s,u}min/{s,u}max bounds to their 32bit equivalents iff they are within the 32bit universe. The question came up whether __reg_combine_64_into_32() should special case the situation that when 64bit {s,u}min bounds have the same value as 64bit {s,u}max bounds to then assign the latter as well to the 32bit reg->{s,u}32_{min,max}_value. As can be seen from the above example however, that is just /one/ special case and not a /generic/ solution given above example would still not be addressed this way and remain at an imprecise var_off=(0x8000000000000000; 0xffffffff). The improvement is needed in __reg_bound_offset() to refine var32_off with the updated var64_off instead of the prior reg->var_off. The reg_bounds_sync() code first refines information about the register's min/max bounds via __update_reg_bounds() from the current var_off, then in __reg_deduce_bounds() from sign bit and with the potentially learned bits from bounds it'll update the var_off tnum in __reg_bound_offset(). For example, intersecting with the old var_off might have improved bounds slightly, e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), then new var_off will then result in (0; 0x7f...fc). The intersected var64_off holds then the universe which is a superset of var32_off. The point for the latter is not to broaden, but to further refine known bits based on the intersection of var_off with 32 bit bounds, so that we later construct the final var_off from upper and lower 32 bits. The final __update_reg_bounds() can then potentially still slightly refine bounds if more bits became known from the new var_off. After the improvement, we can see R1 converging successively: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=-9223372036854775806 13: safe from 12 to 11: R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775811,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775805 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775805 R1_w=-9223372036854775805 13: safe [...] from 12 to 11: R0_w=-9223372036854775798 R1=scalar(umin=9223372036854775819,umax=9223372036854775823,var_off=(0x8000000000000008; 0x7),s32_min=8,s32_max=15,u32_min=8,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775797 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775797 R1=-9223372036854775797 13: safe from 12 to 11: R0_w=-9223372036854775797 R1=scalar(umin=9223372036854775820,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775796 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775796 R1=-9223372036854775796 13: safe from 12 to 11: R0_w=-9223372036854775796 R1=scalar(umin=9223372036854775821,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775795 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775795 R1=-9223372036854775795 13: safe from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x800000000000000e; 0x1),s32_min=14,s32_max=15,u32_min=14,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=-9223372036854775794 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=-9223372036854775793 R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 last_idx 12 first_idx 12 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=scalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=1 stack=0 before 11: (07) r0 += 1 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=scalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 9: (18) r0 = 0x8000000000000000 last_idx 12 first_idx 12 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=Pscalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=2 stack=0 before 11: (07) r0 += 1 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=Pscalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 9: (18) r0 = 0x8000000000000000 regs=2 stack=0 before 8: (0f) r1 += r0 regs=3 stack=0 before 6: (18) r0 = 0x7fffffffffffff10 regs=2 stack=0 before 5: (71) r1 = *(u8 *)(r2 +0) 13: safe from 4 to 13: safe verification time 322 usec stack depth 0 processed 56 insns (limit 1000000) max_states_per_insn 1 total_states 3 peak_states 3 mark_read 1 This also fixes up a test case along with this improvement where we match on the verifier log. The updated log now has a refined var_off, too. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Xu Kuohai <xukuohai@huaweicloud.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230314203424.4015351-2-xukuohai@huaweicloud.com Link: https://lore.kernel.org/bpf/20230322213056.2470-1-daniel@iogearbox.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: return long from bpf_map_ops funcsJP Kobryn
[ Upstream commit d7ba4cc900bf1eea2d8c807c6b1fc6bd61f41237 ] This patch changes the return types of bpf_map_ops functions to long, where previously int was returned. Using long allows for bpf programs to maintain the sign bit in the absence of sign extension during situations where inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative error is returned. The definitions of the helper funcs are generated from comments in the bpf uapi header at `include/uapi/linux/bpf.h`. The return type of these helpers was previously changed from int to long in commit bdb7b79b4ce8. For any case where one of the map helpers call the bpf_map_ops funcs that are still returning 32-bit int, a compiler might not include sign extension instructions to properly convert the 32-bit negative value a 64-bit negative value. For example: bpf assembly excerpt of an inlined helper calling a kernel function and checking for a specific error: ; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST); ... 46: call 0xffffffffe103291c ; htab_map_update_elem ; if (err && err != -EEXIST) { 4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax kernel function assembly excerpt of return value from `htab_map_update_elem` returning 32-bit int: movl $0xffffffef, %r9d ... movl %r9d, %eax ...results in the comparison: cmp $0xffffffffffffffef, $0x00000000ffffffef Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long") Tested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Remove misleading spec_v1 check on var-offset stack readLuis Gerhorst
[ Upstream commit 082cdc69a4651dd2a77539d69416a359ed1214f5 ] For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals() ensures that the resulting pointer has a constant offset if bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds() which in turn calls check_stack_access_for_ptr_arithmetic(). There, -EACCESS is returned if the register's offset is not constant, thereby rejecting the program. In summary, an unprivileged user must never be able to create stack pointers with a variable offset. That is also the case, because a respective check in check_stack_write() is missing. If they were able to create a variable-offset pointer, users could still use it in a stack-write operation to trigger unsafe speculative behavior [1]. Because unprivileged users must already be prevented from creating variable-offset stack pointers, viable options are to either remove this check (replacing it with a clarifying comment), or to turn it into a "verifier BUG"-message, also adding a similar check in check_stack_write() (for consistency, as a second-level defense). This patch implements the first option to reduce verifier bloat. This check was introduced by commit 01f810ace9ed ("bpf: Allow variable-offset stack access") which correctly notes that "variable-offset reads and writes are disallowed (they were already disallowed for the indirect access case) because the speculative execution checking code doesn't support them". However, it does not further discuss why the check in check_stack_read() is necessary. The code which made this check obsolete was also introduced in this commit. I have compiled ~650 programs from the Linux selftests, Linux samples, Cilium, and libbpf/examples projects and confirmed that none of these trigger the check in check_stack_read() [2]. Instead, all of these programs are, as expected, already rejected when constructing the variable-offset pointers. Note that the check in check_stack_access_for_ptr_arithmetic() also prints "off=%d" while the code removed by this patch does not (the error removed does not appear in the "verification_error" values). For reproducibility, the repository linked includes the raw data and scripts used to create the plot. [1] https://arxiv.org/pdf/1807.03757.pdf [2] https://gitlab.cs.fau.de/un65esoq/bpf-spectre/-/raw/53dc19fcf459c186613b1156a81504b39c8d49db/data/plots/23-02-26_23-56_bpftool/bpftool/0004-errors.pdf?inline=false Fixes: 01f810ace9ed ("bpf: Allow variable-offset stack access") Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230315165358.23701-1-gerhorst@cs.fau.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: fix precision propagation verbose loggingAndrii Nakryiko
[ Upstream commit 34f0677e7afd3a292bc1aadda7ce8e35faedb204 ] Fix wrong order of frame index vs register/slot index in precision propagation verbose (level 2) output. It's wrong and very confusing as is. Fixes: 529409ea92d5 ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: take into account liveness when propagating precisionAndrii Nakryiko
[ Upstream commit 52c2b005a3c18c565fc70cfd0ca49375f301e952 ] When doing state comparison, if old state has register that is not marked as REG_LIVE_READ, then we just skip comparison, regardless what's the state of corresponing register in current state. This is because not REG_LIVE_READ register is irrelevant for further program execution and correctness. All good here. But when we get to precision propagation, after two states were declared equivalent, we don't take into account old register's liveness, and thus attempt to propagate precision for register in current state even if that register in old state was not REG_LIVE_READ anymore. This is bad, because register in current state could be anything at all and this could cause -EFAULT due to internal logic bugs. Fix by taking into account REG_LIVE_READ liveness mark to keep the logic in state comparison in sync with precision propagation. Fixes: a3ce685dd01a ("bpf: fix precision tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230309224131.57449-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26bpf: Fix incorrect verifier pruning due to missing register precision taintsDaniel Borkmann
[ Upstream commit 71b547f561247897a0a14f3082730156c0533fed ] Juan Jose et al reported an issue found via fuzzing where the verifier's pruning logic prematurely marks a program path as safe. Consider the following program: 0: (b7) r6 = 1024 1: (b7) r7 = 0 2: (b7) r8 = 0 3: (b7) r9 = -2147483648 4: (97) r6 %= 1025 5: (05) goto pc+0 6: (bd) if r6 <= r9 goto pc+2 7: (97) r6 %= 1 8: (b7) r9 = 0 9: (bd) if r6 <= r9 goto pc+1 10: (b7) r6 = 0 11: (b7) r0 = 0 12: (63) *(u32 *)(r10 -4) = r0 13: (18) r4 = 0xffff888103693400 // map_ptr(ks=4,vs=48) 15: (bf) r1 = r4 16: (bf) r2 = r10 17: (07) r2 += -4 18: (85) call bpf_map_lookup_elem#1 19: (55) if r0 != 0x0 goto pc+1 20: (95) exit 21: (77) r6 >>= 10 22: (27) r6 *= 8192 23: (bf) r1 = r0 24: (0f) r0 += r6 25: (79) r3 = *(u64 *)(r0 +0) 26: (7b) *(u64 *)(r1 +0) = r3 27: (95) exit The verifier treats this as safe, leading to oob read/write access due to an incorrect verifier conclusion: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r6 = 1024 ; R6_w=1024 1: (b7) r7 = 0 ; R7_w=0 2: (b7) r8 = 0 ; R8_w=0 3: (b7) r9 = -2147483648 ; R9_w=-2147483648 4: (97) r6 %= 1025 ; R6_w=scalar() 5: (05) goto pc+0 6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff00000000; 0xffffffff)) R9_w=-2147483648 7: (97) r6 %= 1 ; R6_w=scalar() 8: (b7) r9 = 0 ; R9=0 9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0 10: (b7) r6 = 0 ; R6_w=0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 9 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) 19: (55) if r0 != 0x0 goto pc+1 ; R0=0 20: (95) exit from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? 21: (77) r6 >>= 10 ; R6_w=0 22: (27) r6 *= 8192 ; R6_w=0 23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0) 24: (0f) r0 += r6 last_idx 24 first_idx 19 regs=40 stack=0 before 23: (bf) r1 = r0 regs=40 stack=0 before 22: (27) r6 *= 8192 regs=40 stack=0 before 21: (77) r6 >>= 10 regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1 parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? last_idx 18 first_idx 9 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 regs=40 stack=0 before 10: (b7) r6 = 0 25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 27: (95) exit from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 11 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 frame 0: propagating r6 last_idx 19 first_idx 11 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0 last_idx 9 first_idx 9 regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=0 R10=fp0 last_idx 8 first_idx 0 regs=40 stack=0 before 8: (b7) r9 = 0 regs=40 stack=0 before 7: (97) r6 %= 1 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=40 stack=0 before 5: (05) goto pc+0 regs=40 stack=0 before 4: (97) r6 %= 1025 regs=40 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 19: safe frame 0: propagating r6 last_idx 9 first_idx 0 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=40 stack=0 before 5: (05) goto pc+0 regs=40 stack=0 before 4: (97) r6 %= 1025 regs=40 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 from 6 to 9: safe verification time 110 usec stack depth 4 processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2 The verifier considers this program as safe by mistakenly pruning unsafe code paths. In the above func#0, code lines 0-10 are of interest. In line 0-3 registers r6 to r9 are initialized with known scalar values. In line 4 the register r6 is reset to an unknown scalar given the verifier does not track modulo operations. Due to this, the verifier can also not determine precisely which branches in line 6 and 9 are taken, therefore it needs to explore them both. As can be seen, the verifier starts with exploring the false/fall-through paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic, r6 is correctly marked for precision tracking where backtracking kicks in where it walks back the current path all the way where r6 was set to 0 in the fall-through branch. Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also here, the state of the registers is the same, that is, r6=0 and r9=0, so that at line 19 the path can be pruned as it is considered safe. It is interesting to note that the conditional in line 9 turned r6 into a more precise state, that is, in the fall-through path at the beginning of line 10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed here) at the beginning of line 11, r6 turned into a known const r6=0 as r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must be 0 (**): [...] ; R6_w=scalar() 9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0 [...] from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 [...] The next path is 'from 6 to 9'. The verifier considers the old and current state equivalent, and therefore prunes the search incorrectly. Looking into the two states which are being compared by the pruning logic at line 9, the old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag correctly set in the old state, r9 did not. Both r6'es are considered as equivalent given the old one is a superset of the current, more precise one, however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9 did not have reg->precise flag set, the verifier does not consider the register as contributing to the precision state of r6, and therefore it considered both r9 states as equivalent. However, for this specific pruned path (which is also the actual path taken at runtime), register r6 will be 0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map. The purpose of precision tracking is to initially mark registers (including spilled ones) as imprecise to help verifier's pruning logic finding equivalent states it can then prune if they don't contribute to the program's safety aspects. For example, if registers are used for pointer arithmetic or to pass constant length to a helper, then the verifier sets reg->precise flag and backtracks the BPF program instruction sequence and chain of verifier states to ensure that the given register or stack slot including their dependencies are marked as precisely tracked scalar. This also includes any other registers and slots that contribute to a tracked state of given registers/stack slot. This backtracking relies on recorded jmp_history and is able to traverse entire chain of parent states. This process ends only when all the necessary registers/slots and their transitive dependencies are marked as precise. The backtrack_insn() is called from the current instruction up to the first instruction, and its purpose is to compute a bitmask of registers and stack slots that need precision tracking in the parent's verifier state. For example, if a current instruction is r6 = r7, then r6 needs precision after this instruction and r7 needs precision before this instruction, that is, in the parent state. Hence for the latter r7 is marked and r6 unmarked. For the class of jmp/jmp32 instructions, backtrack_insn() today only looks at call and exit instructions and for all other conditionals the masks remain as-is. However, in the given situation register r6 has a dependency on r9 (as described above in **), so also that one needs to be marked for precision tracking. In other words, if an imprecise register influences a precise one, then the imprecise register should also be marked precise. Meaning, in the parent state both dest and src register need to be tracked for precision and therefore the marking must be more conservative by setting reg->precise flag for both. The precision propagation needs to cover both for the conditional: if the src reg was marked but not the dst reg and vice versa. After the fix the program is correctly rejected: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r6 = 1024 ; R6_w=1024 1: (b7) r7 = 0 ; R7_w=0 2: (b7) r8 = 0 ; R8_w=0 3: (b7) r9 = -2147483648 ; R9_w=-2147483648 4: (97) r6 %= 1025 ; R6_w=scalar() 5: (05) goto pc+0 6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff80000000; 0x7fffffff),u32_min=-2147483648) R9_w=-2147483648 7: (97) r6 %= 1 ; R6_w=scalar() 8: (b7) r9 = 0 ; R9=0 9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0 10: (b7) r6 = 0 ; R6_w=0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 9 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) 19: (55) if r0 != 0x0 goto pc+1 ; R0=0 20: (95) exit from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? 21: (77) r6 >>= 10 ; R6_w=0 22: (27) r6 *= 8192 ; R6_w=0 23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0) 24: (0f) r0 += r6 last_idx 24 first_idx 19 regs=40 stack=0 before 23: (bf) r1 = r0 regs=40 stack=0 before 22: (27) r6 *= 8192 regs=40 stack=0 before 21: (77) r6 >>= 10 regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1 parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? last_idx 18 first_idx 9 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 regs=40 stack=0 before 10: (b7) r6 = 0 25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 27: (95) exit from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 11 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 frame 0: propagating r6 last_idx 19 first_idx 11 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0 last_idx 9 first_idx 9 regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1 parent didn't have regs=240 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=P0 R10=fp0 last_idx 8 first_idx 0 regs=240 stack=0 before 8: (b7) r9 = 0 regs=40 stack=0 before 7: (97) r6 %= 1 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 19: safe from 6 to 9: R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0 9: (bd) if r6 <= r9 goto pc+1 last_idx 9 first_idx 0 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 last_idx 9 first_idx 0 regs=200 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 11: R6=scalar(umax=18446744071562067968) R9=-2147483648 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 11 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=3,off=0,ks=4,vs=48,imm=0) 19: (55) if r0 != 0x0 goto pc+1 ; R0_w=0 20: (95) exit from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=scalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm???? 21: (77) r6 >>= 10 ; R6_w=scalar(umax=18014398507384832,var_off=(0x0; 0x3fffffffffffff)) 22: (27) r6 *= 8192 ; R6_w=scalar(smax=9223372036854767616,umax=18446744073709543424,var_off=(0x0; 0xffffffffffffe000),s32_max=2147475456,u32_max=-8192) 23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0) 24: (0f) r0 += r6 last_idx 24 first_idx 21 regs=40 stack=0 before 23: (bf) r1 = r0 regs=40 stack=0 before 22: (27) r6 *= 8192 regs=40 stack=0 before 21: (77) r6 >>= 10 parent didn't have regs=40 stack=0 marks: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_r=Pscalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm???? last_idx 19 first_idx 11 regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0 last_idx 9 first_idx 0 regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1 regs=240 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 math between map_value pointer and register with unbounded min value is not allowed verification time 886 usec stack depth 4 processed 49 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 2 Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Reported-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com> Reported-by: Meador Inge <meadori@google.com> Reported-by: Simon Scannell <simonscannell@google.com> Reported-by: Nenad Stojanovski <thenenadx@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com> Reviewed-by: Meador Inge <meadori@google.com> Reviewed-by: Simon Scannell <simonscannell@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-30bpf: Adjust insufficient default bpf_jit_limitDaniel Borkmann
[ Upstream commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7 ] We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://github.com/awslabs/amazon-eks-ami/issues/1179 Link: https://github.com/awslabs/amazon-eks-ami/issues/1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-17btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTRLorenz Bauer
[ Upstream commit 9b459804ff9973e173fabafba2a1319f771e85fa ] btf_datasec_resolve contains a bug that causes the following BTF to fail loading: [1] DATASEC a size=2 vlen=2 type_id=4 offset=0 size=1 type_id=7 offset=1 size=1 [2] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none) [3] PTR (anon) type_id=2 [4] VAR a type_id=3 linkage=0 [5] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none) [6] TYPEDEF td type_id=5 [7] VAR b type_id=6 linkage=0 This error message is printed during btf_check_all_types: [1] DATASEC a size=2 vlen=2 type_id=7 offset=1 size=1 Invalid type By tracing btf_*_resolve we can pinpoint the problem: btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_TBD) = 0 btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_TBD) = 0 btf_ptr_resolve(depth: 3, type_id: 3, mode: RESOLVE_PTR) = 0 btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_PTR) = 0 btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_PTR) = -22 The last invocation of btf_datasec_resolve should invoke btf_var_resolve by means of env_stack_push, instead it returns EINVAL. The reason is that env_stack_push is never executed for the second VAR. if (!env_type_is_resolve_sink(env, var_type) && !env_type_is_resolved(env, var_type_id)) { env_stack_set_next_member(env, i + 1); return env_stack_push(env, var_type, var_type_id); } env_type_is_resolve_sink() changes its behaviour based on resolve_mode. For RESOLVE_PTR, we can simplify the if condition to the following: (btf_type_is_modifier() || btf_type_is_ptr) && !env_type_is_resolved() Since we're dealing with a VAR the clause evaluates to false. This is not sufficient to trigger the bug however. The log output and EINVAL are only generated if btf_type_id_size() fails. if (!btf_type_id_size(btf, &type_id, &type_size)) { btf_verifier_log_vsi(env, v->t, vsi, "Invalid type"); return -EINVAL; } Most types are sized, so for example a VAR referring to an INT is not a problem. The bug is only triggered if a VAR points at a modifier. Since we skipped btf_var_resolve that modifier was also never resolved, which means that btf_resolved_type_id returns 0 aka VOID for the modifier. This in turn causes btf_type_id_size to return NULL, triggering EINVAL. To summarise, the following conditions are necessary: - VAR pointing at PTR, STRUCT, UNION or ARRAY - Followed by a VAR pointing at TYPEDEF, VOLATILE, CONST, RESTRICT or TYPE_TAG The fix is to reset resolve_mode to RESOLVE_TBD before attempting to resolve a VAR from a DATASEC. Fixes: 1dc92851849c ("bpf: kernel side support for BTF Var and DataSec") Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230306112138.155352-2-lmb@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Fix global subprog context argument resolution logicAndrii Nakryiko
[ Upstream commit d384dce281ed1b504fae2e279507827638d56fa3 ] KPROBE program's user-facing context type is defined as typedef bpf_user_pt_regs_t. This leads to a problem when trying to passing kprobe/uprobe/usdt context argument into global subprog, as kernel always strip away mods and typedefs of user-supplied type, but takes expected type from bpf_ctx_convert as is, which causes mismatch. Current way to work around this is to define a fake struct with the same name as expected typedef: struct bpf_user_pt_regs_t {}; __noinline my_global_subprog(struct bpf_user_pt_regs_t *ctx) { ... } This patch fixes the issue by resolving expected type, if it's not a struct. It still leaves the above work-around working for backwards compatibility. Fixes: 91cc1a99740e ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230216045954.3002473-2-andrii@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Zeroing allocated object from slab in bpf memory allocatorHou Tao
[ Upstream commit 997849c4b969034e225153f41026657def66d286 ] Currently the freed element in bpf memory allocator may be immediately reused, for htab map the reuse will reinitialize special fields in map value (e.g., bpf_spin_lock), but lookup procedure may still access these special fields, and it may lead to hard-lockup as shown below: NMI backtrace for cpu 16 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0 ...... Call Trace: <TASK> copy_map_value_locked+0xb7/0x170 bpf_map_copy_value+0x113/0x3c0 __sys_bpf+0x1c67/0x2780 __x64_sys_bpf+0x1c/0x20 do_syscall_64+0x30/0x60 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ...... </TASK> For htab map, just like the preallocated case, these is no need to initialize these special fields in map value again once these fields have been initialized. For preallocated htab map, these fields are initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the similar thing for non-preallocated htab in bpf memory allocator. And there is no need to use __GFP_ZERO for per-cpu bpf memory allocator, because __alloc_percpu_gfp() does it implicitly. Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Fix partial dynptr stack slot reads/writesKumar Kartikeya Dwivedi
[ Upstream commit ef8fc7a07c0e161841779d6fe3f6acd5a05c547c ] Currently, while reads are disallowed for dynptr stack slots, writes are not. Reads don't work from both direct access and helpers, while writes do work in both cases, but have the effect of overwriting the slot_type. While this is fine, handling for a few edge cases is missing. Firstly, a user can overwrite the stack slots of dynptr partially. Consider the following layout: spi: [d][d][?] 2 1 0 First slot is at spi 2, second at spi 1. Now, do a write of 1 to 8 bytes for spi 1. This will essentially either write STACK_MISC for all slot_types or STACK_MISC and STACK_ZERO (in case of size < BPF_REG_SIZE partial write of zeroes). The end result is that slot is scrubbed. Now, the layout is: spi: [d][m][?] 2 1 0 Suppose if user initializes spi = 1 as dynptr. We get: spi: [d][d][d] 2 1 0 But this time, both spi 2 and spi 1 have first_slot = true. Now, when passing spi 2 to dynptr helper, it will consider it as initialized as it does not check whether second slot has first_slot == false. And spi 1 should already work as normal. This effectively replaced size + offset of first dynptr, hence allowing invalid OOB reads and writes. Make a few changes to protect against this: When writing to PTR_TO_STACK using BPF insns, when we touch spi of a STACK_DYNPTR type, mark both first and second slot (regardless of which slot we touch) as STACK_INVALID. Reads are already prevented. Second, prevent writing to stack memory from helpers if the range may contain any STACK_DYNPTR slots. Reads are already prevented. For helpers, we cannot allow it to destroy dynptrs from the writes as depending on arguments, helper may take uninit_mem and dynptr both at the same time. This would mean that helper may write to uninit_mem before it reads the dynptr, which would be bad. PTR_TO_MEM: [?????dd] Depending on the code inside the helper, it may end up overwriting the dynptr contents first and then read those as the dynptr argument. Verifier would only simulate destruction when it does byte by byte access simulation in check_helper_call for meta.access_size, and fail to catch this case, as it happens after argument checks. The same would need to be done for any other non-trivial objects created on the stack in the future, such as bpf_list_head on stack, or bpf_rb_root on stack. A common misunderstanding in the current code is that MEM_UNINIT means writes, but note that writes may also be performed even without MEM_UNINIT in case of helpers, in that case the code after handling meta && meta->raw_mode will complain when it sees STACK_DYNPTR. So that invalid read case also covers writes to potential STACK_DYNPTR slots. The only loophole was in case of meta->raw_mode which simulated writes through instructions which could overwrite them. A future series sequenced after this will focus on the clean up of helper access checks and bugs around that. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Fix missing var_off check for ARG_PTR_TO_DYNPTRKumar Kartikeya Dwivedi
[ Upstream commit 79168a669d8125453c8a271115f1ffd4294e61f6 ] Currently, the dynptr function is not checking the variable offset part of PTR_TO_STACK that it needs to check. The fixed offset is considered when computing the stack pointer index, but if the variable offset was not a constant (such that it could not be accumulated in reg->off), we will end up a discrepency where runtime pointer does not point to the actual stack slot we mark as STACK_DYNPTR. It is impossible to precisely track dynptr state when variable offset is not constant, hence, just like bpf_timer, kptr, bpf_spin_lock, etc. simply reject the case where reg->var_off is not constant. Then, consider both reg->off and reg->var_off.value when computing the stack pointer index. A new helper dynptr_get_spi is introduced to hide over these details since the dynptr needs to be located in multiple places outside the process_dynptr_func checks, hence once we know it's a PTR_TO_STACK, we need to enforce these checks in all places. Note that it is disallowed for unprivileged users to have a non-constant var_off, so this problem should only be possible to trigger from programs having CAP_PERFMON. However, its effects can vary. Without the fix, it is possible to replace the contents of the dynptr arbitrarily by making verifier mark different stack slots than actual location and then doing writes to the actual stack address of dynptr at runtime. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Fix state pruning for STACK_DYNPTR stack slotsKumar Kartikeya Dwivedi
[ Upstream commit d6fefa1105dacc8a742cdcf2f4bfb501c9e61349 ] The root of the problem is missing liveness marking for STACK_DYNPTR slots. This leads to all kinds of problems inside stacksafe. The verifier by default inside stacksafe ignores spilled_ptr in stack slots which do not have REG_LIVE_READ marks. Since this is being checked in the 'old' explored state, it must have already done clean_live_states for this old bpf_func_state. Hence, it won't be receiving any more liveness marks from to be explored insns (it has received REG_LIVE_DONE marking from liveness point of view). What this means is that verifier considers that it's safe to not compare the stack slot if was never read by children states. While liveness marks are usually propagated correctly following the parentage chain for spilled registers (SCALAR_VALUE and PTR_* types), the same is not the case for STACK_DYNPTR. clean_live_states hence simply rewrites these stack slots to the type STACK_INVALID since it sees no REG_LIVE_READ marks. The end result is that we will never see STACK_DYNPTR slots in explored state. Even if verifier was conservatively matching !REG_LIVE_READ slots, very next check continuing the stacksafe loop on seeing STACK_INVALID would again prevent further checks. Now as long as verifier stores an explored state which we can compare to when reaching a pruning point, we can abuse this bug to make verifier prune search for obviously unsafe paths using STACK_DYNPTR slots thinking they are never used hence safe. Doing this in unprivileged mode is a bit challenging. add_new_state is only set when seeing BPF_F_TEST_STATE_FREQ (which requires privileges) or when jmps_processed difference is >= 2 and insn_processed difference is >= 8. So coming up with the unprivileged case requires a little more work, but it is still totally possible. The test case being discussed below triggers the heuristic even in unprivileged mode. However, it no longer works since commit 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF"). Let's try to study the test step by step. Consider the following program (C style BPF ASM): 0 r0 = 0; 1 r6 = &ringbuf_map; 3 r1 = r6; 4 r2 = 8; 5 r3 = 0; 6 r4 = r10; 7 r4 -= -16; 8 call bpf_ringbuf_reserve_dynptr; 9 if r0 == 0 goto pc+1; 10 goto pc+1; 11 *(r10 - 16) = 0xeB9F; 12 r1 = r10; 13 r1 -= -16; 14 r2 = 0; 15 call bpf_ringbuf_discard_dynptr; 16 r0 = 0; 17 exit; We know that insn 12 will be a pruning point, hence if we force add_new_state for it, it will first verify the following path as safe in straight line exploration: 0 1 3 4 5 6 7 8 9 -> 10 -> (12) 13 14 15 16 17 Then, when we arrive at insn 12 from the following path: 0 1 3 4 5 6 7 8 9 -> 11 (12) We will find a state that has been verified as safe already at insn 12. Since register state is same at this point, regsafe will pass. Next, in stacksafe, for spi = 0 and spi = 1 (location of our dynptr) is skipped seeing !REG_LIVE_READ. The rest matches, so stacksafe returns true. Next, refsafe is also true as reference state is unchanged in both states. The states are considered equivalent and search is pruned. Hence, we are able to construct a dynptr with arbitrary contents and use the dynptr API to operate on this arbitrary pointer and arbitrary size + offset. To fix this, first define a mark_dynptr_read function that propagates liveness marks whenever a valid initialized dynptr is accessed by dynptr helpers. REG_LIVE_WRITTEN is marked whenever we initialize an uninitialized dynptr. This is done in mark_stack_slots_dynptr. It allows screening off mark_reg_read and not propagating marks upwards from that point. This ensures that we either set REG_LIVE_READ64 on both dynptr slots, or none, so clean_live_states either sets both slots to STACK_INVALID or none of them. This is the invariant the checks inside stacksafe rely on. Next, do a complete comparison of both stack slots whenever they have STACK_DYNPTR. Compare the dynptr type stored in the spilled_ptr, and also whether both form the same first_slot. Only then is the later path safe. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-25bpf: add missing header file includeLinus Torvalds
commit f3dd0c53370e70c0f9b7e931bbec12916f3bb8cc upstream. Commit 74e19ef0ff80 ("uaccess: Add speculation barrier to copy_from_user()") built fine on x86-64 and arm64, and that's the extent of my local build testing. It turns out those got the <linux/nospec.h> include incidentally through other header files (<linux/kvm_host.h> in particular), but that was not true of other architectures, resulting in build errors kernel/bpf/core.c: In function ‘___bpf_prog_run’: kernel/bpf/core.c:1913:3: error: implicit declaration of function ‘barrier_nospec’ so just make sure to explicitly include the proper <linux/nospec.h> header file to make everybody see it. Fixes: 74e19ef0ff80 ("uaccess: Add speculation barrier to copy_from_user()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Reported-by: Huacai Chen <chenhuacai@loongson.cn> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-25uaccess: Add speculation barrier to copy_from_user()Dave Hansen
commit 74e19ef0ff8061ef55957c3abd71614ef0f42f47 upstream. The results of "access_ok()" can be mis-speculated. The result is that you can end speculatively: if (access_ok(from, size)) // Right here even for bad from/size combinations. On first glance, it would be ideal to just add a speculation barrier to "access_ok()" so that its results can never be mis-speculated. But there are lots of system calls just doing access_ok() via "copy_to_user()" and friends (example: fstat() and friends). Those are generally not problematic because they do not _consume_ data from userspace other than the pointer. They are also very quick and common system calls that should not be needlessly slowed down. "copy_from_user()" on the other hand uses a user-controller pointer and is frequently followed up with code that might affect caches. Take something like this: if (!copy_from_user(&kernelvar, uptr, size)) do_something_with(kernelvar); If userspace passes in an evil 'uptr' that *actually* points to a kernel addresses, and then do_something_with() has cache (or other) side-effects, it could allow userspace to infer kernel data values. Add a barrier to the common copy_from_user() code to prevent mis-speculated values which happen after the copy. Also add a stub for architectures that do not define barrier_nospec(). This makes the macro usable in generic code. Since the barrier is now usable in generic code, the x86 #ifdef in the BPF code can also go away. Reported-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Daniel Borkmann <daniel@iogearbox.net> # BPF bits Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-26bpf: Fix the kernel crash caused by bpf_setsockopt().Kui-Feng Lee
The kernel crash was caused by a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook, which performed a call to `bpf_setsockopt()` in order to set the TCP_NODELAY flag as an example. Flags like TCP_NODELAY can prompt the kernel to flush a socket's outgoing queue, and this hook "lsm_cgroup/socket_sock_rcv_skb" is frequently triggered by softirqs. The issue was that in certain circumstances, when `tcp_write_xmit()` was called to flush the queue, it would also allow BH (bottom-half) to run. This could lead to our program attempting to flush the same socket recursively, which caused a `skbuff` to be unlinked twice. `security_sock_rcv_skb()` is triggered by `tcp_filter()`. This occurs before the sock ownership is checked in `tcp_v4_rcv()`. Consequently, if a bpf program runs on `security_sock_rcv_skb()` while under softirq conditions, it may not possess the lock needed for `bpf_setsockopt()`, thus presenting an issue. The patch fixes this issue by ensuring that a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook is not allowed to call `bpf_setsockopt()`. The differences from v1 are - changing commit log to explain holding the lock of the sock, - emphasizing that TCP_NODELAY is not the only flag, and - adding the fixes tag. v1: https://lore.kernel.org/bpf/20230125000244.1109228-1-kuifeng@meta.com/ Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Fixes: 9113d7e48e91 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Link: https://lore.kernel.org/r/20230127001732.4162630-1-kuifeng@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-20bpf: Add missing btf_put to register_btf_id_dtor_kfuncsJiri Olsa
We take the BTF reference before we register dtors and we need to put it back when it's done. We probably won't se a problem with kernel BTF, but module BTF would stay loaded (because of the extra ref) even when its module is removed. Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: 5ce937d613a4 ("bpf: Populate pairs of btf_id and destructor kfunc in btf") Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230120122148.1522359-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19bpf: Fix to preserve reg parent/live fields when copying range infoEduard Zingerman
Register range information is copied in several places. The intent is to transfer range/id information from one register/stack spill to another. Currently this is done using direct register assignment, e.g.: static void find_equal_scalars(..., struct bpf_reg_state *known_reg) { ... struct bpf_reg_state *reg; ... *reg = *known_reg; ... } However, such assignments also copy the following bpf_reg_state fields: struct bpf_reg_state { ... struct bpf_reg_state *parent; ... enum bpf_reg_liveness live; ... }; Copying of these fields is accidental and incorrect, as could be demonstrated by the following example: 0: call ktime_get_ns() 1: r6 = r0 2: call ktime_get_ns() 3: r7 = r0 4: if r0 > r6 goto +1 ; r0 & r6 are unbound thus generated ; branch states are identical 5: *(u64 *)(r10 - 8) = 0xdeadbeef ; 64-bit write to fp[-8] --- checkpoint --- 6: r1 = 42 ; r1 marked as written 7: *(u8 *)(r10 - 8) = r1 ; 8-bit write, fp[-8] parent & live ; overwritten 8: r2 = *(u64 *)(r10 - 8) 9: r0 = 0 10: exit This example is unsafe because 64-bit write to fp[-8] at (5) is conditional, thus not all bytes of fp[-8] are guaranteed to be set when it is read at (8). However, currently the example passes verification. First, the execution path 1-10 is examined by verifier. Suppose that a new checkpoint is created by is_state_visited() at (6). After checkpoint creation: - r1.parent points to checkpoint.r1, - fp[-8].parent points to checkpoint.fp[-8]. At (6) the r1.live is set to REG_LIVE_WRITTEN. At (7) the fp[-8].parent is set to r1.parent and fp[-8].live is set to REG_LIVE_WRITTEN, because of the following code called in check_stack_write_fixed_off(): static void save_register_state(struct bpf_func_state *state, int spi, struct bpf_reg_state *reg, int size) { ... state->stack[spi].spilled_ptr = *reg; // <--- parent & live copied if (size == BPF_REG_SIZE) state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; ... } Note the intent to mark stack spill as written only if 8 bytes are spilled to a slot, however this intent is spoiled by a 'live' field copy. At (8) the checkpoint.fp[-8] should be marked as REG_LIVE_READ but this does not happen: - fp[-8] in a current state is already marked as REG_LIVE_WRITTEN; - fp[-8].parent points to checkpoint.r1, parentage chain is used by mark_reg_read() to mark checkpoint states. At (10) the verification is finished for path 1-10 and jump 4-6 is examined. The checkpoint.fp[-8] never gets REG_LIVE_READ mark and this spill is pruned from the cached states by clean_live_states(). Hence verifier state obtained via path 1-4,6 is deemed identical to one obtained via path 1-6 and program marked as safe. Note: the example should be executed with BPF_F_TEST_STATE_FREQ flag set to force creation of intermediate verifier states. This commit revisits the locations where bpf_reg_state instances are copied and replaces the direct copies with a call to a function copy_register_state(dst, src) that preserves 'parent' and 'live' fields of the 'dst'. Fixes: 679c782de14b ("bpf/verifier: per-register parent pointers") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230106142214.1040390-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-18bpf: Fix off-by-one error in bpf_mem_cache_idx()Hou Tao
According to the definition of sizes[NUM_CACHES], when the size passed to bpf_mem_cache_size() is 256, it should return 6 instead 7. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230118084630.3750680-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-13bpf: Fix pointer-leak due to insufficient speculative store bypass mitigationLuis Gerhorst
To mitigate Spectre v4, 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") inserts lfence instructions after 1) initializing a stack slot and 2) spilling a pointer to the stack. However, this does not cover cases where a stack slot is first initialized with a pointer (subject to sanitization) but then overwritten with a scalar (not subject to sanitization because the slot was already initialized). In this case, the second write may be subject to speculative store bypass (SSB) creating a speculative pointer-as-scalar type confusion. This allows the program to subsequently leak the numerical pointer value using, for example, a branch-based cache side channel. To fix this, also sanitize scalars if they write a stack slot that previously contained a pointer. Assuming that pointer-spills are only generated by LLVM on register-pressure, the performance impact on most real-world BPF programs should be small. The following unprivileged BPF bytecode drafts a minimal exploit and the mitigation: [...] // r6 = 0 or 1 (skalar, unknown user input) // r7 = accessible ptr for side channel // r10 = frame pointer (fp), to be leaked // r9 = r10 # fp alias to encourage ssb *(u64 *)(r9 - 8) = r10 // fp[-8] = ptr, to be leaked // lfence added here because of pointer spill to stack. // // Ommitted: Dummy bpf_ringbuf_output() here to train alias predictor // for no r9-r10 dependency. // *(u64 *)(r10 - 8) = r6 // fp[-8] = scalar, overwrites ptr // 2039f26f3aca: no lfence added because stack slot was not STACK_INVALID, // store may be subject to SSB // // fix: also add an lfence when the slot contained a ptr // r8 = *(u64 *)(r9 - 8) // r8 = architecturally a scalar, speculatively a ptr // // leak ptr using branch-based cache side channel: r8 &= 1 // choose bit to leak if r8 == 0 goto SLOW // no mispredict // architecturally dead code if input r6 is 0, // only executes speculatively iff ptr bit is 1 r8 = *(u64 *)(r7 + 0) # encode bit in cache (0: slow, 1: fast) SLOW: [...] After running this, the program can time the access to *(r7 + 0) to determine whether the chosen pointer bit was 0 or 1. Repeat this 64 times to recover the whole address on amd64. In summary, sanitization can only be skipped if one scalar is overwritten with another scalar. Scalar-confusion due to speculative store bypass can not lead to invalid accesses because the pointer bounds deducted during verification are enforced using branchless logic. See 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") for details. Do not make the mitigation depend on !env->allow_{uninit_stack,ptr_leaks} because speculative leaks are likely unexpected if these were enabled. For example, leaking the address to a protected log file may be acceptable while disabling the mitigation might unintentionally leak the address into the cached-state of a map that is accessible to unprivileged processes. Fixes: 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Henriette Hofmeier <henriette.hofmeier@rub.de> Link: https://lore.kernel.org/bpf/edc95bad-aada-9cfc-ffe2-fa9bb206583c@cs.fau.de Link: https://lore.kernel.org/bpf/20230109150544.41465-1-gerhorst@cs.fau.de
2023-01-12bpf: hash map, avoid deadlock with suitable hash maskTonghao Zhang
The deadlock still may occur while accessed in NMI and non-NMI context. Because in NMI, we still may access the same bucket but with different map_locked index. For example, on the same CPU, .max_entries = 2, we update the hash map, with key = 4, while running bpf prog in NMI nmi_handle(), to update hash map with key = 20, so it will have the same bucket index but have different map_locked index. To fix this issue, using min mask to hash again. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Tonghao Zhang <tong@infragraf.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230111092903.92389-1-tong@infragraf.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-09bpf: remove the do_idr_lock parameter from bpf_prog_free_id()Paul Moore
It was determined that the do_idr_lock parameter to bpf_prog_free_id() was not necessary as it should always be true. Suggested-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230106154400.74211-2-paul@paul-moore.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-09bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and ↵Paul Moore
PERF_BPF_EVENT_PROG_UNLOAD When changing the ebpf program put() routines to support being called from within IRQ context the program ID was reset to zero prior to calling the perf event and audit UNLOAD record generators, which resulted in problems as the ebpf program ID was bogus (always zero). This patch addresses this problem by removing an unnecessary call to bpf_prog_free_id() in __bpf_prog_offload_destroy() and adjusting __bpf_prog_put() to only call bpf_prog_free_id() after audit and perf have finished their bpf program unload tasks in bpf_prog_put_deferred(). For the record, no one can determine, or remember, why it was necessary to free the program ID, and remove it from the IDR, prior to executing bpf_prog_put_deferred(); regardless, both Stanislav and Alexei agree that the approach in this patch should be safe. It is worth noting that when moving the bpf_prog_free_id() call, the do_idr_lock parameter was forced to true as the ebpf devs determined this was the correct as the do_idr_lock should always be true. The do_idr_lock parameter will be removed in a follow-up patch, but it was kept here to keep the patch small in an effort to ease any stable backports. I also modified the bpf_audit_prog() logic used to associate the AUDIT_BPF record with other associated records, e.g. @ctx != NULL. Instead of keying off the operation, it now keys off the execution context, e.g. '!in_irg && !irqs_disabled()', which is much more appropriate and should help better connect the UNLOAD operations with the associated audit state (other audit records). Cc: stable@vger.kernel.org Fixes: d809e134be7a ("bpf: Prepare bpf_prog_put() to be called from irq context.") Reported-by: Burn Alting <burn.alting@iinet.net.au> Reported-by: Jiri Olsa <olsajiri@gmail.com> Suggested-by: Stanislav Fomichev <sdf@google.com> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230106154400.74211-1-paul@paul-moore.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-06bpf: Skip invalid kfunc call in backtrack_insnHao Sun
The verifier skips invalid kfunc call in check_kfunc_call(), which would be captured in fixup_kfunc_call() if such insn is not eliminated by dead code elimination. However, this can lead to the following warning in backtrack_insn(), also see [1]: ------------[ cut here ]------------ verifier backtracking bug WARNING: CPU: 6 PID: 8646 at kernel/bpf/verifier.c:2756 backtrack_insn kernel/bpf/verifier.c:2756 __mark_chain_precision kernel/bpf/verifier.c:3065 mark_chain_precision kernel/bpf/verifier.c:3165 adjust_reg_min_max_vals kernel/bpf/verifier.c:10715 check_alu_op kernel/bpf/verifier.c:10928 do_check kernel/bpf/verifier.c:13821 [inline] do_check_common kernel/bpf/verifier.c:16289 [...] So make backtracking conservative with this by returning ENOTSUPP. [1] https://lore.kernel.org/bpf/CACkBjsaXNceR8ZjkLG=dT3P=4A8SBsg0Z5h5PWLryF5=ghKq=g@mail.gmail.com/ Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230104014709.9375-1-sunhao.th@gmail.com
2022-12-28bpf: Always use maximal size for copy_array()Kees Cook
Instead of counting on prior allocations to have sized allocations to the next kmalloc bucket size, always perform a krealloc that is at least ksize(dst) in size (which is a no-op), so the size can be correctly tracked by all the various allocation size trackers (KASAN, __alloc_size, etc). Reported-by: Hyunwoo Kim <v4bel@theori.io> Link: https://lore.kernel.org/bpf/20221223094551.GA1439509@ubuntu Fixes: ceb35b666d42 ("bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage") Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: bpf@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221223182836.never.866-kees@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: keep a reference to the mm, in case the task is dead.Kui-Feng Lee
Fix the system crash that happens when a task iterator travel through vma of tasks. In task iterators, we used to access mm by following the pointer on the task_struct; however, the death of a task will clear the pointer, even though we still hold the task_struct. That can cause an unexpected crash for a null pointer when an iterator is visiting a task that dies during the visit. Keeping a reference of mm on the iterator ensures we always have a valid pointer to mm. Co-developed-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reported-by: Nathan Slingerland <slinger@meta.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221216221855.4122288-2-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: Fix panic due to wrong pageattr of im->imageChuang Wang
In the scenario where livepatch and kretfunc coexist, the pageattr of im->image is rox after arch_prepare_bpf_trampoline in bpf_trampoline_update, and then modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the BPF_TRAMP_F_ORIG_STACK flag will be configured, and arch_prepare_bpf_trampoline will be re-executed. At this time, because the pageattr of im->image is rox, arch_prepare_bpf_trampoline will read and write im->image, which causes a fault. as follows: insmod livepatch-sample.ko # samples/livepatch/livepatch-sample.c bpftrace -e 'kretfunc:cmdline_proc_show {}' BUG: unable to handle page fault for address: ffffffffa0206000 PGD 322d067 P4D 322d067 PUD 322e063 PMD 1297e067 PTE d428061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 2 PID: 270 Comm: bpftrace Tainted: G E K 6.1.0 #5 RIP: 0010:arch_prepare_bpf_trampoline+0xed/0x8c0 RSP: 0018:ffffc90001083ad8 EFLAGS: 00010202 RAX: ffffffffa0206000 RBX: 0000000000000020 RCX: 0000000000000000 RDX: ffffffffa0206001 RSI: ffffffffa0206000 RDI: 0000000000000030 RBP: ffffc90001083b70 R08: 0000000000000066 R09: ffff88800f51b400 R10: 000000002e72c6e5 R11: 00000000d0a15080 R12: ffff8880110a68c8 R13: 0000000000000000 R14: ffff88800f51b400 R15: ffffffff814fec10 FS: 00007f87bc0dc780(0000) GS:ffff88803e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0206000 CR3: 0000000010b70000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bpf_trampoline_update+0x25a/0x6b0 __bpf_trampoline_link_prog+0x101/0x240 bpf_trampoline_link_prog+0x2d/0x50 bpf_tracing_prog_attach+0x24c/0x530 bpf_raw_tp_link_attach+0x73/0x1d0 __sys_bpf+0x100e/0x2570 __x64_sys_bpf+0x1c/0x30 do_syscall_64+0x5b/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd With this patch, when modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the pageattr of im->image will be reset to nx+rw. Cc: stable@vger.kernel.org Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20221224133146.780578-1-nashuiliang@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-24Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 7 non-merge commits during the last 5 day(s) which contain a total of 11 files changed, 231 insertions(+), 3 deletions(-). The main changes are: 1) Fix a splat in bpf_skb_generic_pop() under CHECKSUM_PARTIAL due to misuse of skb_postpull_rcsum(), from Jakub Kicinski with test case from Martin Lau. 2) Fix BPF verifier's nullness propagation when registers are of type PTR_TO_BTF_ID, from Hao Sun. 3) Fix bpftool build for JIT disassembler under statically built libllvm, from Anton Protopopov. 4) Fix warnings reported by resolve_btfids when building vmlinux with CONFIG_SECURITY_NETWORK disabled, from Hou Tao. 5) Minor fix up for BPF selftest gitignore, from Stanislav Fomichev. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-22bpf: fix nullness propagation for reg to reg comparisonsHao Sun
After befae75856ab, the verifier would propagate null information after JEQ/JNE, e.g., if two pointers, one is maybe_null and the other is not, the former would be marked as non-null in eq path. However, as comment "PTR_TO_BTF_ID points to a kernel struct that does not need to be null checked by the BPF program ... The verifier must keep this in mind and can make no assumptions about null or non-null when doing branch ...". If one pointer is maybe_null and the other is PTR_TO_BTF, the former is incorrectly marked non-null. The following BPF prog can trigger a null-ptr-deref, also see this report for more details[1]: 0: (18) r1 = map_fd ; R1_w=map_ptr(ks=4, vs=4) 2: (79) r6 = *(u64 *)(r1 +8) ; R6_w=bpf_map->inner_map_data ; R6 is PTR_TO_BTF_ID ; equals to null at runtime 3: (bf) r2 = r10 4: (07) r2 += -4 5: (62) *(u32 *)(r2 +0) = 0 6: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null 7: (1d) if r6 == r0 goto pc+1 8: (95) exit ; from 7 to 9: R0=map_value R6=ptr_bpf_map 9: (61) r0 = *(u32 *)(r0 +0) ; null-ptr-deref 10: (95) exit So, make the verifier propagate nullness information for reg to reg comparisons only if neither reg is PTR_TO_BTF_ID. [1] https://lore.kernel.org/bpf/CACkBjsaFJwjC5oiw-1KXvcazywodwXo4zGYsRHwbr2gSG9WcSw@mail.gmail.com/T/#u Fixes: befae75856ab ("bpf: propagate nullness information for reg to reg comparisons") Signed-off-by: Hao Sun <sunhao.th@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221222024414.29539-1-sunhao.th@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-21Merge tag 'net-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, netfilter and can. Current release - regressions: - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func - rxrpc: - fix security setting propagation - fix null-deref in rxrpc_unuse_local() - fix switched parameters in peer tracing Current release - new code bugs: - rxrpc: - fix I/O thread startup getting skipped - fix locking issues in rxrpc_put_peer_locked() - fix I/O thread stop - fix uninitialised variable in rxperf server - fix the return value of rxrpc_new_incoming_call() - microchip: vcap: fix initialization of value and mask - nfp: fix unaligned io read of capabilities word Previous releases - regressions: - stop in-kernel socket users from corrupting socket's task_frag - stream: purge sk_error_queue in sk_stream_kill_queues() - openvswitch: fix flow lookup to use unmasked key - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port() - devlink: - hold region lock when flushing snapshots - protect devlink dump by the instance lock Previous releases - always broken: - bpf: - prevent leak of lsm program after failed attach - resolve fext program type when checking map compatibility - skbuff: account for tail adjustment during pull operations - macsec: fix net device access prior to holding a lock - bonding: switch back when high prio link up - netfilter: flowtable: really fix NAT IPv6 offload - enetc: avoid buffer leaks on xdp_do_redirect() failure - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg() - dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq" * tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits) net: fec: check the return value of build_skb() net: simplify sk_page_frag Treewide: Stop corrupting socket's task_frag net: Introduce sk_use_task_frag in struct sock. mctp: Remove device type check at unregister net: dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len can: flexcan: avoid unbalanced pm_runtime_enable warning Documentation: devlink: add missing toc entry for etas_es58x devlink doc mctp: serial: Fix starting value for frame check sequence nfp: fix unaligned io read of capabilities word net: stream: purge sk_error_queue in sk_stream_kill_queues() myri10ge: Fix an error handling path in myri10ge_probe() net: microchip: vcap: Fix initialization of value and mask rxrpc: Fix the return value of rxrpc_new_incoming_call() rxrpc: rxperf: Fix uninitialised variable rxrpc: Fix I/O thread stop rxrpc: Fix switched parameters in peer tracing rxrpc: Fix locking issues in rxrpc_put_peer_locked() rxrpc: Fix I/O thread startup getting skipped ...
2022-12-19bpf: Define sock security related BTF IDs under CONFIG_SECURITY_NETWORKHou Tao
There are warnings reported from resolve_btfids when building vmlinux with CONFIG_SECURITY_NETWORK disabled: WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_free_security WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_alloc_security So only define BTF IDs for these LSM hooks when CONFIG_SECURITY_NETWORK is enabled. Fixes: c0c852dd1876 ("bpf: Do not mark certain LSM hook arguments as trusted") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221217062144.2507222-1-houtao@huaweicloud.com
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-15mm: Introduce set_memory_rox()Peter Zijlstra
Because endlessly repeating: set_memory_ro() set_memory_x() is getting tedious. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Y1jek64pXOsougmz@hirez.programming.kicks-ass.net
2022-12-14bpf: Resolve fext program type when checking map compatibilityToke Høiland-Jørgensen
The bpf_prog_map_compatible() check makes sure that BPF program types are not mixed inside BPF map types that can contain programs (tail call maps, cpumaps and devmaps). It does this by setting the fields of the map->owner struct to the values of the first program being checked against, and rejecting any subsequent programs if the values don't match. One of the values being set in the map owner struct is the program type, and since the code did not resolve the prog type for fext programs, the map owner type would be set to PROG_TYPE_EXT and subsequent loading of programs of the target type into the map would fail. This bug is seen in particular for XDP programs that are loaded as PROG_TYPE_EXT using libxdp; these cannot insert programs into devmaps and cpumaps because the check fails as described above. Fix the bug by resolving the fext program type to its target program type as elsewhere in the verifier. v3: - Add Yonghong's ACK Fixes: f45d5b6ce2e8 ("bpf: generalise tail call map compatibility check") Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221214230254.790066-1-toke@redhat.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-14bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_funcJiri Olsa
Hao Sun reported crash in dispatcher image [1]. Currently we don't have any sync between bpf_dispatcher_update and bpf_dispatcher_xdp_func, so following race is possible: cpu 0: cpu 1: bpf_prog_run_xdp ... bpf_dispatcher_xdp_func in image at offset 0x0 bpf_dispatcher_update update image at offset 0x800 bpf_dispatcher_update update image at offset 0x0 in image at offset 0x0 -> crash Fixing this by synchronizing dispatcher image update (which is done in bpf_dispatcher_update function) with bpf_dispatcher_xdp_func that reads and execute the dispatcher image. Calling synchronize_rcu after updating and installing new image ensures that readers leave old image before it's changed in the next dispatcher update. The update itself is locked with dispatcher's mutex. The bpf_prog_run_xdp is called under local_bh_disable and synchronize_rcu will wait for it to leave [2]. [1] https://lore.kernel.org/bpf/Y5SFho7ZYXr9ifRn@krava/T/#m00c29ece654bc9f332a17df493bbca33e702896c [2] https://lore.kernel.org/bpf/0B62D35A-E695-4B7A-A0D4-774767544C1A@gmail.com/T/#mff43e2c003ae99f4a38f353c7969be4c7162e877 Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20221214123542.1389719-1-jolsa@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-14bpf: prevent leak of lsm program after failed attachMilan Landaverde
In [0], we added the ability to bpf_prog_attach LSM programs to cgroups, but in our validation to make sure the prog is meant to be attached to BPF_LSM_CGROUP, we return too early if the check fails. This results in lack of decrementing prog's refcnt (through bpf_prog_put) leaving the LSM program alive past the point of the expected lifecycle. This fix allows for the decrement to take place. [0] https://lore.kernel.org/all/20220628174314.1216643-4-sdf@google.com/ Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: Milan Landaverde <milan@mdaverde.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20221213175714.31963-1-milan@mdaverde.com
2022-12-13Merge tag 'net-next-6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Allow live renaming when an interface is up - Add retpoline wrappers for tc, improving considerably the performances of complex queue discipline configurations - Add inet drop monitor support - A few GRO performance improvements - Add infrastructure for atomic dev stats, addressing long standing data races - De-duplicate common code between OVS and conntrack offloading infrastructure - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements - Netfilter: introduce packet parser for tunneled packets - Replace IPVS timer-based estimators with kthreads to scale up the workload with the number of available CPUs - Add the helper support for connection-tracking OVS offload BPF: - Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF - Make cgroup local storage available to non-cgroup attached BPF programs - Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers - A relevant bunch of BPF verifier fixes and improvements - Veristat tool improvements to support custom filtering, sorting, and replay of results - Add LLVM disassembler as default library for dumping JITed code - Lots of new BPF documentation for various BPF maps - Add bpf_rcu_read_{,un}lock() support for sleepable programs - Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs - Add support storing struct task_struct objects as kptrs in maps - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions Protocols: - TCP: implement Protective Load Balancing across switch links - TCP: allow dynamically disabling TCP-MD5 static key, reverting back to fast[er]-path - UDP: Introduce optional per-netns hash lookup table - IPv6: simplify and cleanup sockets disposal - Netlink: support different type policies for each generic netlink operation - MPTCP: add MSG_FASTOPEN and FastOpen listener side support - MPTCP: add netlink notification support for listener sockets events - SCTP: add VRF support, allowing sctp sockets binding to VRF devices - Add bridging MAC Authentication Bypass (MAB) support - Extensions for Ethernet VPN bridging implementation to better support multicast scenarios - More work for Wi-Fi 7 support, comprising conversion of all the existing drivers to internal TX queue usage - IPSec: introduce a new offload type (packet offload) allowing complete header processing and crypto offloading - IPSec: extended ack support for more descriptive XFRM error reporting - RXRPC: increase SACK table size and move processing into a per-local endpoint kernel thread, reducing considerably the required locking - IEEE 802154: synchronous send frame and extended filtering support, initial support for scanning available 15.4 networks - Tun: bump the link speed from 10Mbps to 10Gbps - Tun/VirtioNet: implement UDP segmentation offload support Driver API: - PHY/SFP: improve power level switching between standard level 1 and the higher power levels - New API for netdev <-> devlink_port linkage - PTP: convert existing drivers to new frequency adjustment implementation - DSA: add support for rx offloading - Autoload DSA tagging driver when dynamically changing protocol - Add new PCP and APPTRUST attributes to Data Center Bridging - Add configuration support for 800Gbps link speed - Add devlink port function attribute to enable/disable RoCE and migratable - Extend devlink-rate to support strict prioriry and weighted fair queuing - Add devlink support to directly reading from region memory - New device tree helper to fetch MAC address from nvmem - New big TCP helper to simplify temporary header stripping New hardware / drivers: - Ethernet: - Marvel Octeon CNF95N and CN10KB Ethernet Switches - Marvel Prestera AC5X Ethernet Switch - WangXun 10 Gigabit NIC - Motorcomm yt8521 Gigabit Ethernet - Microchip ksz9563 Gigabit Ethernet Switch - Microsoft Azure Network Adapter - Linux Automation 10Base-T1L adapter - PHY: - Aquantia AQR112 and AQR412 - Motorcomm YT8531S - PTP: - Orolia ART-CARD - WiFi: - MediaTek Wi-Fi 7 (802.11be) devices - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB devices - Bluetooth: - Broadcom BCM4377/4378/4387 Bluetooth chipsets - Realtek RTL8852BE and RTL8723DS - Cypress.CYW4373A0 WiFi + Bluetooth combo device Drivers: - CAN: - gs_usb: bus error reporting support - kvaser_usb: listen only and bus error reporting support - Ethernet NICs: - Intel (100G): - extend action skbedit to RX queue mapping - implement devlink-rate support - support direct read from memory - nVidia/Mellanox (mlx5): - SW steering improvements, increasing rules update rate - Support for enhanced events compression - extend H/W offload packet manipulation capabilities - implement IPSec packet offload mode - nVidia/Mellanox (mlx4): - better big TCP support - Netronome Ethernet NICs (nfp): - IPsec offload support - add support for multicast filter - Broadcom: - RSS and PTP support improvements - AMD/SolarFlare: - netlink extened ack improvements - add basic flower matches to offload, and related stats - Virtual NICs: - ibmvnic: introduce affinity hint support - small / embedded: - FreeScale fec: add initial XDP support - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood - TI am65-cpsw: add suspend/resume support - Mediatek MT7986: add RX wireless wthernet dispatch support - Realtek 8169: enable GRO software interrupt coalescing per default - Ethernet high-speed switches: - Microchip (sparx5): - add support for Sparx5 TC/flower H/W offload via VCAP - Mellanox mlxsw: - add 802.1X and MAC Authentication Bypass offload support - add ip6gre support - Embedded Ethernet switches: - Mediatek (mtk_eth_soc): - improve PCS implementation, add DSA untag support - enable flow offload support - Renesas: - add rswitch R-Car Gen4 gPTP support - Microchip (lan966x): - add full XDP support - add TC H/W offload via VCAP - enable PTP on bridge interfaces - Microchip (ksz8): - add MTU support for KSZ8 series - Qualcomm 802.11ax WiFi (ath11k): - support configuring channel dwell time during scan - MediaTek WiFi (mt76): - enable Wireless Ethernet Dispatch (WED) offload support - add ack signal support - enable coredump support - remain_on_channel support - Intel WiFi (iwlwifi): - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities - 320 MHz channels support - RealTek WiFi (rtw89): - new dynamic header firmware format support - wake-over-WLAN support" * tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits) ipvs: fix type warning in do_div() on 32 bit net: lan966x: Remove a useless test in lan966x_ptp_add_trap() net: ipa: add IPA v4.7 support dt-bindings: net: qcom,ipa: Add SM6350 compatible bnxt: Use generic HBH removal helper in tx path IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver selftests: forwarding: Add bridge MDB test selftests: forwarding: Rename bridge_mdb test bridge: mcast: Support replacement of MDB port group entries bridge: mcast: Allow user space to specify MDB entry routing protocol bridge: mcast: Allow user space to add (*, G) with a source list and filter mode bridge: mcast: Add support for (*, G) with a source list and filter mode bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source bridge: mcast: Add a flag for user installed source entries bridge: mcast: Expose __br_multicast_del_group_src() bridge: mcast: Expose br_multicast_new_group_src() bridge: mcast: Add a centralized error path bridge: mcast: Place netlink policy before validation functions bridge: mcast: Split (*, G) and (S, G) addition into different functions bridge: mcast: Do not derive entry type from its filter mode ...
2022-12-12Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-10bpf: use check_ids() for active_lock comparisonEduard Zingerman
An update for verifier.c:states_equal()/regsafe() to use check_ids() for active spin lock comparisons. This fixes the issue reported by Kumar Kartikeya Dwivedi in [1] using technique suggested by Edward Cree. W/o this commit the verifier might be tricked to accept the following program working with a map containing spin locks: 0: r9 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1. 1: r8 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2. 2: if r9 == 0 goto exit ; r9 -> PTR_TO_MAP_VALUE. 3: if r8 == 0 goto exit ; r8 -> PTR_TO_MAP_VALUE. 4: r7 = ktime_get_ns() ; Unbound SCALAR_VALUE. 5: r6 = ktime_get_ns() ; Unbound SCALAR_VALUE. 6: bpf_spin_lock(r8) ; active_lock.id == 2. 7: if r6 > r7 goto +1 ; No new information about the state ; is derived from this check, thus ; produced verifier states differ only ; in 'insn_idx'. 8: r9 = r8 ; Optionally make r9.id == r8.id. --- checkpoint --- ; Assume is_state_visisted() creates a ; checkpoint here. 9: bpf_spin_unlock(r9) ; (a,b) active_lock.id == 2. ; (a) r9.id == 2, (b) r9.id == 1. 10: exit(0) Consider two verification paths: (a) 0-10 (b) 0-7,9-10 The path (a) is verified first. If checkpoint is created at (8) the (b) would assume that (8) is safe because regsafe() does not compare register ids for registers of type PTR_TO_MAP_VALUE. [1] https://lore.kernel.org/bpf/20221111202719.982118-1-memxor@gmail.com/ Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Suggested-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>