linux-stable.git/include/linux/bpf-cgroup.h, branch v6.6.2

bpf, net: Check skb ownership against full socket.

2023-06-30T14:04:05+00:00

Check skb ownership of an skb against full sockets instead of request_sock.

The filters were called only if an skb is owned by the sock that the skb is
sent out through. In another words, skb->sk should point to the sock that
it is sending through its egress. However, the filters would miss SYN/ACK
skbs that they are owned by a request_sock but sent through the listener
sock, that is the socket listening incoming connections.

However, the listener socket is also the full socket of the request socket.
We should use the full socket as the owner socket of an skb instead.

What is the ownership check for?
================================

BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the
ownership of an skb. Alexei referred to a mailing list conversation [0]
that took place a few years ago. In that conversation, Daniel Borkmann
stated that:

    Wouldn't that mean however, when you go through stacked devices that
    you'd run the same eBPF cgroup program for skb->sk multiple times?

According to what Daniel said, the ownership check mentioned earlier
presumably prevents multiple calls of egress filters caused by an skb.

A test that reproduce this scenario shows that the BPF cgroup egress
programs can be called multiple times for one skb if this ownership
check is not there. So, we can not just remove this check.

Test Stacked Devices
====================

We use L2TP to build an environment of stacked devices. L2TP (Layer 2
Tunneling Protocol) is a tunneling protocol used to support virtual private
networks (VPNs). It relays encapsulated packets; for example in UDP, to its
peer by using a socket.

Using L2TP, packets are first sent through the IP stack and should then
arrive at an L2TP device. The device will expand its skb header to
encapsulate the packet. The skb will be sent back to the IP stack using
the socket that was made for the L2TP session. After that, the routing
process will occur once more, but this time for a new destination.

We changed tools/testing/selftests/net/l2tp.sh to set up a test environment
using L2TP. The run_ping() function in l2tp.sh is where the main change
occurred.

    run_ping()
    {
        local desc="$1"

        sleep 10
        run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1
        fc00:101::2
        log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
        sleep 10
    }

The test will use L2TP devices to send PING messages. These messages will
have a message size of 227 bytes as a special label to distinguish them.
This is not an ideal solution, but works.

During the execution of the test script, bpftrace was attached to
ip6_finish_output() and l2tp_xmit_skb():

    bpftrace -e '
      kfunc:ip6_finish_output {
        time("%H:%M:%S: ");
        printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p
                skb->sk=%p\n", args->skb, args->skb->len,
               args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); }
      kfunc:l2tp_xmit_skb {
        time("%H:%M:%S: ");
        printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb,
	       args->session->tunnel->sock); }'

The following is part of the output messages printed by bpftrace:

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275
              cgroup=0xffff88810741f800 sk=0xffff888105f3b900
              skb->sk=0xffff888105f3b900

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103dd6300
              skb->sk=0xffff888105f3b900

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275
              cgroup=0xffffffff837741d0 sk=0xffff888101fe0000
              skb->sk=0xffff888101fe0000

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103483180
              skb->sk=0xffff888101fe0000

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

The first four entries describe a PING message that was sent using the ping
command, whereas the following four entries describe the response received.
Multiple sockets are used to send one skb, including the socket used by the
L2TP session. This can be observed.

Based on this information, it seems that the ownership check is designed to
avoid multiple calls of egress filters caused by a single skb.

  [0] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/

Signed-off-by: Kui-Feng Lee 
Signed-off-by: Daniel Borkmann 
Link: https://lore.kernel.org/bpf/20230624014600.576756-2-kuifeng@meta.com

bpf: Introduce cgroup_{common,current}_func_proto

2022-08-23T23:08:21+00:00

Split cgroup_base_func_proto into the following:

* cgroup_common_func_proto - common helpers for all cgroup hooks
* cgroup_current_func_proto - common helpers for all cgroup hooks
  running in the process context (== have meaningful 'current').

Move bpf_{g,s}et_retval and other cgroup-related helpers into
kernel/bpf/cgroup.c so they closer to where they are being used.

Signed-off-by: Stanislav Fomichev 
Acked-by: Martin KaFai Lau 
Link: https://lore.kernel.org/r/20220823222555.523590-2-sdf@google.com
Signed-off-by: Alexei Starovoitov

bpf: per-cgroup lsm flavor

2022-06-29T20:21:51+00:00

Allow attaching to lsm hooks in the cgroup context.

Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.

For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.

Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.

Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.

Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.

Reviewed-by: Martin KaFai Lau 
Signed-off-by: Stanislav Fomichev 
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
Signed-off-by: Alexei Starovoitov

bpf: convert cgroup_bpf.progs to hlist

2022-06-29T20:21:51+00:00

This lets us reclaim some space to be used by new cgroup lsm slots.

Before:
struct cgroup_bpf {
	struct bpf_prog_array *    effective[23];        /*     0   184 */
	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
	struct list_head           progs[23];            /*   184   368 */
	/* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */
	u32                        flags[23];            /*   552    92 */

	/* XXX 4 bytes hole, try to pack */

	/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
	struct list_head           storages;             /*   648    16 */
	struct bpf_prog_array *    inactive;             /*   664     8 */
	struct percpu_ref          refcnt;               /*   672    16 */
	struct work_struct         release_work;         /*   688    32 */

	/* size: 720, cachelines: 12, members: 7 */
	/* sum members: 716, holes: 1, sum holes: 4 */
	/* last cacheline: 16 bytes */
};

After:
struct cgroup_bpf {
	struct bpf_prog_array *    effective[23];        /*     0   184 */
	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
	struct hlist_head          progs[23];            /*   184   184 */
	/* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */
	u8                         flags[23];            /*   368    23 */

	/* XXX 1 byte hole, try to pack */

	/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
	struct list_head           storages;             /*   392    16 */
	struct bpf_prog_array *    inactive;             /*   408     8 */
	struct percpu_ref          refcnt;               /*   416    16 */
	struct work_struct         release_work;         /*   432    72 */

	/* size: 504, cachelines: 8, members: 7 */
	/* sum members: 503, holes: 1, sum holes: 1 */
	/* last cacheline: 56 bytes */
};

Suggested-by: Jakub Sitnicki 
Reviewed-by: Jakub Sitnicki 
Reviewed-by: Martin KaFai Lau 
Signed-off-by: Stanislav Fomichev 
Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com
Signed-off-by: Alexei Starovoitov

bpf: Use bpf_prog_run_array_cg_flags everywhere

2022-04-26T00:03:57+00:00

Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
use it everywhere. check_return_code already enforces sane
return ranges for all cgroup types. (only egress and bind hooks have
uncanonical return ranges, the rest is using [0, 1])

No functional changes.

v2:
- 'func_ret & 1' under explicit test (Andrii & Martin)

Suggested-by: Alexei Starovoitov 
Signed-off-by: Stanislav Fomichev 
Signed-off-by: Alexei Starovoitov 
Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com

cgroup/bpf: fast path skb BPF filtering

2022-01-27T18:15:00+00:00

Even though there is a static key protecting from overhead from
cgroup-bpf skb filtering when there is nothing attached, in many cases
it's not enough as registering a filter for one type will ruin the fast
path for all others. It's observed in production servers I've looked
at but also in laptops, where registration is done during init by
systemd or something else.

Add a per-socket fast path check guarding from such overhead. This
affects both receive and transmit paths of TCP, UDP and other
protocols. It showed ~1% tx/s improvement in small payload UDP
send benchmarks using a real NIC and in a server environment and the
number jumps to 2-3% for preemtible kernels.

Reviewed-by: Stanislav Fomichev 
Signed-off-by: Pavel Begunkov 
Acked-by: Martin KaFai Lau 
Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com
Signed-off-by: Alexei Starovoitov

bpf: Remove the cgroup -> bpf header dependecy

2021-12-16T22:57:10+00:00

Remove the dependency from cgroup-defs.h to bpf-cgroup.h and bpf.h.
This reduces the incremental build size of x86 allmodconfig after
bpf.h was touched from ~17k objects rebuilt to ~5k objects.
bpf.h is 2.2kLoC and is modified relatively often.

We need a new header with just the definition of struct cgroup_bpf
and enum cgroup_bpf_attach_type, this is akin to cgroup-defs.h.

Signed-off-by: Jakub Kicinski 
Signed-off-by: Alexei Starovoitov 
Acked-by: Tejun Heo 
Link: https://lore.kernel.org/bpf/20211216025538.1649516-4-kuba@kernel.org

Merge branch 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2021-11-02T22:37:27+00:00

Pull cgroup updates from Tejun Heo:

 - The misc controller now reports allocation rejections through
   misc.events instead of printking

 - cgroup_mutex usage is reduced to improve scalability of some
   operations

 - vhost helper threads are now assigned to the right cgroup on cgroup2

 - Bug fixes

* 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
  cgroup: Fix rootcg cpu.stat guest double counting
  cgroup: no need for cgroup_mutex for /proc/cgroups
  cgroup: remove cgroup_mutex from cgroupstats_build
  cgroup: reduce dependency on cgroup_mutex
  cgroup: cgroup-v1: do not exclude cgrp_dfl_root
  cgroup: Make rebind_subsystems() disable v2 controllers all at once
  docs/cgroup: add entry for misc.events
  misc_cgroup: remove error log to avoid log flood
  misc_cgroup: introduce misc.events to count failures

cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c

2021-11-01T16:49:00+00:00

In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf")
cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only
used in kernel/bpf/cgroup.c, so move these function to
kernel/bpf/cgroup.c, like cgroup_bpf_replace().

Signed-off-by: He Fengqing 
Signed-off-by: Tejun Heo

inet: remove races in inet{6}_getname()

2021-10-28T01:20:21+00:00

syzbot reported data-races in inet_getname() multiple times,
it is time we fix this instead of pretending applications
should not trigger them.

getsockname() and getpeername() are not really considered fast path.

v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration
    needed when CONFIG_CGROUP_BPF=n, as reported by
    kernel test robot 

syzbot typical report:

BUG: KCSAN: data-race in __inet_hash_connect / inet_getname

write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1:
 __inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831
 inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853
 tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275
 __inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664
 inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728
 __sys_connect_file net/socket.c:1896 [inline]
 __sys_connect+0x254/0x290 net/socket.c:1913
 __do_sys_connect net/socket.c:1923 [inline]
 __se_sys_connect net/socket.c:1920 [inline]
 __x64_sys_connect+0x3d/0x50 net/socket.c:1920
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0:
 inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790
 __sys_getsockname+0x11d/0x1b0 net/socket.c:1946
 __do_sys_getsockname net/socket.c:1961 [inline]
 __se_sys_getsockname net/socket.c:1958 [inline]
 __x64_sys_getsockname+0x3e/0x50 net/socket.c:1958
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000 -> 0xdee0

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Link: https://lore.kernel.org/r/20211026213014.3026708-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski