<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/kernel/bpf/cgroup.c, branch v6.12</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction</title>
<updated>2024-10-08T18:43:22+00:00</updated>
<author>
<name>Chen Ridong</name>
<email>chenridong@huawei.com</email>
</author>
<published>2024-10-08T11:24:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=117932eea99b729ee5d12783601a4f7f5fd58a23'/>
<id>117932eea99b729ee5d12783601a4f7f5fd58a23</id>
<content type='text'>
A hung_task problem shown below was found:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
 &lt;TASK&gt;
 __schedule+0x5a2/0x2050
 ? find_held_lock+0x33/0x100
 ? wq_worker_sleeping+0x9e/0xe0
 schedule+0x9f/0x180
 schedule_preempt_disabled+0x25/0x50
 __mutex_lock+0x512/0x740
 ? cgroup_bpf_release+0x1e/0x4d0
 ? cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? cgroup_bpf_release+0x1e/0x4d0
 ? mutex_lock_nested+0x2b/0x40
 ? __pfx_delay_tsc+0x10/0x10
 mutex_lock_nested+0x2b/0x40
 cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
 ? process_scheduled_works+0x161/0x8a0
 process_scheduled_works+0x23a/0x8a0
 worker_thread+0x231/0x5b0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x14d/0x1c0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x59/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 &lt;/TASK&gt;

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
   large number of cgroup_bpf_release works into system_wq. The max_active
   of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
   cgroup_bpf_release works, and many cgroup_bpf_release works will be put
   into inactive queue. As illustrated in the diagram, there are 256 (in
   the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
   smp_call_on_cpu work into system_wq. However step 1 has already filled
   system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
   to wait until the works that were put into the inacvtive queue earlier
   have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
   cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
   When cgroup_metux is acqured by work at css_killed_work_fn, it will
   call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
   However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
   cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
   a result, all of them are blocked. Consequently, sscs.work can not be
   executed. Ultimately, this situation leads to four processes being
   blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&amp;cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate &gt;WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Cc: stable@vger.kernel.org # v5.3+
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Tested-by: Vishal Chourasia &lt;vishalc@linux.ibm.com&gt;
Signed-off-by: Chen Ridong &lt;chenridong@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A hung_task problem shown below was found:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
 &lt;TASK&gt;
 __schedule+0x5a2/0x2050
 ? find_held_lock+0x33/0x100
 ? wq_worker_sleeping+0x9e/0xe0
 schedule+0x9f/0x180
 schedule_preempt_disabled+0x25/0x50
 __mutex_lock+0x512/0x740
 ? cgroup_bpf_release+0x1e/0x4d0
 ? cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? cgroup_bpf_release+0x1e/0x4d0
 ? mutex_lock_nested+0x2b/0x40
 ? __pfx_delay_tsc+0x10/0x10
 mutex_lock_nested+0x2b/0x40
 cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
 ? process_scheduled_works+0x161/0x8a0
 process_scheduled_works+0x23a/0x8a0
 worker_thread+0x231/0x5b0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x14d/0x1c0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x59/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 &lt;/TASK&gt;

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
   large number of cgroup_bpf_release works into system_wq. The max_active
   of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
   cgroup_bpf_release works, and many cgroup_bpf_release works will be put
   into inactive queue. As illustrated in the diagram, there are 256 (in
   the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
   smp_call_on_cpu work into system_wq. However step 1 has already filled
   system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
   to wait until the works that were put into the inacvtive queue earlier
   have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
   cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
   When cgroup_metux is acqured by work at css_killed_work_fn, it will
   call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
   However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
   cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
   a result, all of them are blocked. Consequently, sscs.work can not be
   executed. Ultimately, this situation leads to four processes being
   blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&amp;cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate &gt;WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Cc: stable@vger.kernel.org # v5.3+
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Tested-by: Vishal Chourasia &lt;vishalc@linux.ibm.com&gt;
Signed-off-by: Chen Ridong &lt;chenridong@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: Allow bpf_current_task_under_cgroup() with BPF_CGROUP_*</title>
<updated>2024-08-19T22:25:30+00:00</updated>
<author>
<name>Matteo Croce</name>
<email>teknoraver@meta.com</email>
</author>
<published>2024-08-19T16:28:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7f6287417baf57754f47687c6ea1a749a0686ab0'/>
<id>7f6287417baf57754f47687c6ea1a749a0686ab0</id>
<content type='text'>
The helper bpf_current_task_under_cgroup() currently is only allowed for
tracing programs, allow its usage also in the BPF_CGROUP_* program types.

Move the code from kernel/trace/bpf_trace.c to kernel/bpf/helpers.c,
so it compiles also without CONFIG_BPF_EVENTS.

This will be used in systemd-networkd to monitor the sysctl writes,
and filter it's own writes from others:
https://github.com/systemd/systemd/pull/32212

Signed-off-by: Matteo Croce &lt;teknoraver@meta.com&gt;
Signed-off-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The helper bpf_current_task_under_cgroup() currently is only allowed for
tracing programs, allow its usage also in the BPF_CGROUP_* program types.

Move the code from kernel/trace/bpf_trace.c to kernel/bpf/helpers.c,
so it compiles also without CONFIG_BPF_EVENTS.

This will be used in systemd-networkd to monitor the sysctl writes,
and filter it's own writes from others:
https://github.com/systemd/systemd/pull/32212

Signed-off-by: Matteo Croce &lt;teknoraver@meta.com&gt;
Signed-off-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog types</title>
<updated>2024-03-19T21:24:07+00:00</updated>
<author>
<name>Yonghong Song</name>
<email>yonghong.song@linux.dev</email>
</author>
<published>2024-03-15T18:48:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=eb166e522c77699fc19bfa705652327a1e51a117'/>
<id>eb166e522c77699fc19bfa705652327a1e51a117</id>
<content type='text'>
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
in tracing progs.

We have an internal use case where for an application running
in a container (with pid namespace), user wants to get
the pid associated with the pid namespace in a cgroup bpf
program. Currently, cgroup bpf progs already allow
bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid()
as well.

With auditing the code, bpf_get_current_pid_tgid() is also used
by sk_msg prog. But there are no side effect to expose these two
helpers to all prog types since they do not reveal any kernel specific
data. The detailed discussion is in [1].

So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid()
are put in bpf_base_func_proto(), making them available to all
program types.

  [1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song &lt;yonghong.song@linux.dev&gt;
Signed-off-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Acked-by: Jiri Olsa &lt;jolsa@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
in tracing progs.

We have an internal use case where for an application running
in a container (with pid namespace), user wants to get
the pid associated with the pid namespace in a cgroup bpf
program. Currently, cgroup bpf progs already allow
bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid()
as well.

With auditing the code, bpf_get_current_pid_tgid() is also used
by sk_msg prog. But there are no side effect to expose these two
helpers to all prog types since they do not reveal any kernel specific
data. The detailed discussion is in [1].

So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid()
are put in bpf_base_func_proto(), making them available to all
program types.

  [1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song &lt;yonghong.song@linux.dev&gt;
Signed-off-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Acked-by: Jiri Olsa &lt;jolsa@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev
</pre>
</div>
</content>
</entry>
<entry>
<title>net: adopt skb_network_offset() and similar helpers</title>
<updated>2024-03-04T08:47:06+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2024-02-29T09:39:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=80bfab79b8351c8d858e6928a091b57c103dce29'/>
<id>80bfab79b8351c8d858e6928a091b57c103dce29</id>
<content type='text'>
This is a cleanup patch, making code a bit more concise.

1) Use skb_network_offset(skb) in place of
       (skb_network_header(skb) - skb-&gt;data)

2) Use -skb_network_offset(skb) in place of
       (skb-&gt;data - skb_network_header(skb))

3) Use skb_transport_offset(skb) in place of
       (skb_transport_header(skb) - skb-&gt;data)

4) Use skb_inner_transport_offset(skb) in place of
       (skb_inner_transport_header(skb) - skb-&gt;data)

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Edward Cree &lt;ecree.xilinx@gmail.com&gt; # for sfc
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is a cleanup patch, making code a bit more concise.

1) Use skb_network_offset(skb) in place of
       (skb_network_header(skb) - skb-&gt;data)

2) Use -skb_network_offset(skb) in place of
       (skb-&gt;data - skb_network_header(skb))

3) Use skb_transport_offset(skb) in place of
       (skb_transport_header(skb) - skb-&gt;data)

4) Use skb_inner_transport_offset(skb) in place of
       (skb_inner_transport_header(skb) - skb-&gt;data)

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Edward Cree &lt;ecree.xilinx@gmail.com&gt; # for sfc
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: remove check in __cgroup_bpf_run_filter_skb</title>
<updated>2024-02-13T23:41:17+00:00</updated>
<author>
<name>Oliver Crumrine</name>
<email>ozlinuxc@gmail.com</email>
</author>
<published>2024-02-09T19:41:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=32e18e7688c6847b0c9db073aafb00639ecf576c'/>
<id>32e18e7688c6847b0c9db073aafb00639ecf576c</id>
<content type='text'>
Originally, this patch removed a redundant check in
BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in
the function it called, __cgroup_bpf_run_filter_skb. For v2, it was
reccomended that I remove the check from __cgroup_bpf_run_filter_skb,
and add the checks to the other macro that calls that function,
BPF_CGROUP_RUN_PROG_INET_INGRESS.

To sum it up, checking that the socket exists and that it is a full
socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and
BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the
function they call, __cgroup_bpf_run_filter_skb.

v3-&gt;v4: Fixed weird merge conflict.
v2-&gt;v3: Sent to bpf-next instead of generic patch
v1-&gt;v2: Addressed feedback about where check should be removed.

Signed-off-by: Oliver Crumrine &lt;ozlinuxc@gmail.com&gt;
Acked-by: Stanislav Fomichev &lt;sdf@google.com&gt;
Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Originally, this patch removed a redundant check in
BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in
the function it called, __cgroup_bpf_run_filter_skb. For v2, it was
reccomended that I remove the check from __cgroup_bpf_run_filter_skb,
and add the checks to the other macro that calls that function,
BPF_CGROUP_RUN_PROG_INET_INGRESS.

To sum it up, checking that the socket exists and that it is a full
socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and
BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the
function they call, __cgroup_bpf_run_filter_skb.

v3-&gt;v4: Fixed weird merge conflict.
v2-&gt;v3: Sent to bpf-next instead of generic patch
v1-&gt;v2: Addressed feedback about where check should be removed.

Signed-off-by: Oliver Crumrine &lt;ozlinuxc@gmail.com&gt;
Acked-by: Stanislav Fomichev &lt;sdf@google.com&gt;
Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: Take into account BPF token when fetching helper protos</title>
<updated>2024-01-25T00:21:01+00:00</updated>
<author>
<name>Andrii Nakryiko</name>
<email>andrii@kernel.org</email>
</author>
<published>2024-01-24T02:21:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=bbc1d24724e110b86a1a7c3c1724ce0d62cc1e2e'/>
<id>bbc1d24724e110b86a1a7c3c1724ce0d62cc1e2e</id>
<content type='text'>
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux</title>
<updated>2023-11-01T21:16:34+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-11-01T21:16:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f5277ad1e9768dbd05b1ae8dcdba690215d8c5b7'/>
<id>f5277ad1e9768dbd05b1ae8dcdba690215d8c5b7</id>
<content type='text'>
Pull io_uring {get,set}sockopt support from Jens Axboe:
 "This adds support for using getsockopt and setsockopt via io_uring.

  The main use cases for this is to enable use of direct descriptors,
  rather than first instantiating a normal file descriptor, doing the
  option tweaking needed, then turning it into a direct descriptor. With
  this support, we can avoid needing a regular file descriptor
  completely.

  The net and bpf bits have been signed off on their side"

* tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux:
  selftests/bpf/sockopt: Add io_uring support
  io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
  io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
  io_uring/cmd: return -EOPNOTSUPP if net is disabled
  selftests/net: Extract uring helpers to be reusable
  tools headers: Grab copy of io_uring.h
  io_uring/cmd: Pass compat mode in issue_flags
  net/socket: Break down __sys_getsockopt
  net/socket: Break down __sys_setsockopt
  bpf: Add sockptr support for setsockopt
  bpf: Add sockptr support for getsockopt
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull io_uring {get,set}sockopt support from Jens Axboe:
 "This adds support for using getsockopt and setsockopt via io_uring.

  The main use cases for this is to enable use of direct descriptors,
  rather than first instantiating a normal file descriptor, doing the
  option tweaking needed, then turning it into a direct descriptor. With
  this support, we can avoid needing a regular file descriptor
  completely.

  The net and bpf bits have been signed off on their side"

* tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux:
  selftests/bpf/sockopt: Add io_uring support
  io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
  io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
  io_uring/cmd: return -EOPNOTSUPP if net is disabled
  selftests/net: Extract uring helpers to be reusable
  tools headers: Grab copy of io_uring.h
  io_uring/cmd: Pass compat mode in issue_flags
  net/socket: Break down __sys_getsockopt
  net/socket: Break down __sys_setsockopt
  bpf: Add sockptr support for setsockopt
  bpf: Add sockptr support for getsockopt
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: Add sockptr support for setsockopt</title>
<updated>2023-10-19T20:05:37+00:00</updated>
<author>
<name>Breno Leitao</name>
<email>leitao@debian.org</email>
</author>
<published>2023-10-16T13:47:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=3f31e0d14d44ad491a81b7c1f83f32fbc300a867'/>
<id>3f31e0d14d44ad491a81b7c1f83f32fbc300a867</id>
<content type='text'>
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in setsockptr BPF hooks, so, it
could be used by other callers.

The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.

Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/

Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Acked-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Link: https://lore.kernel.org/r/20231016134750.1381153-3-leitao@debian.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in setsockptr BPF hooks, so, it
could be used by other callers.

The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.

Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/

Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Acked-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Link: https://lore.kernel.org/r/20231016134750.1381153-3-leitao@debian.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: Add sockptr support for getsockopt</title>
<updated>2023-10-19T20:05:28+00:00</updated>
<author>
<name>Breno Leitao</name>
<email>leitao@debian.org</email>
</author>
<published>2023-10-16T13:47:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=a615f67e1a426f35366b8398c11f31c148e7df48'/>
<id>a615f67e1a426f35366b8398c11f31c148e7df48</id>
<content type='text'>
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in getsockptr BPF hooks, so, it
could be used by other callers.

The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.

Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/

Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Acked-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Link: https://lore.kernel.org/r/20231016134750.1381153-2-leitao@debian.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in getsockptr BPF hooks, so, it
could be used by other callers.

The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.

Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/

Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Acked-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Link: https://lore.kernel.org/r/20231016134750.1381153-2-leitao@debian.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: Implement cgroup sockaddr hooks for unix sockets</title>
<updated>2023-10-12T00:27:47+00:00</updated>
<author>
<name>Daan De Meyer</name>
<email>daan.j.demeyer@gmail.com</email>
</author>
<published>2023-10-11T18:51:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=859051dd165ec6cc915f0f2114699021144fd249'/>
<id>859051dd165ec6cc915f0f2114699021144fd249</id>
<content type='text'>
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().

These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.

We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.

We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.

Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: Daan De Meyer &lt;daan.j.demeyer@gmail.com&gt;
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().

These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.

We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.

We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.

Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: Daan De Meyer &lt;daan.j.demeyer@gmail.com&gt;
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
