<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/include/linux/skmsg.h, branch v7.2-rc1</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2026-06-16T21:59:58+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2026-06-16T21:57:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d755d45bc08a57a3b845b850f8760de922a499bf'/>
<id>d755d45bc08a57a3b845b850f8760de922a499bf</id>
<content type='text'>
Merge in late fixes in preparation for the net-next PR.

Conflicts:

net/tls/tls_sw.c
  406e8a651a7b ("net: skmsg: preserve sg.copy across SG transforms")
  79511603a65b ("tls: remove dead sockmap (psock) handling from the SW path")

drivers/net/ethernet/microsoft/mana/mana_en.c
  f8fd56977eeea ("net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check")
  d07efe5a6e641 ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size")
https://lore.kernel.org/ajAPXu-C_PuTgV-a@sirena.org.uk

No adjacent changes.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Merge in late fixes in preparation for the net-next PR.

Conflicts:

net/tls/tls_sw.c
  406e8a651a7b ("net: skmsg: preserve sg.copy across SG transforms")
  79511603a65b ("tls: remove dead sockmap (psock) handling from the SW path")

drivers/net/ethernet/microsoft/mana/mana_en.c
  f8fd56977eeea ("net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check")
  d07efe5a6e641 ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size")
https://lore.kernel.org/ajAPXu-C_PuTgV-a@sirena.org.uk

No adjacent changes.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: skmsg: preserve sg.copy across SG transforms</title>
<updated>2026-06-16T21:38:46+00:00</updated>
<author>
<name>Yiming Qian</name>
<email>yimingqian591@gmail.com</email>
</author>
<published>2026-06-10T06:21:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=406e8a651a7b854c41fecd5117bb282b3a6c2c6b'/>
<id>406e8a651a7b854c41fecd5117bb282b3a6c2c6b</id>
<content type='text'>
The sk_msg sg.copy bitmap is part of the scatterlist entry ownership
state. A set bit tells sk_msg_compute_data_pointers() not to expose the
entry through writable BPF ctx-&gt;data. This protects entries backed by
pages that are not private to the sk_msg, such as splice-backed file
page-cache pages.

Several sk_msg transform paths move, copy, split, or compact
msg-&gt;sg.data[] entries without moving the matching sg.copy bit. This can
make an externally backed entry arrive at a new slot with a clear copy
bit. A later SK_MSG verdict can then expose sg_virt(sge) as writable
ctx-&gt;data and BPF stores can modify the original page cache.

Keep sg.copy synchronized with sg.data[] whenever entries are
transferred, shifted, split, or copied into a new sk_msg. Clear the bit
when an entry is replaced by a newly allocated private page or freed.
This covers the BPF pull/push/pop helpers, sk_msg_shift_left/right(),
sk_msg_xfer(), and tls_split_open_record(), including the partial tail
entry created during TLS open-record splitting.

Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Cc: stable@vger.kernel.org
Reported-by: Yiming Qian &lt;yimingqian591@gmail.com&gt;
Reported-by: Keenan Dong &lt;keenanat2000@gmail.com&gt;
Signed-off-by: Yiming Qian &lt;yimingqian591@gmail.com&gt;
Link: https://patch.msgid.link/20260610062137.49075-1-yimingqian591@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The sk_msg sg.copy bitmap is part of the scatterlist entry ownership
state. A set bit tells sk_msg_compute_data_pointers() not to expose the
entry through writable BPF ctx-&gt;data. This protects entries backed by
pages that are not private to the sk_msg, such as splice-backed file
page-cache pages.

Several sk_msg transform paths move, copy, split, or compact
msg-&gt;sg.data[] entries without moving the matching sg.copy bit. This can
make an externally backed entry arrive at a new slot with a clear copy
bit. A later SK_MSG verdict can then expose sg_virt(sge) as writable
ctx-&gt;data and BPF stores can modify the original page cache.

Keep sg.copy synchronized with sg.data[] whenever entries are
transferred, shifted, split, or copied into a new sk_msg. Clear the bit
when an entry is replaced by a newly allocated private page or freed.
This covers the BPF pull/push/pop helpers, sk_msg_shift_left/right(),
sk_msg_xfer(), and tls_split_open_record(), including the partial tail
entry created during TLS open-record splitting.

Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Cc: stable@vger.kernel.org
Reported-by: Yiming Qian &lt;yimingqian591@gmail.com&gt;
Reported-by: Keenan Dong &lt;keenanat2000@gmail.com&gt;
Signed-off-by: Yiming Qian &lt;yimingqian591@gmail.com&gt;
Link: https://patch.msgid.link/20260610062137.49075-1-yimingqian591@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tls: remove dead sockmap (psock) handling from the SW path</title>
<updated>2026-06-16T15:55:41+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2026-06-14T01:40:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=79511603a65b990bed675eb4bcfd85305d3ff42a'/>
<id>79511603a65b990bed675eb4bcfd85305d3ff42a</id>
<content type='text'>
TLS and sockmap are now mutually exclusive. Try to delete the code
from sendmsg and recvmsg path which is now obviously dead.

The main goal is to delete enough code for AI security scanners
to no longer bother us with sockmap related bugs. At the same
time retain the code in case someone has the cycles to fix
all of this and make the integration work, again.

If the integration does not get restored we can wipe the rest
of the skmsg code from TLS in two or three releases.

The changes on the Tx side are deeper since that's where most
of the bugs are, Rx side simply takes the data from sockmap
and gives it to the user. On Tx split record handling and
rolling back the iterator were the two problem areas.

Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Reviewed-by: Sabrina Dubroca &lt;sd@queasysnail.net&gt;
Link: https://patch.msgid.link/20260614014102.461064-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
TLS and sockmap are now mutually exclusive. Try to delete the code
from sendmsg and recvmsg path which is now obviously dead.

The main goal is to delete enough code for AI security scanners
to no longer bother us with sockmap related bugs. At the same
time retain the code in case someone has the cycles to fix
all of this and make the integration work, again.

If the integration does not get restored we can wipe the rest
of the skmsg code from TLS in two or three releases.

The changes on the Tx side are deeper since that's where most
of the bugs are, Rx side simply takes the data from sockmap
and gives it to the user. On Tx split record handling and
rolling back the iterator were the two problem areas.

Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Reviewed-by: Sabrina Dubroca &lt;sd@queasysnail.net&gt;
Link: https://patch.msgid.link/20260614014102.461064-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kmalloc_obj: Clean up after treewide replacements</title>
<updated>2026-02-21T09:02:52+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2026-02-21T08:12:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7a70c15bd1449f1eb30991772edce37b41e496fb'/>
<id>7a70c15bd1449f1eb30991772edce37b41e496fb</id>
<content type='text'>
Coccinelle doesn't handle re-indenting line escapes. Fix the 2 places
where these got misaligned.

Remove 2 now-redundant type casts, found with:
$ git grep -P 'struct (\S+).*\)\s*k\S+alloc_(objs?|flex)\(struct \1'

Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Coccinelle doesn't handle re-indenting line escapes. Fix the 2 places
where these got misaligned.

Remove 2 now-redundant type casts, found with:
$ git grep -P 'struct (\S+).*\)\s*k\S+alloc_(objs?|flex)\(struct \1'

Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>treewide: Replace kmalloc with kmalloc_obj for non-scalar types</title>
<updated>2026-02-21T09:02:28+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2026-02-21T07:49:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=69050f8d6d075dc01af7a5f2f550a8067510366f'/>
<id>69050f8d6d075dc01af7a5f2f550a8067510366f</id>
<content type='text'>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf, sockmap: Fix FIONREAD for sockmap</title>
<updated>2026-01-27T17:11:30+00:00</updated>
<author>
<name>Jiayuan Chen</name>
<email>jiayuan.chen@linux.dev</email>
</author>
<published>2026-01-24T11:32:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=929e30f9312514902133c45e51c79088421ab084'/>
<id>929e30f9312514902133c45e51c79088421ab084</id>
<content type='text'>
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.

This patch adds a new msg_tot_len field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations.

Note that we intentionally do not include sk_receive_queue data in the
FIONREAD result. Data in sk_receive_queue has not yet been processed by
the BPF verdict program, and may be redirected to other sockets or
dropped. Including it would create semantic ambiguity since this data
may never be readable by the user.

Unix and VSOCK sockets have similar issues, but fixing them is outside
the scope of this patch as it would require more intrusive changes.

Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.

                                                      FD1:read()
                                                      --  FD1-&gt;copied_seq++
                                                          |  [read data]
                                                          |
                                   [enqueue data]         v
                  [sockmap]     -&gt; ingress to self -&gt;  ingress_msg queue
FD1 native stack  ------&gt;                                 ^
-- FD1-&gt;rcv_nxt++               -&gt; redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen &lt;jiayuan.chen@linux.dev&gt;
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Link: https://lore.kernel.org/r/20260124113314.113584-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.

This patch adds a new msg_tot_len field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations.

Note that we intentionally do not include sk_receive_queue data in the
FIONREAD result. Data in sk_receive_queue has not yet been processed by
the BPF verdict program, and may be redirected to other sockets or
dropped. Including it would create semantic ambiguity since this data
may never be readable by the user.

Unix and VSOCK sockets have similar issues, but fixing them is outside
the scope of this patch as it would require more intrusive changes.

Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.

                                                      FD1:read()
                                                      --  FD1-&gt;copied_seq++
                                                          |  [read data]
                                                          |
                                   [enqueue data]         v
                  [sockmap]     -&gt; ingress to self -&gt;  ingress_msg queue
FD1 native stack  ------&gt;                                 ^
-- FD1-&gt;rcv_nxt++               -&gt; redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen &lt;jiayuan.chen@linux.dev&gt;
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Link: https://lore.kernel.org/r/20260124113314.113584-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bpf, sockmap: Fix incorrect copied_seq calculation</title>
<updated>2026-01-27T17:11:30+00:00</updated>
<author>
<name>Jiayuan Chen</name>
<email>jiayuan.chen@linux.dev</email>
</author>
<published>2026-01-24T11:32:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b40cc5adaa80e1471095a62d78233b611d7a558c'/>
<id>b40cc5adaa80e1471095a62d78233b611d7a558c</id>
<content type='text'>
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

The issue is that when reading from ingress_msg, we update tp-&gt;copied_seq
by default. However, if the data is not from its own protocol stack,
tcp-&gt;rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.

This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.

This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.

                                                     FD1:read()
                                                     --  FD1-&gt;copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -&gt; ingress to self -&gt;  ingress_msg queue
FD1 native stack  ------&gt;                                 ^
-- FD1-&gt;rcv_nxt++               -&gt; redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Reviewed-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Jiayuan Chen &lt;jiayuan.chen@linux.dev&gt;
Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

The issue is that when reading from ingress_msg, we update tp-&gt;copied_seq
by default. However, if the data is not from its own protocol stack,
tcp-&gt;rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.

This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.

This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.

                                                     FD1:read()
                                                     --  FD1-&gt;copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -&gt; ingress to self -&gt;  ingress_msg queue
FD1 native stack  ------&gt;                                 ^
-- FD1-&gt;rcv_nxt++               -&gt; redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Reviewed-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Jiayuan Chen &lt;jiayuan.chen@linux.dev&gt;
Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: add sk_drops_skbadd() helper</title>
<updated>2025-08-28T11:14:50+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2025-08-26T12:50:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cb4d5a6eb600a43c2e3ec7f54e06d07aa33d8062'/>
<id>cb4d5a6eb600a43c2e3ec7f54e06d07aa33d8062</id>
<content type='text'>
Existing sk_drops_add() helper is renamed to sk_drops_skbadd().

Add sk_drops_add() and convert sk_drops_inc() to use it.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@google.com&gt;
Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Existing sk_drops_add() helper is renamed to sk_drops_skbadd().

Add sk_drops_add() and convert sk_drops_inc() to use it.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@google.com&gt;
Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>bpf: Fix wrong copied_seq calculation</title>
<updated>2025-01-29T21:32:23+00:00</updated>
<author>
<name>Jiayuan Chen</name>
<email>mrpre@163.com</email>
</author>
<published>2025-01-22T10:09:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=36b62df5683c315ba58c950f1a9c771c796c30ec'/>
<id>36b62df5683c315ba58c950f1a9c771c796c30ec</id>
<content type='text'>
'sk-&gt;copied_seq' was updated in the tcp_eat_skb() function when the action
of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, the
update logic for 'sk-&gt;copied_seq' was moved to tcp_bpf_recvmsg_parser()
to ensure the accuracy of the 'fionread' feature.

It works for a single stream_verdict scenario, as it also modified
sk_data_ready-&gt;sk_psock_verdict_data_ready-&gt;tcp_read_skb
to remove updating 'sk-&gt;copied_seq'.

However, for programs where both stream_parser and stream_verdict are
active (strparser purpose), tcp_read_sock() was used instead of
tcp_read_skb() (sk_data_ready-&gt;strp_data_ready-&gt;tcp_read_sock).
tcp_read_sock() now still updates 'sk-&gt;copied_seq', leading to duplicate
updates.

In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
in both tcp_read_sock() and tcp_bpf_recvmsg_parser().

The issue causes incorrect copied_seq calculations, which prevent
correct data reads from the recv() interface in user-land.

We do not want to add new proto_ops to implement a new version of
tcp_read_sock, as this would introduce code complexity [1].

We could have added noack and copied_seq to desc, and then called
ops-&gt;read_sock. However, unfortunately, other modules didn’t fully
initialize desc to zero. So, for now, we are directly calling
tcp_read_sock_noack() in tcp_bpf.c.

[1]: https://lore.kernel.org/bpf/20241218053408.437295-1-mrpre@163.com

Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Suggested-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Signed-off-by: Jiayuan Chen &lt;mrpre@163.com&gt;
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Acked-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Link: https://patch.msgid.link/20250122100917.49845-3-mrpre@163.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
'sk-&gt;copied_seq' was updated in the tcp_eat_skb() function when the action
of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, the
update logic for 'sk-&gt;copied_seq' was moved to tcp_bpf_recvmsg_parser()
to ensure the accuracy of the 'fionread' feature.

It works for a single stream_verdict scenario, as it also modified
sk_data_ready-&gt;sk_psock_verdict_data_ready-&gt;tcp_read_skb
to remove updating 'sk-&gt;copied_seq'.

However, for programs where both stream_parser and stream_verdict are
active (strparser purpose), tcp_read_sock() was used instead of
tcp_read_skb() (sk_data_ready-&gt;strp_data_ready-&gt;tcp_read_sock).
tcp_read_sock() now still updates 'sk-&gt;copied_seq', leading to duplicate
updates.

In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
in both tcp_read_sock() and tcp_bpf_recvmsg_parser().

The issue causes incorrect copied_seq calculations, which prevent
correct data reads from the recv() interface in user-land.

We do not want to add new proto_ops to implement a new version of
tcp_read_sock, as this would introduce code complexity [1].

We could have added noack and copied_seq to desc, and then called
ops-&gt;read_sock. However, unfortunately, other modules didn’t fully
initialize desc to zero. So, for now, we are directly calling
tcp_read_sock_noack() in tcp_bpf.c.

[1]: https://lore.kernel.org/bpf/20241218053408.437295-1-mrpre@163.com

Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Suggested-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Signed-off-by: Jiayuan Chen &lt;mrpre@163.com&gt;
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Acked-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Link: https://patch.msgid.link/20250122100917.49845-3-mrpre@163.com
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection</title>
<updated>2024-12-20T16:59:47+00:00</updated>
<author>
<name>Zijian Zhang</name>
<email>zijianzhang@bytedance.com</email>
</author>
<published>2024-12-10T01:20:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d888b7af7c149c115dd6ac772cc11c375da3e17c'/>
<id>d888b7af7c149c115dd6ac772cc11c375da3e17c</id>
<content type='text'>
When we do sk_psock_verdict_apply-&gt;sk_psock_skb_ingress, an sk_msg will
be created out of the skb, and the rmem accounting of the sk_msg will be
handled by the skb.

For skmsgs in __SK_REDIRECT case of tcp_bpf_send_verdict, when redirecting
to the ingress of a socket, although we sk_rmem_schedule and add sk_msg to
the ingress_msg of sk_redir, we do not update sk_rmem_alloc. As a result,
except for the global memory limit, the rmem of sk_redir is nearly
unlimited. Thus, add sk_rmem_alloc related logic to limit the recv buffer.

Since the function sk_msg_recvmsg and __sk_psock_purge_ingress_msg are
used in these two paths. We use "msg-&gt;skb" to test whether the sk_msg is
skb backed up. If it's not, we shall do the memory accounting explicitly.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Zijian Zhang &lt;zijianzhang@bytedance.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Reviewed-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Link: https://lore.kernel.org/bpf/20241210012039.1669389-3-zijianzhang@bytedance.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When we do sk_psock_verdict_apply-&gt;sk_psock_skb_ingress, an sk_msg will
be created out of the skb, and the rmem accounting of the sk_msg will be
handled by the skb.

For skmsgs in __SK_REDIRECT case of tcp_bpf_send_verdict, when redirecting
to the ingress of a socket, although we sk_rmem_schedule and add sk_msg to
the ingress_msg of sk_redir, we do not update sk_rmem_alloc. As a result,
except for the global memory limit, the rmem of sk_redir is nearly
unlimited. Thus, add sk_rmem_alloc related logic to limit the recv buffer.

Since the function sk_msg_recvmsg and __sk_psock_purge_ingress_msg are
used in these two paths. We use "msg-&gt;skb" to test whether the sk_msg is
skb backed up. If it's not, we shall do the memory accounting explicitly.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Zijian Zhang &lt;zijianzhang@bytedance.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Reviewed-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Link: https://lore.kernel.org/bpf/20241210012039.1669389-3-zijianzhang@bytedance.com
</pre>
</div>
</content>
</entry>
</feed>
