<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/include/net/tls.h, branch v6.1</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>net/tls: Describe ciphers sizes by const structs</title>
<updated>2022-09-23T00:27:41+00:00</updated>
<author>
<name>Tariq Toukan</name>
<email>tariqt@nvidia.com</email>
</author>
<published>2022-09-20T13:01:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2d2c5ea24243eb3ed12f232b2aef43981fa15360'/>
<id>2d2c5ea24243eb3ed12f232b2aef43981fa15360</id>
<content type='text'>
Introduce cipher sizes descriptor. It helps reducing the amount of code
duplications and repeated switch/cases that assigns the proper sizes
according to the cipher type.

Signed-off-by: Tariq Toukan &lt;tariqt@nvidia.com&gt;
Signed-off-by: Gal Pressman &lt;gal@nvidia.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Introduce cipher sizes descriptor. It helps reducing the amount of code
duplications and repeated switch/cases that assigns the proper sizes
according to the cipher type.

Signed-off-by: Tariq Toukan &lt;tariqt@nvidia.com&gt;
Signed-off-by: Gal Pressman &lt;gal@nvidia.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net/tls: Use RCU API to access tls_ctx-&gt;netdev</title>
<updated>2022-08-11T05:58:43+00:00</updated>
<author>
<name>Maxim Mikityanskiy</name>
<email>maximmi@nvidia.com</email>
</author>
<published>2022-08-10T08:16:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=94ce3b64c62d4b628cf85cd0d9a370aca8f7e43a'/>
<id>94ce3b64c62d4b628cf85cd0d9a370aca8f7e43a</id>
<content type='text'>
Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.

Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.

Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.

The transition to RCU exposes existing issues, fixed by this commit:

1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.

2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.

Fixes: 89df6a810470 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy &lt;maximmi@nvidia.com&gt;
Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.

Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.

Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.

The transition to RCU exposes existing issues, fixed by this commit:

1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.

2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.

Fixes: 89df6a810470 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy &lt;maximmi@nvidia.com&gt;
Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net/tls: Multi-threaded calls to TX tls_dev_del</title>
<updated>2022-07-29T04:50:54+00:00</updated>
<author>
<name>Tariq Toukan</name>
<email>tariqt@nvidia.com</email>
</author>
<published>2022-07-27T09:43:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7adc91e0c93901a0eeeea10665d0feb48ffde2d4'/>
<id>7adc91e0c93901a0eeeea10665d0feb48ffde2d4</id>
<content type='text'>
Multiple TLS device-offloaded contexts can be added in parallel via
concurrent calls to .tls_dev_add, while calls to .tls_dev_del are
sequential in tls_device_gc_task.

This is not a sustainable behavior. This creates a rate gap between add
and del operations (addition rate outperforms the deletion rate).  When
running for enough time, the TLS device resources could get exhausted,
failing to offload new connections.

Replace the single-threaded garbage collector work with a per-context
alternative, so they can be handled on several cores in parallel. Use
a new dedicated destruct workqueue for this.

Tested with mlx5 device:
Before: 22141 add/sec,   103 del/sec
After:  11684 add/sec, 11684 del/sec

Signed-off-by: Tariq Toukan &lt;tariqt@nvidia.com&gt;
Reviewed-by: Maxim Mikityanskiy &lt;maximmi@nvidia.com&gt;
Signed-off-by: Saeed Mahameed &lt;saeedm@nvidia.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Multiple TLS device-offloaded contexts can be added in parallel via
concurrent calls to .tls_dev_add, while calls to .tls_dev_del are
sequential in tls_device_gc_task.

This is not a sustainable behavior. This creates a rate gap between add
and del operations (addition rate outperforms the deletion rate).  When
running for enough time, the TLS device resources could get exhausted,
failing to offload new connections.

Replace the single-threaded garbage collector work with a per-context
alternative, so they can be handled on several cores in parallel. Use
a new dedicated destruct workqueue for this.

Tested with mlx5 device:
Before: 22141 add/sec,   103 del/sec
After:  11684 add/sec, 11684 del/sec

Signed-off-by: Tariq Toukan &lt;tariqt@nvidia.com&gt;
Reviewed-by: Maxim Mikityanskiy &lt;maximmi@nvidia.com&gt;
Signed-off-by: Saeed Mahameed &lt;saeedm@nvidia.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tls: rx: do not use the standard strparser</title>
<updated>2022-07-26T21:38:51+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-07-22T23:50:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=84c61fe1a75b4255df1e1e7c054c9e6d048da417'/>
<id>84c61fe1a75b4255df1e1e7c054c9e6d048da417</id>
<content type='text'>
TLS is a relatively poor fit for strparser. We pause the input
every time a message is received, wait for a read which will
decrypt the message, start the parser, repeat. strparser is
built to delineate the messages, wrap them in individual skbs
and let them float off into the stack or a different socket.
TLS wants the data pages and nothing else. There's no need
for TLS to keep cloning (and occasionally skb_unclone()'ing)
the TCP rx queue.

This patch uses a pre-allocated skb and attaches the skbs
from the TCP rx queue to it as frags. TLS is careful never
to modify the input skb without CoW'ing / detaching it first.

Since we call TCP rx queue cleanup directly we also get back
the benefit of skb deferred free.

Overall this results in a 6% gain in my benchmarks.

Acked-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
TLS is a relatively poor fit for strparser. We pause the input
every time a message is received, wait for a read which will
decrypt the message, start the parser, repeat. strparser is
built to delineate the messages, wrap them in individual skbs
and let them float off into the stack or a different socket.
TLS wants the data pages and nothing else. There's no need
for TLS to keep cloning (and occasionally skb_unclone()'ing)
the TCP rx queue.

This patch uses a pre-allocated skb and attaches the skbs
from the TCP rx queue to it as frags. TLS is careful never
to modify the input skb without CoW'ing / detaching it first.

Since we call TCP rx queue cleanup directly we also get back
the benefit of skb deferred free.

Overall this results in a 6% gain in my benchmarks.

Acked-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tls: rx: async: hold onto the input skb</title>
<updated>2022-07-18T10:24:11+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-07-15T05:22:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c618db2afe7c31d13ca8cf05b60f17165fbdc282'/>
<id>c618db2afe7c31d13ca8cf05b60f17165fbdc282</id>
<content type='text'>
Async crypto currently benefits from the fact that we decrypt
in place. When we allow input and output to be different skbs
we will have to hang onto the input while we move to the next
record. Clone the inputs and keep them on a list.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Async crypto currently benefits from the fact that we decrypt
in place. When we allow input and output to be different skbs
we will have to hang onto the input while we move to the next
record. Clone the inputs and keep them on a list.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tls: rx: allow only one reader at a time</title>
<updated>2022-07-18T10:24:10+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-07-15T05:22:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4cbc325ed6b4dce4910be06d9d6940a8b919c59b'/>
<id>4cbc325ed6b4dce4910be06d9d6940a8b919c59b</id>
<content type='text'>
recvmsg() in TLS gets data from the skb list (rx_list) or fresh
skbs we read from TCP via strparser. The former holds skbs which were
already decrypted for peek or decrypted and partially consumed.

tls_wait_data() only notices appearance of fresh skbs coming out
of TCP (or psock). It is possible, if there is a concurrent call
to peek() and recv() that the peek() will move the data from input
to rx_list without recv() noticing. recv() will then read data out
of order or never wake up.

This is not a practical use case/concern, but it makes the self
tests less reliable. This patch solves the problem by allowing
only one reader in.

Because having multiple processes calling read()/peek() is not
normal avoid adding a lock and try to fast-path the single reader
case.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
recvmsg() in TLS gets data from the skb list (rx_list) or fresh
skbs we read from TCP via strparser. The former holds skbs which were
already decrypted for peek or decrypted and partially consumed.

tls_wait_data() only notices appearance of fresh skbs coming out
of TCP (or psock). It is possible, if there is a concurrent call
to peek() and recv() that the peek() will move the data from input
to rx_list without recv() noticing. recv() will then read data out
of order or never wake up.

This is not a practical use case/concern, but it makes the self
tests less reliable. This patch solves the problem by allowing
only one reader in.

Because having multiple processes calling read()/peek() is not
normal avoid adding a lock and try to fast-path the single reader
case.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tls: create an internal header</title>
<updated>2022-07-09T01:38:45+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-07-08T01:03:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5879031423089b2e19b769f30fc618af742264c3'/>
<id>5879031423089b2e19b769f30fc618af742264c3</id>
<content type='text'>
include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>tls: rx: always allocate max possible aad size for decrypt</title>
<updated>2022-07-09T01:38:45+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-07-08T01:03:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=50a07aa5316181e08fb80914fcf70229a827a2e0'/>
<id>50a07aa5316181e08fb80914fcf70229a827a2e0</id>
<content type='text'>
AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3</title>
<updated>2022-07-06T11:56:35+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-07-05T23:59:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=88527790c079fb1ea41cbcfa4450ee37906a2fb0'/>
<id>88527790c079fb1ea41cbcfa4450ee37906a2fb0</id>
<content type='text'>
Since optimisitic decrypt may add extra load in case of retries
require socket owner to explicitly opt-in.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since optimisitic decrypt may add extra load in case of retries
require socket owner to explicitly opt-in.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tls: Add opt-in zerocopy mode of sendfile()</title>
<updated>2022-05-19T10:14:11+00:00</updated>
<author>
<name>Boris Pismenny</name>
<email>borisp@nvidia.com</email>
</author>
<published>2022-05-18T09:27:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c1318b39c7d36bd5139a9c71044ff2b2d3c6f9d8'/>
<id>c1318b39c7d36bd5139a9c71044ff2b2d3c6f9d8</id>
<content type='text'>
TLS device offload copies sendfile data to a bounce buffer before
transmitting. It allows to maintain the valid MAC on TLS records when
the file contents change and a part of TLS record has to be
retransmitted on TCP level.

In many common use cases (like serving static files over HTTPS) the file
contents are not changed on the fly. In many use cases breaking the
connection is totally acceptable if the file is changed during
transmission, because it would be received corrupted in any case.

This commit allows to optimize performance for such use cases to
providing a new optional mode of TLS sendfile(), in which the extra copy
is skipped. Removing this copy improves performance significantly, as
TLS and TCP sendfile perform the same operations, and the only overhead
is TLS header/trailer insertion.

The new mode can only be enabled with the new socket option named
TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
compatibility with existing applications that rely on the copying
behavior.

The new mode is safe, meaning that unsolicited modifications of the file
being sent can't break integrity of the kernel. The worst thing that can
happen is sending a corrupted TLS record, which is in any case not
forbidden when using regular TCP sockets.

Sockets other than TLS device offload are not affected by the new socket
option. The actual status of zerocopy sendfile can be queried with
sock_diag.

Performance numbers in a single-core test with 24 HTTPS streams on
nginx, under 100% CPU load:

* non-zerocopy: 33.6 Gbit/s
* zerocopy: 79.92 Gbit/s

CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

Signed-off-by: Boris Pismenny &lt;borisp@nvidia.com&gt;
Signed-off-by: Tariq Toukan &lt;tariqt@nvidia.com&gt;
Signed-off-by: Maxim Mikityanskiy &lt;maximmi@nvidia.com&gt;
Reviewed-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
TLS device offload copies sendfile data to a bounce buffer before
transmitting. It allows to maintain the valid MAC on TLS records when
the file contents change and a part of TLS record has to be
retransmitted on TCP level.

In many common use cases (like serving static files over HTTPS) the file
contents are not changed on the fly. In many use cases breaking the
connection is totally acceptable if the file is changed during
transmission, because it would be received corrupted in any case.

This commit allows to optimize performance for such use cases to
providing a new optional mode of TLS sendfile(), in which the extra copy
is skipped. Removing this copy improves performance significantly, as
TLS and TCP sendfile perform the same operations, and the only overhead
is TLS header/trailer insertion.

The new mode can only be enabled with the new socket option named
TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
compatibility with existing applications that rely on the copying
behavior.

The new mode is safe, meaning that unsolicited modifications of the file
being sent can't break integrity of the kernel. The worst thing that can
happen is sending a corrupted TLS record, which is in any case not
forbidden when using regular TCP sockets.

Sockets other than TLS device offload are not affected by the new socket
option. The actual status of zerocopy sendfile can be queried with
sock_diag.

Performance numbers in a single-core test with 24 HTTPS streams on
nginx, under 100% CPU load:

* non-zerocopy: 33.6 Gbit/s
* zerocopy: 79.92 Gbit/s

CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

Signed-off-by: Boris Pismenny &lt;borisp@nvidia.com&gt;
Signed-off-by: Tariq Toukan &lt;tariqt@nvidia.com&gt;
Signed-off-by: Maxim Mikityanskiy &lt;maximmi@nvidia.com&gt;
Reviewed-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
