linux-stable.git/drivers/vhost/net.c, branch linux-6.16.y

vhost-net: flush batched before enabling notifications

2025-10-02T11:48:37+00:00

commit e430451613c7a27beeadd00d707bcf7ceec6328e upstream.

Commit 8c2e6b26ffe2 ("vhost/net: Defer TX queue re-enable until after
sendmsg") tries to defer the notification enabling by moving the logic
out of the loop after the vhost_tx_batch() when nothing new is spotted.
This caused unexpected side effects as the new logic is reused for
several other error conditions.

A previous patch reverted 8c2e6b26ffe2. Now, bring the performance
back up by flushing batched buffers before enabling notifications.

Reported-by: Jon Kohler 
Cc: stable@vger.kernel.org
Fixes: 8c2e6b26ffe2 ("vhost/net: Defer TX queue re-enable until after sendmsg")
Signed-off-by: Jason Wang 
Signed-off-by: Michael S. Tsirkin 
Message-Id: <20250917063045.2042-3-jasowang@redhat.com>
Signed-off-by: Greg Kroah-Hartman

Revert "vhost/net: Defer TX queue re-enable until after sendmsg"

2025-10-02T11:48:37+00:00

commit 4174152771bf0d014d58f7d7e148bb0c8830fe53 upstream.

This reverts commit 8c2e6b26ffe243be1e78f5a4bfb1a857d6e6f6d6. It tries
to defer the notification enabling by moving the logic out of the loop
after the vhost_tx_batch() when nothing new is spotted. This will
bring side effects as the new logic would be reused for several other
error conditions.

One example is the IOTLB: when there's an IOTLB miss, get_tx_bufs()
might return -EAGAIN and exit the loop and see there's still available
buffers, so it will queue the tx work again until userspace feed the
IOTLB entry correctly. This will slowdown the tx processing and
trigger the TX watchdog in the guest as reported in
https://lkml.org/lkml/2025/9/10/1596.

To fix, revert the change. A follow up patch will bring the performance
back in a safe way.

Reported-by: Jon Kohler 
Cc: stable@vger.kernel.org
Fixes: 8c2e6b26ffe2 ("vhost/net: Defer TX queue re-enable until after sendmsg")
Signed-off-by: Jason Wang 
Signed-off-by: Michael S. Tsirkin 
Message-Id: <20250917063045.2042-2-jasowang@redhat.com>
Signed-off-by: Greg Kroah-Hartman

vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put()

2025-09-04T14:55:32+00:00

commit dd54bcf86c91a4455b1f95cbc8e9ac91205f3193 upstream.

When operating on struct vhost_net_ubuf_ref, the following execution
sequence is theoretically possible:
CPU0 is finalizing DMA operation                   CPU1 is doing VHOST_NET_SET_BACKEND
                             // ubufs->refcount == 2
vhost_net_ubuf_put()                               vhost_net_ubuf_put_wait_and_free(oldubufs)
                                                     vhost_net_ubuf_put_and_wait()
                                                       vhost_net_ubuf_put()
                                                         int r = atomic_sub_return(1, &ubufs->refcount);
                                                         // r = 1
int r = atomic_sub_return(1, &ubufs->refcount);
// r = 0
                                                      wait_event(ubufs->wait, !atomic_read(&ubufs->refcount));
                                                      // no wait occurs here because condition is already true
                                                    kfree(ubufs);
if (unlikely(!r))
  wake_up(&ubufs->wait);  // use-after-free

This leads to use-after-free on ubufs access. This happens because CPU1
skips waiting for wake_up() when refcount is already zero.

To prevent that use a read-side RCU critical section in vhost_net_ubuf_put(),
as suggested by Hillf Danton. For this lock to take effect, free ubufs with
kfree_rcu().

Cc: stable@vger.kernel.org
Fixes: 0ad8b480d6ee9 ("vhost: fix ref cnt checking deadlock")
Reported-by: Andrey Ryabinin 
Suggested-by: Hillf Danton 
Signed-off-by: Nikolay Kuratov 
Message-Id: <20250805130917.727332-1-kniv@yandex-team.ru>
Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Greg Kroah-Hartman

vhost/net: Defer TX queue re-enable until after sendmsg

2025-05-06T01:18:41+00:00

In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
batches up to 64 messages before calling sock->sendmsg.

Currently, when there are no more messages on the ring to dequeue,
handle_tx_copy re-enables kicks on the ring *before* firing off the
batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
especially if it needs to wake up a thread (e.g., another vhost worker).

If the guest submits additional messages immediately after the last ring
check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
kick the vhost worker. This may happen while the worker is still
processing the sendmsg, leading to wasteful exit(s).

This is particularly problematic for single-threaded guest submission
threads, as they must exit, wait for the exit to be processed
(potentially involving a TTWU), and then resume.

In scenarios like a constant stream of UDP messages, this results in a
sawtooth pattern where the submitter frequently vmexits, and the
vhost-net worker alternates between sleeping and waking.

A common solution is to configure vhost-net busy polling via userspace
(e.g., qemu poll-us). However, treating the sendmsg as the "busy"
period by keeping kicks disabled during the final sendmsg and
performing one additional ring check afterward provides a significant
performance improvement without any excess busy poll cycles.

If messages are found in the ring after the final sendmsg, requeue the
TX handler. This ensures fairness for the RX handler and allows
vhost_run_work_list to cond_resched() as needed.

Test Case
    TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
    RX VM: taskset -c 2 iperf3 -s -p 5200 -D
    6.12.0, each worker backed by tun interface with IFF_NAPI setup.
    Note: TCP side is largely unchanged as that was copy bound

6.12.0 unpatched
    EPT_MISCONFIG/second: 5411
    Datagrams/second: ~382k
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender

6.12.0 patched
    EPT_MISCONFIG/second: 58 (~93x reduction)
    Datagrams/second: ~650k  (~1.7x increase)
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender

Acked-by: Jason Wang 
Signed-off-by: Jon Kohler 
Acked-by: Michael S. Tsirkin 
Link: https://patch.msgid.link/20250501020428.1889162-1-jon@nutanix.com
Signed-off-by: Jakub Kicinski

vhost/net: Set num_buffers for virtio 1.0

2025-01-27T14:39:25+00:00

The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.

Fixes: 41e3e42108bc ("vhost/net: enable virtio 1.0")
Signed-off-by: Akihiko Odaki 
Message-Id: <20240915-v1-v1-1-f10d2cb5e759@daynix.com>
Signed-off-by: Michael S. Tsirkin

mm: page_frag: avoid caller accessing 'page_frag_cache' directly

2024-11-11T18:56:27+00:00

Use appropriate frag_page API instead of caller accessing
'page_frag_cache' directly.

CC: Andrew Morton 
CC: Linux-MM 
Signed-off-by: Yunsheng Lin 
Reviewed-by: Alexander Duyck 
Acked-by: Chuck Lever 
Link: https://patch.msgid.link/20241028115343.3405838-5-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski

net: extend ubuf_info callback to ops structure

2024-04-22T23:21:35+00:00

We'll need to associate additional callbacks with ubuf_info, introduce
a structure holding ubuf_info callbacks. Apart from a more smarter
io_uring notification management introduced in next patches, it can be
used to generalise msg_zerocopy_put_abort() and also store
->sg_from_iter, which is currently passed in struct msghdr.

Reviewed-by: Jens Axboe 
Reviewed-by: David Ahern 
Signed-off-by: Pavel Begunkov 
Reviewed-by: Willem de Bruijn 
Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

2024-03-19T15:57:39+00:00

Pull virtio updates from Michael Tsirkin:

 - Per vq sizes in vdpa

 - Info query for block devices support in vdpa

 - DMA sync callbacks in vduse

 - Fixes, cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (35 commits)
  virtio_net: rename free_old_xmit_skbs to free_old_xmit
  virtio_net: unify the code for recycling the xmit ptr
  virtio-net: add cond_resched() to the command waiting loop
  virtio-net: convert rx mode setting to use workqueue
  virtio: packed: fix unmap leak for indirect desc table
  vDPA: report virtio-blk flush info to user space
  vDPA: report virtio-block read-only info to user space
  vDPA: report virtio-block write zeroes configuration to user space
  vDPA: report virtio-block discarding configuration to user space
  vDPA: report virtio-block topology info to user space
  vDPA: report virtio-block MQ info to user space
  vDPA: report virtio-block max segments in a request to user space
  vDPA: report virtio-block block-size to user space
  vDPA: report virtio-block max segment size to user space
  vDPA: report virtio-block capacity to user space
  virtio: make virtio_bus const
  vdpa: make vdpa_bus const
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_num_min
  vDPA/ifcvf: get_max_vq_size to return max size
  virtio_vdpa: create vqs with the actual size
  ...

vhost: Added pad cleanup if vnet_hdr is not present.

2024-03-19T06:45:49+00:00

When the Qemu launched with vhost but without tap vnet_hdr,
vhost tries to copy vnet_hdr from socket iter with size 0
to the page that may contain some trash.
That trash can be interpreted as unpredictable values for
vnet_hdr.
That leads to dropping some packets and in some cases to
stalling vhost routine when the vhost_net tries to process
packets and fails in a loop.

Qemu options:
  -netdev tap,vhost=on,vnet_hdr=off,...

Signed-off-by: Andrew Melnychenko 
Message-Id: <20240115194840.1183077-1-andrew@daynix.com>
Signed-off-by: Michael S. Tsirkin

vhost/net: remove vhost_net_page_frag_refill()

2024-03-05T10:38:14+00:00

The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.

This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().

The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.

Signed-off-by: Yunsheng Lin 
Acked-by: Jason Wang 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Paolo Abeni