| Age | Commit message (Collapse) | Author |
|
With multi-device we are much more likely to have multiple
drm-gpusvm ranges pointing to the same struct mm range.
To avoid calling into drm_pagemap_populate_mm(), which is always
very costly, introduce a much less costly drm_gpusvm function,
drm_gpusvm_scan_mm() to scan the current migration state.
The device fault-handler and prefetcher can use this function to
determine whether migration is really necessary.
There are a couple of performance improvements that can be done
for this function if it turns out to be too costly. Those are
documented in the code.
v3:
- New patch.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-21-thomas.hellstrom@linux.intel.com
|
|
Use the dev_pagemap->owner field wherever possible, simplifying
the code slightly.
v3: New patch
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-20-thomas.hellstrom@linux.intel.com
|
|
As an aid to understanding the lifetime of the drm_pagemaps used
by the xe driver, document how the xe driver keeps the
drm_pagemap references.
v3:
- Fix formatting (Matt Brost)
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-19-thomas.hellstrom@linux.intel.com
|
|
Add debug printouts that are valueable for pagemap prefetch,
migration and page collection.
v2:
- Add additional debug prinouts around migration and page collection.
- Require CONFIG_DRM_XE_DEBUG_VM.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com> #v1
Link: https://patch.msgid.link/20251219113320.183860-18-thomas.hellstrom@linux.intel.com
|
|
Mimic the dma-buf method using dma_[map|unmap]_resource to map
for pcie-p2p dma.
There's an ongoing area of work upstream to sort out how this best
should be done. One method proposed is to add an additional
pci_p2p_dma_pagemap aliasing the device_private pagemap and use
the corresponding pci_p2p_dma_pagemap page as input for
dma_map_page(). However, that would incur double the amount of
memory and latency to set up the drm_pagemap and given the huge
amount of memory present on modern GPUs, that would really not work.
Hence the simple approach used in this patch.
v2:
- Simplify xe_page_to_pcie(). (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-17-thomas.hellstrom@linux.intel.com
|
|
placement for svm
Use device file descriptors and regions to represent pagemaps on
foreign or local devices.
The underlying files are type-checked at madvise time, and
references are kept on the drm_pagemap as long as there is are
madvises pointing to it.
Extend the madvise preferred_location UAPI to support the region
instance to identify the foreign placement.
v2:
- Improve UAPI documentation. (Matt Brost)
- Sanitize preferred_mem_loc.region_instance madvise. (Matt Brost)
- Clarify madvise drm_pagemap vs xe_pagemap refcounting. (Matt Brost)
- Don't allow a foreign drm_pagemap madvise without a fast
interconnect.
v3:
- Add a comment about reference-counting in xe_devmem_open() and
remove the reference-count get-and-put. (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-16-thomas.hellstrom@linux.intel.com
|
|
Simplify madvise_preferred_mem_loc by removing repetitive patterns
in favour of local variables.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-15-thomas.hellstrom@linux.intel.com
|
|
Honor the drm_pagemap vma attribute when migrating SVM pages.
Ensure that when the desired placement is validated as device
memory, that we also check that the requested drm_pagemap is
consistent with the current.
v2:
- Initialize a struct drm_pagemap pointer to NULL that could
otherwise be dereferenced uninitialized. (CI)
- Remove a redundant assignment (Matt Brost)
- Slightly improved commit message (Matt Brost)
- Extended drm_pagemap validation.
v3:
- Fix a compilation error if CONFIG_DRM_GPUSVM is not enabled.
(kernel test robot <lkp@intel.com>)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-14-thomas.hellstrom@linux.intel.com
|
|
As a consequence, struct xe_vma_mem_attr() can't simply be assigned
or freed without taking the reference count of individual members
into account. Also add helpers to do that.
v2:
- Move some calls to xe_vma_mem_attr_fini() to xe_vma_free(). (Matt Brost)
v3:
- Rebase.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com> #v2
Link: https://patch.msgid.link/20251219113320.183860-13-thomas.hellstrom@linux.intel.com
|
|
Register a driver-wide owner list, provide a callback to identify
fast interconnects and use the drm_pagemap_util helper to allocate
or reuse a suitable owner struct. For now we consider pagemaps on
different tiles on the same device as having fast interconnect and
thus the same owner.
v2:
- Fix up the error onion unwind in xe_pagemap_create(). (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-12-thomas.hellstrom@linux.intel.com
|
|
interconnected gpus
The hmm_range_fault() and the migration helpers currently need a common
"owner" to identify pagemaps and clients with fast interconnect.
Add a drm_pagemap utility to setup such owners by registering
drm_pagemaps, in a registry, and for each new drm_pagemap,
query which existing drm_pagemaps have fast interconnects with the new
drm_pagemap.
The "owner" scheme is limited in that it is static at drm_pagemap creation.
Ideally one would want the owner to be adjusted at run-time, but that
requires changes to hmm. If the proposed scheme becomes too limited,
we need to revisit.
v2:
- Improve documentation of DRM_PAGEMAP_OWNER_LIST_DEFINE(). (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-11-thomas.hellstrom@linux.intel.com
|
|
With the drm_pagemap_init() interface, drm_pagemap_create() is not
used anymore.
v2:
- Slightly more verbose commit message. (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-10-thomas.hellstrom@linux.intel.com
|
|
Define a struct xe_pagemap that embeds all pagemap-related
data used by xekmd, and use the drm_pagemap cache- and
shrinker to manage lifetime.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-9-thomas.hellstrom@linux.intel.com
|
|
Pagemaps are costly to set up and tear down, and they consume a lot
of system memory for the struct pages. Ideally they should be
created only when needed.
Add a caching mechanism to allow doing just that: Create the drm_pagemaps
when needed for migration. Keep them around to avoid destruction and
re-creation latencies and destroy inactive/unused drm_pagemaps on memory
pressure using a shrinker.
Only add the helper functions. They will be hooked up to the xe driver
in the upcoming patch.
v2:
- Add lockdep checking for drm_pagemap_put(). (Matt Brost)
- Add a copyright notice. (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-8-thomas.hellstrom@linux.intel.com
|
|
If a device holds a reference on a foregin device's drm_pagemap,
and a device unbind is executed on the foreign device,
Typically that foreign device would evict its device-private
pages and then continue its device-managed cleanup eventually
releasing its drm device and possibly allow for module unload.
However, since we're still holding a reference on a drm_pagemap,
when that reference is released and the provider module is
unloaded we'd execute out of undefined memory.
Therefore keep a reference on the provider device and module until
the last drm_pagemap reference is gone.
Note that in theory, the drm_gpusvm_helper module may be unloaded
as soon as the final module_put() of the provider driver module is
executed, so we need to add a module_exit() function that waits
for the work item executing the module_put() has completed.
v2:
- Better commit message (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-7-thomas.hellstrom@linux.intel.com
|
|
To be able to keep track of drm_pagemap usage, add a refcounted
backpointer to struct drm_pagemap_zdd. This will keep the drm_pagemap
reference count from dropping to zero as long as there are drm_pagemap
pages present in a CPU address space.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-6-thomas.hellstrom@linux.intel.com
|
|
With the end goal of being able to free unused pagemaps
and allocate them on demand, add a refcount to struct drm_pagemap,
remove the xe embedded drm_pagemap, allocating and freeing it
explicitly.
v2:
- Make the drm_pagemap pointer in drm_gpusvm_pages reference-counted.
v3:
- Call drm_pagemap_get() before drm_pagemap_put() in drm_gpusvm_pages
(Himal Prasad Ghimiray)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com> #v1
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-5-thomas.hellstrom@linux.intel.com
|
|
In situations where no system memory is migrated to devmem, and in
upcoming patches where another GPU is performing the migration to
the newly allocated devmem buffer, there is nothing to ensure any
ongoing clear to the devmem allocation or async eviction from the
devmem allocation is complete.
Address that by passing a struct dma_fence down to the copy
functions, and ensure it is waited for before migration is marked
complete.
v3:
- New patch.
v4:
- Update the logic used for determining when to wait for the
pre_migrate_fence.
- Update the logic used for determining when to warn for the
pre_migrate_fence since the scheduler fences apparently
can signal out-of-order.
v5:
- Fix a UAF (CI)
- Remove references to source P2P migration (Himal)
- Put the pre_migrate_fence after migration.
v6:
- Pipeline the pre_migrate_fence dependency (Matt Brost)
Fixes: c5b3eb5a906c ("drm/xe: Add GPUSVM device memory copy vfunc functions")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.15+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-4-thomas.hellstrom@linux.intel.com
|
|
The page pointer can't be NULL.
v5:
- New patch. (Matt Brost)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe.
Link: https://patch.msgid.link/20251219113320.183860-3-thomas.hellstrom@linux.intel.com
|
|
Since airoha_probe() is not executed under rtnl lock, there is small race
where a given device is configured by user-space while the remaining ones
are not completely loaded from the dts yet. This condition will allow a
hw device misconfiguration since there are some conditions (e.g. GDM2 check
in airoha_dev_init()) that require all device are properly loaded from the
device tree. Fix the issue moving net_devices registration at the end of
the airoha_probe routine.
Fixes: 9cd451d414f6e ("net: airoha: Add loopback support for GDM2")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251214-airoha-fix-dev-registration-v1-1-860e027ad4c6@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add clock and reset entries for the RSCI IPs.
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20251203094147.6429-3-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
|
|
Add clock and reset entries for the RSCI IPs.
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20251203094147.6429-2-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
|
|
The Renesas RZ/N2H (R9A09G087) SoC has the same Interrupt Controller
(ICU) as the Renesas RZ/T2H (R9A09G077) SoC.
Enable support for it by selecting the RENESAS_RZT2H_ICU config.
Signed-off-by: Cosmin Tanislav <cosmin-gabriel.tanislav.xa@renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20251216123421.124401-1-cosmin-gabriel.tanislav.xa@renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
|
|
The struct ip_tunnel_info has a flexible array member named
options that is protected by a counted_by(options_len)
attribute.
The compiler will use this information to enforce runtime bounds
checking deployed by FORTIFY_SOURCE string helpers.
As laid out in the GCC documentation, the counter must be
initialized before the first reference to the flexible array
member.
After scanning through the files that use struct ip_tunnel_info
and also refer to options or options_len, it appears the normal
case is to use the ip_tunnel_info_opts_set() helper.
Said helper would initialize options_len properly before copying
data into options, however in the GRE ERSPAN code a partial
update is done, preventing the use of the helper function.
Before this change the handling of ERSPAN traffic in GRE tunnels
would cause a kernel panic when the kernel is compiled with
GCC 15+ and having FORTIFY_SOURCE configured:
memcpy: detected buffer overflow: 4 byte write of buffer size 0
Call Trace:
<IRQ>
__fortify_panic+0xd/0xf
erspan_rcv.cold+0x68/0x83
? ip_route_input_slow+0x816/0x9d0
gre_rcv+0x1b2/0x1c0
gre_rcv+0x8e/0x100
? raw_v4_input+0x2a0/0x2b0
ip_protocol_deliver_rcu+0x1ea/0x210
ip_local_deliver_finish+0x86/0x110
ip_local_deliver+0x65/0x110
? ip_rcv_finish_core+0xd6/0x360
ip_rcv+0x186/0x1a0
Cc: stable@vger.kernel.org
Link: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-counted_005fby-variable-attribute
Reported-at: https://launchpad.net/bugs/2129580
Fixes: bb5e62f2d547 ("net: Add options as a flexible array to struct ip_tunnel_info")
Signed-off-by: Frode Nordahl <fnordahl@ubuntu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251213101338.4693-1-fnordahl@ubuntu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Matthieu Baerts says:
====================
mptcp: fix warn on bad status
Two somewhat related fixes addressing different issues found by
syzkaller, and producing the exact same splat: a WARNING in
subflow_data_ready().
- Patch 1: fallback earlier on simultaneous connections to avoid a
warning. A fix for v5.19.
- Patch 2: ensure context reset on disconnect, also to avoid a similar
warning. A fix for v6.2.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
====================
Link: https://patch.msgid.link/20251212-net-mptcp-subflow_data_ready-warn-v1-0-d1f9fd1c36c8@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
After the blamed commit below, if the MPC subflow is already in TCP_CLOSE
status or has fallback to TCP at mptcp_disconnect() time,
mptcp_do_fastclose() skips setting the `send_fastclose flag` and the later
__mptcp_close_ssk() does not reset anymore the related subflow context.
Any later connection will be created with both the `request_mptcp` flag
and the msk-level fallback status off (it is unconditionally cleared at
MPTCP disconnect time), leading to a warning in subflow_data_ready():
WARNING: CPU: 26 PID: 8996 at net/mptcp/subflow.c:1519 subflow_data_ready (net/mptcp/subflow.c:1519 (discriminator 13))
Modules linked in:
CPU: 26 UID: 0 PID: 8996 Comm: syz.22.39 Not tainted 6.18.0-rc7-05427-g11fc074f6c36 #1 PREEMPT(voluntary)
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: 0010:subflow_data_ready (net/mptcp/subflow.c:1519 (discriminator 13))
Code: 90 0f 0b 90 90 e9 04 fe ff ff e8 b7 1e f5 fe 89 ee bf 07 00 00 00 e8 db 19 f5 fe 83 fd 07 0f 84 35 ff ff ff e8 9d 1e f5 fe 90 <0f> 0b 90 e9 27 ff ff ff e8 8f 1e f5 fe 4c 89 e7 48 89 de e8 14 09
RSP: 0018:ffffc9002646fb30 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88813b218000 RCX: ffffffff825c8435
RDX: ffff8881300b3580 RSI: ffffffff825c8443 RDI: 0000000000000005
RBP: 000000000000000b R08: ffffffff825c8435 R09: 000000000000000b
R10: 0000000000000005 R11: 0000000000000007 R12: ffff888131ac0000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f88330af6c0(0000) GS:ffff888a93dd2000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f88330aefe8 CR3: 000000010ff59000 CR4: 0000000000350ef0
Call Trace:
<TASK>
tcp_data_ready (net/ipv4/tcp_input.c:5356)
tcp_data_queue (net/ipv4/tcp_input.c:5445)
tcp_rcv_state_process (net/ipv4/tcp_input.c:7165)
tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1955)
__release_sock (include/net/sock.h:1158 (discriminator 6) net/core/sock.c:3180 (discriminator 6))
release_sock (net/core/sock.c:3737)
mptcp_sendmsg (net/mptcp/protocol.c:1763 net/mptcp/protocol.c:1857)
inet_sendmsg (net/ipv4/af_inet.c:853 (discriminator 7))
__sys_sendto (net/socket.c:727 (discriminator 15) net/socket.c:742 (discriminator 15) net/socket.c:2244 (discriminator 15))
__x64_sys_sendto (net/socket.c:2247)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f883326702d
Address the issue setting an explicit `fastclosing` flag at fastclose
time, and checking such flag after mptcp_do_fastclose().
Fixes: ae155060247b ("mptcp: fix duplicate reset on fastclose")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251212-net-mptcp-subflow_data_ready-warn-v1-2-d1f9fd1c36c8@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Syzkaller reports a simult-connect race leading to inconsistent fallback
status:
WARNING: CPU: 3 PID: 33 at net/mptcp/subflow.c:1515 subflow_data_ready+0x40b/0x7c0 net/mptcp/subflow.c:1515
Modules linked in:
CPU: 3 UID: 0 PID: 33 Comm: ksoftirqd/3 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:subflow_data_ready+0x40b/0x7c0 net/mptcp/subflow.c:1515
Code: 89 ee e8 78 61 3c f6 40 84 ed 75 21 e8 8e 66 3c f6 44 89 fe bf 07 00 00 00 e8 c1 61 3c f6 41 83 ff 07 74 09 e8 76 66 3c f6 90 <0f> 0b 90 e8 6d 66 3c f6 48 89 df e8 e5 ad ff ff 31 ff 89 c5 89 c6
RSP: 0018:ffffc900006cf338 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888031acd100 RCX: ffffffff8b7f2abf
RDX: ffff88801e6ea440 RSI: ffffffff8b7f2aca RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000007
R10: 0000000000000004 R11: 0000000000002c10 R12: ffff88802ba69900
R13: 1ffff920000d9e67 R14: ffff888046f81800 R15: 0000000000000004
FS: 0000000000000000(0000) GS:ffff8880d69bc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560fc0ca1670 CR3: 0000000032c3a000 CR4: 0000000000352ef0
Call Trace:
<TASK>
tcp_data_queue+0x13b0/0x4f90 net/ipv4/tcp_input.c:5197
tcp_rcv_state_process+0xfdf/0x4ec0 net/ipv4/tcp_input.c:6922
tcp_v6_do_rcv+0x492/0x1740 net/ipv6/tcp_ipv6.c:1672
tcp_v6_rcv+0x2976/0x41e0 net/ipv6/tcp_ipv6.c:1918
ip6_protocol_deliver_rcu+0x188/0x1520 net/ipv6/ip6_input.c:438
ip6_input_finish+0x1e4/0x4b0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
NF_HOOK include/linux/netfilter.h:312 [inline]
ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:500
dst_input include/net/dst.h:471 [inline]
ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline]
NF_HOOK include/linux/netfilter.h:318 [inline]
NF_HOOK include/linux/netfilter.h:312 [inline]
ipv6_rcv+0x264/0x650 net/ipv6/ip6_input.c:311
__netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:5979
__netif_receive_skb+0x1d/0x160 net/core/dev.c:6092
process_backlog+0x442/0x15e0 net/core/dev.c:6444
__napi_poll.constprop.0+0xba/0x550 net/core/dev.c:7494
napi_poll net/core/dev.c:7557 [inline]
net_rx_action+0xa9f/0xfe0 net/core/dev.c:7684
handle_softirqs+0x216/0x8e0 kernel/softirq.c:579
run_ksoftirqd kernel/softirq.c:968 [inline]
run_ksoftirqd+0x3a/0x60 kernel/softirq.c:960
smpboot_thread_fn+0x3f7/0xae0 kernel/smpboot.c:160
kthread+0x3c2/0x780 kernel/kthread.c:463
ret_from_fork+0x5d7/0x6f0 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
The TCP subflow can process the simult-connect syn-ack packet after
transitioning to TCP_FIN1 state, bypassing the MPTCP fallback check,
as the sk_state_change() callback is not invoked for * -> FIN_WAIT1
transitions.
That will move the msk socket to an inconsistent status and the next
incoming data will hit the reported splat.
Close the race moving the simult-fallback check at the earliest possible
stage - that is at syn-ack generation time.
About the fixes tags: [2] was supposed to also fix this issue introduced
by [3]. [1] is required as a dependence: it was not explicitly marked as
a fix, but it is one and it has already been backported before [3]. In
other words, this commit should be backported up to [3], including [2]
and [1] if that's not already there.
Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") [1]
Fixes: 4fd19a307016 ("mptcp: fix inconsistent state on fastopen race") [2]
Fixes: 1e777f39b4d7 ("mptcp: add MSG_FASTOPEN sendmsg flag support") [3]
Cc: stable@vger.kernel.org
Reported-by: syzbot+0ff6b771b4f7a5bce83b@syzkaller.appspotmail.com
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/586
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251212-net-mptcp-subflow_data_ready-warn-v1-1-d1f9fd1c36c8@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Avoid spamming the log with drm_info(). Use drm_dbg() instead.
Fixes: cc795e041034 ("drm/xe/svm: Make xe_svm_range_needs_migrate_to_vram() public")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: <stable@vger.kernel.org> # v6.17+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patch.msgid.link/20251219113320.183860-2-thomas.hellstrom@linux.intel.com
|
|
There has been a syzkaller bug reported recently with the following
trace:
list_del corruption, ffff888058bea080->prev is LIST_POISON2 (dead000000000122)
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:59!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 3 UID: 0 PID: 21246 Comm: syz.0.2928 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:__list_del_entry_valid_or_report+0x13e/0x200 lib/list_debug.c:59
Code: 48 c7 c7 e0 71 f0 8b e8 30 08 ef fc 90 0f 0b 48 89 ef e8 a5 02 55 fd 48 89 ea 48 89 de 48 c7 c7 40 72 f0 8b e8 13 08 ef fc 90 <0f> 0b 48 89 ef e8 88 02 55 fd 48 89 ea 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000d49f370 EFLAGS: 00010286
RAX: 000000000000004e RBX: ffff888058bea080 RCX: ffffc9002817d000
RDX: 0000000000000000 RSI: ffffffff819becc6 RDI: 0000000000000005
RBP: dead000000000122 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000080000000 R11: 0000000000000001 R12: ffff888039e9c230
R13: ffff888058bea088 R14: ffff888058bea080 R15: ffff888055461480
FS: 00007fbbcfe6f6c0(0000) GS:ffff8880d6d0a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000110c3afcb0 CR3: 00000000382c7000 CR4: 0000000000352ef0
Call Trace:
<TASK>
__list_del_entry_valid include/linux/list.h:132 [inline]
__list_del_entry include/linux/list.h:223 [inline]
list_del_rcu include/linux/rculist.h:178 [inline]
__team_queue_override_port_del drivers/net/team/team_core.c:826 [inline]
__team_queue_override_port_del drivers/net/team/team_core.c:821 [inline]
team_queue_override_port_prio_changed drivers/net/team/team_core.c:883 [inline]
team_priority_option_set+0x171/0x2f0 drivers/net/team/team_core.c:1534
team_option_set drivers/net/team/team_core.c:376 [inline]
team_nl_options_set_doit+0x8ae/0xe60 drivers/net/team/team_core.c:2653
genl_family_rcv_msg_doit+0x209/0x2f0 net/netlink/genetlink.c:1115
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x55c/0x800 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x158/0x420 net/netlink/af_netlink.c:2552
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1346
netlink_sendmsg+0x8c8/0xdd0 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0xa98/0xc70 net/socket.c:2630
___sys_sendmsg+0x134/0x1d0 net/socket.c:2684
__sys_sendmsg+0x16d/0x220 net/socket.c:2716
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The problem is in this flow:
1) Port is enabled, queue_id != 0, in qom_list
2) Port gets disabled
-> team_port_disable()
-> team_queue_override_port_del()
-> del (removed from list)
3) Port is disabled, queue_id != 0, not in any list
4) Priority changes
-> team_queue_override_port_prio_changed()
-> checks: port disabled && queue_id != 0
-> calls del - hits the BUG as it is removed already
To fix this, change the check in team_queue_override_port_prio_changed()
so it returns early if port is not enabled.
Reported-by: syzbot+422806e5f4cce722a71f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=422806e5f4cce722a71f
Fixes: 6c31ff366c11 ("team: remove synchronize_rcu() called during queue override change")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251212102953.167287-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The driver already has debug messages for memcpy and linear transfers,
but is missing them for cyclic transfers.
Cyclic transfers are one of the main uses of the DMA controller, used
for audio data transfers. And since these are likely the first DMA
peripherals to be enabled, it helps to have these debug messages.
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
Link: https://patch.msgid.link/20251221064754.1783369-1-wens@kernel.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
maxburst, as provided by the client, specifies the largest amount of
data that is allowed to be transferred in one burst. This limit is
normally provided to avoid a data burst overflowing the target FIFO.
It does not mean that the DMA engine can only do bursts in that size.
Let the driver pick the largest supported burst length within the
given limit. This lets the driver work correctly with some clients that
give a large maxburst value. In particular, the 8250_dw driver will give
a quarter of the UART's FIFO size as maxburst. On some systems the FIFO
size is 256 bytes, giving a maxburst of 64 bytes, while the hardware
only supports bursts of up to 16 bytes.
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
Reviewed-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Link: https://patch.msgid.link/20251221080450.1813479-1-wens@kernel.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
Using libc types and headers from the UAPI headers is problematic as it
introduces a dependency on a full C toolchain.
Use the fixed-width integer types provided by the UAPI headers instead.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251222-uapi-idxd-v1-1-baa183adb20d@linutronix.de
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
Clobbering an error value to be returned from shdma_tx_submit() with
a pm_runtime_put() return value is not particularly useful, especially
if the latter is 0, so stop doing that.
This will facilitate a planned change of the pm_runtime_put() return
type to void in the future.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/9626129.rMLUfLXkoz@rafael.j.wysocki
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
Add qcom,soundwire-v2.2.0 to the list of supported Qualcomm
SoundWire controller versions. This version falls back to
qcom,soundwire-v2.0.0 if not explicitly handled by the driver.
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Prasad Kumpatla <prasad.kumpatla@oss.qualcomm.com>
Reviewed-by: Srinivas Kandagatla <srinivas.kandagatla@oss.qualcomm.com>
Link: https://patch.msgid.link/20251105-knp-audio-v2-v4-1-ae0953f02b44@oss.qualcomm.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
These are nearly identical to the respective driver callbacks. The only
differences are that .remove() returns void instead of int and .shutdown()
has to cope for unbound devices.
The objective is to get rid of users of struct device_driver callbacks
.probe(), .remove() and .shutdown() to eventually remove these.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/20251215174925.1327021-6-u.kleine-koenig@baylibre.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
All remove functions return zero and the driver core ignores any other
returned value (just emits a warning about it being ignored). So make all
remove callbacks return void instead of an ignored int. This is in line
with most other subsystems.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/20251215174925.1327021-5-u.kleine-koenig@baylibre.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
The -EEXIST error code is reserved by the module loading infrastructure
to indicate that a module is already loaded. When a module's init
function returns -EEXIST, userspace tools like kmod interpret this as
"module already loaded" and treat the operation as successful, returning
0 to the user even though the module initialization actually failed.
This follows the precedent set by commit 54416fd76770 ("netfilter:
conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same
issue in nf_conntrack_helper_register().
This affects bpf_crypto_skcipher module. While the configuration
required to build it as a module is unlikely in practice, it is
technically possible, so fix it for correctness.
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Puranjay Mohan says:
====================
Allow calling kfuncs from raw_tp programs
V1: https://lore.kernel.org/all/20251218145514.339819-1-puranjay@kernel.org/
Changes in V1->V2:
- Update selftests to allow success for raw_tp programs calling kfuncs.
This set enables calling kfuncs from raw_tp programs.
====================
Link: https://patch.msgid.link/20251222133250.1890587-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
As the previous commit allowed raw_tp programs to call kfuncs, so of the
selftests that were expected to fail will now succeed.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Associate raw tracepoint program type with the kfunc tracing hook. This
allows calling kfuncs from raw_tp programs.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Roman Gushchin says:
====================
mm: bpf kfuncs to access memcg data
Introduce kfuncs to simplify the access to the memcg data.
These kfuncs can be used to accelerate monitoring use cases and
for implementing custom OOM policies once BPF OOM is landed.
This patchset was separated out from the BPF OOM patchset to simplify
the logistics and accelerate the landing of the part which is useful
by itself. No functional changes since BPF OOM v2.
v4:
- refactored memcg vm event and stat item idx checks (by Alexei)
v3:
- dropped redundant kfuncs flags (by Alexei)
- fixed kdocs warnings (by Alexei)
- merged memcg stats access patches into one (by Alexei)
- restored root memcg usage reporting, added a comment
- added checks for enum boundaries
- added Shakeel and JP as co-maintainers (by Shakeel)
v2:
- added mem_cgroup_disabled() checks (by Shakeel B.)
- added special handling of the root memcg in bpf_mem_cgroup_usage()
(by Shakeel B.)
- minor fixes in the kselftest (by Shakeel B.)
- added a MAINTAINERS entry (by Shakeel B.)
v1:
https://lore.kernel.org/bpf/87ike29s5r.fsf@linux.dev/T/#t
====================
Link: https://patch.msgid.link/20251223044156.208250-1-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Let's create a separate entry for MM BPF extensions: these patches
often require an attention from both bpf and mm communities.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-7-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add test coverage for the kfuncs that fetch memcg stats. Using some common
stats, test scenarios ensuring that the given stat increases by some
arbitrary amount. The stats selected cover the three categories represented
by the enums: node_stat_item, memcg_stat_item, vm_event_item.
Since only a subset of all stats are queried, use a static struct made up
of fields for each stat. Write to the struct with the fetched values when
the bpf program is invoked and read the fields in the user mode program for
verification.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-6-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce BPF kfuncs to conveniently access memcg data:
- bpf_mem_cgroup_vm_events(),
- bpf_mem_cgroup_memory_events(),
- bpf_mem_cgroup_usage(),
- bpf_mem_cgroup_page_state(),
- bpf_mem_cgroup_flush_stats().
These functions are useful for implementing BPF OOM policies, but
also can be used to accelerate access to the memcg data. Reading
it through cgroupfs is much more expensive, roughly 5x, mostly
because of the need to convert the data into the text and back.
JP Kobryn:
An experiment was setup to compare the performance of a program that
uses the traditional method of reading memory.stat vs a program using
the new kfuncs. The control program opens up the root memory.stat file
and for 1M iterations reads, converts the string values to numeric data,
then seeks back to the beginning. The experimental program sets up the
requisite libbpf objects and for 1M iterations invokes a bpf program
which uses the kfuncs to fetch all available stats for node_stat_item,
memcg_stat_item, and vm_event_item types.
The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 93%. In kernel mode,
elapsed time was reduced by 80%, while in user mode, over 99% of time
was saved.
control: elapsed time
real 0m38.318s
user 0m25.131s
sys 0m13.070s
experiment: elapsed time
real 0m2.789s
user 0m0.187s
sys 0m2.512s
control: perf data
33.43% a.out libc.so.6 [.] __vfscanf_internal
6.88% a.out [kernel.kallsyms] [k] vsnprintf
6.33% a.out libc.so.6 [.] _IO_fgets
5.51% a.out [kernel.kallsyms] [k] format_decode
4.31% a.out libc.so.6 [.] __GI_____strtoull_l_internal
3.78% a.out [kernel.kallsyms] [k] string
3.53% a.out [kernel.kallsyms] [k] number
2.71% a.out libc.so.6 [.] _IO_sputbackc
2.41% a.out [kernel.kallsyms] [k] strlen
1.98% a.out a.out [.] main
1.70% a.out libc.so.6 [.] _IO_getline_info
1.51% a.out libc.so.6 [.] __isoc99_sscanf
1.47% a.out [kernel.kallsyms] [k] memory_stat_format
1.47% a.out [kernel.kallsyms] [k] memcpy_orig
1.41% a.out [kernel.kallsyms] [k] seq_buf_printf
experiment: perf data
10.55% memcgstat bpf_prog_..._query [k] bpf_prog_16aab2f19fa982a7_query
6.90% memcgstat [kernel.kallsyms] [k] memcg_page_state_output
3.55% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
3.12% memcgstat [kernel.kallsyms] [k] memcg_events
2.87% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
2.73% memcgstat [kernel.kallsyms] [k] kmem_cache_free
2.70% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
2.25% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
2.06% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20251223044156.208250-5-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce a BPF kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.
It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes BPF programs
more complex and less efficient.
bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessary to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-4-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
To effectively operate with memory cgroups in BPF there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in BPF because
from the verifier's point of view that's a out-of-bounds memory access.
Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
- bpf_get_mem_cgroup,
- bpf_put_mem_cgroup.
bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-3-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
To use memcg_page_state_output() in bpf_memcontrol.c move the
declaration from v1-specific memcontrol-v1.h to memcontrol.h.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-2-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Fix typo received -> received by review.
Signed-off-by: Rohit Chourasia <chourasiarohit27@gmail.com>
Acked-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Link: https://patch.msgid.link/20251211161911.30611-1-chourasiarohit27@gmail.com
|
|
llist_add() returns true only when adding to an empty list, which indicates
that no IRQ work is currently queued or running. Therefore, we only need to
call irq_work_queue() when llist_add() returns true, to avoid unnecessarily
re-queueing IRQ work that is already pending or executing.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
bypass_lb_node()
For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the
preemptible per-CPU ktimer kthread context, this means that the following
scenarios will occur(for x86 platform):
cpu1 cpu2
ktimer kthread:
->scx_bypass_lb_timerfn
->bypass_lb_node
->for_each_cpu(cpu, resched_mask)
migration/1: by preempt by migration/2:
multi_cpu_stop() multi_cpu_stop()
->take_cpu_down()
->__cpu_disable()
->set cpu1 offline
->rq1 = cpu_rq(cpu1)
->resched_curr(rq1)
->smp_send_reschedule(cpu1)
->native_smp_send_reschedule(cpu1)
->if(unlikely(cpu_is_offline(cpu))) {
WARN(1, "sched: Unexpected
reschedule of offline CPU#%d!\n", cpu);
return;
}
This commit therefore use the resched_cpu() to replace resched_curr()
in the bypass_lb_node() to avoid send-ipi to offline CPUs.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|