linux.git/include/linux, branch v7.1-rc3

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

2026-05-10T01:42:54+00:00

Pull bpf fixes from Alexei Starovoitov:

 - Fix sk_local_storage diag dump via netlink (Amery Hung)

 - Fix off-by-one in arena direct-value access (Junyoung Jang)

 - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan)

 - Fix type confusion in bpf_*_sock() (Kuniyuki Iwashima)

 - Reject TX-only AF_XDP sockets (Linpu Yu)

 - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon)

 - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup
   (Weiming Shi)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix off-by-one boundary validation in arena direct-value access
  xskmap: reject TX-only AF_XDP sockets
  bpf: Don't run arg-tracking analysis twice on main subprog
  bpf: Free reuseport cBPF prog after RCU grace period.
  bpf: tcp: Fix type confusion in sol_tcp_sockopt().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock().
  mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow()
  selftest: bpf: Add test for bpf_tcp_sock() and RAW socket.
  bpf: tcp: Fix type confusion in bpf_tcp_sock().
  tools/headers: Regenerate stddef.h to fix BPF selftests
  bpf: Fix sk_local_storage diag dumping uninitialized special fields
  bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup()
  sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}().
  bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths
  selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY
  selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks
  bpf: Reject TCP_NODELAY in bpf-tcp-cc
  bpf: Reject TCP_NODELAY in TCP header option callbacks

Merge tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2026-05-09T02:42:10+00:00

Pull scheduler fixes from Ingo Molnar:

 - Fix spurious failures in rseq self-tests (Mark Brown)

 - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative
   use of the supposedly read-only field

   The fix is to introduce a new ABI variant based on a new (larger)
   rseq area registration size, to keep the TCMalloc use of rseq
   backwards compatible on new kernels (Thomas Gleixner)

 - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot)

 - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng)

* tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix wakeup_preempt_fair() for not waking up task
  sched/fair: Fix overflow in vruntime_eligible()
  selftests/rseq: Expand for optimized RSEQ ABI v2
  rseq: Reenable performance optimizations conditionally
  rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode
  selftests/rseq: Validate legacy behavior
  selftests/rseq: Make registration flexible for legacy and optimized mode
  selftests/rseq: Skip tests if time slice extensions are not available
  rseq: Revert to historical performance killing behaviour
  rseq: Don't advertise time slice extensions if disabled
  rseq: Protect rseq_reset() against interrupts
  rseq: Set rseq::cpu_id_start to 0 on unregistration
  selftests/rseq: Don't run tests with runner scripts outside of the scripts

Merge tag 'net-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2026-05-07T17:32:03+00:00

Pull networking fixes from Jakub Kicinski:
 "Including fixes from Netfilter, IPsec, Bluetooth and WiFi.

  Current release - fix to a fix:

   - ipmr: add __rcu to netns_ipv4.mrt, make sure we hold the RCU lock
     in all relevant places

  Current release - new code bugs:

   - fixes for the recently added resizable hash tables

   - ipv6: make sure we default IPv6 tunnel drivers to =m now that IPv6
     itself is built in

   - drv: octeontx2-af: fixes for parser/CAM fixes

  Previous releases - regressions:

   - phy: micrel: fix LAN8814 QSGMII soft reset

   - wifi:
       - cw1200: revert "Fix locking in error paths"
       - ath12k: fix crash on WCN7850, due to adding the same queue
         buffer to a list multiple times

  Previous releases - always broken:

   - number of info leak fixes

   - ipv6: implement limits on extension header parsing

   - wifi: number of fixes for missing bound checks in the drivers

   - Bluetooth: fixes for races and locking issues

   - af_unix:
       - fix an issue between garbage collection and PEEK
       - fix yet another issue with OOB data

   - xfrm: esp: avoid in-place decrypt on shared skb frags

   - netfilter: replace skb_try_make_writable() by skb_ensure_writable()

   - openvswitch: vport: fix race between tunnel creation and linking
     leading to invalid memory accesses (type confusion)

   - drv: amd-xgbe: fix PTP addend overflow causing frozen clock

  Misc:

   - sched/isolation: make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
     (for relevant IPVS change)"

* tag 'net-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (190 commits)
  net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init()
  net: sparx5: fix wrong chip ids for TSN SKUs
  net: stmmac: dwmac-nuvoton: fix NULL pointer dereference in nvt_set_phy_intf_sel()
  tcp: Fix dst leak in tcp_v6_connect().
  ipmr: Call ipmr_fib_lookup() under RCU.
  net: phy: broadcom: Save PHY counters during suspend
  net/smc: fix missing sk_err when TCP handshake fails
  af_unix: Reject SIOCATMARK on non-stream sockets
  veth: fix OOB txq access in veth_poll() with asymmetric queue counts
  eth: fbnic: fix double-free of PCS on phylink creation failure
  net: ethernet: cortina: Drop half-assembled SKB
  selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl
  selftests: mptcp: check output: catch cmd errors
  mptcp: pm: prio: skip closed subflows
  mptcp: pm: ADD_ADDR rtx: return early if no retrans
  mptcp: pm: ADD_ADDR rtx: skip inactive subflows
  mptcp: pm: ADD_ADDR rtx: resched blocked ADD_ADDR quicker
  mptcp: pm: ADD_ADDR rtx: free sk if last
  mptcp: pm: ADD_ADDR rtx: always decrease sk refcount
  mptcp: pm: ADD_ADDR rtx: fix potential data-race
  ...

Merge tag 'v7.1-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd

2026-05-07T05:02:28+00:00

Pull smb server fixes from Steve French:

 - Fix memory leak in connection free

 - Fix inherited ACL ACE validation

 - Minor cleanup

 - Fix for share config

 - Fix durable handle cleanup race

 - Fix close_file_table_ids in session teardown

 - smbdirect fixes:
    - Fix memory region registration
    - Two fixes for out-of-tree builds

* tag 'v7.1-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: validate inherited ACE SID length
  ksmbd: fix kernel-doc warnings from ksmbd_conn_get/put()
  ksmbd: fail share config requests when path allocation fails
  ksmbd: close durable scavenger races against m_fp_list lookups
  ksmbd: harden file lifetime during session teardown
  ksmbd: centralize ksmbd_conn final release to plug transport leak
  smb: smbdirect: fix MR registration for coalesced SG lists
  smb: smbdirect: introduce and use include/linux/smbdirect.h
  smb: smbdirect: make use of DEFAULT_SYMBOL_NAMESPACE and EXPORT_SYMBOL_GPL

rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode

2026-05-06T15:40:15+00:00

The optimized RSEQ V2 mode requires that user space adheres to the ABI
specification and does not modify the read-only fields cpu_id_start,
cpu_id, node_id and mm_cid behind the kernel's back.

While the kernel does not rely on these fields, the adherence to this is a
fundamental prerequisite to allow multiple entities, e.g. libraries, in an
application to utilize the full potential of RSEQ without stepping on each
other toes.

Validate this adherence on every update of these fields. If the kernel
detects that user space modified the fields, the application is force
terminated.

Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dmitry Vyukov 
Tested-by: Dmitry Vyukov 
Link: https://patch.msgid.link/20260428224427.845230956%40kernel.org
Cc: stable@vger.kernel.org

Merge tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

2026-05-05T23:09:31+00:00

Pull workqueue fixes from Tejun Heo:

 - Fix devm_alloc_workqueue() passing a va_list as a positional arg to
   the variadic alloc_workqueue() macro, which garbled wq->name and
   skipped lockdep init on the devm path. Fold both noprof entry points
   onto a va_list helper.

   Also, annotate it using __printf(1, 0)

* tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Annotate alloc_workqueue_va() with __printf(1, 0)
  workqueue: fix devm_alloc_workqueue() va_list misuse

Merge tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2026-05-05T22:43:32+00:00

Pull cgroup fixes from Tejun Heo:

- During v6.19, cgroup task unlink was moved from do_exit() to after the
final task switch to satisfy a controller invariant. That left the kernel
seeing tasks past exit_signals() longer than userspace expected, and
several v7.0 follow-ups tried to bridge the gap by making rmdir wait for
the kernel side. None held up.

The latest is an A-A deadlock when rmdir is invoked by the reaper of
zombies whose pidns teardown the rmdir itself is waiting on, which
points at the synchronizing approach being fundamentally wrong.

Take a different approach: drop the wait, leave rmdir's user-visible
side returning as soon as cgroup.procs is empty, and defer the css
percpu_ref kill that drives ->css_offline() until the cgroup is fully
depopulated.

Tagged for stable. Somewhat invasive but contained. The hope is that
fixing forward sticks. If not, the fallback is to revert the entire
chain and rework on the development branch.

Note that this doesn't plug a pre-existing analogous race in
cgroup_apply_control_disable() (controller disable via
subtree_control). Not a regression. The development branch will do
the more invasive restructuring needed for that.

- Documentation update for cgroup-v1 charge-commit section that still
referenced functions removed when the memcg hugetlb try-commit-cancel
protocol was retired.

* tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: Update charge-commit section
cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

Merge tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

2026-05-05T22:22:04+00:00

Pull sched_ext fixes from Tejun Heo:

 - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr
   when the BPF caller's allowed mask was wider. Stable backport.

 - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode
   versus the global mode:

    - Tasks past exit_signals() are filtered by the cgroup walk but kept
      by global. Sub-scheduler enable abort leaked __scx_init_task()
      state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task
      iterator (scx_task_iter is its only user) and use it.

    - Tasks past sched_ext_dead() are still returned, tripping
      WARN_ON_ONCE() in callers or making them touch torn-down state.
      Mark and skip under the per-task rq lock.

* tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Recheck prev_cpu after narrowing allowed mask
  sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()
  cgroup, sched_ext: Include exiting tasks in cgroup iter

rseq: Revert to historical performance killing behaviour

2026-05-05T14:02:57+00:00

The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI
as it not longer unconditionally updates the CPU, node, mm_cid fields,
which are documented as read only for user space. Due to the observed
behavior of the kernel it was possible for TCMalloc to overwrite the
cpu_id_start field for their own purposes and rely on the kernel to update
it unconditionally after each context switch and before signal delivery.

The RSEQ ABI only guarantees that these fields are updated when the data
changes, i.e. the task is migrated or the MMCID of the task changes due to
switching from or to per CPU ownership mode.

The optimization work eliminated the unconditional updates and reduced them
to the documented ABI guarantees, which results in a massive performance
win for syscall, scheduling heavy work loads, which in turn breaks the
TCMalloc expectations.

There have been several options discussed to restore the TCMalloc
functionality while preserving the optimization benefits. They all end up
in a series of hard to maintain workarounds, which in the worst case
introduce overhead for everyone, e.g. in the scheduler.

The requirements of TCMalloc and the optimization work are diametral and
the required work arounds are a maintainence burden. They end up as fragile
constructs, which are blocking further optimization work and are pretty
much guaranteed to cause more subtle issues down the road.

The optimization work heavily depends on the generic entry code, which is
not used by all architectures yet. So the rework preserved the original
mechanism moslty unmodified to keep the support for architectures, which
handle rseq in their own exit to user space loop. That code is currently
optimized out by the compiler on architectures which use the generic entry
code.

This allows to revert back to the original behaviour by replacing the
compile time constant conditions with a runtime condition where required,
which disables the optimization and the dependend time slice extension
feature until the run-time condition can be enabled in the RSEQ
registration code on a per task basis again.

The following changes are required to restore the original behavior, which
makes TCMalloc work again:

  1) Replace the compile time constant conditionals with runtime
     conditionals where appropriate to prevent the compiler from optimizing
     the legacy mode out

  2) Enforce unconditional update of IDs on context switch for the
     non-optimized v1 mode

  3) Enforce update of IDs in the pre signal delivery path for the
     non-optimized v1 mode

  4) Enforce update of IDs in the membarrier(RSEQ) IPI for the
     non-optimized v1 mode

  5) Make time slice and future extensions depend on optimized v2 mode

This brings back the full performance problems, but preserves the v2
optimization code and for generic entry code using architectures also the
TIF_RSEQ optimization which avoids a full evaluation of the exit to user
mode loop in many cases.

Fixes: 566d8015f7ee ("rseq: Avoid CPU/MM CID updates when no event pending")
Reported-by: Mathias Stearn 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dmitry Vyukov 
Tested-by: Dmitry Vyukov 
Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com
Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org
Cc: stable@vger.kernel.org

sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN

2026-05-04T23:52:55+00:00

Since commit 041ee6f3727a ("kthread: Rely on HK_TYPE_DOMAIN for preferred
affinity management"), kthreads default to use the HK_TYPE_DOMAIN
cpumask. IOW, it is no longer affected by the setting of the nohz_full
boot kernel parameter.

That means HK_TYPE_KTHREAD should now be an alias of HK_TYPE_DOMAIN
instead of HK_TYPE_KERNEL_NOISE to correctly reflect the current kthread
behavior. Make the change as HK_TYPE_KTHREAD is still being used in
some networking code.

Fixes: 041ee6f3727a ("kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management")
Signed-off-by: Waiman Long 
Signed-off-by: Julian Anastasov 
Signed-off-by: Pablo Neira Ayuso