linux.git/kernel, branch v7.1-rc2

Merge tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2026-05-03T15:17:09+00:00

Pull locking fix from Ingo Molnar:
 "Fix lockup in requeue-PI during signal/timeout wakeups, by Sebastian
  Andrzej Siewior"

* tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Prevent lockup in requeue-PI during signal/ timeout wakeup

Merge tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2026-05-03T15:05:23+00:00

Pull scheduler fixes from Ingo Molnar:

 - Fix the delayed dequeue negative lag increase fix in the
   fair scheduler (Peter Zijlstra)

 - Fix wakeup_preempt_fair() to do proper delayed dequeue
   (Vincent Guittot)

 - Clear sched_entity::rel_deadline when initializing
   forked entities, which bug can cause all tasks to be
   EEVDF-ineligible, causing a NULL pointer dereference
   crash in pick_next_entity() (Zicheng Qu)

* tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Clear rel_deadline when initializing forked entities
  sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue
  sched/fair: Fix the negative lag increase fix

futex: Drop CLONE_THREAD requirement for private default hash alloc

2026-05-01T20:12:34+00:00

Currently need_futex_hash_allocate_default() depends on strict pthread
semantics, abusing CLONE_THREAD.  This breaks the non-concurrency
assumptions when doing the mm->futex_ref pcpu allocations, leading to
bugs[0] when sharing the mm in other ways; ie:

    BUG: KASAN: slab-use-after-free in futex_hash_put

... where the +1 bias can end up on a percpu counter that mm->futex_ref
no longer points at.

Loosen the check to cover any CLONE_VM clone, except vfork().  Excluding
vfork keeps the existing paths untouched (no overhead), and we can't
race in the first place: either the parent is suspended and the child
runs alone, or mm->futex_ref is already allocated from an earlier
CLONE_VM.

Link: https://lore.kernel.org/all/CAL_bE8LsmCQ-FAtYDuwbJhOkt9p2wwYQwAbMh=PifC=VsiBM6A@mail.gmail.com/ [0]
Fixes: d9b05321e21e ("futex: Move futex_hash_free() back to __mmput()")
Reported-by: Yiming Qian 
Signed-off-by: Davidlohr Bueso 
Signed-off-by: Linus Torvalds

Merge tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2026-05-01T15:45:23+00:00

Pull MM fixes from Andrew Morton:
 "20 hotfixes. All are for MM (and for MMish maintainers). 9 are
  cc:stable and the remainder are for post-7.0 issues or aren't deemed
  suitable for backporting.

  There are two DAMON series from SeongJae Park which address races
  which could lead to use-after-free errors, and avoid the possibility
  of presenting stale parameter values to users"

* tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end()
  mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()
  MAINTAINERS: remove stale kdump project URL
  mm/damon/stat: detect and use fresh enabled value
  mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
  mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
  selftests/mm: specify requirement for PROC_MEM_ALWAYS_FORCE=y
  mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lock
  mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock
  MAINTAINERS: update Li Wang's email address
  MAINTAINERS, mailmap: update email address for Qi Zheng
  MAINTAINERS: update Liam's email address
  mm/hugetlb_cma: round up per_node before logging it
  MAINTAINERS: fix regex pattern in CORE MM category
  mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()
  mm: start background writeback based on per-wb threshold for strictlimit BDIs
  kho: fix error handling in kho_add_subtree()
  liveupdate: fix return value on session allocation failure
  mailmap: update entry for Dan Carpenter
  vmalloc: fix buffer overflow in vrealloc_node_align()

Merge tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

2026-04-30T05:21:44+00:00

Pull tracing fixes from Steven Rostedt:

 - Fix inverted check of registering the stats for branch tracing

   When calling register_stat_tracer() which returns zero on success and
   negative on error, the callers were checking the return of zero as an
   error and printing a warning message. Because this was just a normal
   printk() message and not a WARN(), it wasn't caught in any testing.

   Fix the check to print the warning message when an error actually
   happens.

 - Fix a typo in a comment in tracepoint.h

 - Limit the size of event probes to 3K in size

   It is possible to create a dynamic event probe via the tracefs system
   that is greater than the max size of an event that the ring buffer
   can hold. This basically causes the event to become useless.

   Limit the size of an event probe to be 3K as that should be large
   enough to handle any dynamic events being created, and fits within
   the PAGE_SIZE sub-buffers of the ring buffer.

* tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Limit size of event probe to 3K
  tracepoint: Fix typo in tracepoint.h comment
  tracing: branch: Fix inverted check on stat tracer registration

tracing/probes: Limit size of event probe to 3K

2026-04-29T20:07:38+00:00

There currently isn't a max limit an event probe can be. One could make an
event greater than PAGE_SIZE, which makes the event useless because if
it's bigger than the max event that can be recorded into the ring buffer,
then it will never be recorded.

A event probe should never need to be greater than 3K, so make that the
max size. As long as the max is less than the max that can be recorded
onto the ring buffer, it should be fine.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers 
Acked-by: Masami Hiramatsu (Google) 
Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home
Signed-off-by: Steven Rostedt

futex: Prevent lockup in requeue-PI during signal/ timeout wakeup

2026-04-29T06:56:40+00:00

During wait-requeue-pi (task A) and requeue-PI (task B) the following
race can happen:

     Task A                             Task B
  futex_wait_requeue_pi()
    futex_setup_timer()
    futex_do_wait()
                                   futex_requeue()
                                        CLASS(hb, hb1)(&key1);
                                        CLASS(hb, hb2)(&key2);
        *timeout*
    futex_requeue_pi_wakeup_sync()
        requeue_state = Q_REQUEUE_PI_IGNORE

    *blocks on hb->lock*

                                        futex_proxy_trylock_atomic()
                                          futex_requeue_pi_prepare()
                                            Q_REQUEUE_PI_IGNORE => -EAGAIN
                                        double_unlock_hb(hb1, hb2)
                                         *retry*

Task B acquires both hb locks and attempts to acquire the PI-lock of the
top most waiter (task B). Task A is leaving early due to a signal/
timeout and started removing itself from the queue. It updates its
requeue_state but can not remove it from the list because this requires
the hb lock which is owned by task B.

Usually task A is able to swoop the lock after task B unlocked it.
However if task B is of higher priority then task A may not be able to
wake up in time and acquire the lock before task B gets it again.
Especially on a UP system where A is never scheduled.

As a result task A blocks on the lock and task B busy loops, trying to
make progress but live locks the system instead. Tragic.

This can be fixed by removing the top most waiter from the list in this
case. This allows task B to grab the next top waiter (if any) in the
next iteration and make progress.

Remove the top most waiter if futex_requeue_pi_prepare() fails.
Let the waiter conditionally remove itself from the list in
handle_early_requeue_pi_wakeup().

Fixes: 07d91ef510fb1 ("futex: Prevent requeue_pi() lock nesting issue on RT")
Reported-by: Moritz Klammler 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Link: https://patch.msgid.link/20260428103425.dywXyPd3@linutronix.de
Closes: https://lore.kernel.org/all/VE1PR06MB6894BE61C173D802365BE19DFF4CA@VE1PR06MB6894.eurprd06.prod.outlook.com

Merge tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

2026-04-28T23:26:11+00:00

Pull sched_ext fixes from Tejun Heo:
 "The merge window pulled in the cgroup sub-scheduler infrastructure,
  and new AI reviews are accelerating bug reporting and fixing - hence
  the larger than usual fixes batch:

   - Use-after-frees during scheduler load/unload:
       - The disable path could free the BPF scheduler while deferred
         irq_work / kthread work was still in flight
       - cgroup setter callbacks read the active scheduler outside the
         rwsem that synchronizes against teardown
     Fix both, and reuse the disable drain in the enable error paths so
     the BPF JIT page can't be freed under live callbacks.

   - Several BPF op invocations didn't tell the framework which runqueue
     was already locked, so helper kfuncs that re-acquire the runqueue
     by CPU could deadlock on the held lock

     Fix the affected callsites, including recursive parent-into-child
     dispatch.

   - The hardlockup notifier ran from NMI but eventually took a
     non-NMI-safe lock. Bounce it through irq_work.

   - A handful of bugs in the new sub-scheduler hierarchy:
       - helper kfuncs hard-coded the root instead of resolving the
         caller's scheduler
       - the enable error path tried to disable per-task state that had
         never been initialized, and leaked cpus_read_lock on the way
         out
       - a sysfs object was leaked on every load/unload
       - the dispatch fast-path used the root scheduler instead of the
         task's
       - a couple of CONFIG #ifdef guards were misclassified

   - Verifier-time hardening: BPF programs of unrelated struct_ops types
     (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
     bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
     Now rejected at load. Plus a few NULL and cross-task argument
     checks on sched_ext kfuncs, and a selftest covering the new deny.

   - rhashtable (Herbert): restore the insecure_elasticity toggle and
     bounce the deferred-resize kick through irq_work to break a
     lock-order cycle observable from raw-spinlock callers. sched_ext's
     scheduler-instance hash is the first user of both.

   - The bypass-mode load balancer used file-scope cpumasks; with
     multiple scheduler instances now possible, those raced. Move to
     per-instance cpumasks, plus a follow-up to skip tasks whose
     recorded CPU is stale relative to the new owning runqueue.

   - Smaller fixes:
       - a dispatch queue's first-task tracking misbehaved when a parked
         iterator cursor sat in the list
       - the runqueue's next-class wasn't promoted on local-queue
         enqueue, leaving an SCX task behind RT in edge cases
       - the reference qmap scheduler stopped erroring on legitimate
         cross-scheduler task-storage misses"

* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
  sched_ext: Fix scx_flush_disable_work() UAF race
  sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
  sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
  sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
  sched_ext: Refuse cross-task select_cpu_from_kfunc calls
  sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
  sched_ext: Make bypass LB cpumasks per-scheduler
  sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
  sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
  sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
  sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
  sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
  sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
  sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
  sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
  sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
  sched_ext: Unregister sub_kset on scheduler disable
  sched_ext: Defer scx_hardlockup() out of NMI
  sched_ext: sync disable_irq_work in bpf_scx_unreg()
  sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
  ...

tracing: branch: Fix inverted check on stat tracer registration

2026-04-28T18:28:29+00:00

init_annotated_branch_stats() and all_annotated_branch_stats() check the
return value of register_stat_tracer() with "if (!ret)", but
register_stat_tracer() returns 0 on success and a negative errno on
failure. The inverted check causes the warning to be printed on every
successful registration, e.g.:

  Warning: could not register annotated branches stats

while leaving real failures silent. The initcall also returned a
hard-coded 1 instead of the actual error.

Invert the check and propagate ret so that the warning fires on real
errors and the initcall reports the correct status.

Cc: Mathieu Desnoyers 
Cc: Ingo Molnar 
Cc: Frederic Weisbecker 
Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org
Fixes: 002bb86d8d42 ("tracing/ftrace: separate events tracing and stats tracing engine")
Signed-off-by: Breno Leitao 
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Steven Rostedt

sched_ext: Fix scx_flush_disable_work() UAF race

2026-04-28T17:40:03+00:00

scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.

However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():

  CPU A (scx_vexit (watchdog))        CPU B (bpf_scx_unreg)
  ----                                ----
  scx_claim_exit()
    atomic_try_cmpxchg(NONE->kind)
  stack_trace_save()
  vscnprintf()
                                      scx_disable()
                                        scx_claim_exit() -> FAIL
                                      scx_flush_disable_work()
                                        irq_work_sync()      // no-op: not queued yet
                                        kthread_flush_work() // no-op: not queued yet
                                      kobject_put(&sch->kobj) -> free %sch
  irq_work_queue() -> UAF on %sch
  scx_disable_irq_workfn()
    kthread_queue_work() -> UAF

The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().

Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).

Fixes: 510a27055446 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo 
Signed-off-by: Cheng-Yang Chou 
Signed-off-by: Tejun Heo