linux.git/kernel/fork.c, branch v7.2-rc1

Merge tag 'block-7.2-20260625' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-06-25T16:56:47+00:00

Pull block fixes from Jens Axboe:

 - blk-cgroup locking rework and fixes:
      - fix a use-after-free in __blkcg_rstat_flush()
      - defer freeing policy data until after an RCU grace period
      - defer the blkcg css_put until the blkg is unlinked from
        the queue
      - unwind the queue_lock nesting under RCU / blkcg->lock
        across the lookup, create, associate and destroy paths

 - NVMe fixes via Keith:
      - Fix a crash and memory leak during invalid cdev teardown,
        and related cdev cleanups (Maurizio, John)
      - nvmet fixes: handle TCP_CLOSING in the tcp state_change
        handler, reject short AUTH_RECEIVE buffers, handle inline
        data with a nonzero offset in rdma, fix an sq refcount leak,
        and allocate ana_state with the port (Maurizio, Michael,
        Bryam, Wentao, Rosen)
      - nvme-fc fix to not cancel requests on an IO target before it
        is initialized (Mohamed)
      - nvme-apple fix to prevent shared tags across queues on Apple
        A11 (Nick)
      - Various smaller fixes and cleanups (John)

 - MD fixes via Yu Kuai:
      - raid1/raid10 fixes for writes_pending and barrier reference
        leaks on write and discard failures, plus REQ_NOWAIT handling
        fixes (Abd-Alrhman)
      - raid5 discard accounting and validation, and a batch of fixes
        for stripe batch races (Yu Kuai, Chen)
      - Protect raid1 head_position during read balancing (Chen)

 - block bio-integrity fixes: correct an error injection static key
   decrement, fix GFP flag confusion in bio_integrity_alloc_buf(), and
   handle REQ_OP_ZONE_APPEND in __bio_integrity_action() (Christoph)

 - Fixes for bio_iov_iter_bounce_write(): revert the iov_iter after a
   short copy, and respect the iov_iter nofault flag (Qu)

 - Invalidate the cached plug timestamp after a task switch, and clear
   PF_BLOCK_TS in copy_process() (Usama)

 - Fix the IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd()
   (Yitang)

 - Remove a redundant plug in __submit_bio() (Wen)

 - Don't warn when reclassifying a busy socket lock in nbd (Deepanshu)

* tag 'block-7.2-20260625' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (45 commits)
  block: handle REQ_OP_ZONE_APPEND in __bio_integrity_action
  block: fix GFP_ flags confusion in bio_integrity_alloc_buf
  block, bfq: don't grab queue_lock to initialize bfq
  mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page()
  blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs()
  blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg()
  blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()
  blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs()
  blk-cgroup: delay freeing policy data after rcu grace period
  blk-cgroup: protect iterating blkgs with blkcg->lock in blkcg_print_stat()
  md/raid5: avoid R5_Overlap races while breaking stripe batches
  md/raid5: use stripe state snapshot in break_stripe_batch_list()
  blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
  blk-cgroup: fix UAF in __blkcg_rstat_flush()
  block, bfq: protect async queue reset with blkcg locks
  nbd: don't warn when reclassifying a busy socket lock
  block: fix incorrect error injection static key decrement
  md/raid5: let stripe batch bm_seq comparison wrap-safe
  md/raid1: protect head_position for read balance
  md/raid1: free r1_bio when REQ_NOWAIT is set and read would block on retry
  ...

Merge tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2026-06-19T17:14:34+00:00

Pull MM updates from Andrew Morton:

 - "selftests/mm: clean up build output and verbosity" (Li Wang)

   Remove some noise from the MM selftests build

 - "mm: Free contiguous order-0 pages efficiently" (Ryan Roberts)

   Speed up the freeing of a batch of 0-order pages by first scanning
   them for coalescing opportunities. This is applicable to vfree() and
   to the releasing of frozen pages

 - "mm/damon: introduce DAMOS failed region quota charge ratio"
   (SeongJae Park)

   Address a DAMOS usability issue: The DAMOS quota often exhausts
   prematurely because it charges for all memory attempted, causing slow
   and inconsistent performance when actions fail on unreclaimable
   memory.

   To fix this, a new feature lets users set a smaller, flexible quota
   charge ratio (via a numerator and denominator) for failed regions.
   Since failed actions cause less overhead, reducing their quota cost
   ensures more predictable and efficient DAMOS processing

 - "selftests/cgroup: improve zswap tests robustness and support large
   page sizes" (Li Wang)

   Fix various spurious failures and improves the overall robustness of
   the cgroup zswap selftests

 - "fix MAP_DROPPABLE not supported errno" (Anthony Yznaga)

   Fix an issue in the mlock selftests on arm32

 - "mm: huge_memory: clean up defrag sysfs with shared" (Breno Leitao)

   Some maintenance work in the huge_memory code

 - "treewide: fixup gfp_t printks" (Brendan Jackman)

   Use the special vprintf() gfp_t conversion in various places

 - "mm: Fix vmemmap optimization accounting and initialization" (Muchun
   Song)

   Fix several bugs in the vmemmap optimization, mainly around incorrect
   page accounting and memmap initialization in the DAX and memory
   hotplug paths. It also fixes pageblock migratetype initialization and
   struct page initialization for ZONE_DEVICE compound pages

 - "mm/damon: repost non-hotfix reviewed patches in damon/next tree"

   A sprinkle of unrelated minor bugfixes for DAMON

 - "mm: remove page_mapped()" (David Hildenbrand)

   Remove this function from the tree, replacing it with folio_mapped()

 - "mm/damon: let DAMON be paused and resumed" (SeongJae Park)

   Allow DAMON to be paused and resumed without losing its current state

 - "kasan: hw_tags: Disable tagging for stack and page-tables" (Muhammad
   Usama Anjum)

   Simplify and speed up kasan by removing its ineffective tagging of
   stacks and page tables

 - "mm/damon/reclaim,lru_sort: monitor all system rams by default"
   (SeongJae Park)

   Simplify deployment on diverse hardware like NUMA systems by updating
   DAMON_RECLAIM and DAMON_LRU_SORT to automatically monitor the
   physical address range covering all System RAM areas by default,
   replacing the overly restrictive behavior that only targeted the
   single largest memory block to save on negligible overhead

 - "mm/damon/sysfs: document filters/ directory as deprecated" (SeongJae
   Park)

   Update some DAMON docs

 - "mm: use spinlock guards for zone lock" (Dmitry Ilvokhin)

   Switch zone->lock handling over to using the guard() mechanisms

 - "mm/filemap: tighten mmap_miss hit accounting" (fujunjie)

   Fix a flaw where the mmap_miss counter over-credited page cache hits
   during fault-arounds and page-fault retries. This results in
   significant reduction of redundant synchronous mmap readahead I/O,
   drastically cutting down execution time and gigabytes read for sparse
   random or strided memory access workloads

 - "selftests/cgroup: Fix false positive failures in test_percpu_basic"
   (Li Wang)

   Fix a couple of false-positives in the cgroup kmem selftests

 - "mm/damon/reclaim: support monitoring intervals auto-tuning"
   (SeongJae Park)

   Add a new parameter to DAMON permitting DAMON_RECLAIM to
   automatically tune DAMON's sampling and aggregation intervals

 - "mm/damon/stat: add kdamond_pid parameter" (SeongJae Park)

   Change DAMON_STAT to provide the pid of its kdamond

 - "mm/kmemleak: dedupe verbose scan output" (Breno Leitao)

   Remove large amounts of duplicated backtraces from the verbose-mode
   kmemleak output

 - "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)" (David
   Hildenbrand)

   Reduce our use of CONFIG_HAVE_BOOTMEM_INFO_NODE, with a view to
   removing it entirely in a later series

 - "mm/damon: validate min_region_size to be power of 2" (Liew Rui Yan)

   Prevent users from passing a non-power-of-2 value of `addr_unit', as
   this later results in undesirable behavior

 - "mm: document read_pages and simplify usage" (Frederick Mayle)

 - "tools/mm/page-types: Fix misc bugs" (Ye Liu)

   Fix three issues in tools/mm/page-types.c

 - "mm: misc cleanups from __GFP_UNMAPPED series" (Brendan Jackman)

   Implement several cleanups in the page allocator and related code

 - "mm, swap: swap table phase IV: unify allocation" (Kairui Song)

   Unify the allocation and charging of anon and shmem swap in folios,
   provides better synchronization, consolidates the metadata
   management, hence dropping the static array and map, and improves
   performance

 - "mm/damon: introduce data attributes monitoring" (SeongJae Park(

   Extend DAMON to monitor general data attributes other than accesses

 - "mm/vmalloc: free unused pages on vrealloc() shrink" (Shivam Kalra)

   Implement the TODO in vrealloc() to unmap and free unused pages when
   shrinking across a page boundary

 - "mm/damon: documentation and comment fixes" (niecheng)

 - "remove mmap_action success, error hooks" (Lorenzo Stoakes)

   Eliminate custom hooks from mmap_action by removing the problematic
   success_hook which allowed drivers to improperly access uninitialized
   VMAs. It replaces the error_hook with a simple error-code field and
   updates the memory char driver accordingly

 - "mm/damon: minor improvements for code readability and tests"
   (SeongJae Park)

 - "mm/damon: fix macro arguments and clarify quota goals doc" (Maksym
   Shcherba)

 - "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c" (Mike
   Rapoport)

 - "mm/mglru: improve reclaim loop and dirty folio" (Kairui Song and
   others)

   Clean up and slightly improves MGLRU's reclaim loop and dirty
   writeback handling. Large performance improvements are measured

 - "use vma locks for proc/pid/{smaps|numa_maps} reads" (Suren
   Baghdasaryan)

   Use per-vma locks when reading /proc/pid/smaps and numa_maps similar
   to reduce contention on central mmap_lock

 - "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()"
   (Ran Xiaokai)

   Some cleanup work in the THP code

 - "selftests/memfd: fix compilation warnings" (Konstantin Khorenko)

   Fix a few build glitches in the memfd selftest code.

 - "memcg: shrink obj_stock_pcp and cache multiple objcgs" (Shakeel
   Butt)

   Resolve a 68% performance regression caused by NUMA-node cache
   thrashing around struct obj_stock_pcp by shrinking its existing
   fields and expanding it into a multi-slot array that caches up to
   five obj_cgroup pointers per CPU, allowing per-node variants of the
   same memcg to coexist within a single 64-byte cache line.

 - "zram: writeback fixes" (Sergey Senozhatsky)

   address a couple of unrelated zram writeback issues

 - "mm: switch THP shrinker to list_lru" (Johannes Weiner)

   Resolve NUMA-awareness issues and streamlines callsite interaction by
   refactoring and extending the list_lru API to completely replace the
   complex, open-coded deferred split queue for Transparent Huge Pages

 - "mm: improve large folio readahead for exec memory" (Usama Arif)

   Improve large-folio readahead on systems like 64K-page arm64 by
   preventing the mmap_miss check from permanently disabling
   target-oriented VM_EXEC readahead, and by generalizing the
   force_thp_readahead gate to support mappings with any usefully large
   maximum folio order under the cache cap.

 - "userfaultfd/pagemap: pre-existing fixes" (Kiryl Shutsemau)

   Fix a bunch of minor issues in the userfaultfd/pagemap, all of which
   were flagged by Sashiko review of proposed new material

 - "mm/sparse-vmemmap: Provide generic vmemmap_set_pmd() and
   vmemmap_check_pmd()" (Muchun Song)

   Provide generic versions of these two functions so the four
   arch-specific implementations can be removed.

 - "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap
   device" (Youngjun Park)

   Address a uswsusp-vs-swapoff race and reduces the swap device
   reference taking/releasing frequency.

 - "mm/hmm: A fix and a selftest" (Dev Jain)

* tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  selftests/mm/hmm-tests: test pagemap reads of PMD device-private entries
  fs/proc/task_mmu: do not warn on seeing non-migration pmd entry
  lib/test_hmm: check alloc_page_vma() return value and handle OOM
  mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  mm/swap: remove redundant swap device reference in alloc/free
  mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device
  mm/filemap: use folio_next_index() for start
  vmalloc: fix NULL pointer dereference in is_vm_area_hugepages()
  sparc/mm: drop vmemmap_check_pmd helper and use generic code
  loongarch/mm: drop vmemmap_check_pmd helper and use generic code
  riscv/mm: drop vmemmap_pmd helpers and use generic code
  arm64/mm: drop vmemmap_pmd helpers and use generic code
  mm/sparse-vmemmap: provide generic vmemmap_set_pmd() and vmemmap_check_pmd()
  rust: page: mark Page::nid as inline
  userfaultfd: build __VMA_UFFD_FLAGS from config-gated masks
  userfaultfd: gate must_wait writability check on pte_present()
  mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade
  fs/proc/task_mmu: fix hugetlb self-deadlock in pagemap_scan_pte_hole()
  fs/proc/task_mmu: use huge_page_size() in pagemap_scan_hugetlb_entry()
  fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race
  ...

kernel/fork: clear PF_BLOCK_TS in copy_process()

2026-06-16T16:07:36+00:00

PF_BLOCK_TS is only set in blk_time_get_ns() when current->plug is
non-NULL, and blk_finish_plug() clears it via __blk_flush_plug()
before NULLing the plug pointer.  copy_process() breaks the
invariant by inheriting PF_BLOCK_TS from the parent while resetting
the child's plug to NULL.

Clear PF_BLOCK_TS alongside that assignment so callers can rely on
"PF_BLOCK_TS set implies current->plug != NULL" and dereference
current->plug unguarded.

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif 
Link: https://patch.msgid.link/20260616141604.328820-2-usama.arif@linux.dev
Signed-off-by: Jens Axboe

Merge tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip

2026-06-15T09:20:18+00:00

Pull scheduler updates from Ingo Molnar:
 "SMP load-balancing updates:

   - A large series to introduce infrastructure for cache-aware load
     balancing, with the goal of co-locating tasks that share data
     within the same Last Level Cache (LLC) domain. By improving cache
     locality, the scheduler can reduce cache bouncing and cache misses,
     ultimately improving data access efficiency.

     Implemented by Chen Yu and Tim Chen, based on early prototype work
     by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and
     Shrikanth Hegde.

   - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde)

  Fair scheduler updates:

   - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing
     SMT awareness (Andrea Righi, K Prateek Nayak)

   - A series to optimize cfs_rq and sched_entity allocation for better
     data locality (Zecheng Li)

   - A preparatory series to change fair/cgroup scheduling to a single
     runqueue, without the final change (Peter Zijlstra)

   - Auto-manage ext/fair dl_server bandwidth (Andrea Righi)

   - Fix cpu_util runnable_avg arithmetic (Hongyan Xia)

   - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel)

   - Allow account_cfs_rq_runtime() to throttle current hierarchy
     (K Prateek Nayak)

   - Update util_est after updating util_avg during dequeue, to fix the
     util signal update logic, which reduces signal noise (Vincent
     Guittot)

  Scheduler topology updates:

   - Allow multiple domains to claim sched_domain_shared (K Prateek
     Nayak)

   - Add parameter to split LLC (Peter Zijlstra)

  Core scheduler updates:

   - Use trace_call__() to save a static branch (Gabriele Monaco)

  Scheduler statistics updates:

   - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation
     guard (Nicolas Pitre)

  Deadline scheduler updates:

   - Reject debugfs dl_server writes for offline CPUs (Andrea Righi)

   - Fix replenishment logic for non-deferred servers (Yuri Andriaccio)

  RT scheduling updates:

   - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt)

   - Update default bandwidth for real-time tasks to 1.0 (Yuri
     Andriaccio)

  Proxy scheduling updates:

   - A series to implement Optimized Donor Migration for Proxy Execution
     (John Stultz, Peter Zijlstra)

   - Various proxy scheduling cleanups and fixes (Peter Zijlstra,
     K Prateek Nayak)

  Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi,
  Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde,
  Peter Zijlstra, Liang Luo and Yiyang Chen"

* tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits)
  sched/fair: Fix newidle vs core-sched
  sched/deadline: Use task_on_rq_migrating() helper
  sched/core: Combine separate 'else' and 'if' statements
  sched/fair: Fix cpu_util runnable_avg arithmetic
  sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
  sched/fair: Call update_curr() before unthrottling the hierarchy
  sched/fair: Use throttled_csd_list for local unthrottle
  sched/fair: Convert cfs bandwidth throttling to use guards
  sched/fair: Allocate cfs_tg_state with percpu allocator
  sched/fair: Remove task_group->se pointer array
  sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
  sched: restore timer_slack_ns when resetting RT policy on fork
  MAINTAINERS: Fix spelling mistake in Peter's name
  sched: Simplify ttwu_runnable()
  sched/proxy: Remove superfluous clear_task_blocked_in()
  sched/proxy: Remove PROXY_WAKING
  sched/proxy: Switch proxy to use p->is_blocked
  sched/proxy: Only return migrate when needed
  sched: Be more strict about p->is_blocked
  ...

Merge tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip

2026-06-15T08:51:14+00:00

Pull locking updates from Ingo Molnar:
 "Futex updates:

   - Optimize futex hash bucket access patterns (Peter Zijlstra)

   - Large series to address the robust futex unlock race for real, by
     Thomas Gleixner:

      "The robust futex unlock mechanism is racy in respect to the
       clearing of the robust_list_head::list_op_pending pointer because
       unlock and clearing the pointer are not atomic.

       The race window is between the unlock and clearing the pending op
       pointer. If the task is forced to exit in this window, exit will
       access a potentially invalid pending op pointer when cleaning up
       the robust list.

       That happens if another task manages to unmap the object
       containing the lock before the cleanup, which results in an UAF.

       In the worst case this UAF can lead to memory corruption when
       unrelated content has been mapped to the same address by the time
       the access happens.

       User space can't solve this problem without help from the kernel.
       This series provides the kernel side infrastructure to help it
       along:

        1) Combined unlock, pointer clearing, wake-up for the
           contended case

        2) VDSO based unlock and pointer clearing helpers with a
           fix-up function in the kernel when user space was interrupted
           within the critical section.

      ... with help by André Almeida:

        - Add a note about robust list race condition (André Almeida)
        - Add self-tests for robust release operations (André Almeida)

  Context analysis updates:

   - Implement context analysis for 'struct rt_mutex'. (Bart Van Assche)
   - Bump required Clang version to 23 (Marco Elver)

  Guard infrastructure updates:

   - Series to remove NULL check from unconditional guards (Dmitry
     Ilvokhin)

  Lockdep updates:

   - Restore self-test migrate_disable() and sched_rt_mutex state on
     PREEMPT_RT (Karl Mehltretter)

  Membarriers updates:

   - Use per-CPU mutexes for targeted commands (Aniket Gattani)
   - Modernize membarrier_global_expedited with cleanup guards (Aniket
     Gattani)
   - Add rseq stress test for CFS throttle interactions (Aniket Gattani)

  percpu-rwsems updates:

   - Extract __percpu_up_read() to optimize inlining overhead (Dmitry
     Ilvokhin)

  Seqlocks updates:

   - Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens)

  Lock tracing:

   - Add contended_release tracepoint to sleepable locks such as
     mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry
     Ilvokhin)

  MAINTAINERS updates:

   - MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng)

  Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra,
  Dmitry Ilvokhin and Peter Zijlstra"

* tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits)
  locking: Add contended_release tracepoint to sleepable locks
  locking/percpu-rwsem: Extract __percpu_up_read()
  tracing/lock: Remove unnecessary linux/sched.h include
  futex: Optimize futex hash bucket access patterns
  rust: sync: completion: Mark inline complete_all and wait_for_completion
  MAINTAINERS: Add RUST [SYNC] entry
  cleanup: Specify nonnull argument index
  selftests: futex: Add tests for robust release operations
  Documentation: futex: Add a note about robust list race condition
  x86/vdso: Implement __vdso_futex_robust_try_unlock()
  x86/vdso: Prepare for robust futex unlock support
  futex: Provide infrastructure to plug the non contended robust futex unlock race
  futex: Add robust futex unlock IP range
  futex: Add support for unlocking robust futexes
  futex: Cleanup UAPI defines
  x86: Select ARCH_MEMORY_ORDER_TSO
  uaccess: Provide unsafe_atomic_store_release_user()
  futex: Provide UABI defines for robust list entry modifiers
  futex: Move futex related mm_struct data into a struct
  futex: Make futex_mm_init() void
  ...

Merge tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T21:30:58+00:00

Pull task_exec_state updates from Christian Brauner:
 "This introduces a new per-task task_exec_state structure and relocates
  the dumpable mode and the user namespace captured at execve() from
  mm_struct onto it. It stays attached to the task for its full
  lifetime.

  __ptrace_may_access() and several /proc owner and visibility checks
  need to consult two pieces of state for any observable task, including
  zombies that have already gone through exit_mm(): the dumpable mode
  and the user namespace captured at execve(). Both live on mm_struct
  today, which exit_mm() clears from the task long before the task is
  reaped. A reader that races with do_exit() observes task->mm == NULL
  and either fails the check or falls back to init_user_ns - which
  denies legitimate access to non-dumpable zombies that were running in
  a nested user namespace.

  mm_struct loses ->user_ns and the dumpability bits in ->flags.
  MMF_DUMPABLE_BITS is reserved so the MMF_DUMP_FILTER_* layout exposed
  via /proc//coredump_filter stays stable. task->user_dumpable and
  its exit_mm() snapshot are removed.

  task_exec_state is the privilege domain established by an execve().
  Within a thread group it is shared via refcount; across thread groups
  each task has its own:

   - CLONE_VM siblings (thread-group members, io_uring workers)
     refcount-share the parent's exec_state.

   - Non-CLONE_VM clones (fork(), vfork() without CLONE_VM) allocate a
     fresh exec_state inheriting the parent's dumpable mode and user_ns.

   - execve() in the child allocates a fresh instance and installs it
     under task_lock + exec_update_lock via task_exec_state_replace().

   - Credential changes (setresuid, capset, ...) and
     prctl(PR_SET_DUMPABLE) update dumpability on the current task's
     exec_state, i.e., on the thread group's shared instance.

  On top of this exec_mmap() no longer tears down the old mm while
  holding exec_update_lock for writing and cred_guard_mutex. Neither
  lock is needed for that: exec_update_lock only exists to make the mm
  swap atomic with the later commit_creds() and all its readers operate
  on the new mm; none looks at the detached old mm.

  The cost was real: __mmput() runs exit_mmap() over the entire old
  address space and can block in exit_aio() waiting for in-flight AIO,
  so execve() of a large process blocked ptrace_attach() and every
  exec_update_lock reader for the duration of the teardown.

  The old mm is now stashed in bprm->old_mm and released from
  setup_new_exec() after both locks are dropped, with a backstop in
  free_bprm() for the error paths"

* tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  exec: free the old mm outside the exec locks
  exec_state: relocate dumpable information
  ptrace: add ptracer_access_allowed()
  exec: introduce struct task_exec_state
  sched/coredump: introduce enum task_dumpable

futex: Make futex_mm_init() void

2026-06-03T09:38:49+00:00

Nothing fails there. Mop up the leftovers of the early version of this,
which did an allocation.

While at it clean up the stubs and the #ifdef comments to make the header
file readable.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260602090535.356789395@kernel.org

sched: Add blocked_donor link to task for smarter mutex handoffs

2026-06-02T10:26:07+00:00

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Juri Lelli 
Signed-off-by: Valentin Schneider 
Signed-off-by: Connor O'Brien 
Signed-off-by: John Stultz 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com

kasan: skip HW tagging for all kernel thread stacks

2026-05-29T04:05:01+00:00

HW-tag KASAN never checks kernel stacks because stack pointers carry the
match-all tag, so setting/poisoning tags is pure overhead.

- Add __GFP_SKIP_KASAN to THREADINFO_GFP so every stack allocator that
  uses it skips tagging (fork path plus arch users)
- Add __GFP_SKIP_KASAN to GFP_VMAP_STACK for the fork-specific vmap
  stacks.
- When reusing cached vmap stacks, skip kasan_unpoison_range() if HW tags
  are enabled.

Software KASAN is unchanged; this only affects tag-based KASAN.

Link: https://lore.kernel.org/20260429102704.680174-3-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum 
Signed-off-by: Dev Jain 
Reviewed-by: Catalin Marinas 
Cc: Arnd Bergmann 
Cc: Ben Segall 
Cc: David Hildenbrand (Arm) 
Cc: Dietmar Eggemann 
Cc: Ingo Molnar 
Cc: Juri Lelli 
Cc: Kees Cook 
Cc: K Prateek Nayak 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes 
Cc: Mathieu Desnoyers 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Peter Zijlstra 
Cc: Ryan Roberts 
Cc: Steven Rostedt 
Cc: Suren Baghdasaryan 
Cc: "Uladzislau Rezki (Sony)" 
Cc: Valentin Schneider 
Cc: Vincent Guittot 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

exec_state: relocate dumpable information

2026-05-26T09:02:01+00:00

The dumpable flag captured at execve() is consulted by
__ptrace_may_access() and several /proc owner / visibility checks.
It lives on mm_struct today, which exit_mm() clears from the task
long before the task itself is reaped.

exec_state is anchored to the execve() that established the current
privilege domain.  CLONE_VM siblings refcount-share the parent's
exec_state via copy_exec_state(); non-CLONE_VM clones allocate a
fresh exec_state inheriting the parent's dumpable mode and user_ns
reference via task_exec_state_copy().  execve() allocates a fresh
instance (via alloc_task_exec_state() in begin_new_exec()) and
installs it under task_lock + exec_update_lock with
task_exec_state_replace().  init_task uses a static instance.

The dumpable mode now lives on task->exec_state->dumpable.
task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is
removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit
positions remain stable for the /proc//coredump_filter ABI. The
task->user_dumpable cache bit and its assignment in exit_mm() are
removed; readers go through get_dumpable(task) directly.

coredump_params gains a snapshot field cprm.dumpable, populated from
get_dumpable(current) at vfs_coredump() entry, replacing the previous
__get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and
fs/pidfs.c.

The user namespace recorded at execve() is consulted by
__ptrace_may_access() and by /proc/PID/* owner derivation. Move the
captured user_ns onto task_exec_state, which stays attached to the task
past exit_mm() and across exit_files().

bprm grows a user_ns field staged in bprm_mm_init() with the caller's
user_ns, narrowed by would_dump() to the closest privileged ancestor,
and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns).
free_bprm() releases the staging reference.

mm_struct loses ->user_ns entirely.  Initializers in init-mm, efi_mm,
and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed;
__mmdrop() drops the matching put_user_ns(). The kthread_use_mm()
WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too.

Reviewed-by: Jann Horn 
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-4-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable)