summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-03-29Merge tag 'locking-urgent-2026-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fixes from Ingo Molnar: - Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar futex flags and potential UaF access (Peter Zijlstra) - Fix UaF between futex_key_to_node_opt() and vma_replace_policy() (Hao-Yu Yang) - Clear stale exiting pointer in futex_lock_pi() retry path, which triggered a warning (and potential misbehavior) in stress-testing (Davidlohr Bueso) * tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Clear stale exiting pointer in futex_lock_pi() retry path futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy() futex: Require sys_futex_requeue() to have identical flags
2026-03-29bpf: Support struct btf_struct_meta via KF_IMPLICIT_ARGSIhor Solodrai
The following kfuncs currently accept void *meta__ign argument: * bpf_obj_new_impl * bpf_obj_drop_impl * bpf_percpu_obj_new_impl * bpf_percpu_obj_drop_impl * bpf_refcount_acquire_impl * bpf_list_push_back_impl * bpf_list_push_front_impl * bpf_rbtree_add_impl The __ign suffix is an indicator for the verifier to skip the argument in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may set the value of this argument to struct btf_struct_meta * kptr_struct_meta from insn_aux_data. BPF programs must pass a dummy NULL value when calling these kfuncs. Additionally, the list and rbtree _impl kfuncs also accept an implicit u64 argument, which doesn't require __ign suffix because it's a scalar, and BPF programs explicitly pass 0. Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each _impl kfunc accepting meta__ign. The existing _impl kfuncs remain unchanged for backwards compatibility. To support this, add "btf_struct_meta" to the list of recognized implicit argument types in resolve_btfids. Implement is_kfunc_arg_implicit() in the verifier, that determines implicit args by inspecting both a non-_impl BTF prototype of the kfunc. Update the special_kfunc_list in the verifier and relevant checks to support both the old _impl and the new KF_IMPLICIT_ARGS variants of btf_struct_meta users. [1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-28tracing: Remove spurious default precision from show_event_trigger/filter ↵David Laight
formats Change 2d8b7f9bf8e6e ("tracing: Have show_event_trigger/filter format a bit more in columns") added space padding to align the output. However it used ("%*.s", len, "") which requests the default precision. It doesn't matter here whether the userspace default (0) or kernel default (no precision) is used, but the format should be "%*s". Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260326201824.3919-1-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-28Merge tag 'trace-v7.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix potential deadlock in osnoise and hotplug The interface_lock can be called by a osnoise thread and the CPU shutdown logic of osnoise can wait for this thread to finish. But cpus_read_lock() can also be taken while holding the interface_lock. This produces a circular lock dependency and can cause a deadlock. Swap the ordering of cpus_read_lock() and the interface_lock to have interface_lock taken within the cpus_read_lock() context to prevent this circular dependency. - Fix freeing of event triggers in early boot up If the same trigger is added on the kernel command line, the second one will fail to be applied and the trigger created will be freed. This calls into the deferred logic and creates a kernel thread to do the freeing. But the command line logic is called before kernel threads can be created and this leads to a NULL pointer dereference. Delay freeing event triggers until late init. * tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Drain deferred trigger frees if kthread creation fails tracing: Fix potential deadlock in cpu hotplug with osnoise
2026-03-28tracing: Remove tracing_alloc_snapshot() when snapshot isn't definedSteven Rostedt
The function tracing_alloc_snapshot() is only used between trace.c and trace_snapshot.c. When snapshot isn't configured, it's not used at all. The stub function was defined as a global with no users and no prototype causing build issues. Remove the function when snapshot isn't configured as nothing is calling it. Also remove the EXPORT_SYMBOL_GPL() that was associated with it as it's not used outside of the tracing subsystem which also includes any modules. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260328101946.2c4ef4a5@robin Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/all/acb-IuZ4vDkwwQLW@sirena.co.uk/ Fixes: bade44fe546212 (tracing: Move snapshot code out of trace.c and into trace_snapshot.c) Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-28posix-timers: Fix stale function name in commentZhan Xusheng
The comment in exit_itimers() still refers to itimer_delete(), which was replaced by posix_timer_delete(). Update the comment accordingly. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260326142210.98632-1-zhanxusheng@xiaomi.com
2026-03-28futex: Clear stale exiting pointer in futex_lock_pi() retry pathDavidlohr Bueso
Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING *exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) *uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: 3ef240eaff36 ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net
2026-03-28tracing: Drain deferred trigger frees if kthread creation failsWesley Atwell
Boot-time trigger registration can fail before the trigger-data cleanup kthread exists. Deferring those frees until late init is fine, but the post-boot fallback must still drain the deferred list if kthread creation never succeeds. Otherwise, boot-deferred nodes can accumulate on trigger_data_free_list, later frees fall back to synchronously freeing only the current object, and the older queued entries are leaked forever. To trigger this, add the following to the kernel command line: trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon The second traceon trigger will fail and be freed. This triggers a NULL pointer dereference and crashes the kernel. Keep the deferred boot-time behavior, but when kthread creation fails, drain the whole queued list synchronously. Do the same in the late-init drain path so queued entries are not stranded there either. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com Fixes: 61d445af0a7c ("tracing: Add bulk garbage collection of freeing event_trigger_data") Signed-off-by: Wesley Atwell <atwellwea@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-27fork: zero vmap stack using clear_pages() instead of memset()Linus Walleij
After the introduction of clear_pages() we exploit the fact that the process vm_area is allocated in contiguous pages to just clear them all in one swift operation. Link: https://lkml.kernel.org/r/20260224-mm-fork-clear-pages-v1-1-184c65a72d49@kernel.org Signed-off-by: Linus Walleij <linusw@kernel.org> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/linux-mm/dpnwsp7dl4535rd7qmszanw6u5an2p74uxfex4dh53frpb7pu3@2bnjjavjrepe/ Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-7-pasha.tatashin@soleen.com Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27do_notify_parent: sanitize the valid_signal() checksOleg Nesterov
Now that kernel_clone() checks valid_signal(args->exit_signal), the "sig" argument of do_notify_parent() must always be valid or we have a bug. However, do_notify_parent() only checks that sig != -1 at the start, then it does another valid_signal() check before __send_signal_locked(). This is confusing. Change do_notify_parent() to WARN and return early if valid_signal(sig) is false. Link: https://lkml.kernel.org/r/abld-ilvMEZ7VgMw@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Deepanshu Kartikey <Kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27watchdog/hardlockup: improve buddy system detection timelinessMayank Rungta
Currently, the buddy system only performs checks every 3rd sample. With a 4-second interval. If a check window is missed, the next check occurs 12 seconds later, potentially delaying hard lockup detection for up to 24 seconds. Modify the buddy system to perform checks at every interval (4s). Introduce a missed-interrupt threshold to maintain the existing grace period while reducing the detection window to 8-12 seconds. Best and worst case detection scenarios: Before (12s check window): - Best case: Lockup occurs after first check but just before heartbeat interval. Detected in ~8s (8s till next check). - Worst case: Lockup occurs just after a check. Detected in ~24s (missed check + 12s till next check + 12s logic). After (4s check window with threshold of 3): - Best case: Lockup occurs just before a check. Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd). - Worst case: Lockup occurs just after a check. Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd). Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-4-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta <mrungta@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Ian Rogers <irogers@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Stephane Erainan <eranian@google.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27watchdog: update saved interrupts during checkMayank Rungta
Currently, arch_touch_nmi_watchdog() causes an early return that skips updating hrtimer_interrupts_saved. This leads to stale comparisons and delayed lockup detection. I found this issue because in our system the serial console is fairly chatty. For example, the 8250 console driver frequently calls touch_nmi_watchdog() via console_write(). If a CPU locks up after a timer interrupt but before next watchdog check, we see the following sequence: * watchdog_hardlockup_check() saves counter (e.g., 1000) * Timer runs and updates the counter (1001) * touch_nmi_watchdog() is called * CPU locks up * 10s pass: check() notices touch, returns early, skips update * 10s pass: check() saves counter (1001) * 10s pass: check() finally detects lockup This delays detection to 30 seconds. With this fix, we detect the lockup in 20 seconds. Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-2-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta <mrungta@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Ian Rogers <irogers@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Stephane Erainan <eranian@google.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27watchdog: return early in watchdog_hardlockup_check()Mayank Rungta
Patch series "watchdog/hardlockup: Improvements to hardlockup", v2. This series addresses limitations in the hardlockup detector implementations and updates the documentation to reflect actual behavior and recent changes. The changes are structured as follows: Refactoring (Patch 1) ===================== Patch 1 refactors watchdog_hardlockup_check() to return early if no lockup is detected. This reduces the indentation level of the main logic block, serving as a clean base for the subsequent changes. Hardlockup Detection Improvements (Patches 2 & 4) ================================================= The hardlockup detector logic relies on updating saved interrupt counts to determine if the CPU is making progress. Patch 1 ensures that the saved interrupt count is updated unconditionally before checking the "touched" flag. This prevents stale comparisons which can delay detection. This is a logic fix that ensures the detector remains accurate even when the watchdog is frequently touched. Patch 3 improves the Buddy detector's timeliness. The current checking interval (every 3rd sample) causes high variability in detection time (up to 24s). This patch changes the Buddy detector to check at every hrtimer interval (4s) with a missed-interrupt threshold of 3, narrowing the detection window to a consistent 8-12 second range. Documentation Updates (Patches 3 & 5) ===================================== The current documentation does not fully capture the variable nature of detection latency or the details of the Buddy system. Patch 3 removes the strict "10 seconds" definition of a hardlockup, which was misleading given the periodic nature of the detector. It adds a "Detection Overhead" section to the admin guide, using "Best Case" and "Worst Case" scenarios to illustrate that detection time can vary significantly (e.g., ~6s to ~20s). Patch 5 adds a dedicated section for the Buddy detector, which was previously undocumented. It details the mechanism, the new timing logic, and known limitations. This patch (of 5): Invert the `is_hardlockup(cpu)` check in `watchdog_hardlockup_check()` to return early when a hardlockup is not detected. This flattens the main logic block, reducing the indentation level and making the code easier to read and maintain. This refactoring serves as a preparation patch for future hardlockup changes. Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-0-45bd8a0cc7ed@google.com Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-1-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta <mrungta@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Ian Rogers <irogers@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Stephane Erainan <eranian@google.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/kexec: remove inclusion of crypto/hash.hEric Biggers
kexec_core.c does not do any cryptographic hashing, so the header crypto/hash.h is not needed at all. Link: https://lkml.kernel.org/r/20260314204144.44884-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/crash: remove inclusion of crypto/sha1.hEric Biggers
Several files related to kernel crash dumps include crypto/sha1.h but never use any of its functionality. Remove these includes so that these files don't unnecessarily come up in searches for which kernel code is still using the obsolete SHA-1 algorithm. Link: https://lkml.kernel.org/r/20260314204243.45001-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: explicitly report I/O wait state in log outputAaron Tomlin
Currently, the hung task reporting mechanism indiscriminately labels all TASK_UNINTERRUPTIBLE (D) tasks as "blocked", irrespective of whether they are awaiting I/O completion or kernel locking primitives. This ambiguity compels system administrators to manually inspect stack traces to discern whether the delay stems from an I/O wait (typically indicative of hardware or filesystem anomalies) or software contention. Such detailed analysis is not always immediately accessible to system administrators or support engineers. To address this, this patch utilises the existing in_iowait field within struct task_struct to augment the failure report. If the task is blocked due to I/O (e.g., via io_schedule_prepare()), the log message is updated to explicitly state "blocked in I/O wait". Examples: - Standard Block: "INFO: task bash:123 blocked for more than 120 seconds". - I/O Block: "INFO: task dd:456 blocked in I/O wait for more than 120 seconds". Theoretically, concurrent executions of io_schedule_finish() could result in a race condition where the read flag does not precisely correlate with the subsequently printed backtrace. However, this limitation is deemed acceptable in practice. The entire reporting mechanism is inherently racy by design; nevertheless, it remains highly reliable in the vast majority of cases, particularly because it primarily captures protracted stalls. Consequently, introducing additional synchronisation to mitigate this minor inaccuracy would be entirely disproportionate to the situation. Link: https://lkml.kernel.org/r/20260303221324.4106917-1-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: increment the global counter immediatelyPetr Mladek
A recent change allowed to reset the global counter of hung tasks using the sysctl interface. A potential race with the regular check has been solved by updating the global counter only once at the end of the check. However, the hung task check can take a significant amount of time, particularly when task information is being dumped to slow serial consoles. Some users monitor this global counter to trigger immediate migration of critical containers. Delaying the increment until the full check completes postpones these high-priority rescue operations. Update the global counter as soon as a hung task is detected. Since the value is read asynchronously, a relaxed atomic operation is sufficient. Link: https://lkml.kernel.org/r/20260303203031.4097316-4-atomlin@atomlin.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reported-by: Lance Yang <lance.yang@linux.dev> Closes: https://lore.kernel.org/r/f239e00f-4282-408d-b172-0f9885f4b01b@linux.dev Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: enable runtime reset of hung_task_detect_countAaron Tomlin
Currently, the hung_task_detect_count sysctl provides a cumulative count of hung tasks since boot. In long-running, high-availability environments, this counter may lose its utility if it cannot be reset once an incident has been resolved. Furthermore, the previous implementation relied upon implicit ordering, which could not strictly guarantee that diagnostic metadata published by one CPU was visible to the panic logic on another. This patch introduces the capability to reset the detection count by writing "0" to the hung_task_detect_count sysctl. The proc_handler logic has been updated to validate this input and atomically reset the counter. The synchronisation of sysctl_hung_task_detect_count relies upon a transactional model to ensure the integrity of the detection counter against concurrent resets from userspace. The application of atomic_long_read_acquire() and atomic_long_cmpxchg_release() is correct and provides the following guarantees: 1. Prevention of Load-Store Reordering via Acquire Semantics By utilising atomic_long_read_acquire() to snapshot the counter before initiating the task traversal, we establish a strict memory barrier. This prevents the compiler or hardware from reordering the initial load to a point later in the scan. Without this "acquire" barrier, a delayed load could potentially read a "0" value resulting from a userspace reset that occurred mid-scan. This would lead to the subsequent cmpxchg succeeding erroneously, thereby overwriting the user's reset with stale increment data. 2. Atomicity of the "Commit" Phase via Release Semantics The atomic_long_cmpxchg_release() serves as the transaction's commit point. The "release" barrier ensures that all diagnostic recordings and task-state observations made during the scan are globally visible before the counter is incremented. 3. Race Condition Resolution This pairing effectively detects any "out-of-band" reset of the counter. If sysctl_hung_task_detect_count is modified via the procfs interface during the scan, the final cmpxchg will detect the discrepancy between the current value and the "acquire" snapshot. Consequently, the update will fail, ensuring that a reset command from the administrator is prioritised over a scan that may have been invalidated by that very reset. Link: https://lkml.kernel.org/r/20260303203031.4097316-3-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Joel Granados <joel.granados@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: refactor detection logic and atomicise detection countAaron Tomlin
Patch series "hung_task: Provide runtime reset interface for hung task detector", v9. This series introduces the ability to reset /proc/sys/kernel/hung_task_detect_count. Writing a "0" value to this file atomically resets the counter of detected hung tasks. This functionality provides system administrators with the means to clear the cumulative diagnostic history following incident resolution, thereby simplifying subsequent monitoring without necessitating a system restart. This patch (of 3): The check_hung_task() function currently conflates two distinct responsibilities: validating whether a task is hung and handling the subsequent reporting (printing warnings, triggering panics, or tracepoints). This patch refactors the logic by introducing hung_task_info(), a function dedicated solely to reporting. The actual detection check, task_is_hung(), is hoisted into the primary loop within check_hung_uninterruptible_tasks(). This separation clearly decouples the mechanism of detection from the policy of reporting. Furthermore, to facilitate future support for concurrent hung task detection, the global sysctl_hung_task_detect_count variable is converted from unsigned long to atomic_long_t. Consequently, the counting logic is updated to accumulate the number of hung tasks locally (this_round_count) during the iteration. The global counter is then updated atomically via atomic_long_cmpxchg_relaxed() once the loop concludes, rather than incrementally during the scan. These changes are strictly preparatory and introduce no functional change to the system's runtime behaviour. Link: https://lkml.kernel.org/r/20260303203031.4097316-1-atomlin@atomlin.com Link: https://lkml.kernel.org/r/20260303203031.4097316-2-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Joel Granados <joel.granados@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27crash_dump: use sysfs_emit in sysfs show functionsThorsten Blum
Replace sprintf() with sysfs_emit() in sysfs show functions. sysfs_emit() is preferred for formatting sysfs output because it provides safer bounds checking. No functional changes. Link: https://lkml.kernel.org/r/20260301125106.911980-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27pid: document the PIDNS_ADDING checks in alloc_pid() and copy_process()Oleg Nesterov
Both copy_process() and alloc_pid() do the same PIDNS_ADDING check. The reasons for these checks, and the fact that both are necessary, are not immediately obvious. Add the comments. Link: https://lkml.kernel.org/r/aaGIRElc78U4Er42@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <areber@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27pid: make sub-init creation retryableOleg Nesterov
Patch series "pid: make sub-init creation retryable". This patch (of 2): Currently we allow only one attempt to create init in a new namespace. If the first fork() fails after alloc_pid() succeeds, free_pid() clears PIDNS_ADDING and thus disables further PID allocations. Nowadays this looks like an unnecessary limitation. The original reason to handle "case PIDNS_ADDING" in free_pid() is gone, most probably after commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount of proc"). Change free_pid() to keep ns->pid_allocated == PIDNS_ADDING, and change alloc_pid() to reset the cursor early, right after taking pidmap_lock. Test-case: #define _GNU_SOURCE #include <linux/sched.h> #include <sys/syscall.h> #include <sys/wait.h> #include <assert.h> #include <sched.h> #include <errno.h> int main(void) { struct clone_args args = { .exit_signal = SIGCHLD, .flags = CLONE_PIDFD, .pidfd = 0, }; unsigned long pidfd; int pid; assert(unshare(CLONE_NEWPID) == 0); pid = syscall(__NR_clone3, &args, sizeof(args)); assert(pid == -1 && errno == EFAULT); args.pidfd = (unsigned long)&pidfd; pid = syscall(__NR_clone3, &args, sizeof(args)); if (pid) assert(pid > 0 && wait(NULL) == pid); else assert(getpid() == 1); return 0; } Link: https://lkml.kernel.org/r/aaGHu3ixbw9Y7kFj@redhat.com Link: https://lkml.kernel.org/r/aaGIHa7vGdwhEc_D@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrei Vagin <avagin@gmail.com> Cc: Adrian Reber <areber@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27crash_dump: fix typo in function name read_key_from_user_keyingThorsten Blum
The function read_key_from_user_keying() is missing an 'r' in its name. Fix the typo by renaming it to read_key_from_user_keyring(). Link: https://lkml.kernel.org/r/20260227230422.859423-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27crash_dump: remove redundant less-than-zero checkThorsten Blum
'key_count' is an 'unsigned int' and cannot be less than zero. Remove the redundant condition. Link: https://lkml.kernel.org/r/20260228085136.861971-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27fork: replace simple_strtoul with kstrtoul in coredump_filter_setupThorsten Blum
Replace simple_strtoul() with the recommended kstrtoul() for parsing the 'coredump_filter=' boot parameter. Check the return value of kstrtoul() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtoul() helper. The current code silently sets 'default_dump_filter = 0' if parsing fails, instead of leaving the default value (MMF_DUMP_FILTER_DEFAULT) unchanged. Rename the static variable 'default_dump_filter' to 'coredump_filter' since it does not necessarily contain the default value and the current name can be misleading. Link: https://lkml.kernel.org/r/20251215142152.4082-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27complete_signal: kill always-true "core_state || !SIGNAL_GROUP_EXIT" checkOleg Nesterov
The "(signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT))" check in complete_signal() is not obvious at all, and in fact it only adds unnecessary confusion: this condition is always true. prepare_signal() does: if (signal->flags & SIGNAL_GROUP_EXIT) { if (signal->core_state) return sig == SIGKILL; /* * The process is in the middle of dying, drop the signal. */ return false; } This means that "!signal->core_state && (signal->flags & SIGNAL_GROUP_EXIT)" in complete_signal() is never possible. If SIGNAL_GROUP_EXIT is set, prepare_signal() can only return true if signal->core_state is not NULL. Link: https://lkml.kernel.org/r/aZsfkDhnqJ4s1oTs@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc; Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27exit: kill unnecessary thread_group_leader() checks in exit_notify() and ↵Oleg Nesterov
do_notify_parent() thread_group_empty(tsk) is only possible if tsk is a group leader, and thread_group_empty() already does the thread_group_leader() check. So it makes no sense to check "thread_group_leader() && thread_group_empty()"; thread_group_empty() alone is enough. Link: https://lkml.kernel.org/r/aZsfeegKZPZZszJh@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc; Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/panic: mark init_taint_buf as __initdata and panic instead of warning ↵Rio
in alloc_taint_buf() However there's a convention of assuming that __init-time allocations cannot fail. Because if a kmalloc() were to fail at this time, the kernel is hopelessly messed up anyway. So simply panic() if that kmalloc failed, then make that 350-byte buffer __initdata. Link: https://lkml.kernel.org/r/20260223035914.4033-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/panic: allocate taint string buffer dynamicallyRio
The buffer used to hold the taint string is statically allocated, which requires updating whenever a new taint flag is added. Instead, allocate the exact required length at boot once the allocator is available in an init function. The allocation sums the string lengths in taint_flags[], along with space for separators and formatting. print_tainted() is switched to use this dynamically allocated buffer. If allocation fails, print_tainted() warns about the failure and continues to use the original static buffer as a fallback. Link: https://lkml.kernel.org/r/20260222140804.22225-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/panic: increase buffer size for verbose taint loggingRio
The verbose 'Tainted: ...' string in print_tainted_seq can total to 327 characters while the buffer defined in _print_tainted is 320 bytes. Increase its size to 350 characters to hold all flags, along with some headroom. [akpm@linux-foundation.org: fix spello, add comment] Link: https://lkml.kernel.org/r/20260220151500.13585-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27unshare: fix nsproxy leak in ksys_unshare() on set_cred_ucounts() failureMichal Grzedzicki
When set_cred_ucounts() fails in ksys_unshare() new_nsproxy is leaked. Let's call put_nsproxy() if that happens. Link: https://lkml.kernel.org/r/20260213193959.2556730-1-mge@meta.com Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Signed-off-by: Michal Grzedzicki <mge@meta.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Gladkov (Intel) <legion@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27Merge tag 'sysctl-7.00-fixes-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl fix from Joel Granados: "Fix uninitialized variable error when writing to a sysctl bitmap Removed the possibility of returning an unjustified -EINVAL when writing to a sysctl bitmap" * tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: fix uninitialized variable in proc_do_large_bitmap
2026-03-27tracing: Fix potential deadlock in cpu hotplug with osnoiseLuo Haiyang
The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-27sched_ext: Document why built-in DSQs are unsupported sources in ↵Cheng-Yang Chou
scx_bpf_dsq_move_to_local() Add a comment explaining the design intent behind rejecting built-in DSQs (%SCX_DSQ_GLOBAL and %SCX_DSQ_LOCAL*) as sources. Local DSQs support reenqueueing but the BPF scheduler cannot directly iterate or move tasks from them. %SCX_DSQ_GLOBAL is similar but also doesn't support reenqueueing because it maps to multiple per-node DSQs, making the scope difficult to define. Also annotate @dsq_id to make clear it must be a user-created DSQ. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-27printk_ringbuffer: Add sanity check for 0-size dataJohn Ogness
get_data() has a sanity check for regular data blocks to ensure at least space for the ID exists. But a regular block should also have at least 1 byte of data (otherwise it would be data-less instead of regular). Expand the get_data() block size sanity check to additionally expect at least 1 byte of data. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://patch.msgid.link/20260326133809.8045-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2026-03-27printk_ringbuffer: Fix get_data() size sanity checkJohn Ogness
Commit cc3bad11de6e ("printk_ringbuffer: Fix check of valid data size when blk_lpos overflows") added sanity checking to get_data() to avoid returning data of illegal sizes (too large or too small). It uses the helper function data_check_size() for the check. However, data_check_size() expects the size of the data, not the size of the data block. get_data() is providing the size of the data block. This means that if the data size (text_buf_size) is at or near the maximum legal size: sizeof(prb_data_block) + text_buf_size == DATA_SIZE(data_ring) / 2 data_check_size() will report failure because it adds sizeof(prb_data_block) to the provided size. The sanity check in get_data() is counting the data block header twice. The result is that the reader fails to read the legal record. Since get_data() subtracts the data block header size before returning, move the sanity check to after the subtraction. Luckily printk() is not vulnerable to this problem because truncate_msg() limits printk-messages to 1/4 of the ringbuffer. Indeed, by adjusting the printk_ringbuffer KUnit test, which does not use printk() and its truncate_msg() check, it is easy to see that the reader fails and the WARN_ON is triggered. Fixes: cc3bad11de6e ("printk_ringbuffer: Fix check of valid data size when blk_lpos overflows") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://patch.msgid.link/20260326133809.8045-1-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2026-03-27bpf: classify block device hooks appropriatelyChristian Brauner
A bunch of new hooks for managing block devices were added a while ago but they weren't actually appropriately classified. * bpf_lsm_bdev_alloc() is called when the inode for the block device is allocated. This happens from a sleepable context so mark the function as sleepable. When this function is called the memory for the block device storage embedded into the inode is zeroed. That block device cannot be meaningfully reference or interacted with at this point. So mark it as untrusted for now. * bpf_lsm_bdev_free() is called when the inode for the block device is freed. A bunch of memory associated with the block device has already been freed and there's dangling pointers in there. So mark it as untrusted. It cannot be meaningfully referenced or interacted with anymore. It is also called from sb->s_op->free_inode:: which means it runs in rcu context (most of the times). So leave it as non-sleepable. * bpf_lsm_bdev_setintegrity() is called when a dm-verity device is instantiated (glossing over details for simplicity of the commit message). The block device is very much alive so it remains a trusted hook. It's also called with device mapper's suspend lock held and so the hook is able to sleep so mark it sleepable. Signed-off-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20260326-work-bpf-bdev-v2-1-5e3c58963987@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-27PCI: Align head space betterIlpo Järvinen
When a bridge window contains big and small resource(s), the small resource(s) may not amount to the half of the size of the big resource which would allow calculate_head_align() to shrink the head alignment. This results in always placing the small resource(s) after the big resource. In general, it would be good to be able to place the small resource(s) before the big resource to achieve better utilization of the address space. In the cases where the large resource can only fit at the end of the window, it is even required. However, carrying the information over from pbus_size_mem() and calculate_head_align() to __pci_assign_resource() and pcibios_align_resource() is not easy with the current data structures. A somewhat hacky way to move the non-aligning tail part to the head is possible within pcibios_align_resource(). The free space between the start of the free space span and the aligned start address can be compared with the non-aligning remainder of the size. If the free space is larger than the remainder, placing the remainder before the start address is possible. This relocation should generally work, because PCI resources consist only power-of-2 atoms. Various arch requirements may still need to override the relocation, so the relocation is only applied selectively in such cases. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221205 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-10-ilpo.jarvinen@linux.intel.com
2026-03-27resource: Rename 'tmp' variable to 'full_avail'Ilpo Järvinen
__find_resource_space() has variable called 'tmp'. Rename it to 'full_avail' to better indicate its purpose. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-4-ilpo.jarvinen@linux.intel.com
2026-03-27resource: Pass full extent of empty space to resource_alignf callbackIlpo Järvinen
__find_resource_space() calculates the full extent of empty space but only passes the aligned space to resource_alignf callback. In some situations, the callback may choose take advantage of the free space before the requested alignment. Pass the full extent of the calculated empty space to resource_alignf callback as an additional parameter. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-3-ilpo.jarvinen@linux.intel.com
2026-03-27Merge back earlier material related to system sleep for 7.1Rafael J. Wysocki
2026-03-27Merge branch 'dt-reserved-mem-cleanups' into dma-mapping-for-nextMarek Szyprowski
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2026-03-26btf: Support kernel parsing of BTF with layout infoAlan Maguire
Validate layout if present, but because the kernel must be strict in what it accepts, reject BTF with unsupported kinds, even if they are in the layout information. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260326145444.2076244-8-alan.maguire@oracle.com
2026-03-26resource: Add __resource_contains_unbound() for internal contains checksIlpo Järvinen
__find_resource_space() currently uses resource_contains() but for tentative resources that are not yet crafted into the resource tree. As resource_contains() checks that IORESOURCE_UNSET is not set for either of the resources, the caller has to hack around this problem by clearing the IORESOURCE_UNSET flag (essentially lying to resource_contains()). Instead of the hack, introduce __resource_contains_unbound() for cases like this. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-2-ilpo.jarvinen@linux.intel.com
2026-03-26Merge tag 'pm-7.0-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two cpufreq issues, one in the core and one in the conservative governor, and two issues related to system sleep: - Restore the cpufreq core behavior changed inadvertently during the 6.19 development cycle to call cpufreq_frequency_table_cpuinfo() for cpufreq policies getting re-initialized which ensures that policy->max and policy->cpuinfo_max_freq will be valid going forward (Viresh Kumar) - Adjust the cached requested frequency in the conservative cpufreq governor on policy limits changes to prevent it from becoming stale in some cases (Viresh Kumar) - Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some code paths in which it is legitimately called without invoking pm_restrict_gfp_mask() previously (Youngjun Park) - Update snapshot_write_finalize() to take trailing zero pages into account properly which prevents user space restore from failing subsequently in some cases (Alberto Garcia)" * tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask() PM: hibernate: Drain trailing zero pages on userspace restore cpufreq: conservative: Reset requested_freq on limits change cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
2026-03-26of: reserved_mem: replace CMA quirks by generic methodsMarek Szyprowski
Add optional reserved memory callbacks to perform region verification and early fixup, then move all CMA related code in of_reserved_mem.c to them. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-5-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26of: reserved_mem: switch to ops based OF_DECLARE()Marek Szyprowski
Move init function from OF_DECLARE() argument to the given reserved memory region ops structure and then pass that structure to the OF_DECLARE() initializer. This node_init callback is mandatory for the reserved mem driver. Such change makes it possible in the future to add more functions called by the generic code before given memory region is initialized and rmem object is created. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-4-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26of: reserved_mem: use -ENODEV instead of -ENOENTMarek Szyprowski
When given reserved memory region doesn't really support given node, return -ENODEV instead of -ENOENT. Then fix __reserved_mem_init_node() function to properly propagate error code different from -ENODEV instead of silently ignoring it. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-3-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26of: reserved_mem: remove fdt node from the structureMarek Szyprowski
FDT node is not needed for anything besides the initialization, so it can be simply passed as an argument to the reserved memory region init function. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-2-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26smp: Use system_percpu_wq instead of system_wqMarco Crivellari
When a caller enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() uses WORK_CPU_UNBOUND (used when no target CPU is specified). The same applies to schedule_work() that is using system_wq and queue_work(), which again makes use of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. Continue the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue() flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") and switch smp_call_on_cpu() to use system_percpu_wq because system_wq is going away once the ongoing workqueue restructuring is done. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251110170332.319314-1-marco.crivellari@suse.com