summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-02-23rseq: Clarify rseq registration rseq_size bound check commentMathieu Desnoyers
The rseq registration validates that the rseq_size argument is greater or equal to 32 (the original rseq size), but the comment associated with this check does not clearly state this. Clarify the comment to that effect. Fixes: ee3e3ac05c26 ("rseq: Introduce extensible rseq ABI") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260220200642.1317826-2-mathieu.desnoyers@efficios.com
2026-02-23sched/core: Fix wakeup_preempt's next_class trackingPeter Zijlstra
Kernel test robot reported that tools/testing/selftests/kvm/hardware_disable_test was failing due to commit 704069649b5b ("sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()") It turns out there were two related problems that could lead to a missed preemption: - when hitting newidle balance from the idle thread, it would elevate rb->next_class from &idle_sched_class to &fair_sched_class, causing later wakeup_preempt() calls to not hit the sched_class_above() case, and not issue resched_curr(). Notably, this modification pattern should only lower the next_class, and never raise it. Create two new helper functions to wrap this. - when doing schedule_idle(), it was possible to miss (re)setting rq->next_class to &idle_sched_class, leading to the very same problem. Cc: Sean Christopherson <seanjc@google.com> Fixes: 704069649b5b ("sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202602122157.4e861298-lkp@intel.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260218163329.GQ1395416@noisy.programming.kicks-ass.net
2026-02-23sched/fair: Fix lag clampPeter Zijlstra
Vincent reported that he was seeing undue lag clamping in a mixed slice workload. Implement the max_slice tracking as per the todo comment. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20250422101628.GA33555@noisy.programming.kicks-ass.net
2026-02-23sched/eevdf: Update se->vprot in reweight_entity()Wang Tao
In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an independent variable defining the virtual protection timestamp. When `reweight_entity()` is called (e.g., via nice/renice), it performs the following actions to preserve Lag consistency: 1. Scales `se->vlag` based on the new weight. 2. Calls `place_entity()`, which recalculates `se->vruntime` based on the new weight and scaled lag. However, the current implementation fails to update `se->vprot`, leading to mismatches between the task's actual runtime and its expected duration. Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Suggested-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Wang Tao <wangtao554@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260120123113.3518950-1-wangtao554@huawei.com
2026-02-23sched/fair: Only set slice protection at pick timePeter Zijlstra
We should not (re)set slice protection in the sched_change pattern which calls put_prev_task() / set_next_task(). Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080624.561421378%40infradead.org
2026-02-23sched/fair: Fix zero_vruntime trackingPeter Zijlstra
It turns out that zero_vruntime tracking is broken when there is but a single task running. Current update paths are through __{en,de}queue_entity(), and when there is but a single task, pick_next_task() will always return that one task, and put_prev_set_next_task() will end up in neither function. This can cause entity_key() to grow indefinitely large and cause overflows, leading to much pain and suffering. Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which are called from {set_next,put_prev}_entity() has problems because: - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se. This means the avg_vruntime() will see the removal but not current, missing the entity for accounting. - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr = NULL. This means the avg_vruntime() will see the addition *and* current, leading to double accounting. Both cases are incorrect/inconsistent. Noting that avg_vruntime is already called on each {en,de}queue, remove the explicit avg_vruntime() calls (which removes an extra 64bit division for each {en,de}queue) and have avg_vruntime() update zero_vruntime itself. Additionally, have the tick call avg_vruntime() -- discarding the result, but for the side-effect of updating zero_vruntime. While there, optimize avg_vruntime() by noting that the average of one value is rather trivial to compute. Test case: # taskset -c -p 1 $$ # taskset -c 2 bash -c 'while :; do :; done&' # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>" PRE: .zero_vruntime : 31316.407903 >R bash 487 50787.345112 E 50789.145972 2.800000 50780.298364 16 120 0.000000 0.000000 0.000000 / .zero_vruntime : 382548.253179 >R bash 487 427275.204288 E 427276.003584 2.800000 427268.157540 23 120 0.000000 0.000000 0.000000 / POST: .zero_vruntime : 17259.709467 >R bash 526 17259.709467 E 17262.509467 2.800000 16915.031624 9 120 0.000000 0.000000 0.000000 / .zero_vruntime : 18702.723356 >R bash 526 18702.723356 E 18705.523356 2.800000 18358.045513 9 120 0.000000 0.000000 0.000000 / Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080624.438854780%40infradead.org
2026-02-23locking/mutex: Rename mutex_init_lockep()Davidlohr Bueso
Typo, this wants to be _lockdep(). Fixes: 51d7a054521d ("locking/mutex: Redo __mutex_init() to reduce generated code size") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260217191512.1180151-2-dave@stgolabs.net
2026-02-23dma-mapping: avoid random addr value print out on error pathJiri Pirko
dma_addr is unitialized in dma_direct_map_phys() when swiotlb is forced and DMA_ATTR_MMIO is set which leads to random value print out in warning. Fix that by just returning DMA_MAPPING_ERROR. Fixes: e53d29f957b3 ("dma-mapping: convert dma_direct_*map_page to be phys_addr_t based") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260209153809.250835-2-jiri@resnulli.us
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-20Merge tag 'trace-v7.0-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible dereference of uninitialized pointer When validating the persistent ring buffer on boot up, if the first validation fails, a reference to "head_page" is performed in the error path, but it skips over the initialization of that variable. Move the initialization before the first validation check. - Fix use of event length in validation of persistent ring buffer On boot up, the persistent ring buffer is checked to see if it is valid by several methods. One being to walk all the events in the memory location to make sure they are all valid. The length of the event is used to move to the next event. This length is determined by the data in the buffer. If that length is corrupted, it could possibly make the next event to check located at a bad memory location. Validate the length field of the event when doing the event walk. - Fix function graph on archs that do not support use of ftrace_ops When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means that its function graph tracer uses the ftrace_ops of the function tracer to call its callbacks. This allows a single registered callback to be called directly instead of checking the callback's meta data's hash entries against the function being traced. For architectures that do not support this feature, it must always call the loop function that tests each registered callback (even if there's only one). The loop function tests each callback's meta data against its hash of functions and will call its callback if the function being traced is in its hash map. The issue was that there was no check against this and the direct function was being called even if the architecture didn't support it. This meant that if function tracing was enabled at the same time as a callback was registered with the function graph tracer, its callback would be called for every function that the function tracer also traced, even if the callback's meta data only wanted to be called back for a small subset of functions. Prevent the direct calling for those architectures that do not support it. - Fix references to trace_event_file for hist files The hist files used event_file_data() to get a reference to the associated trace_event_file the histogram was attached to. This would return a pointer even if the trace_event_file is about to be freed (via RCU). Instead it should use the event_file_file() helper that returns NULL if the trace_event_file is marked to be freed so that no new references are added to it. - Wake up hist poll readers when an event is being freed When polling on a hist file, the task is only awoken when a hist trigger is triggered. This means that if an event is being freed while there's a task waiting on its hist file, it will need to wait until the hist trigger occurs to wake it up and allow the freeing to happen. Note, the event will not be completely freed until all references are removed, and a hist poller keeps a reference. But it should still be woken when the event is being freed. * tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Wake up poll waiters for hist files when removing an event tracing: Fix checking of freed trace_event_file for hist files fgraph: Do not call handlers direct when not using ftrace_ops tracing: ring-buffer: Fix to check event length before using ring-buffer: Fix possible dereference of uninitialized pointer
2026-02-19tracing: Wake up poll waiters for hist files when removing an eventPetr Pavlu
The event_hist_poll() function attempts to verify whether an event file is being removed, but this check may not occur or could be unnecessarily delayed. This happens because hist_poll_wakeup() is currently invoked only from event_hist_trigger() when a hist command is triggered. If the event file is being removed, no associated hist command will be triggered and a waiter will be woken up only after an unrelated hist command is triggered. Fix the issue by adding a call to hist_poll_wakeup() in remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This ensures that a task polling on a hist file is woken up and receives EPOLLERR. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19tracing: Fix checking of freed trace_event_file for hist filesPetr Pavlu
The event_hist_open() and event_hist_poll() functions currently retrieve a trace_event_file pointer from a file struct by invoking event_file_data(), which simply returns file->f_inode->i_private. The functions then check if the pointer is NULL to determine whether the event is still valid. This approach is flawed because i_private is assigned when an eventfs inode is allocated and remains set throughout its lifetime. Instead, the code should call event_file_file(), which checks for EVENT_FILE_FL_FREED. Using the incorrect access function may result in the code potentially opening a hist file for an event that is being removed or becoming stuck while polling on this file. Correct the access method to event_file_file() in both functions. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19fgraph: Do not call handlers direct when not using ftrace_opsSteven Rostedt
The function graph tracer was modified to us the ftrace_ops of the function tracer. This simplified the code as well as allowed more features of the function graph tracer. Not all architectures were converted over as it required the implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those architectures, it still did it the old way where the function graph tracer handle was called by the function tracer trampoline. The handler then had to check the hash to see if the registered handlers wanted to be called by that function or not. In order to speed up the function graph tracer that used ftrace_ops, if only one callback was registered with function graph, it would call its function directly via a static call. Now, if the architecture does not support the use of using ftrace_ops and still has the ftrace function trampoline calling the function graph handler, then by doing a direct call it removes the check against the handler's hash (list of functions it wants callbacks to), and it may call that handler for functions that the handler did not request calls for. On 32bit x86, which does not support the ftrace_ops use with function graph tracer, it shows the issue: ~# trace-cmd start -p function -l schedule ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 2) * 11898.94 us | schedule(); 3) # 1783.041 us | schedule(); 1) | schedule() { ------------------------------------------ 1) bash-8369 => kworker-7669 ------------------------------------------ 1) | schedule() { ------------------------------------------ 1) kworker-7669 => bash-8369 ------------------------------------------ 1) + 97.004 us | } 1) | schedule() { [..] Now by starting the function tracer is another instance: ~# trace-cmd start -B foo -p function This causes the function graph tracer to trace all functions (because the function trace calls the function graph tracer for each on, and the function graph trace is doing a direct call): ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) 1.669 us | } /* preempt_count_sub */ 1) + 10.443 us | } /* _raw_spin_unlock_irqrestore */ 1) | tick_program_event() { 1) | clockevents_program_event() { 1) 1.044 us | ktime_get(); 1) 6.481 us | lapic_next_event(); 1) + 10.114 us | } 1) + 11.790 us | } 1) ! 181.223 us | } /* hrtimer_interrupt */ 1) ! 184.624 us | } /* __sysvec_apic_timer_interrupt */ 1) | irq_exit_rcu() { 1) 0.678 us | preempt_count_sub(); When it should still only be tracing the schedule() function. To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the architecture does not support function graph use of ftrace_ops, and set to 1 otherwise. Then use this macro to know to allow function graph tracer to call the handlers directly or not. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19tracing: ring-buffer: Fix to check event length before usingMasami Hiramatsu (Google)
Check the event length before adding it for accessing next index in rb_read_data_buffer(). Since this function is used for validating possibly broken ring buffers, the length of the event could be broken. In that case, the new event (e + len) can point a wrong address. To avoid invalid memory access at boot, check whether the length of each event is in the possible range before using it. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events") Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19ring-buffer: Fix possible dereference of uninitialized pointerDaniil Dulov
There is a pointer head_page in rb_meta_validate_events() which is not initialized at the beginning of a function. This pointer can be dereferenced if there is a failure during reader page validation. In this case the control is passed to "invalid" label where the pointer is dereferenced in a loop. To fix the issue initialize orig_head and head_page before calling rb_validate_buffer. Found by Linux Verification Center (linuxtesting.org) with SVACE. Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/ Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events") Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf before 7.0-rc1Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-19Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix invalid write loop logic in libbpf's bpf_linker__add_buf() (Amery Hung) - Fix a potential use-after-free of BTF object (Anton Protopopov) - Add feature detection to libbpf and avoid moving arena global variables on older kernels (Emil Tsalapatis) - Remove extern declaration of bpf_stream_vprintk() from libbpf headers (Ihor Solodrai) - Fix truncated netlink dumps in bpftool (Jakub Kicinski) - Fix map_kptr grace period wait in bpf selftests (Kumar Kartikeya Dwivedi) - Remove hexdump dependency while building bpf selftests (Matthieu Baerts) - Complete fsession support in BPF trampolines on riscv (Menglong Dong) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Remove hexdump dependency libbpf: Remove extern declaration of bpf_stream_vprintk() selftests/bpf: Use vmlinux.h in test_xdp_meta bpftool: Fix truncated netlink dumps libbpf: Delay feature gate check until object prepare time libbpf: Do not use PROG_TYPE_TRACEPOINT program for feature gating bpf: Add a map/btf from a fd array more consistently selftests/bpf: Fix map_kptr grace period wait selftests/bpf: enable fsession_test on riscv64 selftests/bpf: Adjust selftest due to function rename bpf, riscv: add fsession support for trampolines bpf: Fix a potential use-after-free of BTF object bpf, riscv: introduce emit_store_stack_imm64() for trampoline libbpf: Fix invalid write loop logic in bpf_linker__add_buf() libbpf: Add gating for arena globals relocation feature
2026-02-18Merge tag 'mm-nonmm-stable-2026-02-18-19-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more non-MM updates from Andrew Morton: - "two fixes in kho_populate()" fixes a couple of not-major issues in the kexec handover code (Ran Xiaokai) - misc singletons * tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib/group_cpus: handle const qualifier from clusters allocation type kho: remove unnecessary WARN_ON(err) in kho_populate() kho: fix missing early_memunmap() call in kho_populate() scripts/gdb: implement x86_page_ops in mm.py objpool: fix the overestimation of object pooling metadata size selftests/memfd: use IPC semaphore instead of SIGSTOP/SIGCONT delayacct: fix build regression on accounting tool
2026-02-18Merge tag 'mm-stable-2026-02-18-19-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ...
2026-02-18Merge tag 'sysctl-7.00-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. - General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. - Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing * tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions sysctl: Replace unidirectional INT converter macros with functions sysctl: Add kernel doc to proc_douintvec_conv sysctl: Replace UINT converter macros with functions sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros sysctl: clarify proc_douintvec_minmax doc sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n sysctl: Remove unused ctl_table forward declarations loadpin: Implement custom proc_handler for enforce alloc_tag: move memory_allocation_profiling_sysctls into .rodata sysctl: Add missing kernel-doc for proc_dointvec_conv
2026-02-18unshare: fix unshare_fs() handlingAl Viro
There's an unpleasant corner case in unshare(2), when we have a CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that case copy_mnt_ns() gets passed current->fs instead of a private copy, which causes interesting warts in proof of correctness] > I guess if private means fs->users == 1, the condition could still be true. Unfortunately, it's worse than just a convoluted proof of correctness. Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS (and current->fs->users == 1). We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and flips current->fs->{pwd,root} to corresponding locations in the new namespace. Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM). We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's destroyed and its mount tree is dissolved, but... current->fs->root and current->fs->pwd are both left pointing to now detached mounts. They are pinning those, so it's not a UAF, but it leaves the calling process with unshare(2) failing with -ENOMEM _and_ leaving it with pwd and root on detached isolated mounts. The last part is clearly a bug. There is other fun related to that mess (races with pivot_root(), including the one between pivot_root() and fork(), of all things), but this one is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new fs_struct even if it hadn't been shared in the first place". Sure, we could go for something like "if both CLONE_NEWNS *and* one of the things that might end up failing after copy_mnt_ns() call in create_new_namespaces() are set, force allocation of new fs_struct", but let's keep it simple - the cost of copy_fs_struct() is trivial. Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets a freshly allocated fs_struct, yet to be attached to anything. That seriously simplifies the analysis... FWIW, that bug had been there since the introduction of unshare(2) ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV Tested-by: Waiman Long <longman@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-17Merge tag 'spdx-7.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX updates from Greg KH: "Here are two small changes that add some missing SPDX license lines to some core kernel files. These are: - adding SPDX license lines to kdb files - adding SPDX license lines to the remaining kernel/ files Both of these have been in linux-next for a while with no reported issues" * tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: kernel: debug: Add SPDX license ids to kdb files kernel: add SPDX-License-Identifier lines
2026-02-17Merge tag 'block-7.0-20260216' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-16Merge tag 'kernel-7.0-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - pid: introduce task_ppid_vnr() helper - pidfs: convert rb-tree to rhashtable Mateusz reported performance penalties during task creation because pidfs uses pidmap_lock to add elements into the rbtree. Switch to an rhashtable to have separate fine-grained locking and to decouple from pidmap_lock moving all heavy manipulations outside of it Also move inode allocation outside of pidmap_lock. With this there's nothing happening for pidfs under pidmap_lock - pid: reorder fields in pid_namespace to reduce false sharing - Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers" - ipc: Add SPDX license id to mqueue.c * tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pid: introduce task_ppid_vnr() helper pidfs: implement ino allocation without the pidmap lock Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers" pid: reorder fields in pid_namespace to reduce false sharing pidfs: convert rb-tree to rhashtable ipc: Add SPDX license id to mqueue.c
2026-02-16blk-mq: use NOIO context to prevent deadlock during debugfs creationYu Kuai
Creating debugfs entries can trigger fs reclaim, which can enter back into the block layer request_queue. This can cause deadlock if the queue is frozen. Previously, a WARN_ON_ONCE check was used in debugfs_create_files() to detect this condition, but it was racy since the queue can be frozen from another context at any time. Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave() and blk_debugfs_unlock_nomemrestore() variants for callers that don't need NOIO protection (e.g., debugfs removal or read-only operations). Replace all raw debugfs_mutex lock/unlock pairs with these helpers, using the _nomemsave/_nomemrestore variants where appropriate. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/ Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16Merge tag 'probes-v7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull kprobes updates from Masami Hiramatsu: - Use a dedicated kernel thread to optimize the kprobes instead of using workqueue thread. Since the kprobe optimizer waits a long time for synchronize_rcu_task(), it can block other workers in the same queue if it uses a workqueue. - kprobe-events: return immediately if no new probe events are specified on the kernel command line at boot time. This shortens the kernel boot time. - When a kprobe is fully removed from the kernel code, retry optimizing another kprobe which is blocked by that kprobe. * tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Use dedicated kthread for kprobe optimizer tracing: kprobe-event: Return directly when trace kprobes is empty kprobes: retry blocked optprobe in do_free_cleaned_kprobes
2026-02-14printk, vt, fbcon: Remove console_conditional_schedule()Sebastian Andrzej Siewior
do_con_write(), fbcon_redraw.*() invoke console_conditional_schedule() which is a conditional scheduling point based on printk's internal variables console_may_schedule. It may only be used if the console lock is acquired for instance via console_lock() or console_trylock(). Prinkt sets the internal variable to 1 (and allows to schedule) if the console lock has been acquired via console_lock(). The trylock does not allow it. The console_conditional_schedule() invocation in do_con_write() is invoked shortly before console_unlock(). The console_conditional_schedule() invocation in fbcon_redraw.*() original from fbcon_scroll() / vt's con_scroll() which originate from a line feed. In console_unlock() the variable is set to 0 (forbids to schedule) and it tries to schedule while making progress printing. This is brand new compared to when console_conditional_schedule() was added in v2.4.9.11. In v2.6.38-rc3, console_unlock() (started its existence) iterated over all consoles and flushed them with disabled interrupts. A scheduling attempt here was not possible, it relied that a long print scheduled before console_unlock(). Since commit 8d91f8b15361d ("printk: do cond_resched() between lines while outputting to consoles"), which appeared in v4.5-rc1, console_unlock() attempts to schedule if it was allowed to schedule while during console_lock(). Each record is idealy one line so after every line feed. This console_conditional_schedule() is also only relevant on PREEMPT_NONE and PREEMPT_VOLUNTARY builds. In other configurations cond_resched() becomes a nop and has no impact. I'm bringing this all up just proof that it is not required anymore. It becomes a problem on a PREEMPT_RT build with debug code enabled because that might_sleep() in cond_resched() remains and triggers a warnings. This is due to legacy_kthread_func-> console_flush_one_record -> vt_console_print-> lf -> con_scroll -> fbcon_scroll and vt_console_print() acquires a spinlock_t which does not allow a voluntary schedule. There is no need to fb_scroll() to schedule since console_flush_one_record() attempts to schedule after each line. !PREEMPT_RT is not affected because the legacy printing thread is only enabled on PREEMPT_RT builds. Therefore I suggest to remove console_conditional_schedule(). Cc: Simona Vetter <simona@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: linux-fbdev@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Fixes: 5f53ca3ff83b4 ("printk: Implement legacy printer kthread for PREEMPT_RT") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Petr Mladek <pmladek@suse.com> # from printk() POV Signed-off-by: Helge Deller <deller@gmx.de>
2026-02-13Merge tag 'trace-v7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "User visible changes: - Add an entry into MAINTAINERS file for RUST versions of code There's now RUST code for tracing and static branches. To differentiate that code from the C code, add entries in for the RUST version (with "[RUST]" around it) so that the right maintainers get notified on changes. - New bitmask-list option added to tracefs When this is set, bitmasks in trace event are not displayed as hex numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f - New show_event_filters file in tracefs Instead of having to search all events/*/*/filter for any active filters enabled in the trace instance, the file show_event_filters will list them so that there's only one file that needs to be examined to see if any filters are active. - New show_event_triggers file in tracefs Instead of having to search all events/*/*/trigger for any active triggers enabled in the trace instance, the file show_event_triggers will list them so that there's only one file that needs to be examined to see if any triggers are active. - Have traceoff_on_warning disable trace pintk buffer too Recently recording of trace_printk() could go to other trace instances instead of the top level instance. But if traceoff_on_warning triggers, it doesn't stop the buffer with trace_printk() and that data can easily be lost by being overwritten. Have traceoff_on_warning also disable the instance that has trace_printk() being written to it. - Update the hist_debug file to show what function the field uses When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file exists for every event. This displays the internal data of any histogram enabled for that event. But it is lacking the function that is called to process one of its fields. This is very useful information that was missing when debugging histograms. - Up the histogram stack size from 16 to 31 Stack traces can be used as keys for event histograms. Currently the size of the stack that is stored is limited to just 16 entries. But the storage space in the histogram is 256 bytes, meaning that it can store up to 31 entries (plus one for the count of entries). Instead of letting that space go to waste, up the limit from 16 to 31. This makes the keys much more useful. - Fix permissions of per CPU file buffer_size_kb The per CPU file of buffer_size_kb was incorrectly set to read only in a previous cleanup. It should be writable. - Reset "last_boot_info" if the persistent buffer is cleared The last_boot_info shows address information of a persistent ring buffer if it contains data from a previous boot. It is cleared when recording starts again, but it is not cleared when the buffer is reset. The data is useless after a reset so clear it on reset too. Internal changes: - A change was made to allow tracepoint callbacks to have preemption enabled, and instead be protected by SRCU. This required some updates to the callbacks for perf and BPF. perf needed to disable preemption directly in its callback because it expects preemption disabled in the later code. BPF needed to disable migration, as its code expects to run completely on the same CPU. - Have irq_work wake up other CPU if current CPU is "isolated" When there's a waiter waiting on ring buffer data and a new event happens, an irq work is triggered to wake up that waiter. This is noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a house keeping CPU instead. - Use proper free of trigger_data instead of open coding it in. - Remove redundant call of event_trigger_reset_filter() It was called immediately in a function that was called right after it. - Workqueue cleanups - Report errors if tracing_update_buffers() were to fail. - Make the enum update workqueue generic for other parts of tracing On boot up, a work queue is created to convert enum names into their numbers in the trace event format files. This work queue can also be used for other aspects of tracing that takes some time and shouldn't be called by the init call code. The blk_trace initialization takes a bit of time. Have the initialization code moved to the new tracing generic work queue function. - Skip kprobe boot event creation call if there's no kprobes defined on cmdline The kprobe initialization to set up kprobes if they are defined on the cmdline requires taking the event_mutex lock. This can be held by other tracing code doing initialization for a long time. Since kprobes added to the kernel command line need to be setup immediately, as they may be tracing early initialization code, they cannot be postponed in a work queue and must be setup in the initcall code. If there's no kprobe on the kernel cmdline, there's no reason to take the mutex and slow down the boot up code waiting to get the lock only to find out there's nothing to do. Simply exit out early if there's no kprobes on the kernel cmdline. If there are kprobes on the cmdline, then someone cares more about tracing over the speed of boot up. - Clean up the trigger code a bit - Move code out of trace.c and into their own files trace.c is now over 11,000 lines of code and has become more difficult to maintain. Start splitting it up so that related code is in their own files. Move all the trace_printk() related code into trace_printk.c. Move the __always_inline stack functions into trace.h. Move the pid filtering code into a new trace_pid.c file. - Better define the max latency and snapshot code The latency tracers have a "max latency" buffer that is a copy of the main buffer and gets swapped with it when a new high latency is detected. This keeps the trace up to the highest latency around where this max_latency buffer is never written to. It is only used to save the last max latency trace. A while ago a snapshot feature was added to tracefs to allow user space to perform the same logic. It could also enable events to trigger a "snapshot" if one of their fields hit a new high. This was built on top of the latency max_latency buffer logic. Because snapshots came later, they were dependent on the latency tracers to be enabled. In reality, the latency tracers depend on the snapshot code and not the other way around. It was just that they came first. Restructure the code and the kconfigs to have the latency tracers depend on snapshot code instead. This actually simplifies the logic a bit and allows to disable more when the latency tracers are not defined and the snapshot code is. - Fix a "false sharing" in the hwlat tracer code The loop to search for latency in hardware was using a variable that could be changed by user space for each sample. If the user change this variable, it could cause a bus contention, and reading that variable can show up as a large latency in the trace causing a false positive. Read this variable at the start of the sample with a READ_ONCE() into a local variable and keep the code from sharing cache lines with readers. - Fix function graph tracer static branch optimization code When only one tracer is defined for function graph tracing, it uses a static branch to call that tracer directly. When another tracer is added, it goes into loop logic to call all the registered callbacks. The code was incorrect when going back to one tracer and never re-enabled the static branch again to do the optimization code. - And other small fixes and cleanups" * tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits) function_graph: Restore direct mode when callbacks drop to one tracing: Fix indentation of return statement in print_trace_fmt() tracing: Reset last_boot_info if ring buffer is reset tracing: Fix to set write permission to per-cpu buffer_size_kb tracing: Fix false sharing in hwlat get_sample() tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection tracing: Better separate SNAPSHOT and MAX_TRACE options tracing: Add tracer_uses_snapshot() helper to remove #ifdefs tracing: Rename trace_array field max_buffer to snapshot_buffer tracing: Move pid filtering into trace_pid.c tracing: Move trace_printk functions out of trace.c and into trace_printk.c tracing: Use system_state in trace_printk_init_buffers() tracing: Have trace_printk functions use flags instead of using global_trace tracing: Make tracing_update_buffers() take NULL for global_trace tracing: Make printk_trace global for tracing system tracing: Move ftrace_trace_stack() out of trace.c and into trace.h tracing: Move __trace_buffer_{un}lock_*() functions to trace.h tracing: Make tracing_selftest_running global to the tracing subsystem tracing: Make tracing_disabled global for tracing system tracing: Clean up use of trace_create_maxlat_file() ...
2026-02-13Merge tag 'dma-mapping-7.0-2026-02-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping update from Marek Szyprowski: "A small code cleanup for the DMA-mapping subsystem: removal of unused hooks (Robin Murphy)" * tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: Remove dma_mark_clean (again)
2026-02-13bpf: rename bpf_reg_state->off to bpf_reg_state->deltaEduard Zingerman
This field is now used only for linked scalar registers tracking. Rename it to 'delta' to better describe it's purpose: constant delta between "linked" scalars with the same ID. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260212-ptrs-off-migration-v2-4-00820e4d3438@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13bpf: use reg->var_off instead of reg->off for pointersEduard Zingerman
This commit consolidates static and varying pointer offset tracking logic. All offsets are now represented solely using `.var_off` and min/max fields. The reasons are twofold: - This simplifies pointer tracking code, as each relevant function needs to check the `.var_off` field anyway. - It makes it easier to widen pointer registers for the purpose of loop convergence checks, by forgoing the `regsafe()` logic demanding `.off` fields to be identical. The changes are spread across many functions and are hard to group into smaller patches. Some of the logical changes include: - Checks in __check_ptr_off_reg() are reordered so that the tnum_is_const() check is done before operating on reg->var_off.value. - check_packet_access() now uses check_mem_region_access() to handle possible 'off' overflow cases. - In check_helper_mem_access() utility functions like check_packet_access() are now called with 'off=0', as these utility functions now account for the complete register offset range. - In check_reg_type() a call to __check_ptr_off_reg() is added before a call to btf_struct_ids_match(). This prevents btf_struct_ids_match() from potentially working on non-constant reg->var_off.value. - regsafe() is relaxed to avoid comparing '.off' field for pointers. As a precaution, the changes are verified in [1] by adding a pass checking that no pointer has non-zero '.off' field on each do_check_insn() iteration. [1] https://github.com/eddyz87/bpf/tree/ptrs-off-migration Notable selftests changes: - `.var_off` value changed because it now combines static and varying offsets. Affected tests: - linked_list/incorrect_node_var_off - linked_list/incorrect_head_var_off2 - verifier_align/packet_variable_offset - Overflowing `smax_value` bound leads to a pointer with big negative or positive offset to be rejected immediately (previously overflowing `rX += const` instruction updated `.off` field avoiding the overflow). Affected tests: - verifier_align/dubious_pointer_arithmetic - verifier_bounds/var_off_insn_off_test1 - Invalid access to packet now reports full offset inside a packet. Affected tests: - verifier_direct_packet_access/test23_x_pkt_ptr_4 - A change in check_mem_region_access() behavior: when register `.smin_value` is negative, it reports "rX min value is negative..." before calling into __check_mem_access() which reports "invalid access to ...". In the tests below, the `.off` field was negative, while `.smin_value` remained positive. This is no longer the case after the changes in this commit. Affected tests: - verifier_gotox/jump_table_invalid_mem_acceess_neg - verifier_helper_packet_access/test15_cls_helper_fail_sub - verifier_helper_value_access/imm_out_of_bound_2 - verifier_helper_value_access/reg_out_of_bound_2 - verifier_meta_access/meta_access_test2 - verifier_value_ptr_arith/known_scalar_from_different_maps - lower_oob_arith_test_1 - value_ptr_known_scalar_3 - access_value_ptr_known_scalar - Usage of check_mem_region_access() instead of __check_mem_access() in check_packet_access() changes the reported message from "rX offset is outside ..." to "rX min/max value is outside ...". Affected tests: - verifier_xdp_direct_packet_access/* - In check_func_arg_reg_off() the check for zero offset now operates on `.var_off` field instead of `.off` field. For tests where the pattern looks like `kfunc(reg_with_var_off, ...)`, this changes the reported error: - previously the error "variable ... access ... disallowed" was reported by __check_ptr_off_reg(); - now "R1 must have zero offset ..." is reported by check_func_arg_reg_off() itself. Affected tests: - verifier/calls.c "calls: invalid kfunc call: PTR_TO_BTF_ID with variable offset" Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260212-ptrs-off-migration-v2-2-00820e4d3438@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13bpf: split check_reg_sane_offset() in two partsEduard Zingerman
check_reg_sane_offset() is used when verifying operations like: dst_reg += src_reg ^ ^ | '-------- scalar '------------------- pointer To verify range for both dst_reg and src_reg. Split it in two parts: - one to check a pointer offset - another to check scalar offset This would be useful for further refactoring. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260212-ptrs-off-migration-v2-1-00820e4d3438@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13bpf: Add a map/btf from a fd array more consistentlyAnton Protopopov
The add_fd_from_fd_array() function takes a file descriptor as a parameter and tries to add either map or btf to the corresponding list of used objects. As was reported by Dan Carpenter, since the commit c81e4322acf0 ("bpf: Fix a potential use-after-free of BTF object"), the fdget() is called twice on the file descriptor, and thus userspace, potentially, can replace the file pointed to by the file descriptor in between the two calls. On practice, this shouldn't break anything on the kernel side, but for consistency fix the code such that only one fdget() is executed. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/aY689z7gHNv8rgVO@stanley.mountain/ Fixes: ccd2d799ed44 ("bpf: Fix a potential use-after-free of BTF object") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260213212949.759321-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13bpf: Fix a potential use-after-free of BTF objectAnton Protopopov
Refcounting in the check_pseudo_btf_id() function is incorrect: the __check_pseudo_btf_id() function might get called with a zero refcounted btf. Fix this, and patch related code accordingly. v3: rephrase a comment (AI) v2: fix a refcount leak introduced in v1 (AI) Reported-by: syzbot+5a0f1995634f7c1dadbf@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5a0f1995634f7c1dadbf Fixes: 76145f725532 ("bpf: Refactor check_pseudo_btf_id") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260209132904.63908-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - in-order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably dma alignment fixes for non-cache-coherent systems * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits) vduse: avoid adding implicit padding vhost: fix caching attributes of MMIO regions by setting them explicitly vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr() vdpa/mlx5: reuse common function for MAC address updates vdpa/mlx5: update mlx_features with driver state check crypto: virtio: Replace package id with numa node id crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req crypto: virtio: Add spinlock protection with virtqueue notification Documentation: Add documentation for VDUSE Address Space IDs vduse: bump version number vduse: add vq group asid support vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls vduse: take out allocations from vduse_dev_alloc_coherent vduse: remove unused vaddr parameter of vduse_domain_free_coherent vduse: refactor vdpa_dev_add for goto err handling vhost: forbid change vq groups ASID if DRIVER_OK is set vdpa: document set_group_asid thread safety vduse: return internal vq group struct as map token vduse: add vq group support vduse: add v1 API definition ...
2026-02-13function_graph: Restore direct mode when callbacks drop to oneShengming Hu
When registering a second fgraph callback, direct path is disabled and array loop is used instead. When ftrace_graph_active falls back to one, we try to re-enable direct mode via ftrace_graph_enable_direct(true, ...). But ftrace_graph_enable_direct() incorrectly disables the static key rather than enabling it. This leaves fgraph_do_direct permanently off after first multi-callback transition, so direct fast mode is never restored. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260213142932519cuWSpEXeS4-UnCvNXnK2P@zte.com.cn Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function") Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-12Merge tag 'riscv-for-linus-7.0-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Paul Walmsley: - Add support for control flow integrity for userspace processes. This is based on the standard RISC-V ISA extensions Zicfiss and Zicfilp - Improve ptrace behavior regarding vector registers, and add some selftests - Optimize our strlen() assembly - Enable the ISO-8859-1 code page as built-in, similar to ARM64, for EFI volume mounting - Clean up some code slightly, including defining copy_user_page() as copy_page() rather than memcpy(), aligning us with other architectures; and using max3() to slightly simplify an expression in riscv_iommu_init_check() * tag 'riscv-for-linus-7.0-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (42 commits) riscv: lib: optimize strlen loop efficiency selftests: riscv: vstate_exec_nolibc: Use the regular prctl() function selftests: riscv: verify ptrace accepts valid vector csr values selftests: riscv: verify ptrace rejects invalid vector csr inputs selftests: riscv: verify syscalls discard vector context selftests: riscv: verify initial vector state with ptrace selftests: riscv: test ptrace vector interface riscv: ptrace: validate input vector csr registers riscv: csr: define vtype register elements riscv: vector: init vector context with proper vlenb riscv: ptrace: return ENODATA for inactive vector extension kselftest/riscv: add kselftest for user mode CFI riscv: add documentation for shadow stack riscv: add documentation for landing pad / indirect branch tracking riscv: create a Kconfig fragment for shadow stack and landing pad support arch/riscv: add dual vdso creation logic and select vdso based on hw arch/riscv: compile vdso with landing pad and shadow stack note riscv: enable kernel access to shadow stack memory via the FWFT SBI call riscv: add kernel command line option to opt out of user CFI riscv/hwprobe: add zicfilp / zicfiss enumeration in hwprobe ...
2026-02-12Merge tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds
Pull CXL updates from Dave Jiang: - Introduce cxl_memdev_attach and pave way for soft reserved handling, type2 accelerator enabling, and LSA 2.0 enabling. All these series require the endpoint driver to settle before continuing the memdev driver probe. - Address CXL port error protocol handling and reporting. The large patch series was split into three parts. The first two parts are included here with the final part coming later. The first part consists of a series of code refactoring to PCI AER sub-system that addresses CXL and also CXL RAS code to prepare for port error handling. The second part refactors the CXL code to move management of component registers to cxl_port objects to allow all CXL AER errors to be handled through the cxl_port hierarchy. - Provide AMD Zen5 platform address translation for CXL using ACPI PRMT. This includes a conventions document to explain why this is needed and how it's implemented. - Misc CXL patches of fixes, cleanups, and updates. Including CXL address translation for unaligned MOD3 regions. [ TLA service: CXL is "Compute Express Link" ] * tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (59 commits) cxl: Disable HPA/SPA translation handlers for Normalized Addressing cxl/region: Factor out code into cxl_region_setup_poison() cxl/atl: Lock decoders that need address translation cxl: Enable AMD Zen5 address translation using ACPI PRMT cxl/acpi: Prepare use of EFI runtime services cxl: Introduce callback for HPA address ranges translation cxl/region: Use region data to get the root decoder cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos() cxl/region: Separate region parameter setup and region construction cxl: Simplify cxl_root_ops allocation and handling cxl/region: Store HPA range in struct cxl_region cxl/region: Store root decoder in struct cxl_region cxl/region: Rename misleading variable name @hpa to @hpa_range Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement cxl, doc: Moving conventions in separate files cxl, doc: Remove isonum.txt inclusion cxl/port: Unify endpoint and switch port lookup cxl/port: Move endpoint component register management to cxl_port cxl/port: Map Port RAS registers cxl/port: Move dport RAS setup to dport add time ...
2026-02-12kho: remove unnecessary WARN_ON(err) in kho_populate()Ran Xiaokai
The following pr_warn() provides detailed error and location information, WARN_ON(err) adds no additional debugging value, so remove the redundant WARN_ON() call. Link: https://lkml.kernel.org/r/20260212111146.210086-3-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12kho: fix missing early_memunmap() call in kho_populate()Ran Xiaokai
Patch series "two fixes in kho_populate()", v3. This patch (of 2): kho_populate() returns without calling early_memunmap() on success path, this will cause early ioremap virtual address space leak. Link: https://lkml.kernel.org/r/20260212111146.210086-1-ranxiaokai627@163.com Link: https://lkml.kernel.org/r/20260212111146.210086-2-ranxiaokai627@163.com Fixes: b50634c5e84a ("kho: cleanup error handling in kho_populate()") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12mm: update all remaining mmap_prepare users to use vma_flags_tLorenzo Stoakes
We will be shortly removing the vm_flags_t field from vm_area_desc so we need to update all mmap_prepare users to only use the dessc->vma_flags field. This patch achieves that and makes all ancillary changes required to make this possible. This lays the groundwork for future work to eliminate the use of vm_flags_t in vm_area_desc altogether and more broadly throughout the kernel. While we're here, we take the opportunity to replace VM_REMAP_FLAGS with VMA_REMAP_FLAGS, the vma_flags_t equivalent. No functional changes intended. Link: https://lkml.kernel.org/r/fb1f55323799f09fe6a36865b31550c9ec67c225.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Damien Le Moal <dlemoal@kernel.org> [zonefs] Acked-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12mm/vmscan: fix demotion targets checks in reclaim/demotionBing Jiao
Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion", v9. This patch series addresses two issues in demote_folio_list(), can_demote(), and next_demotion_node() in reclaim/demotion. 1. demote_folio_list() and can_demote() do not correctly check demotion target against cpuset.mems_effective, which will cause (a) pages to be demoted to not-allowed nodes and (b) pages fail demotion even if the system still has allowed demotion nodes. Patch 1 fixes this bug by updating cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. 2. next_demotion_node() returns a preferred demotion target, but it does not check the node against allowed nodes. Patch 2 ensures that next_demotion_node() filters against the allowed node mask and selects the closest demotion target to the source node. This patch (of 2): Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks against cpuset.mems_effective in reclaim/demotion. Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Bug 1 reproduction: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Bug 2 reproduction: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12Merge tag 'trace-rv-v7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier updates from Steven Rostedt: - Refactor da_monitor to minimize macros Complete refactor of da_monitor.h to reduce reliance on macros generating functions. Use generic static functions and uses the preprocessor only when strictly necessary (e.g. for tracepoint handlers). The change essentially relies on functions with generic names (e.g. da_handle) instead of monitor-specific as well adding the need to define constant (e.g. MONITOR_NAME, MONITOR_TYPE) before including the header rather than calling macros that would define functions. Also adapt monitors and documentation accordingly. - Cleanup DA code generation scripts Clean up functions in dot2c removing reimplementations of trivial library functions (__buff_to_string) and removing some other unused intermediate steps. - Annotate functions with types in the rvgen python scripts - Remove superfluous assignments and cleanup generated code The rvgen scripts generate a superfluous assignment to 0 for enum variables and don't add commas to the last elements, which is against the kernel coding standards. Change the generation process for a better compliance and slightly simpler logic. - Remove superfluous declarations from generated code The monitor container source files contained a declaration and a definition for the rv_monitor variable. The former is superfluous and was removed. - Fix reference to outdated documentation s/da_monitor_synthesis.rst/monitor_synthesis.rst in comment in da_monitor.h * tag 'trace-rv-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Fix documentation reference in da_monitor.h verification/rvgen: Remove unused variable declaration from containers verification/dot2c: Remove superfluous enum assignment and add last comma verification/dot2c: Remove __buff_to_string() and cleanup verification/rvgen: Annotate DA functions with types verification/rvgen: Adapt dot2k and templates after refactoring da_monitor.h Documentation/rv: Adapt documentation after da_monitor refactoring rv: Cleanup da_monitor after refactor rv: Refactor da_monitor to minimise macros
2026-02-12Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2026-02-12Merge tag 'mm-stable-2026-02-11-19-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...
2026-02-12cgroup: fix race between task migration and iterationQingye Zhao
When a task is migrated out of a css_set, cgroup_migrate_add_task() first moves it from cset->tasks to cset->mg_tasks via: list_move_tail(&task->cg_list, &cset->mg_tasks); If a css_task_iter currently has it->task_pos pointing to this task, css_set_move_task() calls css_task_iter_skip() to keep the iterator valid. However, since the task has already been moved to ->mg_tasks, the iterator is advanced relative to the mg_tasks list instead of the original tasks list. As a result, remaining tasks on cset->tasks, as well as tasks queued on cset->mg_tasks, can be skipped by iteration. Fix this by calling css_set_skip_task_iters() before unlinking task->cg_list from cset->tasks. This advances all active iterators to the next task on cset->tasks, so iteration continues correctly even when a task is concurrently being migrated. This race is hard to hit in practice without instrumentation, but it can be reproduced by artificially slowing down cgroup_procs_show(). For example, on an Android device a temporary /sys/kernel/cgroup/cgroup_test knob can be added to inject a delay into cgroup_procs_show(), and then: 1) Spawn three long-running tasks (PIDs 101, 102, 103). 2) Create a test cgroup and move the tasks into it. 3) Enable a large delay via /sys/kernel/cgroup/cgroup_test. 4) In one shell, read cgroup.procs from the test cgroup. 5) Within the delay window, in another shell migrate PID 102 by writing it to a different cgroup.procs file. Under this setup, cgroup.procs can intermittently show only PID 101 while skipping PID 103. Once the migration completes, reading the file again shows all tasks as expected. Note that this change does not allow removing the existing css_set_skip_task_iters() call in css_set_move_task(). The new call in cgroup_migrate_add_task() only handles iterators that are racing with migration while the task is still on cset->tasks. Iterators may also start after the task has been moved to cset->mg_tasks. If we dropped css_set_skip_task_iters() from css_set_move_task(), such iterators could keep task_pos pointing to a migrating task, causing css_task_iter_advance() to malfunction on the destination css_set, up to and including crashes or infinite loops. The race window between migration and iteration is very small, and css_task_iter is not on a hot path. In the worst case, when an iterator is positioned on the first thread of the migrating process, cgroup_migrate_add_task() may have to skip multiple tasks via css_set_skip_task_iters(). However, this only happens when migration and iteration actually race, so the performance impact is negligible compared to the correctness fix provided here. Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Qingye Zhao <zhaoqingye@honor.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>