summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-06-18sched_ext: Implement runnable task stall watchdogDavid Vernet
The most common and critical way that a BPF scheduler can misbehave is by failing to run runnable tasks for too long. This patch implements a watchdog. * All tasks record when they become runnable. * A watchdog work periodically scans all runnable tasks. If any task has stayed runnable for too long, the BPF scheduler is aborted. * scheduler_tick() monitors whether the watchdog itself is stuck. If so, the BPF scheduler is aborted. Because the watchdog only scans the tasks which are currently runnable and usually very infrequently, the overhead should be negligible. scx_qmap is updated so that it can be told to stall user and/or kernel tasks. A detected task stall looks like the following: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s) scx_check_timeout_workfn+0x10e/0x1b0 process_one_work+0x287/0x560 worker_thread+0x234/0x420 kthread+0xe9/0x100 ret_from_fork+0x1f/0x30 A detected watchdog stall: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (watchdog failed to check in for 5.001s) scheduler_tick+0x2eb/0x340 update_process_times+0x7a/0x90 tick_sched_timer+0xd8/0x130 __hrtimer_run_queues+0x178/0x3b0 hrtimer_interrupt+0xfc/0x390 __sysvec_apic_timer_interrupt+0xb7/0x2b0 sysvec_apic_timer_interrupt+0x90/0xb0 asm_sysvec_apic_timer_interrupt+0x1b/0x20 default_idle+0x14/0x20 arch_cpu_idle+0xf/0x20 default_idle_call+0x50/0x90 do_idle+0xe8/0x240 cpu_startup_entry+0x1d/0x20 kernel_init+0x0/0x190 start_kernel+0x0/0x392 start_kernel+0x324/0x392 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x104/0x109 secondary_startup_64_no_verify+0xce/0xdb Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to inline scx_notify_sched_tick(). v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was being called before forward progress was guaranteed and thus could lead to system lockup. Relocated. - While enabling, it was comparing msecs against jiffies without conversion leading to spurious load failures on lower HZ kernels. Fixed. - runnable list management is now used by core bypass logic and moved to the patch implementing sched_ext core. v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without conversion leading to spurious load failures in lower HZ kernels. Fixed. v2: - Julia Lawall noticed that the watchdog code was mixing msecs and jiffies. Fix by using jiffies for everything. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Julia Lawall <julia.lawall@inria.fr>
2024-06-18sched_ext: Add sysrq-S which disables the BPF schedulerTejun Heo
This enables the admin to abort the BPF scheduler and revert to CFS anytime. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Implement BPF extensible scheduler classTejun Heo
Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo <tj@kernel.org> Co-authored-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-06-18sched_ext: Add boilerplate for extensible scheduler classTejun Heo
This adds dummy implementations of sched_ext interfaces which interact with the scheduler core and hook them in the correct places. As they're all dummies, this doesn't cause any behavior changes. This is split out to help reviewing. v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Add normal_policy()Tejun Heo
A new BPF extensible sched_class will need to dynamically change how a task picks its sched_class. For example, if the loaded BPF scheduler progs fail, the tasks will be forced back on CFS even if the task's policy is set to the new sched_class. To support such mapping, add normal_policy() which wraps testing for %SCHED_NORMAL. This doesn't cause any behavior changes. v2: Update the description with more details on the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Factor out update_other_load_avgs() from __update_blocked_others()Tejun Heo
RT, DL, thermal and irq load and utilization metrics need to be decayed and updated periodically and before consumption to keep the numbers reasonable. This is currently done from __update_blocked_others() as a part of the fair class load balance path. Let's factor it out to update_other_load_avgs(). Pure refactor. No functional changes. This will be used by the new BPF extensible scheduling class to ensure that the above metrics are properly maintained. v2: Refreshed on top of tip:sched/core. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18sched: Factor out cgroup weight conversion functionsTejun Heo
Factor out sched_weight_from/to_cgroup() which convert between scheduler shares and cgroup weight. No functional change. The factored out functions will be used by a new BPF extensible sched_class so that the weights can be exposed to the BPF programs in a way which is consistent cgroup weights and easier to interpret. The weight conversions will be used regardless of cgroup usage. It's just borrowing the cgroup weight range as it's more intuitive. CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that the conversion helpers can always be defined. v2: The helpers are now defined regardless of COFNIG_CGROUPS. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Add sched_class->switching_to() and expose check_class_changing/changed()Tejun Heo
When a task switches to a new sched_class, the prev and new classes are notified through ->switched_from() and ->switched_to(), respectively, after the switching is done. A new BPF extensible sched_class will have callbacks that allow the BPF scheduler to keep track of relevant task states (like priority and cpumask). Those callbacks aren't called while a task is on a different sched_class. When a task comes back, we wanna tell the BPF progs the up-to-date state before the task gets enqueued, so we need a hook which is called before the switching is committed. This patch adds ->switching_to() which is called during sched_class switch through check_class_changing() before the task is restored. Also, this patch exposes check_class_changing/changed() in kernel/sched/sched.h. They will be used by the new BPF extensible sched_class to implement implicit sched_class switching which is used e.g. when falling back to CFS when the BPF scheduler fails or unloads. This is a prep patch and doesn't cause any behavior changes. The new operation and exposed functions aren't used yet. v3: Refreshed on top of tip:sched/core. v2: Improve patch description w/ details on planned use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Add sched_class->reweight_task()Tejun Heo
Currently, during a task weight change, sched core directly calls reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper sched_class operation instead. CFS's reweight_task() is renamed to reweight_task_fair() and now called through sched_class. While it turns a direct call into an indirect one, set_load_weight() isn't called from a hot path and this change shouldn't cause any noticeable difference. This will be used to implement reweight_task for a new BPF extensible sched_class so that it can keep its cached task weight up-to-date. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()Tejun Heo
A new BPF extensible sched_class will need more control over the forking process. It wants to be able to fail from sched_cgroup_fork() after the new task's sched_task_group is initialized so that the loaded BPF program can prepare the task with its cgroup association is established and reject fork if e.g. allocation fails. Allow sched_cgroup_fork() to fail by making it return int instead of void and adding sched_cancel_fork() to undo sched_fork() in the error path. sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any behavior changes. v2: Patch description updated to detail the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Restructure sched_class order sanity checks in sched_init()Tejun Heo
Currently, sched_init() checks that the sched_class'es are in the expected order by testing each adjacency which is a bit brittle and makes it cumbersome to add optional sched_class'es. Instead, let's verify whether they're in the expected order using sched_class_above() which is what matters. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18Merge branch 'bpf/for-next' into sched_ext-baseTejun Heo
2024-06-18srcu: Fill out polled grace-period APIsPaul E. McKenney
This commit adds the get_completed_synchronize_srcu() and the same_state_synchronize_srcu() functions. The first returns a cookie that is always interpreted as corresponding to an expired grace period. The second does an equality comparison of a pair of cookies. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-18srcu: Update cleanup_srcu_struct() commentPaul E. McKenney
Now that we have polled SRCU grace periods, a grace period can be started by start_poll_synchronize_srcu() as well as call_srcu(), synchronize_srcu(), and synchronize_srcu_expedited(). This commit therefore calls out this new start_poll_synchronize_srcu() possibility in the comment on the WARN_ON(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-18srcu: Disable interrupts directly in srcu_gp_end()Paul E. McKenney
Interrupts are enabled in srcu_gp_end(), so this commit switches from spin_lock_irqsave_rcu_node() and spin_unlock_irqrestore_rcu_node() to spin_lock_irq_rcu_node() and spin_unlock_irq_rcu_node(). Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-18rcu: Disable interrupts directly in rcu_gp_init()Paul E. McKenney
Interrupts are enabled in rcu_gp_init(), so this commit switches from local_irq_save() and local_irq_restore() to local_irq_disable() and local_irq_enable(). Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-18rcu/tree: Reduce wake up for synchronize_rcu() common caseJoel Fernandes (Google)
In the synchronize_rcu() common case, we will have less than SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker is pointless just to free the last injected wait head since at that point, all the users have already been awakened. Introduce a new counter to track this and prevent the wakeup in the common case. [ paulmck: Remove atomic_dec_return_release in cannot-happen state. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-18bpf: Fix remap of arena.Alexei Starovoitov
The bpf arena logic didn't account for mremap operation. Add a refcnt for multiple mmap events to prevent use-after-free in arena_vm_close. Fixes: 317460317a02 ("bpf: Introduce bpf_arena.") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Barret Rhoden <brho@google.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Closes: https://lore.kernel.org/bpf/Zmuw29IhgyPNKnIM@xpf.sh.intel.com Link: https://lore.kernel.org/bpf/20240617171812.76634-1-alexei.starovoitov@gmail.com
2024-06-17Merge tag 'lsm-pr-20240617' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm fix from Paul Moore: "A single LSM/IMA patch to fix a problem caused by sleeping while in a RCU critical section" * tag 'lsm-pr-20240617' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: ima: Avoid blocking in RCU read-side critical section
2024-06-17Merge tag 'mm-hotfixes-stable-2024-06-17-11-43' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Mainly MM singleton fixes. And a couple of ocfs2 regression fixes" * tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kcov: don't lose track of remote references during softirqs mm: shmem: fix getting incorrect lruvec when replacing a shmem folio mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick mm: fix possible OOB in numa_rebuild_large_mapping() mm/migrate: fix kernel BUG at mm/compaction.c:2761! selftests: mm: make map_fixed_noreplace test names stable mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC mm: mmap: allow for the maximum number of bits for randomizing mmap_base by default gcov: add support for GCC 14 zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING mm: huge_memory: fix misused mapping_large_folio_support() for anon folios lib/alloc_tag: fix RCU imbalance in pgalloc_tag_get() lib/alloc_tag: do not register sysctl interface when CONFIG_SYSCTL=n MAINTAINERS: remove Lorenzo as vmalloc reviewer Revert "mm: init_mlocked_on_free_v3" mm/page_table_check: fix crash on ZONE_DEVICE gcc: disable '-Warray-bounds' for gcc-9 ocfs2: fix NULL pointer dereference in ocfs2_abort_trigger() ocfs2: fix NULL pointer dereference in ocfs2_journal_dirty()
2024-06-17bpf: Add missed var_off setting in coerce_subreg_to_size_sx()Yonghong Song
In coerce_subreg_to_size_sx(), for the case where upper sign extension bits are the same for smax32 and smin32 values, we missed to setup properly. This is especially problematic if both smax32 and smin32's sign extension bits are 1. The following is a simple example illustrating the inconsistent verifier states due to missed var_off: 0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: (bf) r3 = r0 ; R0_w=scalar(id=1) R3_w=scalar(id=1) 2: (57) r3 &= 15 ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) 3: (47) r3 |= 128 ; R3_w=scalar(smin=umin=smin32=umin32=128,smax=umax=smax32=umax32=143,var_off=(0x80; 0xf)) 4: (bc) w7 = (s8)w3 REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0xffffff80, 0x8f] s64=[0xffffff80, 0x8f] u32=[0xffffff80, 0x8f] s32=[0x80, 0xffffff8f] var_off=(0x80, 0xf) The var_off=(0x80, 0xf) is not correct, and the correct one should be var_off=(0xffffff80; 0xf) since from insn 3, we know that at insn 4, the sign extension bits will be 1. This patch fixed this issue by setting var_off properly. Fixes: 8100928c8814 ("bpf: Support new sign-extension mov insns") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240615174632.3995278-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-17bpf: Add missed var_off setting in set_sext32_default_val()Yonghong Song
Zac reported a verification failure and Alexei reproduced the issue with a simple reproducer ([1]). The verification failure is due to missed setting for var_off. The following is the reproducer in [1]: 0: R1=ctx() R10=fp0 0: (71) r3 = *(u8 *)(r10 -387) ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R10=fp0 1: (bc) w7 = (s8)w3 ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f)) 2: (36) if w7 >= 0x2533823b goto pc-3 mark_precise: frame0: last_idx 2 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r7 stack= before 1: (bc) w7 = (s8)w3 mark_precise: frame0: regs=r3 stack= before 0: (71) r3 = *(u8 *)(r10 -387) 2: R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f)) 3: (b4) w0 = 0 ; R0_w=0 4: (95) exit Note that after insn 1, the var_off for R7 is (0x0; 0x7f). This is not correct since upper 24 bits of w7 could be 0 or 1. So correct var_off should be (0x0; 0xffffffff). Missing var_off setting in set_sext32_default_val() caused later incorrect analysis in zext_32_to_64(dst_reg) and reg_bounds_sync(dst_reg). To fix the issue, set var_off correctly in set_sext32_default_val(). The correct reg state after insn 1 becomes: 1: (bc) w7 = (s8)w3 ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R7_w=scalar(smin=0,smax=umax=0xffffffff,smin32=-128,smax32=127,var_off=(0x0; 0xffffffff)) and at insn 2, the verifier correctly determines either branch is possible. [1] https://lore.kernel.org/bpf/CAADnVQLPU0Shz7dWV4bn2BgtGdxN3uFHPeobGBA72tpg5Xoykw@mail.gmail.com/ Fixes: 8100928c8814 ("bpf: Support new sign-extension mov insns") Reported-by: Zac Ecob <zacecob@protonmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240615174626.3994813-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-17cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeupKirill A. Shutemov
ACPI MADT doesn't allow to offline a CPU after it has been woken up. Currently, CPU hotplug is prevented based on the confidential computing attribute which is set for Intel TDX. But TDX is not the only possible user of the wake up method. Any platform that uses ACPI MADT wakeup method cannot offline CPU. Disable CPU offlining on ACPI MADT wakeup enumeration. This has no visible effects for users: currently, TDX guest is the only platform that uses the ACPI MADT wakeup method. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-5-kirill.shutemov@linux.intel.com
2024-06-17cpu/hotplug: Add support for declaring CPU offlining not supportedKirill A. Shutemov
The ACPI MADT mailbox wakeup method doesn't allow to offline a CPU after it has been woken up. Currently, offlining is prevented based on the confidential computing attribute which is set for Intel TDX. But TDX is not the only possible user of the wake up method. The MADT wakeup can be implemented outside of a confidential computing environment. Offline support is a property of the wakeup method, not the CoCo implementation. Introduce cpu_hotplug_disable_offlining() that can be called to indicate that CPU offlining should be disabled. This function is going to replace CC_ATTR_HOTPLUG_DISABLED for ACPI MADT wakeup method. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-4-kirill.shutemov@linux.intel.com
2024-06-17irqdomain: Remove __irq_domain_add()Herve Codina
__irq_domain_add() has been replaced by irq_domain_instanciate() and so, it is no more used. Simply remove it. Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-21-herve.codina@bootlin.com
2024-06-17irqdomain: Convert domain creation functions to irq_domain_instantiate()Herve Codina
Domain creation functions use __irq_domain_add(). With the introduction of irq_domain_instantiate(), __irq_domain_add() becomes obsolete. In order to fully remove __irq_domain_add(), convert domain creation function to irq_domain_instantiate() Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-19-herve.codina@bootlin.com
2024-06-17irqdomain: Add a resource managed version of irq_domain_instantiate()Herve Codina
Add a devres version of irq_domain_instantiate(). Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-17-herve.codina@bootlin.com
2024-06-17irqdomain: Add support for generic irq chips creation before publishing a domainHerve Codina
The current API functions create an irq_domain and also publish this newly created to domain. Once an irq_domain is published, consumers can request IRQ in order to use them. Some interrupt controller drivers have to perform some more operations with the created irq_domain in order to have it ready to be used. For instance: - Allocate generic irq chips with irq_alloc_domain_generic_chips() - Retrieve the generic irq chips with irq_get_domain_generic_chip() - Initialize retrieved chips: set register base address and offsets, set several hooks such as irq_mask, irq_unmask, ... With the newly introduced irq_domain_alloc_generic_chips(), an interrupt controller driver can use the irq_domain_chip_generic_info structure and set the init() hook to perform its generic chips initialization. In order to avoid a window where the domain is published but not yet ready to be used, handle the generic chip creation (i.e the irq_domain_alloc_generic_chips() call) before the domain is published. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-16-herve.codina@bootlin.com
2024-06-17genirq/generic_chip: Introduce init() and exit() hooksHerve Codina
Most of generic chip drivers need to perform some more additional initializations on the generic chips allocated before they can be fully ready. These additional initializations need to be performed before the IRQ domain is published to avoid a race condition between IRQ consumers and suppliers. Introduce the init() hook to perform these initializations at the right place just after the generic chip creation. Also introduce the exit() hook to allow reverting operations done by the init() hook just before the generic chip is destroyed. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-15-herve.codina@bootlin.com
2024-06-17genirq/generic_chip: Introduce irq_domain_{alloc,remove}_generic_chips()Herve Codina
The existing __irq_alloc_domain_generic_chips() uses a bunch of parameters to describe the generic chips that need to be allocated. Adding more parameters and wrappers to hide new parameters in the existing code leads to more and more code without any relevant values and without any flexibility. Introduce irq_domain_alloc_generic_chips() where the generic chips description is done using the irq_domain_chip_generic_info structure instead of the bunch of parameters to allow flexibility and easy evolution. Also introduce irq_domain_remove_generic_chips() to revert the operations done by irq_domain_alloc_generic_chips(). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-14-herve.codina@bootlin.com
2024-06-17irqdomain: Introduce init() and exit() hooksHerve Codina
The current API does not allow additional initialization before the domain is published. This can lead to a race condition between consumers and supplier as a domain can be available for consumers before being fully ready. Introduce the init() hook to allow additional initialization before plublishing the domain. Also introduce the exit() hook to revert operations done in init() on domain removal. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-13-herve.codina@bootlin.com
2024-06-17irqdomain: Handle domain bus token in irq_domain_create()Herve Codina
irq_domain_update_bus_token() is the only way to set the domain bus token. This is sub-optimal as irq_domain_update_bus_token() can be called only once the domain is created and needs to revert some operations, change the domain name and redo the operations. In order to avoid this revert/change/redo sequence, take the domain bus into account token during the domain creation. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-12-herve.codina@bootlin.com
2024-06-17irqdomain: Make __irq_domain_create() return an error codeHerve Codina
__irq_domain_create() can fail for several reasons. When it fails it returns a NULL pointer and so filters out the exact failure reason. The only user of __irq_domain_create() is irq_domain_instantiate() which can return a PTR_ERR value. On __irq_domain_create() failure, it uses an arbitrary error code. Rather than using this arbitrary error value, make __irq_domain_create() return is own error code and use that one. [ tglx: Remove the pointless ERR_CAST. domain is a valid return pointer ] Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-11-herve.codina@bootlin.com
2024-06-17irqdomain: Use irq_domain_instantiate() for hierarchy domain creationHerve Codina
irq_domain_instantiate() handles all needs to be used in irq_domain_create_hierarchy() Avoid code duplication and use directly irq_domain_instantiate() for hierarchy domain creation. Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-10-herve.codina@bootlin.com
2024-06-17irqdomain: Handle domain hierarchy parent in irq_domain_instantiate()Herve Codina
To use irq_domain_instantiate() from irq_domain_create_hierarchy(), irq_domain_instantiate() needs to handle the domain hierarchy parent. Add the required functionality. Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-9-herve.codina@bootlin.com
2024-06-17irqdomain: Handle additional domain flags in irq_domain_instantiate()Herve Codina
In order to use irq_domain_instantiate() from several places such as irq_domain_create_hierarchy(), irq_domain_instantiate() needs to handle additional domain flags. Add the required infrastructure. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-8-herve.codina@bootlin.com
2024-06-17irqdomain: Convert __irq_domain_create() to use struct irq_domain_infoHerve Codina
The existing __irq_domain_create() use a bunch of parameters to create an irq domain. With the introduction of irq_domain_info structure, these parameters are available in the information structure itself. Using directly this information structure allows future flexibility to add other parameters in a simple way without the need to change the __irq_domain_create() prototype. Convert __irq_domain_create() to use the information structure. [ tglx: Fixup struct initializer ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-7-herve.codina@bootlin.com
2024-06-17irqdomain: Use a dedicated function to set the domain nameHerve Codina
The interrupt domain name computation and setting is directly done in __irq_domain_create(). This leads to a quite long __irq_domain_create() function. In order to simplify __irq_domain_create() and isolate the domain name computation and setting, move the related operations to a dedicated function. Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-6-herve.codina@bootlin.com
2024-06-17irqdomain: Introduce irq_domain_instantiate()Herve Codina
The existing irq_domain_add_*() functions used to instantiate an IRQ domain are wrappers built on top of __irq_domain_add() and describe the domain properties using a bunch of parameters. Adding more parameters and wrappers to hide new parameters in the existing code lead to more and more code without any relevant value and without any flexibility. Introduce irq_domain_instantiate() where the interrupt domain properties are given using a irq_domain_info structure instead of the bunch of parameters to allow flexibility and easy evolution. irq_domain_instantiate() performs the same operation as the one done by __irq_domain_add(). For compatibility reason with existing code, keep __irq_domain_add() but convert it to irq_domain_instantiate(). [ tglx: Fixed up struct initializer coding style ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-3-herve.codina@bootlin.com
2024-06-17irqdomain: Introduce irq_domain_free()Herve Codina
In preparation of the introduction of the irq domain instantiation, introduce irq_domain_free() to avoid code duplication on later modifications. This new function is an extraction of the current operations performed to free the irq domain. No functional change intended. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240614173232.1184015-2-herve.codina@bootlin.com
2024-06-17irqdomain: Fixed unbalanced fwnode get and putHerve Codina
fwnode_handle_get(fwnode) is called when a domain is created with fwnode passed as a function parameter. fwnode_handle_put(domain->fwnode) is called when the domain is destroyed but during the creation a path exists that does not set domain->fwnode. If this path is taken, the fwnode get will never be put. To avoid the unbalanced get and put, set domain->fwnode unconditionally. Fixes: d59f6617eef0 ("genirq: Allow fwnode to carry name information only") Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240614173232.1184015-4-herve.codina@bootlin.com
2024-06-17cpu/hotplug: Reverse order of iteration in freeze_secondary_cpus()Stanislav Spassov
Whenever CPU hotplug state callbacks are registered, the startup callback is invoked on CPUs that have already reached the provided state in order of ascending CPU IDs. In freeze_secondary_cpus() the teardown of CPUs happens in the same are invoked in the same order. This is known to make a difference is the current implementation of these callbacks in arch/x86/events/intel/uncore.c: - uncore_event_cpu_online() designates the first CPU it is invoked for on each package as the uncore event collector for that package - uncore_event_cpu_offline() if the CPU being offlined is the event collector for its package, transfers that responsibility over to the next (by ascending CPU id) one in the same package With the current order of CPU teardowns in freeze_secondary_cpus(), the latter ends up doing the ownership transfer work on every single CPU. That work involves a synchronize_rcu() call, ultimately unnecessarily degrading the performance of CPU offlining. To address this make freeze_secondary_cpus() iterate through the CPUs in reverse order, so that the teardown happens in order of descending CPU IDs. [ tglx: Massage change log ] Signed-off-by: Stanislav Spassov <stanspas@amazon.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240524160449.48594-1-stanspas@amazon.de
2024-06-17smp: Use str_plural() to fix Coccinelle warningsThorsten Blum
Fixes the following two Coccinelle/coccicheck warnings reported by string_choices.cocci: opportunity for str_plural(num_cpus) opportunity for str_plural(num_nodes) Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240508154225.309703-2-thorsten.blum@toblux.com
2024-06-17cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked()Yuntao Wang
Commit 4205e4786d0b ("cpu/hotplug: Provide dynamic range for prepare stage") added a dynamic range for the prepare states, but did not handle the assignment of the dynstate variable in __cpuhp_setup_state_cpuslocked(). This causes the corresponding startup callback not to be invoked when calling __cpuhp_setup_state_cpuslocked() with the CPUHP_BP_PREPARE_DYN parameter, even though it should be. Currently, the users of __cpuhp_setup_state_cpuslocked(), for one reason or another, have not triggered this bug. Fixes: 4205e4786d0b ("cpu/hotplug: Provide dynamic range for prepare stage") Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240515134554.427071-1-ytcoode@gmail.com
2024-06-15kcov: don't lose track of remote references during softirqsAleksandr Nogikh
In kcov_remote_start()/kcov_remote_stop(), we swap the previous KCOV metadata of the current task into a per-CPU variable. However, the kcov_mode_enabled(mode) check is not sufficient in the case of remote KCOV coverage: current->kcov_mode always remains KCOV_MODE_DISABLED for remote KCOV objects. If the original task that has invoked the KCOV_REMOTE_ENABLE ioctl happens to get interrupted and kcov_remote_start() is called, it ultimately leads to kcov_remote_stop() NOT restoring the original KCOV reference. So when the task exits, all registered remote KCOV handles remain active forever. The most uncomfortable effect (at least for syzkaller) is that the bug prevents the reuse of the same /sys/kernel/debug/kcov descriptor. If we obtain it in the parent process and then e.g. drop some capabilities and continuously fork to execute individual programs, at some point current->kcov of the forked process is lost, kcov_task_exit() takes no action, and all KCOV_REMOTE_ENABLE ioctls calls from subsequent forks fail. And, yes, the efficiency is also affected if we keep on losing remote kcov objects. a) kcov_remote_map keeps on growing forever. b) (If I'm not mistaken), we're also not freeing the memory referenced by kcov->area. Fix it by introducing a special kcov_mode that is assigned to the task that owns a KCOV remote object. It makes kcov_mode_enabled() return true and yet does not trigger coverage collection in __sanitizer_cov_trace_pc() and write_comp_data(). [nogikh@google.com: replace WRITE_ONCE() with an ordinary assignment] Link: https://lkml.kernel.org/r/20240614171221.2837584-1-nogikh@google.com Link: https://lkml.kernel.org/r/20240611133229.527822-1-nogikh@google.com Fixes: 5ff3b30ab57d ("kcov: collect coverage from interrupts") Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-15gcov: add support for GCC 14Peter Oberparleiter
Using gcov on kernels compiled with GCC 14 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 14. Tested with GCC versions 14.1.0 and 13.2.0. Link: https://lkml.kernel.org/r/20240610092743.1609845-1-oberpar@linux.ibm.com Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by: Allison Henderson <allison.henderson@oracle.com> Reported-by: Chuck Lever III <chuck.lever@oracle.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-15zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDINGOleg Nesterov
kernel_wait4() doesn't sleep and returns -EINTR if there is no eligible child and signal_pending() is true. That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() return false and avoid a busy-wait loop. Link: https://lkml.kernel.org/r/20240608120616.GB7947@redhat.com Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Rachel Menge <rachelmenge@linux.microsoft.com> Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/ Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Tested-by: Wei Fu <fuweid89@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joel Granados <j.granados@samsung.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Zqiang <qiang.zhang1211@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-14kunit: test: Add vm_mmap() allocation resource managerKees Cook
For tests that need to allocate using vm_mmap() (e.g. usercopy and execve), provide the interface to have the allocation tracked by KUnit itself. This requires bringing up a placeholder userspace mm. This combines my earlier attempt at this with Mark Rutland's version[1]. Normally alloc_mm() and arch_pick_mmap_layout() aren't exported for modules, so export these only for KUnit testing. Link: https://lore.kernel.org/lkml/20230321122514.1743889-2-mark.rutland@arm.com/ [1] Co-developed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-06-14Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-06-14 We've added 8 non-merge commits during the last 2 day(s) which contain a total of 9 files changed, 92 insertions(+), 11 deletions(-). The main changes are: 1) Silence a syzkaller splat under CONFIG_DEBUG_NET=y in pskb_pull_reason() triggered via __bpf_try_make_writable(), from Florian Westphal. 2) Fix removal of kfuncs during linking phase which then throws a kernel build warning via resolve_btfids about unresolved symbols, from Tony Ambardar. 3) Fix a UML x86_64 compilation failure from BPF as pcpu_hot symbol is not available on User Mode Linux, from Maciej Żenczykowski. 4) Fix a register corruption in reg_set_min_max triggering an invariant violation in BPF verifier, from Daniel Borkmann. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Harden __bpf_kfunc tag against linker kfunc removal compiler_types.h: Define __retain for __attribute__((__retain__)) bpf: Avoid splat in pskb_pull_reason bpf: fix UML x86_64 compile failure selftests/bpf: Add test coverage for reg_set_min_max handling bpf: Reduce stack consumption in check_stack_write_fixed_off bpf: Fix reg_set_min_max corruption of fake_reg MAINTAINERS: mailmap: Update Stanislav's email address ==================== Link: https://lore.kernel.org/r/20240614203223.26500-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14bpf: Track delta between "linked" registers.Alexei Starovoitov
Compilers can generate the code r1 = r2 r1 += 0x1 if r2 < 1000 goto ... use knowledge of r2 range in subsequent r1 operations So remember constant delta between r2 and r1 and update r1 after 'if' condition. Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct: for (i = 0; i < 1000 && can_loop; i++) The "undo" pass was introduced in LLVM https://reviews.llvm.org/D121937 to prevent this optimization, but it cannot cover all cases. Instead of fighting middle end optimizer in BPF backend teach the verifier about this pattern. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com