summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-02-28cgroup/cpuset: Fix a memory leak in update_exclusive_cpumask()Waiman Long
Fix a possible memory leak in update_exclusive_cpumask() by moving the alloc_cpumasks() down after the validate_change() check which can fail and still before the temporary cpumasks are needed. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Reported-and-tested-by: Mirsad Todorovac <mirsad.todorovac@alu.hr> Closes: https://lore.kernel.org/lkml/14915689-27a3-4cd8-80d2-9c30d0c768b6@alu.unizg.hr Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v6.7+
2024-02-28pidfd: move struct pidfd_fopsChristian Brauner
Move the pidfd file operations over to their own file in preparation of implementing pidfs and to isolate them from other mostly unrelated functionality in other files. Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-1-f863f58cfce1@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-28sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLCAlex Shi
SD_SHARE_PKG_RESOURCES is a bit of a misnomer: its naming suggests that it's sharing all 'package resources' - while in reality it's specifically for sharing the LLC only. Rename it to SD_SHARE_LLC to reduce confusion. [ mingo: Rewrote the confusing changelog as well. ] Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Barry Song <baohua@kernel.org> Link: https://lore.kernel.org/r/20240210113924.1130448-5-alexs@kernel.org
2024-02-28sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio()Alex Shi
sched_use_asym_prio() checks whether CPU priorities should be used. It makes sense to check for the SD_ASYM_PACKING() inside the function. Since both sched_asym() and sched_group_asym() use sched_use_asym_prio(), remove the now superfluous checks for the flag in various places. Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-4-alexs@kernel.org
2024-02-28sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer()Alex Shi
sched_use_asym_prio() and sched_asym_prefer() are used together in various places. Consolidate them into a single function sched_asym(). The existing sched_asym() function is only used when collecting statistics of a scheduling group. Rename it as sched_group_asym(), and remove the obsolete function description. This makes the code easier to read. No functional changes. Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-3-alexs@kernel.org
2024-02-28sched/fair: Remove unused parameter from sched_asym()Alex Shi
The 'sds' argument is not used in the sched_asym() function anymore, remove it. Fixes: c9ca07886aaa ("sched/fair: Do not even the number of busy CPUs via asym_packing") Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-2-alexs@kernel.org
2024-02-28sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGSAlex Shi
These flags are already documented in include/linux/sched/sd_flags.h. Also, add missing SD_CLUSTER and keep the comment on SD_ASYM_PACKING as it is a special case. Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-1-alexs@kernel.org
2024-02-28sched/fair: Simplify the update_sd_pick_busiest() logicDavid Vernet
When comparing the current struct sched_group with the yet-busiest domain in update_sd_pick_busiest(), if the two groups have the same group type, we're currently doing a bit of unnecessary work for any group >= group_misfit_task. We're comparing the two groups, and then returning only if false (the group in question is not the busiest). Otherwise, we break out, do an extra unnecessary conditional check that's vacuously false for any group type > group_fully_busy, and then always return true. Let's just return directly in the switch statement instead. This doesn't change the size of vmlinux with llvm 17 (not surprising given that all of this is inlined in load_balance()), but it does shrink load_balance() by 88 bytes on x86. Given that it also improves readability, this seems worth doing. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240206043921.850302-4-void@manifault.com
2024-02-28sched/fair: Do strict inequality check for busiest misfit task groupDavid Vernet
In update_sd_pick_busiest(), when comparing two sched groups that are both of type group_misfit_task, we currently consider the new group as busier than the current busiest group even if the new group has the same misfit task load as the current busiest group. We can avoid some unnecessary writes if we instead only consider the newest group to be the busiest if it has a higher load than the current busiest. This matches the behavior of other group types where we compare load, such as two groups that are both overloaded. Let's update the group_misfit_task type comparison to also only update the busiest group in the event of strict inequality. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240206043921.850302-3-void@manifault.com
2024-02-28sched/fair: Remove unnecessary goto in update_sd_lb_stats()David Vernet
In update_sd_lb_stats(), when we're iterating over the sched groups that comprise a sched domain, we're skipping the call to update_sd_pick_busiest() for the sched group that contains the local / destination CPU. We use a goto to skip the call, but we could just as easily check !local_group, as there's no other logic that we need to skip with the goto. Let's remove the goto, and check for !local_group in the if statement instead. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240206043921.850302-2-void@manifault.com
2024-02-28sched/fair: Take the scheduling domain into account in select_idle_core()Keisuke Nishimura
When picking a CPU on task wakeup, select_idle_core() has to take into account the scheduling domain where the function looks for the CPU. This is because the "isolcpus" kernel command line option can remove CPUs from the domain to isolate them from other SMT siblings. This change replaces the set of CPUs allowed to run the task from p->cpus_ptr by the intersection of p->cpus_ptr and sched_domain_span(sd) which is stored in the 'cpus' argument provided by select_idle_cpu(). Fixes: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240110131707.437301-2-keisuke.nishimura@inria.fr
2024-02-28sched/fair: Take the scheduling domain into account in select_idle_smt()Keisuke Nishimura
When picking a CPU on task wakeup, select_idle_smt() has to take into account the scheduling domain of @target. This is because the "isolcpus" kernel command line option can remove CPUs from the domain to isolate them from other SMT siblings. This fix checks if the candidate CPU is in the target scheduling domain. Commit: df3cb4ea1fb6 ("sched/fair: Fix wrong cpu selecting from isolated domain") ... originally introduced this fix by adding the check of the scheduling domain in the loop. However, commit: 3e6efe87cd5cc ("sched/fair: Remove redundant check in select_idle_smt()") ... accidentally removed the check. Bring it back. Fixes: 3e6efe87cd5c ("sched/fair: Remove redundant check in select_idle_smt()") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240110131707.437301-1-keisuke.nishimura@inria.fr
2024-02-28sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irqShrikanth Hegde
Use existing helper function cpu_util_irq() instead of open-coding access to ->avg_irq. During review it was noted that ->avg_irq could be updated by a different CPU than the one which is trying to access it. ->avg_irq is updated with WRITE_ONCE(), use READ_ONCE to access it in order to avoid any compiler optimizations. Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240101154624.100981-3-sshegde@linux.vnet.ibm.com
2024-02-28sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dlShrikanth Hegde
There are helper functions called cpu_util_dl() and cpu_util_rt() which give the average utilization of DL and RT respectively. But there are a few places in code where access to these variables is open-coded. Instead use the helper function so that code becomes simpler and easier to maintain later on. No functional changes intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240101154624.100981-2-sshegde@linux.vnet.ibm.com
2024-02-28dma-direct: Leak pages on dma_set_decrypted() failureRick Edgecombe
On TDX it is possible for the untrusted host to cause set_memory_encrypted() or set_memory_decrypted() to fail such that an error is returned and the resulting memory is shared. Callers need to take care to handle these errors to avoid returning decrypted (shared) memory to the page allocator, which could lead to functional or security issues. DMA could free decrypted/shared pages if dma_set_decrypted() fails. This should be a rare case. Just leak the pages in this case instead of freeing them. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-02-28swiotlb: add debugfs to track swiotlb transient pool usageZhangPeng
Introduce a new debugfs interface io_tlb_transient_nslabs. The device driver can create a new swiotlb transient memory pool once default memory pool is full. To export the swiotlb transient memory pool usage via debugfs would help the user estimate the size of transient swiotlb memory pool or analyze device driver memory leak issue. Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-02-28locking/percpu-rwsem: Trigger contention tracepoints only if contendedNamhyung Kim
We mistakenly always fire lock contention tracepoints in the writer path, while it should be conditional on the trylock result. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20231108215322.2845536-1-namhyung@kernel.org
2024-02-28locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hintWaiman Long
Clarify in the comments that the RWSEM_READER_OWNED bit in the owner field is just a hint, not an authoritative state of the rwsem. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20240222150540.79981-4-longman@redhat.com
2024-02-28locking/qspinlock: Fix 'wait_early' set but not used warningWaiman Long
When CONFIG_LOCK_EVENT_COUNTS is off, the wait_early variable will be set but not used. This is expected. Recent compilers will not generate wait_early code in this case. Add the __maybe_unused attribute to wait_early for suppressing this W=1 warning. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240222150540.79981-2-longman@redhat.com Closes: https://lore.kernel.org/oe-kbuild-all/202312260422.f4pK3f9m-lkp@intel.com/
2024-02-27time: test: Fix incorrect format specifierDavid Gow
'days' is a s64 (from div_s64), and so should use a %lld specifier. This was found by extending KUnit's assertion macros to use gcc's __printf attribute. Fixes: 276010551664 ("time: Improve performance of time64_to_tm()") Signed-off-by: David Gow <davidgow@google.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-02-27Merge branch 'x86/urgent' into x86/apic, to resolve conflictsIngo Molnar
Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/cpu/intel.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-27smp: Avoid 'setup_max_cpus' namespace collision/shadowingIngo Molnar
bringup_nonboot_cpus() gets passed the 'setup_max_cpus' variable in init/main.c - which is also the name of the parameter, shadowing the name. To reduce confusion and to allow the 'setup_max_cpus' value to be #defined in the <linux/smp.h> header, use the 'max_cpus' name for the function parameter name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org
2024-02-26Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', ↵Boqun Feng
'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a
2024-02-26timers: Assert no next dyntick timer look-up while CPU is offlineFrederic Weisbecker
The next timer (re-)evaluation, with the purpose of entering/updating the dyntick mode, can happen from 3 sites and none of them are relevant while the CPU is offline: 1) The idle loop: a) From the quick check helping the cpuidle governor to heuristically predict the best C-state. b) While stopping the tick. But if the CPU is offline, the tick has been cancelled and there is consequently no need to further stop the tick. 2) Remote expiry: when a CPU remotely expires global timers on behalf of another CPU, the latter target's next timer is re-evaluated afterwards. However remote expîry doesn't happen on offline CPUs. 3) IRQ exit: on nohz_full mode, the tick is (re-)evaluated on IRQ exit. But full dynticks is disabled on offline CPUs. Therefore it is safe to assume that no next dyntick timer lookup can be performed on offline CPUs. Assert this expectation to report any surprise. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-17-frederic@kernel.org
2024-02-26tick: Assume timekeeping is correctly handed over upon last offline idle callFrederic Weisbecker
The timekeeping duty is handed over from the outgoing CPU on stop machine, then the oneshot tick is stopped right after. Therefore it's guaranteed that the current CPU isn't the timekeeper upon its last call to idle. Besides, calling tick_nohz_idle_stop_tick() while the dying CPU goes into idle suggests that the tick is going to be stopped while it is actually stopped already from the appropriate CPU hotplug state. Remove the confusing call and the obsolete case handling and convert it to a sanity check that verifies the above assumption. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-16-frederic@kernel.org
2024-02-26tick: Shut down low-res tick from dying CPUFrederic Weisbecker
The timekeeping duty is handed over from the outgoing CPU within stop machine. This works well if CONFIG_NO_HZ_COMMON=n or the tick is in high-res mode. However in low-res dynticks mode, the tick isn't cancelled until the clockevent is shut down, which can happen later. The tick may therefore fire again once IRQs are re-enabled on stop machine and until IRQs are disabled for good upon the last call to idle. That's so many opportunities for a timekeeper to go idle and the outgoing CPU to take over that duty. This is why tick_nohz_idle_stop_tick() is called one last time on idle if the CPU is seen offline: so that the timekeeping duty is handed over again in case the CPU has re-taken the duty. This means there are two timekeeping handovers on CPU down hotplug with different undocumented constraints and purposes: 1) A handover on stop machine for !dynticks || highres. All online CPUs are guaranteed to be non-idle and the timekeeping duty can be safely handed-over. The hrtimer tick is cancelled so it is guaranteed that in dynticks mode the outgoing CPU won't take again the duty. 2) A handover on last idle call for dynticks && lowres. Setting the duty to TICK_DO_TIMER_NONE makes sure that a CPU will take over the timekeeping. Prepare for consolidating the handover to a single place (the first one) with shutting down the low-res tick as well from tick_cancel_sched_timer() as well. This will simplify the handover and unify the tick cancellation between high-res and low-res. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-15-frederic@kernel.org
2024-02-26tick: Split nohz and highres features from nohz_modeFrederic Weisbecker
The nohz mode field tells about low resolution nohz mode or high resolution nohz mode but it doesn't tell about high resolution non-nohz mode. In order to retrieve the latter state, tick_cancel_sched_timer() must fiddle with struct hrtimer's internals to guess if the tick has been initialized in high resolution. Move instead the nohz mode field information into the tick flags and provide two new bits: one to know if the tick is in nohz mode and another one to know if the tick is in high resolution. The combination of those two flags provides all the needed informations to determine which of the three tick modes is running. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-14-frederic@kernel.org
2024-02-26tick: Move individual bit features to debuggable mask accessesFrederic Weisbecker
The individual bitfields of struct tick_sched must be modified from IRQs disabled places, otherwise local modifications can race due to them sharing the same memory storage. The recent move of the "got_idle_tick" bitfield to its own storage shows that the use of these bitfields, as pretty as they look, can be as much error prone. In order to avoid future issues of the like and make sure that those bitfields are safely accessed, move those flags to an explicit mask along with a mutator function performing the basic IRQs disabled sanity check. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-13-frederic@kernel.org
2024-02-26tick: Move got_idle_tick away from common flagsFrederic Weisbecker
tick_nohz_idle_got_tick() is called by cpuidle_reflect() within the idle loop with interrupts enabled. This function modifies the struct tick_sched's bitfield "got_idle_tick". However this bitfield is stored within the same mask as other bitfields that can be modified from interrupts. Fortunately so far it looks like the only race that can happen is while writing ->got_idle_tick to 0, an interrupt fires and writes the ->idle_active field to 0. It's then possible that the interrupted write to ->got_idle_tick writes back the old value of ->idle_active back to 1. However if that happens, the worst possible outcome is that the time spent between that interrupt and the upcoming call to tick_nohz_idle_exit() is accounted as idle, which is negligible quantity. Still all the bitfield writes within this struct tick_sched's shadow mask should be IRQ-safe. Therefore move this bitfield out to its own storage to avoid further suprises. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-12-frederic@kernel.org
2024-02-26tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE modeFrederic Weisbecker
The full-nohz update function checks if the nohz mode is active before proceeding. It considers one exception though: if the tick is already stopped even though the nohz mode is inactive, it still moves on in order to update/restart the tick if needed. However in order for the tick to be stopped, the nohz_mode has to be either NOHZ_MODE_LOWRES or NOHZ_MODE_HIGHRES. Therefore it doesn't make sense to test if the tick is stopped before verifying NOHZ_MODE_INACTIVE mode. Remove the needless related condition. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-11-frederic@kernel.org
2024-02-26tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYINGFrederic Weisbecker
The broadcast shutdown code is executed through a random explicit call within stop machine from the outgoing CPU. However the tick broadcast is a midware between the tick callback and the clocksource, therefore it makes more sense to shut it down after the tick callback and before the clocksource drivers. Move it instead to the common tick shutdown CPU hotplug state where related operations can be ordered from highest to lowest level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-10-frederic@kernel.org
2024-02-26tick: Move tick cancellation up to CPUHP_AP_TICK_DYINGFrederic Weisbecker
The tick hrtimer is cancelled right before hrtimers are migrated. This is done from the hrtimer subsystem even though it shouldn't know about its actual users. Move instead the tick hrtimer cancellation to the relevant CPU hotplug state that aims at centralizing high level tick shutdown operations so that the related flow is easy to follow. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-9-frederic@kernel.org
2024-02-26tick: Start centralizing tick related CPU hotplug operationsFrederic Weisbecker
During the CPU offlining process, the various timer tick features are shut down from scattered places, sometimes from teardown callbacks on stop machine, sometimes through explicit calls, sometimes from the control CPU after the CPU died. The reason why these shutdown operations are spread around is not always clear and it makes the tick lifecycle hard to follow. The tick should be shut down in order from highest to lowest level: On stop machine from the dying CPU (high-level): 1) Hand-over the timekeeping duty (tick_handover_do_timer()) 2) Cancel the tick implementation called by the clockevent callback (tick_cancel_sched_timer()) 3) Shutdown broadcasting (tick_offline_cpu() / tick_broadcast_offline()) On stop machine from the dying CPU (low-level): 4) Shutdown clockevents drivers (CPUHP_AP_*_TIMER_STARTING states) From the control CPU after the CPU died (low-level): 5) Shutdown/unregister/cleanup clockevents for the dead CPU (tick_cleanup_dead_cpu()) Instead the current order is 2, 4 (both from CPU hotplug states), then 1 and 3 through direct calls. This layout and order don't make much sense. The operations 1, 2, 3 should be gathered together and in order. Sort this situation with creating a new TICK shut-down CPU hotplug state and start with introducing the timekeeping duty hand-over there. The state must precede hrtimers migration because the tick hrtimer will be stopped from it in a further patch. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-8-frederic@kernel.org
2024-02-26tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()Frederic Weisbecker
The tick sched structure is already cleared from tick_cancel_sched_timer(), so there is no need to clear that field again. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-7-frederic@kernel.org
2024-02-26tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()Frederic Weisbecker
tick_nohz_stop_sched_tick() is only about NOHZ_full and not about dynticks-idle. Reflect that in the function name to avoid confusion. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-6-frederic@kernel.org
2024-02-26tick: Use IS_ENABLED() whenever possibleFrederic Weisbecker
Avoid ifdeferry if it can be converted to IS_ENABLED() whenever possible Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-5-frederic@kernel.org
2024-02-26tick/sched: Remove useless oneshot ifdefferyFrederic Weisbecker
tick-sched.c is only built when CONFIG_TICK_ONESHOT=y, which is selected only if CONFIG_NO_HZ_COMMON=y or CONFIG_HIGH_RES_TIMERS=y. Therefore the related ifdeferry in this file is needless and can be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-4-frederic@kernel.org
2024-02-26tick/nohz: Remove duplicate between lowres and highres handlersPeng Liu
tick_nohz_lowres_handler() does the same work as tick_nohz_highres_handler() plus the clockevent device reprogramming, so make the former reuse the latter and rename it accordingly. Signed-off-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-3-frederic@kernel.org
2024-02-26tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and ↵Peng Liu
tick_setup_sched_timer() The ts->sched_timer initialization work of tick_nohz_switch_to_nohz() is almost the same as that of tick_setup_sched_timer(), so adjust the latter to get it reused by tick_nohz_switch_to_nohz(). This also makes the low resolution mode sched_timer benefit from the tick skew boot option. Signed-off-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-2-frederic@kernel.org
2024-02-25rcu-tasks: Maintain real-time response in rcu_tasks_postscan()Paul E. McKenney
The current code will scan the entirety of each per-CPU list of exiting tasks in ->rtp_exit_list with interrupts disabled. This is normally just fine, because each CPU typically won't have very many tasks in this state. However, if a large number of tasks block late in do_exit(), these lists could be arbitrarily long. Low probability, perhaps, but it really could happen. This commit therefore occasionally re-enables interrupts while traversing these lists, inserting a dummy element to hold the current place in the list. In kernels built with CONFIG_PREEMPT_RT=y, this re-enabling happens after each list element is processed, otherwise every one-to-two jiffies. [ paulmck: Apply Frederic Weisbecker feedback. ] Link: https://lore.kernel.org/all/ZdeI_-RfdLR8jlsm@localhost.localdomain/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasksPaul E. McKenney
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore eliminates these deadlock by replacing the SRCU-based wait for do_exit() completion with per-CPU lists of tasks currently exiting. A given task will be on one of these per-CPU lists for the same period of time that this task would previously have been in the previous SRCU read-side critical section. These lists enable RCU Tasks to find the tasks that have already been removed from the tasks list, but that must nevertheless be waited upon. The RCU Tasks grace period gathers any of these do_exit() tasks that it must wait on, and adds them to the list of holdouts. Per-CPU locking and get_task_struct() are used to synchronize addition to and removal from these lists. Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney
This commit continues the elimination of deadlocks involving do_exit() and RCU tasks by causing exit_tasks_rcu_start() to add the current task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the current task from whatever list it is on. These lists will be used to track tasks that are exiting, while still accounting for any RCU-tasks quiescent states that these tasks pass though. [ paulmck: Apply Frederic Weisbecker feedback. ] Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore initializes the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks. Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Initialize callback lists at rcu_init() timePaul E. McKenney
In order for RCU Tasks to reliably maintain per-CPU lists of exiting tasks, those lists must be initialized before it is possible for tasks to exit, especially given that the boot CPU is not necessarily CPU 0 (an example being, powerpc kexec() kernels). And at the time that rcu_init_tasks_generic() is called, a task could potentially exit, unconventional though that sort of thing might be. This commit therefore moves the calls to cblist_init_generic() from functions called from rcu_init_tasks_generic() to a new function named tasks_cblist_init_generic() that is invoked from rcu_init(). This constituted a bug in a commit that never went to mainline, so there is no need for any backporting to -stable. Reported-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore adds the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks. Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25power: port block device access to fileChristian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-6-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-24sched: Add a new function to compare if two cpus have the same capacityQais Yousef
The new helper function is needed to help blk-mq check if it needs to dispatch the softirq on another CPU to match the performance level the IO requester is running at. This is important on HMP systems where not all CPUs have the same compute capacity. Signed-off-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240223155749.2958009-2-qyousef@layalina.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-23arm64, crash: wrap crash dumping code into crash related ifdefsBaoquan He
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on arm64 with some adjustments. Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery. [bhe@redhat.com: fix building error in generic codes] Link: https://lkml.kernel.org/r/20240129135033.157195-2-bhe@redhat.com Link: https://lkml.kernel.org/r/20240124051254.67105-8-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23crash: clean up kdump related config itemsBaoquan He
By splitting CRASH_RESERVE and VMCORE_INFO out from CRASH_CORE, cleaning up the dependency of FA_DMUMP on CRASH_DUMP, and moving crash codes from kexec_core.c to crash_core.c, now we can rearrange CRASH_DUMP to depend on KEXEC_CORE, and make CRASH_DUMP select CRASH_RESERVE and VMCORE_INFO. KEXEC_CORE won't select CRASH_RESERVE and VMCORE_INFO any more because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and crash dumping. Doing this makes codes and the corresponding config items more logical (the right item depends on or is selected by the left item). PROC_KCORE -----------> VMCORE_INFO |----------> VMCORE_INFO FA_DUMP----| |----------> CRASH_RESERVE ---->VMCORE_INFO / |---->CRASH_RESERVE KEXEC --| /| |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE KEXEC_FILE --| \ | \---->CRASH_HOTPLUG KEXEC --| |--> KEXEC_CORE--> kexec reboot KEXEC_FILE --| Link: https://lkml.kernel.org/r/20240124051254.67105-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23crash: split crash dumping code out from kexec_core.cBaoquan He
Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes need be built in to avoid compiling error when building kexec code even though the crash dumping functionality is not enabled. E.g -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- After splitting out crashkernel reservation code and vmcoreinfo exporting code, there's only crash related code left in kernel/crash_core.c. Now move crash related codes from kexec_core.c to crash_core.c and only build it in when CONFIG_CRASH_DUMP=y. And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope, or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP ifdef in generic kernel files. With these changes, crash_core codes are abstracted from kexec codes and can be disabled at all if only kexec reboot feature is wanted. Link: https://lkml.kernel.org/r/20240124051254.67105-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>