linux-stable.git/kernel/cpu.c, branch v6.5.2

cpu/hotplug: Fix off by one in cpuhp_bringup_mask()

2023-05-23T16:06:40+00:00

cpuhp_bringup_mask() iterates over a cpumask and starts all present CPUs up
to a caller provided upper limit.

The limit variable is decremented and checked for 0 before invoking
cpu_up(), which is obviously off by one and prevents the bringup of the
last CPU when the limit is equal to the number of present CPUs.

Move the decrement and check after the cpu_up() invocation.

Fixes: 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE")
Reported-by: Mark Brown 
Signed-off-by: Thomas Gleixner 
Tested-by: Mark Brown 
Link: https://lore.kernel.org/r/87wn10ufj9.ffs@tglx

cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE

2023-05-15T11:45:02+00:00

There is often significant latency in the early stages of CPU bringup, and
time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86) and
then waiting for it to respond before moving on to the next.

Allow a platform to enable parallel setup which brings all to be onlined
CPUs up to the CPUHP_BP_KICK_AP state. While this state advancement on the
control CPU (BP) is single-threaded the important part is the last state
CPUHP_BP_KICK_AP which wakes the to be onlined CPUs up.

This allows the CPUs to run up to the first sychronization point
cpuhp_ap_sync_alive() where they wait for the control CPU to release them
one by one for the full onlining procedure.

This parallelism depends on the CPU hotplug core sync mechanism which
ensures that the parallel brought up CPUs wait for release before touching
any state which would make the CPU visible to anything outside the hotplug
control mechanism.

To handle the SMT constraints of X86 correctly the bringup happens in two
iterations when CONFIG_HOTPLUG_SMT is enabled. The control CPU brings up
the primary SMT threads of each core first, which can load the microcode
without the need to rendevouz with the thread siblings. Once that's
completed it brings up the secondary SMT threads.

Co-developed-by: David Woodhouse 
Signed-off-by: David Woodhouse 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Michael Kelley 
Tested-by: Oleksandr Natalenko 
Tested-by: Helge Deller  # parisc
Tested-by: Guilherme G. Piccoli  # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.240231377@linutronix.de

cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism

2023-05-15T11:45:01+00:00

The bring up logic of a to be onlined CPU consists of several parts, which
are considered to be a single hotplug state:

  1) Control CPU issues the wake-up

  2) To be onlined CPU starts up, does the minimal initialization,
     reports to be alive and waits for release into the complete bring-up.

  3) Control CPU waits for the alive report and releases the upcoming CPU
     for the complete bring-up.

Allow to split this into two states:

  1) Control CPU issues the wake-up

     After that the to be onlined CPU starts up, does the minimal
     initialization, reports to be alive and waits for release into the
     full bring-up. As this can run after the control CPU dropped the
     hotplug locks the code which is executed on the AP before it reports
     alive has to be carefully audited to not violate any of the hotplug
     constraints, especially not modifying any of the various cpumasks.

     This is really only meant to avoid waiting for the AP to react on the
     wake-up. Of course an architecture can move strict CPU related setup
     functionality, e.g. microcode loading, with care before the
     synchronization point to save further pointless waiting time.

  2) Control CPU waits for the alive report and releases the upcoming CPU
     for the complete bring-up.

This allows that the two states can be split up to run all to be onlined
CPUs up to state #1 on the control CPU and then at a later point run state
#2. This spares some of the latencies of the full serialized per CPU
bringup by avoiding the per CPU wakeup/wait serialization. The assumption
is that the first AP already waits when the last AP has been woken up. This
obvioulsy depends on the hardware latencies and depending on the timings
this might still not completely eliminate all wait scenarios.

This split is just a preparatory step for enabling the parallel bringup
later. The boot time bringup is still fully serialized. It has a separate
config switch so that architectures which want to support parallel bringup
can test the split of the CPUHP_BRINGUG step separately.

To enable this the architecture must support the CPU hotplug core sync
mechanism and has to be audited that there are no implicit hotplug state
dependencies which require a fully serialized bringup.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Michael Kelley 
Tested-by: Oleksandr Natalenko 
Tested-by: Helge Deller  # parisc
Tested-by: Guilherme G. Piccoli  # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.080801387@linutronix.de

cpu/hotplug: Reset task stack state in _cpu_up()

2023-05-15T11:45:00+00:00

Commit dce1ca0525bf ("sched/scs: Reset task stack state in bringup_cpu()")
ensured that the shadow call stack and KASAN poisoning were removed from
a CPU's stack each time that CPU is brought up, not just once.

This is not incorrect. However, with parallel bringup the idle thread setup
will happen at a different step. As a consequence the cleanup in
bringup_cpu() would be too late.

Move the SCS/KASAN cleanup to the generic _cpu_up() function instead,
which already ensures that the new CPU's stack is available, purely to
allow for early failure. This occurs when the CPU to be brought up is
in the CPUHP_OFFLINE state, which should correctly do the cleanup any
time the CPU has been taken down to the point where such is needed.

Signed-off-by: David Woodhouse 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland 
Tested-by: Michael Kelley 
Tested-by: Oleksandr Natalenko 
Tested-by: Helge Deller  # parisc
Tested-by: Guilherme G. Piccoli  # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.027075560@linutronix.de

cpu/hotplug: Add CPU state tracking and synchronization

2023-05-15T11:44:55+00:00

The CPU state tracking and synchronization mechanism in smpboot.c is
completely independent of the hotplug code and all logic around it is
implemented in architecture specific code.

Except for the state reporting of the AP there is absolutely nothing
architecture specific and the sychronization and decision functions can be
moved into the generic hotplug core code.

Provide an integrated variant and add the core synchronization and decision
points. This comes in two flavours:

  1) DEAD state synchronization

     Updated by the architecture code once the AP reaches the point where
     it is ready to be torn down by the control CPU, e.g. by removing power
     or clocks or tear down via the hypervisor.

     The control CPU waits for this state to be reached with a timeout. If
     the state is reached an architecture specific cleanup function is
     invoked.

  2) Full state synchronization

     This extends #1 with AP alive synchronization. This is new
     functionality, which allows to replace architecture specific wait
     mechanims, e.g. cpumasks, completely.

     It also prevents that an AP which is in a limbo state can be brought
     up again. This can happen when an AP failed to report dead state
     during a previous off-line operation.

The dead synchronization is what most architectures use. Only x86 makes a
bringup decision based on that state at the moment.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Michael Kelley 
Tested-by: Oleksandr Natalenko 
Tested-by: Helge Deller  # parisc
Tested-by: Guilherme G. Piccoli  # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.476305035@linutronix.de

cpu/hotplug: Rework sparse_irq locking in bringup_cpu()

2023-05-15T11:44:54+00:00

There is no harm to hold sparse_irq lock until the upcoming CPU completes
in cpuhp_online_idle(). This allows to remove cpu_online() synchronization
from architecture code.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Michael Kelley 
Tested-by: Oleksandr Natalenko 
Tested-by: Helge Deller  # parisc
Tested-by: Guilherme G. Piccoli  # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.263722880@linutronix.de

cpu/hotplug: Mark arch_disable_smp_support() and bringup_nonboot_cpus() __init

2023-05-15T11:44:47+00:00

No point in keeping them around.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Michael Kelley 
Tested-by: Oleksandr Natalenko 
Tested-by: Helge Deller  # parisc
Tested-by: Guilherme G. Piccoli  # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2023-04-28T02:42:02+00:00

Pull MM updates from Andrew Morton:

- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.

- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.

- zsmalloc performance improvements from Sergey Senozhatsky.

- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.

- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful

- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.

- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.

- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).

- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.

- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.

- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.

- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.

- Cleanups to do_fault_around() by Lorenzo Stoakes.

- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.

- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.

- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.

- Lorenzo has also contributed further cleanups of vma_merge().

- Chaitanya Prakash provides some fixes to the mmap selftesting code.

- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.

- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.

- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.

- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.

- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.

- Yosry Ahmed has improved the performance of memcg statistics
flushing.

- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.

- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.

- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.

- Pankaj Raghav has made page_endio() unneeded and has removed it.

- Peter Xu contributed some rationalizations of the userfaultfd
selftests.

- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.

- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.

- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.

- Peter Xu fixes a few issues with uffd-wp at fork() time.

- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...

lazy tlb: introduce lazy tlb mm refcount helper functions

2023-03-28T23:20:08+00:00

Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. 
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.  There is no
functional change with this patch.

Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin 
Acked-by: Linus Torvalds 
Cc: Andy Lutomirski 
Cc: Catalin Marinas 
Cc: Christophe Leroy 
Cc: Dave Hansen 
Cc: Michael Ellerman 
Cc: Nadav Amit 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

cpu/hotplug: move to use bus_get_dev_root()

2023-03-17T14:29:19+00:00

Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.

Cc: Thomas Gleixner 
Cc: Valentin Schneider 
Cc: Phil Auld 
Cc: Steven Price 
Cc: Juri Lelli 
Cc: Greg Kroah-Hartman 
Cc: Vincent Donnefort 
Cc: Kuppuswamy Sathyanarayanan 
Cc: "Jason A. Donenfeld" 
Link: https://lore.kernel.org/r/20230313182918.1312597-7-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman