linux.git/arch/powerpc, branch v6.6

Merge tag 'powerpc-6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

2023-10-27T15:40:42+00:00

Pull powerpc fixes from Michael Ellerman:

 - Fix boot crash with FLATMEM since set_ptes() introduction

 - Avoid calling arch_enter/leave_lazy_mmu() in set_ptes()

Thanks to Aneesh Kumar K.V and Erhard Furtner.

* tag 'powerpc-6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  powerpc/mm: Fix boot crash with FLATMEM

powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes

2023-10-25T05:08:46+00:00

With commit 9fee28baa601 ("powerpc: implement the new page table range
API") we added set_ptes to powerpc architecture. The implementation
included calling arch_enter/leave_lazy_mmu() calls.

The patch removes the usage of arch_enter/leave_lazy_mmu() because
set_pte is not supposed to be used when updating a pte entry. Powerpc
architecture uses this rule to skip the expensive tlb invalidate which
is not needed when you are setting up the pte for the first time. See
commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating
_PAGE_NUMA bit") for more details

The patch also makes sure we are not using the interface to update a
valid/present pte entry by adding VM_WARN_ON check all the ptes we
are setting up. Furthermore, we add a comment to set_pte_filter to
clarify it can only update folio-related flags and cannot filter
pfn specific details in pte filtering.

Removal of arch_enter/leave_lazy_mmu() also will avoid nesting of
these functions that are not supported. For ex:

remap_pte_range()
  -> arch_enter_lazy_mmu()
  -> set_ptes()
      -> arch_enter_lazy_mmu()
      -> arch_leave_lazy_mmu()
  -> arch_leave_lazy_mmu()

Fixes: 9fee28baa601 ("powerpc: implement the new page table range API")
Signed-off-by: "Aneesh Kumar K.V" 
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/20231024143604.16749-1-aneesh.kumar@linux.ibm.com

powerpc/mm: Fix boot crash with FLATMEM

2023-10-23T11:50:15+00:00

Erhard reported that his G5 was crashing with v6.6-rc kernels:

  mpic: Setting up HT PICs workarounds for U3/U4
  BUG: Unable to handle kernel data access at 0xfeffbb62ffec65fe
  Faulting instruction address: 0xc00000000005dc40
  Oops: Kernel access of bad area, sig: 11 [#1]
  BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Tainted: G                T  6.6.0-rc3-PMacGS #1
  Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac
  NIP:  c00000000005dc40 LR: c000000000066660 CTR: c000000000007730
  REGS: c0000000022bf510 TRAP: 0380   Tainted: G                T (6.6.0-rc3-PMacGS)
  MSR:  9000000000001032   CR: 44004242  XER: 00000000
  IRQMASK: 3
  GPR00: 0000000000000000 c0000000022bf7b0 c0000000010c0b00 00000000000001ac
  GPR04: 0000000003c80000 0000000000000300 c0000000f20001ae 0000000000000300
  GPR08: 0000000000000006 feffbb62ffec65ff 0000000000000001 0000000000000000
  GPR12: 9000000000001032 c000000002362000 c000000000f76b80 000000000349ecd8
  GPR16: 0000000002367ba8 0000000002367f08 0000000000000006 0000000000000000
  GPR20: 00000000000001ac c000000000f6f920 c0000000022cd985 000000000000000c
  GPR24: 0000000000000300 00000003b0a3691d c0003e008030000e 0000000000000000
  GPR28: c00000000000000c c0000000f20001ee feffbb62ffec65fe 00000000000001ac
  NIP hash_page_do_lazy_icache+0x50/0x100
  LR  __hash_page_4K+0x420/0x590
  Call Trace:
    hash_page_mm+0x364/0x6f0
    do_hash_fault+0x114/0x2b0
    data_access_common_virt+0x198/0x1f0
  --- interrupt: 300 at mpic_init+0x4bc/0x10c4
  NIP:  c000000002020a5c LR: c000000002020a04 CTR: 0000000000000000
  REGS: c0000000022bf9f0 TRAP: 0300   Tainted: G                T (6.6.0-rc3-PMacGS)
  MSR:  9000000000001032   CR: 24004248  XER: 00000000
  DAR: c0003e008030000e DSISR: 40000000 IRQMASK: 1
  ...
  NIP mpic_init+0x4bc/0x10c4
  LR  mpic_init+0x464/0x10c4
  --- interrupt: 300
    pmac_setup_one_mpic+0x258/0x2dc
    pmac_pic_init+0x28c/0x3d8
    init_IRQ+0x90/0x140
    start_kernel+0x57c/0x78c
    start_here_common+0x1c/0x20

A bisect pointed to the breakage beginning with commit 9fee28baa601 ("powerpc:
implement the new page table range API").

Analysis of the oops pointed to a struct page with a corrupted
compound_head being loaded via page_folio() -> _compound_head() in
hash_page_do_lazy_icache().

The access by the mpic code is to an MMIO address, so the expectation
is that the struct page for that address would be initialised by
init_unavailable_range(), as pointed out by Aneesh.

Instrumentation showed that was not the case, which eventually lead to
the realisation that pfn_valid() was returning false for that address,
causing the struct page to not be initialised.

Because the system is using FLATMEM, the version of pfn_valid() in
memory_model.h is used:

static inline int pfn_valid(unsigned long pfn)
{
	...
	return pfn >= pfn_offset && (pfn - pfn_offset) < max_mapnr;
}

Which relies on max_mapnr being initialised. Early in boot max_mapnr is
zero meaning no PFNs are valid.

max_mapnr is initialised in mem_init() called via:

  start_kernel()
    mm_core_init()  # init/main.c:928
      mem_init()

But that is too late for the usage in init_unavailable_range() called via:

  start_kernel()
    setup_arch()    # init/main.c:893
      paging_init()
        free_area_init()
          init_unavailable_range()

Although max_mapnr is currently set in mem_init(), the value is actually
already available much earlier, as soon as mem_topology_setup() has
completed, which is also before paging_init() is called. So move the
initialisation there, which causes paging_init() to correctly initialise
the struct page and fixes the bug.

This bug seems to have been lurking for years, but went unnoticed
because the pre-folio code was inspecting the uninitialised page->flags
but not dereferencing it.

Thanks to Erhard and Aneesh for help debugging.

Reported-by: Erhard Furtner 
Closes: https://lore.kernel.org/all/20230929132750.3cd98452@yea/
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/20231023112500.1550208-1-mpe@ellerman.id.au

Merge tag 'powerpc-6.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

2023-10-22T01:46:47+00:00

Pull powerpc fixes from Michael Ellerman:

 - Fix stale propagated yield_cpu in qspinlocks leading to lockups

 - Fix broken hugepages on some configs due to ARCH_FORCE_MAX_ORDER

 - Fix a spurious warning when copros are in use at exit time

Thanks to Nicholas Piggin, Christophe Leroy, Nysal Jan K.A Sachin Sant,
and Shrikanth Hegde.

* tag 'powerpc-6.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/qspinlock: Fix stale propagated yield_cpu
  powerpc/64s/radix: Don't warn on copros in radix__tlb_flush()
  powerpc/mm: Allow ARCH_FORCE_MAX_ORDER up to 12

powerpc/qspinlock: Fix stale propagated yield_cpu

2023-10-18T10:07:21+00:00

yield_cpu is a sample of a preempted lock holder that gets propagated
back through the queue. Queued waiters use this to yield to the
preempted lock holder without continually sampling the lock word (which
would defeat the purpose of MCS queueing by bouncing the cache line).

The problem is that yield_cpu can become stale. It can take some time to
be passed down the chain, and if any queued waiter gets preempted then
it will cease to propagate the yield_cpu to later waiters.

This can result in yielding to a CPU that no longer holds the lock,
which is bad, but particularly if it is currently in H_CEDE (idle),
then it appears to be preempted and some hypervisors (PowerVM) can
cause very long H_CONFER latencies waiting for H_CEDE wakeup. This
results in latency spikes and hard lockups on oversubscribed
partitions with lock contention.

This is a minimal fix. Before yielding to yield_cpu, sample the lock
word to confirm yield_cpu is still the owner, and bail out of it is not.

Thanks to a bunch of people who reported this and tracked down the
exact problem using tracepoints and dispatch trace logs.

Fixes: 28db61e207ea ("powerpc/qspinlock: allow propagation of yield CPU down the queue")
Cc: stable@vger.kernel.org # v6.2+
Reported-by: Srikar Dronamraju 
Reported-by: Laurent Dufour 
Reported-by: Shrikanth Hegde 
Debugged-by: "Nysal Jan K.A" 
Signed-off-by: Nicholas Piggin 
Tested-by: Shrikanth Hegde 
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/20231016124305.139923-2-npiggin@gmail.com

powerpc/64s/radix: Don't warn on copros in radix__tlb_flush()

2023-10-17T22:45:52+00:00

Sachin reported a warning when running the inject-ra-err selftest:

  # selftests: powerpc/mce: inject-ra-err
  Disabling lock debugging due to kernel taint
  MCE: CPU19: machine check (Severe)  Real address Load/Store (foreign/control memory) [Not recovered]
  MCE: CPU19: PID: 5254 Comm: inject-ra-err NIP: [0000000010000e48]
  MCE: CPU19: Initiator CPU
  MCE: CPU19: Unknown
  ------------[ cut here ]------------
  WARNING: CPU: 19 PID: 5254 at arch/powerpc/mm/book3s64/radix_tlb.c:1221 radix__tlb_flush+0x160/0x180
  CPU: 19 PID: 5254 Comm: inject-ra-err Kdump: loaded Tainted: G   M        E      6.6.0-rc3-00055-g9ed22ae6be81 #4
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries
  ...
  NIP radix__tlb_flush+0x160/0x180
  LR  radix__tlb_flush+0x104/0x180
  Call Trace:
    radix__tlb_flush+0xf4/0x180 (unreliable)
    tlb_finish_mmu+0x15c/0x1e0
    exit_mmap+0x1a0/0x510
    __mmput+0x60/0x1e0
    exit_mm+0xdc/0x170
    do_exit+0x2bc/0x5a0
    do_group_exit+0x4c/0xc0
    sys_exit_group+0x28/0x30
    system_call_exception+0x138/0x330
    system_call_vectored_common+0x15c/0x2ec

And bisected it to commit e43c0a0c3c28 ("powerpc/64s/radix: combine
final TLB flush and lazy tlb mm shootdown IPIs"), which added a warning
in radix__tlb_flush() if mm->context.copros is still elevated.

However it's possible for the copros count to be elevated if a process
exits without first closing file descriptors that are associated with a
copro, eg. VAS.

If the process exits with a VAS file still open, the release callback
is queued up for exit_task_work() via:
  exit_files()
    put_files_struct()
      close_files()
        filp_close()
          fput()

And called via:
  exit_task_work()
    ____fput()
      __fput()
        file->f_op->release(inode, file)
          coproc_release()
            vas_user_win_ops->close_win()
              vas_deallocate_window()
                mm_context_remove_vas_window()
                  mm_context_remove_copro()

But that is after exit_mm() has been called from do_exit() and triggered
the warning.

Fix it by dropping the warning, and always calling __flush_all_mm().

In the normal case of no copros, that will result in a call to
_tlbiel_pid(mm->context.id, RIC_FLUSH_ALL) just as the current code
does.

If the copros count is elevated then it will cause a global flush, which
should flush translations from any copros. Note that the process table
entry was cleared in arch_exit_mmap(), so copros should not be able to
fetch any new translations.

Fixes: e43c0a0c3c28 ("powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs")
Reported-by: Sachin Sant 
Closes: https://lore.kernel.org/all/A8E52547-4BF1-47CE-8AEA-BC5A9D7E3567@linux.ibm.com/
Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Tested-by: Sachin Sant 
Link: https://msgid.link/20231017121527.1574104-1-mpe@ellerman.id.au

Merge tag 'powerpc-6.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

2023-10-15T15:48:53+00:00

Pull powerpc fixes from Michael Ellerman:

 - Fix softlockup/crash when using hcall tracing

 - Fix pte_access_permitted() for PAGE_NONE on 8xx

 - Fix inverted pte_young() test in __ptep_test_and_clear_young()
   on 64-bit BookE

 - Fix unhandled math emulation exception on 85xx

 - Fix kernel crash on syscall return on 476

Thanks to Athira Rajeev, Christophe Leroy, Eddie James, and Naveen N
Rao.

* tag 'powerpc-6.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/47x: Fix 47x syscall return crash
  powerpc/85xx: Fix math emulation exception
  powerpc/64e: Fix wrong test in __ptep_test_and_clear_young()
  powerpc/8xx: Fix pte_access_permitted() for PAGE_NONE
  powerpc/pseries: Remove unused r0 in the hcall tracing code
  powerpc/pseries: Fix STK_PARAM access in the hcall tracing code

powerpc/mm: Allow ARCH_FORCE_MAX_ORDER up to 12

2023-10-15T09:55:03+00:00

Christophe reported that the change to ARCH_FORCE_MAX_ORDER to limit the
range to 10 had broken his ability to configure hugepages:

  # echo 1 > /sys/kernel/mm/hugepages/hugepages-8192kB/nr_hugepages
  sh: write error: Invalid argument

Several of the powerpc defconfigs previously set the
ARCH_FORCE_MAX_ORDER value to 12, via the definition in
arch/powerpc/configs/fsl-emb-nonhw.config, used by:

  mpc85xx_defconfig
  mpc85xx_smp_defconfig
  corenet32_smp_defconfig
  corenet64_smp_defconfig
  mpc86xx_defconfig
  mpc86xx_smp_defconfig

Fix it by increasing the allowed range to 12 to restore the previous
behaviour.

Fixes: 358e526a1648 ("powerpc/mm: Reinstate ARCH_FORCE_MAX_ORDER ranges")
Reported-by: Christophe Leroy 
Closes: https://lore.kernel.org/all/8011d806-5b30-bf26-2bfe-a08c39d57e20@csgroup.eu/
Tested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/20230824122849.942072-1-mpe@ellerman.id.au

powerpc/47x: Fix 47x syscall return crash

2023-10-10T22:31:26+00:00

Eddie reported that newer kernels were crashing during boot on his 476
FSP2 system:

  kernel tried to execute user page (b7ee2000) - exploit attempt? (uid: 0)
  BUG: Unable to handle kernel instruction fetch
  Faulting instruction address: 0xb7ee2000
  Oops: Kernel access of bad area, sig: 11 [#1]
  BE PAGE_SIZE=4K FSP-2
  Modules linked in:
  CPU: 0 PID: 61 Comm: mount Not tainted 6.1.55-d23900f.ppcnf-fsp2 #1
  Hardware name: ibm,fsp2 476fpe 0x7ff520c0 FSP-2
  NIP:  b7ee2000 LR: 8c008000 CTR: 00000000
  REGS: bffebd83 TRAP: 0400   Not tainted (6.1.55-d23900f.ppcnf-fs p2)
  MSR:  00000030   CR: 00001000  XER: 20000000
  GPR00: c00110ac bffebe63 bffebe7e bffebe88 8c008000 00001000 00000d12 b7ee2000
  GPR08: 00000033 00000000 00000000 c139df10 48224824 1016c314 10160000 00000000
  GPR16: 10160000 10160000 00000008 00000000 10160000 00000000 10160000 1017f5b0
  GPR24: 1017fa50 1017f4f0 1017fa50 1017f740 1017f630 00000000 00000000 1017f4f0
  NIP [b7ee2000] 0xb7ee2000
  LR [8c008000] 0x8c008000
  Call Trace:
  Instruction dump:
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  ---[ end trace 0000000000000000 ]---

The problem is in ret_from_syscall where the check for
icache_44x_need_flush is done. When the flush is needed the code jumps
out-of-line to do the flush, and then intends to jump back to continue
the syscall return.

However the branch back to label 1b doesn't return to the correct
location, instead branching back just prior to the return to userspace,
causing bogus register values to be used by the rfi.

The breakage was introduced by commit 6f76a01173cc
("powerpc/syscall: implement system call entry/exit logic in C for PPC32") which
inadvertently removed the "1" label and reused it elsewhere.

Fix it by adding named local labels in the correct locations. Note that
the return label needs to be outside the ifdef so that CONFIG_PPC_47x=n
compiles.

Fixes: 6f76a01173cc ("powerpc/syscall: implement system call entry/exit logic in C for PPC32")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Eddie James 
Tested-by: Eddie James 
Link: https://lore.kernel.org/linuxppc-dev/fdaadc46-7476-9237-e104-1d2168526e72@linux.ibm.com/
Signed-off-by: Michael Ellerman 
Reviewed-by: Christophe Leroy 
Link: https://msgid.link/20231010114750.847794-1-mpe@ellerman.id.au

powerpc/85xx: Fix math emulation exception

2023-10-10T10:32:40+00:00

Booting mpc85xx_defconfig kernel on QEMU leads to:

Bad trap at PC: fe9bab0, SR: 2d000, vector=800
awk[82]: unhandled trap (5) at 0 nip fe9bab0 lr fe9e01c code 5 in libc-2.27.so[fe5a000+17a000]
awk[82]: code: 3aa00000 3a800010 4bffe03c 9421fff0 7ca62b78 38a00000 93c10008 83c10008
awk[82]: code: 38210010 4bffdec8 9421ffc0 7c0802a6  d8010008 4815190d 93810030
Trace/breakpoint trap
WARNING: no useful console

This is because allthough CONFIG_MATH_EMULATION is selected,
Exception 800 calls unknown_exception().

Call emulation_assist_interrupt() instead.

Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/066caa6d9480365da9b8ed83692d7101e10ac5f8.1695657339.git.christophe.leroy@csgroup.eu