linux-stable.git/Documentation/kernel-parameters.txt, branch linux-3.2.y

x86/paravirt: Remove 'noreplace-paravirt' cmdline option

2018-03-19T18:58:38+00:00

commit 12c69f1e94c89d40696e83804dd2f0965b5250cd upstream.

The 'noreplace-paravirt' option disables paravirt patching, leaving the
original pv indirect calls in place.

That's highly incompatible with retpolines, unless we want to uglify
paravirt even further and convert the paravirt calls to retpolines.

As far as I can tell, the option doesn't seem to be useful for much
other than introducing surprising corner cases and making the kernel
vulnerable to Spectre v2.  It was probably a debug option from the early
paravirt days.  So just remove it.

Signed-off-by: Josh Poimboeuf 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Juergen Gross 
Cc: Andrea Arcangeli 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Ashok Raj 
Cc: Greg KH 
Cc: Jun Nakajima 
Cc: Tim Chen 
Cc: Rusty Russell 
Cc: Dave Hansen 
Cc: Asit Mallick 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Jason Baron 
Cc: Paolo Bonzini 
Cc: Alok Kataria 
Cc: Arjan Van De Ven 
Cc: David Woodhouse 
Cc: Dan Williams 
Link: https://lkml.kernel.org/r/20180131041333.2x6blhxirc2kclrq@treble
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings

x86/spectre: Add boot time option to select Spectre v2 mitigation

2018-03-19T18:58:31+00:00

commit da285121560e769cc31797bba6422eea71d473e0 upstream.

Add a spectre_v2= option to select the mitigation used for the indirect
branch speculation vulnerability.

Currently, the only option available is retpoline, in its various forms.
This will be expanded to cover the new IBRS/IBPB microcode features.

The RETPOLINE_AMD feature relies on a serializing LFENCE for speculation
control. For AMD hardware, only set RETPOLINE_AMD if LFENCE is a
serializing instruction, which is indicated by the LFENCE_RDTSC feature.

[ tglx: Folded back the LFENCE/AMD fixes and reworked it so IBRS
  	integration becomes simple ]

Signed-off-by: David Woodhouse 
Signed-off-by: Thomas Gleixner 
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel 
Cc: Andi Kleen 
Cc: Josh Poimboeuf 
Cc: thomas.lendacky@amd.com
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Jiri Kosina 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Kees Cook 
Cc: Tim Chen 
Cc: Greg Kroah-Hartman 
Cc: Paul Turner 
Link: https://lkml.kernel.org/r/1515707194-20531-5-git-send-email-dwmw@amazon.co.uk
[bwh: Backported to 3.2: adjust filename, context]
Signed-off-by: Ben Hutchings

x86/Documentation: Add PTI description

2018-03-19T18:58:27+00:00

commit 01c9b17bf673b05bb401b76ec763e9730ccf1376 upstream.

Add some details about how PTI works, what some of the downsides
are, and how to debug it when things go wrong.

Also document the kernel parameter: 'pti/nopti'.

Signed-off-by: Dave Hansen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Randy Dunlap 
Reviewed-by: Kees Cook 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Hugh Dickins 
Cc: Andi Lutomirsky 
Link: https://lkml.kernel.org/r/20180105174436.1BC6FA2B@viggo.jf.intel.com
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Ben Hutchings

x86/smp: Don't ever patch back to UP if we unplug cpus

2018-02-13T18:32:14+00:00

commit 816afe4ff98ee10b1d30fd66361be132a0a5cee6 upstream.

We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time.  In particular, not if we
unplug CPUs to return to a single cpu.

Paul McKenney points out:

 mean offline overhead is 6251/48=130.2 milliseconds.

 If I remove the alternatives_smp_switch() from the offline
 path [...] the mean offline overhead is 550/42=13.1 milliseconds

Basically, we're never going to get those 120ms back, and the
code is pretty messy.

We get rid of:

 1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
    documentation is wrong. It's now the default.

 2) The skip_smp_alternatives flag used by suspend.

 3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
    which were only used to set this one flag.

Signed-off-by: Rusty Russell 
Cc: Paul McKenney 
Cc: Suresh Siddha 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

x86/kaiser: Check boottime cmdline params

2018-01-07T01:46:52+00:00

AMD (and possibly other vendors) are not affected by the leak
KAISER is protecting against.

Keep the "nopti" for traditional reasons and add pti=
like upstream.

Signed-off-by: Borislav Petkov 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Ben Hutchings

x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling

2018-01-07T01:46:52+00:00

Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".

Signed-off-by: Borislav Petkov 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Ben Hutchings

kaiser: add "nokaiser" boot option, using ALTERNATIVE

2018-01-07T01:46:51+00:00

Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime.  That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER fabricated
("" in its comment so "kaiser" not magicked into /proc/cpuinfo).

Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.

It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.

Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in init_memory_mapping().

(cherry picked from Change-Id: I8e5bec716944444359cbd19f6729311eff943e9a)

Signed-off-by: Hugh Dickins 
Signed-off-by: Ben Hutchings

x86/mm: Add the 'nopcid' boot option to turn off PCID

2018-01-07T01:46:48+00:00

commit 0790c9aad84901ca1bdc14746175549c8b5da215 upstream.

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Nadav Amit 
Reviewed-by: Borislav Petkov 
Reviewed-by: Thomas Gleixner 
Cc: Andrew Morton 
Cc: Arjan van de Ven 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/8bbb2e65bcd249a5f18bfb8128b4689f08ac2b60.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
[Hugh Dickins: Backported to 3.2:
 - Documentation/admin-guide/kernel-parameters.txt (not in this tree)
 - Documentation/kernel-parameters.txt (patched instead of that)]
Signed-off-by: Hugh Dickins 
Signed-off-by: Ben Hutchings

x86/mm: Add a 'noinvpcid' boot option to turn off INVPCID

2018-01-07T01:46:47+00:00

commit d12a72b844a49d4162f24cefdab30bed3f86730e upstream.

This adds a chicken bit to turn off INVPCID in case something goes
wrong.  It's an early_param() because we do TLB flushes before we
parse __setup() parameters.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Andrey Ryabinin 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Luis R. Rodriguez 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/f586317ed1bc2b87aee652267e515b90051af385.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
Cc: Hugh Dickins 
Signed-off-by: Ben Hutchings

mm: larger stack guard gap, between vmas

2017-07-02T16:12:47+00:00

commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov 
Original-patch-by: Michal Hocko 
Signed-off-by: Hugh Dickins 
Acked-by: Michal Hocko 
Tested-by: Helge Deller  # parisc
Signed-off-by: Linus Torvalds 
[Hugh Dickins: Backported to 3.2]
[bwh: Fix more instances of vma->vm_start in sparc64 impl. of
 arch_get_unmapped_area_topdown() and generic impl. of
 hugetlb_get_unmapped_area()]
Signed-off-by: Ben Hutchings