linux-stable.git/Documentation, branch v3.16.63

fs/proc: Stop trying to report thread stacks

2018-11-20T18:05:58+00:00

commit b18cb64ead400c01bf1580eeba330ace51f8087d upstream.

This reverts more of:

  b76437579d13 ("procfs: mark thread stack correctly in proc//maps")

... which was partially reverted by:

  65376df58217 ("proc: revert /proc//maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn 
Signed-off-by: Andy Lutomirski 
Acked-by: Thomas Gleixner 
Cc: Al Viro 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Johannes Weiner 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Linux API 
Cc: Peter Zijlstra 
Cc: Tycho Andersen 
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.16: Squash in the earlier commits 58cb65487e92
 "proc/maps: make vm_is_stack() logic namespace-friendly" and 
 65376df58217 "proc: revert /proc//maps [stack:TID] annotation",
 which would introduce build failures if applied separately.]
Signed-off-by: Ben Hutchings

ppp: remove the PPPIOCDETACH ioctl

2018-10-21T07:46:19+00:00

commit af8d3c7c001ae7df1ed2b2715f058113efc86187 upstream.

The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file
before f_count has reached 0, which is fundamentally a bad idea.  It
does check 'f_count < 2', which excludes concurrent operations on the
file since they would only be possible with a shared fd table, in which
case each fdget() would take a file reference.  However, it fails to
account for the fact that even with 'f_count == 1' the file can still be
linked into epoll instances.  As reported by syzbot, this can trivially
be used to cause a use-after-free.

Yet, the only known user of PPPIOCDETACH is pppd versions older than
ppp-2.4.2, which was released almost 15 years ago (November 2003).
Also, PPPIOCDETACH apparently stopped working reliably at around the
same time, when the f_count check was added to the kernel, e.g. see
https://lkml.org/lkml/2002/12/31/83.  Also, the current 'f_count < 2'
check makes PPPIOCDETACH only work in single-threaded applications; it
always fails if called from a multithreaded application.

All pppd versions released in the last 15 years just close() the file
descriptor instead.

Therefore, instead of hacking around this bug by exporting epoll
internals to modules, and probably missing other related bugs, just
remove the PPPIOCDETACH ioctl and see if anyone actually notices.  Leave
a stub in place that prints a one-time warning and returns EINVAL.

Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Biggers 
Acked-by: Paul Mackerras 
Reviewed-by: Guillaume Nault 
Tested-by: Guillaume Nault 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

rmap: drop support of non-linear mappings

2018-10-03T03:09:51+00:00

commit 27ba0644ea9dfe6e7693abc85837b60e40583b96 upstream.

We don't create non-linear mappings anymore.  Let's drop code which
handles them in rmap.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Deleted code is slightly different
 - Adjust context]
Signed-off-by: Ben Hutchings

mm: replace remap_file_pages() syscall with emulation

2018-10-03T03:09:50+00:00

commit c8d78c1823f46519473949d33f0d1d33fe21ea16 upstream.

remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.

Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.

Let's drop it and get rid of all these special-cased code.

The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.

I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on.  I've checked Debian code search and source of all
packages in ALT Linux.  No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.

There are few basic tests in LTP for the syscall.  They work just fine
with emulation.

To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.

The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.

Before:		23.3 ( +-  4.31% ) seconds
After:		43.9 ( +-  0.85% ) seconds
Slowdown:	1.88x

I believe we can live with that.

Test case:

        #define _GNU_SOURCE
        #include 
        #include 
        #include 
        #include 

        #define MB	(1024UL * 1024)
        #define SIZE	(4096 * MB)

        int main(int argc, char **argv)
        {
                unsigned long *p;
                long i, pass;

                for (pass = 0; pass < 10; pass++) {
                        p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                        MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                        if (p == MAP_FAILED) {
                                perror("mmap");
                                return -1;
                        }

                        for (i = 0; i < SIZE / 4096; i++)
                                p[i * 4096 / sizeof(*p)] = i;

                        for (i = 0; i < SIZE / 4096; i++) {
                                if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                        perror("remap_file_pages");
                                        return -1;
                                }
                        }

                        for (i = SIZE / 4096 - 1; i >= 0; i--)
                                assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

                        munmap(p, SIZE);
                }

                return 0;
        }

[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Dave Jones 
Cc: Linus Torvalds 
Cc: Armin Rigo 
Signed-off-by: Sasha Levin 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Deleted code is slightly different
 - Adjust context]
Signed-off-by: Ben Hutchings

Documentation/spec_ctrl: Do some minor cleanups

2018-10-03T03:09:44+00:00

commit dd0792699c4058e63c0715d9a7c2d40226fcdddc upstream.

Fix some typos, improve formulations, end sentences with a fullstop.

Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings

x86/speculation: Make "seccomp" the default mode for Speculative Store Bypass

2018-10-03T03:09:43+00:00

commit f21b53b20c754021935ea43364dbf53778eeba32 upstream.

Unless explicitly opted out of, anything running under seccomp will have
SSB mitigations enabled. Choosing the "prctl" mode will disable this.

[ tglx: Adjusted it to the new arch_seccomp_spec_mitigate() mechanism ]

Signed-off-by: Kees Cook 
Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings

prctl: Add force disable speculation

2018-10-03T03:09:42+00:00

commit 356e4bfff2c5489e016fdb925adbf12a1e3950ee upstream.

For certain use cases it is desired to enforce mitigations so they cannot
be undone afterwards. That's important for loader stubs which want to
prevent a child from disabling the mitigation again. Will also be used for
seccomp(). The extra state preserving of the prctl state for SSB is a
preparatory step for EBPF dymanic speculation control.

Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.16: adjust filename, context]
Signed-off-by: Ben Hutchings

x86/speculation: Add prctl for Speculative Store Bypass mitigation

2018-10-03T03:09:41+00:00

commit a73ec77ee17ec556fe7f165d00314cb7c047b1ac upstream.

Add prctl based control for Speculative Store Bypass mitigation and make it
the default mitigation for Intel and AMD.

Andi Kleen provided the following rationale (slightly redacted):

 There are multiple levels of impact of Speculative Store Bypass:

 1) JITed sandbox.
    It cannot invoke system calls, but can do PRIME+PROBE and may have call
    interfaces to other code

 2) Native code process.
    No protection inside the process at this level.

 3) Kernel.

 4) Between processes.

 The prctl tries to protect against case (1) doing attacks.

 If the untrusted code can do random system calls then control is already
 lost in a much worse way. So there needs to be system call protection in
 some way (using a JIT not allowing them or seccomp). Or rather if the
 process can subvert its environment somehow to do the prctl it can already
 execute arbitrary code, which is much worse than SSB.

 To put it differently, the point of the prctl is to not allow JITed code
 to read data it shouldn't read from its JITed sandbox. If it already has
 escaped its sandbox then it can already read everything it wants in its
 address space, and do much worse.

 The ability to control Speculative Store Bypass allows to enable the
 protection selectively without affecting overall system performance.

Based on an initial patch from Tim Chen. Completely rewritten.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings

prctl: Add speculation control prctls

2018-10-03T03:09:41+00:00

commit b617cfc858161140d69cc0b5cc211996b557a1c7 upstream.

Add two new prctls to control aspects of speculation related vulnerabilites
and their mitigations to provide finer grained control over performance
impacting mitigations.

PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
the following meaning:

Bit  Define           Description
0    PR_SPEC_PRCTL    Mitigation can be controlled per task by
                      PR_SET_SPECULATION_CTRL
1    PR_SPEC_ENABLE   The speculation feature is enabled, mitigation is
                      disabled
2    PR_SPEC_DISABLE  The speculation feature is disabled, mitigation is
                      enabled

If all bits are 0 the CPU is not affected by the speculation misfeature.

If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
misfeature will fail.

PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.

The common return values are:

EINVAL  prctl is not implemented by the architecture or the unused prctl()
        arguments are not 0
ENODEV  arg2 is selecting a not supported speculation misfeature

PR_SET_SPECULATION_CTRL has these additional return values:

ERANGE  arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
ENXIO   prctl control of the selected speculation misfeature is disabled

The first supported controlable speculation misfeature is
PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
architectures.

Based on an initial patch from Tim Chen and mostly rewritten.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Reviewed-by: Konrad Rzeszutek Wilk 
[bwh: Backported to 3.16:
 - Add the documentation directly under Documentation/ since there is
   no userspace-api subdirectory or reST index
 - Adjust context]
Signed-off-by: Ben Hutchings

x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation

2018-10-03T03:09:39+00:00

commit 24f7fc83b9204d20f878c57cb77d261ae825e033 upstream.

Contemporary high performance processors use a common industry-wide
optimization known as "Speculative Store Bypass" in which loads from
addresses to which a recent store has occurred may (speculatively) see an
older value. Intel refers to this feature as "Memory Disambiguation" which
is part of their "Smart Memory Access" capability.

Memory Disambiguation can expose a cache side-channel attack against such
speculatively read values. An attacker can create exploit code that allows
them to read memory outside of a sandbox environment (for example,
malicious JavaScript in a web page), or to perform more complex attacks
against code running within the same privilege level, e.g. via the stack.

As a first step to mitigate against such attacks, provide two boot command
line control knobs:

 nospec_store_bypass_disable
 spec_store_bypass_disable=[off,auto,on]

By default affected x86 processors will power on with Speculative
Store Bypass enabled. Hence the provided kernel parameters are written
from the point of view of whether to enable a mitigation or not.
The parameters are as follows:

 - auto - Kernel detects whether your CPU model contains an implementation
	  of Speculative Store Bypass and picks the most appropriate
	  mitigation.

 - on   - disable Speculative Store Bypass
 - off  - enable Speculative Store Bypass

[ tglx: Reordered the checks so that the whole evaluation is not done
  	when the CPU does not support RDS ]

Signed-off-by: Konrad Rzeszutek Wilk 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Borislav Petkov 
Reviewed-by: Ingo Molnar 
[bwh: Backported to 3.16:
 - Renumber the feature bit
 - Adjust filenames, context]
Signed-off-by: Ben Hutchings