linux-stable.git/arch/tile, branch linux-3.10.y

mm: larger stack guard gap, between vmas

2017-06-21T13:42:43+00:00

commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov 
Original-patch-by: Michal Hocko 
Signed-off-by: Hugh Dickins 
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
[wt: backport to 3.18: adjust context ; no FOLL_POPULATE ;
     s390 uses generic arch_get_unmapped_area()]
[wt: backport to 3.16: adjust context]
[wt: backport to 3.10: adjust context ; code logic in PARISC's
     arch_get_unmapped_area() wasn't found ; code inserted into
     expand_upwards() and expand_downwards() runs under anon_vma lock;
     changes for gup.c:faultin_page go to memory.c:__get_user_pages();
     included Hugh Dickins' fixes]
Signed-off-by: Willy Tarreau

tile/ptrace: Preserve previous registers for short regset write

2017-06-20T12:04:15+00:00

commit fd7c99142d77dc4a851879a66715abf12a3193fb upstream.

Ensure that if userspace supplies insufficient data to
PTRACE_SETREGSET to fill all the registers, the thread's old
registers are preserved.

Signed-off-by: Dave Martin 
Signed-off-by: Chris Metcalf 
Signed-off-by: Willy Tarreau

tile: avoid using clocksource_cyc2ns with absolute cycle count

2017-02-10T10:04:05+00:00

commit e658a6f14d7c0243205f035979d0ecf6c12a036f upstream.

For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits.  For example,
the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
we have mult = 853, and after 208.5 days, we overflow 64 bits.

Since clocksource_cyc2ns() is intended to be used for relative
cycle counts, not absolute cycle counts, performance is more
importance than accepting a wider range of cycle values.  So,
just use mult_frac() directly in tile's sched_clock().

Commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow
in sched_clock") by Salman Qazi results in essentially the same
generated code for x86 as this change does for tile.  In fact,
a follow-on change by Salman introduced mult_frac() and switched
to using it, so the C code was largely identical at that point too.

Peter Zijlstra then added mul_u64_u32_shr() and switched x86
to use it.  This is, in principle, better; by optimizing the
64x64->64 multiplies to be 32x32->64 multiplies we can potentially
save some time.  However, the compiler piplines the 64x64->64
multiplies pretty well, and the conditional branch in the generic
mul_u64_u32_shr() causes some bubbles in execution, with the
result that it's pretty much a wash.  If tilegx provided its own
implementation of mul_u64_u32_shr() without the conditional branch,
we could potentially save 3 cycles, but that seems like small gain
for a fair amount of additional build scaffolding; no other platform
currently provides a mul_u64_u32_shr() override, and tile doesn't
currently have an  header to put the override in.

Additionally, gcc currently has an optimization bug that prevents
it from recognizing the opportunity to use a 32x32->64 multiply,
and so the result would be no better than the existing mult_frac()
until such time as the compiler is fixed.

For now, just using mult_frac() seems like the right answer.

Signed-off-by: Chris Metcalf 
Signed-off-by: Willy Tarreau

tile: use free_bootmem_late() for initrd

2015-08-10T19:20:30+00:00

commit 3f81d2447b37ac697b3c600039f2c6b628c06e21 upstream.

We were previously using free_bootmem() and just getting lucky
that nothing too bad happened.

Signed-off-by: Chris Metcalf 
Signed-off-by: Greg Kroah-Hartman

vm: add VM_FAULT_SIGSEGV handling support

2015-04-29T08:34:00+00:00

commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream.

The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.

That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works.  However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV.  And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.

However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space.  And user space really
expected SIGSEGV, not SIGBUS.

To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it.  They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.

This is the mindless minimal patch to do this.  A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.

Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.

Reported-and-tested-by: Takashi Iwai 
Tested-by: Jan Engelhardt 
Acked-by: Heiko Carstens  # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Signed-off-by: Linus Torvalds 
[shengyong: Backport to 3.10
 - adjust context
 - ignore modification for arch nios2, because 3.10 does not support it
 - ignore modification for driver lustre, because 3.10 does not support it
 - ignore VM_FAULT_FALLBACK in VM_FAULT_ERROR, becase 3.10 does not support
   this flag
 - add SIGSEGV handling to powerpc/cell spu_fault.c, because 3.10 does not
   separate it to copro_fault.c
 - add SIGSEGV handling in mm/memory.c, because 3.10 does not separate it
   to gup.c
]
Signed-off-by: Sheng Yong 
Signed-off-by: Greg Kroah-Hartman

arch: mm: pass userspace fault flag to generic fault handler

2014-11-21T17:22:56+00:00

commit 759496ba6407c6994d6a5ce3a5e74937d7816208 upstream.

Unlike global OOM handling, memory cgroup code will invoke the OOM killer
in any OOM situation because it has no way of telling faults occuring in
kernel context - which could be handled more gracefully - from
user-triggered faults.

Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.

Signed-off-by: Johannes Weiner 
Reviewed-by: Michal Hocko 
Cc: David Rientjes 
Cc: KAMEZAWA Hiroyuki 
Cc: azurIt 
Cc: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Cong Wang 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

arch: mm: remove obsolete init OOM protection

2014-11-21T17:22:55+00:00

commit 94bce453c78996cc4373d5da6cfabe07fcc6d9f9 upstream.

The memcg code can trap tasks in the context of the failing allocation
until an OOM situation is resolved.  They can hold all kinds of locks
(fs, mm) at this point, which makes it prone to deadlocking.

This series converts memcg OOM handling into a two step process that is
started in the charge context, but any waiting is done after the fault
stack is fully unwound.

Patches 1-4 prepare architecture handlers to support the new memcg
requirements, but in doing so they also remove old cruft and unify
out-of-memory behavior across architectures.

Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel
faults, because they can gracefully unwind the stack with -ENOMEM.  OOM
handling is restricted to user triggered faults that have no other
option.

Patch 6 reworks memcg's hierarchical OOM locking to make it a little
more obvious wth is going on in there: reduce locked regions, rename
locking functions, reorder and document.

Patch 7 implements the two-part OOM handling such that tasks are never
trapped with the full charge stack in an OOM situation.

This patch:

Back before smart OOM killing, when faulting tasks were killed directly on
allocation failures, the arch-specific fault handlers needed special
protection for the init process.

Now that all fault handlers call into the generic OOM killer (see commit
609838cfed97: "mm: invoke oom-killer from remaining unconverted page
fault handlers"), which already provides init protection, the
arch-specific leftovers can be removed.

Signed-off-by: Johannes Weiner 
Reviewed-by: Michal Hocko 
Acked-by: KOSAKI Motohiro 
Cc: David Rientjes 
Cc: KAMEZAWA Hiroyuki 
Cc: azurIt 
Acked-by: Vineet Gupta 	[arch/arc bits]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Cong Wang 
Signed-off-by: Greg Kroah-Hartman

mm: invoke oom-killer from remaining unconverted page fault handlers

2014-11-21T17:22:55+00:00

commit 609838cfed972d49a65aac7923a9ff5cbe482e30 upstream.

A few remaining architectures directly kill the page faulting task in an
out of memory situation.  This is usually not a good idea since that
task might not even use a significant amount of memory and so may not be
the optimal victim to resolve the situation.

Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there
is a hook that architecture page fault handlers are supposed to call to
invoke the OOM killer and let it pick the right task to kill.  Convert
the remaining architectures over to this hook.

To have the previous behavior of simply taking out the faulting task the
vm.oom_kill_allocating_task sysctl can be set to 1.

Signed-off-by: Johannes Weiner 
Reviewed-by: Michal Hocko 
Cc: KAMEZAWA Hiroyuki 
Acked-by: David Rientjes 
Acked-by: Vineet Gupta    [arch/arc bits]
Cc: James Hogan 
Cc: David Howells 
Cc: Jonas Bonn 
Cc: Chen Liqin 
Cc: Lennox Wu 
Cc: Chris Metcalf 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Cong Wang 
Signed-off-by: Greg Kroah-Hartman

tile: remove compat_sys_lookup_dcookie declaration to fix compile error

2014-02-13T21:48:00+00:00

commit 5a5e75f4714a592f31e57f248b8f5c866f278b8d upstream.

With commit d8d14bd09cdd ("fs/compat: fix lookup_dcookie() parameter
handling") I changed the type of the len parameter of the
lookup_dcookie() syscall.

However I missed that there was still a stale declaration in
arch/tile/..  which now causes a compile error on tile:

  In file included from fs/dcookies.c:28:0:
  include/linux/compat.h:425:17: error: conflicting types for 'compat_sys_lookup_dcookie'
  fs/dcookies.c:207:1: error: conflicting types for 'compat_sys_lookup_dcookie'

Simply remove the declaration in the tile architecture, which is only a
leftover from before the different compat lookup_dcookie() versions have
been merged.  The correct declaration is now in include/linux/compat.h

The build error was reported by Fenguang's build bot.

Signed-off-by: Heiko Carstens 
Acked-by: Chris Metcalf 
Signed-off-by: Linus Torvalds 
Cc: Guenter Roeck 
Signed-off-by: Greg Kroah-Hartman

tile: use a more conservative __my_cpu_offset in CONFIG_PREEMPT

2013-10-13T23:08:34+00:00

commit f862eefec0b68e099a9fa58d3761ffb10bad97e1 upstream.

It turns out the kernel relies on barrier() to force a reload of the
percpu offset value.  Since we can't easily modify the definition of
barrier() to include "tp" as an output register, we instead provide a
definition of __my_cpu_offset as extended assembly that includes a fake
stack read to hazard against barrier(), forcing gcc to know that it
must reread "tp" and recompute anything based on "tp" after a barrier.

This fixes observed hangs in the slub allocator when we are looping
on a percpu cmpxchg_double.

A similar fix for ARMv7 was made in June in change 509eb76ebf97.

Signed-off-by: Chris Metcalf 
Signed-off-by: Greg Kroah-Hartman