linux-stable.git/fs/proc/task_mmu.c, branch linux-4.1.y

mm: larger stack guard gap, between vmas

2017-06-28T22:57:15+00:00

[ Upstream commit 1be7107fbe18eed3e319a6c3e83c78254b693acb ]

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov 
Original-patch-by: Michal Hocko 
Signed-off-by: Hugh Dickins 
Acked-by: Michal Hocko 
Tested-by: Helge Deller  # parisc
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

fs/proc/task_mmu.c: fix mm_access() mode parameter in pagemap_read()

2016-08-12T17:27:29+00:00

Backport of caaee6234d05a58c5b4d05e7bf766131b810a657 ("ptrace: use fsuid,
fsgid, effective creds for fs access checks") to v4.1 failed to update the
mode parameter in the mm_access() call in pagemap_read() to have one of the
new PTRACE_MODE_*CREDS flags.

Attempting to read any other process' pagemap results in a WARN()

WARNING: CPU: 0 PID: 883 at kernel/ptrace.c:229 __ptrace_may_access+0x14a/0x160()
denying ptrace access check without PTRACE_MODE_*CREDS
Modules linked in: loop sg e1000 i2c_piix4 ppdev virtio_balloon virtio_pci parport_pc i2c_core virtio_ring ata_generic serio_raw pata_acpi virtio parport pcspkr floppy acpi_cpufreq ip_tables ext3 mbcache jbd sd_mod ata_piix crc32c_intel libata
CPU: 0 PID: 883 Comm: cat Tainted: G        W       4.1.12-51.el7uek.x86_64 #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000286 00000000619f225a ffff88003b6fbc18 ffffffff81717021
  ffff88003b6fbc70 ffffffff819be870 ffff88003b6fbc58 ffffffff8108477a
  000000003b6fbc58 0000000000000001 ffff88003d287000 0000000000000001
Call Trace:
  [] dump_stack+0x63/0x81
  [] warn_slowpath_common+0x8a/0xc0
  [] warn_slowpath_fmt+0x55/0x70
  [] __ptrace_may_access+0x14a/0x160
  [] ptrace_may_access+0x32/0x50
  [] mm_access+0x6d/0xb0
  [] pagemap_read+0xe1/0x360
  [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
  [] __vfs_read+0x37/0x100
  [] ? security_file_permission+0x84/0xa0
  [] ? rw_verify_area+0x56/0xe0
  [] vfs_read+0x86/0x140
  [] SyS_read+0x55/0xd0
  [] system_call_fastpath+0x12/0x71

Fixes: ab88ce5feca4 (ptrace: use fsuid, fsgid, effective creds for fs access checks)
Signed-off-by: Kenny Keslar 
Cc: Roland McGrath 
Cc: Oleg Nesterov 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

pagemap: do not leak physical addresses to non-privileged userspace

2015-03-17T16:31:30+00:00

As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.

This disallows anybody without CAP_SYS_ADMIN to read the pagemap.

[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

[ Eventually we might want to do anything more finegrained, but for now
  this is the simple model.   - Linus ]

Signed-off-by: Kirill A. Shutemov 
Acked-by: Konstantin Khlebnikov 
Acked-by: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Andrew Morton 
Cc: Mark Seaborn 
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

fs: proc: task_mmu: show page size in /proc//numa_maps

2015-02-13T02:54:12+00:00

The output of /proc/$pid/numa_maps is in terms of number of pages like
anon=22 or dirty=54.  Here's some output:

  7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
  7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
  7fff8d425000 default stack anon=50 dirty=50 N0=50

Looks like we have a stack and a couple of anonymous hugetlbfs
areas page which both use the same amount of memory.  They don't.

The 'bigfile' uses 1GB pages and takes up ~50GB of space.  The
anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
uses normal 4k pages.  You can go over to smaps to figure out what the
page size _really_ is with KernelPageSize or MMUPageSize.  But, I think
this is a pretty nasty and counterintuitive interface as it stands.

This patch introduces 'kernelpagesize_kB' line element to
/proc//numa_maps report file in order to help identifying the size of
pages that are backing memory areas mapped by a given task.  This is
specially useful to help differentiating between HUGE and GIGANTIC page
backed VMAs.

This patch is based on Dave Hansen's proposal and reviewer's follow-ups
taken from the following dicussion threads:
 * https://lkml.org/lkml/2011/9/21/454
 * https://lkml.org/lkml/2014/12/20/66

Signed-off-by: Rafael Aquini 
Cc: Johannes Weiner 
Cc: Dave Hansen 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fs/proc/task_mmu.c: add user-space support for resetting mm->hiwater_rss (peak RSS)

2015-02-13T02:54:12+00:00

Peak resident size of a process can be reset back to the process's
current rss value by writing "5" to /proc/pid/clear_refs.  The driving
use-case for this would be getting the peak RSS value, which can be
retrieved from the VmHWM field in /proc/pid/status, per benchmark
iteration or test scenario.

[akpm@linux-foundation.org: clarify behaviour in documentation]
Signed-off-by: Petr Cermak 
Cc: Bjorn Helgaas 
Cc: Primiano Tucci 
Cc: Petr Cermak 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: /proc/pid/clear_refs: avoid split_huge_page()

2015-02-12T01:06:06+00:00

Currently pagewalker splits all THP pages on any clear_refs request.  It's
not necessary.  We can handle this on PMD level.

One side effect is that soft dirty will potentially see more dirty memory,
since we will mark whole THP page dirty at once.

Sanity checked with CRIU test suite. More testing is required.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Naoya Horiguchi 
Reviewed-by: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)

2015-02-12T01:06:06+00:00

walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
undesirable behaviour at client end (who called walk_page_range).  For
example for pagemap_read(), when no callbacks are called against VM_PFNMAP
vma, pagemap_read() may prepare pagemap data for next virtual address
range at wrong index.  That could confuse and/or break userspace
applications.

This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
- for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
  over vma(VM_PFNMAP),
- for clear_refs and queue_pages which have their own ->tests_walk,
  just return 1 and skip vma(VM_PFNMAP). This is no problem because
  these are not interested in hole regions,
- for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

Signed-off-by: Naoya Horiguchi 
Signed-off-by: Shiraz Hashim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

numa_maps: remove numa_maps->vma

2015-02-12T01:06:06+00:00

pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private.  And show_numa_map() walks pages on vma basis, so using
walk_page_vma() is preferable.

Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Cyrill Gorcunov 
Cc: Dave Hansen 
Cc: Pavel Emelyanov 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

numa_maps: fix typo in gather_hugetbl_stats

2015-02-12T01:06:06+00:00

Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code
grep-friendly.

Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Cyrill Gorcunov 
Cc: Dave Hansen 
Cc: Pavel Emelyanov 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

pagemap: use walk->vma instead of calling find_vma()

2015-02-12T01:06:05+00:00

Page table walker has the information of the current vma in mm_walk, so we
don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call
any longer.  Currently pagemap_pte_range() does vma loop itself, so this
patch reduces many lines of code.

NULL-vma check is omitted because we assume that we never run these
callbacks on any address outside vma.  And even if it were broken, NULL
pointer dereference would be detected, so we can get enough information
for debugging.

Signed-off-by: Naoya Horiguchi 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Cyrill Gorcunov 
Cc: Dave Hansen 
Cc: Kirill A. Shutemov 
Cc: Pavel Emelyanov 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds