linux-stable.git/fs/proc/task_mmu.c, branch linux-3.18.y

fs/proc: Stop trying to report thread stacks

2018-04-08T09:49:48+00:00

commit b18cb64ead400c01bf1580eeba330ace51f8087d upstream.

This reverts more of:

  b76437579d13 ("procfs: mark thread stack correctly in proc//maps")

... which was partially reverted by:

  65376df58217 ("proc: revert /proc//maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn 
Signed-off-by: Andy Lutomirski 
Acked-by: Thomas Gleixner 
Cc: Al Viro 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Johannes Weiner 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Linux API 
Cc: Peter Zijlstra 
Cc: Tycho Andersen 
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

proc: revert /proc//maps [stack:TID] annotation

2018-04-08T09:49:48+00:00

commit 65376df582174ffcec9e6471bf5b0dd79ba05e4a upstream.

Commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") added [stack:TID] annotation to /proc//maps.

Finding the task of a stack VMA requires walking the entire thread list,
turning this into quadratic behavior: a thousand threads means a
thousand stacks, so the rendering of /proc//maps needs to look at a
million combinations.

The cost is not in proportion to the usefulness as described in the
patch.

Drop the [stack:TID] annotation to make /proc//maps (and
/proc//numa_maps) usable again for higher thread counts.

The [stack] annotation inside /proc//task//maps is retained, as
identifying the stack VMA there is an O(1) operation.

Siddesh said:
 "The end users needed a way to identify thread stacks programmatically and
  there wasn't a way to do that.  I'm afraid I no longer remember (or have
  access to the resources that would aid my memory since I changed
  employers) the details of their requirement.  However, I did do this on my
  own time because I thought it was an interesting project for me and nobody
  really gave any feedback then as to its utility, so as far as I am
  concerned you could roll back the main thread maps information since the
  information is available in the thread-specific files"

Signed-off-by: Johannes Weiner 
Cc: "Kirill A. Shutemov" 
Cc: Siddhesh Poyarekar 
Cc: Shaohua Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman

mm: larger stack guard gap, between vmas

2017-06-26T05:10:54+00:00

commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov 
Original-patch-by: Michal Hocko 
Signed-off-by: Hugh Dickins 
Acked-by: Michal Hocko 
Tested-by: Helge Deller  # parisc
Signed-off-by: Linus Torvalds 
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
[wt: backport to 3.18: adjust context ; no FOLL_POPULATE ;
     s390 uses generic arch_get_unmapped_area()]
Signed-off-by: Willy Tarreau 
[gkh: minor build fixes for 3.18]
Signed-off-by: Greg Kroah-Hartman

mm: fix huge zero page accounting in smaps report

2016-06-03T15:30:36+00:00

[ Upstream commit c164e038eee805147e95789dddb88ae3b3aca11c ]

As a small zero page, huge zero page should not be accounted in smaps
report as normal page.

For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds.  We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.

Let's add separate codepath to handle pmds.  follow_trans_huge_pmd() will
detect huge zero page for us.

We would need pmd_dirty() helper to do this properly.  The patch adds it
to THP-enabled architectures which don't yet have one.

[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov" 
Reported-by: Fengguang Wu 
Tested-by: Fengwei Yin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Sasha Levin

pagemap: do not leak physical addresses to non-privileged userspace

2015-03-28T14:03:28+00:00

[ Upstream commit ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce ]

As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.

This disallows anybody without CAP_SYS_ADMIN to read the pagemap.

[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

[ Eventually we might want to do anything more finegrained, but for now
  this is the simple model.   - Linus ]

Signed-off-by: Kirill A. Shutemov 
Acked-by: Konstantin Khlebnikov 
Acked-by: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Andrew Morton 
Cc: Mark Seaborn 
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

proc/pagemap: walk page tables under pte lock

2015-03-06T22:52:53+00:00

commit 05fbf357d94152171bc50f8a369390f1f16efd89 upstream.

Lockless access to pte in pagemap_pte_range() might race with page
migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap)                           CPU B (migration)
                                          lock_page()
                                          try_to_unmap(page, TTU_MIGRATION...)
                                               make_migration_entry()
                                               set_pte_at()

pte_to_pagemap_entry()
                                          remove_migration_ptes()
                                          unlock_page()
    if(is_migration_entry())
        migration_entry_to_page()
            BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov 
Reported-by: Andrey Ryabinin 
Reviewed-by: Cyrill Gorcunov 
Acked-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared

2014-10-14T00:18:28+00:00

For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set.  If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared.  Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect.  An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").

Signed-off-by: Peter Feiner 
Reported-by: Peter Feiner 
Suggested-by: Kirill A. Shutemov 
Cc: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Jamie Liu 
Cc: Hugh Dickins 
Cc: Naoya Horiguchi 
Cc: Bjorn Helgaas 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: softdirty: unmapped addresses between VMAs are clean

2014-10-10T02:25:58+00:00

If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty.  Here's a program to demonstrate the bug:

int main() {
	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
	uint64_t pme[3];
	int fd = open("/proc/self/pagemap", O_RDONLY);;
	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
	munmap(m + getpagesize(), getpagesize());
	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
	return 0;
}

(Note that all pages in new VMAs are softdirty until cleared).

Tested:
	Used the program given above. I'm going to include this code in
	a selftest in the future.

[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner 
Cc: "Kirill A. Shutemov" 
Cc: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Jamie Liu 
Cc: Hugh Dickins 
Cc: Naoya Horiguchi 
Signed-off-by: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race

2014-10-10T02:25:56+00:00

9e7814404b77 "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.

The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.

However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.

Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.

Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.

Signed-off-by: Oleg Nesterov 
Cc: KAMEZAWA Hiroyuki 
Cc: David Rientjes 
Cc: KOSAKI Motohiro 
Cc: Alexander Viro 
Cc: Cyrill Gorcunov 
Cc: "Eric W. Biederman" 
Cc: "Kirill A. Shutemov" 
Cc: Peter Zijlstra 
Cc: Hugh Dickins 
Cc: Andi Kleen 
Cc: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc/maps: make vm_is_stack() logic namespace-friendly

2014-10-10T02:25:50+00:00

- Rename vm_is_stack() to task_of_stack() and change it to return
  "struct task_struct *" rather than the global (and thus wrong in
  general) pid_t.

- Add the new pid_of_stack() helper which calls task_of_stack() and
  uses the right namespace to report the correct pid_t.

  Unfortunately we need to define this helper twice, in task_mmu.c
  and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
  and move at least pid_of_stack/task_of_stack there to avoid the
  code duplication.

- Change show_map_vma() and show_numa_map() to use the new helper.

Signed-off-by: Oleg Nesterov 
Cc: Alexander Viro 
Cc: Cyrill Gorcunov 
Cc: "Eric W. Biederman" 
Cc: Greg Ungerer 
Cc: "Kirill A. Shutemov" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds