linux-stable.git/fs/proc/task_mmu.c, branch linux-3.16.y

coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping

2019-05-02T20:42:04+00:00

commit 04f5866e41fb70690e28397487d8bd8eea7d712a upstream.

The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli 
Reported-by: Jann Horn 
Suggested-by: Oleg Nesterov 
Acked-by: Peter Xu 
Reviewed-by: Mike Rapoport 
Reviewed-by: Oleg Nesterov 
Reviewed-by: Jann Horn 
Acked-by: Jason Gunthorpe 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Drop changes in Infiniband and userfaultfd
 - In clear_refs_write(), use up_read() as we never upgrade to a write lock
 - Adjust filename, context]
Signed-off-by: Ben Hutchings

fs/proc: Stop trying to report thread stacks

2018-11-20T18:05:58+00:00

commit b18cb64ead400c01bf1580eeba330ace51f8087d upstream.

This reverts more of:

  b76437579d13 ("procfs: mark thread stack correctly in proc//maps")

... which was partially reverted by:

  65376df58217 ("proc: revert /proc//maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn 
Signed-off-by: Andy Lutomirski 
Acked-by: Thomas Gleixner 
Cc: Al Viro 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Johannes Weiner 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Linux API 
Cc: Peter Zijlstra 
Cc: Tycho Andersen 
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.16: Squash in the earlier commits 58cb65487e92
 "proc/maps: make vm_is_stack() logic namespace-friendly" and 
 65376df58217 "proc: revert /proc//maps [stack:TID] annotation",
 which would introduce build failures if applied separately.]
Signed-off-by: Ben Hutchings

mm: /proc/pid/pagemap: hide swap entries from unprivileged users

2018-11-20T18:05:11+00:00

commit ab6ecf247a9321e3180e021a6a60164dee53ab2e upstream.

In commit ab676b7d6fbf ("pagemap: do not leak physical addresses to
non-privileged userspace"), the /proc/PID/pagemap is restricted to be
readable only by CAP_SYS_ADMIN to address some security issue.

In commit 1c90308e7a77 ("pagemap: hide physical addresses from
non-privileged users"), the restriction is relieved to make
/proc/PID/pagemap readable, but hide the physical addresses for
non-privileged users.

But the swap entries are readable for non-privileged users too.  This
has some security issues.  For example, for page under migrating, the
swap entry has physical address information.  So, in this patch, the
swap entries are hided for non-privileged users too.

Link: http://lkml.kernel.org/r/20180508012745.7238-1-ying.huang@intel.com
Fixes: 1c90308e7a77 ("pagemap: hide physical addresses from non-privileged users")
Signed-off-by: "Huang, Ying" 
Suggested-by: Kirill A. Shutemov 
Reviewed-by: Naoya Horiguchi 
Reviewed-by: Konstantin Khlebnikov 
Acked-by: Michal Hocko 
Cc: Konstantin Khlebnikov 
Cc: Andrei Vagin 
Cc: Jerome Glisse 
Cc: Daniel Colascione 
Cc: Zi Yan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Only PTEs can be swap entries
 - Adjust context]
Signed-off-by: Ben Hutchings

pagemap: hide physical addresses from non-privileged users

2018-11-20T18:05:11+00:00

commit 1c90308e7a77af6742a97d1021cca923b23b7f0d upstream.

This patch makes pagemap readable for normal users and hides physical
addresses from them.  For some use-cases PFN isn't required at all.

See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name

Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
Signed-off-by: Konstantin Khlebnikov 
Cc: Naoya Horiguchi 
Reviewed-by: Mark Williamson 
Tested-by:  Mark Williamson 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Add the same check in the places where we look up a PFN
 - Adjust context]
Signed-off-by: Ben Hutchings

proc: drop handling non-linear mappings

2018-10-03T03:09:51+00:00

commit 1da4b35b001481df99a6dcab12d5d39a876f7056 upstream.

We have to handle non-linear mappings for /proc/PID/{smaps,clear_refs}
which is unused now.  Let's drop it.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Deleted code is slightly different
 - Adjust context]
Signed-off-by: Ben Hutchings

ptrace: use fsuid, fsgid, effective creds for fs access checks

2017-09-15T17:30:17+00:00

commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream.

By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

 /proc/$pid/stat - uses the check for determining whether pointers
     should be visible, useful for bypassing ASLR
 /proc/$pid/maps - also useful for bypassing ASLR
 /proc/$pid/cwd - useful for gaining access to restricted
     directories that contain files with lax permissions, e.g. in
     this scenario:
     lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
     drwx------ root root /root
     drwxr-xr-x root root /root/foobar
     -rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn 
Acked-by: Kees Cook 
Cc: Casey Schaufler 
Cc: Oleg Nesterov 
Cc: Ingo Molnar 
Cc: James Morris 
Cc: "Serge E. Hallyn" 
Cc: Andy Shevchenko 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: "Eric W. Biederman" 
Cc: Willy Tarreau 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Update mm_access() calls in fs/proc/task_{,no}mmu.c too
 - Adjust context]
Signed-off-by: Ben Hutchings

mm: larger stack guard gap, between vmas

2017-07-02T16:13:02+00:00

commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov 
Original-patch-by: Michal Hocko 
Signed-off-by: Hugh Dickins 
Acked-by: Michal Hocko 
Tested-by: Helge Deller  # parisc
Signed-off-by: Linus Torvalds 
[Hugh Dickins: Backported to 3.16]
Signed-off-by: Ben Hutchings

pagemap: do not leak physical addresses to non-privileged userspace

2015-03-30T10:11:38+00:00

commit ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce upstream.

As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.

This disallows anybody without CAP_SYS_ADMIN to read the pagemap.

[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

[ Eventually we might want to do anything more finegrained, but for now
  this is the simple model.   - Linus ]

Signed-off-by: Kirill A. Shutemov 
Acked-by: Konstantin Khlebnikov 
Acked-by: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Andrew Morton 
Cc: Mark Seaborn 
Signed-off-by: Linus Torvalds 
Signed-off-by: Luis Henriques

proc/pagemap: walk page tables under pte lock

2015-03-03T14:54:07+00:00

commit 05fbf357d94152171bc50f8a369390f1f16efd89 upstream.

Lockless access to pte in pagemap_pte_range() might race with page
migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap)                           CPU B (migration)
                                          lock_page()
                                          try_to_unmap(page, TTU_MIGRATION...)
                                               make_migration_entry()
                                               set_pte_at()

pte_to_pagemap_entry()
                                          remove_migration_ptes()
                                          unlock_page()
    if(is_migration_entry())
        migration_entry_to_page()
            BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov 
Reported-by: Andrey Ryabinin 
Reviewed-by: Cyrill Gorcunov 
Acked-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Luis Henriques

mm: softdirty: unmapped addresses between VMAs are clean

2015-03-03T14:52:26+00:00

commit 81d0fa623c5b8dbd5279d9713094b0f9b0a00fb4 upstream.

If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty.  Here's a program to demonstrate the bug:

int main() {
	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
	uint64_t pme[3];
	int fd = open("/proc/self/pagemap", O_RDONLY);;
	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
	munmap(m + getpagesize(), getpagesize());
	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
	return 0;
}

(Note that all pages in new VMAs are softdirty until cleared).

Tested:
	Used the program given above. I'm going to include this code in
	a selftest in the future.

[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner 
Cc: "Kirill A. Shutemov" 
Cc: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Jamie Liu 
Cc: Hugh Dickins 
Cc: Naoya Horiguchi 
Signed-off-by: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[ luis: 3.16-stable prereq for:
  05fbf357d941 "proc/pagemap: walk page tables under pte lock" ]
Signed-off-by: Luis Henriques