linux-stable.git/fs/proc/task_mmu.c, branch linux-3.14.y

proc: Fix ptrace-based permission checks for accessing task maps

2016-03-03T23:06:45+00:00

Modify mm_access() calls in fs/proc/task_mmu.c and fs/proc/task_nommu.c to
have the mode include PTRACE_MODE_FSCREDS so accessing /proc/pid/maps and
/proc/pid/pagemap is not denied to all users.

In backporting upstream commit caaee623 to pre-3.18 kernel versions it was
overlooked that mm_access() is used in fs/proc/task_*mmu.c as those calls
were removed in 3.18 (by upstream commit 29a40ace) and did not exist at the
time of the original commit.

Signed-off-by: Corey Wright 
Acked-by: Jann Horn 
Signed-off-by: Greg Kroah-Hartman

proc/pagemap: walk page tables under pte lock

2015-04-29T08:31:56+00:00

commit 05fbf357d94152171bc50f8a369390f1f16efd89 upstream.

Lockless access to pte in pagemap_pte_range() might race with page
migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap)                           CPU B (migration)
                                          lock_page()
                                          try_to_unmap(page, TTU_MIGRATION...)
                                               make_migration_entry()
                                               set_pte_at()

pte_to_pagemap_entry()
                                          remove_migration_ptes()
                                          unlock_page()
    if(is_migration_entry())
        migration_entry_to_page()
            BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov 
Reported-by: Andrey Ryabinin 
Reviewed-by: Cyrill Gorcunov 
Acked-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: 	[3.5+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: softdirty: unmapped addresses between VMAs are clean

2015-04-29T08:31:56+00:00

commit 81d0fa623c5b8dbd5279d9713094b0f9b0a00fb4 upstream.

If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty.  Here's a program to demonstrate the bug:

int main() {
	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
	uint64_t pme[3];
	int fd = open("/proc/self/pagemap", O_RDONLY);;
	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
	munmap(m + getpagesize(), getpagesize());
	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
	return 0;
}

(Note that all pages in new VMAs are softdirty until cleared).

Tested:
	Used the program given above. I'm going to include this code in
	a selftest in the future.

[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner 
Cc: "Kirill A. Shutemov" 
Cc: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Jamie Liu 
Cc: Hugh Dickins 
Cc: Naoya Horiguchi 
Signed-off-by: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

pagemap: do not leak physical addresses to non-privileged userspace

2015-03-26T14:06:57+00:00

commit ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce upstream.

As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.

This disallows anybody without CAP_SYS_ADMIN to read the pagemap.

[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

[ Eventually we might want to do anything more finegrained, but for now
  this is the simple model.   - Linus ]

Signed-off-by: Kirill A. Shutemov 
Acked-by: Konstantin Khlebnikov 
Acked-by: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Andrew Morton 
Cc: Mark Seaborn 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: per-thread vma caching

2014-10-09T19:21:29+00:00

commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso 
Reviewed-by: Rik van Riel 
Acked-by: Linus Torvalds 
Reviewed-by: Michel Lespinasse 
Cc: Oleg Nesterov 
Tested-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

mm: add !pte_present() check on existing hugetlb_entry callbacks

2014-06-11T18:54:13+00:00

commit d4c54919ed86302094c0ca7d48a8cbd4ee753e92 upstream.

The age table walker doesn't check non-present hugetlb entry in common
path, so hugetlb_entry() callbacks must check it.  The reason for this
behavior is that some callers want to handle it in its own way.

[ I think that reason is bogus, btw - it should just do what the regular
  code does, which is to call the "pte_hole()" function for such hugetlb
  entries  - Linus]

However, some callers don't check it now, which causes unpredictable
result, for example when we have a race between migrating hugepage and
reading /proc/pid/numa_maps.  This patch fixes it by adding !pte_present
checks on buggy callbacks.

This bug exists for years and got visible by introducing hugepage
migration.

ChangeLog v2:
- fix if condition (check !pte_present() instead of pte_present())

Reported-by: Sasha Levin 
Signed-off-by: Naoya Horiguchi 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
[ Backported to 3.15.  Signed-off-by: Josh Boyer  ]
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

seq_file: remove "%n" usage from seq_file users

2013-11-15T00:32:20+00:00

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa 
Signed-off-by: Kees Cook 
Cc: Joe Perches 
Cc: David Miller 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, thp: change pmd_trans_huge_lock() to return taken lock

2013-11-15T00:32:14+00:00

With split ptlock it's important to know which lock
pmd_trans_huge_lock() took.  This patch adds one more parameter to the
function to return the lock.

In most places migration to new api is trivial.  Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.

Signed-off-by: Naoya Horiguchi 
Signed-off-by: Kirill A. Shutemov 
Tested-by: Alex Thorlton 
Cc: Ingo Molnar 
Cc: "Eric W . Biederman" 
Cc: "Paul E . McKenney" 
Cc: Al Viro 
Cc: Andi Kleen 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: Dave Jones 
Cc: David Howells 
Cc: Frederic Weisbecker 
Cc: Johannes Weiner 
Cc: Kees Cook 
Cc: Mel Gorman 
Cc: Michael Kerrisk 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Robin Holt 
Cc: Sedat Dilek 
Cc: Srikar Dronamraju 
Cc: Thomas Gleixner 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: convert mm->nr_ptes to atomic_long_t

2013-11-15T00:32:14+00:00

With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov 
Tested-by: Alex Thorlton 
Cc: Ingo Molnar 
Cc: Naoya Horiguchi 
Cc: "Eric W . Biederman" 
Cc: "Paul E . McKenney" 
Cc: Al Viro 
Cc: Andi Kleen 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: Dave Jones 
Cc: David Howells 
Cc: Frederic Weisbecker 
Cc: Johannes Weiner 
Cc: Kees Cook 
Cc: Mel Gorman 
Cc: Michael Kerrisk 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Robin Holt 
Cc: Sedat Dilek 
Cc: Srikar Dronamraju 
Cc: Thomas Gleixner 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

/proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line

2013-11-13T03:09:07+00:00

This flag shows that the VMA is "newly created" and thus represents
"dirty" in the task's VM.

You can clear it by "echo 4 > /proc/pid/clear_refs."

Signed-off-by: Naoya Horiguchi 
Cc: Wu Fengguang 
Cc: Pavel Emelyanov 
Acked-by: Cyrill Gorcunov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds