linux-stable.git/fs/proc/task_mmu.c, branch linux-3.8.y

mempolicy: remove arg from mpol_parse_str, mpol_to_str

2013-01-02T17:27:10+00:00

Remove the unused argument (formerly no_context) from mpol_parse_str()
and from mpol_to_str().

Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds

procfs: add VmFlags field in smaps output

2012-12-18T01:15:22+00:00

During c/r sessions we've found that there is no way at the moment to
fetch some VMA associated flags, such as mlock() and madvise().

This leads us to a problem -- we don't know if we should call for mlock()
and/or madvise() after restore on the vma area we're bringing back to
life.

This patch intorduces a new field into "smaps" output called VmFlags,
where all set flags associated with the particular VMA is shown as two
letter mnemonics.

[ Strictly speaking for c/r we only need mlock/madvise bits but it has been
  said that providing just a few flags looks somehow inconsistent.  So all
  flags are here now. ]

This feature is made available on CONFIG_CHECKPOINT_RESTORE=n kernels, as
other applications may start to use these fields.

The data is encoded in a somewhat awkward two letters mnemonic form, to
encourage userspace to be prepared for fields being added or removed in
the future.

[a.p.zijlstra@chello.nl: props to use for_each_set_bit]
[sfr@canb.auug.org.au: props to use array instead of struct]
[akpm@linux-foundation.org: overall redesign and simplification]
[akpm@linux-foundation.org: remove unneeded braces per sfr, avoid using bloaty for_each_set_bit()]
Signed-off-by: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Peter Zijlstra 
Cc: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

procfs: use N_MEMORY instead N_HIGH_MEMORY

2012-12-13T01:38:32+00:00

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
Signed-off-by: Wen Congyang 
Cc: Christoph Lameter 
Cc: Lin Feng 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

thp: change split_huge_page_pmd() interface

2012-12-13T01:38:31+00:00

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Andi Kleen 
Cc: "H. Peter Anvin" 
Cc: Mel Gorman 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hold task->mempolicy while numa_maps scans.

2012-10-19T21:32:10+00:00

  /proc//numa_maps scans vma and show mempolicy under
  mmap_sem. It sometimes accesses task->mempolicy which can
  be freed without mmap_sem and numa_maps can show some
  garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

V2->v3
  -  updated comments to be more verbose.
  -  removed task_lock() in numa_maps code.
V1->V2
  -  access task->mempolicy only once and remember it.  Becase kernel/exit.c
     can overwrite it.

Signed-off-by: KAMEZAWA Hiroyuki 
Acked-by: David Rientjes 
Acked-by: KOSAKI Motohiro 
Signed-off-by: Linus Torvalds

mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17T01:00:50+00:00

When reading /proc/pid/numa_maps, it's possible to return the contents of
the stack where the mempolicy string should be printed if the policy gets
freed from beneath us.

This happens because mpol_to_str() may return an error the
stack-allocated buffer is then printed without ever being stored.

There are two possible error conditions in mpol_to_str():

 - if the buffer allocated is insufficient for the string to be stored,
   and

 - if the mempolicy has an invalid mode.

The first error condition is not triggered in any of the callers to
mpol_to_str(): at least 50 bytes is always allocated on the stack and this
is sufficient for the string to be written.  A future patch should convert
this into BUILD_BUG_ON() since we know the maximum strlen possible, but
that's not -rc material.

The second error condition is possible if a race occurs in dropping a
reference to a task's mempolicy causing it to be freed during the read().
The slab poison value is then used for the mode and mpol_to_str() returns
-EINVAL.

This race is only possible because get_vma_policy() believes that
mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
does not hold mm->mmap_sem when dropping the reference or setting
task->mempolicy to NULL: it uses task_lock(task) instead.

Thus, it's required for the caller of a task mempolicy to hold
task_lock(task) while grabbing the mempolicy and reading it.  Callers with
a vma policy store their mempolicy earlier and can simply increment the
reference count so it's guaranteed not to be freed.

Reported-by: Dave Jones 
Signed-off-by: David Rientjes 
Signed-off-by: Linus Torvalds

mm: kill vma flag VM_RESERVED and mm->reserved_vm counter

2012-10-09T07:22:19+00:00

A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

 | effect                 | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump      | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct.  Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov 
Cc: Alexander Viro 
Cc: Carsten Otte 
Cc: Chris Metcalf 
Cc: Cyrill Gorcunov 
Cc: Eric Paris 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: James Morris 
Cc: Jason Baron 
Cc: Kentaro Takeda 
Cc: Matt Helsley 
Cc: Nick Piggin 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Robert Richter 
Cc: Suresh Siddha 
Cc: Tetsuo Handa 
Cc: Venkatesh Pallipadi 
Acked-by: Linus Torvalds 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc/smaps: show amount of nonlinear ptes in vma

2012-06-01T00:49:29+00:00

Currently, nonlinear mappings can not be distinguished from ordinary
mappings.  This patch adds into /proc/pid/smaps line "Nonlinear: 
kB", where size is amount of nonlinear ptes in vma, this line appears only
if VM_NONLINEAR is set.  This information may be useful not only for
checkpoint/restore project.

Requested by Pavel Emelyanov.

Signed-off-by: Konstantin Khlebnikov 
Cc: Andi Kleen 
Cc: Pavel Emelyanov 
Cc: Alexey Dobriyan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc/smaps: carefully handle migration entries

2012-06-01T00:49:29+00:00

Currently smaps reports migration entries as "swap", as result "swap" can
appears in shared mapping.

This patch converts migration entries into pages and handles them as usual.

Signed-off-by: Konstantin Khlebnikov 
Cc: Andi Kleen 
Cc: Pavel Emelyanov 
Cc: Alexey Dobriyan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: report file/anon bit in /proc/pid/pagemap

2012-06-01T00:49:29+00:00

This is an implementation of Andrew's proposal to extend the pagemap file
bits to report what is missing about tasks' working set.

The problem with the working set detection is multilateral.  In the criu
(checkpoint/restore) project we dump the tasks' memory into image files
and to do it properly we need to detect which pages inside mappings are
really in use.  The mincore syscall I though could help with this did not.
 First, it doesn't report swapped pages, thus we cannot find out which
parts of anonymous mappings to dump.  Next, it does report pages from page
cache as present even if they are not mapped, and it doesn't make that has
not been cow-ed.

Note, that issue with swap pages is critical -- we must dump swap pages to
image file.  But the issues with file pages are optimization -- we can
take all file pages to image, this would be correct, but if we know that a
page is not mapped or not cow-ed, we can remove them from dump file.  The
dump would still be self-consistent, though significantly smaller in size
(up to 10 times smaller on real apps).

Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
it reports whether a page is present or swapped and it doesn't report not
mapped page cache pages.  But, it doesn't distinguish cow-ed file pages
from not cow-ed.

I would like to make the last unused bit in this file to report whether the
page mapped into respective pte is PageAnon or not.

[comment stolen from Pavel Emelyanov's v1 patch]

Signed-off-by: Konstantin Khlebnikov 
Cc: Pavel Emelyanov 
Cc: Matt Mackall 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Acked-by: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds