linux-stable.git/mm/memory.c, branch linux-2.6.25.y

mlock() fix return values

2008-08-20T18:15:25+00:00

commit a477097d9c37c1cf289c7f0257dffcfa42d50197 upstream

Halesh says:

Please find the below testcase provide to test mlock.

Test Case :
===========================

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
  int fd,ret, i = 0;
  char *addr, *addr1 = NULL;
  unsigned int page_size;
  struct rlimit rlim;

  if (0 != geteuid())
  {
   printf("Execute this pgm as root\n");
   exit(1);
  }

  /* create a file */
  if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
  {
   printf("cant create test file\n");
   exit(1);
  }

  page_size = sysconf(_SC_PAGE_SIZE);

  /* set the MEMLOCK limit */
  rlim.rlim_cur = 2000;
  rlim.rlim_max = 2000;

  if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
  {
   printf("Cant change limit values\n");
   exit(1);
  }

  addr = 0;
  while (1)
  {
  /* map a page into memory each time*/
  if ((addr = (char *) mmap(addr,page_size, PROT_READ |
PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
  {
   printf("cant do mmap on file\n");
   exit(1);
  }

  if (0 == i)
    addr1 = addr;
  i++;
  errno = 0;
  /* lock the mapped memory pagewise*/
  if ((ret = mlock((char *)addr, 1500)) == -1)
  {
   printf("errno value is %d\n", errno);
   printf("cant lock maped region\n");
   exit(1);
  }
  addr = addr + page_size;
 }
}
======================================================

This testcase results in an mlock() failure with errno 14 that is EFAULT,
but it has nowhere been specified that mlock() will return EFAULT.  When I
tested the same on older kernels like 2.6.18, I got the correct result i.e
errno 12 (ENOMEM).

I think in source code mlock(2), setting errno ENOMEM has been missed in
do_mlock() , on mlock_fixup() failure.

SUSv3 requires the following behavior frmo mlock(2).

[ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

[EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

This rule isn't so nice and slighly strange.  but many people think
POSIX/SUS compliance is important.

Reported-by: Halesh Sadashiv 
Tested-by: Halesh Sadashiv 
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Fix ZERO_PAGE breakage with vmware

2008-06-24T21:08:31+00:00

commit 672ca28e300c17bf8d792a2a7a8631193e580c74 upstream

Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:

  "This broke vmware 6.0.4.
   Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
   /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.

The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.

This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.

Reported-and-tested-by: Jeff Chua 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP

2008-06-24T21:08:29+00:00

commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f upstream

KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.

We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.

In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.

This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.

While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.

We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *".  The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.

[ Also removed an impossible test for pte_offset_map_lock() failing:
  that's not how that function works ]

Acked-by: Oleg Nesterov 
Acked-by: Nick Piggin 
Cc: KAMEZAWA Hiroyuki 
Cc: Hugh Dickins 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Roland McGrath 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

memcg: when do_swap's do_wp_page fails

2008-03-05T00:35:14+00:00

Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
was charged for is there in the pagetable, and will be correctly uncharged
when that area is unmapped - it was only its COWing which failed.

And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
fails, mask out success bits, which might confuse some arches e.g.  sparc.

Signed-off-by: Hugh Dickins 
Cc: David Rientjes 
Acked-by: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Hirokazu Takahashi 
Cc: YAMAMOTO Takashi 
Cc: Paul Menage 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: page_cache_release not __free_page

2008-03-05T00:35:14+00:00

There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
do_anonymous page using __free_page, but it does look odd when nearby code
uses page_cache_release: use that instead (while turning a blind eye to
ancient inconsistencies of page_cache_release versus put_page).

Signed-off-by: Hugh Dickins 
Cc: David Rientjes 
Acked-by: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Hirokazu Takahashi 
Cc: YAMAMOTO Takashi 
Cc: Paul Menage 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86

2008-02-15T05:23:19+00:00

* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
  x86: cpa, fix out of date comment
  KVM is not seen under X86 config with latest git (32 bit compile)
  x86: cpa: ensure page alignment
  x86: include proper prototypes for rodata_test
  x86: fix gart_iommu_init()
  x86: EFI set_memory_x()/set_memory_uc() fixes
  x86: make dump_pagetable() static
  x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()

d_path: Make d_path() use a struct path

2008-02-15T05:17:09+00:00

d_path() is used on a  pair.  Lets use a struct path to
reflect this.

[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck 
Acked-by: Bryan Wu 
Acked-by: Christoph Hellwig 
Cc: Al Viro 
Cc: "J. Bruce Fields" 
Cc: Neil Brown 
Cc: Michael Halcrow 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()

2008-02-14T22:30:19+00:00

Jiri Kosina reported the following deadlock scenario with
show_unhandled_signals enabled:

 [   68.379022] gnome-settings-[2941] trap int3 ip:3d2c840f34
 sp:7fff36f5d100 error:0<3>BUG: sleeping function called from invalid
 context at kernel/rwsem.c:21
 [   68.379039] in_atomic():1, irqs_disabled():0
 [   68.379044] no locks held by gnome-settings-/2941.
 [   68.379050] Pid: 2941, comm: gnome-settings- Not tainted 2.6.25-rc1 #30
 [   68.379054]
 [   68.379056] Call Trace:
 [   68.379061]  <#DB>  [] ? __debug_show_held_locks+0x13/0x30
 [   68.379109]  [] __might_sleep+0xe5/0x110
 [   68.379123]  [] down_read+0x20/0x70
 [   68.379137]  [] print_vma_addr+0x3a/0x110
 [   68.379152]  [] do_trap+0xf5/0x170
 [   68.379168]  [] do_int3+0x7b/0xe0
 [   68.379180]  [] int3+0x9f/0xd0
 [   68.379203]  <>
 [   68.379229]  in libglib-2.0.so.0.1505.0[3d2c800000+dc000]

and tracked it down to:

  commit 03252919b79891063cf99145612360efbdf9500b
  Author: Andi Kleen 
  Date:   Wed Jan 30 13:33:18 2008 +0100

      x86: print which shared library/executable faulted in segfault etc. messages

the problem is that we call down_read() from an atomic context.

Solve this by returning from print_vma_addr() if the preempt count is
elevated. Update preempt_conditional_sti / preempt_conditional_cli to
unconditionally lift the preempt count even on !CONFIG_PREEMPT.

Reported-by: Jiri Kosina 
Signed-off-by: Ingo Molnar

Be more robust about bad arguments in get_user_pages()

2008-02-12T04:44:44+00:00

So I spent a while pounding my head against my monitor trying to figure
out the vmsplice() vulnerability - how could a failure to check for
*read* access turn into a root exploit? It turns out that it's a buffer
overflow problem which is made easy by the way get_user_pages() is
coded.

In particular, "len" is a signed int, and it is only checked at the
*end* of a do {} while() loop.  So, if it is passed in as zero, the loop
will execute once and decrement len to -1.  At that point, the loop will
proceed until the next invalid address is found; in the process, it will
likely overflow the pages array passed in to get_user_pages().

I think that, if get_user_pages() has been asked to grab zero pages,
that's what it should do.  Thus this patch; it is, among other things,
enough to block the (already fixed) root exploit and any others which
might be lurking in similar code.  I also think that the number of pages
should be unsigned, but changing the prototype of this function probably
requires some more careful review.

Signed-off-by: Jonathan Corbet 
Signed-off-by: Linus Torvalds

CONFIG_HIGHPTE vs. sub-page page tables.

2008-02-08T17:22:42+00:00

Background: I've implemented 1K/2K page tables for s390.  These sub-page
page tables are required to properly support the s390 virtualization
instruction with KVM.  The SIE instruction requires that the page tables
have 256 page table entries (pte) followed by 256 page status table entries
(pgste).  The pgstes are only required if the process is using the SIE
instruction.  The pgstes are updated by the hardware and by the hypervisor
for a number of reasons, one of them is dirty and reference bit tracking.
To avoid wasting memory the standard pte table allocation should return
1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
the s390 version for pte_alloc_one cannot return a pointer to a struct
page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
cannot return a pointer to a pte either, since that would require more than
32 bit for the return value of pte_alloc_one (and the pte * would not be
accessible since its not kmapped).

Solution: The only solution I found to this dilemma is a new typedef: a
pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
later patch.  For everybody else it will be a (struct page *).  The
additional problem with the initialization of the ptl lock and the
NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
a destructor pgtable_page_dtor.  The page table allocation and free
functions need to call these two whenever a page table page is allocated or
freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
 To get the pgtable_t back from a pmd entry that has been installed with
pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
call in free_pte_range and apply_to_pte_range.

Signed-off-by: Martin Schwidefsky 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds