linux-stable.git/mm/memory.c, branch linux-2.6.13.y

Fix up recent get_user_pages() handling

2005-08-03T17:07:09+00:00

The VM_FAULT_WRITE thing is an extra bit, not a valid return value, and
has to be treated as such by get_user_pages().

Signed-off-by: Linus Torvalds

[PATCH] fix get_user_pages bug

2005-08-03T16:12:05+00:00

Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.

So revert __follow_page to check pte_write as before, and make
do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has
done its full job: once get_user_pages receives this value, it no longer
requires pte_write in __follow_page.

But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case.  To avoid changing
them all in a hurry, make an inline wrapper function (using the old
name) that masks off the new bit, and use the extended interface with
double underscores.

Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.

Signed-off-by: Hugh Dickins 
[ Cleanups by Nick Piggin ]
Signed-off-by: Linus Torvalds

[PATCH] x86_64: access of some bad address

2005-08-02T04:38:00+00:00

x86_64 has a large sparse gate area between VSYSCALL_START and
VSYSCALL_END, not all of it presently backed by pmds.  Alexander Nyberg has
found that in some circumstances gdb may try to ptrace here, and hit
get_user_pages BUG_ON.  It seems odd that gdb should be accessing here, but
it certainly shouldn't crash in this way: relax BUG_ON to -EFAULT.  Fixes
kernel bugzilla #4801.

Signed-off-by: Hugh Dickins 
Cc: Andi Kleen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Fix get_user_pages() race for write access

2005-08-01T18:14:49+00:00

There's no real guarantee that handle_mm_fault() will always be able to
break a COW situation - if an update from another thread ends up
modifying the page table some way, handle_mm_fault() may end up
requiring us to re-try the operation.

That's normally fine, but get_user_pages() ended up re-trying it as a
read, and thus a write access could in theory end up losing the dirty
bit or be done on a page that had not been properly COW'ed.

This makes get_user_pages() always retry write accesses as write
accesses by making "follow_page()" require that a writable follow has
the dirty bit set.  That simplifies the code and solves the race: if the
COW break fails for some reason, we'll just loop around and try again.

Signed-off-by: Linus Torvalds

[PATCH] check_user_page_readable() deadlock fix

2005-07-27T23:25:53+00:00

Fix bug identifued by Richard Purdie .

oprofile calls check_user_page_readable() from interrupt context, so we
deadlock over various VFS locks.

But check_user_page_readable() doesn't imply either a read or a write of the
page's contents.  Change __follow_page() so that check_user_page_readable()
can tell __follow_page() that we're not accessing the page's contents, and use
that info to avoid the troublesome lock-takings.

Also, make follow_page() inline for the single callsite in memory.c to save a
bit of stack space.

Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] mm: fix remap_pte_range BUG

2005-06-25T23:24:26+00:00

Out-of-tree user of remap_pfn_range hit kernel BUG at mm/memory.c:1112!  It
passes an unrounded size to remap_pfn_range, which was okay before 2.6.12,
but misses remap_pte_range's new end condition.  An audit of all the other
ptwalks confirms that this is the only one so exposed.

Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] DocBook: update comments

2005-06-24T07:06:40+00:00

This patch updates some comments to match code changes.

Signed-off-by: Martin Waitz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] sparsemem memory model

2005-06-23T16:45:04+00:00

Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.

A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.

Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.

Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.

In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.

One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.

This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.

Signed-off-by: Andy Whitcroft 
Signed-off-by: Dave Hansen 
Signed-off-by: Martin Bligh 
Signed-off-by: Adrian Bunk 
Signed-off-by: Yasunori Goto 
Signed-off-by: Bob Picco 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] can_share_swap_page: use page_mapcount

2005-06-22T01:46:21+00:00

Remember that ironic get_user_pages race?  when the raised page_count on a
page swapped out led do_wp_page to decide that it had to copy on write, so
substituted a different page into userspace.  2.6.7 onwards have Andrea's
solution, where try_to_unmap_one backs out if it finds page_count raised.

Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
and was found a few months ago to hang an intensive page migration test.  A
year ago I was hesitant to engage page_mapcount, now it seems the right fix.

So remove the page_count hack from try_to_unmap_one; and use activate_page in
unuse_mm when dropping lock, to replace its secondary effect of helping
swapoff to make progress in that case.

Simplify can_share_swap_page (now called only on anonymous pages) to check
page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
keep sum on the high side, and correct when can_share_swap_page called.

Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] do_wp_page: cannot share file page

2005-06-22T01:46:21+00:00

A small optimization to do_wp_page's check for whether to avoid copy by
reusing the page already mapped.  It can never share a cached file page,
nor can it share a reserved page (often the empty zero page), so it's a
waste of time to lock and unlock in those cases.  Which nowadays can both
be neatly excluded by a preliminary PageAnon test.

Christoph has reported that a preliminary page_count test proved valuable
for scalability here, but PageAnon covers more common cases all at once.

Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds