linux-stable.git/mm, branch linux-2.6.18.y

Fix for shmem_truncate_range() BUG_ON()

2007-02-23T23:49:53+00:00

Ran into BUG() while doing madvise(REMOVE) testing.  If we are punching a
hole into shared memory segment using madvise(REMOVE) and the entire hole
is below the indirect blocks, we hit following assert.

	        BUG_ON(limit <= SHMEM_NR_DIRECT);

Signed-off-by: Badari Pulavarty 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Chris Wright 
Signed-off-by: Greg Kroah-Hartman

fix msync error on unmapped area

2007-02-23T23:49:52+00:00

Fix the 2.6.18 sys_msync to report -ENOMEM correctly when an unmapped area
falls within its range, and not to overshoot: to satisfy LSB 3.1 tests and
to fix Debian Bug#394392.  Took the 2.6.19 sys_msync as starting point
(including its cleanup of repeated "current->mm"s), reintroducing the
msync_interval and balance_dirty_pages_ratelimited_nr needed in 2.6.18.

The misbehaviour fixed here may not seem very serious; but it was enough
to mislead Debian into backporting 2.6.19's dirty page tracking patches,
with attendant mayhem when those resulted in unsuspected file corruption.

Signed-off-by: Hugh Dickins 
Signed-off-by: Chris Wright 
Signed-off-by: Greg Kroah-Hartman

read_zero_pagealigned() locking fix

2007-02-23T23:49:52+00:00

Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel
bugzilla 7645.  Right: read_zero_pagealigned uses down_read of mmap_sem,
but another thread's racing read of /dev/zero, or a normal fault, can
easily set that pte again, in between zap_page_range and zeromap_page_range
getting there.  It's been wrong ever since 2.4.3.

The simple fix is to use down_write instead, but that would serialize reads
of /dev/zero more than at present: perhaps some app would be badly
affected.  So instead let zeromap_page_range return the error instead of
BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in
that case - there's no need to optimize for it.

Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of
zeromap_page_range), though it really isn't interesting there.  And since
mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that
than -ENOMEM.

Signed-off-by: Hugh Dickins 
Cc: Ramiro Voicu: 
Signed-off-by: Andrew Morton 
Signed-off-by: Chris Wright 
Signed-off-by: Greg Kroah-Hartman

Fix incorrect user space access locking in mincore() (CVE-2006-4814)

2007-02-23T23:49:52+00:00

Doug Chapman noticed that mincore() will doa "copy_to_user()" of the
result while holding the mmap semaphore for reading, which is a big
no-no.  While a recursive read-lock on a semaphore in the case of a page
fault happens to work, we don't actually allow them due to deadlock
schenarios with writers due to fairness issues.

Doug and Marcel sent in a patch to fix it, but I decided to just rewrite
the mess instead - not just fixing the locking problem, but making the
code smaller and (imho) much easier to understand.

Cc: Doug Chapman 
Cc: Marcel Holtmann 
Cc: Hugh Dickins 
Cc: Andrew Morton 
[chrisw: fold in subsequent fix: 4fb23e439ce0]
Acked-by: Hugh Dickins 
[chrisw: fold in subsequent fix: 825020c3866e]
Signed-off-by: Oleg Nesterov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright 
Signed-off-by: Greg Kroah-Hartman

[PATCH] init_reap_node() initialization fix

2006-11-19T03:28:01+00:00

It looks like there is a bug in init_reap_node() in slab.c that can cause
multiple oops's on certain ES7000 configurations.  The variable reap_node
is defined per cpu, but only initialized on a single CPU.  This causes an
oops in next_reap_node() when __get_cpu_var(reap_node) returns the wrong
value.  Fix is below.

Signed-off-by: Dan Yeisley 
Cc: Andi Kleen 
Acked-by: Christoph Lameter 
Cc: Pekka Enberg 
Cc: Manfred Spraul 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright

[PATCH] Fix sys_move_pages when a NULL node list is passed.

2006-11-19T03:27:58+00:00

sys_move_pages() uses vmalloc() to allocate an array of structures
that is fills with information passed from user mode and then passes to
do_stat_pages() (in the case the node list is NULL).  do_stat_pages()
depends on a marker in the node field of the structure to decide how large
the array is and this marker is correctly inserted into the last element
of the array.  However, vmalloc() doesn't zero the memory it allocates
and if the user passes NULL for the node list, then the node fields are
not filled in (except for the end marker).  If the memory the vmalloc()
returned happend to have a word with the marker value in it in just the
right place, do_pages_stat will fail to fill the status field of part
of the array and we will return (random) kernel data to user mode.

Signed-off-by: Stephen Rothwell 
Acked-by: Christoph Lameter 
Signed-off-by: Chris Wright

[PATCH] Use min of two prio settings in calculating distress for reclaim

2006-11-04T01:33:50+00:00

If try_to_free_pages / balance_pgdat are called with a gfp_mask specifying
GFP_IO and/or GFP_FS, they will reclaim the requisite number of pages, and the
reset prev_priority to DEF_PRIORITY (or to some other high (ie: unurgent)
value).

However, another reclaimer without those gfp_mask flags set (say, GFP_NOIO)
may still be struggling to reclaim pages.  The concurrent overwrite of
zone->prev_priority will cause this GFP_NOIO thread to unexpectedly cease
deactivating mapped pages, thus causing reclaim difficulties.

Fix this is to key the distress calculation not off zone->prev_priority, but
also take into account the local caller's priority by using
min(zone->prev_priority, sc->priority)

Signed-off-by: Martin J. Bligh 
Cc: Nick Piggin 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright

[PATCH] vmscan: Fix temp_priority race

2006-11-04T01:33:50+00:00

The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.

The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently.  In balance_pgdat, we keep a separate
priority record per zone in a local array.  In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.

Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low.  They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.

From: Andrew Morton 

__zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list.  Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.

Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.

This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
possible to remove that now, and to just start out at DEF_PRIORITY?

Cc: Nick Piggin 
Cc: Christoph Lameter 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright 
[chrisw: minor wiggle to fit -stable]

[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc

2006-11-04T01:33:49+00:00

Revert "[PATCH] Remove SPAN_OTHER_NODES config definition"
    This reverts commit f62859bb6871c5e4a8e591c60befc8caaf54db8c.
Revert "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
    This reverts commit a94b3ab7eab4edcc9b2cb474b188f774c331adf7.

Also update the comments to indicate that this is still required
and where its used.

Signed-off-by: Andy Whitcroft 
Cc: Paul Mackerras 
Cc: Mike Kravetz 
Cc: Benjamin Herrenschmidt 
Acked-by: Mel Gorman 
Acked-by: Will Schmidt 
Cc: Christoph Lameter 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright

[PATCH] mm: fix a race condition under SMC + COW

2006-11-04T01:33:44+00:00

Failing context is a multi threaded process context and the failing
sequence is as follows.

One thread T0 doing self modifying code on page X on processor P0 and
another thread T1 doing COW (breaking the COW setup as part of just
happened fork() in another thread T2) on the same page X on processor P1.
T0 doing SMC can endup modifying the new page Y (allocated by the T1 doing
COW on P1) but because of different I/D TLB's, P0 ITLB will not see the new
mapping till the flush TLB IPI from P1 is received.  During this interval,
if T0 executes the code created by SMC it can result in an app error (as
ITLB still points to old page X and endup executing the content in page X
rather than using the content in page Y).

Fix this issue by first clearing the PTE and flushing it, before updating
it with new entry.

Hugh sayeth:

  I was a bit sceptical, in the habit of thinking that Self Modifying Code
  must look such issues itself: but I guess there's nothing it can do to avoid
  this one.

  Fair enough, what you're changing it to is pretty much what powerpc and
  s390 were already doing, and is a more robust way of proceeding, consistent
  with how ptes are set everywhere else.

  The ptep_clear_flush is a bit heavy-handed (it's anxious to return the pte
  that was atomically cleared), but we'd have to wander through lots of arches
  to get the right minimal behaviour.  It'd also be nice to eliminate
  ptep_establish completely, now only used to define other macros/inlines: it
  always seemed obfuscation to me, what you've got there now is clearer.
  Let's put those cleanups on a TODO list.

Signed-off-by: Suresh Siddha 
Acked-by: "David S. Miller" 
Acked-by: Hugh Dickins 
Cc: Nick Piggin 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Chris Wright