linux-stable.git/mm, branch linux-2.6.24.y

alloc_percpu() fails to allocate percpu data

2008-04-19T01:53:21+00:00

upstream commit: be852795e1c8d3829ddf3cb1ce806113611fa555

Some oprofile results obtained while using tbench on a 2x2 cpu machine were
very surprising.

For example, loopback_xmit() function was using high number of cpu cycles
to perform the statistic updates, supposed to be real cheap since they use
percpu data

        pcpu_lstats = netdev_priv(dev);
        lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
        lb_stats->packets++;  /* HERE : serious contention */
        lb_stats->bytes += skb->len;

struct pcpu_lstats is a small structure containing two longs.  It appears
that on my 32bits platform, alloc_percpu(8) allocates a single cache line,
instead of giving to each cpu a separate cache line.

Using the following patch gave me impressive boost in various benchmarks
( 6 % in tbench)
(all percpu_counters hit this bug too)

Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
block of memory, so that we dont need to roudup sizes to L1_CACHE_BYTES, or
merging the SGI stuff of course...

Note : SLUB vs SLAB is important here to *show* the improvement, since they
dont have the same minimum allocation sizes (8 bytes vs 32 bytes).  This
could very well explain regressions some guys reported when they switched
to SLUB.

Signed-off-by: Eric Dumazet 
Acked-by: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright

PERCPU : __percpu_alloc_mask() can dynamically size percpu_data storage

2008-04-19T01:53:21+00:00

upstream commit: b3242151906372f30f57feaa43b4cac96a23edb1

Instead of allocating a fix sized array of NR_CPUS pointers for percpu_data,
we can use nr_cpu_ids, which is generally < NR_CPUS.

Signed-off-by: Eric Dumazet 
Cc: Christoph Lameter 
Cc: "David S. Miller" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright

slab: fix cache_cache bootstrap in kmem_cache_init()

2008-04-19T01:53:20+00:00

upstream commit: ec1f5eeeb5a79a0d48036de649a3498da42db565

Commit 556a169dab38b5100df6f4a45b655dddd3db94c1 ("slab: fix bootstrap on
memoryless node") introduced bootstrap-time cache_cache list3s for all nodes
but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This
patch fixes list_add() corruption in mm/slab.c seen on the ES7000.

Cc: Mel Gorman 
Cc: Olaf Hering 
Signed-off-by: Dan Yeisley 
Signed-off-by: Pekka Enberg 
Signed-off-by: Christoph Lameter 
Signed-off-by: Chris Wright

slab: NUMA slab allocator migration bugfix

2008-03-24T18:48:36+00:00

NUMA slab allocator cpu migration bugfix

The NUMA slab allocator (specifically, cache_alloc_refill)
is not refreshing its local copies of what cpu and what
numa node it is on, when it drops and reacquires the irq
block that it inherited from its caller.  As a result
those values become invalid if an attempt to migrate the
process to another numa node occured while the irq block
had been dropped.

The solution is to make cache_alloc_refill reload these
variables whenever it drops and reacquires the irq block.

The error is very difficult to hit.  When it does occur,
one gets the following oops + stack traceback bits in
check_spinlock_acquired:

	kernel BUG at mm/slab.c:2417
	cache_alloc_refill+0xe6
	kmem_cache_alloc+0xd0
	...

This patch was developed against 2.6.23, ported to and
compiled-tested only against 2.6.25-rc4.

Signed-off-by: Joe Korty 
Signed-off-by: Christoph Lameter 
Signed-off-by: Chris Wright

hugetlb: ensure we do not reference a surplus page after handing it to buddy

2008-03-24T18:47:19+00:00

commit: e5df70ab194543522397fa3da8c8f80564a0f7d3

When we free a page via free_huge_page and we detect that we are in surplus
the page will be returned to the buddy.  After this we no longer own the page.

However at the end free_huge_page we clear out our mapping pointer from
page private.  Even where the page is not a surplus we free the page to
the hugepage pool, drop the pool locks and then clear page private.  In
either case the page may have been reallocated.  BAD.

Make sure we clear out page private before we free the page.

Signed-off-by: Andy Whitcroft 
Acked-by: Adam Litke 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright 
Signed-off-by: Greg Kroah-Hartman

iov_iter_advance() fix

2008-03-24T18:47:10+00:00

commit: f7009264c519603b8ec67c881bd368a56703cfc9

iov_iter_advance() skips over zero-length iovecs, however it does not properly
terminate at the end of the iovec array.  Fix this by checking against
i->count before we skip a zero-length iov.

The bug was reproduced with a test program that continually randomly creates
iovs to writev.  The fix was also verified with the same program and also it
could verify that the correct data was contained in the file after each
writev.

Signed-off-by: Nick Piggin 
Tested-by: "Kevin Coffman" 
Cc: "Alexey Dobriyan" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright 
Signed-off-by: Greg Kroah-Hartman

SLUB: Deal with annoying gcc warning on kfree()

2008-02-26T00:18:57+00:00

patch 5bb983b0cce9b7b281af15730f7019116dd42568 in mainline.

gcc 4.2 spits out an annoying warning if one casts a const void *
pointer to a void * pointer. No warning is generated if the
conversion is done through an assignment.

Signed-off-by: Christoph Lameter 
Signed-off-by: Greg Kroah-Hartman

Be more robust about bad arguments in get_user_pages()

2008-02-26T00:18:44+00:00

patch 900cf086fd2fbad07f72f4575449e0d0958f860f in mainline.

So I spent a while pounding my head against my monitor trying to figure
out the vmsplice() vulnerability - how could a failure to check for
*read* access turn into a root exploit? It turns out that it's a buffer
overflow problem which is made easy by the way get_user_pages() is
coded.

In particular, "len" is a signed int, and it is only checked at the
*end* of a do {} while() loop.  So, if it is passed in as zero, the loop
will execute once and decrement len to -1.  At that point, the loop will
proceed until the next invalid address is found; in the process, it will
likely overflow the pages array passed in to get_user_pages().

I think that, if get_user_pages() has been asked to grab zero pages,
that's what it should do.  Thus this patch; it is, among other things,
enough to block the (already fixed) root exploit and any others which
might be lurking in similar code.  I also think that the number of pages
should be unsigned, but changing the prototype of this function probably
requires some more careful review.

Signed-off-by: Jonathan Corbet 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

hugetlb: add locking for overcommit sysctl

2008-02-26T00:18:32+00:00

patch a3d0c6aa1bb342b9b2c7b123b52ac2f48a4d4d0a in mainline.

When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly.  However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock.  Add a
callback into the hugetlb code similar to the one for nr_hugepages.  Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().

Signed-off-by: Nishanth Aravamudan 
Acked-by: Adam Litke 
Cc: David Gibson 
Cc: William Lee Irwin III 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

fix writev regression: pan hanging unkillable and un-straceable

2008-02-08T19:46:29+00:00

patch 124d3b7041f9a0ca7c43a6293e1cae4576c32fd5 in mainline.

Frederik Himpe reported an unkillable and un-straceable pan process.

Zero length iovecs can go into an infinite loop in writev, because the 
iovec iterator does not always advance over them.

The sequence required to trigger this is not trivial. I think it 
requires that a zero-length iovec be followed by a non-zero-length iovec 
which causes a pagefault in the atomic usercopy. This causes the writev 
code to drop back into single-segment copy mode, which then tries to 
copy the 0 bytes of the zero-length iovec; a zero length copy looks like 
a failure though, so it loops.

Put a test into iov_iter_advance to catch zero-length iovecs. We could 
just put the test in the fallback path, but I feel it is more robust to 
skip over zero-length iovecs throughout the code (iovec iterator may be 
used in filesystems too, so it should be robust).

Signed-off-by: Nick Piggin 
Signed-off-by: Ingo Molnar 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman