linux-stable.git/include/linux/mm_types.h, branch v3.1.7

mm: thp: tail page refcounting fix

2011-11-11T17:43:41+00:00

commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream.

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times.  This will guarantee that get_page_unless_zero() can never
succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages.  That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic.  As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns.  It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages.  The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it.  A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead.  So it's worth it.  Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages().  The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli 
Reported-by: Michel Lespinasse 
Reviewed-by: Michel Lespinasse 
Reviewed-by: Minchan Kim 
Cc: Peter Zijlstra 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: KOSAKI Motohiro 
Cc: Benjamin Herrenschmidt 
Cc: David Gibson 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Avoid duplicate _count variables in page_struct

2011-07-18T12:17:01+00:00

Restructure the union / struct cascade in struct page so that
we only have one definition of _count.

Tested-by: Hugh Dickins 
Signed-off-by: Christoph Lameter 
Signed-off-by: Pekka Enberg

Revert "SLUB: Fix build breakage in linux/mm_types.h"

2011-07-18T12:16:55+00:00

This reverts commit ea6bd8ee1a2ccdffc38b2b1fcfe941addfafaade.

SLUB: Fix build breakage in linux/mm_types.h

2011-07-07T19:24:30+00:00

On Wed, 6 Jul 2011, Jonathan Cameron wrote:

> Getting:
>
>   CHK     include/linux/version.h
>   CHK     include/generated/utsrelease.h
> make[1]: `include/generated/mach-types.h' is up to date.
>   CC      arch/arm/kernel/asm-offsets.s
> In file included from include/linux/sched.h:64:0,
>                  from arch/arm/kernel/asm-offsets.c:13:
> include/linux/mm_types.h:74:15: error: duplicate member '_count'
> make[1]: *** [arch/arm/kernel/asm-offsets.s] Error 1
> make: *** [prepare0] Error 2
>
> Issue looks to have been introduced by
>
>     mm: Rearrange struct page
>
> fc9bb8c768abe7ae10861c3510e01a95f98d5933
>
> Guessing it's a known issue, but just thought I'd flag it up in case
> it's something very specific about my build.
>
> gcc-2.6 armv7a
>
> Reverting that patch works, but given I don't know the history, I'm
> not proposing doing that in general!

Well _count exists in two unionized structs but always has the same offset
within the larger struct. Maybe ARM creates different offsets there for
some reason?

The following is a patch to restructure the union / structs combo in such
a way that only a single definition of _count

Reported-by: Jonathan Cameron 
Tested-by: Piotr Hosowicz 
Signed-off-by: Christoph Lameter 
Signed-off-by: Pekka Enberg

mm: Rearrange struct page

2011-07-02T10:26:53+00:00

We need to be able to use cmpxchg_double on the freelist and object count
field in struct page. Rearrange the fields in struct page according to
doubleword entities so that the freelist pointer comes before the counters.
Do the rearranging with a future in mind where we use more doubleword
atomics to avoid locking of updates to flags/mapping or lru pointers.

Create another union to allow access to counters in struct page as a
single unsigned long value.

The doublewords must be properly aligned for cmpxchg_double to work.
Sadly this increases the size of page struct by one word on some architectures.
But as a resultpage structs are now cacheline aligned on x86_64.

Signed-off-by: Christoph Lameter 
Signed-off-by: Pekka Enberg

slub: Do not use frozen page flag but a bit in the page counters

2011-07-02T10:26:52+00:00

Do not use a page flag for the frozen bit. It needs to be part
of the state that is handled with cmpxchg_double(). So use a bit
in the counter struct in the page struct for that purpose.

Signed-off-by: Christoph Lameter 
Signed-off-by: Pekka Enberg

mm: Fix boot crash in mm_alloc()

2011-05-29T18:32:28+00:00

Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at   (null)
    IP: [] find_next_bit+0x55/0xb0
    Call Trace:
     [] cpumask_any_but+0x2a/0x70
     [] flush_tlb_mm+0x2b/0x80
     [] pud_populate+0x35/0x50
     [] pgd_alloc+0x9a/0xf0
     [] mm_init+0xec/0x120
     [] mm_alloc+0x53/0xd0

which was introduced by commit de03c72cfce5 ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask

Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.

Reported-by: Thomas Gleixner 
Reported-by: Ingo Molnar 
Cc: KOSAKI Motohiro 
Cc: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: extract exe_file handling from procfs

2011-05-27T00:12:36+00:00

Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc//exe.  Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong.  By doing that we can make dup_mm_exe_file static.  Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

Signed-off-by: Jiri Slaby 
Cc: Alexander Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: don't access vm_flags as 'int'

2011-05-26T16:20:31+00:00

The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro 
[ Changed to use a typedef - we'll extend it to cover more cases
  later, since there has been discussion about making it a 64-bit
  type..                      - Linus ]
Signed-off-by: Linus Torvalds

mm: delete non-atomic mm counter implementation

2011-05-25T15:39:30+00:00

The problem with having two different types of counters is that developers
adding new code need to keep in mind whether it's safe to use both the
atomic and non-atomic implementations.  For example, when adding new
callers of the *_mm_counter() functions a developer needs to ensure that
those paths are always executed with page_table_lock held, in case we're
using the non-atomic implementation of mm counters.

Hugh Dickins introduced the atomic mm counters in commit f412ac08c986
("[PATCH] mm: fix rss and mmlist locking").  When asked why he left the
non-atomic counters around he said,

  | The only reason was to avoid adding costly atomic operations into a
  | configuration that had no need for them there: the page_table_lock
  | sufficed.
  |
  | Certainly it would be simpler just to delete the non-atomic variant.
  |
  | And I think it's fair to say that any configuration on which we're
  | measuring performance to that degree (rather than "does it boot fast?"
  | type measurements), would already be going the split ptlocks route.

Removing the non-atomic counters eases the maintenance burden because
developers no longer have to mindful of the two implementations when using
*_mm_counter().

Note that all architectures provide a means of atomically updating
atomic_long_t variables, even if they have to revert to the generic
spinlock implementation because they don't support 64-bit atomic
instructions (see lib/atomic64.c).

Signed-off-by: Matt Fleming 
Acked-by: Dave Hansen 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Hugh Dickins 
Cc: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds