linux-stable.git/include/linux/mmzone.h, branch linux-2.6.35.y

mm: page allocator: adjust the per-cpu counter threshold when memory is low

2011-04-28T15:20:49+00:00

Upstream commit 88f5acf88ae6a9778f6d25d0d5d7ec2d57764a97

Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold.  On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high.  The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues.  The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing.  This patch takes a different approach.  For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level.  This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node.  There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat().  We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark
functions is reduced.  The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().

vanilla                      11.6615%
disable-threshold            0.2584%

David said:

: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity.  I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).

Signed-off-by: Mel Gorman 
Signed-off-by: Andi Kleen 
Reported-by: Shaohua Li 
Reviewed-by: Christoph Lameter 
Tested-by: Nicolas Bareil 
Cc: David Rientjes 
Cc: Kyle McMartin 
Cc: 		[2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

backported from 88f5acf88ae6a9778f6d25d0d5d7ec2d57764a97
BugLink: http://bugs.launchpad.net/bugs/719446
Signed-off-by: Tim Gardner 
Signed-off-by: Andi Kleen

mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

2010-09-27T00:18:40+00:00

commit aa45484031ddee09b06350ab8528bfe5b2c76d1c upstream.

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter 
Signed-off-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

numa: introduce numa_mem_id()- effective local memory node id

2010-05-27T16:12:57+00:00

Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in  when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild.  Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn 
Signed-off-by: Christoph Lameter 
Cc: Tejun Heo 
Cc: Mel Gorman 
Cc: Christoph Lameter 
Cc: Nick Piggin 
Cc: David Rientjes 
Cc: Eric Whitney 
Cc: KAMEZAWA Hiroyuki 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: "Luck, Tony" 
Cc: Pekka Enberg 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mem-hotplug: fix potential race while building zonelist for new populated zone

2010-05-25T15:07:02+00:00

Add global mutex zonelists_mutex to fix the possible race:

     CPU0                                  CPU1                    CPU2
(1) zone->present_pages += online_pages;
(2)                                       build_all_zonelists();
(3)                                                               alloc_page();
(4)                                                               free_page();
(5) build_all_zonelists();
(6)   __build_all_zonelists();
(7)     zone->pageset = alloc_percpu();

In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).

Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).

[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li 
Signed-off-by: Wu Fengguang 
Reviewed-by: Andi Kleen 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset

2010-05-25T15:07:01+00:00

For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
       at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
       shared in onlined_pages()

Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:

  ------------[ cut here ]------------
  kernel BUG at mm/page_alloc.c:1239!
  invalid opcode: 0000 [#1] SMP
  ...
  Call Trace:
   [] __alloc_pages_nodemask+0x131/0x7b0
   [] alloc_pages_current+0x87/0xd0
   [] __page_cache_alloc+0x67/0x70
   [] __do_page_cache_readahead+0x120/0x260
   [] ra_submit+0x21/0x30
   [] ondemand_readahead+0x166/0x2c0
   [] page_cache_async_readahead+0x80/0xa0
   [] generic_file_aio_read+0x364/0x670
   [] nfs_file_read+0xca/0x130
   [] do_sync_read+0xfa/0x140
   [] vfs_read+0xb5/0x1a0
   [] sys_read+0x51/0x80
   [] system_call_fastpath+0x16/0x1b
  RIP  [] get_page_from_freelist+0x883/0x900
   RSP 
  ---[ end trace 4bda28328b9990db ]

[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li 
Signed-off-by: Wu Fengguang 
Reviewed-by: Andi Kleen 
Reviewed-by: Christoph Lameter 
Cc: Mel Gorman 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix NR_SECTION_ROOTS == 0 when using using sparsemem extreme.

2010-05-25T15:07:01+00:00

Got this while compiling for ARM/SA1100:

mm/sparse.c: In function '__section_nr':
mm/sparse.c:135: warning: 'root' is used uninitialized in this function

This patch follows Russell King's suggestion for a new calculation for
NR_SECTION_ROOTS.  Thanks also to Sergei Shtylyov for pointing out the
existence of the macro DIV_ROUND_UP.

Atsushi Nemoto observed:
: This fix doesn't just silence the warning - it fixes a real problem.
:
: Without this fix, mem_section[] might have 0 size so mem_section[0]
: will share other variable area.  For example, I got:
:
: c030c700 b __warned.16478
: c030c700 B mem_section
: c030c701 b __warned.16483
:
: This might cause very strange behavior.  Your patch actually fixes it.

Signed-off-by: Marcelo Roberto Jimenez 
Cc: Atsushi Nemoto 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: Yinghai Lu 
Cc: Sergei Shtylyov 
Cc: Russell King 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: compaction: defer compaction using an exponential backoff when compaction fails

2010-05-25T15:07:00+00:00

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail.  There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6.  If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e.  the
preferred zone.  To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache.  This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'for-next' into for-linus

2010-03-08T15:55:37+00:00

Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c

mm: restore zone->all_unreclaimable to independence word

2010-03-06T19:26:25+00:00

commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag.  But it had an undesireble side
effect.  free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro 
Cc: David Rientjes 
Cc: Wu Fengguang 
Cc: KAMEZAWA Hiroyuki 
Cc: Minchan Kim 
Cc: Huang Shijie 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2010-03-03T16:15:05+00:00

* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
  early_res: Need to save the allocation name in drop_range_partial()
  sparsemem: Fix compilation on PowerPC
  early_res: Add free_early_partial()
  x86: Fix non-bootmem compilation on PowerPC
  core: Move early_res from arch/x86 to kernel/
  x86: Add find_fw_memmap_area
  Move round_up/down to kernel.h
  x86: Make 32bit support NO_BOOTMEM
  early_res: Enhance check_and_double_early_res
  x86: Move back find_e820_area to e820.c
  x86: Add find_early_area_size
  x86: Separate early_res related code from e820.c
  x86: Move bios page reserve early to head32/64.c
  sparsemem: Put mem map for one node together.
  sparsemem: Put usemap for one node together
  x86: Make 64 bit use early_res instead of bootmem before slab
  x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
  x86: Make early_node_mem get mem > 4 GB if possible
  x86: Dynamically increase early_res array size
  x86: Introduce max_early_res and early_res_count
  ...