<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/mm/memory_hotplug.c, branch v3.14.78</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()</title>
<updated>2016-02-25T19:58:55+00:00</updated>
<author>
<name>Andrew Banman</name>
<email>abanman@sgi.com</email>
</author>
<published>2015-12-29T22:54:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=bda630e4c3217f8c4c0905e544ac2cae14d81785'/>
<id>bda630e4c3217f8c4c0905e544ac2cae14d81785</id>
<content type='text'>
commit 5f0f2887f4de9508dcf438deab28f1de8070c271 upstream.

test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range.  pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.

Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.

This also prevents a crash from offlining memory devices with missing
sections.  Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.

Signed-off-by: Andrew Banman &lt;abanman@sgi.com&gt;
Acked-by: Alex Thorlton &lt;athorlton@sgi.com&gt;
Cc: Russ Anderson &lt;rja@sgi.com&gt;
Cc: Alex Thorlton &lt;athorlton@sgi.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Seth Jennings &lt;sjennings@variantweb.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 5f0f2887f4de9508dcf438deab28f1de8070c271 upstream.

test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range.  pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.

Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.

This also prevents a crash from offlining memory devices with missing
sections.  Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.

Signed-off-by: Andrew Banman &lt;abanman@sgi.com&gt;
Acked-by: Alex Thorlton &lt;athorlton@sgi.com&gt;
Cc: Russ Anderson &lt;rja@sgi.com&gt;
Cc: Alex Thorlton &lt;athorlton@sgi.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Seth Jennings &lt;sjennings@variantweb.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memory_hotplug.c: set zone-&gt;wait_table to null after freeing it</title>
<updated>2015-06-23T00:01:23+00:00</updated>
<author>
<name>Gu Zheng</name>
<email>guz.fnst@cn.fujitsu.com</email>
</author>
<published>2015-06-10T18:14:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=20272b17474284e93cc8af418b39a30f94357f28'/>
<id>20272b17474284e93cc8af418b39a30f94357f28</id>
<content type='text'>
commit 85bd839983778fcd0c1c043327b14a046e979b39 upstream.

Izumi found the following oops when hot re-adding a node:

    BUG: unable to handle kernel paging request at ffffc90008963690
    IP: __wake_up_bit+0x20/0x70
    Oops: 0000 [#1] SMP
    CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
    Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
    task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
    RIP: 0010:[&lt;ffffffff810dff80&gt;]  [&lt;ffffffff810dff80&gt;] __wake_up_bit+0x20/0x70
    RSP: 0018:ffff880017b97be8  EFLAGS: 00010246
    RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
    RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
    RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
    R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
    FS:  00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
    Call Trace:
      unlock_page+0x6d/0x70
      generic_write_end+0x53/0xb0
      xfs_vm_write_end+0x29/0x80 [xfs]
      generic_perform_write+0x10a/0x1e0
      xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
      xfs_file_write_iter+0x79/0x120 [xfs]
      __vfs_write+0xd4/0x110
      vfs_write+0xac/0x1c0
      SyS_write+0x58/0xd0
      system_call_fastpath+0x12/0x76
    Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 &lt;48&gt; 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
    RIP  [&lt;ffffffff810dff80&gt;] __wake_up_bit+0x20/0x70
     RSP &lt;ffff880017b97be8&gt;
    CR2: ffffc90008963690

Reproduce method (re-add a node)::
  Hot-add nodeA --&gt; remove nodeA --&gt; hot-add nodeA (panic)

This seems an use-after-free problem, and the root cause is
zone-&gt;wait_table was not set to *NULL* after free it in
try_offline_node.

When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized.  The
judgement of zone initialized is based on zone-&gt;wait_table:

	static inline bool zone_is_initialized(struct zone *zone)
	{
		return !!zone-&gt;wait_table;
	}

so if we do not set the zone-&gt;wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.

Signed-off-by: Gu Zheng &lt;guz.fnst@cn.fujitsu.com&gt;
Reported-by: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Reviewed by: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 85bd839983778fcd0c1c043327b14a046e979b39 upstream.

Izumi found the following oops when hot re-adding a node:

    BUG: unable to handle kernel paging request at ffffc90008963690
    IP: __wake_up_bit+0x20/0x70
    Oops: 0000 [#1] SMP
    CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
    Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
    task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
    RIP: 0010:[&lt;ffffffff810dff80&gt;]  [&lt;ffffffff810dff80&gt;] __wake_up_bit+0x20/0x70
    RSP: 0018:ffff880017b97be8  EFLAGS: 00010246
    RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
    RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
    RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
    R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
    FS:  00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
    Call Trace:
      unlock_page+0x6d/0x70
      generic_write_end+0x53/0xb0
      xfs_vm_write_end+0x29/0x80 [xfs]
      generic_perform_write+0x10a/0x1e0
      xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
      xfs_file_write_iter+0x79/0x120 [xfs]
      __vfs_write+0xd4/0x110
      vfs_write+0xac/0x1c0
      SyS_write+0x58/0xd0
      system_call_fastpath+0x12/0x76
    Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 &lt;48&gt; 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
    RIP  [&lt;ffffffff810dff80&gt;] __wake_up_bit+0x20/0x70
     RSP &lt;ffff880017b97be8&gt;
    CR2: ffffc90008963690

Reproduce method (re-add a node)::
  Hot-add nodeA --&gt; remove nodeA --&gt; hot-add nodeA (panic)

This seems an use-after-free problem, and the root cause is
zone-&gt;wait_table was not set to *NULL* after free it in
try_offline_node.

When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized.  The
judgement of zone initialized is based on zone-&gt;wait_table:

	static inline bool zone_is_initialized(struct zone *zone)
	{
		return !!zone-&gt;wait_table;
	}

so if we do not set the zone-&gt;wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.

Signed-off-by: Gu Zheng &lt;guz.fnst@cn.fujitsu.com&gt;
Reported-by: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Reviewed by: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memory hotplug: postpone the reset of obsolete pgdat</title>
<updated>2015-04-19T08:11:06+00:00</updated>
<author>
<name>Gu Zheng</name>
<email>guz.fnst@cn.fujitsu.com</email>
</author>
<published>2015-03-25T22:55:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4c605b1d706cff4c4c43e1d0712d19478ef880ab'/>
<id>4c605b1d706cff4c4c43e1d0712d19478ef880ab</id>
<content type='text'>
commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08 upstream.

Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:

  BUG: unable to handle kernel paging request at 0000000000025f60
  IP: next_online_pgdat+0x1/0x50
  PGD 0
  Oops: 0000 [#1] SMP
  ACPI: Device does not support D3cold
  Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
  CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
  Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
  Workqueue: events vmstat_update
  task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
  RIP: 0010: next_online_pgdat+0x1/0x50
  RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
  RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
  RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
  RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
  R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
  R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
  FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    refresh_cpu_vm_stats+0xd0/0x140
    vmstat_update+0x11/0x50
    process_one_work+0x194/0x3d0
    worker_thread+0x12b/0x410
    kthread+0xc6/0xd0
    ret_from_fork+0x7c/0xb0

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.

process A:				offline node XX:

vmstat_updat()
   refresh_cpu_vm_stats()
     for_each_populated_zone()
       find online node XX
     cond_resched()
					offline cpu and memory, then try_offline_node()
					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
       zone = next_zone(zone)
         pg_data_t *pgdat = zone-&gt;zone_pgdat;  // here pgdat is NULL now
           next_online_pgdat(pgdat)
             next_online_node(pgdat-&gt;node_id);  // NULL pointer access

So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat-&gt;nr_zones and pgdat-&gt;classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.

Signed-off-by: Gu Zheng &lt;guz.fnst@cn.fujitsu.com&gt;
Reported-by: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Suggested-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Cc: Xie XiuQi &lt;xiexiuqi@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08 upstream.

Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:

  BUG: unable to handle kernel paging request at 0000000000025f60
  IP: next_online_pgdat+0x1/0x50
  PGD 0
  Oops: 0000 [#1] SMP
  ACPI: Device does not support D3cold
  Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
  CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
  Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
  Workqueue: events vmstat_update
  task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
  RIP: 0010: next_online_pgdat+0x1/0x50
  RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
  RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
  RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
  RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
  R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
  R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
  FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    refresh_cpu_vm_stats+0xd0/0x140
    vmstat_update+0x11/0x50
    process_one_work+0x194/0x3d0
    worker_thread+0x12b/0x410
    kthread+0xc6/0xd0
    ret_from_fork+0x7c/0xb0

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.

process A:				offline node XX:

vmstat_updat()
   refresh_cpu_vm_stats()
     for_each_populated_zone()
       find online node XX
     cond_resched()
					offline cpu and memory, then try_offline_node()
					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
       zone = next_zone(zone)
         pg_data_t *pgdat = zone-&gt;zone_pgdat;  // here pgdat is NULL now
           next_online_pgdat(pgdat)
             next_online_node(pgdat-&gt;node_id);  // NULL pointer access

So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat-&gt;nr_zones and pgdat-&gt;classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.

Signed-off-by: Gu Zheng &lt;guz.fnst@cn.fujitsu.com&gt;
Reported-by: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Suggested-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Cc: Xie XiuQi &lt;xiexiuqi@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>mm, migration: add destination page freeing callback</title>
<updated>2014-11-21T17:23:06+00:00</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2014-06-04T23:08:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=a527e8d4f7ae0e1d6a3c267f31e13b1d2a198508'/>
<id>a527e8d4f7ae0e1d6a3c267f31e13b1d2a198508</id>
<content type='text'>
commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream.

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages.  When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages.  If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails.  If the caller passes NULL then
freeing back to the system will be handled as usual.  This patch
introduces no functional change.

Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Reviewed-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;


</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream.

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages.  When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages.  If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails.  If the caller passes NULL then
freeing back to the system will be handled as usual.  This patch
introduces no functional change.

Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Reviewed-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memory_hotplug.c: move register_memory_resource out of the lock_memory_hotplug</title>
<updated>2014-01-24T00:36:52+00:00</updated>
<author>
<name>Nathan Zimmer</name>
<email>nzimmer@sgi.com</email>
</author>
<published>2014-01-23T23:53:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ac13c4622bda2a9ff8f57bbbfeff48b2a42d0963'/>
<id>ac13c4622bda2a9ff8f57bbbfeff48b2a42d0963</id>
<content type='text'>
We don't need to do register_memory_resource() under
lock_memory_hotplug() since it has its own lock and doesn't make any
callbacks.

Also register_memory_resource return NULL on failure so we don't have
anything to cleanup at this point.

The reason for this rfc is I was doing some experiments with hotplugging
of memory on some of our larger systems.  While it seems to work, it can
be quite slow.  With some preliminary digging I found that
lock_memory_hotplug is clearly ripe for breakup.

It could be broken up per nid or something but it also covers the
online_page_callback.  The online_page_callback shouldn't be very hard
to break out.

Also there is the issue of various structures(wmarks come to mind) that
are only updated under the lock_memory_hotplug that would need to be
dealt with.

Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Cc: Wen Congyang &lt;wency@cn.fujitsu.com&gt;
Cc: Kamezawa Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Reviewed-by: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: "Rafael J. Wysocki" &lt;rafael.j.wysocki@intel.com&gt;
Cc: Hedi &lt;hedi@sgi.com&gt;
Cc: Mike Travis &lt;travis@sgi.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We don't need to do register_memory_resource() under
lock_memory_hotplug() since it has its own lock and doesn't make any
callbacks.

Also register_memory_resource return NULL on failure so we don't have
anything to cleanup at this point.

The reason for this rfc is I was doing some experiments with hotplugging
of memory on some of our larger systems.  While it seems to work, it can
be quite slow.  With some preliminary digging I found that
lock_memory_hotplug is clearly ripe for breakup.

It could be broken up per nid or something but it also covers the
online_page_callback.  The online_page_callback shouldn't be very hard
to break out.

Also there is the issue of various structures(wmarks come to mind) that
are only updated under the lock_memory_hotplug that would need to be
dealt with.

Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Cc: Wen Congyang &lt;wency@cn.fujitsu.com&gt;
Cc: Kamezawa Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Reviewed-by: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: "Rafael J. Wysocki" &lt;rafael.j.wysocki@intel.com&gt;
Cc: Hedi &lt;hedi@sgi.com&gt;
Cc: Mike Travis &lt;travis@sgi.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: print more details for bad_page()</title>
<updated>2014-01-24T00:36:50+00:00</updated>
<author>
<name>Dave Hansen</name>
<email>dave@sr71.net</email>
</author>
<published>2014-01-23T23:52:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f0b791a34cb3cffd2bbc3ca4365c9b719fa2c9f3'/>
<id>f0b791a34cb3cffd2bbc3ca4365c9b719fa2c9f3</id>
<content type='text'>
bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad,
or whether -&gt;index or -&gt;mapping is required to be NULL.

This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places.  It also adds a 'bad_flags' argument to bad_page(), which it
then dumps out separately from the flags which are actually set.

This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page
flag combination which was the problem.

[akpm@linux-foundation.org: switch to pr_alert]
Signed-off-by: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Reviewed-by: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad,
or whether -&gt;index or -&gt;mapping is required to be NULL.

This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places.  It also adds a 'bad_flags' argument to bad_page(), which it
then dumps out separately from the flags which are actually set.

This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page
flag combination which was the problem.

[akpm@linux-foundation.org: switch to pr_alert]
Signed-off-by: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Reviewed-by: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memory_hotplug.c: use memblock apis for early memory allocations</title>
<updated>2014-01-22T00:19:47+00:00</updated>
<author>
<name>Santosh Shilimkar</name>
<email>santosh.shilimkar@ti.com</email>
</author>
<published>2014-01-21T23:50:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=9e43aa2b8d1cb3137bd7e60d5fead83d0569de2b'/>
<id>9e43aa2b8d1cb3137bd7e60d5fead83d0569de2b</id>
<content type='text'>
Correct ensure_zone_is_initialized() function description according to
the introduced memblock APIs for early memory allocations.

Signed-off-by: Grygorii Strashko &lt;grygorii.strashko@ti.com&gt;
Signed-off-by: Santosh Shilimkar &lt;santosh.shilimkar@ti.com&gt;
Cc: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Paul Walmsley &lt;paul@pwsan.com&gt;
Cc: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: Russell King &lt;linux@arm.linux.org.uk&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Tony Lindgren &lt;tony@atomide.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Correct ensure_zone_is_initialized() function description according to
the introduced memblock APIs for early memory allocations.

Signed-off-by: Grygorii Strashko &lt;grygorii.strashko@ti.com&gt;
Signed-off-by: Santosh Shilimkar &lt;santosh.shilimkar@ti.com&gt;
Cc: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Paul Walmsley &lt;paul@pwsan.com&gt;
Cc: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: Russell King &lt;linux@arm.linux.org.uk&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Tony Lindgren &lt;tony@atomide.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memblock: remove unnecessary inclusions of bootmem.h</title>
<updated>2014-01-22T00:19:46+00:00</updated>
<author>
<name>Grygorii Strashko</name>
<email>grygorii.strashko@ti.com</email>
</author>
<published>2014-01-21T23:50:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=869a84e1ca163b737236dae997db4a6a1e230b9b'/>
<id>869a84e1ca163b737236dae997db4a6a1e230b9b</id>
<content type='text'>
Clean-up to remove depedency with bootmem headers.

Signed-off-by: Grygorii Strashko &lt;grygorii.strashko@ti.com&gt;
Signed-off-by: Santosh Shilimkar &lt;santosh.shilimkar@ti.com&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Paul Walmsley &lt;paul@pwsan.com&gt;
Cc: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: Russell King &lt;linux@arm.linux.org.uk&gt;
Cc: Tony Lindgren &lt;tony@atomide.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Clean-up to remove depedency with bootmem headers.

Signed-off-by: Grygorii Strashko &lt;grygorii.strashko@ti.com&gt;
Signed-off-by: Santosh Shilimkar &lt;santosh.shilimkar@ti.com&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Paul Walmsley &lt;paul@pwsan.com&gt;
Cc: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: Russell King &lt;linux@arm.linux.org.uk&gt;
Cc: Tony Lindgren &lt;tony@atomide.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>memblock, mem_hotplug: make memblock skip hotpluggable regions if needed</title>
<updated>2014-01-22T00:19:45+00:00</updated>
<author>
<name>Tang Chen</name>
<email>tangchen@cn.fujitsu.com</email>
</author>
<published>2014-01-21T23:49:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=55ac590c2fadad785d60dd70c12d62823bc2cd39'/>
<id>55ac590c2fadad785d60dd70c12d62823bc2cd39</id>
<content type='text'>
Linux kernel cannot migrate pages used by the kernel.  As a result,
hotpluggable memory used by the kernel won't be able to be hot-removed.
To solve this problem, the basic idea is to prevent memblock from
allocating hotpluggable memory for the kernel at early time, and arrange
all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
ZONE_MOVABLE when initializing zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions
in the default top-down allocation function if movable_node boot option
is specified.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Zhang Yanfei &lt;zhangyanfei@cn.fujitsu.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: "Rafael J . Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Chen Tang &lt;imtangchen@gmail.com&gt;
Cc: Gong Chen &lt;gong.chen@linux.intel.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Jiang Liu &lt;jiang.liu@huawei.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Len Brown &lt;lenb@kernel.org&gt;
Cc: Liu Jiang &lt;jiang.liu@huawei.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Thomas Renninger &lt;trenn@suse.de&gt;
Cc: Toshi Kani &lt;toshi.kani@hp.com&gt;
Cc: Vasilis Liaskovitis &lt;vasilis.liaskovitis@profitbricks.com&gt;
Cc: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Cc: Wen Congyang &lt;wency@cn.fujitsu.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Linux kernel cannot migrate pages used by the kernel.  As a result,
hotpluggable memory used by the kernel won't be able to be hot-removed.
To solve this problem, the basic idea is to prevent memblock from
allocating hotpluggable memory for the kernel at early time, and arrange
all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
ZONE_MOVABLE when initializing zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions
in the default top-down allocation function if movable_node boot option
is specified.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Zhang Yanfei &lt;zhangyanfei@cn.fujitsu.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: "Rafael J . Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Chen Tang &lt;imtangchen@gmail.com&gt;
Cc: Gong Chen &lt;gong.chen@linux.intel.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Jiang Liu &lt;jiang.liu@huawei.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Len Brown &lt;lenb@kernel.org&gt;
Cc: Liu Jiang &lt;jiang.liu@huawei.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Thomas Renninger &lt;trenn@suse.de&gt;
Cc: Toshi Kani &lt;toshi.kani@hp.com&gt;
Cc: Vasilis Liaskovitis &lt;vasilis.liaskovitis@profitbricks.com&gt;
Cc: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Cc: Wen Congyang &lt;wency@cn.fujitsu.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mem-hotplug: introduce movable_node boot option</title>
<updated>2013-11-13T03:09:09+00:00</updated>
<author>
<name>Tang Chen</name>
<email>tangchen@cn.fujitsu.com</email>
</author>
<published>2013-11-12T23:08:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c5320926e370b4cfb8f10c2169e26f960079cf67'/>
<id>c5320926e370b4cfb8f10c2169e26f960079cf67</id>
<content type='text'>
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed.  So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
kernel cannot use memory in movable nodes.  This will cause NUMA
performance down.  And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction.  That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches.  So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality.  For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything.  The kernel
will work as before.

Signed-off-by: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Zhang Yanfei &lt;zhangyanfei@cn.fujitsu.com&gt;
Suggested-by: Kamezawa Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Suggested-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Toshi Kani &lt;toshi.kani@hp.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Cc: Thomas Renninger &lt;trenn@suse.de&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Jiang Liu &lt;jiang.liu@huawei.com&gt;
Cc: Wen Congyang &lt;wency@cn.fujitsu.com&gt;
Cc: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed.  So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
kernel cannot use memory in movable nodes.  This will cause NUMA
performance down.  And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction.  That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches.  So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality.  For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything.  The kernel
will work as before.

Signed-off-by: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Zhang Yanfei &lt;zhangyanfei@cn.fujitsu.com&gt;
Suggested-by: Kamezawa Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Suggested-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Toshi Kani &lt;toshi.kani@hp.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Cc: Thomas Renninger &lt;trenn@suse.de&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Jiang Liu &lt;jiang.liu@huawei.com&gt;
Cc: Wen Congyang &lt;wency@cn.fujitsu.com&gt;
Cc: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
