linux.git/mm/migrate.c, branch v5.5

mm: move_pages: return valid node id in status if the page is already on the target node

2020-01-04T21:55:09+00:00

Felix Abecassis reports move_pages() would return random status if the
pages are already on the target node by the below test program:

  int main(void)
  {
	const long node_id = 1;
	const long page_size = sysconf(_SC_PAGESIZE);
	const int64_t num_pages = 8;

	unsigned long nodemask =  1 << node_id;
	long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
	if (ret < 0)
		return (EXIT_FAILURE);

	void **pages = malloc(sizeof(void*) * num_pages);
	for (int i = 0; i < num_pages; ++i) {
		pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
				MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
				-1, 0);
		if (pages[i] == MAP_FAILED)
			return (EXIT_FAILURE);
	}

	ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
	if (ret < 0)
		return (EXIT_FAILURE);

	int *nodes = malloc(sizeof(int) * num_pages);
	int *status = malloc(sizeof(int) * num_pages);
	for (int i = 0; i < num_pages; ++i) {
		nodes[i] = node_id;
		status[i] = 0xd0; /* simulate garbage values */
	}

	ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
	printf("move_pages: %ld\n", ret);
	for (int i = 0; i < num_pages; ++i)
		printf("status[%d] = %d\n", i, status[i]);
  }

Then running the program would return nonsense status values:

  $ ./move_pages_bug
  move_pages: 0
  status[0] = 208
  status[1] = 208
  status[2] = 208
  status[3] = 208
  status[4] = 208
  status[5] = 208
  status[6] = 208
  status[7] = 208

This is because the status is not set if the page is already on the
target node, but move_pages() should return valid status as long as it
succeeds.  The valid status may be errno or node id.

We can't simply initialize status array to zero since the pages may be
not on node 0.  Fix it by updating status with node id which the page is
already on.

Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi 
Reported-by: Felix Abecassis 
Tested-by: Felix Abecassis 
Suggested-by: Michal Hocko 
Reviewed-by: John Hubbard 
Acked-by: Christoph Lameter 
Acked-by: Michal Hocko 
Reviewed-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: 	[4.17+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

autonuma: fix watermark checking in migrate_balanced_pgdat()

2019-12-01T20:59:09+00:00

When zone_watermark_ok() is called in migrate_balanced_pgdat() to check
migration target node, the parameter classzone_idx (for requested zone)
is specified as 0 (ZONE_DMA).  But when allocating memory for autonuma
in alloc_misplaced_dst_page(), the requested zone from GFP flags is
ZONE_MOVABLE.  That is, the requested zone is different.  The size of
lowmem_reserve for the different requested zone is different.  And this
may cause some issues.

For example, in the zoneinfo of a test machine as below,

Node 0, zone    DMA32
  pages free     61592
        min      29
        low      454
        high     879
        spanned  1044480
        present  442306
        managed  425921
        protection: (0, 0, 62457, 62457, 62457)

The free page number of ZONE_DMA32 is greater than "high watermark +
lowmem_reserve[ZONE_DMA]", but less than "high watermark +
lowmem_reserve[ZONE_MOVABLE]".  And because __alloc_pages_node() in
alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always
return true.  So, autonuma may not stop even when memory pressure in
node 0 is heavy.

To fix the issue, ZONE_MOVABLE is used as parameter to call
zone_watermark_ok() in migrate_balanced_pgdat().  This makes it same as
requested zone in alloc_misplaced_dst_page().  So that
migrate_balanced_pgdat() returns false when memory pressure is heavy.

Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" 
Acked-by: Mel Gorman 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Dave Hansen 
Cc: Dan Williams 
Cc: Fengguang Wu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/migrate.c: handle freed page at the first place

2019-12-01T20:59:09+00:00

When doing migration if the freed page is met, we just return without
migrating it since it is pointless to migrate a freed page.  But, the
current code allocates target page unconditionally before handling freed
page, if the page is freed, the newly allocated will be just freed.  It
doesn't make too much sense and is just a waste of time although
migrating freed page is rare.

So, handle freed page at the before that to avoid unnecessary page
allocation and free.

Link: http://lkml.kernel.org/r/1573755869-106954-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi 
Acked-by: Michal Hocko 
Reviewed-by: Andrew Morton 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: untag user pointers passed to memory syscalls

2019-09-26T00:51:41+00:00

This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

This patch allows tagged pointers to be passed to the following memory
syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
mremap, msync, munlock, move_pages.

The mmap and mremap syscalls do not currently accept tagged addresses.
Architectures may interpret the tag as a background colour for the
corresponding vma.

Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov 
Reviewed-by: Khalid Aziz 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Catalin Marinas 
Reviewed-by: Kees Cook 
Cc: Al Viro 
Cc: Dave Hansen 
Cc: Eric Auger 
Cc: Felix Kuehling 
Cc: Jens Wiklander 
Cc: Mauro Carvalho Chehab 
Cc: Mike Rapoport 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/migrate.c: clean up useless code in migrate_vma_collect_pmd()

2019-09-24T22:54:10+00:00

Remove unused 'pfn' variable.

Link: http://lkml.kernel.org/r/1565167272-21453-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu 
Reviewed-by: Andrew Morton 
Reviewed-by: Ralph Campbell 
Cc: "Jérôme Glisse" 
Cc: Mel Gorman 
Cc: Jan Kara 
Cc: "Kirill A. Shutemov" 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Andrea Arcangeli 
Cc: Matthew Wilcox 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page cache: store only head pages in i_pages

2019-09-24T22:54:08+00:00

Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages.  This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

Kirill and Huang Ying contributed several fixes.

[willy@infradead.org: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox 
Acked-by: Jan Kara 
Reviewed-by: Kirill Shutemov 
Reviewed-by: Song Liu 
Tested-by: Song Liu 
Tested-by: William Kucharski 
Reviewed-by: William Kucharski 
Tested-by: Qian Cai 
Tested-by: Mikhail Gavrilov 
Cc: Hugh Dickins 
Cc: Chris Wilson 
Cc: Song Liu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: introduce compound_nr()

2019-09-24T22:54:08+00:00

Replace 1 << compound_order(page) with compound_nr(page).  Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Andrew Morton 
Reviewed-by: Ira Weiny 
Acked-by: Kirill A. Shutemov 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

pagewalk: separate function pointers from iterator data

2019-09-07T07:28:04+00:00

The mm_walk structure currently mixed data and code.  Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.

Based on patch from Linus Torvalds.

Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig 
Reviewed-by: Thomas Hellstrom 
Reviewed-by: Steven Price 
Reviewed-by: Jason Gunthorpe 
Signed-off-by: Jason Gunthorpe

mm: split out a new pagewalk.h header from mm.h

2019-09-07T07:28:04+00:00

Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.

Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig 
Reviewed-by: Thomas Hellstrom 
Reviewed-by: Steven Price 
Reviewed-by: Jason Gunthorpe 
Signed-off-by: Jason Gunthorpe

Merge branch 'odp_fixes' into hmm.git

2019-08-21T23:58:18+00:00

From rdma.git

Jason Gunthorpe says:

====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================

The branch is based on v5.3-rc5 due to dependencies, and is being taken
into hmm.git due to dependencies in the next patches.

* odp_fixes:
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  RDMA/core: Make invalidate_range a device operation
  RDMA/odp: Use kvcalloc for the dma_list and page_list
  RDMA/odp: Check for overflow when computing the umem_odp end
  RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
  RDMA/odp: Split creating a umem_odp from ib_umem_get
  RDMA/odp: Make the three ways to create a umem_odp clear
  RMDA/odp: Consolidate umem_odp initialization
  RDMA/odp: Make it clearer when a umem is an implicit ODP umem
  RDMA/odp: Iterate over the whole rbtree directly
  RDMA/odp: Use the common interval tree library instead of generic
  RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

Signed-off-by: Jason Gunthorpe