linux-stable.git/include/linux/hugetlb.h, branch v4.4.166

mm: migration: fix migration of huge PMD shared pages

2018-11-21T08:27:44+00:00

commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.

The page migration code employs try_to_unmap() to try and unmap the source
page.  This is accomplished by using rmap_walk to find all vmas where the
page is mapped.  This search stops when page mapcount is zero.  For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings.  Shared mappings are tracked via the reference count of the PMD
page.  Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors.  DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages.  A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page.  After this, flush caches and TLB.

mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked.  Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.

Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
  Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz 
Acked-by: Kirill A. Shutemov 
Reviewed-by: Naoya Horiguchi 
Acked-by: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: Jerome Glisse 
Cc: Mike Kravetz 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Jérôme Glisse 
Signed-off-by: Greg Kroah-Hartman

mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-11-06T03:34:48+00:00

Currently there's no easy way to get per-process usage of hugetlb pages,
which is inconvenient because userspace applications which use hugetlb
typically want to control their processes on the basis of how much memory
(including hugetlb) they use.  So this patch simply provides easy access
to the info via /proc/PID/status.

Signed-off-by: Naoya Horiguchi 
Acked-by: Joern Engel 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Mike Kravetz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlbfs: add hugetlbfs_fallocate()

2015-09-08T22:35:28+00:00

This is based on the shmem version, but it has diverged quite a bit.  We
have no swap to worry about, nor the new file sealing.  Add
synchronication via the fault mutex table to coordinate page faults,
fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of a
hugetlbfs file without having it mapped.  This also gives us the ability
to support MADV_REMOVE since it is currently implemented using
fallocate().  MADV_REMOVE lets madvise() remove pages from the middle of
a hugetlbfs file, which wasn't possible before.

hugetlbfs fallocate only operates on whole huge pages.

Based on code by Dave Hansen.

Signed-off-by: Mike Kravetz 
Reviewed-by: Naoya Horiguchi 
Acked-by: Hillf Danton 
Cc: Dave Hansen 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Davidlohr Bueso 
Cc: Aneesh Kumar 
Cc: Christoph Hellwig 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlbfs: New huge_add_to_page_cache helper routine

2015-09-08T22:35:28+00:00

Currently, there is only a single place where hugetlbfs pages are added
to the page cache.  The new fallocate code be adding a second one, so
break the functionality out into its own helper.

Signed-off-by: Dave Hansen 
Signed-off-by: Mike Kravetz 
Reviewed-by: Naoya Horiguchi 
Acked-by: Hillf Danton 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Davidlohr Bueso 
Cc: Aneesh Kumar 
Cc: Christoph Hellwig 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlbfs: truncate_hugepages() takes a range of pages

2015-09-08T22:35:28+00:00

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start.  If an end value of LLONG_MAX is passed, the
current "truncate" functionality is maintained.  Existing callers are
modified to pass LLONG_MAX as end of range.  By keying off end ==
LLONG_MAX, the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults by
using the fault mutex table.  The hole punch case can experience the
rare region_del error and must handle accordingly.

Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
the case where region_del returns an error.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages().  To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from region_del and
pass it back to the caller.

Signed-off-by: Mike Kravetz 
Reviewed-by: Naoya Horiguchi 
Acked-by: Hillf Danton 
Cc: Dave Hansen 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Davidlohr Bueso 
Cc: Aneesh Kumar 
Cc: Christoph Hellwig 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/hugetlb: expose hugetlb fault mutex for use by fallocate

2015-09-08T22:35:28+00:00

hugetlb page faults are currently synchronized by the table of mutexes
(htlb_fault_mutex_table).  fallocate code will need to synchronize with
the page fault code when it allocates or deletes pages.  Expose
interfaces so that fallocate operations can be synchronized with page
faults.  Minor name changes to be more consistent with other global
hugetlb symbols.

Signed-off-by: Mike Kravetz 
Reviewed-by: Naoya Horiguchi 
Acked-by: Hillf Danton 
Cc: Dave Hansen 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Davidlohr Bueso 
Cc: Aneesh Kumar 
Cc: Christoph Hellwig 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/hugetlb: add cache of descriptors to resv_map for region_add

2015-09-08T22:35:28+00:00

hugetlbfs is used today by applications that want a high degree of
control over huge page usage.  Often, large hugetlbfs files are used to
map a large number huge pages into the application processes.  The
applications know when page ranges within these large files will no
longer be used, and ideally would like to release them back to the
subpool or global pools for other uses.  The fallocate() system call
provides an interface for preallocation and hole punching within files.
This patch set adds fallocate functionality to hugetlbfs.

fallocate hole punch will want to remove a specific range of pages.
When pages are removed, their associated entries in the region/reserve
map will also be removed.  This will break an assumption in the
region_chg/region_add calling sequence.  If a new region descriptor must
be allocated, it is done as part of the region_chg processing.  In this
way, region_add can not fail because it does not need to attempt an
allocation.

To prepare for fallocate hole punch, create a "cache" of descriptors
that can be used by region_add if necessary.  region_chg will ensure
there are sufficient entries in the cache.  It will be necessary to
track the number of in progress add operations to know a sufficient
number of descriptors reside in the cache.  A new routine region_abort
is added to adjust this in progress count when add operations are
aborted.  vma_abort_reservation is also added for callers creating
reservations with vma_needs_reservation/vma_commit_reservation.

[akpm@linux-foundation.org: fix typo in comment, use more cols]
Signed-off-by: Mike Kravetz 
Reviewed-by: Naoya Horiguchi 
Acked-by: Hillf Danton 
Cc: Dave Hansen 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Davidlohr Bueso 
Cc: Aneesh Kumar 
Cc: Christoph Hellwig 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: hugetlb: allow hugepages_supported to be architecture specific

2015-07-17T23:39:52+00:00

s390 has a constant hugepage size, by setting HPAGE_SHIFT we also change
e.g. the pageblock_order, which should be independent in respect to
hugepage support.

With this patch every architecture is free to define how to check
for hugepage support.

Signed-off-by: Dominik Dingel 
Acked-by: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Michael Holzheu 
Cc: Gerald Schaefer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: hugetlb: cleanup using paeg_huge_active()

2015-04-15T23:35:19+00:00

Now we have an easy access to hugepages' activeness, so existing helpers to
get the information can be cleaned up.

[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
Signed-off-by: Naoya Horiguchi 
Cc: Hugh Dickins 
Reviewed-by: Michal Hocko 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlbfs: accept subpool min_size mount option and setup accordingly

2015-04-15T23:35:18+00:00

Make 'min_size=' be an option when mounting a hugetlbfs.  This
option takes the same value as the 'size' option.  min_size can be
specified without specifying size.  If both are specified, min_size must
be less that or equal to size else the mount will fail.  If min_size is
specified, then at mount time an attempt is made to reserve min_size
pages.  If the reservation fails, the mount fails.  At umount time, the
reserved pages are released.

Signed-off-by: Mike Kravetz 
Cc: Davidlohr Bueso 
Cc: Aneesh Kumar 
Cc: Joonsoo Kim 
Cc: Andi Kleen 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds