summaryrefslogtreecommitdiff
path: root/include/linux
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2026-04-15 12:59:16 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2026-04-15 12:59:16 -0700
commit334fbe734e687404f346eba7d5d96ed2b44d35ab (patch)
tree65d5c8f4de18335209b2529146e6b06960a48b43 /include/linux
parent5bdb4078e1efba9650c03753616866192d680718 (diff)
parent3bac01168982ec3e3bf87efdc1807c7933590a85 (diff)
Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
Diffstat (limited to 'include/linux')
-rw-r--r--include/linux/damon.h18
-rw-r--r--include/linux/dax.h4
-rw-r--r--include/linux/folio_batch.h (renamed from include/linux/pagevec.h)16
-rw-r--r--include/linux/folio_queue.h8
-rw-r--r--include/linux/fs.h14
-rw-r--r--include/linux/huge_mm.h13
-rw-r--r--include/linux/hugetlb.h31
-rw-r--r--include/linux/hugetlb_inline.h4
-rw-r--r--include/linux/hyperv.h4
-rw-r--r--include/linux/iomap.h2
-rw-r--r--include/linux/kasan.h8
-rw-r--r--include/linux/kho/abi/kexec_handover.h144
-rw-r--r--include/linux/kho/abi/memfd.h18
-rw-r--r--include/linux/kho_radix_tree.h70
-rw-r--r--include/linux/ksm.h10
-rw-r--r--include/linux/leafops.h39
-rw-r--r--include/linux/maple_tree.h42
-rw-r--r--include/linux/memcontrol.h2
-rw-r--r--include/linux/memfd.h12
-rw-r--r--include/linux/memory-tiers.h2
-rw-r--r--include/linux/memory.h3
-rw-r--r--include/linux/memory_hotplug.h18
-rw-r--r--include/linux/mm.h716
-rw-r--r--include/linux/mm_inline.h16
-rw-r--r--include/linux/mm_types.h91
-rw-r--r--include/linux/mman.h49
-rw-r--r--include/linux/mmu_notifier.h130
-rw-r--r--include/linux/mmzone.h82
-rw-r--r--include/linux/page-flags.h163
-rw-r--r--include/linux/page_ref.h18
-rw-r--r--include/linux/page_reporting.h1
-rw-r--r--include/linux/pagewalk.h8
-rw-r--r--include/linux/pgtable.h139
-rw-r--r--include/linux/sunrpc/svc.h2
-rw-r--r--include/linux/swap.h30
-rw-r--r--include/linux/types.h2
-rw-r--r--include/linux/uio_driver.h4
-rw-r--r--include/linux/userfaultfd_k.h3
-rw-r--r--include/linux/vm_event_item.h13
-rw-r--r--include/linux/vmalloc.h3
-rw-r--r--include/linux/writeback.h2
41 files changed, 1311 insertions, 643 deletions
diff --git a/include/linux/damon.h b/include/linux/damon.h
index be3d198043ff..d9a3babbafc1 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -55,6 +55,8 @@ struct damon_size_range {
* @list: List head for siblings.
* @age: Age of this region.
*
+ * For any use case, @ar should be non-zero positive size.
+ *
* @nr_accesses is reset to zero for every &damon_attrs->aggr_interval and be
* increased for every &damon_attrs->sample_interval if an access to the region
* during the last sampling interval is found. The update of this field should
@@ -214,11 +216,22 @@ struct damos_quota_goal {
};
/**
+ * enum damos_quota_goal_tuner - Goal-based quota tuning logic.
+ * @DAMOS_QUOTA_GOAL_TUNER_CONSIST: Aim long term consistent quota.
+ * @DAMOS_QUOTA_GOAL_TUNER_TEMPORAL: Aim zero quota asap.
+ */
+enum damos_quota_goal_tuner {
+ DAMOS_QUOTA_GOAL_TUNER_CONSIST,
+ DAMOS_QUOTA_GOAL_TUNER_TEMPORAL,
+};
+
+/**
* struct damos_quota - Controls the aggressiveness of the given scheme.
* @reset_interval: Charge reset interval in milliseconds.
* @ms: Maximum milliseconds that the scheme can use.
* @sz: Maximum bytes of memory that the action can be applied.
* @goals: Head of quota tuning goals (&damos_quota_goal) list.
+ * @goal_tuner: Goal-based @esz tuning algorithm to use.
* @esz: Effective size quota in bytes.
*
* @weight_sz: Weight of the region's size for prioritization.
@@ -260,6 +273,7 @@ struct damos_quota {
unsigned long ms;
unsigned long sz;
struct list_head goals;
+ enum damos_quota_goal_tuner goal_tuner;
unsigned long esz;
unsigned int weight_sz;
@@ -647,8 +661,7 @@ struct damon_operations {
void (*prepare_access_checks)(struct damon_ctx *context);
unsigned int (*check_accesses)(struct damon_ctx *context);
int (*get_scheme_score)(struct damon_ctx *context,
- struct damon_target *t, struct damon_region *r,
- struct damos *scheme);
+ struct damon_region *r, struct damos *scheme);
unsigned long (*apply_scheme)(struct damon_ctx *context,
struct damon_target *t, struct damon_region *r,
struct damos *scheme, unsigned long *sz_filter_passed);
@@ -981,6 +994,7 @@ int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control);
int damon_set_region_biggest_system_ram_default(struct damon_target *t,
unsigned long *start, unsigned long *end,
+ unsigned long addr_unit,
unsigned long min_region_sz);
#endif /* CONFIG_DAMON */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index bf103f317cac..10a7cc79aea5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -69,7 +69,7 @@ static inline bool daxdev_mapping_supported(const struct vm_area_desc *desc,
const struct inode *inode,
struct dax_device *dax_dev)
{
- if (!vma_desc_test_flags(desc, VMA_SYNC_BIT))
+ if (!vma_desc_test(desc, VMA_SYNC_BIT))
return true;
if (!IS_DAX(inode))
return false;
@@ -115,7 +115,7 @@ static inline bool daxdev_mapping_supported(const struct vm_area_desc *desc,
const struct inode *inode,
struct dax_device *dax_dev)
{
- return !vma_desc_test_flags(desc, VMA_SYNC_BIT);
+ return !vma_desc_test(desc, VMA_SYNC_BIT);
}
static inline size_t dax_recovery_write(struct dax_device *dax_dev,
pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
diff --git a/include/linux/pagevec.h b/include/linux/folio_batch.h
index 63be5a451627..b45946adc50b 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/folio_batch.h
@@ -1,18 +1,18 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
- * include/linux/pagevec.h
+ * include/linux/folio_batch.h
*
* In many places it is efficient to batch an operation up against multiple
* folios. A folio_batch is a container which is used for that.
*/
-#ifndef _LINUX_PAGEVEC_H
-#define _LINUX_PAGEVEC_H
+#ifndef _LINUX_FOLIO_BATCH_H
+#define _LINUX_FOLIO_BATCH_H
#include <linux/types.h>
/* 31 pointers + header align the folio_batch structure to a power of two */
-#define PAGEVEC_SIZE 31
+#define FOLIO_BATCH_SIZE 31
struct folio;
@@ -29,7 +29,7 @@ struct folio_batch {
unsigned char nr;
unsigned char i;
bool percpu_pvec_drained;
- struct folio *folios[PAGEVEC_SIZE];
+ struct folio *folios[FOLIO_BATCH_SIZE];
};
/**
@@ -58,7 +58,7 @@ static inline unsigned int folio_batch_count(const struct folio_batch *fbatch)
static inline unsigned int folio_batch_space(const struct folio_batch *fbatch)
{
- return PAGEVEC_SIZE - fbatch->nr;
+ return FOLIO_BATCH_SIZE - fbatch->nr;
}
/**
@@ -93,7 +93,7 @@ static inline struct folio *folio_batch_next(struct folio_batch *fbatch)
return fbatch->folios[fbatch->i++];
}
-void __folio_batch_release(struct folio_batch *pvec);
+void __folio_batch_release(struct folio_batch *fbatch);
static inline void folio_batch_release(struct folio_batch *fbatch)
{
@@ -102,4 +102,4 @@ static inline void folio_batch_release(struct folio_batch *fbatch)
}
void folio_batch_remove_exceptionals(struct folio_batch *fbatch);
-#endif /* _LINUX_PAGEVEC_H */
+#endif /* _LINUX_FOLIO_BATCH_H */
diff --git a/include/linux/folio_queue.h b/include/linux/folio_queue.h
index adab609c972e..f6d5f1f127c9 100644
--- a/include/linux/folio_queue.h
+++ b/include/linux/folio_queue.h
@@ -14,7 +14,7 @@
#ifndef _LINUX_FOLIO_QUEUE_H
#define _LINUX_FOLIO_QUEUE_H
-#include <linux/pagevec.h>
+#include <linux/folio_batch.h>
#include <linux/mm.h>
/*
@@ -29,12 +29,12 @@
*/
struct folio_queue {
struct folio_batch vec; /* Folios in the queue segment */
- u8 orders[PAGEVEC_SIZE]; /* Order of each folio */
+ u8 orders[FOLIO_BATCH_SIZE]; /* Order of each folio */
struct folio_queue *next; /* Next queue segment or NULL */
struct folio_queue *prev; /* Previous queue segment of NULL */
unsigned long marks; /* 1-bit mark per folio */
unsigned long marks2; /* Second 1-bit mark per folio */
-#if PAGEVEC_SIZE > BITS_PER_LONG
+#if FOLIO_BATCH_SIZE > BITS_PER_LONG
#error marks is not big enough
#endif
unsigned int rreq_id;
@@ -70,7 +70,7 @@ static inline void folioq_init(struct folio_queue *folioq, unsigned int rreq_id)
*/
static inline unsigned int folioq_nr_slots(const struct folio_queue *folioq)
{
- return PAGEVEC_SIZE;
+ return FOLIO_BATCH_SIZE;
}
/**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b01bb22d12..e1d257e6da68 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2058,16 +2058,24 @@ static inline bool can_mmap_file(struct file *file)
return true;
}
-int __compat_vma_mmap(const struct file_operations *f_op,
- struct file *file, struct vm_area_struct *vma);
+void compat_set_desc_from_vma(struct vm_area_desc *desc, const struct file *file,
+ const struct vm_area_struct *vma);
+int __compat_vma_mmap(struct vm_area_desc *desc, struct vm_area_struct *vma);
int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
+int __vma_check_mmap_hook(struct vm_area_struct *vma);
static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
{
+ int err;
+
if (file->f_op->mmap_prepare)
return compat_vma_mmap(file, vma);
- return file->f_op->mmap(file, vma);
+ err = file->f_op->mmap(file, vma);
+ if (err)
+ return err;
+
+ return __vma_check_mmap_hook(vma);
}
static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfde..2949e5acff35 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -27,8 +27,8 @@ static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, unsigned long next);
-int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr);
+bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long addr);
int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
unsigned long addr);
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
@@ -83,7 +83,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
* file is never split and the MAX_PAGECACHE_ORDER limit does not apply to
* it. Same to PFNMAPs where there's neither page* nor pagecache.
*/
-#define THP_ORDERS_ALL_SPECIAL \
+#define THP_ORDERS_ALL_SPECIAL_DAX \
(BIT(PMD_ORDER) | BIT(PUD_ORDER))
#define THP_ORDERS_ALL_FILE_DEFAULT \
((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
@@ -92,7 +92,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
* Mask of all large folio orders supported for THP.
*/
#define THP_ORDERS_ALL \
- (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
+ (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL_DAX | THP_ORDERS_ALL_FILE_DEFAULT)
enum tva_type {
TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
@@ -771,6 +771,11 @@ static inline bool pmd_is_huge(pmd_t pmd)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline bool is_pmd_order(unsigned int order)
+{
+ return order == HPAGE_PMD_ORDER;
+}
+
static inline int split_folio_to_list_to_order(struct folio *folio,
struct list_head *list, int new_order)
{
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fc5462fe943f..93418625d3c5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -778,10 +778,6 @@ static inline unsigned long huge_page_size(const struct hstate *h)
return (unsigned long)PAGE_SIZE << h->order;
}
-extern unsigned long vma_kernel_pagesize(struct vm_area_struct *vma);
-
-extern unsigned long vma_mmu_pagesize(struct vm_area_struct *vma);
-
static inline unsigned long huge_page_mask(struct hstate *h)
{
return h->mask;
@@ -797,6 +793,23 @@ static inline unsigned huge_page_shift(struct hstate *h)
return h->order + PAGE_SHIFT;
}
+/**
+ * hugetlb_linear_page_index() - linear_page_index() but in hugetlb
+ * page size granularity.
+ * @vma: the hugetlb VMA
+ * @address: the virtual address within the VMA
+ *
+ * Return: the page offset within the mapping in huge page units.
+ */
+static inline pgoff_t hugetlb_linear_page_index(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ struct hstate *h = hstate_vma(vma);
+
+ return ((address - vma->vm_start) >> huge_page_shift(h)) +
+ (vma->vm_pgoff >> huge_page_order(h));
+}
+
static inline bool order_is_gigantic(unsigned int order)
{
return order > MAX_PAGE_ORDER;
@@ -1178,16 +1191,6 @@ static inline unsigned long huge_page_mask(struct hstate *h)
return PAGE_MASK;
}
-static inline unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
-{
- return PAGE_SIZE;
-}
-
-static inline unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
-{
- return PAGE_SIZE;
-}
-
static inline unsigned int huge_page_order(struct hstate *h)
{
return 0;
diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 593f5d4e108b..565b473fd135 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -13,7 +13,7 @@ static inline bool is_vm_hugetlb_flags(vm_flags_t vm_flags)
static inline bool is_vma_hugetlb_flags(const vma_flags_t *flags)
{
- return vma_flags_test(flags, VMA_HUGETLB_BIT);
+ return vma_flags_test_any(flags, VMA_HUGETLB_BIT);
}
#else
@@ -30,7 +30,7 @@ static inline bool is_vma_hugetlb_flags(const vma_flags_t *flags)
#endif
-static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
+static inline bool is_vm_hugetlb_page(const struct vm_area_struct *vma)
{
return is_vm_hugetlb_flags(vma->vm_flags);
}
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..a26fb8e7cedf 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1015,8 +1015,8 @@ struct vmbus_channel {
/* The max size of a packet on this channel */
u32 max_pkt_size;
- /* function to mmap ring buffer memory to the channel's sysfs ring attribute */
- int (*mmap_ring_buffer)(struct vmbus_channel *channel, struct vm_area_struct *vma);
+ /* function to mmap ring buffer memory to the channel's sysfs ring attribute */
+ int (*mmap_prepare_ring_buffer)(struct vmbus_channel *channel, struct vm_area_desc *desc);
/* boolean to control visibility of sysfs for ring buffer */
bool ring_sysfs_visible;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 531f9ebdeeae..2c5685adf3a9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -9,7 +9,7 @@
#include <linux/types.h>
#include <linux/mm_types.h>
#include <linux/blkdev.h>
-#include <linux/pagevec.h>
+#include <linux/folio_batch.h>
struct address_space;
struct fiemap_extent_info;
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 338a1921a50a..bf233bde68c7 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -352,8 +352,8 @@ bool __kasan_mempool_poison_object(void *ptr, unsigned long ip);
* kasan_mempool_unpoison_object().
*
* This function operates on all slab allocations including large kmalloc
- * allocations (the ones returned by kmalloc_large() or by kmalloc() with the
- * size > KMALLOC_MAX_SIZE).
+ * allocations (i.e. the ones backed directly by the buddy allocator rather
+ * than kmalloc slab caches).
*
* Return: true if the allocation can be safely reused; false otherwise.
*/
@@ -381,8 +381,8 @@ void __kasan_mempool_unpoison_object(void *ptr, size_t size, unsigned long ip);
* original tags based on the pointer value.
*
* This function operates on all slab allocations including large kmalloc
- * allocations (the ones returned by kmalloc_large() or by kmalloc() with the
- * size > KMALLOC_MAX_SIZE).
+ * allocations (i.e. the ones backed directly by the buddy allocator rather
+ * than kmalloc slab caches).
*/
static __always_inline void kasan_mempool_unpoison_object(void *ptr,
size_t size)
diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h
index 2201a0d2c159..6b7d8ef550f9 100644
--- a/include/linux/kho/abi/kexec_handover.h
+++ b/include/linux/kho/abi/kexec_handover.h
@@ -10,8 +10,13 @@
#ifndef _LINUX_KHO_ABI_KEXEC_HANDOVER_H
#define _LINUX_KHO_ABI_KEXEC_HANDOVER_H
+#include <linux/bits.h>
+#include <linux/log2.h>
+#include <linux/math.h>
#include <linux/types.h>
+#include <asm/page.h>
+
/**
* DOC: Kexec Handover ABI
*
@@ -29,32 +34,32 @@
* compatibility is only guaranteed for kernels supporting the same ABI version.
*
* FDT Structure Overview:
- * The FDT serves as a central registry for physical
- * addresses of preserved data structures and sub-FDTs. The first kernel
- * populates this FDT with references to memory regions and other FDTs that
- * need to persist across the kexec transition. The subsequent kernel then
- * parses this FDT to locate and restore the preserved data.::
+ * The FDT serves as a central registry for physical addresses of preserved
+ * data structures. The first kernel populates this FDT with references to
+ * memory regions and other metadata that need to persist across the kexec
+ * transition. The subsequent kernel then parses this FDT to locate and
+ * restore the preserved data.::
*
* / {
- * compatible = "kho-v1";
+ * compatible = "kho-v2";
*
* preserved-memory-map = <0x...>;
*
* <subnode-name-1> {
- * fdt = <0x...>;
+ * preserved-data = <0x...>;
* };
*
* <subnode-name-2> {
- * fdt = <0x...>;
+ * preserved-data = <0x...>;
* };
* ... ...
* <subnode-name-N> {
- * fdt = <0x...>;
+ * preserved-data = <0x...>;
* };
* };
*
* Root KHO Node (/):
- * - compatible: "kho-v1"
+ * - compatible: "kho-v2"
*
* Indentifies the overall KHO ABI version.
*
@@ -69,20 +74,20 @@
* is provided by the subsystem that uses KHO for preserving its
* data.
*
- * - fdt: u64
+ * - preserved-data: u64
*
- * Physical address pointing to a subnode FDT blob that is also
+ * Physical address pointing to a subnode data blob that is also
* being preserved.
*/
/* The compatible string for the KHO FDT root node. */
-#define KHO_FDT_COMPATIBLE "kho-v1"
+#define KHO_FDT_COMPATIBLE "kho-v2"
/* The FDT property for the preserved memory map. */
#define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map"
-/* The FDT property for sub-FDTs. */
-#define KHO_FDT_SUB_TREE_PROP_NAME "fdt"
+/* The FDT property for preserved data blobs. */
+#define KHO_FDT_SUB_TREE_PROP_NAME "preserved-data"
/**
* DOC: Kexec Handover ABI for vmalloc Preservation
@@ -160,4 +165,113 @@ struct kho_vmalloc {
unsigned short order;
};
+/**
+ * DOC: KHO persistent memory tracker
+ *
+ * KHO tracks preserved memory using a radix tree data structure. Each node of
+ * the tree is exactly a single page. The leaf nodes are bitmaps where each set
+ * bit is a preserved page of any order. The intermediate nodes are tables of
+ * physical addresses that point to a lower level node.
+ *
+ * The tree hierarchy is shown below::
+ *
+ * root
+ * +-------------------+
+ * | Level 5 | (struct kho_radix_node)
+ * +-------------------+
+ * |
+ * v
+ * +-------------------+
+ * | Level 4 | (struct kho_radix_node)
+ * +-------------------+
+ * |
+ * | ... (intermediate levels)
+ * |
+ * v
+ * +-------------------+
+ * | Level 0 | (struct kho_radix_leaf)
+ * +-------------------+
+ *
+ * The tree is traversed using a key that encodes the page's physical address
+ * (pa) and its order into a single unsigned long value. The encoded key value
+ * is composed of two parts: the 'order bit' in the upper part and the
+ * 'shifted physical address' in the lower part.::
+ *
+ * +------------+-----------------------------+--------------------------+
+ * | Page Order | Order Bit | Shifted Physical Address |
+ * +------------+-----------------------------+--------------------------+
+ * | 0 | ...000100 ... (at bit 52) | pa >> (PAGE_SHIFT + 0) |
+ * | 1 | ...000010 ... (at bit 51) | pa >> (PAGE_SHIFT + 1) |
+ * | 2 | ...000001 ... (at bit 50) | pa >> (PAGE_SHIFT + 2) |
+ * | ... | ... | ... |
+ * +------------+-----------------------------+--------------------------+
+ *
+ * Shifted Physical Address:
+ * The 'shifted physical address' is the physical address normalized for its
+ * order. It effectively represents the PFN shifted right by the order.
+ *
+ * Order Bit:
+ * The 'order bit' encodes the page order by setting a single bit at a
+ * specific position. The position of this bit itself represents the order.
+ *
+ * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
+ * maximum range for the shifted physical address (for order 0) is 52 bits
+ * (64 - 12). This address occupies bits [0-51]. For order 0, the order bit is
+ * set at position 52.
+ *
+ * The following diagram illustrates how the encoded key value is split into
+ * indices for the tree levels, with PAGE_SIZE of 4KB::
+ *
+ * 63:60 59:51 50:42 41:33 32:24 23:15 14:0
+ * +---------+--------+--------+--------+--------+--------+-----------------+
+ * | 0 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 | Lv 0 (bitmap) |
+ * +---------+--------+--------+--------+--------+--------+-----------------+
+ *
+ * The radix tree stores pages of all orders in a single 6-level hierarchy. It
+ * efficiently shares higher tree levels, especially due to common zero top
+ * address bits, allowing a single, efficient algorithm to manage all
+ * pages. This bitmap approach also offers memory efficiency; for example, a
+ * 512KB bitmap can cover a 16GB memory range for 0-order pages with PAGE_SIZE =
+ * 4KB.
+ *
+ * The data structures defined here are part of the KHO ABI. Any modification
+ * to these structures that breaks backward compatibility must be accompanied by
+ * an update to the "compatible" string. This ensures that a newer kernel can
+ * correctly interpret the data passed by an older kernel.
+ */
+
+/*
+ * Defines constants for the KHO radix tree structure, used to track preserved
+ * memory. These constants govern the indexing, sizing, and depth of the tree.
+ */
+enum kho_radix_consts {
+ /*
+ * The bit position of the order bit (and also the length of the
+ * shifted physical address) for an order-0 page.
+ */
+ KHO_ORDER_0_LOG2 = 64 - PAGE_SHIFT,
+
+ /* Size of the table in kho_radix_node, in log2 */
+ KHO_TABLE_SIZE_LOG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)),
+
+ /* Number of bits in the kho_radix_leaf bitmap, in log2 */
+ KHO_BITMAP_SIZE_LOG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE),
+
+ /*
+ * The total tree depth is the number of intermediate levels
+ * and 1 bitmap level.
+ */
+ KHO_TREE_MAX_DEPTH =
+ DIV_ROUND_UP(KHO_ORDER_0_LOG2 - KHO_BITMAP_SIZE_LOG2,
+ KHO_TABLE_SIZE_LOG2) + 1,
+};
+
+struct kho_radix_node {
+ u64 table[1 << KHO_TABLE_SIZE_LOG2];
+};
+
+struct kho_radix_leaf {
+ DECLARE_BITMAP(bitmap, 1 << KHO_BITMAP_SIZE_LOG2);
+};
+
#endif /* _LINUX_KHO_ABI_KEXEC_HANDOVER_H */
diff --git a/include/linux/kho/abi/memfd.h b/include/linux/kho/abi/memfd.h
index 68cb6303b846..08b10fea2afc 100644
--- a/include/linux/kho/abi/memfd.h
+++ b/include/linux/kho/abi/memfd.h
@@ -56,10 +56,24 @@ struct memfd_luo_folio_ser {
u64 index;
} __packed;
+/*
+ * The set of seals this version supports preserving. If support for any new
+ * seals is needed, add it here and bump version.
+ */
+#define MEMFD_LUO_ALL_SEALS (F_SEAL_SEAL | \
+ F_SEAL_SHRINK | \
+ F_SEAL_GROW | \
+ F_SEAL_WRITE | \
+ F_SEAL_FUTURE_WRITE | \
+ F_SEAL_EXEC)
+
/**
* struct memfd_luo_ser - Main serialization structure for a memfd.
* @pos: The file's current position (f_pos).
* @size: The total size of the file in bytes (i_size).
+ * @seals: The seals present on the memfd. The seals are uABI so it is safe
+ * to directly use them in the ABI.
+ * @flags: Flags for the file. Unused flag bits must be set to 0.
* @nr_folios: Number of folios in the folios array.
* @folios: KHO vmalloc descriptor pointing to the array of
* struct memfd_luo_folio_ser.
@@ -67,11 +81,13 @@ struct memfd_luo_folio_ser {
struct memfd_luo_ser {
u64 pos;
u64 size;
+ u32 seals;
+ u32 flags;
u64 nr_folios;
struct kho_vmalloc folios;
} __packed;
/* The compatibility string for memfd file handler */
-#define MEMFD_LUO_FH_COMPATIBLE "memfd-v1"
+#define MEMFD_LUO_FH_COMPATIBLE "memfd-v2"
#endif /* _LINUX_KHO_ABI_MEMFD_H */
diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
new file mode 100644
index 000000000000..84e918b96e53
--- /dev/null
+++ b/include/linux/kho_radix_tree.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_KHO_RADIX_TREE_H
+#define _LINUX_KHO_RADIX_TREE_H
+
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/mutex_types.h>
+#include <linux/types.h>
+
+/**
+ * DOC: Kexec Handover Radix Tree
+ *
+ * This is a radix tree implementation for tracking physical memory pages
+ * across kexec transitions. It was developed for the KHO mechanism but is
+ * designed for broader use by any subsystem that needs to preserve pages.
+ *
+ * The radix tree is a multi-level tree where leaf nodes are bitmaps
+ * representing individual pages. To allow pages of different sizes (orders)
+ * to be stored efficiently in a single tree, it uses a unique key encoding
+ * scheme. Each key is an unsigned long that combines a page's physical
+ * address and its order.
+ *
+ * Client code is responsible for allocating the root node of the tree,
+ * initializing the mutex lock, and managing its lifecycle. It must use the
+ * tree data structures defined in the KHO ABI,
+ * `include/linux/kho/abi/kexec_handover.h`.
+ */
+
+struct kho_radix_node;
+
+struct kho_radix_tree {
+ struct kho_radix_node *root;
+ struct mutex lock; /* protects the tree's structure and root pointer */
+};
+
+typedef int (*kho_radix_tree_walk_callback_t)(phys_addr_t phys,
+ unsigned int order);
+
+#ifdef CONFIG_KEXEC_HANDOVER
+
+int kho_radix_add_page(struct kho_radix_tree *tree, unsigned long pfn,
+ unsigned int order);
+
+void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn,
+ unsigned int order);
+
+int kho_radix_walk_tree(struct kho_radix_tree *tree,
+ kho_radix_tree_walk_callback_t cb);
+
+#else /* #ifdef CONFIG_KEXEC_HANDOVER */
+
+static inline int kho_radix_add_page(struct kho_radix_tree *tree, long pfn,
+ unsigned int order)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline void kho_radix_del_page(struct kho_radix_tree *tree,
+ unsigned long pfn, unsigned int order) { }
+
+static inline int kho_radix_walk_tree(struct kho_radix_tree *tree,
+ kho_radix_tree_walk_callback_t cb)
+{
+ return -EOPNOTSUPP;
+}
+
+#endif /* #ifdef CONFIG_KEXEC_HANDOVER */
+
+#endif /* _LINUX_KHO_RADIX_TREE_H */
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index c982694c987b..d39d0d5483a2 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -17,8 +17,8 @@
#ifdef CONFIG_KSM
int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
unsigned long end, int advice, vm_flags_t *vm_flags);
-vm_flags_t ksm_vma_flags(struct mm_struct *mm, const struct file *file,
- vm_flags_t vm_flags);
+vma_flags_t ksm_vma_flags(struct mm_struct *mm, const struct file *file,
+ vma_flags_t vma_flags);
int ksm_enable_merge_any(struct mm_struct *mm);
int ksm_disable_merge_any(struct mm_struct *mm);
int ksm_disable(struct mm_struct *mm);
@@ -103,10 +103,10 @@ bool ksm_process_mergeable(struct mm_struct *mm);
#else /* !CONFIG_KSM */
-static inline vm_flags_t ksm_vma_flags(struct mm_struct *mm,
- const struct file *file, vm_flags_t vm_flags)
+static inline vma_flags_t ksm_vma_flags(struct mm_struct *mm,
+ const struct file *file, vma_flags_t vma_flags)
{
- return vm_flags;
+ return vma_flags;
}
static inline int ksm_disable(struct mm_struct *mm)
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index 05673d3529e7..992cd8bd8ed0 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -607,7 +607,20 @@ static inline bool pmd_is_migration_entry(pmd_t pmd)
}
/**
- * pmd_is_valid_softleaf() - Is this PMD entry a valid leaf entry?
+ * softleaf_is_valid_pmd_entry() - Is the specified softleaf entry obtained from
+ * a PMD one that we support at PMD level?
+ * @entry: Entry to check.
+ * Returns: true if the softleaf entry is valid at PMD, otherwise false.
+ */
+static inline bool softleaf_is_valid_pmd_entry(softleaf_t entry)
+{
+ /* Only device private, migration entries valid for PMD. */
+ return softleaf_is_device_private(entry) ||
+ softleaf_is_migration(entry);
+}
+
+/**
+ * pmd_is_valid_softleaf() - Is this PMD entry a valid softleaf entry?
* @pmd: PMD entry.
*
* PMD leaf entries are valid only if they are device private or migration
@@ -620,9 +633,27 @@ static inline bool pmd_is_valid_softleaf(pmd_t pmd)
{
const softleaf_t entry = softleaf_from_pmd(pmd);
- /* Only device private, migration entries valid for PMD. */
- return softleaf_is_device_private(entry) ||
- softleaf_is_migration(entry);
+ return softleaf_is_valid_pmd_entry(entry);
+}
+
+/**
+ * pmd_to_softleaf_folio() - Convert the PMD entry to a folio.
+ * @pmd: PMD entry.
+ *
+ * The PMD entry is expected to be a valid PMD softleaf entry.
+ *
+ * Returns: the folio the softleaf entry references if this is a valid softleaf
+ * entry, otherwise NULL.
+ */
+static inline struct folio *pmd_to_softleaf_folio(pmd_t pmd)
+{
+ const softleaf_t entry = softleaf_from_pmd(pmd);
+
+ if (!softleaf_is_valid_pmd_entry(entry)) {
+ VM_WARN_ON_ONCE(true);
+ return NULL;
+ }
+ return softleaf_to_folio(entry);
}
#endif /* CONFIG_MMU */
diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 7b8aad47121e..0c464eade1d6 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -139,6 +139,7 @@ enum maple_type {
maple_leaf_64,
maple_range_64,
maple_arange_64,
+ maple_copy,
};
enum store_type {
@@ -154,6 +155,46 @@ enum store_type {
wr_slot_store,
};
+struct maple_copy {
+ /*
+ * min, max, and pivots are values
+ * start, end, split are indexes into arrays
+ * data is a size
+ */
+
+ struct {
+ struct maple_node *node;
+ unsigned long max;
+ enum maple_type mt;
+ } dst[3];
+ struct {
+ struct maple_node *node;
+ unsigned long max;
+ unsigned char start;
+ unsigned char end;
+ enum maple_type mt;
+ } src[4];
+ /* Simulated node */
+ void __rcu *slot[3];
+ unsigned long gap[3];
+ unsigned long min;
+ union {
+ unsigned long pivot[3];
+ struct {
+ void *_pad[2];
+ unsigned long max;
+ };
+ };
+ unsigned char end;
+
+ /*Avoid passing these around */
+ unsigned char s_count;
+ unsigned char d_count;
+ unsigned char split;
+ unsigned char data;
+ unsigned char height;
+};
+
/**
* DOC: Maple tree flags
*
@@ -299,6 +340,7 @@ struct maple_node {
};
struct maple_range_64 mr64;
struct maple_arange_64 ma64;
+ struct maple_copy cp;
};
};
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70b685a85bf4..5173a9f16721 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,10 +35,10 @@ enum memcg_stat_item {
MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
MEMCG_SOCK,
MEMCG_PERCPU_B,
- MEMCG_VMALLOC,
MEMCG_KMEM,
MEMCG_ZSWAP_B,
MEMCG_ZSWAPPED,
+ MEMCG_ZSWAP_INCOMP,
MEMCG_NR_STAT,
};
diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index c328a7b356d0..b4fda09dab9f 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -18,6 +18,8 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx);
*/
int memfd_check_seals_mmap(struct file *file, vm_flags_t *vm_flags_ptr);
struct file *memfd_alloc_file(const char *name, unsigned int flags);
+int memfd_get_seals(struct file *file);
+int memfd_add_seals(struct file *file, unsigned int seals);
#else
static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
{
@@ -37,6 +39,16 @@ static inline struct file *memfd_alloc_file(const char *name, unsigned int flags
{
return ERR_PTR(-EINVAL);
}
+
+static inline int memfd_get_seals(struct file *file)
+{
+ return -EINVAL;
+}
+
+static inline int memfd_add_seals(struct file *file, unsigned int seals)
+{
+ return -EINVAL;
+}
#endif
#endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 96987d9d95a8..7999c58629ee 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -52,7 +52,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
struct memory_dev_type *mt_find_alloc_memory_type(int adist,
struct list_head *memory_types);
void mt_put_memory_types(struct list_head *memory_types);
-#ifdef CONFIG_MIGRATION
+#ifdef CONFIG_NUMA_MIGRATION
int next_demotion_node(int node, const nodemask_t *allowed_mask);
void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
bool node_is_toptier(int node);
diff --git a/include/linux/memory.h b/include/linux/memory.h
index faeaa921e55b..5bb5599c6b2b 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -19,6 +19,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/mutex.h>
+#include <linux/memory_hotplug.h>
#define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
@@ -77,7 +78,7 @@ enum memory_block_state {
struct memory_block {
unsigned long start_section_nr;
enum memory_block_state state; /* serialized by the dev->lock */
- int online_type; /* for passing data to online routine */
+ enum mmop online_type; /* for passing data to online routine */
int nid; /* NID for this memory block */
/*
* The single zone of this memory block if all PFNs of this memory block
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f2f16cdd73ee..815e908c4135 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -16,11 +16,8 @@ struct resource;
struct vmem_altmap;
struct dev_pagemap;
-#ifdef CONFIG_MEMORY_HOTPLUG
-struct page *pfn_to_online_page(unsigned long pfn);
-
/* Types for control the zone type of onlined and offlined memory */
-enum {
+enum mmop {
/* Offline the memory. */
MMOP_OFFLINE = 0,
/* Online the memory. Zone depends, see default_zone_for_pfn(). */
@@ -31,6 +28,9 @@ enum {
MMOP_ONLINE_MOVABLE,
};
+#ifdef CONFIG_MEMORY_HOTPLUG
+struct page *pfn_to_online_page(unsigned long pfn);
+
/* Flags for add_memory() and friends to specify memory hotplug details. */
typedef int __bitwise mhp_t;
@@ -286,8 +286,8 @@ static inline void __remove_memory(u64 start, u64 size) {}
#ifdef CONFIG_MEMORY_HOTPLUG
/* Default online_type (MMOP_*) when new memory blocks are added. */
-extern int mhp_get_default_online_type(void);
-extern void mhp_set_default_online_type(int online_type);
+extern enum mmop mhp_get_default_online_type(void);
+extern void mhp_set_default_online_type(enum mmop online_type);
extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
@@ -308,10 +308,8 @@ extern int sparse_add_section(int nid, unsigned long pfn,
struct dev_pagemap *pgmap);
extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap);
-extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
- unsigned long pnum);
-extern struct zone *zone_for_pfn_range(int online_type, int nid,
- struct memory_group *group, unsigned long start_pfn,
+extern struct zone *zone_for_pfn_range(enum mmop online_type,
+ int nid, struct memory_group *group, unsigned long start_pfn,
unsigned long nr_pages);
extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
struct mhp_params *params);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index abb4963c1f06..8260e28205e9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -27,7 +27,6 @@
#include <linux/page-flags.h>
#include <linux/page_ref.h>
#include <linux/overflow.h>
-#include <linux/sizes.h>
#include <linux/sched.h>
#include <linux/pgtable.h>
#include <linux/kasan.h>
@@ -208,8 +207,6 @@ static inline void __mm_zero_struct_page(struct page *page)
#define MAPCOUNT_ELF_CORE_MARGIN (5)
#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-extern int sysctl_max_map_count;
-
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;
@@ -349,9 +346,9 @@ enum {
* if KVM does not lock down the memory type.
*/
DECLARE_VMA_BIT(ALLOW_ANY_UNCACHED, 39),
-#ifdef CONFIG_PPC32
+#if defined(CONFIG_PPC32)
DECLARE_VMA_BIT_ALIAS(DROPPABLE, ARCH_1),
-#else
+#elif defined(CONFIG_64BIT)
DECLARE_VMA_BIT(DROPPABLE, 40),
#endif
DECLARE_VMA_BIT(UFFD_MINOR, 41),
@@ -466,8 +463,10 @@ enum {
#if defined(CONFIG_X86_USER_SHADOW_STACK) || defined(CONFIG_ARM64_GCS) || \
defined(CONFIG_RISCV_USER_CFI)
#define VM_SHADOW_STACK INIT_VM_FLAG(SHADOW_STACK)
+#define VMA_STARTGAP_FLAGS mk_vma_flags(VMA_GROWSDOWN_BIT, VMA_SHADOW_STACK_BIT)
#else
#define VM_SHADOW_STACK VM_NONE
+#define VMA_STARTGAP_FLAGS mk_vma_flags(VMA_GROWSDOWN_BIT)
#endif
#if defined(CONFIG_PPC64)
#define VM_SAO INIT_VM_FLAG(SAO)
@@ -506,32 +505,41 @@ enum {
#endif
#if defined(CONFIG_64BIT) || defined(CONFIG_PPC32)
#define VM_DROPPABLE INIT_VM_FLAG(DROPPABLE)
+#define VMA_DROPPABLE mk_vma_flags(VMA_DROPPABLE_BIT)
#else
#define VM_DROPPABLE VM_NONE
+#define VMA_DROPPABLE EMPTY_VMA_FLAGS
#endif
/* Bits set in the VMA until the stack is in its final location */
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
-#define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
+#define TASK_EXEC_BIT ((current->personality & READ_IMPLIES_EXEC) ? \
+ VMA_EXEC_BIT : VMA_READ_BIT)
/* Common data flag combinations */
-#define VM_DATA_FLAGS_TSK_EXEC (VM_READ | VM_WRITE | TASK_EXEC | \
- VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
-#define VM_DATA_FLAGS_NON_EXEC (VM_READ | VM_WRITE | VM_MAYREAD | \
- VM_MAYWRITE | VM_MAYEXEC)
-#define VM_DATA_FLAGS_EXEC (VM_READ | VM_WRITE | VM_EXEC | \
- VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
-
-#ifndef VM_DATA_DEFAULT_FLAGS /* arch can override this */
-#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_EXEC
+#define VMA_DATA_FLAGS_TSK_EXEC mk_vma_flags(VMA_READ_BIT, VMA_WRITE_BIT, \
+ TASK_EXEC_BIT, VMA_MAYREAD_BIT, VMA_MAYWRITE_BIT, \
+ VMA_MAYEXEC_BIT)
+#define VMA_DATA_FLAGS_NON_EXEC mk_vma_flags(VMA_READ_BIT, VMA_WRITE_BIT, \
+ VMA_MAYREAD_BIT, VMA_MAYWRITE_BIT, VMA_MAYEXEC_BIT)
+#define VMA_DATA_FLAGS_EXEC mk_vma_flags(VMA_READ_BIT, VMA_WRITE_BIT, \
+ VMA_EXEC_BIT, VMA_MAYREAD_BIT, VMA_MAYWRITE_BIT, \
+ VMA_MAYEXEC_BIT)
+
+#ifndef VMA_DATA_DEFAULT_FLAGS /* arch can override this */
+#define VMA_DATA_DEFAULT_FLAGS VMA_DATA_FLAGS_EXEC
#endif
-#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
-#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
+#ifndef VMA_STACK_DEFAULT_FLAGS /* arch can override this */
+#define VMA_STACK_DEFAULT_FLAGS VMA_DATA_DEFAULT_FLAGS
#endif
-#define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK)
+#define VMA_STACK_FLAGS append_vma_flags(VMA_STACK_DEFAULT_FLAGS, \
+ VMA_STACK_BIT, VMA_ACCOUNT_BIT)
+
+/* Temporary until VMA flags conversion complete. */
+#define VM_STACK_FLAGS vma_flags_to_legacy(VMA_STACK_FLAGS)
#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
#define VM_SEALED_SYSMAP VM_SEALED
@@ -539,15 +547,17 @@ enum {
#define VM_SEALED_SYSMAP VM_NONE
#endif
-#define VM_STACK_FLAGS (VM_STACK | VM_STACK_DEFAULT_FLAGS | VM_ACCOUNT)
-
/* VMA basic access permission flags */
#define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
+#define VMA_ACCESS_FLAGS mk_vma_flags(VMA_READ_BIT, VMA_WRITE_BIT, VMA_EXEC_BIT)
/*
* Special vmas that are non-mergable, non-mlock()able.
*/
-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
+
+#define VMA_SPECIAL_FLAGS mk_vma_flags(VMA_IO_BIT, VMA_DONTEXPAND_BIT, \
+ VMA_PFNMAP_BIT, VMA_MIXEDMAP_BIT)
+#define VM_SPECIAL vma_flags_to_legacy(VMA_SPECIAL_FLAGS)
/*
* Physically remapped pages are special. Tell the
@@ -574,6 +584,8 @@ enum {
/* This mask represents all the VMA flag bits used by mlock */
#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
+#define VMA_LOCKED_MASK mk_vma_flags(VMA_LOCKED_BIT, VMA_LOCKONFAULT_BIT)
+
/* These flags can be updated atomically via VMA/mmap read lock. */
#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
@@ -588,27 +600,32 @@ enum {
* possesses it but the other does not, the merged VMA should nonetheless have
* applied to it:
*
- * VM_SOFTDIRTY - if a VMA is marked soft-dirty, that is has not had its
- * references cleared via /proc/$pid/clear_refs, any merged VMA
- * should be considered soft-dirty also as it operates at a VMA
- * granularity.
+ * VMA_SOFTDIRTY_BIT - if a VMA is marked soft-dirty, that is has not had its
+ * references cleared via /proc/$pid/clear_refs, any
+ * merged VMA should be considered soft-dirty also as it
+ * operates at a VMA granularity.
*
- * VM_MAYBE_GUARD - If a VMA may have guard regions in place it implies that
- * mapped page tables may contain metadata not described by the
- * VMA and thus any merged VMA may also contain this metadata,
- * and thus we must make this flag sticky.
+ * VMA_MAYBE_GUARD_BIT - If a VMA may have guard regions in place it implies
+ * that mapped page tables may contain metadata not
+ * described by the VMA and thus any merged VMA may also
+ * contain this metadata, and thus we must make this flag
+ * sticky.
*/
-#define VM_STICKY (VM_SOFTDIRTY | VM_MAYBE_GUARD)
+#ifdef CONFIG_MEM_SOFT_DIRTY
+#define VMA_STICKY_FLAGS mk_vma_flags(VMA_SOFTDIRTY_BIT, VMA_MAYBE_GUARD_BIT)
+#else
+#define VMA_STICKY_FLAGS mk_vma_flags(VMA_MAYBE_GUARD_BIT)
+#endif
/*
* VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
* of these flags and the other not does not preclude a merge.
*
- * VM_STICKY - When merging VMAs, VMA flags must match, unless they are
- * 'sticky'. If any sticky flags exist in either VMA, we simply
- * set all of them on the merged VMA.
+ * VMA_STICKY_FLAGS - When merging VMAs, VMA flags must match, unless they
+ * are 'sticky'. If any sticky flags exist in either VMA,
+ * we simply set all of them on the merged VMA.
*/
-#define VM_IGNORE_MERGE VM_STICKY
+#define VMA_IGNORE_MERGE_FLAGS VMA_STICKY_FLAGS
/*
* Flags which should result in page tables being copied on fork. These are
@@ -747,15 +764,37 @@ struct vm_fault {
* to the functions called when a no-page or a wp-page exception occurs.
*/
struct vm_operations_struct {
- void (*open)(struct vm_area_struct * area);
+ /**
+ * @open: Called when a VMA is remapped, split or forked. Not called
+ * upon first mapping a VMA.
+ * Context: User context. May sleep. Caller holds mmap_lock.
+ */
+ void (*open)(struct vm_area_struct *vma);
/**
* @close: Called when the VMA is being removed from the MM.
* Context: User context. May sleep. Caller holds mmap_lock.
*/
- void (*close)(struct vm_area_struct * area);
+ void (*close)(struct vm_area_struct *vma);
+ /**
+ * @mapped: Called when the VMA is first mapped in the MM. Not called if
+ * the new VMA is merged with an adjacent VMA.
+ *
+ * The @vm_private_data field is an output field allowing the user to
+ * modify vma->vm_private_data as necessary.
+ *
+ * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
+ * set from f_op->mmap.
+ *
+ * Returns %0 on success, or an error otherwise. On error, the VMA will
+ * be unmapped.
+ *
+ * Context: User context. May sleep. Caller holds mmap_lock.
+ */
+ int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
+ const struct file *file, void **vm_private_data);
/* Called any time before splitting to check if it's allowed */
- int (*may_split)(struct vm_area_struct *area, unsigned long addr);
- int (*mremap)(struct vm_area_struct *area);
+ int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
+ int (*mremap)(struct vm_area_struct *vma);
/*
* Called by mprotect() to make driver-specific permission
* checks before mprotect() is finalised. The VMA must not
@@ -767,7 +806,7 @@ struct vm_operations_struct {
vm_fault_t (*huge_fault)(struct vm_fault *vmf, unsigned int order);
vm_fault_t (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
- unsigned long (*pagesize)(struct vm_area_struct * area);
+ unsigned long (*pagesize)(struct vm_area_struct *vma);
/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -937,22 +976,20 @@ static inline void vm_flags_reset(struct vm_area_struct *vma,
vm_flags_init(vma, flags);
}
-static inline void vm_flags_reset_once(struct vm_area_struct *vma,
- vm_flags_t flags)
+static inline void vma_flags_reset_once(struct vm_area_struct *vma,
+ vma_flags_t *flags)
{
- vma_assert_write_locked(vma);
- /*
- * If VMA flags exist beyond the first system word, also clear these. It
- * is assumed the write once behaviour is required only for the first
- * system word.
- */
+ const unsigned long word = flags->__vma_flags[0];
+
+ /* It is assumed only the first system word must be written once. */
+ vma_flags_overwrite_word_once(&vma->flags, word);
+ /* The remainder can be copied normally. */
if (NUM_VMA_FLAG_BITS > BITS_PER_LONG) {
- unsigned long *bitmap = vma->flags.__vma_flags;
+ unsigned long *dst = &vma->flags.__vma_flags[1];
+ const unsigned long *src = &flags->__vma_flags[1];
- bitmap_zero(&bitmap[1], NUM_VMA_FLAG_BITS - BITS_PER_LONG);
+ bitmap_copy(dst, src, NUM_VMA_FLAG_BITS - BITS_PER_LONG);
}
-
- vma_flags_overwrite_word_once(&vma->flags, flags);
}
static inline void vm_flags_set(struct vm_area_struct *vma,
@@ -991,7 +1028,8 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
__vm_flags_mod(vma, set, clear);
}
-static inline bool __vma_atomic_valid_flag(struct vm_area_struct *vma, vma_flag_t bit)
+static __always_inline bool __vma_atomic_valid_flag(struct vm_area_struct *vma,
+ vma_flag_t bit)
{
const vm_flags_t mask = BIT((__force int)bit);
@@ -1006,7 +1044,8 @@ static inline bool __vma_atomic_valid_flag(struct vm_area_struct *vma, vma_flag_
* Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
* valid flags are allowed to do this.
*/
-static inline void vma_set_atomic_flag(struct vm_area_struct *vma, vma_flag_t bit)
+static __always_inline void vma_set_atomic_flag(struct vm_area_struct *vma,
+ vma_flag_t bit)
{
unsigned long *bitmap = vma->flags.__vma_flags;
@@ -1022,7 +1061,8 @@ static inline void vma_set_atomic_flag(struct vm_area_struct *vma, vma_flag_t bi
* This is necessarily racey, so callers must ensure that serialisation is
* achieved through some other means, or that races are permissible.
*/
-static inline bool vma_test_atomic_flag(struct vm_area_struct *vma, vma_flag_t bit)
+static __always_inline bool vma_test_atomic_flag(struct vm_area_struct *vma,
+ vma_flag_t bit)
{
if (__vma_atomic_valid_flag(vma, bit))
return test_bit((__force int)bit, &vma->vm_flags);
@@ -1031,21 +1071,21 @@ static inline bool vma_test_atomic_flag(struct vm_area_struct *vma, vma_flag_t b
}
/* Set an individual VMA flag in flags, non-atomically. */
-static inline void vma_flag_set(vma_flags_t *flags, vma_flag_t bit)
+static __always_inline void vma_flags_set_flag(vma_flags_t *flags,
+ vma_flag_t bit)
{
unsigned long *bitmap = flags->__vma_flags;
__set_bit((__force int)bit, bitmap);
}
-static inline vma_flags_t __mk_vma_flags(size_t count, const vma_flag_t *bits)
+static __always_inline vma_flags_t __mk_vma_flags(vma_flags_t flags,
+ size_t count, const vma_flag_t *bits)
{
- vma_flags_t flags;
int i;
- vma_flags_clear_all(&flags);
for (i = 0; i < count; i++)
- vma_flag_set(&flags, bits[i]);
+ vma_flags_set_flag(&flags, bits[i]);
return flags;
}
@@ -1054,16 +1094,73 @@ static inline vma_flags_t __mk_vma_flags(size_t count, const vma_flag_t *bits)
* vma_flags_t bitmap value. E.g.:
*
* vma_flags_t flags = mk_vma_flags(VMA_IO_BIT, VMA_PFNMAP_BIT,
- * VMA_DONTEXPAND_BIT, VMA_DONTDUMP_BIT);
+ * VMA_DONTEXPAND_BIT, VMA_DONTDUMP_BIT);
*
* The compiler cleverly optimises away all of the work and this ends up being
* equivalent to aggregating the values manually.
*/
-#define mk_vma_flags(...) __mk_vma_flags(COUNT_ARGS(__VA_ARGS__), \
- (const vma_flag_t []){__VA_ARGS__})
+#define mk_vma_flags(...) __mk_vma_flags(EMPTY_VMA_FLAGS, \
+ COUNT_ARGS(__VA_ARGS__), (const vma_flag_t []){__VA_ARGS__})
+
+/*
+ * Helper macro which acts like mk_vma_flags, only appending to a copy of the
+ * specified flags rather than establishing new flags. E.g.:
+ *
+ * vma_flags_t flags = append_vma_flags(VMA_STACK_DEFAULT_FLAGS, VMA_STACK_BIT,
+ * VMA_ACCOUNT_BIT);
+ */
+#define append_vma_flags(flags, ...) __mk_vma_flags(flags, \
+ COUNT_ARGS(__VA_ARGS__), (const vma_flag_t []){__VA_ARGS__})
+
+/* Calculates the number of set bits in the specified VMA flags. */
+static __always_inline int vma_flags_count(const vma_flags_t *flags)
+{
+ const unsigned long *bitmap = flags->__vma_flags;
+
+ return bitmap_weight(bitmap, NUM_VMA_FLAG_BITS);
+}
+
+/*
+ * Test whether a specific VMA flag is set, e.g.:
+ *
+ * if (vma_flags_test(flags, VMA_READ_BIT)) { ... }
+ */
+static __always_inline bool vma_flags_test(const vma_flags_t *flags,
+ vma_flag_t bit)
+{
+ const unsigned long *bitmap = flags->__vma_flags;
+
+ return test_bit((__force int)bit, bitmap);
+}
+
+/*
+ * Obtain a set of VMA flags which contain the overlapping flags contained
+ * within flags and to_and.
+ */
+static __always_inline vma_flags_t vma_flags_and_mask(const vma_flags_t *flags,
+ vma_flags_t to_and)
+{
+ vma_flags_t dst;
+ unsigned long *bitmap_dst = dst.__vma_flags;
+ const unsigned long *bitmap = flags->__vma_flags;
+ const unsigned long *bitmap_to_and = to_and.__vma_flags;
+
+ bitmap_and(bitmap_dst, bitmap, bitmap_to_and, NUM_VMA_FLAG_BITS);
+ return dst;
+}
+
+/*
+ * Obtain a set of VMA flags which contains the specified overlapping flags,
+ * e.g.:
+ *
+ * vma_flags_t read_flags = vma_flags_and(&flags, VMA_READ_BIT,
+ * VMA_MAY_READ_BIT);
+ */
+#define vma_flags_and(flags, ...) \
+ vma_flags_and_mask(flags, mk_vma_flags(__VA_ARGS__))
/* Test each of to_test flags in flags, non-atomically. */
-static __always_inline bool vma_flags_test_mask(const vma_flags_t *flags,
+static __always_inline bool vma_flags_test_any_mask(const vma_flags_t *flags,
vma_flags_t to_test)
{
const unsigned long *bitmap = flags->__vma_flags;
@@ -1075,10 +1172,10 @@ static __always_inline bool vma_flags_test_mask(const vma_flags_t *flags,
/*
* Test whether any specified VMA flag is set, e.g.:
*
- * if (vma_flags_test(flags, VMA_READ_BIT, VMA_MAYREAD_BIT)) { ... }
+ * if (vma_flags_test_any(flags, VMA_READ_BIT, VMA_MAYREAD_BIT)) { ... }
*/
-#define vma_flags_test(flags, ...) \
- vma_flags_test_mask(flags, mk_vma_flags(__VA_ARGS__))
+#define vma_flags_test_any(flags, ...) \
+ vma_flags_test_any_mask(flags, mk_vma_flags(__VA_ARGS__))
/* Test that ALL of the to_test flags are set, non-atomically. */
static __always_inline bool vma_flags_test_all_mask(const vma_flags_t *flags,
@@ -1098,8 +1195,29 @@ static __always_inline bool vma_flags_test_all_mask(const vma_flags_t *flags,
#define vma_flags_test_all(flags, ...) \
vma_flags_test_all_mask(flags, mk_vma_flags(__VA_ARGS__))
+/*
+ * Helper to test that a flag mask of type vma_flags_t has a SINGLE flag set
+ * (returning false if flagmask has no flags set).
+ *
+ * This is defined to make the semantics clearer when testing an optionally
+ * defined VMA flags mask, e.g.:
+ *
+ * if (vma_flags_test_single_mask(&flags, VMA_DROPPABLE)) { ... }
+ *
+ * When VMA_DROPPABLE is defined if available, or set to EMPTY_VMA_FLAGS
+ * otherwise.
+ */
+static __always_inline bool vma_flags_test_single_mask(const vma_flags_t *flags,
+ vma_flags_t flagmask)
+{
+ VM_WARN_ON_ONCE(vma_flags_count(&flagmask) > 1);
+
+ return vma_flags_test_any_mask(flags, flagmask);
+}
+
/* Set each of the to_set flags in flags, non-atomically. */
-static __always_inline void vma_flags_set_mask(vma_flags_t *flags, vma_flags_t to_set)
+static __always_inline void vma_flags_set_mask(vma_flags_t *flags,
+ vma_flags_t to_set)
{
unsigned long *bitmap = flags->__vma_flags;
const unsigned long *bitmap_to_set = to_set.__vma_flags;
@@ -1116,7 +1234,8 @@ static __always_inline void vma_flags_set_mask(vma_flags_t *flags, vma_flags_t t
vma_flags_set_mask(flags, mk_vma_flags(__VA_ARGS__))
/* Clear all of the to-clear flags in flags, non-atomically. */
-static __always_inline void vma_flags_clear_mask(vma_flags_t *flags, vma_flags_t to_clear)
+static __always_inline void vma_flags_clear_mask(vma_flags_t *flags,
+ vma_flags_t to_clear)
{
unsigned long *bitmap = flags->__vma_flags;
const unsigned long *bitmap_to_clear = to_clear.__vma_flags;
@@ -1133,13 +1252,85 @@ static __always_inline void vma_flags_clear_mask(vma_flags_t *flags, vma_flags_t
vma_flags_clear_mask(flags, mk_vma_flags(__VA_ARGS__))
/*
+ * Obtain a VMA flags value containing those flags that are present in flags or
+ * flags_other but not in both.
+ */
+static __always_inline vma_flags_t vma_flags_diff_pair(const vma_flags_t *flags,
+ const vma_flags_t *flags_other)
+{
+ vma_flags_t dst;
+ const unsigned long *bitmap_other = flags_other->__vma_flags;
+ const unsigned long *bitmap = flags->__vma_flags;
+ unsigned long *bitmap_dst = dst.__vma_flags;
+
+ bitmap_xor(bitmap_dst, bitmap, bitmap_other, NUM_VMA_FLAG_BITS);
+ return dst;
+}
+
+/* Determine if flags and flags_other have precisely the same flags set. */
+static __always_inline bool vma_flags_same_pair(const vma_flags_t *flags,
+ const vma_flags_t *flags_other)
+{
+ const unsigned long *bitmap = flags->__vma_flags;
+ const unsigned long *bitmap_other = flags_other->__vma_flags;
+
+ return bitmap_equal(bitmap, bitmap_other, NUM_VMA_FLAG_BITS);
+}
+
+/* Determine if flags and flags_other have precisely the same flags set. */
+static __always_inline bool vma_flags_same_mask(const vma_flags_t *flags,
+ vma_flags_t flags_other)
+{
+ const unsigned long *bitmap = flags->__vma_flags;
+ const unsigned long *bitmap_other = flags_other.__vma_flags;
+
+ return bitmap_equal(bitmap, bitmap_other, NUM_VMA_FLAG_BITS);
+}
+
+/*
+ * Helper macro to determine if only the specific flags are set, e.g.:
+ *
+ * if (vma_flags_same(&flags, VMA_WRITE_BIT) { ... }
+ */
+#define vma_flags_same(flags, ...) \
+ vma_flags_same_mask(flags, mk_vma_flags(__VA_ARGS__))
+
+/*
+ * Test whether a specific flag in the VMA is set, e.g.:
+ *
+ * if (vma_test(vma, VMA_READ_BIT)) { ... }
+ */
+static __always_inline bool vma_test(const struct vm_area_struct *vma,
+ vma_flag_t bit)
+{
+ return vma_flags_test(&vma->flags, bit);
+}
+
+/* Helper to test any VMA flags in a VMA . */
+static __always_inline bool vma_test_any_mask(const struct vm_area_struct *vma,
+ vma_flags_t flags)
+{
+ return vma_flags_test_any_mask(&vma->flags, flags);
+}
+
+/*
+ * Helper macro for testing whether any VMA flags are set in a VMA,
+ * e.g.:
+ *
+ * if (vma_test_any(vma, VMA_IO_BIT, VMA_PFNMAP_BIT,
+ * VMA_DONTEXPAND_BIT, VMA_DONTDUMP_BIT)) { ... }
+ */
+#define vma_test_any(vma, ...) \
+ vma_test_any_mask(vma, mk_vma_flags(__VA_ARGS__))
+
+/*
* Helper to test that ALL specified flags are set in a VMA.
*
* Note: appropriate locks must be held, this function does not acquire them for
* you.
*/
-static inline bool vma_test_all_flags_mask(const struct vm_area_struct *vma,
- vma_flags_t flags)
+static __always_inline bool vma_test_all_mask(const struct vm_area_struct *vma,
+ vma_flags_t flags)
{
return vma_flags_test_all_mask(&vma->flags, flags);
}
@@ -1147,10 +1338,28 @@ static inline bool vma_test_all_flags_mask(const struct vm_area_struct *vma,
/*
* Helper macro for checking that ALL specified flags are set in a VMA, e.g.:
*
- * if (vma_test_all_flags(vma, VMA_READ_BIT, VMA_MAYREAD_BIT) { ... }
+ * if (vma_test_all(vma, VMA_READ_BIT, VMA_MAYREAD_BIT) { ... }
+ */
+#define vma_test_all(vma, ...) \
+ vma_test_all_mask(vma, mk_vma_flags(__VA_ARGS__))
+
+/*
+ * Helper to test that a flag mask of type vma_flags_t has a SINGLE flag set
+ * (returning false if flagmask has no flags set).
+ *
+ * This is useful when a flag needs to be either defined or not depending upon
+ * kernel configuration, e.g.:
+ *
+ * if (vma_test_single_mask(vma, VMA_DROPPABLE)) { ... }
+ *
+ * When VMA_DROPPABLE is defined if available, or set to EMPTY_VMA_FLAGS
+ * otherwise.
*/
-#define vma_test_all_flags(vma, ...) \
- vma_test_all_flags_mask(vma, mk_vma_flags(__VA_ARGS__))
+static __always_inline bool
+vma_test_single_mask(const struct vm_area_struct *vma, vma_flags_t flagmask)
+{
+ return vma_flags_test_single_mask(&vma->flags, flagmask);
+}
/*
* Helper to set all VMA flags in a VMA.
@@ -1158,8 +1367,8 @@ static inline bool vma_test_all_flags_mask(const struct vm_area_struct *vma,
* Note: appropriate locks must be held, this function does not acquire them for
* you.
*/
-static inline void vma_set_flags_mask(struct vm_area_struct *vma,
- vma_flags_t flags)
+static __always_inline void vma_set_flags_mask(struct vm_area_struct *vma,
+ vma_flags_t flags)
{
vma_flags_set_mask(&vma->flags, flags);
}
@@ -1176,26 +1385,69 @@ static inline void vma_set_flags_mask(struct vm_area_struct *vma,
#define vma_set_flags(vma, ...) \
vma_set_flags_mask(vma, mk_vma_flags(__VA_ARGS__))
-/* Helper to test all VMA flags in a VMA descriptor. */
-static inline bool vma_desc_test_flags_mask(const struct vm_area_desc *desc,
- vma_flags_t flags)
+/* Helper to clear all VMA flags in a VMA. */
+static __always_inline void vma_clear_flags_mask(struct vm_area_struct *vma,
+ vma_flags_t flags)
{
- return vma_flags_test_mask(&desc->vma_flags, flags);
+ vma_flags_clear_mask(&vma->flags, flags);
}
/*
- * Helper macro for testing VMA flags for an input pointer to a struct
- * vm_area_desc object describing a proposed VMA, e.g.:
+ * Helper macro for clearing VMA flags, e.g.:
*
- * if (vma_desc_test_flags(desc, VMA_IO_BIT, VMA_PFNMAP_BIT,
+ * vma_clear_flags(vma, VMA_IO_BIT, VMA_PFNMAP_BIT, VMA_DONTEXPAND_BIT,
+ * VMA_DONTDUMP_BIT);
+ */
+#define vma_clear_flags(vma, ...) \
+ vma_clear_flags_mask(vma, mk_vma_flags(__VA_ARGS__))
+
+/*
+ * Test whether a specific VMA flag is set in a VMA descriptor, e.g.:
+ *
+ * if (vma_desc_test(desc, VMA_READ_BIT)) { ... }
+ */
+static __always_inline bool vma_desc_test(const struct vm_area_desc *desc,
+ vma_flag_t bit)
+{
+ return vma_flags_test(&desc->vma_flags, bit);
+}
+
+/* Helper to test any VMA flags in a VMA descriptor. */
+static __always_inline bool vma_desc_test_any_mask(const struct vm_area_desc *desc,
+ vma_flags_t flags)
+{
+ return vma_flags_test_any_mask(&desc->vma_flags, flags);
+}
+
+/*
+ * Helper macro for testing whether any VMA flags are set in a VMA descriptor,
+ * e.g.:
+ *
+ * if (vma_desc_test_any(desc, VMA_IO_BIT, VMA_PFNMAP_BIT,
* VMA_DONTEXPAND_BIT, VMA_DONTDUMP_BIT)) { ... }
*/
-#define vma_desc_test_flags(desc, ...) \
- vma_desc_test_flags_mask(desc, mk_vma_flags(__VA_ARGS__))
+#define vma_desc_test_any(desc, ...) \
+ vma_desc_test_any_mask(desc, mk_vma_flags(__VA_ARGS__))
+
+/* Helper to test all VMA flags in a VMA descriptor. */
+static __always_inline bool vma_desc_test_all_mask(const struct vm_area_desc *desc,
+ vma_flags_t flags)
+{
+ return vma_flags_test_all_mask(&desc->vma_flags, flags);
+}
+
+/*
+ * Helper macro for testing whether ALL VMA flags are set in a VMA descriptor,
+ * e.g.:
+ *
+ * if (vma_desc_test_all(desc, VMA_READ_BIT, VMA_MAYREAD_BIT)) { ... }
+ */
+#define vma_desc_test_all(desc, ...) \
+ vma_desc_test_all_mask(desc, mk_vma_flags(__VA_ARGS__))
/* Helper to set all VMA flags in a VMA descriptor. */
-static inline void vma_desc_set_flags_mask(struct vm_area_desc *desc,
- vma_flags_t flags)
+static __always_inline void vma_desc_set_flags_mask(struct vm_area_desc *desc,
+ vma_flags_t flags)
{
vma_flags_set_mask(&desc->vma_flags, flags);
}
@@ -1211,8 +1463,8 @@ static inline void vma_desc_set_flags_mask(struct vm_area_desc *desc,
vma_desc_set_flags_mask(desc, mk_vma_flags(__VA_ARGS__))
/* Helper to clear all VMA flags in a VMA descriptor. */
-static inline void vma_desc_clear_flags_mask(struct vm_area_desc *desc,
- vma_flags_t flags)
+static __always_inline void vma_desc_clear_flags_mask(struct vm_area_desc *desc,
+ vma_flags_t flags)
{
vma_flags_clear_mask(&desc->vma_flags, flags);
}
@@ -1292,12 +1544,6 @@ static inline bool vma_is_accessible(const struct vm_area_struct *vma)
return vma->vm_flags & VM_ACCESS_FLAGS;
}
-static inline bool is_shared_maywrite_vm_flags(vm_flags_t vm_flags)
-{
- return (vm_flags & (VM_SHARED | VM_MAYWRITE)) ==
- (VM_SHARED | VM_MAYWRITE);
-}
-
static inline bool is_shared_maywrite(const vma_flags_t *flags)
{
return vma_flags_test_all(flags, VMA_SHARED_BIT, VMA_MAYWRITE_BIT);
@@ -1308,6 +1554,28 @@ static inline bool vma_is_shared_maywrite(const struct vm_area_struct *vma)
return is_shared_maywrite(&vma->flags);
}
+/**
+ * vma_kernel_pagesize - Default page size granularity for this VMA.
+ * @vma: The user mapping.
+ *
+ * The kernel page size specifies in which granularity VMA modifications
+ * can be performed. Folios in this VMA will be aligned to, and at least
+ * the size of the number of bytes returned by this function.
+ *
+ * The default kernel page size is not affected by Transparent Huge Pages
+ * being in effect.
+ *
+ * Return: The default page size granularity for this VMA.
+ */
+static inline unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
+{
+ if (unlikely(vma->vm_ops && vma->vm_ops->pagesize))
+ return vma->vm_ops->pagesize(vma);
+ return PAGE_SIZE;
+}
+
+unsigned long vma_mmu_pagesize(struct vm_area_struct *vma);
+
static inline
struct vm_area_struct *vma_find(struct vma_iterator *vmi, unsigned long max)
{
@@ -1507,7 +1775,7 @@ static inline int folio_put_testzero(struct folio *folio)
*/
static inline bool get_page_unless_zero(struct page *page)
{
- return page_ref_add_unless(page, 1, 0);
+ return page_ref_add_unless_zero(page, 1);
}
static inline struct folio *folio_get_nontail_page(struct page *page)
@@ -1957,7 +2225,7 @@ static inline bool is_nommu_shared_mapping(vm_flags_t flags)
static inline bool is_nommu_shared_vma_flags(const vma_flags_t *flags)
{
- return vma_flags_test(flags, VMA_MAYSHARE_BIT, VMA_MAYOVERLAY_BIT);
+ return vma_flags_test_any(flags, VMA_MAYSHARE_BIT, VMA_MAYOVERLAY_BIT);
}
#endif
@@ -2479,36 +2747,6 @@ static inline unsigned long folio_nr_pages(const struct folio *folio)
return folio_large_nr_pages(folio);
}
-#if !defined(CONFIG_HAVE_GIGANTIC_FOLIOS)
-/*
- * We don't expect any folios that exceed buddy sizes (and consequently
- * memory sections).
- */
-#define MAX_FOLIO_ORDER MAX_PAGE_ORDER
-#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-/*
- * Only pages within a single memory section are guaranteed to be
- * contiguous. By limiting folios to a single memory section, all folio
- * pages are guaranteed to be contiguous.
- */
-#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT
-#elif defined(CONFIG_HUGETLB_PAGE)
-/*
- * There is no real limit on the folio size. We limit them to the maximum we
- * currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
- * no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
- */
-#define MAX_FOLIO_ORDER get_order(IS_ENABLED(CONFIG_64BIT) ? SZ_16G : SZ_1G)
-#else
-/*
- * Without hugetlb, gigantic folios that are bigger than a single PUD are
- * currently impossible.
- */
-#define MAX_FOLIO_ORDER PUD_ORDER
-#endif
-
-#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
-
/*
* compound_nr() returns the number of pages in this potentially compound
* page. compound_nr() can be called on a tail page, and is defined to
@@ -2667,7 +2905,7 @@ static inline bool folio_maybe_mapped_shared(struct folio *folio)
* The caller must add any reference (e.g., from folio_try_get()) it might be
* holding itself to the result.
*
- * Returns the expected folio refcount.
+ * Returns: the expected folio refcount.
*/
static inline int folio_expected_ref_count(const struct folio *folio)
{
@@ -2798,8 +3036,9 @@ extern void pagefault_out_of_memory(void);
*/
struct zap_details {
struct folio *single_folio; /* Locked folio to be unmapped */
- bool even_cows; /* Zap COWed private pages too? */
+ bool skip_cows; /* Do not zap COWed private pages */
bool reclaim_pt; /* Need reclaim page tables? */
+ bool reaping; /* Reaping, do not block. */
zap_flags_t zap_flags; /* Extra flags for zapping */
};
@@ -2832,14 +3071,17 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t pud);
-void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
+void zap_special_vma_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
-void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *details);
-static inline void zap_vma_pages(struct vm_area_struct *vma)
+void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
+ unsigned long size);
+/**
+ * zap_vma - zap all page table entries in a vma
+ * @vma: The vma to zap.
+ */
+static inline void zap_vma(struct vm_area_struct *vma)
{
- zap_page_range_single(vma, vma->vm_start,
- vma->vm_end - vma->vm_start, NULL);
+ zap_vma_range(vma, vma->vm_start, vma->vm_end - vma->vm_start);
}
struct mmu_notifier_range;
@@ -3847,7 +4089,6 @@ extern int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
extern struct file *get_mm_exe_file(struct mm_struct *mm);
extern struct file *get_task_exe_file(struct task_struct *task);
-extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long npages);
extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
extern bool vma_is_special_mapping(const struct vm_area_struct *vma,
@@ -3898,11 +4139,13 @@ static inline void mm_populate(unsigned long addr, unsigned long len) {}
#endif
/* This takes the mm semaphore itself */
-extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long);
-extern int vm_munmap(unsigned long, size_t);
-extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
- unsigned long, unsigned long,
- unsigned long, unsigned long);
+int __must_check vm_brk_flags(unsigned long addr, unsigned long request, bool is_exec);
+int vm_munmap(unsigned long start, size_t len);
+unsigned long __must_check vm_mmap(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long prot,
+ unsigned long flag, unsigned long offset);
+unsigned long __must_check vm_mmap_shadow_stack(unsigned long addr,
+ unsigned long len, unsigned long flags);
struct vm_unmapped_area_info {
#define VM_UNMAPPED_AREA_TOPDOWN 1
@@ -3999,6 +4242,11 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
}
+static inline unsigned long vma_last_pgoff(struct vm_area_struct *vma)
+{
+ return vma->vm_pgoff + vma_pages(vma) - 1;
+}
+
static inline unsigned long vma_desc_size(const struct vm_area_desc *desc)
{
return desc->end - desc->start;
@@ -4073,15 +4321,75 @@ static inline void mmap_action_ioremap(struct vm_area_desc *desc,
* @start_pfn: The first PFN in the range to remap.
*/
static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
- unsigned long start_pfn)
+ unsigned long start_pfn)
{
mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
}
-void mmap_action_prepare(struct mmap_action *action,
- struct vm_area_desc *desc);
-int mmap_action_complete(struct mmap_action *action,
- struct vm_area_struct *vma);
+/**
+ * mmap_action_simple_ioremap - helper for mmap_prepare hook to specify that the
+ * physical range in [start_phys_addr, start_phys_addr + size) should be I/O
+ * remapped.
+ * @desc: The VMA descriptor for the VMA requiring remap.
+ * @start_phys_addr: Start of the physical memory to be mapped.
+ * @size: Size of the area to map.
+ *
+ * NOTE: Some drivers might want to tweak desc->page_prot for purposes of
+ * write-combine or similar.
+ */
+static inline void mmap_action_simple_ioremap(struct vm_area_desc *desc,
+ phys_addr_t start_phys_addr,
+ unsigned long size)
+{
+ struct mmap_action *action = &desc->action;
+
+ action->simple_ioremap.start_phys_addr = start_phys_addr;
+ action->simple_ioremap.size = size;
+ action->type = MMAP_SIMPLE_IO_REMAP;
+}
+
+/**
+ * mmap_action_map_kernel_pages - helper for mmap_prepare hook to specify that
+ * @num kernel pages contained in the @pages array should be mapped to userland
+ * starting at virtual address @start.
+ * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
+ * @start: The virtual address from which to map them.
+ * @pages: An array of struct page pointers describing the memory to map.
+ * @nr_pages: The number of entries in the @pages aray.
+ */
+static inline void mmap_action_map_kernel_pages(struct vm_area_desc *desc,
+ unsigned long start, struct page **pages,
+ unsigned long nr_pages)
+{
+ struct mmap_action *action = &desc->action;
+
+ action->type = MMAP_MAP_KERNEL_PAGES;
+ action->map_kernel.start = start;
+ action->map_kernel.pages = pages;
+ action->map_kernel.nr_pages = nr_pages;
+ action->map_kernel.pgoff = desc->pgoff;
+}
+
+/**
+ * mmap_action_map_kernel_pages_full - helper for mmap_prepare hook to specify that
+ * kernel pages contained in the @pages array should be mapped to userland
+ * from @desc->start to @desc->end.
+ * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
+ * @pages: An array of struct page pointers describing the memory to map.
+ *
+ * The caller must ensure that @pages contains sufficient entries to cover the
+ * entire range described by @desc.
+ */
+static inline void mmap_action_map_kernel_pages_full(struct vm_area_desc *desc,
+ struct page **pages)
+{
+ mmap_action_map_kernel_pages(desc, desc->start, pages,
+ vma_desc_pages(desc));
+}
+
+int mmap_action_prepare(struct vm_area_desc *desc);
+int mmap_action_complete(struct vm_area_struct *vma,
+ struct mmap_action *action);
/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
@@ -4095,20 +4403,81 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
return vma;
}
+/**
+ * range_is_subset - Is the specified inner range a subset of the outer range?
+ * @outer_start: The start of the outer range.
+ * @outer_end: The exclusive end of the outer range.
+ * @inner_start: The start of the inner range.
+ * @inner_end: The exclusive end of the inner range.
+ *
+ * Returns: %true if [inner_start, inner_end) is a subset of [outer_start,
+ * outer_end), otherwise %false.
+ */
+static inline bool range_is_subset(unsigned long outer_start,
+ unsigned long outer_end,
+ unsigned long inner_start,
+ unsigned long inner_end)
+{
+ return outer_start <= inner_start && inner_end <= outer_end;
+}
+
+/**
+ * range_in_vma - is the specified [@start, @end) range a subset of the VMA?
+ * @vma: The VMA against which we want to check [@start, @end).
+ * @start: The start of the range we wish to check.
+ * @end: The exclusive end of the range we wish to check.
+ *
+ * Returns: %true if [@start, @end) is a subset of [@vma->vm_start,
+ * @vma->vm_end), %false otherwise.
+ */
static inline bool range_in_vma(const struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- return (vma && vma->vm_start <= start && end <= vma->vm_end);
+ if (!vma)
+ return false;
+
+ return range_is_subset(vma->vm_start, vma->vm_end, start, end);
+}
+
+/**
+ * range_in_vma_desc - is the specified [@start, @end) range a subset of the VMA
+ * described by @desc, a VMA descriptor?
+ * @desc: The VMA descriptor against which we want to check [@start, @end).
+ * @start: The start of the range we wish to check.
+ * @end: The exclusive end of the range we wish to check.
+ *
+ * Returns: %true if [@start, @end) is a subset of [@desc->start, @desc->end),
+ * %false otherwise.
+ */
+static inline bool range_in_vma_desc(const struct vm_area_desc *desc,
+ unsigned long start, unsigned long end)
+{
+ if (!desc)
+ return false;
+
+ return range_is_subset(desc->start, desc->end, start, end);
}
#ifdef CONFIG_MMU
pgprot_t vm_get_page_prot(vm_flags_t vm_flags);
+
+static inline pgprot_t vma_get_page_prot(vma_flags_t vma_flags)
+{
+ const vm_flags_t vm_flags = vma_flags_to_legacy(vma_flags);
+
+ return vm_get_page_prot(vm_flags);
+}
+
void vma_set_page_prot(struct vm_area_struct *vma);
#else
static inline pgprot_t vm_get_page_prot(vm_flags_t vm_flags)
{
return __pgprot(0);
}
+static inline pgprot_t vma_get_page_prot(vma_flags_t vma_flags)
+{
+ return __pgprot(0);
+}
static inline void vma_set_page_prot(struct vm_area_struct *vma)
{
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
@@ -4130,6 +4499,9 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
struct page **pages, unsigned long *num);
+int map_kernel_pages_prepare(struct vm_area_desc *desc);
+int map_kernel_pages_complete(struct vm_area_struct *vma,
+ struct mmap_action *action);
int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
unsigned long num);
int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
@@ -4508,10 +4880,9 @@ int vmemmap_populate_hugepages(unsigned long start, unsigned long end,
int node, struct vmem_altmap *altmap);
int vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap);
-int vmemmap_populate_hvo(unsigned long start, unsigned long end, int node,
+int vmemmap_populate_hvo(unsigned long start, unsigned long end,
+ unsigned int order, struct zone *zone,
unsigned long headsize);
-int vmemmap_undo_hvo(unsigned long start, unsigned long end, int node,
- unsigned long headsize);
void vmemmap_wrprotect_hvo(unsigned long start, unsigned long end, int node,
unsigned long headsize);
void vmemmap_populate_print_last(void);
@@ -4697,22 +5068,6 @@ long copy_folio_from_user(struct folio *dst_folio,
const void __user *usr_src,
bool allow_pagefault);
-/**
- * vma_is_special_huge - Are transhuge page-table entries considered special?
- * @vma: Pointer to the struct vm_area_struct to consider
- *
- * Whether transhuge page-table entries are considered "special" following
- * the definition in vm_normal_page().
- *
- * Return: true if transhuge page-table entries should be considered special,
- * false otherwise.
- */
-static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
-{
- return vma_is_dax(vma) || (vma->vm_file &&
- (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
-}
-
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
#if MAX_NUMNODES > 1
@@ -4817,10 +5172,9 @@ int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
* DMA mapping IDs for page_pool
*
* When DMA-mapping a page, page_pool allocates an ID (from an xarray) and
- * stashes it in the upper bits of page->pp_magic. We always want to be able to
- * unambiguously identify page pool pages (using page_pool_page_is_pp()). Non-PP
- * pages can have arbitrary kernel pointers stored in the same field as pp_magic
- * (since it overlaps with page->lru.next), so we must ensure that we cannot
+ * stashes it in the upper bits of page->pp_magic. Non-PP pages can have
+ * arbitrary kernel pointers stored in the same field as pp_magic (since
+ * it overlaps with page->lru.next), so we must ensure that we cannot
* mistake a valid kernel pointer with any of the values we write into this
* field.
*
@@ -4855,26 +5209,6 @@ int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \
PP_DMA_INDEX_SHIFT)
-/* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
- * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
- * the head page of compound page and bit 1 for pfmemalloc page, as well as the
- * bits used for the DMA index. page_is_pfmemalloc() is checked in
- * __page_pool_put_page() to avoid recycling the pfmemalloc page.
- */
-#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL)
-
-#ifdef CONFIG_PAGE_POOL
-static inline bool page_pool_page_is_pp(const struct page *page)
-{
- return (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE;
-}
-#else
-static inline bool page_pool_page_is_pp(const struct page *page)
-{
- return false;
-}
-#endif
-
#define PAGE_SNAPSHOT_FAITHFUL (1 << 0)
#define PAGE_SNAPSHOT_PG_BUDDY (1 << 1)
#define PAGE_SNAPSHOT_PG_IDLE (1 << 2)
@@ -4894,4 +5228,8 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
+ struct vm_area_struct *vma, unsigned long addr,
+ bool uffd_wp);
+
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fa2d6ba811b5..7fc2ced00f8f 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -30,11 +30,6 @@ static inline int folio_is_file_lru(const struct folio *folio)
return !folio_test_swapbacked(folio);
}
-static inline int page_is_file_lru(struct page *page)
-{
- return folio_is_file_lru(page_folio(page));
-}
-
static __always_inline void __update_lru_size(struct lruvec *lruvec,
enum lru_list lru, enum zone_type zid,
long nr_pages)
@@ -102,6 +97,12 @@ static __always_inline enum lru_list folio_lru_list(const struct folio *folio)
#ifdef CONFIG_LRU_GEN
+static inline bool lru_gen_switching(void)
+{
+ DECLARE_STATIC_KEY_FALSE(lru_switch);
+
+ return static_branch_unlikely(&lru_switch);
+}
#ifdef CONFIG_LRU_GEN_ENABLED
static inline bool lru_gen_enabled(void)
{
@@ -316,6 +317,11 @@ static inline bool lru_gen_enabled(void)
return false;
}
+static inline bool lru_gen_switching(void)
+{
+ return false;
+}
+
static inline bool lru_gen_in_fault(void)
{
return false;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc8ae722886..a308e2c23b82 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -126,14 +126,14 @@ struct page {
atomic_long_t pp_ref_count;
};
struct { /* Tail pages of compound page */
- unsigned long compound_head; /* Bit zero is set */
+ unsigned long compound_info; /* Bit zero is set */
};
struct { /* ZONE_DEVICE pages */
/*
- * The first word is used for compound_head or folio
+ * The first word is used for compound_info or folio
* pgmap
*/
- void *_unused_pgmap_compound_head;
+ void *_unused_pgmap_compound_info;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
@@ -409,7 +409,7 @@ struct folio {
/* private: avoid cluttering the output */
/* For the Unevictable "LRU list" slot */
struct {
- /* Avoid compound_head */
+ /* Avoid compound_info */
void *__filler;
/* public: */
unsigned int mlock_count;
@@ -510,7 +510,7 @@ struct folio {
FOLIO_MATCH(flags, flags);
FOLIO_MATCH(lru, lru);
FOLIO_MATCH(mapping, mapping);
-FOLIO_MATCH(compound_head, lru);
+FOLIO_MATCH(compound_info, lru);
FOLIO_MATCH(__folio_index, index);
FOLIO_MATCH(private, private);
FOLIO_MATCH(_mapcount, _mapcount);
@@ -529,7 +529,7 @@ FOLIO_MATCH(_last_cpupid, _last_cpupid);
static_assert(offsetof(struct folio, fl) == \
offsetof(struct page, pg) + sizeof(struct page))
FOLIO_MATCH(flags, _flags_1);
-FOLIO_MATCH(compound_head, _head_1);
+FOLIO_MATCH(compound_info, _head_1);
FOLIO_MATCH(_mapcount, _mapcount_1);
FOLIO_MATCH(_refcount, _refcount_1);
#undef FOLIO_MATCH
@@ -537,13 +537,13 @@ FOLIO_MATCH(_refcount, _refcount_1);
static_assert(offsetof(struct folio, fl) == \
offsetof(struct page, pg) + 2 * sizeof(struct page))
FOLIO_MATCH(flags, _flags_2);
-FOLIO_MATCH(compound_head, _head_2);
+FOLIO_MATCH(compound_info, _head_2);
#undef FOLIO_MATCH
#define FOLIO_MATCH(pg, fl) \
static_assert(offsetof(struct folio, fl) == \
offsetof(struct page, pg) + 3 * sizeof(struct page))
FOLIO_MATCH(flags, _flags_3);
-FOLIO_MATCH(compound_head, _head_3);
+FOLIO_MATCH(compound_info, _head_3);
#undef FOLIO_MATCH
/**
@@ -609,8 +609,8 @@ struct ptdesc {
#define TABLE_MATCH(pg, pt) \
static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
TABLE_MATCH(flags, pt_flags);
-TABLE_MATCH(compound_head, pt_list);
-TABLE_MATCH(compound_head, _pt_pad_1);
+TABLE_MATCH(compound_info, pt_list);
+TABLE_MATCH(compound_info, _pt_pad_1);
TABLE_MATCH(mapping, __page_mapping);
TABLE_MATCH(__folio_index, pt_index);
TABLE_MATCH(rcu_head, pt_rcu_head);
@@ -814,6 +814,8 @@ enum mmap_action_type {
MMAP_NOTHING, /* Mapping is complete, no further action. */
MMAP_REMAP_PFN, /* Remap PFN range. */
MMAP_IO_REMAP_PFN, /* I/O remap PFN range. */
+ MMAP_SIMPLE_IO_REMAP, /* I/O remap with guardrails. */
+ MMAP_MAP_KERNEL_PAGES, /* Map kernel page range from array. */
};
/*
@@ -822,13 +824,22 @@ enum mmap_action_type {
*/
struct mmap_action {
union {
- /* Remap range. */
struct {
unsigned long start;
unsigned long start_pfn;
unsigned long size;
pgprot_t pgprot;
} remap;
+ struct {
+ phys_addr_t start_phys_addr;
+ unsigned long size;
+ } simple_ioremap;
+ struct {
+ unsigned long start;
+ struct page **pages;
+ unsigned long nr_pages;
+ pgoff_t pgoff;
+ } map_kernel;
};
enum mmap_action_type type;
@@ -870,6 +881,14 @@ typedef struct {
#define EMPTY_VMA_FLAGS ((vma_flags_t){ })
+/* Are no flags set in the specified VMA flags? */
+static __always_inline bool vma_flags_empty(const vma_flags_t *flags)
+{
+ const unsigned long *bitmap = flags->__vma_flags;
+
+ return bitmap_empty(bitmap, NUM_VMA_FLAG_BITS);
+}
+
/*
* Describes a VMA that is about to be mmap()'ed. Drivers may choose to
* manipulate mutable fields which will cause those fields to be updated in the
@@ -879,8 +898,8 @@ typedef struct {
*/
struct vm_area_desc {
/* Immutable state. */
- const struct mm_struct *const mm;
- struct file *const file; /* May vary from vm_file in stacked callers. */
+ struct mm_struct *mm;
+ struct file *file; /* May vary from vm_file in stacked callers. */
unsigned long start;
unsigned long end;
@@ -1056,18 +1075,31 @@ struct vm_area_struct {
} __randomize_layout;
/* Clears all bits in the VMA flags bitmap, non-atomically. */
-static inline void vma_flags_clear_all(vma_flags_t *flags)
+static __always_inline void vma_flags_clear_all(vma_flags_t *flags)
{
bitmap_zero(flags->__vma_flags, NUM_VMA_FLAG_BITS);
}
/*
+ * Helper function which converts a vma_flags_t value to a legacy vm_flags_t
+ * value. This is only valid if the input flags value can be expressed in a
+ * system word.
+ *
+ * Will be removed once the conversion to VMA flags is complete.
+ */
+static __always_inline vm_flags_t vma_flags_to_legacy(vma_flags_t flags)
+{
+ return (vm_flags_t)flags.__vma_flags[0];
+}
+
+/*
* Copy value to the first system word of VMA flags, non-atomically.
*
* IMPORTANT: This does not overwrite bytes past the first system word. The
* caller must account for this.
*/
-static inline void vma_flags_overwrite_word(vma_flags_t *flags, unsigned long value)
+static __always_inline void vma_flags_overwrite_word(vma_flags_t *flags,
+ unsigned long value)
{
unsigned long *bitmap = flags->__vma_flags;
@@ -1075,12 +1107,27 @@ static inline void vma_flags_overwrite_word(vma_flags_t *flags, unsigned long va
}
/*
+ * Helper function which converts a legacy vm_flags_t value to a vma_flags_t
+ * value.
+ *
+ * Will be removed once the conversion to VMA flags is complete.
+ */
+static __always_inline vma_flags_t legacy_to_vma_flags(vm_flags_t flags)
+{
+ vma_flags_t ret = EMPTY_VMA_FLAGS;
+
+ vma_flags_overwrite_word(&ret, flags);
+ return ret;
+}
+
+/*
* Copy value to the first system word of VMA flags ONCE, non-atomically.
*
* IMPORTANT: This does not overwrite bytes past the first system word. The
* caller must account for this.
*/
-static inline void vma_flags_overwrite_word_once(vma_flags_t *flags, unsigned long value)
+static __always_inline void vma_flags_overwrite_word_once(vma_flags_t *flags,
+ unsigned long value)
{
unsigned long *bitmap = flags->__vma_flags;
@@ -1088,7 +1135,8 @@ static inline void vma_flags_overwrite_word_once(vma_flags_t *flags, unsigned lo
}
/* Update the first system word of VMA flags setting bits, non-atomically. */
-static inline void vma_flags_set_word(vma_flags_t *flags, unsigned long value)
+static __always_inline void vma_flags_set_word(vma_flags_t *flags,
+ unsigned long value)
{
unsigned long *bitmap = flags->__vma_flags;
@@ -1096,7 +1144,8 @@ static inline void vma_flags_set_word(vma_flags_t *flags, unsigned long value)
}
/* Update the first system word of VMA flags clearing bits, non-atomically. */
-static inline void vma_flags_clear_word(vma_flags_t *flags, unsigned long value)
+static __always_inline void vma_flags_clear_word(vma_flags_t *flags,
+ unsigned long value)
{
unsigned long *bitmap = flags->__vma_flags;
@@ -1241,7 +1290,11 @@ struct mm_struct {
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
unsigned long stack_vm; /* VM_STACK */
- vm_flags_t def_flags;
+ union {
+ /* Temporary while VMA flags are being converted. */
+ vm_flags_t def_flags;
+ vma_flags_t def_vma_flags;
+ };
/**
* @write_protect_seq: Locked when any thread is write
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 0ba8a7e8b90a..389521594c69 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -170,53 +170,4 @@ static inline bool arch_memory_deny_write_exec_supported(void)
}
#define arch_memory_deny_write_exec_supported arch_memory_deny_write_exec_supported
#endif
-
-/*
- * Denies creating a writable executable mapping or gaining executable permissions.
- *
- * This denies the following:
- *
- * a) mmap(PROT_WRITE | PROT_EXEC)
- *
- * b) mmap(PROT_WRITE)
- * mprotect(PROT_EXEC)
- *
- * c) mmap(PROT_WRITE)
- * mprotect(PROT_READ)
- * mprotect(PROT_EXEC)
- *
- * But allows the following:
- *
- * d) mmap(PROT_READ | PROT_EXEC)
- * mmap(PROT_READ | PROT_EXEC | PROT_BTI)
- *
- * This is only applicable if the user has set the Memory-Deny-Write-Execute
- * (MDWE) protection mask for the current process.
- *
- * @old specifies the VMA flags the VMA originally possessed, and @new the ones
- * we propose to set.
- *
- * Return: false if proposed change is OK, true if not ok and should be denied.
- */
-static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
-{
- /* If MDWE is disabled, we have nothing to deny. */
- if (!mm_flags_test(MMF_HAS_MDWE, current->mm))
- return false;
-
- /* If the new VMA is not executable, we have nothing to deny. */
- if (!(new & VM_EXEC))
- return false;
-
- /* Under MDWE we do not accept newly writably executable VMAs... */
- if (new & VM_WRITE)
- return true;
-
- /* ...nor previously non-executable VMAs becoming executable. */
- if (!(old & VM_EXEC))
- return true;
-
- return false;
-}
-
#endif /* _LINUX_MMAN_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 0da15adb4aac..69c304b467df 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -97,20 +97,20 @@ struct mmu_notifier_ops {
* Start-end is necessary in case the secondary MMU is mapping the page
* at a smaller granularity than the primary MMU.
*/
- int (*clear_flush_young)(struct mmu_notifier *subscription,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
+ bool (*clear_flush_young)(struct mmu_notifier *subscription,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
/*
* clear_young is a lightweight version of clear_flush_young. Like the
* latter, it is supposed to test-and-clear the young/accessed bitflag
* in the secondary pte, but it may omit flushing the secondary tlb.
*/
- int (*clear_young)(struct mmu_notifier *subscription,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
+ bool (*clear_young)(struct mmu_notifier *subscription,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
/*
* test_young is called to check the young/accessed bitflag in
@@ -118,9 +118,9 @@ struct mmu_notifier_ops {
* frequently used without actually clearing the flag or tearing
* down the secondary mapping on the page.
*/
- int (*test_young)(struct mmu_notifier *subscription,
- struct mm_struct *mm,
- unsigned long address);
+ bool (*test_young)(struct mmu_notifier *subscription,
+ struct mm_struct *mm,
+ unsigned long address);
/*
* invalidate_range_start() and invalidate_range_end() must be
@@ -418,14 +418,12 @@ mmu_interval_check_retry(struct mmu_interval_notifier *interval_sub,
extern void __mmu_notifier_subscriptions_destroy(struct mm_struct *mm);
extern void __mmu_notifier_release(struct mm_struct *mm);
-extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
-extern int __mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
-extern int __mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address);
+bool __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+bool __mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+bool __mmu_notifier_test_young(struct mm_struct *mm,
+ unsigned long address);
extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -445,30 +443,28 @@ static inline void mmu_notifier_release(struct mm_struct *mm)
__mmu_notifier_release(mm);
}
-static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
+static inline bool mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
if (mm_has_notifiers(mm))
return __mmu_notifier_clear_flush_young(mm, start, end);
- return 0;
+ return false;
}
-static inline int mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
+static inline bool mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
if (mm_has_notifiers(mm))
return __mmu_notifier_clear_young(mm, start, end);
- return 0;
+ return false;
}
-static inline int mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address)
+static inline bool mmu_notifier_test_young(struct mm_struct *mm,
+ unsigned long address)
{
if (mm_has_notifiers(mm))
return __mmu_notifier_test_young(mm, address);
- return 0;
+ return false;
}
static inline void
@@ -558,55 +554,6 @@ static inline void mmu_notifier_range_init_owner(
range->owner = owner;
}
-#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr) \
-({ \
- int __young; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- unsigned int ___nr = __nr; \
- __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr); \
- __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
- ___address, \
- ___address + \
- ___nr * PAGE_SIZE); \
- __young; \
-})
-
-#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp) \
-({ \
- int __young; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __young = pmdp_clear_flush_young(___vma, ___address, __pmdp); \
- __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
- ___address, \
- ___address + \
- PMD_SIZE); \
- __young; \
-})
-
-#define ptep_clear_young_notify(__vma, __address, __ptep) \
-({ \
- int __young; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __young = ptep_test_and_clear_young(___vma, ___address, __ptep);\
- __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
- ___address + PAGE_SIZE); \
- __young; \
-})
-
-#define pmdp_clear_young_notify(__vma, __address, __pmdp) \
-({ \
- int __young; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\
- __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
- ___address + PMD_SIZE); \
- __young; \
-})
-
#else /* CONFIG_MMU_NOTIFIER */
struct mmu_notifier_range {
@@ -643,24 +590,22 @@ static inline void mmu_notifier_release(struct mm_struct *mm)
{
}
-static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
+static inline bool mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
- return 0;
+ return false;
}
-static inline int mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
+static inline bool mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
- return 0;
+ return false;
}
-static inline int mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address)
+static inline bool mmu_notifier_test_young(struct mm_struct *mm,
+ unsigned long address)
{
- return 0;
+ return false;
}
static inline void
@@ -694,11 +639,6 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
#define mmu_notifier_range_update_to_read_only(r) false
-#define clear_flush_young_ptes_notify clear_flush_young_ptes
-#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
-#define ptep_clear_young_notify ptep_test_and_clear_young
-#define pmdp_clear_young_notify pmdp_test_and_clear_young
-
static inline void mmu_notifier_synchronize(void)
{
}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 841b40031833..3bcdda226a91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -23,6 +23,7 @@
#include <linux/page-flags.h>
#include <linux/local_lock.h>
#include <linux/zswap.h>
+#include <linux/sizes.h>
#include <asm/page.h>
/* Free memory management - zoned buddy allocator. */
@@ -61,6 +62,59 @@
*/
#define PAGE_ALLOC_COSTLY_ORDER 3
+#if !defined(CONFIG_HAVE_GIGANTIC_FOLIOS)
+/*
+ * We don't expect any folios that exceed buddy sizes (and consequently
+ * memory sections).
+ */
+#define MAX_FOLIO_ORDER MAX_PAGE_ORDER
+#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+/*
+ * Only pages within a single memory section are guaranteed to be
+ * contiguous. By limiting folios to a single memory section, all folio
+ * pages are guaranteed to be contiguous.
+ */
+#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT
+#elif defined(CONFIG_HUGETLB_PAGE)
+/*
+ * There is no real limit on the folio size. We limit them to the maximum we
+ * currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
+ * no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
+ */
+#ifdef CONFIG_64BIT
+#define MAX_FOLIO_ORDER (ilog2(SZ_16G) - PAGE_SHIFT)
+#else
+#define MAX_FOLIO_ORDER (ilog2(SZ_1G) - PAGE_SHIFT)
+#endif
+#else
+/*
+ * Without hugetlb, gigantic folios that are bigger than a single PUD are
+ * currently impossible.
+ */
+#define MAX_FOLIO_ORDER (PUD_SHIFT - PAGE_SHIFT)
+#endif
+
+#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
+
+/*
+ * HugeTLB Vmemmap Optimization (HVO) requires struct pages of the head page to
+ * be naturally aligned with regard to the folio size.
+ *
+ * HVO which is only active if the size of struct page is a power of 2.
+ */
+#define MAX_FOLIO_VMEMMAP_ALIGN \
+ (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && \
+ is_power_of_2(sizeof(struct page)) ? \
+ MAX_FOLIO_NR_PAGES * sizeof(struct page) : 0)
+
+/*
+ * vmemmap optimization (like HVO) is only possible for page orders that fill
+ * two or more pages with struct pages.
+ */
+#define VMEMMAP_TAIL_MIN_ORDER (ilog2(2 * PAGE_SIZE / sizeof(struct page)))
+#define __NR_VMEMMAP_TAILS (MAX_FOLIO_ORDER - VMEMMAP_TAIL_MIN_ORDER + 1)
+#define NR_VMEMMAP_TAILS (__NR_VMEMMAP_TAILS > 0 ? __NR_VMEMMAP_TAILS : 0)
+
enum migratetype {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
@@ -220,6 +274,7 @@ enum node_stat_item {
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
+ NR_VMALLOC,
NR_KERNEL_STACK_KB, /* measured in KiB */
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
NR_KERNEL_SCS_KB, /* measured in KiB */
@@ -255,6 +310,19 @@ enum node_stat_item {
PGDEMOTE_DIRECT,
PGDEMOTE_KHUGEPAGED,
PGDEMOTE_PROACTIVE,
+ PGSTEAL_KSWAPD,
+ PGSTEAL_DIRECT,
+ PGSTEAL_KHUGEPAGED,
+ PGSTEAL_PROACTIVE,
+ PGSTEAL_ANON,
+ PGSTEAL_FILE,
+ PGSCAN_KSWAPD,
+ PGSCAN_DIRECT,
+ PGSCAN_KHUGEPAGED,
+ PGSCAN_PROACTIVE,
+ PGSCAN_ANON,
+ PGSCAN_FILE,
+ PGREFILL,
#ifdef CONFIG_HUGETLB_PAGE
NR_HUGETLB,
#endif
@@ -618,7 +686,7 @@ struct lru_gen_memcg {
void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
-bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr);
void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -637,7 +705,8 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}
-static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw,
+ unsigned int nr)
{
return false;
}
@@ -1059,6 +1128,9 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+ struct page *vmemmap_tails[NR_VMEMMAP_TAILS];
+#endif
} ____cacheline_internodealigned_in_smp;
enum pgdat_flags {
@@ -1912,15 +1984,13 @@ struct mem_section_usage {
unsigned long pageblock_flags[0];
};
-void subsection_map_init(unsigned long pfn, unsigned long nr_pages);
-
struct page;
struct page_ext;
struct mem_section {
/*
* This is, logically, a pointer to an array of struct
* pages. However, it is stored with some other magic.
- * (see sparse.c::sparse_init_one_section())
+ * (see sparse_init_one_section())
*
* Additionally during early boot we encode node id of
* the location of the section here to guide allocation.
@@ -2302,11 +2372,9 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
#endif
#else
-#define sparse_index_init(_sec, _nid) do {} while (0)
#define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
#define sparse_vmemmap_init_nid_late(_nid) do {} while (0)
#define pfn_in_present_section pfn_valid
-#define subsection_map_init(_pfn, _nr_pages) do {} while (0)
#endif /* CONFIG_SPARSEMEM */
/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c73..0e03d816e8b9 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,97 +198,91 @@ enum pageflags {
#ifndef __GENERATING_BOUNDS_H
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
-
/*
- * Return the real head page struct iff the @page is a fake head page, otherwise
- * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
+ * For tail pages, if the size of struct page is power-of-2 ->compound_info
+ * encodes the mask that converts the address of the tail page address to
+ * the head page address.
+ *
+ * Otherwise, ->compound_info has direct pointer to head pages.
*/
-static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
+static __always_inline bool compound_info_has_mask(void)
{
- if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
- return page;
-
/*
- * Only addresses aligned with PAGE_SIZE of struct page may be fake head
- * struct page. The alignment check aims to avoid access the fields (
- * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
- * cold cacheline in some cases.
+ * Limit mask usage to HugeTLB vmemmap optimization (HVO) where it
+ * makes a difference.
+ *
+ * The approach with mask would work in the wider set of conditions,
+ * but it requires validating that struct pages are naturally aligned
+ * for all orders up to the MAX_FOLIO_ORDER, which can be tricky.
*/
- if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
- test_bit(PG_head, &page->flags.f)) {
- /*
- * We can safely access the field of the @page[1] with PG_head
- * because the @page is a compound page composed with at least
- * two contiguous pages.
- */
- unsigned long head = READ_ONCE(page[1].compound_head);
-
- if (likely(head & 1))
- return (const struct page *)(head - 1);
- }
- return page;
+ if (!IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP))
+ return false;
+
+ return is_power_of_2(sizeof(struct page));
}
-static __always_inline bool page_count_writable(const struct page *page, int u)
+static __always_inline unsigned long _compound_head(const struct page *page)
{
- if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
- return true;
+ unsigned long info = READ_ONCE(page->compound_info);
+ unsigned long mask;
+
+ if (!compound_info_has_mask()) {
+ /* Bit 0 encodes PageTail() */
+ if (info & 1)
+ return info - 1;
+
+ return (unsigned long)page;
+ }
/*
- * The refcount check is ordered before the fake-head check to prevent
- * the following race:
- * CPU 1 (HVO) CPU 2 (speculative PFN walker)
- *
- * page_ref_freeze()
- * synchronize_rcu()
- * rcu_read_lock()
- * page_is_fake_head() is false
- * vmemmap_remap_pte()
- * XXX: struct page[] becomes r/o
+ * If compound_info_has_mask() is true the rest of the info encodes
+ * the mask that converts the address of the tail page to the head page.
*
- * page_ref_unfreeze()
- * page_ref_count() is not zero
+ * No need to clear bit 0 in the mask as 'page' always has it clear.
*
- * atomic_add_unless(&page->_refcount)
- * XXX: try to modify r/o struct page[]
- *
- * The refcount check also prevents modification attempts to other (r/o)
- * tail pages that are not fake heads.
+ * Let's do it in a branchless manner.
*/
- if (atomic_read_acquire(&page->_refcount) == u)
- return false;
- return page_fixed_fake_head(page) == page;
-}
-#else
-static inline const struct page *page_fixed_fake_head(const struct page *page)
-{
- return page;
-}
+ /* Non-tail: -1UL, Tail: 0 */
+ mask = (info & 1) - 1;
-static inline bool page_count_writable(const struct page *page, int u)
-{
- return true;
-}
-#endif
+ /* Non-tail: -1UL, Tail: info */
+ mask |= info;
-static __always_inline int page_is_fake_head(const struct page *page)
-{
- return page_fixed_fake_head(page) != page;
+ return (unsigned long)page & mask;
}
-static __always_inline unsigned long _compound_head(const struct page *page)
+#define compound_head(page) ((typeof(page))_compound_head(page))
+
+static __always_inline void set_compound_head(struct page *tail,
+ const struct page *head, unsigned int order)
{
- unsigned long head = READ_ONCE(page->compound_head);
+ unsigned int shift;
+ unsigned long mask;
+
+ if (!compound_info_has_mask()) {
+ WRITE_ONCE(tail->compound_info, (unsigned long)head | 1);
+ return;
+ }
+
+ /*
+ * If the size of struct page is power-of-2, bits [shift:0] of the
+ * virtual address of compound head are zero.
+ *
+ * Calculate mask that can be applied to the virtual address of
+ * the tail page to get address of the head page.
+ */
+ shift = order + order_base_2(sizeof(struct page));
+ mask = GENMASK(BITS_PER_LONG - 1, shift);
- if (unlikely(head & 1))
- return head - 1;
- return (unsigned long)page_fixed_fake_head(page);
+ /* Bit 0 encodes PageTail() */
+ WRITE_ONCE(tail->compound_info, mask | 1);
}
-#define compound_head(page) ((typeof(page))_compound_head(page))
+static __always_inline void clear_compound_head(struct page *page)
+{
+ WRITE_ONCE(page->compound_info, 0);
+}
/**
* page_folio - Converts from page to folio.
@@ -320,13 +314,13 @@ static __always_inline unsigned long _compound_head(const struct page *page)
static __always_inline int PageTail(const struct page *page)
{
- return READ_ONCE(page->compound_head) & 1 || page_is_fake_head(page);
+ return READ_ONCE(page->compound_info) & 1;
}
static __always_inline int PageCompound(const struct page *page)
{
return test_bit(PG_head, &page->flags.f) ||
- READ_ONCE(page->compound_head) & 1;
+ READ_ONCE(page->compound_info) & 1;
}
#define PAGE_POISON_PATTERN -1l
@@ -348,7 +342,7 @@ static const unsigned long *const_folio_flags(const struct folio *folio,
{
const struct page *page = &folio->page;
- VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
+ VM_BUG_ON_PGFLAGS(page->compound_info & 1, page);
VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags.f), page);
return &page[n].flags.f;
}
@@ -357,7 +351,7 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
{
struct page *page = &folio->page;
- VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
+ VM_BUG_ON_PGFLAGS(page->compound_info & 1, page);
VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags.f), page);
return &page[n].flags.f;
}
@@ -724,6 +718,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
return ((unsigned long)folio->mapping & FOLIO_MAPPING_ANON) != 0;
}
+static __always_inline bool folio_test_lazyfree(const struct folio *folio)
+{
+ return folio_test_anon(folio) && !folio_test_swapbacked(folio);
+}
+
static __always_inline bool PageAnonNotKsm(const struct page *page)
{
unsigned long flags = (unsigned long)page_folio(page)->mapping;
@@ -847,7 +846,7 @@ static __always_inline bool folio_test_head(const struct folio *folio)
static __always_inline int PageHead(const struct page *page)
{
PF_POISONED_CHECK(page);
- return test_bit(PG_head, &page->flags.f) && !page_is_fake_head(page);
+ return test_bit(PG_head, &page->flags.f);
}
__SETPAGEFLAG(Head, head, PF_ANY)
@@ -865,16 +864,6 @@ static inline bool folio_test_large(const struct folio *folio)
return folio_test_head(folio);
}
-static __always_inline void set_compound_head(struct page *page, struct page *head)
-{
- WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
-}
-
-static __always_inline void clear_compound_head(struct page *page)
-{
- WRITE_ONCE(page->compound_head, 0);
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline void ClearPageCompound(struct page *page)
{
@@ -934,6 +923,7 @@ enum pagetype {
PGTY_zsmalloc = 0xf6,
PGTY_unaccepted = 0xf7,
PGTY_large_kmalloc = 0xf8,
+ PGTY_netpp = 0xf9,
PGTY_mapcount_underflow = 0xff
};
@@ -1066,6 +1056,11 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
PAGE_TYPE_OPS(LargeKmalloc, large_kmalloc, large_kmalloc)
+/*
+ * Marks page_pool allocated pages.
+ */
+PAGE_TYPE_OPS(Netpp, netpp, netpp)
+
/**
* PageHuge - Determine if the page belongs to hugetlbfs
* @page: The page to test.
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 544150d1d5fd..94d3f0e71c06 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -228,24 +228,18 @@ static inline int folio_ref_dec_return(struct folio *folio)
return page_ref_dec_return(&folio->page);
}
-static inline bool page_ref_add_unless(struct page *page, int nr, int u)
+static inline bool page_ref_add_unless_zero(struct page *page, int nr)
{
- bool ret = false;
-
- rcu_read_lock();
- /* avoid writing to the vmemmap area being remapped */
- if (page_count_writable(page, u))
- ret = atomic_add_unless(&page->_refcount, nr, u);
- rcu_read_unlock();
+ bool ret = atomic_add_unless(&page->_refcount, nr, 0);
if (page_ref_tracepoint_active(page_ref_mod_unless))
__page_ref_mod_unless(page, nr, ret);
return ret;
}
-static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
+static inline bool folio_ref_add_unless_zero(struct folio *folio, int nr)
{
- return page_ref_add_unless(&folio->page, nr, u);
+ return page_ref_add_unless_zero(&folio->page, nr);
}
/**
@@ -261,12 +255,12 @@ static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
*/
static inline bool folio_try_get(struct folio *folio)
{
- return folio_ref_add_unless(folio, 1, 0);
+ return folio_ref_add_unless_zero(folio, 1);
}
static inline bool folio_ref_try_add(struct folio *folio, int count)
{
- return folio_ref_add_unless(folio, count, 0);
+ return folio_ref_add_unless_zero(folio, count);
}
static inline int page_ref_freeze(struct page *page, int count)
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index fe648dfa3a7c..9d4ca5c218a0 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -7,6 +7,7 @@
/* This value should always be a power of 2, see page_reporting_cycle() */
#define PAGE_REPORTING_CAPACITY 32
+#define PAGE_REPORTING_ORDER_UNSPECIFIED -1
struct page_reporting_dev_info {
/* function that alters pages to make them "reported" */
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 88e18615dd72..b41d7265c01b 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -148,14 +148,8 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
typedef int __bitwise folio_walk_flags_t;
-/*
- * Walk migration entries as well. Careful: a large folio might get split
- * concurrently.
- */
-#define FW_MIGRATION ((__force folio_walk_flags_t)BIT(0))
-
/* Walk shared zeropages (small + huge) as well. */
-#define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(1))
+#define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(0))
enum folio_walk_level {
FW_LEVEL_PTE,
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a50df42a893f..cdd68ed3ae1a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -491,64 +491,63 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
#endif
#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
- unsigned long address,
- pte_t *ptep)
+static inline bool ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
{
pte_t pte = ptep_get(ptep);
- int r = 1;
+ bool young = true;
+
if (!pte_young(pte))
- r = 0;
+ young = false;
else
set_pte_at(vma->vm_mm, address, ptep, pte_mkold(pte));
- return r;
+ return young;
}
#endif
#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
-static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmdp)
+static inline bool pmdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
{
pmd_t pmd = *pmdp;
- int r = 1;
+ bool young = true;
+
if (!pmd_young(pmd))
- r = 0;
+ young = false;
else
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mkold(pmd));
- return r;
+ return young;
}
#else
-static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmdp)
+static inline bool pmdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
{
BUILD_BUG();
- return 0;
+ return false;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
#endif
#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-int ptep_clear_flush_young(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep);
+bool ptep_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep);
#endif
#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp);
+bool pmdp_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp);
#else
/*
* Despite relevant to THP only, this API is called from generic rmap code
* under PageTransHuge(), hence needs a dummy implementation for !THP
*/
-static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
+static inline bool pmdp_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
{
BUILD_BUG();
- return 0;
+ return false;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
@@ -1086,10 +1085,10 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
* Context: The caller holds the page table lock. The PTEs map consecutive
* pages that belong to the same folio. The PTEs are all in the same PMD.
*/
-static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+static inline bool clear_flush_young_ptes(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep, unsigned int nr)
{
- int young = 0;
+ bool young = false;
for (;;) {
young |= ptep_clear_flush_young(vma, addr, ptep);
@@ -1103,6 +1102,43 @@ static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
}
#endif
+#ifndef test_and_clear_young_ptes
+/**
+ * test_and_clear_young_ptes - Mark PTEs that map consecutive pages of the same
+ * folio as old
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear access bit.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_test_and_clear_young().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ *
+ * Returns: whether any PTE was young.
+ */
+static inline bool test_and_clear_young_ptes(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+ bool young = false;
+
+ for (;;) {
+ young |= ptep_test_and_clear_young(vma, addr, ptep);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ }
+
+ return young;
+}
+#endif
+
/*
* On some architectures hardware does not set page access bit when accessing
* memory page, it is responsibility of software setting this bit. It brings
@@ -1917,41 +1953,56 @@ static inline void pfnmap_setup_cachemode_pfn(unsigned long pfn, pgprot_t *prot)
pfnmap_setup_cachemode(pfn, PAGE_SIZE, prot);
}
-#ifdef CONFIG_MMU
+/*
+ * ZERO_PAGE() is global shared page(s) that is always zero. It is used for
+ * zero-mapped memory areas, CoW etc.
+ *
+ * On architectures that __HAVE_COLOR_ZERO_PAGE there are several such pages
+ * for different ranges in the virtual address space.
+ *
+ * zero_page_pfn identifies the first (or the only) pfn for these pages.
+ *
+ * For architectures that don't __HAVE_COLOR_ZERO_PAGE the zero page lives in
+ * empty_zero_page in BSS.
+ */
+void arch_setup_zero_pages(void);
+
#ifdef __HAVE_COLOR_ZERO_PAGE
static inline int is_zero_pfn(unsigned long pfn)
{
- extern unsigned long zero_pfn;
- unsigned long offset_from_zero_pfn = pfn - zero_pfn;
+ extern unsigned long zero_page_pfn;
+ unsigned long offset_from_zero_pfn = pfn - zero_page_pfn;
+
return offset_from_zero_pfn <= (zero_page_mask >> PAGE_SHIFT);
}
-#define my_zero_pfn(addr) page_to_pfn(ZERO_PAGE(addr))
+#define zero_pfn(addr) page_to_pfn(ZERO_PAGE(addr))
#else
static inline int is_zero_pfn(unsigned long pfn)
{
- extern unsigned long zero_pfn;
- return pfn == zero_pfn;
-}
+ extern unsigned long zero_page_pfn;
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
- extern unsigned long zero_pfn;
- return zero_pfn;
+ return pfn == zero_page_pfn;
}
-#endif
-#else
-static inline int is_zero_pfn(unsigned long pfn)
+
+static inline unsigned long zero_pfn(unsigned long addr)
{
- return 0;
+ extern unsigned long zero_page_pfn;
+
+ return zero_page_pfn;
}
-static inline unsigned long my_zero_pfn(unsigned long addr)
+extern uint8_t empty_zero_page[PAGE_SIZE];
+extern struct page *__zero_page;
+
+static inline struct page *_zero_page(unsigned long addr)
{
- return 0;
+ return __zero_page;
}
-#endif /* CONFIG_MMU */
+#define ZERO_PAGE(vaddr) _zero_page(vaddr)
+
+#endif /* __HAVE_COLOR_ZERO_PAGE */
#ifdef CONFIG_MMU
@@ -1989,7 +2040,7 @@ static inline int pud_trans_unstable(pud_t *pud)
{
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
- pud_t pudval = READ_ONCE(*pud);
+ pud_t pudval = pudp_get(pud);
if (pud_none(pudval) || pud_trans_huge(pudval))
return 1;
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4dc14c7a711b..a11acf5cd63b 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -20,7 +20,7 @@
#include <linux/lwq.h>
#include <linux/wait.h>
#include <linux/mm.h>
-#include <linux/pagevec.h>
+#include <linux/folio_batch.h>
#include <linux/kthread.h>
/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..4b1f13b5bbad 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -20,8 +20,6 @@ struct notifier_block;
struct bio;
-struct pagevec;
-
#define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
#define SWAP_FLAG_PRIO_MASK 0x7fff
#define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */
@@ -208,7 +206,6 @@ enum {
SWP_DISCARDABLE = (1 << 2), /* blkdev support discard */
SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */
SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */
- SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV = (1 << 6), /* its a block device */
SWP_ACTIVATED = (1 << 7), /* set after swap_activate success */
SWP_FS_OPS = (1 << 8), /* swapfile operations go through fs */
@@ -223,16 +220,6 @@ enum {
#define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
-/* Bit flag in swap_map */
-#define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count */
-
-/* Special value in first swap_map */
-#define SWAP_MAP_MAX 0x3e /* Max count */
-#define SWAP_MAP_BAD 0x3f /* Note page is bad */
-
-/* Special value in each swap_map continuation */
-#define SWAP_CONT_MAX 0x7f /* Max count */
-
/*
* The first page in the swap file is the swap header, which is always marked
* bad to prevent it from being allocated as an entry. This also prevents the
@@ -264,8 +251,7 @@ struct swap_info_struct {
signed short prio; /* swap priority of this type */
struct plist_node list; /* entry in swap_active_head */
signed char type; /* strange name for an index */
- unsigned int max; /* extent of the swap_map */
- unsigned char *swap_map; /* vmalloc'ed array of usage counts */
+ unsigned int max; /* size of this swap device */
unsigned long *zeromap; /* kvmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
@@ -284,18 +270,14 @@ struct swap_info_struct {
struct completion comp; /* seldom referenced */
spinlock_t lock; /*
* protect map scan related fields like
- * swap_map, inuse_pages and all cluster
- * lists. other fields are only changed
+ * inuse_pages and all cluster lists.
+ * Other fields are only changed
* at swapon/swapoff, so are protected
* by swap_lock. changing flags need
* hold this lock and swap_lock. If
* both locks need hold, hold swap_lock
* first.
*/
- spinlock_t cont_lock; /*
- * protect swap count continuation page
- * list.
- */
struct work_struct discard_work; /* discard worker */
struct work_struct reclaim_work; /* reclaim worker */
struct list_head discard_clusters; /* discard clusters list */
@@ -451,7 +433,6 @@ static inline long get_nr_swap_pages(void)
}
extern void si_swapinfo(struct sysinfo *);
-extern int add_swap_count_continuation(swp_entry_t, gfp_t);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
@@ -517,11 +498,6 @@ static inline void free_swap_cache(struct folio *folio)
{
}
-static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
-{
- return 0;
-}
-
static inline int swap_dup_entry_direct(swp_entry_t ent)
{
return 0;
diff --git a/include/linux/types.h b/include/linux/types.h
index 7e71d260763c..608050dbca6a 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -239,7 +239,7 @@ struct ustat {
*
* This guarantee is important for few reasons:
* - future call_rcu_lazy() will make use of lower bits in the pointer;
- * - the structure shares storage space in struct page with @compound_head,
+ * - the structure shares storage space in struct page with @compound_info,
* which encode PageTail() in bit 0. The guarantee is needed to avoid
* false-positive PageTail().
*/
diff --git a/include/linux/uio_driver.h b/include/linux/uio_driver.h
index 334641e20fb1..02eaac47ac44 100644
--- a/include/linux/uio_driver.h
+++ b/include/linux/uio_driver.h
@@ -97,7 +97,7 @@ struct uio_device {
* @irq_flags: flags for request_irq()
* @priv: optional private data
* @handler: the device's irq handler
- * @mmap: mmap operation for this uio device
+ * @mmap_prepare: mmap_prepare operation for this uio device
* @open: open operation for this uio device
* @release: release operation for this uio device
* @irqcontrol: disable/enable irqs when 0/1 is written to /dev/uioX
@@ -112,7 +112,7 @@ struct uio_info {
unsigned long irq_flags;
void *priv;
irqreturn_t (*handler)(int irq, struct uio_info *dev_info);
- int (*mmap)(struct uio_info *info, struct vm_area_struct *vma);
+ int (*mmap_prepare)(struct uio_info *info, struct vm_area_desc *desc);
int (*open)(struct uio_info *info, struct inode *inode);
int (*release)(struct uio_info *info, struct inode *inode);
int (*irqcontrol)(struct uio_info *info, s32 irq_on);
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index fd5f42765497..d83e349900a3 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -23,6 +23,9 @@
/* The set of all possible UFFD-related VM flags. */
#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
+#define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BIT, \
+ VMA_UFFD_MINOR_BIT)
+
/*
* CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
* new flags, since they might collide with O_* ones. We want
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 22a139f82d75..03fe95f5a020 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,21 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
PGFAULT, PGMAJFAULT,
PGLAZYFREED,
- PGREFILL,
PGREUSE,
- PGSTEAL_KSWAPD,
- PGSTEAL_DIRECT,
- PGSTEAL_KHUGEPAGED,
- PGSTEAL_PROACTIVE,
- PGSCAN_KSWAPD,
- PGSCAN_DIRECT,
- PGSCAN_KHUGEPAGED,
- PGSCAN_PROACTIVE,
PGSCAN_DIRECT_THROTTLE,
- PGSCAN_ANON,
- PGSCAN_FILE,
- PGSTEAL_ANON,
- PGSTEAL_FILE,
#ifdef CONFIG_NUMA
PGSCAN_ZONE_RECLAIM_SUCCESS,
PGSCAN_ZONE_RECLAIM_FAILED,
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e8e94f90d686..3b02c0c6b371 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -286,8 +286,6 @@ int unregister_vmap_purge_notifier(struct notifier_block *nb);
#ifdef CONFIG_MMU
#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
-unsigned long vmalloc_nr_pages(void);
-
int vm_area_map_pages(struct vm_struct *area, unsigned long start,
unsigned long end, struct page **pages);
void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
@@ -304,7 +302,6 @@ static inline void set_vm_flush_reset_perms(void *addr)
#else /* !CONFIG_MMU */
#define VMALLOC_TOTAL 0UL
-static inline unsigned long vmalloc_nr_pages(void) { return 0; }
static inline void set_vm_flush_reset_perms(void *addr) {}
#endif /* CONFIG_MMU */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e530112c4b3a..62552a2ce5b9 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,7 +11,7 @@
#include <linux/flex_proportions.h>
#include <linux/backing-dev-defs.h>
#include <linux/blk_types.h>
-#include <linux/pagevec.h>
+#include <linux/folio_batch.h>
struct bio;