linux-stable.git/drivers/dax/device.c, branch linux-6.3.y

Merge tag 'cxl-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

2023-02-25T17:19:23+00:00

Pull Compute Express Link (CXL) updates from Dan Williams:
 "To date Linux has been dependent on platform-firmware to map CXL RAM
  regions and handle events / errors from devices. With this update we
  can now parse / update the CXL memory layout, and report events /
  errors from devices. This is a precursor for the CXL subsystem to
  handle the end-to-end "RAS" flow for CXL memory. i.e. the flow that
  for DDR-attached-DRAM is handled by the EDAC driver where it maps
  system physical address events to a field-replaceable-unit (FRU /
  endpoint device). In general, CXL has the potential to standardize
  what has historically been a pile of memory-controller-specific error
  handling logic.

  Another change of note is the default policy for handling RAM-backed
  device-dax instances. Previously the default access mode was "device",
  mmap(2) a device special file to access memory. The new default is
  "kmem" where the address range is assigned to the core-mm via
  add_memory_driver_managed(). This saves typical users from wondering
  why their platform memory is not visible via free(1) and stuck behind
  a device-file. At the same time it allows expert users to deploy
  policy to, for example, get dedicated access to high performance
  memory, or hide low performance memory from general purpose kernel
  allocations. This affects not only CXL, but also systems with
  high-bandwidth-memory that platform-firmware tags with the
  EFI_MEMORY_SP (special purpose) designation.

  Summary:

   - CXL RAM region enumeration: instantiate 'struct cxl_region' objects
     for platform firmware created memory regions

   - CXL RAM region provisioning: complement the existing PMEM region
     creation support with RAM region support

   - "Soft Reservation" policy change: Online (memory hot-add)
     soft-reserved memory (EFI_MEMORY_SP) by default, but still allow
     for setting aside such memory for dedicated access via device-dax.

   - CXL Events and Interrupts: Takeover CXL event handling from
     platform-firmware (ACPI calls this CXL Memory Error Reporting) and
     export CXL Events via Linux Trace Events.

   - Convey CXL _OSC results to drivers: Similar to PCI, let the CXL
     subsystem interrogate the result of CXL _OSC negotiation.

   - Emulate CXL DVSEC Range Registers as "decoders": Allow for
     first-generation devices that pre-date the definition of the CXL
     HDM Decoder Capability to translate the CXL DVSEC Range Registers
     into 'struct cxl_decoder' objects.

   - Set timestamp: Per spec, set the device timestamp in case of
     hotplug, or if platform-firwmare failed to set it.

   - General fixups: linux-next build issues, non-urgent fixes for
     pre-production hardware, unit test fixes, spelling and debug
     message improvements"

* tag 'cxl-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (66 commits)
  dax/kmem: Fix leak of memory-hotplug resources
  cxl/mem: Add kdoc param for event log driver state
  cxl/trace: Add serial number to trace points
  cxl/trace: Add host output to trace points
  cxl/trace: Standardize device information output
  cxl/pci: Remove locked check for dvsec_range_allowed()
  cxl/hdm: Add emulation when HDM decoders are not committed
  cxl/hdm: Create emulated cxl_hdm for devices that do not have HDM decoders
  cxl/hdm: Emulate HDM decoder from DVSEC range registers
  cxl/pci: Refactor cxl_hdm_decode_init()
  cxl/port: Export cxl_dvsec_rr_decode() to cxl_port
  cxl/pci: Break out range register decoding from cxl_hdm_decode_init()
  cxl: add RAS status unmasking for CXL
  cxl: remove unnecessary calling of pci_enable_pcie_error_reporting()
  dax/hmem: build hmem device support as module if possible
  dax: cxl: add CXL_REGION dependency
  cxl: avoid returning uninitialized error code
  cxl/pmem: Fix nvdimm registration races
  cxl/mem: Fix UAPI command comment
  cxl/uapi: Tag commands from cxl_query_cmd()
  ...

dax: Assign RAM regions to memory-hotplug by default

2023-02-11T01:33:40+00:00

The default mode for device-dax instances is backwards for RAM-regions
as evidenced by the fact that it tends to catch end users by surprise.
"Where is my memory?". Recall that platforms are increasingly shipping
with performance-differentiated memory pools beyond typical DRAM and
NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic
interleave, varied media types, and future fabric attached
possibilities).

For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux
'Soft Reserved') attribute is expected to be applied to all memory-pools
that are not the general purpose pool. This designation gives an
Operating System a chance to defer usage of a memory pool until later in
the boot process where its performance properties can be interrogated
and administrator policy can be applied.

'Soft Reserved' memory can be anything from too limited and precious to
be part of the general purpose pool (HBM), too slow to host hot kernel
data structures (some PMEM media), or anything in between. However, in
the absence of an explicit policy, the memory should at least be made
usable by default. The current device-dax default hides all
non-general-purpose memory behind a device interface.

The expectation is that the distribution of users that want the memory
online by default vs device-dedicated-access by default follows the
Pareto principle. A small number of enlightened users may want to do
userspace memory management through a device, but general users just
want the kernel to make the memory available with an option to get more
advanced later.

Arrange for all device-dax instances not backed by PMEM to default to
attaching to the dax_kmem driver. From there the baseline memory hotplug
policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=)
gates whether the memory comes online or stays offline. Where, if it
stays offline, it can be reliably converted back to device-mode where it
can be partitioned, or fronted by a userspace allocator.

So, if someone wants device-dax instances for their 'Soft Reserved'
memory:

1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot
   with memhp_default_state=offline, or roll the dice and hope that the
   kernel has not pinned a page in that memory before step 2.

2/ Write a udev rule to convert the target dax device(s) from
   'system-ram' mode to 'devdax' mode:

   daxctl reconfigure-device $dax -m devdax -f

Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Dave Hansen 
Reviewed-by: Gregory Price 
Tested-by: Fan Ni 
Reviewed-by: Dave Jiang 
Link: https://lore.kernel.org/r/167602003336.1924368.6809503401422267885.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams

mm: replace vma->vm_flags direct modifications with modifier calls

2023-02-10T00:51:39+00:00

Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan 
Acked-by: Michal Hocko 
Acked-by: Mel Gorman 
Acked-by: Mike Rapoport (IBM) 
Acked-by: Sebastian Reichel 
Reviewed-by: Liam R. Howlett 
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski 
Cc: Arjun Roy 
Cc: Axel Rasmussen 
Cc: David Hildenbrand 
Cc: David Howells 
Cc: Davidlohr Bueso 
Cc: David Rientjes 
Cc: Eric Dumazet 
Cc: Greg Thelen 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Joel Fernandes 
Cc: Johannes Weiner 
Cc: Kent Overstreet 
Cc: Laurent Dufour 
Cc: Lorenzo Stoakes 
Cc: Matthew Wilcox 
Cc: Minchan Kim 
Cc: Paul E. McKenney 
Cc: Peter Oskolkov 
Cc: Peter Xu 
Cc: Peter Zijlstra 
Cc: Punit Agrawal 
Cc: Sebastian Andrzej Siewior 
Cc: Shakeel Butt 
Cc: Soheil Hassas Yeganeh 
Cc: Song Liu 
Cc: Vlastimil Babka 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio

2022-03-16T17:37:05+00:00

This is a mechanical change.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Damien Le Moal 
Acked-by: Damien Le Moal 
Tested-by: Mike Marshall  # orangefs
Tested-by: David Howells  # afs

fs: Remove noop_invalidatepage()

2022-03-15T12:23:29+00:00

We used to have to use noop_invalidatepage() to prevent
block_invalidatepage() from being called, but that behaviour is now gone.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Damien Le Moal 
Acked-by: Damien Le Moal 
Tested-by: Mike Marshall  # orangefs
Tested-by: David Howells  # afs

Merge branch 'akpm' (patches from Andrew)

2022-01-15T18:37:06+00:00

Merge misc updates from Andrew Morton:
 "146 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
  dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
  memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
  userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
  ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
  damon)"

* emailed patches from Andrew Morton : (146 commits)
  mm/damon: hide kernel pointer from tracepoint event
  mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
  mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
  mm/damon/dbgfs: remove an unnecessary variable
  mm/damon: move the implementation of damon_insert_region to damon.h
  mm/damon: add access checking for hugetlb pages
  Docs/admin-guide/mm/damon/usage: update for schemes statistics
  mm/damon/dbgfs: support all DAMOS stats
  Docs/admin-guide/mm/damon/reclaim: document statistics parameters
  mm/damon/reclaim: provide reclamation statistics
  mm/damon/schemes: account how many times quota limit has exceeded
  mm/damon/schemes: account scheme actions that successfully applied
  mm/damon: remove a mistakenly added comment for a future feature
  Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
  Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
  Docs/admin-guide/mm/damon/usage: remove redundant information
  Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
  mm/damon: convert macro functions to static inline functions
  mm/damon: modify damon_rand() macro to static inline function
  mm/damon: move damon_rand() definition into damon.h
  ...

device-dax: compound devmap support

2022-01-15T14:30:26+00:00

Use the newly added compound devmap facility which maps the assigned dax
ranges as compound pages at a page size of @align.

dax devices are created with a fixed @align (huge page size) which is
enforced through as well at mmap() of the device.  Faults, consequently
happen too at the specified @align specified at the creation, and those
don't change throughout dax device lifetime.  MCEs unmap a whole dax
huge page, as well as splits occurring at the configured page size.

Performance measured by gup_test improves considerably for
unpin_user_pages() and altmap with NVDIMMs:

  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
  (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
  [altmap]
  (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms

   $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
  (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
  [altmap with -m 127004]
  (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms

.. as well as unpin_user_page_range_dirty_lock() being just as effective
as THP/hugetlb[0] pages.

[0] https://lore.kernel.org/linux-mm/20210212130843.13865-5-joao.m.martins@oracle.com/

Link: https://lkml.kernel.org/r/20211202204422.26777-12-joao.m.martins@oracle.com
Signed-off-by: Joao Martins 
Reviewed-by: Dan Williams 
Cc: Christoph Hellwig 
Cc: Dave Jiang 
Cc: Jane Chu 
Cc: Jason Gunthorpe 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Jonathan Corbet 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Vishal Verma 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault()

2022-01-15T14:30:26+00:00

After moving the page mapping to be set prior to pte insertion, the pfn
in dev_dax_huge_fault() no longer is necessary.  Remove it, as well as
the @pfn argument passed to the internal fault handler helpers.

[akpm@linux-foundation.org: fix CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=n build]

Link: https://lkml.kernel.org/r/20211202204422.26777-11-joao.m.martins@oracle.com
Signed-off-by: Joao Martins 
Suggested-by: Christoph Hellwig 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Jane Chu 
Cc: Jason Gunthorpe 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Jonathan Corbet 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Vishal Verma 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}()

2022-01-15T14:30:26+00:00

Normally, the @page mapping is set prior to inserting the page into a
page table entry.  Make device-dax adhere to the same ordering, rather
than setting mapping after the PTE is inserted.

The address_space never changes and it is always associated with the
same inode and underlying pages.  So, the page mapping is set once but
cleared when the struct pages are removed/freed (i.e.  after
{devm_}memunmap_pages()).

Link: https://lkml.kernel.org/r/20211202204422.26777-10-joao.m.martins@oracle.com
Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
Cc: Christoph Hellwig 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Jane Chu 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Jonathan Corbet 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Vishal Verma 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

device-dax: factor out page mapping initialization

2022-01-15T14:30:25+00:00

Move initialization of page->mapping into a separate helper.

This is in preparation to move the mapping set to be prior to inserting
the page table entry and also for tidying up compound page handling into
one helper.

Link: https://lkml.kernel.org/r/20211202204422.26777-9-joao.m.martins@oracle.com
Signed-off-by: Joao Martins 
Cc: Christoph Hellwig 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Jane Chu 
Cc: Jason Gunthorpe 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Jonathan Corbet 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Vishal Verma 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds