diff options
| author | Mike Marshall <hubcap@omnibond.com> | 2026-04-13 11:18:23 -0400 |
|---|---|---|
| committer | Mike Marshall <hubcap@omnibond.com> | 2026-04-13 12:14:17 -0400 |
| commit | e61bc5e4d87433c8759e7dc92bb640ef71a8970c (patch) | |
| tree | c4e0e50b5691128f37b6413e444f3ae5f466f11e /rust/kernel/alloc | |
| parent | 092e0d0e964279feb9f43f81e8d1c52ef080d085 (diff) | |
bufmap: manage as folios, V2.
Thanks for the feedback from Dan Carpenter and Arnd Bergmann.
Dan suggested to make the rollback loop in orangefs_bufmap_map
more robust.
Arnd caught a %ld format for a size_t in
orangefs_bufmap_copy_to_iovec. He suggested %zd, I
used %zu which I think is OK too.
Orangefs userspace allocates 40 megabytes on an address that's page
aligned.
With this folio modification the allocation is aligned on a multiple of
2 megabytes:
posix_memalign(&ptr, 2097152, 41943040);
Then userspace tries to enable Huge Pages for the range:
madvise(ptr, 41943040, MADV_HUGEPAGE);
Userspace provides the address of the 40 megabyte allocation to
the Orangefs kernel module with an ioctl.
The kernel module initializes the memory as a "bufmap" with ten
4 megabyte "slots".
Traditionally, the slots are manipulated a page at a time.
This folio/bufmap modification manages the slots as folios, with
two 2 megabyte folios per slot and data can be read into
and out of each slot a folio at a time.
This modification works fine with orangefs userspace lacking
the THP focused posix_memalign and madvise settings listed above,
each slot can end up being made of page sized folios. It also works
if there are some, but less than 20, hugepages available. A message
is printed in the kernel ring buffer (dmesg) at userspace start
time that describes the folio/page ratio. As an example, I started
orangefs and saw "Grouped 2575 folios from 10240 pages" in the ring
buffer.
To get the optimum ratio, 20/10240, I use these settings before
I start the orangefs userspace:
echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/defrag
echo 30 > /proc/sys/vm/nr_hugepages
https://docs.kernel.org/admin-guide/mm/hugetlbpage.html discusses
hugepages and manipulating the /proc/sys/vm settings.
Comparing the performance between the page/bufmap and the folio/bufmap
is a mixed bag.
- The folio/bufmap version is about 8% faster at running through the
xfstest suite on my VMs.
- It is easy to construct an fio test that brings the page/bufmap
version to its knees on my dinky VM test system, with all bufmap
slots used and I/O timeouts cascading.
- Some smaller tests I did with fio that didn't overwhelm the
page/bufmap version showed no performance gain with the
folio/bufmap version on my VM.
I suspect this change will improve performance only in some use-cases.
I think it will be a gain when there are many concurrent IOs that
mostly fill the bufmap. I'm working up a gcloud test for that.
Reported-by: Dan Carpenter <error27@gmail.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Diffstat (limited to 'rust/kernel/alloc')
0 files changed, 0 insertions, 0 deletions
