linux-stable.git/fs/fuse/dev.c, branch v6.1

mm: multi-gen LRU: groundwork

2022-09-27T02:46:09+00:00

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations.  They form a closed-loop system, i.e., "the page reclaim". 
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim.  These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking rendering
   threads.

There are also two access patterns: one with temporal locality and the
other without.  For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults".  Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting.  The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction.  The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then.  This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao 
Acked-by: Brian Geffon 
Acked-by: Jan Alexander Steffens (heftig) 
Acked-by: Oleksandr Natalenko 
Acked-by: Steven Barrett 
Acked-by: Suleiman Souhlal 
Tested-by: Daniel Byrne 
Tested-by: Donald Carr 
Tested-by: Holger Hoffstätte 
Tested-by: Konstantin Kharlamov 
Tested-by: Shuang Zhai 
Tested-by: Sofia Trinh 
Tested-by: Vaibhav Jain 
Cc: Andi Kleen 
Cc: Aneesh Kumar K.V 
Cc: Barry Song 
Cc: Catalin Marinas 
Cc: Dave Hansen 
Cc: Hillf Danton 
Cc: Jens Axboe 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: Linus Torvalds 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Miaohe Lin 
Cc: Michael Larabel 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Mike Rapoport 
Cc: Peter Zijlstra 
Cc: Qi Zheng 
Cc: Tejun Heo 
Cc: Vlastimil Babka 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()

2022-08-09T02:37:22+00:00

Most of the users immediately follow successful iov_iter_get_pages()
with advancing by the amount it had returned.

Provide inline wrappers doing that, convert trivial open-coded
uses of those.

BTW, iov_iter_get_pages() never returns more than it had been asked
to; such checks in cifs ought to be removed someday...

Reviewed-by: Jeff Layton 
Signed-off-by: Al Viro

new iov_iter flavour - ITER_UBUF

2022-08-09T02:37:15+00:00

Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.

We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.

New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.

DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.

Signed-off-by: Al Viro

fuse: remove reliance on bdi congestion

2022-03-22T22:57:00+00:00

The bdi congestion tracking in not widely used and will be removed.

Fuse is one of a small number of filesystems that uses it, setting both
the sync (read) and async (write) congestion flags at what it determines
are appropriate times.

The only remaining effect of the sync flag is to cause read-ahead to be
skipped.  The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flags, change:

 - .readahead to stop when it has submitted all non-async pages for
   read.

 - .writepages to do nothing if WB_SYNC_NONE and the flag would be set

 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the
   flag would be set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be
called on the page which (I think) will further delay the next attempt at
writeout.  This might be a good thing.

Link: https://lkml.kernel.org/r/164549983737.9187.2627117501000365074.stgit@noble.brown
Signed-off-by: NeilBrown 
Cc: Anna Schumaker 
Cc: Chao Yu 
Cc: Darrick J. Wong 
Cc: Ilya Dryomov 
Cc: Jaegeuk Kim 
Cc: Jan Kara 
Cc: Jeff Layton 
Cc: Jens Axboe 
Cc: Lars Ellenberg 
Cc: Miklos Szeredi 
Cc: Paolo Valente 
Cc: Philipp Reisner 
Cc: Ryusuke Konishi 
Cc: Trond Myklebust 
Cc: Wu Fengguang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fuse: fix pipe buffer lifetime for direct_io

2022-03-07T15:30:44+00:00

In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
imports the write buffer with fuse_get_user_pages(), which uses
iov_iter_get_pages() to grab references to userspace pages instead of
actually copying memory.

On the filesystem device side, these pages can then either be read to
userspace (via fuse_dev_read()), or splice()d over into a pipe using
fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.

This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
the userspace filesystem can mark the request as completed, causing write()
to return. At that point, the userspace filesystem should no longer have
access to the pipe buffer.

Fix by copying pages coming from the user address space to new pipe
buffers.

Reported-by: Jann Horn 
Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device")
Cc: 
Signed-off-by: Miklos Szeredi

fuse: release pipe buf after last use

2021-11-25T13:05:18+00:00

Checking buf->flags should be done before the pipe_buf_release() is called
on the pipe buffer, since releasing the buffer might modify the flags.

This is exactly what page_cache_pipe_buf_release() does, and which results
in the same VM_BUG_ON_PAGE(PageLRU(page)) that the original patch was
trying to fix.

Reported-by: Justin Forbes 
Fixes: 712a951025c0 ("fuse: fix page stealing")
Cc:  # v2.6.35
Signed-off-by: Miklos Szeredi

fuse: fix page stealing

2021-11-02T10:10:37+00:00

It is possible to trigger a crash by splicing anon pipe bufs to the fuse
device.

The reason for this is that anon_pipe_buf_release() will reuse buf->page if
the refcount is 1, but that page might have already been stolen and its
flags modified (e.g. PG_lru added).

This happens in the unlikely case of fuse_dev_splice_write() getting around
to calling pipe_buf_release() after a page has been stolen, added to the
page cache and removed from the page cache.

Fix by calling pipe_buf_release() right after the page was inserted into
the page cache.  In this case the page has an elevated refcount so any
release function will know that the page isn't reusable.

Reported-by: Frank Dinoff 
Link: https://lore.kernel.org/r/CAAmZXrsGg2xsP1CK+cbuEMumtrqdvD-NKnWzhNcvn71RV3c1yw@mail.gmail.com/
Fixes: dd3bb14f44a6 ("fuse: support splice() writing to fuse device")
Cc:  # v2.6.35
Signed-off-by: Miklos Szeredi

fuse: always invalidate attributes after writes

2021-10-28T07:45:32+00:00

Extend the fuse_write_update_attr() helper to invalidate cached attributes
after a write.

This has already been done in all cases except in fuse_notify_store(), so
this is mostly a cleanup.

fuse_direct_write_iter() calls fuse_direct_IO() which already calls
fuse_write_update_attr(), so don't repeat that again in the former.

Signed-off-by: Miklos Szeredi

fuse: rename fuse_write_update_size()

2021-10-28T07:45:32+00:00

This function already updates the attr_version in fuse_inode, regardless of
whether the size was changed or not.

Rename the helper to fuse_write_update_attr() to reflect the more generic
nature.

Signed-off-by: Miklos Szeredi

fuse: use kmap_local_page()

2021-10-22T15:03:01+00:00

Due to the introduction of kmap_local_*, the storage of slots used for
short-term mapping has changed from per-CPU to per-thread.  kmap_atomic()
disable preemption, while kmap_local_*() only disable migration.

There is no need to disable preemption in several kamp_atomic places used
in fuse.

Link: https://lwn.net/Articles/836144/
Signed-off-by: Peng Hao 
Signed-off-by: Miklos Szeredi