<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/bcachefs/btree_write_buffer.c, branch v6.7</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>bcachefs: Heap allocate btree_trans</title>
<updated>2023-10-22T21:10:13+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-09-12T21:16:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6bd68ec266ad71827ef940151067b67b62fb8fed'/>
<id>6bd68ec266ad71827ef940151067b67b62fb8fed</id>
<content type='text'>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: Fix btree write buffer with snapshots btrees</title>
<updated>2023-10-22T21:10:11+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-08-21T23:57:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=da525760802b9f18cd9eb9ecdb23952f41723de2'/>
<id>da525760802b9f18cd9eb9ecdb23952f41723de2</id>
<content type='text'>
Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: use prejournaled key updates for write buffer flushes</title>
<updated>2023-10-22T21:10:08+00:00</updated>
<author>
<name>Brian Foster</name>
<email>bfoster@redhat.com</email>
</author>
<published>2023-07-19T12:53:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=60a5b898007d766d6180ca101634bf7cad98d82f'/>
<id>60a5b898007d766d6180ca101634bf7cad98d82f</id>
<content type='text'>
The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.

For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.

To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.

Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.

Signed-off-by: Brian Foster &lt;bfoster@redhat.com&gt;
Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.

For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.

To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.

Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.

Signed-off-by: Brian Foster &lt;bfoster@redhat.com&gt;
Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: Add a race_fault() for write buffer slowpath</title>
<updated>2023-10-22T21:10:07+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-07-12T15:43:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d82978ca1593890a1b41eab6d06fe6e5950e4722'/>
<id>d82978ca1593890a1b41eab6d06fe6e5950e4722</id>
<content type='text'>
We haven't hooked up dynamic fault injection quite yet, but we will soon

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We haven't hooked up dynamic fault injection quite yet, but we will soon

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: Kill BTREE_INSERT_USE_RESERVE</title>
<updated>2023-10-22T21:10:05+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-06-27T21:32:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f33c58fc46a9c5bd6cbf90edb6ce17fa3fd912d5'/>
<id>f33c58fc46a9c5bd6cbf90edb6ce17fa3fd912d5</id>
<content type='text'>
Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: Kill JOURNAL_WATERMARK</title>
<updated>2023-10-22T21:10:05+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-06-27T21:32:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ec14fc6010fdcc40e54e289afc657a676ce93e72'/>
<id>ec14fc6010fdcc40e54e289afc657a676ce93e72</id>
<content type='text'>
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: Write buffer flush needs BTREE_INSERT_NOCHECK_RW</title>
<updated>2023-10-22T21:10:04+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-06-11T23:45:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8e5b1115f1dd88125cbb06c344ba1f4214265042'/>
<id>8e5b1115f1dd88125cbb06c344ba1f4214265042</id>
<content type='text'>
btree write buffer flush is only invoked from contexts that already hold
a write ref, and checking if we're still RW could cause us to fail to
completely flush the write buffer when shutting down.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
btree write buffer flush is only invoked from contexts that already hold
a write ref, and checking if we're still RW could cause us to fail to
completely flush the write buffer when shutting down.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: more aggressive fast path write buffer key flushing</title>
<updated>2023-10-22T21:09:58+00:00</updated>
<author>
<name>Brian Foster</name>
<email>bfoster@redhat.com</email>
</author>
<published>2023-03-17T12:54:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=873555f04d81b49a96ea03b37dcd499c13e67742'/>
<id>873555f04d81b49a96ea03b37dcd499c13e67742</id>
<content type='text'>
The btree write buffer flush code is prone to causing journal
deadlock due to inefficient use and release of reservation space.
Reservation is not pre-reserved for write buffered keys (as is done
for key cache keys, for example), because the write buffer flush
side uses a fast path that attempts insertion without need for any
reservation at all.

The write buffer flush attempts to deal with this by inserting keys
using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on
journal reservations that require blocking. Upon first error, it
falls back to a slow path that inserts in journal order and supports
moving the associated journal pin forward.

The problem is that under pathological conditions (i.e. smaller log,
larger write buffer and journal reservation pressure), we've seen
instances where the fast path fails fairly quickly without having
completed many insertions, and then the slow path is unable to push
the journal pin forward enough to free up the space it needs to
completely flush the buffer. This problem is occasionally reproduced
by fstest generic/333.

To avoid this problem, update the fast path algorithm to skip key
inserts that fail due to inability to acquire needed journal
reservation without immediately breaking out of the loop. Instead,
insert as many keys as possible, zap the sequence numbers to mark
them as processed, and then fall back to the slow path to process
the remaining set in journal order. This reduces the amount of
journal reservation that might be required to flush the entire
buffer and increases the odds that the slow path is able to move the
journal pin forward and free up space as keys are processed.

Signed-off-by: Brian Foster &lt;bfoster@redhat.com&gt;
Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The btree write buffer flush code is prone to causing journal
deadlock due to inefficient use and release of reservation space.
Reservation is not pre-reserved for write buffered keys (as is done
for key cache keys, for example), because the write buffer flush
side uses a fast path that attempts insertion without need for any
reservation at all.

The write buffer flush attempts to deal with this by inserting keys
using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on
journal reservations that require blocking. Upon first error, it
falls back to a slow path that inserts in journal order and supports
moving the associated journal pin forward.

The problem is that under pathological conditions (i.e. smaller log,
larger write buffer and journal reservation pressure), we've seen
instances where the fast path fails fairly quickly without having
completed many insertions, and then the slow path is unable to push
the journal pin forward enough to free up the space it needs to
completely flush the buffer. This problem is occasionally reproduced
by fstest generic/333.

To avoid this problem, update the fast path algorithm to skip key
inserts that fail due to inability to acquire needed journal
reservation without immediately breaking out of the loop. Instead,
insert as many keys as possible, zap the sequence numbers to mark
them as processed, and then fall back to the slow path to process
the remaining set in journal order. This reduces the amount of
journal reservation that might be required to flush the entire
buffer and increases the odds that the slow path is able to move the
journal pin forward and free up space as keys are processed.

Signed-off-by: Brian Foster &lt;bfoster@redhat.com&gt;
Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: Private error codes: ENOMEM</title>
<updated>2023-10-22T21:09:57+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-03-14T19:35:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=65d48e35250fe46a560dffa13876830336b152c9'/>
<id>65d48e35250fe46a560dffa13876830336b152c9</id>
<content type='text'>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>bcachefs: Fix for shared paths in write buffer flush</title>
<updated>2023-10-22T21:09:54+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@linux.dev</email>
</author>
<published>2023-02-26T20:48:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=747ded6ddfe88eb9644ee0512c061e46fe2fb09d'/>
<id>747ded6ddfe88eb9644ee0512c061e46fe2fb09d</id>
<content type='text'>
It's possible for bch2_write_buffer_flush_one() to end up with a shared
path, if called from a context that already has a btree iterator
pointing to a key being flushed. We have to be careful when that
happens, since we can't clone a path that holds write locks.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It's possible for bch2_write_buffer_flush_one() to end up with a shared
path, if called from a context that already has a btree iterator
pointing to a key being flushed. We have to be careful when that
happens, since we can't clone a path that holds write locks.

Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
</pre>
</div>
</content>
</entry>
</feed>
