<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/mbcache.c, branch v4.14</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>ext4: xattr inode deduplication</title>
<updated>2017-06-22T15:44:55+00:00</updated>
<author>
<name>Tahsin Erdogan</name>
<email>tahsin@google.com</email>
</author>
<published>2017-06-22T15:44:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=dec214d00e0d78a08b947d7dccdfdb84407a9f4d'/>
<id>dec214d00e0d78a08b947d7dccdfdb84407a9f4d</id>
<content type='text'>
Ext4 now supports xattr values that are up to 64k in size (vfs limit).
Large xattr values are stored in external inodes each one holding a
single value. Once written the data blocks of these inodes are immutable.

The real world use cases are expected to have a lot of value duplication
such as inherited acls etc. To reduce data duplication on disk, this patch
implements a deduplicator that allows sharing of xattr inodes.

The deduplication is based on an in-memory hash lookup that is a best
effort sharing scheme. When a xattr inode is read from disk (i.e.
getxattr() call), its crc32c hash is added to a hash table. Before
creating a new xattr inode for a value being set, the hash table is
checked to see if an existing inode holds an identical value. If such an
inode is found, the ref count on that inode is incremented. On value
removal the ref count is decremented and if it reaches zero the inode is
deleted.

The quota charging for such inodes is manually managed. Every reference
holder is charged the full size as if there was no sharing happening.
This is consistent with how xattr blocks are also charged.

[ Fixed up journal credits calculation to handle inline data and the
  rare case where an shared xattr block can get freed when two thread
  race on breaking the xattr block sharing. --tytso ]

Signed-off-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Ext4 now supports xattr values that are up to 64k in size (vfs limit).
Large xattr values are stored in external inodes each one holding a
single value. Once written the data blocks of these inodes are immutable.

The real world use cases are expected to have a lot of value duplication
such as inherited acls etc. To reduce data duplication on disk, this patch
implements a deduplicator that allows sharing of xattr inodes.

The deduplication is based on an in-memory hash lookup that is a best
effort sharing scheme. When a xattr inode is read from disk (i.e.
getxattr() call), its crc32c hash is added to a hash table. Before
creating a new xattr inode for a value being set, the hash table is
checked to see if an existing inode holds an identical value. If such an
inode is found, the ref count on that inode is incremented. On value
removal the ref count is decremented and if it reaches zero the inode is
deleted.

The quota charging for such inodes is manually managed. Every reference
holder is charged the full size as if there was no sharing happening.
This is consistent with how xattr blocks are also charged.

[ Fixed up journal credits calculation to handle inline data and the
  rare case where an shared xattr block can get freed when two thread
  race on breaking the xattr block sharing. --tytso ]

Signed-off-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: make mbcache naming more generic</title>
<updated>2017-06-22T14:29:53+00:00</updated>
<author>
<name>Tahsin Erdogan</name>
<email>tahsin@google.com</email>
</author>
<published>2017-06-22T14:29:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c07dfcb45877fbc6798fa042bab3c4b85378efd4'/>
<id>c07dfcb45877fbc6798fa042bab3c4b85378efd4</id>
<content type='text'>
Make names more generic so that mbcache usage is not limited to
block sharing. In a subsequent patch in the series
("ext4: xattr inode deduplication"), we start using the mbcache code
for sharing xattr inodes. With that patch, old mb_cache_entry.e_block
field could be holding either a block number or an inode number.

Signed-off-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Make names more generic so that mbcache usage is not limited to
block sharing. In a subsequent patch in the series
("ext4: xattr inode deduplication"), we start using the mbcache code
for sharing xattr inodes. With that patch, old mb_cache_entry.e_block
field could be holding either a block number or an inode number.

Signed-off-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: document that "find" functions only return reusable entries</title>
<updated>2016-12-03T20:55:01+00:00</updated>
<author>
<name>Eric Biggers</name>
<email>ebiggers@google.com</email>
</author>
<published>2016-12-03T20:55:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b649668c0bb3f54523cd932490a3f637859fdeb0'/>
<id>b649668c0bb3f54523cd932490a3f637859fdeb0</id>
<content type='text'>
mb_cache_entry_find_first() and mb_cache_entry_find_next() only return
cache entries with the 'e_reusable' bit set.  This should be documented.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
mb_cache_entry_find_first() and mb_cache_entry_find_next() only return
cache entries with the 'e_reusable' bit set.  This should be documented.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: use consistent type for entry count</title>
<updated>2016-12-03T20:43:48+00:00</updated>
<author>
<name>Eric Biggers</name>
<email>ebiggers@google.com</email>
</author>
<published>2016-12-03T20:43:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=132d4e2d55dfd9edae26cb35f088add7c19990e6'/>
<id>132d4e2d55dfd9edae26cb35f088add7c19990e6</id>
<content type='text'>
mbcache used several different types to represent the number of entries
in the cache.  For consistency within mbcache and with the shrinker API,
always use unsigned long.

This does not change behavior for current mbcache users (ext2 and ext4)
since they limit the entry count to a value which easily fits in an int.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
mbcache used several different types to represent the number of entries
in the cache.  For consistency within mbcache and with the shrinker API,
always use unsigned long.

This does not change behavior for current mbcache users (ext2 and ext4)
since they limit the entry count to a value which easily fits in an int.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: remove unnecessary module_get/module_put</title>
<updated>2016-12-03T20:38:29+00:00</updated>
<author>
<name>Eric Biggers</name>
<email>ebiggers@google.com</email>
</author>
<published>2016-12-03T20:38:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=97c7b18a5d15516a4358a6d2d5ddfb22d62d9fac'/>
<id>97c7b18a5d15516a4358a6d2d5ddfb22d62d9fac</id>
<content type='text'>
When mbcache is built as a module, any modules that use it (ext2 and/or
ext4) will depend on its symbols directly, incrementing its reference
count.  Therefore, there is no need to do module_get/module_put.

Also note that since the module_get/module_put were in the mbcache
module itself, executing those lines of code was already dependent on
another reference to the mbcache module being held.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When mbcache is built as a module, any modules that use it (ext2 and/or
ext4) will depend on its symbols directly, incrementing its reference
count.  Therefore, there is no need to do module_get/module_put.

Also note that since the module_get/module_put were in the mbcache
module itself, executing those lines of code was already dependent on
another reference to the mbcache module being held.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: don't BUG() if entry cache cannot be allocated</title>
<updated>2016-12-03T20:28:53+00:00</updated>
<author>
<name>Eric Biggers</name>
<email>ebiggers@google.com</email>
</author>
<published>2016-12-03T20:28:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=21d0f4fa8e7b026cdae988f99f4c3aab154d5b7c'/>
<id>21d0f4fa8e7b026cdae988f99f4c3aab154d5b7c</id>
<content type='text'>
mbcache can be a module that is loaded long after startup, when someone
asks to mount an ext2 or ext4 filesystem.  Therefore it should not BUG()
if kmem_cache_create() fails, but rather just fail the module load.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
mbcache can be a module that is loaded long after startup, when someone
asks to mount an ext2 or ext4 filesystem.  Therefore it should not BUG()
if kmem_cache_create() fails, but rather just fail the module load.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: correctly handle 'e_referenced' bit</title>
<updated>2016-12-03T20:13:15+00:00</updated>
<author>
<name>Eric Biggers</name>
<email>ebiggers@google.com</email>
</author>
<published>2016-12-03T20:13:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=918b7306edacbcc8a9ca318a5a34d73954e1705d'/>
<id>918b7306edacbcc8a9ca318a5a34d73954e1705d</id>
<content type='text'>
mbcache entries have an 'e_referenced' bit which users can set with
mb_cache_entry_touch() to indicate that an entry should be given another
pass through the LRU list before the shrinker can delete it.  However,
mb_cache_shrink() actually would, when seeing an e_referenced entry at
the front of the list (the least-recently used end), place it right at
the front of the list again.  The next iteration would then remove the
entry from the list and delete it.  Consequently, e_referenced had
essentially no effect, so ext2/ext4 xattr blocks would sometimes not be
reused as often as expected.

Fix this by making the shrinker move e_referenced entries to the back of
the list rather than the front.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
mbcache entries have an 'e_referenced' bit which users can set with
mb_cache_entry_touch() to indicate that an entry should be given another
pass through the LRU list before the shrinker can delete it.  However,
mb_cache_shrink() actually would, when seeing an e_referenced entry at
the front of the list (the least-recently used end), place it right at
the front of the list again.  The next iteration would then remove the
entry from the list and delete it.  Consequently, e_referenced had
essentially no effect, so ext2/ext4 xattr blocks would sometimes not be
reused as often as expected.

Fix this by making the shrinker move e_referenced entries to the back of
the list rather than the front.

Signed-off-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: fix to detect failure of register_shrinker</title>
<updated>2016-08-31T15:44:36+00:00</updated>
<author>
<name>Chao Yu</name>
<email>yuchao0@huawei.com</email>
</author>
<published>2016-08-31T15:44:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8913f343cdb568222c95afe3cad78aee1a4df56b'/>
<id>8913f343cdb568222c95afe3cad78aee1a4df56b</id>
<content type='text'>
register_shrinker in mb_cache_create may fail due to no memory. This
patch fixes to do the check of return value of register_shrinker and
handle the error case, otherwise mb_cache_create may return with no
error, but losing the inner shrinker.

Signed-off-by: Chao Yu &lt;yuchao0@huawei.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
register_shrinker in mb_cache_create may fail due to no memory. This
patch fixes to do the check of return value of register_shrinker and
handle the error case, otherwise mb_cache_create may return with no
error, but losing the inner shrinker.

Signed-off-by: Chao Yu &lt;yuchao0@huawei.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: add reusable flag to cache entries</title>
<updated>2016-02-23T03:44:04+00:00</updated>
<author>
<name>Andreas Gruenbacher</name>
<email>agruenba@redhat.com</email>
</author>
<published>2016-02-23T03:44:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6048c64b26097a0ffbd966866b599f990e674e9b'/>
<id>6048c64b26097a0ffbd966866b599f990e674e9b</id>
<content type='text'>
To reduce amount of damage caused by single bad block, we limit number
of inodes sharing an xattr block to 1024. Thus there can be more xattr
blocks with the same contents when there are lots of files with the same
extended attributes. These xattr blocks naturally result in hash
collisions and can form long hash chains and we unnecessarily check each
such block only to find out we cannot use it because it is already
shared by too many inodes.

Add a reusable flag to cache entries which is cleared when a cache entry
has reached its maximum refcount.  Cache entries which are not marked
reusable are skipped by mb_cache_entry_find_{first,next}. This
significantly speeds up mbcache when there are many same xattr blocks.
For example for xattr-bench with 5 values and each process handling
20000 files, the run for 64 processes is 25x faster with this patch.
Even for 8 processes the speedup is almost 3x. We have also verified
that for situations where there is only one xattr block of each kind,
the patch doesn't have a measurable cost.

[JK: Remove handling of setting the same value since it is not needed
anymore, check for races in e_reusable setting, improve changelog,
add measurements]

Signed-off-by: Andreas Gruenbacher &lt;agruenba@redhat.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
To reduce amount of damage caused by single bad block, we limit number
of inodes sharing an xattr block to 1024. Thus there can be more xattr
blocks with the same contents when there are lots of files with the same
extended attributes. These xattr blocks naturally result in hash
collisions and can form long hash chains and we unnecessarily check each
such block only to find out we cannot use it because it is already
shared by too many inodes.

Add a reusable flag to cache entries which is cleared when a cache entry
has reached its maximum refcount.  Cache entries which are not marked
reusable are skipped by mb_cache_entry_find_{first,next}. This
significantly speeds up mbcache when there are many same xattr blocks.
For example for xattr-bench with 5 values and each process handling
20000 files, the run for 64 processes is 25x faster with this patch.
Even for 8 processes the speedup is almost 3x. We have also verified
that for situations where there is only one xattr block of each kind,
the patch doesn't have a measurable cost.

[JK: Remove handling of setting the same value since it is not needed
anymore, check for races in e_reusable setting, improve changelog,
add measurements]

Signed-off-by: Andreas Gruenbacher &lt;agruenba@redhat.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mbcache: get rid of _e_hash_list_head</title>
<updated>2016-02-23T03:42:05+00:00</updated>
<author>
<name>Andreas Gruenbacher</name>
<email>agruenba@redhat.com</email>
</author>
<published>2016-02-23T03:42:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=dc8d5e565f00c9442fa1cbf9acc115475628527c'/>
<id>dc8d5e565f00c9442fa1cbf9acc115475628527c</id>
<content type='text'>
Get rid of field _e_hash_list_head in cache entries and add bit field
e_referenced instead.

Signed-off-by: Andreas Gruenbacher &lt;agruenba@redhat.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Get rid of field _e_hash_list_head in cache entries and add bit field
e_referenced instead.

Signed-off-by: Andreas Gruenbacher &lt;agruenba@redhat.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
</pre>
</div>
</content>
</entry>
</feed>
