<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/xfs, branch v2.6.35</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>xfs: track AGs with reclaimable inodes in per-ag radix tree</title>
<updated>2010-07-19T23:43:39+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2010-07-19T23:43:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=16fd5367370099b59d96e30bb7d9de8d419659f2'/>
<id>16fd5367370099b59d96e30bb7d9de8d419659f2</id>
<content type='text'>
https://bugzilla.kernel.org/show_bug.cgi?id=16348

When the filesystem grows to a large number of allocation groups,
the summing of recalimable inodes gets expensive. In many cases,
most AGs won't have any reclaimable inodes and so we are wasting CPU
time aggregating over these AGs. This is particularly important for
the inode shrinker that gets called frequently under memory
pressure.

To avoid the overhead, track AGs with reclaimable inodes in the
per-ag radix tree so that we can find all the AGs with reclaimable
inodes via a simple gang tag lookup. This involves setting the tag
when the first reclaimable inode is tracked in the AG, and removing
the tag when the last reclaimable inode is removed from the tree.
Then the summation process becomes a loop walking the radix tree
summing AGs with the reclaim tag set.

This significantly reduces the overhead of scanning - a 6400 AG
filesystea now only uses about 25% of a cpu in kswapd while slab
reclaim progresses instead of being permanently stuck at 100% CPU
and making little progress. Clean filesystems filesystems will see
no overhead and the overhead only increases linearly with the number
of dirty AGs.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
https://bugzilla.kernel.org/show_bug.cgi?id=16348

When the filesystem grows to a large number of allocation groups,
the summing of recalimable inodes gets expensive. In many cases,
most AGs won't have any reclaimable inodes and so we are wasting CPU
time aggregating over these AGs. This is particularly important for
the inode shrinker that gets called frequently under memory
pressure.

To avoid the overhead, track AGs with reclaimable inodes in the
per-ag radix tree so that we can find all the AGs with reclaimable
inodes via a simple gang tag lookup. This involves setting the tag
when the first reclaimable inode is tracked in the AG, and removing
the tag when the last reclaimable inode is removed from the tree.
Then the summation process becomes a loop walking the radix tree
summing AGs with the reclaim tag set.

This significantly reduces the overhead of scanning - a 6400 AG
filesystea now only uses about 25% of a cpu in kswapd while slab
reclaim progresses instead of being permanently stuck at 100% CPU
and making little progress. Clean filesystems filesystems will see
no overhead and the overhead only increases linearly with the number
of dirty AGs.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: convert inode shrinker to per-filesystem contexts</title>
<updated>2010-07-19T22:07:02+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2010-07-19T22:07:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=70e60ce71516c3a9e882edb70a09f696a05961db'/>
<id>70e60ce71516c3a9e882edb70a09f696a05961db</id>
<content type='text'>
Now the shrinker passes us a context, wire up a shrinker context per
filesystem. This allows us to remove the global mount list and the
locking problems that introduced. It also means that a shrinker call
does not need to traverse clean filesystems before finding a
filesystem with reclaimable inodes.  This significantly reduces
scanning overhead when lots of filesystems are present.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now the shrinker passes us a context, wire up a shrinker context per
filesystem. This allows us to remove the global mount list and the
locking problems that introduced. It also means that a shrinker call
does not need to traverse clean filesystems before finding a
filesystem with reclaimable inodes.  This significantly reduces
scanning overhead when lots of filesystems are present.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: add context argument to shrinker callback</title>
<updated>2010-07-19T04:56:17+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2010-07-19T04:56:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7f8275d0d660c146de6ee3017e1e2e594c49e820'/>
<id>7f8275d0d660c146de6ee3017e1e2e594c49e820</id>
<content type='text'>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: remove block number from inode lookup code</title>
<updated>2010-06-24T01:35:17+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2010-06-24T01:35:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7b6259e7a83647948fa33a736cc832310c8d85aa'/>
<id>7b6259e7a83647948fa33a736cc832310c8d85aa</id>
<content type='text'>
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: rename XFS_IGET_BULKSTAT to XFS_IGET_UNTRUSTED</title>
<updated>2010-06-24T01:15:47+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2010-06-24T01:15:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1920779e67cbf5ea8afef317777c5bf2b8096188'/>
<id>1920779e67cbf5ea8afef317777c5bf2b8096188</id>
<content type='text'>
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: validate untrusted inode numbers during lookup</title>
<updated>2010-06-24T01:15:33+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2010-06-24T01:15:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7124fe0a5b619d65b739477b3b55a20bf805b06d'/>
<id>7124fe0a5b619d65b739477b3b55a20bf805b06d</id>
<content type='text'>
When we decode a handle or do a bulkstat lookup, we are using an
inode number we cannot trust to be valid. If we are deleting inode
chunks from disk (default noikeep mode), then we cannot trust the on
disk inode buffer for any given inode number to correctly reflect
whether the inode has been unlinked as the di_mode nor the
generation number may have been updated on disk.

This is due to the fact that when we delete an inode chunk, we do
not write the clusters back to disk when they are removed - instead
we mark them stale to avoid them being written back potentially over
the top of something that has been subsequently allocated at that
location. The result is that we can have locations of disk that look
like they contain valid inodes but in reality do not. Hence we
cannot simply convert the inode number to a block number and read
the location from disk to determine if the inode is valid or not.

As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the
inode up in the inode allocation btree to determine if the inode
number is valid or not.

It should be noted even on ikeep filesystems, there is the
possibility that blocks on disk may look like valid inode clusters.
e.g. if there are filesystem images hosted on the filesystem. Hence
even for ikeep filesystems we really need to validate that the inode
number is valid before issuing the inode buffer read.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When we decode a handle or do a bulkstat lookup, we are using an
inode number we cannot trust to be valid. If we are deleting inode
chunks from disk (default noikeep mode), then we cannot trust the on
disk inode buffer for any given inode number to correctly reflect
whether the inode has been unlinked as the di_mode nor the
generation number may have been updated on disk.

This is due to the fact that when we delete an inode chunk, we do
not write the clusters back to disk when they are removed - instead
we mark them stale to avoid them being written back potentially over
the top of something that has been subsequently allocated at that
location. The result is that we can have locations of disk that look
like they contain valid inodes but in reality do not. Hence we
cannot simply convert the inode number to a block number and read
the location from disk to determine if the inode is valid or not.

As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the
inode up in the inode allocation btree to determine if the inode
number is valid or not.

It should be noted even on ikeep filesystems, there is the
possibility that blocks on disk may look like valid inode clusters.
e.g. if there are filesystem images hosted on the filesystem. Hence
even for ikeep filesystems we really need to validate that the inode
number is valid before issuing the inode buffer read.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: always use iget in bulkstat</title>
<updated>2010-06-23T08:11:11+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2010-06-23T08:11:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7dce11dbac54fce777eea0f5fb25b2694ccd7900'/>
<id>7dce11dbac54fce777eea0f5fb25b2694ccd7900</id>
<content type='text'>
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes.  This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes.  This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: prevent swapext from operating on write-only files</title>
<updated>2010-06-24T02:07:47+00:00</updated>
<author>
<name>Dan Rosenberg</name>
<email>dan.j.rosenberg@gmail.com</email>
</author>
<published>2010-06-24T02:07:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1817176a86352f65210139d4c794ad2d19fc6b63'/>
<id>1817176a86352f65210139d4c794ad2d19fc6b63</id>
<content type='text'>
This patch prevents user "foo" from using the SWAPEXT ioctl to swap
a write-only file owned by user "bar" into a file owned by "foo" and
subsequently reading it.  It does so by checking that the file
descriptors passed to the ioctl are also opened for reading.

Signed-off-by: Dan Rosenberg &lt;dan.j.rosenberg@gmail.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch prevents user "foo" from using the SWAPEXT ioctl to swap
a write-only file owned by user "bar" into a file owned by "foo" and
subsequently reading it.  It does so by checking that the file
descriptors passed to the ioctl are also opened for reading.

Signed-off-by: Dan Rosenberg &lt;dan.j.rosenberg@gmail.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: remove nr_to_write writeback windup.</title>
<updated>2010-06-09T01:12:44+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2010-06-09T00:37:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=254c8c2dbf0e06a560a5814eb90cb628adb2de66'/>
<id>254c8c2dbf0e06a560a5814eb90cb628adb2de66</id>
<content type='text'>
Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc-&gt;nr_to_write to get good writeback. Remove
that code.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc-&gt;nr_to_write to get good writeback. Remove
that code.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'master' into for-linus</title>
<updated>2010-06-04T18:22:30+00:00</updated>
<author>
<name>Alex Elder</name>
<email>aelder@sgi.com</email>
</author>
<published>2010-06-04T18:22:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1bf7dbfde8fe7ddaa8e2e1b4e0fc41a9fc6aa7a5'/>
<id>1bf7dbfde8fe7ddaa8e2e1b4e0fc41a9fc6aa7a5</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
</feed>
