<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/fs, branch v3.18.26</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>ceph: fix message length computation</title>
<updated>2016-01-21T16:23:29+00:00</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2015-09-30T13:04:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=27327566b15132cfecc1ef056c26ea8ccceabd6e'/>
<id>27327566b15132cfecc1ef056c26ea8ccceabd6e</id>
<content type='text'>
[ Upstream commit 777d738a5e58ba3b6f3932ab1543ce93703f4873 ]

create_request_message() computes the maximum length of a message,
but uses the wrong type for the time stamp: sizeof(struct timespec)
may be 8 or 16 depending on the architecture, while sizeof(struct
ceph_timespec) is always 8, and that is what gets put into the
message.

Found while auditing the uses of timespec for y2038 problems.

Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request")
Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Yan, Zheng &lt;zyan@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 777d738a5e58ba3b6f3932ab1543ce93703f4873 ]

create_request_message() computes the maximum length of a message,
but uses the wrong type for the time stamp: sizeof(struct timespec)
may be 8 or 16 depending on the architecture, while sizeof(struct
ceph_timespec) is always 8, and that is what gets put into the
message.

Found while auditing the uses of timespec for y2038 problems.

Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request")
Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Yan, Zheng &lt;zyan@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ocfs2: fix umask ignored issue</title>
<updated>2016-01-21T16:23:29+00:00</updated>
<author>
<name>Junxiao Bi</name>
<email>junxiao.bi@oracle.com</email>
</author>
<published>2015-11-20T23:57:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=33fde5c22c1a352ec56a7500ab94cdfc7574faed'/>
<id>33fde5c22c1a352ec56a7500ab94cdfc7574faed</id>
<content type='text'>
[ Upstream commit 8f1eb48758aacf6c1ffce18179295adbf3bd7640 ]

New created file's mode is not masked with umask, and this makes umask not
work for ocfs2 volume.

Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Junxiao Bi &lt;junxiao.bi@oracle.com&gt;
Cc: Gang He &lt;ghe@suse.com&gt;
Cc: Mark Fasheh &lt;mfasheh@suse.de&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 8f1eb48758aacf6c1ffce18179295adbf3bd7640 ]

New created file's mode is not masked with umask, and this makes umask not
work for ocfs2 volume.

Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Junxiao Bi &lt;junxiao.bi@oracle.com&gt;
Cc: Gang He &lt;ghe@suse.com&gt;
Cc: Mark Fasheh &lt;mfasheh@suse.de&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs: if we have no valid attrs, then don't declare the attribute cache valid</title>
<updated>2016-01-21T16:23:29+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@poochiereds.net</email>
</author>
<published>2015-11-25T18:50:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2de7d46289702bcc5727cb70857f8f6860c85ce8'/>
<id>2de7d46289702bcc5727cb70857f8f6860c85ce8</id>
<content type='text'>
[ Upstream commit c812012f9ca7cf89c9e1a1cd512e6c3b5be04b85 ]

If we pass in an empty nfs_fattr struct to nfs_update_inode, it will
(correctly) not update any of the attributes, but it then clears the
NFS_INO_INVALID_ATTR flag, which indicates that the attributes are
up to date. Don't clear the flag if the fattr struct has no valid
attrs to apply.

Reviewed-by: Steve French &lt;steve.french@primarydata.com&gt;
Signed-off-by: Jeff Layton &lt;jeff.layton@primarydata.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit c812012f9ca7cf89c9e1a1cd512e6c3b5be04b85 ]

If we pass in an empty nfs_fattr struct to nfs_update_inode, it will
(correctly) not update any of the attributes, but it then clears the
NFS_INO_INVALID_ATTR flag, which indicates that the attributes are
up to date. Don't clear the flag if the fattr struct has no valid
attrs to apply.

Reviewed-by: Steve French &lt;steve.french@primarydata.com&gt;
Signed-off-by: Jeff Layton &lt;jeff.layton@primarydata.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs4: start callback_ident at idr 1</title>
<updated>2016-01-21T16:23:29+00:00</updated>
<author>
<name>Benjamin Coddington</name>
<email>bcodding@redhat.com</email>
</author>
<published>2015-11-20T14:56:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d48e82da8b064ed3ce8bfa2d8b17d8566e811f4f'/>
<id>d48e82da8b064ed3ce8bfa2d8b17d8566e811f4f</id>
<content type='text'>
[ Upstream commit c68a027c05709330fe5b2f50c50d5fa02124b5d8 ]

If clp-&gt;cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing
it when the nfs_client is freed.  A decoding or server bug can then find
and try to put that first nfs_client which would lead to a crash.

Signed-off-by: Benjamin Coddington &lt;bcodding@redhat.com&gt;
Fixes: d6870312659d ("nfs4client: convert to idr_alloc()")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit c68a027c05709330fe5b2f50c50d5fa02124b5d8 ]

If clp-&gt;cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing
it when the nfs_client is freed.  A decoding or server bug can then find
and try to put that first nfs_client which would lead to a crash.

Signed-off-by: Benjamin Coddington &lt;bcodding@redhat.com&gt;
Fixes: d6870312659d ("nfs4client: convert to idr_alloc()")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfsd: serialize state seqid morphing operations</title>
<updated>2016-01-21T16:23:28+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@poochiereds.net</email>
</author>
<published>2015-09-17T11:47:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e46b3f459f2efa11329013b3df96723ec4d2f314'/>
<id>e46b3f459f2efa11329013b3df96723ec4d2f314</id>
<content type='text'>
[ Upstream commit 35a92fe8770ce54c5eb275cd76128645bea2d200 ]

Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
running in parallel. The server would receive the OPEN_DOWNGRADE first
and check its seqid, but then an OPEN would race in and bump it. The
OPEN_DOWNGRADE would then complete and bump the seqid again.  The result
was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
it should have been rejected since the seqid changed.

The only recourse we have here I think is to serialize operations that
bump the seqid in a stateid, particularly when we're given a seqid in
the call. To address this, we add a new rw_semaphore to the
nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
after looking up the stateid to ensure that nothing else is going to
bump it while we're operating on it.

In the case of OPEN, we do a down_read, as the call doesn't contain a
seqid. Those can run in parallel -- we just need to serialize them when
there is a concurrent OPEN_DOWNGRADE or CLOSE.

LOCK and LOCKU however always take the write lock as there is no
opportunity for parallelizing those.

Reported-and-Tested-by: Andrew W Elble &lt;aweits@rit.edu&gt;
Signed-off-by: Jeff Layton &lt;jeff.layton@primarydata.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 35a92fe8770ce54c5eb275cd76128645bea2d200 ]

Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
running in parallel. The server would receive the OPEN_DOWNGRADE first
and check its seqid, but then an OPEN would race in and bump it. The
OPEN_DOWNGRADE would then complete and bump the seqid again.  The result
was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
it should have been rejected since the seqid changed.

The only recourse we have here I think is to serialize operations that
bump the seqid in a stateid, particularly when we're given a seqid in
the call. To address this, we add a new rw_semaphore to the
nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
after looking up the stateid to ensure that nothing else is going to
bump it while we're operating on it.

In the case of OPEN, we do a down_read, as the call doesn't contain a
seqid. Those can run in parallel -- we just need to serialize them when
there is a concurrent OPEN_DOWNGRADE or CLOSE.

LOCK and LOCKU however always take the write lock as there is no
opportunity for parallelizing those.

Reported-and-Tested-by: Andrew W Elble &lt;aweits@rit.edu&gt;
Signed-off-by: Jeff Layton &lt;jeff.layton@primarydata.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ext4, jbd2: ensure entering into panic after recording an error in superblock</title>
<updated>2016-01-21T16:23:28+00:00</updated>
<author>
<name>Daeho Jeong</name>
<email>daeho.jeong@samsung.com</email>
</author>
<published>2015-10-18T21:02:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8fecc1e2c4b4a71abbbead8afe3098fce2863569'/>
<id>8fecc1e2c4b4a71abbbead8afe3098fce2863569</id>
<content type='text'>
[ Upstream commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 ]

If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option.  But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.

Task A                        Task B
ext4_handle_error()
-&gt; jbd2_journal_abort()
  -&gt; __journal_abort_soft()
    -&gt; __jbd2_journal_abort_hard()
    | -&gt; journal-&gt;j_flags |= JBD2_ABORT;
    |
    |                         __ext4_abort()
    |                         -&gt; jbd2_journal_abort()
    |                         | -&gt; __journal_abort_soft()
    |                         |   -&gt; if (journal-&gt;j_flags &amp; JBD2_ABORT)
    |                         |           return;
    |                         -&gt; panic()
    |
    -&gt; jbd2_journal_update_sb_errno()

Tested-by: Hobin Woo &lt;hobin.woo@samsung.com&gt;
Signed-off-by: Daeho Jeong &lt;daeho.jeong@samsung.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 ]

If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option.  But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.

Task A                        Task B
ext4_handle_error()
-&gt; jbd2_journal_abort()
  -&gt; __journal_abort_soft()
    -&gt; __jbd2_journal_abort_hard()
    | -&gt; journal-&gt;j_flags |= JBD2_ABORT;
    |
    |                         __ext4_abort()
    |                         -&gt; jbd2_journal_abort()
    |                         | -&gt; __journal_abort_soft()
    |                         |   -&gt; if (journal-&gt;j_flags &amp; JBD2_ABORT)
    |                         |           return;
    |                         -&gt; panic()
    |
    -&gt; jbd2_journal_update_sb_errno()

Tested-by: Hobin Woo &lt;hobin.woo@samsung.com&gt;
Signed-off-by: Daeho Jeong &lt;daeho.jeong@samsung.com&gt;
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix race when listing an inode's xattrs</title>
<updated>2016-01-21T16:23:27+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-11-09T18:06:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c0cdea628d04f529189ae4fa84ce7abfefe938e3'/>
<id>c0cdea628d04f529189ae4fa84ce7abfefe938e3</id>
<content type='text'>
[ Upstream commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d ]

When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.

If we have the following leafs:

               Leaf X (has N items)                    Leaf Y

 [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
           slot N - 2         slot N - 1              slot 0

The race illustrated by the following sequence diagram is possible:

       CPU 1                                               CPU 2

  btrfs_listxattr()

    searches for key (257 XATTR_ITEM 0)

    gets path with path-&gt;nodes[0] == leaf X
    and path-&gt;slots[0] == N

    because path-&gt;slots[0] is &gt;=
    btrfs_header_nritems(leaf X), it calls
    btrfs_next_leaf()

    btrfs_next_leaf()
      releases the path

                                                   adds key (257 INODE_REF 666)
                                                   to the end of leaf X (slot N),
                                                   and leaf X now has N + 1 items

      searches for the key (257 INODE_REF 256),
      with path-&gt;keep_locks == 1, because that
      is the last key it saw in leaf X before
      releasing the path

      ends up at leaf X again and it verifies
      that the key (257 INODE_REF 256) is no
      longer the last key in leaf X, so it
      returns with path-&gt;nodes[0] == leaf X
      and path-&gt;slots[0] == N, pointing to
      the new item with key (257 INODE_REF 666)

    btrfs_listxattr's loop iteration sees that
    the type of the key pointed by the path is
    different from the type BTRFS_XATTR_ITEM_KEY
    and so it breaks the loop and stops looking
    for more xattr items
      --&gt; the application doesn't get any xattr
          listed for our inode

So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d ]

When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.

If we have the following leafs:

               Leaf X (has N items)                    Leaf Y

 [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
           slot N - 2         slot N - 1              slot 0

The race illustrated by the following sequence diagram is possible:

       CPU 1                                               CPU 2

  btrfs_listxattr()

    searches for key (257 XATTR_ITEM 0)

    gets path with path-&gt;nodes[0] == leaf X
    and path-&gt;slots[0] == N

    because path-&gt;slots[0] is &gt;=
    btrfs_header_nritems(leaf X), it calls
    btrfs_next_leaf()

    btrfs_next_leaf()
      releases the path

                                                   adds key (257 INODE_REF 666)
                                                   to the end of leaf X (slot N),
                                                   and leaf X now has N + 1 items

      searches for the key (257 INODE_REF 256),
      with path-&gt;keep_locks == 1, because that
      is the last key it saw in leaf X before
      releasing the path

      ends up at leaf X again and it verifies
      that the key (257 INODE_REF 256) is no
      longer the last key in leaf X, so it
      returns with path-&gt;nodes[0] == leaf X
      and path-&gt;slots[0] == N, pointing to
      the new item with key (257 INODE_REF 666)

    btrfs_listxattr's loop iteration sees that
    the type of the key pointed by the path is
    different from the type BTRFS_XATTR_ITEM_KEY
    and so it breaks the loop and stops looking
    for more xattr items
      --&gt; the application doesn't get any xattr
          listed for our inode

So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow</title>
<updated>2016-01-21T16:23:27+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-11-09T00:33:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=938165ad1c0617047358cb0f0a2b27e7f92e8eca'/>
<id>938165ad1c0617047358cb0f0a2b27e7f92e8eca</id>
<content type='text'>
[ Upstream commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d ]

If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.

This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).

The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:

             Leaf X (has N items)                    Leaf Y

 [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
              slot N - 2         slot N - 1              slot 0

 (Note the implicit hole for inode 257 regarding the [0, 8K[ range)

       CPU 1                                         CPU 2

 run_dealloc_nocow()
   btrfs_lookup_file_extent()
     --&gt; searches for a key with value
         (257 EXTENT_DATA 4096) in the
         fs/subvol tree
     --&gt; returns us a path with
         path-&gt;nodes[0] == leaf X and
         path-&gt;slots[0] == N

   because path-&gt;slots[0] is &gt;=
   btrfs_header_nritems(leaf X), it
   calls btrfs_next_leaf()

   btrfs_next_leaf()
     --&gt; releases the path

                                              hard link added to our inode,
                                              with key (257 INODE_REF 500)
                                              added to the end of leaf X,
                                              so leaf X now has N + 1 keys

     --&gt; searches for the key
         (257 INODE_REF 256), because
         it was the last key in leaf X
         before it released the path,
         with path-&gt;keep_locks set to 1

     --&gt; ends up at leaf X again and
         it verifies that the key
         (257 INODE_REF 256) is no longer
         the last key in the leaf, so it
         returns with path-&gt;nodes[0] ==
         leaf X and path-&gt;slots[0] == N,
         pointing to the new item with
         key (257 INODE_REF 500)

   the loop iteration of run_dealloc_nocow()
   does not break out the loop and continues
   because the key referenced in the path
   at path-&gt;nodes[0] and path-&gt;slots[0] is
   for inode 257, its type is &lt; BTRFS_EXTENT_DATA_KEY
   and its offset (500) is less then our delalloc
   range's end (8192)

   the item pointed by the path, an inode reference item,
   is (incorrectly) interpreted as a file extent item and
   we get an invalid extent type, leading to the BUG_ON(1):

   if (extent_type == BTRFS_FILE_EXTENT_REG ||
      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
       (...)
   } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
       (...)
   } else {
       BUG_ON(1)
   }

The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.

So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d ]

If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.

This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).

The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:

             Leaf X (has N items)                    Leaf Y

 [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
              slot N - 2         slot N - 1              slot 0

 (Note the implicit hole for inode 257 regarding the [0, 8K[ range)

       CPU 1                                         CPU 2

 run_dealloc_nocow()
   btrfs_lookup_file_extent()
     --&gt; searches for a key with value
         (257 EXTENT_DATA 4096) in the
         fs/subvol tree
     --&gt; returns us a path with
         path-&gt;nodes[0] == leaf X and
         path-&gt;slots[0] == N

   because path-&gt;slots[0] is &gt;=
   btrfs_header_nritems(leaf X), it
   calls btrfs_next_leaf()

   btrfs_next_leaf()
     --&gt; releases the path

                                              hard link added to our inode,
                                              with key (257 INODE_REF 500)
                                              added to the end of leaf X,
                                              so leaf X now has N + 1 keys

     --&gt; searches for the key
         (257 INODE_REF 256), because
         it was the last key in leaf X
         before it released the path,
         with path-&gt;keep_locks set to 1

     --&gt; ends up at leaf X again and
         it verifies that the key
         (257 INODE_REF 256) is no longer
         the last key in the leaf, so it
         returns with path-&gt;nodes[0] ==
         leaf X and path-&gt;slots[0] == N,
         pointing to the new item with
         key (257 INODE_REF 500)

   the loop iteration of run_dealloc_nocow()
   does not break out the loop and continues
   because the key referenced in the path
   at path-&gt;nodes[0] and path-&gt;slots[0] is
   for inode 257, its type is &lt; BTRFS_EXTENT_DATA_KEY
   and its offset (500) is less then our delalloc
   range's end (8192)

   the item pointed by the path, an inode reference item,
   is (incorrectly) interpreted as a file extent item and
   we get an invalid extent type, leading to the BUG_ON(1):

   if (extent_type == BTRFS_FILE_EXTENT_REG ||
      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
       (...)
   } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
       (...)
   } else {
       BUG_ON(1)
   }

The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.

So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix race leading to incorrect item deletion when dropping extents</title>
<updated>2016-01-21T16:23:27+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-11-06T13:33:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c03844aec669f455b26f06a75e7d16d36bed32e0'/>
<id>c03844aec669f455b26f06a75e7d16d36bed32e0</id>
<content type='text'>
[ Upstream commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c ]

While running a stress test I got the following warning triggered:

  [191627.672810] ------------[ cut here ]------------
  [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
  (...)
  [191627.701485] Call Trace:
  [191627.702037]  [&lt;ffffffff8145f077&gt;] dump_stack+0x4f/0x7b
  [191627.702992]  [&lt;ffffffff81095de5&gt;] ? console_unlock+0x356/0x3a2
  [191627.704091]  [&lt;ffffffff8104b3b0&gt;] warn_slowpath_common+0xa1/0xbb
  [191627.705380]  [&lt;ffffffffa0664499&gt;] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
  [191627.706637]  [&lt;ffffffff8104b46d&gt;] warn_slowpath_null+0x1a/0x1c
  [191627.707789]  [&lt;ffffffffa0664499&gt;] __btrfs_drop_extents+0x391/0xa50 [btrfs]
  [191627.709155]  [&lt;ffffffff8115663c&gt;] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
  [191627.712444]  [&lt;ffffffff81155007&gt;] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
  [191627.714162]  [&lt;ffffffffa06570c9&gt;] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
  [191627.715887]  [&lt;ffffffffa065422b&gt;] ? start_transaction+0x3bb/0x610 [btrfs]
  [191627.717287]  [&lt;ffffffffa065b604&gt;] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
  [191627.728865]  [&lt;ffffffffa065b888&gt;] finish_ordered_fn+0x15/0x17 [btrfs]
  [191627.730045]  [&lt;ffffffffa067d688&gt;] normal_work_helper+0x14c/0x32c [btrfs]
  [191627.731256]  [&lt;ffffffffa067d96a&gt;] btrfs_endio_write_helper+0x12/0x14 [btrfs]
  [191627.732661]  [&lt;ffffffff81061119&gt;] process_one_work+0x24c/0x4ae
  [191627.733822]  [&lt;ffffffff810615b0&gt;] worker_thread+0x206/0x2c2
  [191627.734857]  [&lt;ffffffff810613aa&gt;] ? process_scheduled_works+0x2f/0x2f
  [191627.736052]  [&lt;ffffffff810613aa&gt;] ? process_scheduled_works+0x2f/0x2f
  [191627.737349]  [&lt;ffffffff810669a6&gt;] kthread+0xef/0xf7
  [191627.738267]  [&lt;ffffffff810f3b3a&gt;] ? time_hardirqs_on+0x15/0x28
  [191627.739330]  [&lt;ffffffff810668b7&gt;] ? __kthread_parkme+0xad/0xad
  [191627.741976]  [&lt;ffffffff81465592&gt;] ret_from_fork+0x42/0x70
  [191627.743080]  [&lt;ffffffff810668b7&gt;] ? __kthread_parkme+0xad/0xad
  [191627.744206] ---[ end trace bbfddacb7aaada8d ]---

  $ cat -n fs/btrfs/file.c
  691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
  (...)
  758                  btrfs_item_key_to_cpu(leaf, &amp;key, path-&gt;slots[0]);
  759                  if (key.objectid &gt; ino ||
  760                      key.type &gt; BTRFS_EXTENT_DATA_KEY || key.offset &gt;= end)
  761                          break;
  762
  763                  fi = btrfs_item_ptr(leaf, path-&gt;slots[0],
  764                                      struct btrfs_file_extent_item);
  765                  extent_type = btrfs_file_extent_type(leaf, fi);
  766
  767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
  768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
  (...)
  774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
  (...)
  778                  } else {
  779                          WARN_ON(1);
  780                          extent_end = search_start;
  781                  }
  (...)

This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:

               Leaf X (has N items)                    Leaf Y

[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
          slot N - 2         slot N - 1              slot 0

Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:

          CPU 1                                       CPU 2

  btrfs_finish_ordered_io()
    insert_reserved_file_extent()
      __btrfs_drop_extents()
         Searches for the key
          (257 EXTENT_DATA 4096) through
          btrfs_lookup_file_extent()

         Key not found and we get a path where
         path-&gt;nodes[0] == leaf X and
         path-&gt;slots[0] == N

         Because path-&gt;slots[0] is &gt;=
         btrfs_header_nritems(leaf X), we call
         btrfs_next_leaf()

         btrfs_next_leaf() releases the path

                                                  inserts key
                                                  (257 INODE_REF 4096)
                                                  at the end of leaf X,
                                                  leaf X now has N + 1 keys,
                                                  and the new key is at
                                                  slot N

         btrfs_next_leaf() searches for
         key (257 INODE_REF 256), with
         path-&gt;keep_locks set to 1,
         because it was the last key it
         saw in leaf X

           finds it in leaf X again and
           notices it's no longer the last
           key of the leaf, so it returns 0
           with path-&gt;nodes[0] == leaf X and
           path-&gt;slots[0] == N (which is now
           &lt; btrfs_header_nritems(leaf X)),
           pointing to the new key
           (257 INODE_REF 4096)

         __btrfs_drop_extents() casts the
         item at path-&gt;nodes[0], slot
         path-&gt;slots[0], to a struct
         btrfs_file_extent_item - it does
         not skip keys for the target
         inode with a type less than
         BTRFS_EXTENT_DATA_KEY
         (BTRFS_INODE_REF_KEY &lt; BTRFS_EXTENT_DATA_KEY)

         sees a bogus value for the type
         field triggering the WARN_ON in
         the trace shown above, and sets
         extent_end = search_start (4096)

         does the if-then-else logic to
         fixup 0 length extent items created
         by a past bug from hole punching:

           if (extent_end == key.offset &amp;&amp;
               extent_end &gt;= search_start)
               goto delete_extent_item;

         that evaluates to true and it ends
         up deleting the key pointed to by
         path-&gt;slots[0], (257 INODE_REF 4096),
         from leaf X

The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).

So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c ]

While running a stress test I got the following warning triggered:

  [191627.672810] ------------[ cut here ]------------
  [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
  (...)
  [191627.701485] Call Trace:
  [191627.702037]  [&lt;ffffffff8145f077&gt;] dump_stack+0x4f/0x7b
  [191627.702992]  [&lt;ffffffff81095de5&gt;] ? console_unlock+0x356/0x3a2
  [191627.704091]  [&lt;ffffffff8104b3b0&gt;] warn_slowpath_common+0xa1/0xbb
  [191627.705380]  [&lt;ffffffffa0664499&gt;] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
  [191627.706637]  [&lt;ffffffff8104b46d&gt;] warn_slowpath_null+0x1a/0x1c
  [191627.707789]  [&lt;ffffffffa0664499&gt;] __btrfs_drop_extents+0x391/0xa50 [btrfs]
  [191627.709155]  [&lt;ffffffff8115663c&gt;] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
  [191627.712444]  [&lt;ffffffff81155007&gt;] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
  [191627.714162]  [&lt;ffffffffa06570c9&gt;] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
  [191627.715887]  [&lt;ffffffffa065422b&gt;] ? start_transaction+0x3bb/0x610 [btrfs]
  [191627.717287]  [&lt;ffffffffa065b604&gt;] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
  [191627.728865]  [&lt;ffffffffa065b888&gt;] finish_ordered_fn+0x15/0x17 [btrfs]
  [191627.730045]  [&lt;ffffffffa067d688&gt;] normal_work_helper+0x14c/0x32c [btrfs]
  [191627.731256]  [&lt;ffffffffa067d96a&gt;] btrfs_endio_write_helper+0x12/0x14 [btrfs]
  [191627.732661]  [&lt;ffffffff81061119&gt;] process_one_work+0x24c/0x4ae
  [191627.733822]  [&lt;ffffffff810615b0&gt;] worker_thread+0x206/0x2c2
  [191627.734857]  [&lt;ffffffff810613aa&gt;] ? process_scheduled_works+0x2f/0x2f
  [191627.736052]  [&lt;ffffffff810613aa&gt;] ? process_scheduled_works+0x2f/0x2f
  [191627.737349]  [&lt;ffffffff810669a6&gt;] kthread+0xef/0xf7
  [191627.738267]  [&lt;ffffffff810f3b3a&gt;] ? time_hardirqs_on+0x15/0x28
  [191627.739330]  [&lt;ffffffff810668b7&gt;] ? __kthread_parkme+0xad/0xad
  [191627.741976]  [&lt;ffffffff81465592&gt;] ret_from_fork+0x42/0x70
  [191627.743080]  [&lt;ffffffff810668b7&gt;] ? __kthread_parkme+0xad/0xad
  [191627.744206] ---[ end trace bbfddacb7aaada8d ]---

  $ cat -n fs/btrfs/file.c
  691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
  (...)
  758                  btrfs_item_key_to_cpu(leaf, &amp;key, path-&gt;slots[0]);
  759                  if (key.objectid &gt; ino ||
  760                      key.type &gt; BTRFS_EXTENT_DATA_KEY || key.offset &gt;= end)
  761                          break;
  762
  763                  fi = btrfs_item_ptr(leaf, path-&gt;slots[0],
  764                                      struct btrfs_file_extent_item);
  765                  extent_type = btrfs_file_extent_type(leaf, fi);
  766
  767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
  768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
  (...)
  774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
  (...)
  778                  } else {
  779                          WARN_ON(1);
  780                          extent_end = search_start;
  781                  }
  (...)

This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:

               Leaf X (has N items)                    Leaf Y

[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
          slot N - 2         slot N - 1              slot 0

Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:

          CPU 1                                       CPU 2

  btrfs_finish_ordered_io()
    insert_reserved_file_extent()
      __btrfs_drop_extents()
         Searches for the key
          (257 EXTENT_DATA 4096) through
          btrfs_lookup_file_extent()

         Key not found and we get a path where
         path-&gt;nodes[0] == leaf X and
         path-&gt;slots[0] == N

         Because path-&gt;slots[0] is &gt;=
         btrfs_header_nritems(leaf X), we call
         btrfs_next_leaf()

         btrfs_next_leaf() releases the path

                                                  inserts key
                                                  (257 INODE_REF 4096)
                                                  at the end of leaf X,
                                                  leaf X now has N + 1 keys,
                                                  and the new key is at
                                                  slot N

         btrfs_next_leaf() searches for
         key (257 INODE_REF 256), with
         path-&gt;keep_locks set to 1,
         because it was the last key it
         saw in leaf X

           finds it in leaf X again and
           notices it's no longer the last
           key of the leaf, so it returns 0
           with path-&gt;nodes[0] == leaf X and
           path-&gt;slots[0] == N (which is now
           &lt; btrfs_header_nritems(leaf X)),
           pointing to the new key
           (257 INODE_REF 4096)

         __btrfs_drop_extents() casts the
         item at path-&gt;nodes[0], slot
         path-&gt;slots[0], to a struct
         btrfs_file_extent_item - it does
         not skip keys for the target
         inode with a type less than
         BTRFS_EXTENT_DATA_KEY
         (BTRFS_INODE_REF_KEY &lt; BTRFS_EXTENT_DATA_KEY)

         sees a bogus value for the type
         field triggering the WARN_ON in
         the trace shown above, and sets
         extent_end = search_start (4096)

         does the if-then-else logic to
         fixup 0 length extent items created
         by a past bug from hole punching:

           if (extent_end == key.offset &amp;&amp;
               extent_end &gt;= search_start)
               goto delete_extent_item;

         that evaluates to true and it ends
         up deleting the key pointed to by
         path-&gt;slots[0], (257 INODE_REF 4096),
         from leaf X

The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).

So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix file corruption and data loss after cloning inline extents</title>
<updated>2016-01-21T16:23:27+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2015-10-13T14:15:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d7ca88a480683b1589479bd15eeef69402e9934b'/>
<id>d7ca88a480683b1589479bd15eeef69402e9934b</id>
<content type='text'>
[ Upstream commit 8039d87d9e473aeb740d4fdbd59b9d2f89b2ced9 ]

Currently the clone ioctl allows to clone an inline extent from one file
to another that already has other (non-inlined) extents. This is a problem
because btrfs is not designed to deal with files having inline and regular
extents, if a file has an inline extent then it must be the only extent
in the file and must start at file offset 0. Having a file with an inline
extent followed by regular extents results in EIO errors when doing reads
or writes against the first 4K of the file.

Also, the clone ioctl allows one to lose data if the source file consists
of a single inline extent, with a size of N bytes, and the destination
file consists of a single inline extent with a size of M bytes, where we
have M &gt; N. In this case the clone operation removes the inline extent
from the destination file and then copies the inline extent from the
source file into the destination file - we lose the M - N bytes from the
destination file, a read operation will get the value 0x00 for any bytes
in the the range [N, M] (the destination inode's i_size remained as M,
that's why we can read past N bytes).

So fix this by not allowing such destructive operations to happen and
return errno EOPNOTSUPP to user space.

Currently the fstest btrfs/035 tests the data loss case but it totally
ignores this - i.e. expects the operation to succeed and does not check
the we got data loss.

The following test case for fstests exercises all these cases that result
in file corruption and data loss:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _require_btrfs_fs_feature "no_holes"
  _require_btrfs_mkfs_feature "no-holes"

  rm -f $seqres.full

  test_cloning_inline_extents()
  {
      local mkfs_opts=$1
      local mount_opts=$2

      _scratch_mkfs $mkfs_opts &gt;&gt;$seqres.full 2&gt;&amp;1
      _scratch_mount $mount_opts

      # File bar, the source for all the following clone operations, consists
      # of a single inline extent (50 bytes).
      $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
          | _filter_xfs_io

      # Test cloning into a file with an extent (non-inlined) where the
      # destination offset overlaps that extent. It should not be possible to
      # clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo data after clone operation:"
      # All bytes should have the value 0xaa (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo
      $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io

      # Test cloning the inline extent against a file which has a hole in its
      # first 4K followed by a non-inlined extent. It should not be possible
      # as well to clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo2 data after clone operation:"
      # All bytes should have the value 0x00 (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo2
      $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io

      # Test cloning the inline extent against a file which has a size of zero
      # but has a prealloc extent. It should not be possible as well to clone
      # the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "First 50 bytes of foo3 after clone operation:"
      # Should not be able to read any bytes, file has 0 bytes i_size (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo3
      $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size not greater than the size of
      # bar's inline extent (40 &lt; 50).
      # It should be possible to do the extent cloning from bar to this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4

      # Doing IO against any range in the first 4K of the file should work.
      echo "File foo4 data after clone operation:"
      # Must match file bar's content.
      od -t x1 $SCRATCH_MNT/foo4
      $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size greater than the size of bar's
      # inline extent (60 &gt; 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5

      # Reading the file should not fail.
      echo "File foo5 data after clone operation:"
      # Must have a size of 60 bytes, with all bytes having a value of 0x03
      # (the clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo5

      # Test cloning the inline extent against a file which has no extents but
      # has a size greater than bar's inline extent (16K &gt; 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6

      # Reading the file should not fail.
      echo "File foo6 data after clone operation:"
      # Must have a size of 16K, with all bytes having a value of 0x00 (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo6

      # Test cloning the inline extent against a file which has no extents but
      # has a size not greater than bar's inline extent (30 &lt; 50).
      # It should be possible to clone the inline extent from file bar into
      # this file.
      $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7

      # Reading the file should not fail.
      echo "File foo7 data after clone operation:"
      # Must have a size of 50 bytes, with all bytes having a value of 0xbb.
      od -t x1 $SCRATCH_MNT/foo7

      # Test cloning the inline extent against a file which has a size not
      # greater than the size of bar's inline extent (20 &lt; 50) but has
      # a prealloc extent that goes beyond the file's size. It should not be
      # possible to clone the inline extent from bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" \
                      -c "pwrite -S 0x88 0 20" \
                      $SCRATCH_MNT/foo8 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8

      echo "File foo8 data after clone operation:"
      # Must have a size of 20 bytes, with all bytes having a value of 0x88
      # (the clone operation did not modify our file).
      od -t x1 $SCRATCH_MNT/foo8

      _scratch_unmount
  }

  echo -e "\nTesting without compression and without the no-holes feature...\n"
  test_cloning_inline_extents

  echo -e "\nTesting with compression and without the no-holes feature...\n"
  test_cloning_inline_extents "" "-o compress"

  echo -e "\nTesting without compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" ""

  echo -e "\nTesting with compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" "-o compress"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 8039d87d9e473aeb740d4fdbd59b9d2f89b2ced9 ]

Currently the clone ioctl allows to clone an inline extent from one file
to another that already has other (non-inlined) extents. This is a problem
because btrfs is not designed to deal with files having inline and regular
extents, if a file has an inline extent then it must be the only extent
in the file and must start at file offset 0. Having a file with an inline
extent followed by regular extents results in EIO errors when doing reads
or writes against the first 4K of the file.

Also, the clone ioctl allows one to lose data if the source file consists
of a single inline extent, with a size of N bytes, and the destination
file consists of a single inline extent with a size of M bytes, where we
have M &gt; N. In this case the clone operation removes the inline extent
from the destination file and then copies the inline extent from the
source file into the destination file - we lose the M - N bytes from the
destination file, a read operation will get the value 0x00 for any bytes
in the the range [N, M] (the destination inode's i_size remained as M,
that's why we can read past N bytes).

So fix this by not allowing such destructive operations to happen and
return errno EOPNOTSUPP to user space.

Currently the fstest btrfs/035 tests the data loss case but it totally
ignores this - i.e. expects the operation to succeed and does not check
the we got data loss.

The following test case for fstests exercises all these cases that result
in file corruption and data loss:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _require_btrfs_fs_feature "no_holes"
  _require_btrfs_mkfs_feature "no-holes"

  rm -f $seqres.full

  test_cloning_inline_extents()
  {
      local mkfs_opts=$1
      local mount_opts=$2

      _scratch_mkfs $mkfs_opts &gt;&gt;$seqres.full 2&gt;&amp;1
      _scratch_mount $mount_opts

      # File bar, the source for all the following clone operations, consists
      # of a single inline extent (50 bytes).
      $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
          | _filter_xfs_io

      # Test cloning into a file with an extent (non-inlined) where the
      # destination offset overlaps that extent. It should not be possible to
      # clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo data after clone operation:"
      # All bytes should have the value 0xaa (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo
      $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io

      # Test cloning the inline extent against a file which has a hole in its
      # first 4K followed by a non-inlined extent. It should not be possible
      # as well to clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo2 data after clone operation:"
      # All bytes should have the value 0x00 (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo2
      $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io

      # Test cloning the inline extent against a file which has a size of zero
      # but has a prealloc extent. It should not be possible as well to clone
      # the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "First 50 bytes of foo3 after clone operation:"
      # Should not be able to read any bytes, file has 0 bytes i_size (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo3
      $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size not greater than the size of
      # bar's inline extent (40 &lt; 50).
      # It should be possible to do the extent cloning from bar to this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4

      # Doing IO against any range in the first 4K of the file should work.
      echo "File foo4 data after clone operation:"
      # Must match file bar's content.
      od -t x1 $SCRATCH_MNT/foo4
      $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size greater than the size of bar's
      # inline extent (60 &gt; 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5

      # Reading the file should not fail.
      echo "File foo5 data after clone operation:"
      # Must have a size of 60 bytes, with all bytes having a value of 0x03
      # (the clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo5

      # Test cloning the inline extent against a file which has no extents but
      # has a size greater than bar's inline extent (16K &gt; 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6

      # Reading the file should not fail.
      echo "File foo6 data after clone operation:"
      # Must have a size of 16K, with all bytes having a value of 0x00 (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo6

      # Test cloning the inline extent against a file which has no extents but
      # has a size not greater than bar's inline extent (30 &lt; 50).
      # It should be possible to clone the inline extent from file bar into
      # this file.
      $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7

      # Reading the file should not fail.
      echo "File foo7 data after clone operation:"
      # Must have a size of 50 bytes, with all bytes having a value of 0xbb.
      od -t x1 $SCRATCH_MNT/foo7

      # Test cloning the inline extent against a file which has a size not
      # greater than the size of bar's inline extent (20 &lt; 50) but has
      # a prealloc extent that goes beyond the file's size. It should not be
      # possible to clone the inline extent from bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" \
                      -c "pwrite -S 0x88 0 20" \
                      $SCRATCH_MNT/foo8 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8

      echo "File foo8 data after clone operation:"
      # Must have a size of 20 bytes, with all bytes having a value of 0x88
      # (the clone operation did not modify our file).
      od -t x1 $SCRATCH_MNT/foo8

      _scratch_unmount
  }

  echo -e "\nTesting without compression and without the no-holes feature...\n"
  test_cloning_inline_extents

  echo -e "\nTesting with compression and without the no-holes feature...\n"
  test_cloning_inline_extents "" "-o compress"

  echo -e "\nTesting without compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" ""

  echo -e "\nTesting with compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" "-o compress"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
