linux.git/fs, branch v4.3-rc5

namei: results of d_is_negative() should be checked after dentry revalidation

2015-10-10T17:17:27+00:00

Leandro Awa writes:
 "After switching to version 4.1.6, our parallelized and distributed
  workflows now fail consistently with errors of the form:

  T34: ./regex.c:39:22: error: config.h: No such file or directory

  From our 'git bisect' testing, the following commit appears to be the
  possible cause of the behavior we've been seeing: commit 766c4cbfacd8"

Al Viro says:
 "What happens is that 766c4cbfacd8 got the things subtly wrong.

  We used to treat d_is_negative() after lookup_fast() as "fall with
  ENOENT".  That was wrong - checking ->d_flags outside of ->d_seq
  protection is unreliable and failing with hard error on what should've
  fallen back to non-RCU pathname resolution is a bug.

  Unfortunately, we'd pulled the test too far up and ran afoul of
  another kind of staleness.  The dentry might have been absolutely
  stable from the RCU point of view (and we might be on UP, etc), but
  stale from the remote fs point of view.  If ->d_revalidate() returns
  "it's actually stale", dentry gets thrown away and the original code
  wouldn't even have looked at its ->d_flags.

  What we need is to check ->d_flags where 766c4cbfacd8 does (prior to
  ->d_seq validation) but only use the result in cases where we do not
  discard this dentry outright"

Reported-by: Leandro Awa 
Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
Fixes: 766c4cbfacd8 ("namei: d_is_negative() should be checked...")
Tested-by: Leandro Awa 
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Trond Myklebust 
Acked-by: Al Viro 
Signed-off-by: Linus Torvalds

Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs

2015-10-09T23:39:35+00:00

Pull btrfs fixes from Chris Mason:
 "These are small and assorted.  Neil's is the oldest, I dropped the
  ball thinking he was going to send it in"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: support NFSv2 export
  Btrfs: open_ctree: Fix possible memory leak
  Btrfs: fix deadlock when finalizing block group creation
  Btrfs: update fix for read corruption of compressed and shared extents
  Btrfs: send, fix corner case for reference overwrite detection

Merge tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

2015-10-07T07:54:22+00:00

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Bugfixes:
   - Fix a use-after-free bug in the RPC/RDMA client
   - Fix a write performance regression
   - Fix up page writeback accounting
   - Don't try to reclaim unused state owners
   - Fix a NFSv4 nograce recovery hang
   - reset states to use open_stateid when returning delegation
     voluntarily
   - Fix a tracepoint NULL-pointer dereference"

* tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix a tracepoint NULL-pointer dereference
  nfs4: reset states to use open_stateid when returning delegation voluntarily
  NFSv4: Fix a nograce recovery hang
  NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FH
  NFSv4: Don't try to reclaim unused state owners
  NFS: Fix a write performance regression
  NFS: Fix up page writeback accounting
  xprtrdma: disconnect and flush cqs before freeing buffers

NFS: Fix a tracepoint NULL-pointer dereference

2015-10-06T22:56:25+00:00

Running xfstest generic/013 with the tracepoint nfs:nfs4_open_file
enabled produces a NULL-pointer dereference when calculating fileid and
filehandle of the opened file.  Fix this by checking if state is NULL
before trying to use the inode pointer.

Reported-by: Olga Kornievskaia 
Signed-off-by: Anna Schumaker 
Signed-off-by: Trond Myklebust

BTRFS: support NFSv2 export

2015-10-06T13:55:23+00:00

The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as
that returned by encode_fh - it may be larger.

With NFSv2, the filehandle is fixed length, so it may appear longer
than expected and be zero-padded.

So we must test that fh_len is at least some value, not exactly equal
to it.

Signed-off-by: NeilBrown 
Acked-by: David Sterba

Btrfs: open_ctree: Fix possible memory leak

2015-10-06T13:55:22+00:00

After reading one of chunk or tree root tree's root node from disk, if the
root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release
the memory used by the root node. Fix this.

Signed-off-by: Chandan Rajendra

Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6

2015-10-06T13:30:21+00:00

Pull CIFS fixes from Steve French:
 "Two fixes for problems pointed out by automated tools.

  Thanks PaX/grsecurity team and Dan Carpenter (and the Smatch tool)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] Update cifs version number
  [SMB3] Do not fall back to SMBWriteX in set_file_size error cases
  [SMB3] Missing null tcon check

Btrfs: fix deadlock when finalizing block group creation

2015-10-05T23:56:38+00:00

Josef ran into a deadlock while a transaction handle was finalizing the
creation of its block groups, which produced the following trace:

  [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
  [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
  [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
  [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
  [260445.593137] Call Trace:
  [260445.593145]  [] schedule+0x37/0x80
  [260445.593189]  [] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
  [260445.593197]  [] ? prepare_to_wait_event+0xf0/0xf0
  [260445.593225]  [] btrfs_lock_root_node+0x34/0x50 [btrfs]
  [260445.593253]  [] btrfs_search_slot+0x88b/0xa00 [btrfs]
  [260445.593295]  [] ? free_extent_buffer+0x4f/0x90 [btrfs]
  [260445.593324]  [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [260445.593351]  [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [260445.593394]  [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
  [260445.593427]  [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
  [260445.593459]  [] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
  [260445.593491]  [] find_free_extent+0xa55/0xd90 [btrfs]
  [260445.593524]  [] btrfs_reserve_extent+0xd2/0x220 [btrfs]
  [260445.593532]  [] ? account_page_dirtied+0xdd/0x170
  [260445.593564]  [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
  [260445.593597]  [] ? btree_set_page_dirty+0xe/0x10 [btrfs]
  [260445.593626]  [] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
  [260445.593654]  [] btrfs_cow_block+0x11f/0x1c0 [btrfs]
  [260445.593682]  [] btrfs_search_slot+0x1e7/0xa00 [btrfs]
  [260445.593724]  [] ? free_extent_buffer+0x4f/0x90 [btrfs]
  [260445.593752]  [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [260445.593830]  [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [260445.593905]  [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
  [260445.593946]  [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
  [260445.593990]  [] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
  [260445.594042]  [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
  [260445.594089]  [] btrfs_sync_file+0x294/0x350 [btrfs]
  [260445.594115]  [] vfs_fsync_range+0x3b/0xa0
  [260445.594133]  [] ? syscall_trace_enter_phase1+0x131/0x180
  [260445.594149]  [] do_fsync+0x3d/0x70
  [260445.594169]  [] ? syscall_trace_leave+0xb8/0x110
  [260445.594187]  [] SyS_fsync+0x10/0x20
  [260445.594204]  [] entry_SYSCALL_64_fastpath+0x12/0x71

This happened because the same transaction handle created a large number
of block groups and while finalizing their creation (inserting new items
and updating existing items in the chunk and device trees) a new metadata
extent had to be allocated and no free space was found in the current
metadata block groups, which made find_free_extent() attempt to allocate
a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
ended up allocating a new system chunk too and exceeded the threshold
of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
final part of block group creation again (at
btrfs_create_pending_block_groups()) and attempt to lock again the root
of the chunk tree when it's already write locked by the same task.

Similarly we can deadlock on extent tree nodes/leafs if while we are
running delayed references we end up creating a new metadata block group
in order to allocate a new node/leaf for the extent tree (as part of
a CoW operation or growing the tree), as btrfs_create_pending_block_groups
inserts items into the extent tree as well. In this case we get the
following trace:

  [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
  [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
  [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
  [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
  [14242.773606] Call Trace:
  [14242.773613]  [] schedule+0x37/0x80
  [14242.773656]  [] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
  [14242.773664]  [] ? prepare_to_wait_event+0xf0/0xf0
  [14242.773692]  [] btrfs_lock_root_node+0x34/0x50 [btrfs]
  [14242.773720]  [] btrfs_search_slot+0x88b/0xa00 [btrfs]
  [14242.773750]  [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [14242.773758]  [] ? kmem_cache_alloc+0x1d2/0x200
  [14242.773786]  [] btrfs_insert_item+0x71/0xf0 [btrfs]
  [14242.773818]  [] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
  [14242.773850]  [] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
  [14242.773934]  [] find_free_extent+0xa55/0xd90 [btrfs]
  [14242.773998]  [] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
  [14242.774041]  [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
  [14242.774078]  [] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
  [14242.774118]  [] btrfs_cow_block+0x11f/0x1c0 [btrfs]
  [14242.774155]  [] btrfs_search_slot+0x1e7/0xa00 [btrfs]
  [14242.774194]  [] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
  [14242.774235]  [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [14242.774274]  [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [14242.774318]  [] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
  [14242.774358]  [] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
  [14242.774391]  [] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
  [14242.774432]  [] commit_cowonly_roots+0x8d/0x2bd [btrfs]
  [14242.774474]  [] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
  [14242.774516]  [] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
  [14242.774558]  [] btrfs_commit_transaction+0x590/0xb40 [btrfs]
  [14242.774599]  [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
  [14242.774642]  [] btrfs_sync_file+0x294/0x350 [btrfs]
  [14242.774650]  [] vfs_fsync_range+0x3b/0xa0
  [14242.774657]  [] ? syscall_trace_enter_phase1+0x131/0x180
  [14242.774663]  [] do_fsync+0x3d/0x70
  [14242.774669]  [] ? syscall_trace_leave+0xb8/0x110
  [14242.774675]  [] SyS_fsync+0x10/0x20
  [14242.774681]  [] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by never recursing into the finalization phase of block group
creation and making sure we never trigger the finalization of block group
creation while running delayed references.

Reported-by: Josef Bacik 
Fixes: 00d80e342c0f ("Btrfs: fix quick exhaustion of the system array in the superblock")
Signed-off-by: Filipe Manana

Btrfs: update fix for read corruption of compressed and shared extents

2015-10-05T23:56:27+00:00

My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  test_clone_and_read_compressed_extent()
  {
      local mount_opts=$1

      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # Create our test file with a single extent of 64Kb that is going to
      # be compressed no matter which compression algo is used (zlib/lzo).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
          $SCRATCH_MNT/foo | _filter_xfs_io

      # Now clone the compressed extent into an adjacent file offset.
      $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
          $SCRATCH_MNT/foo $SCRATCH_MNT/foo

      echo "File digest before unmount:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch

      # Remount the fs or clear the page cache to trigger the bug in
      # btrfs. Because the extent has an uncompressed length that is a
      # multiple of 16 pages, all the pages belonging to the second range
      # of the file (64K to 128K), which points to the same extent as the
      # first range (0K to 64K), had their contents full of zeroes instead
      # of the byte 0xaa. This was a bug exclusively in the read path of
      # compressed extents, the correct data was stored on disk, btrfs
      # just failed to fill in the pages correctly.
      _scratch_remount

      echo "File digest after remount:"
      # Must match the digest we got before.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
  }

  echo -e "\nTesting with zlib compression..."
  test_clone_and_read_compressed_extent "-o compress=zlib"

  _scratch_unmount

  echo -e "\nTesting with lzo compression..."
  test_clone_and_read_compressed_extent "-o compress=lzo"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana 
Tested-by: Timofey Titovets

Btrfs: send, fix corner case for reference overwrite detection

2015-10-05T23:56:27+00:00

When the inode given to did_overwrite_ref() matches the current progress
and has a reference that collides with the reference of other inode that
has the same number as the current progress, we were always telling our
caller that the inode's reference was overwritten, which is incorrect
because the other inode might be a new inode (different generation number)
in which case we must return false from did_overwrite_ref() so that its
callers don't use an orphanized path for the inode (as it will never be
orphanized, instead it will be unlinked and the new inode created later).

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single extent of 64K.
  mkdir -p $SCRATCH_MNT/foo
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \
      | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap1
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap2

  echo "File digest before being replaced:"
  md5sum $SCRATCH_MNT/mysnap1/foo/bar | _filter_scratch

  # Remove the file and then create a new one in the same location with
  # the same name but with different content. This new file ends up
  # getting the same inode number as the previous one, because that inode
  # number was the highest inode number used by the snapshot's root and
  # therefore when attempting to find the a new inode number for the new
  # file, we end up reusing the same inode number. This happens because
  # currently btrfs uses the highest inode number summed by 1 for the
  # first inode created once a snapshot's root is loaded (done at
  # fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel
  # tree).
  # Having these two different files in the snapshots with the same inode
  # number (but different generation numbers) caused the btrfs send code
  # to emit an incorrect path for the file when issuing an unlink
  # operation because it failed to realize they were different files.
  rm -f $SCRATCH_MNT/mysnap2/foo/bar
  $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \
      $SCRATCH_MNT/mysnap2/foo/bar | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \
      $SCRATCH_MNT/mysnap2_ro

  _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
  _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \
      $SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap

  echo "File digest in the original filesystem after being replaced:"
  md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch

  # Now recreate the filesystem by receiving both send streams and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap
  _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap

  echo "File digest in the new filesystem:"
  # Must match the digest from the new file.
  md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch

  status=0
  exit

Reported-by: Martin Raiber 
Fixes: 8b191a684968 ("Btrfs: incremental send, check if orphanized dir inode needs delayed rename")
Signed-off-by: Filipe Manana