linux-stable.git/fs, branch v3.0.9

ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining

2011-11-11T17:37:16+00:00

commit 8c0bec2151a47906bf779c6715a10ce04453ab77 upstream.

The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
in ext4_evict_inode() causes lockdep complaining about potential
deadlock in several places.  In most/all of these LOCKDEP complaints
it looks like it's a false positive, since many of the potential
circular locking cases can't take place by the time the
ext4_evict_inode() is called; but since at the very least it may mask
real problems, we need to address this.

This change removes the flush_completed_IO() and i_mutex lock in
ext4_evict_inode().  Instead, we take a different approach to resolve
the software lockup that commit 2581fdc810 intends to fix.  Rather
than having ext4-dio-unwritten thread wait for grabing the i_mutex
lock of an inode, we use mutex_trylock() instead, and simply requeue
the work item if we fail to grab the inode's i_mutex lock.

This should speed up work queue processing in general and also
prevents the following deadlock scenario: During page fault,
shrink_icache_memory is called that in turn evicts another inode B.
Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero.  However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue.  As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock.  Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.

Signed-off-by: Jiaying Zhang 
Signed-off-by: "Theodore Ts'o" 
Cc: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

VFS: we need to set LOOKUP_JUMPED on mountpoint crossing

2011-11-11T17:37:08+00:00

commit a3fbbde70a0cec017f2431e8f8de208708c76acc upstream.

Mountpoint crossing is similar to following procfs symlinks - we do
not get ->d_revalidate() called for dentry we have arrived at, with
unpleasant consequences for NFS4.

Simple way to reproduce the problem in mainline:

    cat >/tmp/a.c <<'EOF'
    #include 
    #include 
    #include 
    main()
    {
            struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1};
            if (fcntl(0, F_SETLK, &fl))
                    perror("setlk");
    }
    EOF
    cc /tmp/a.c -o /tmp/test

then on nfs4:

    mount --bind file1 file2
    /tmp/test < file1		# ok
    /tmp/test < file2		# spews "setlk: No locks available"...

What happens is the missing call of ->d_revalidate() after mountpoint
crossing and that's where NFS4 would issue OPEN request to server.

The fix is simple - treat mountpoint crossing the same way we deal with
following procfs-style symlinks.  I.e.  set LOOKUP_JUMPED...

Signed-off-by: Al Viro 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

VFS: fix statfs() automounter semantics regression

2011-11-11T17:37:08+00:00

commit 5c8a0fbba543d9428a486f0d1282bbcf3cf1d95a upstream.

No one in their right mind would expect statfs() to not work on a
automounter managed mount point. Fix it.

[ I'm not sure about the "no one in their right mind" part.  It's not
  mounted, and you didn't ask for it to be mounted.  But nobody will
  really care, and this probably makes it match previous semantics, so..
      - Linus ]

This mirrors the fix made to the quota code in 815d405ceff0d69646.

Signed-off-by: Dan McGee 
Cc: Trond Myklebust 
Cc: Alexander Viro 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

block: make gendisk hold a reference to its queue

2011-11-11T17:37:07+00:00

commit f992ae801a7dec34a4ed99a6598bbbbfb82af4fb upstream.

The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
 RIP: 0010:[]  [] __lock_acquire+0x389/0x1d60
...
 Call Trace:
  [] lock_acquire+0x95/0x140
  [] _raw_spin_lock+0x3b/0x50
  [] bdi_lock_two+0x5c/0x70
  [] bdev_inode_switch_bdi+0x4c/0xf0
  [] __blkdev_put+0x11b/0x1d0
  [] __blkdev_put+0x160/0x1d0
  [] blkdev_put+0x5f/0x190
  [] kill_block_super+0x4d/0x80
  [] deactivate_locked_super+0x45/0x70
  [] deactivate_super+0x4a/0x70
  [] mntput_no_expire+0xed/0x130
  [] sys_umount+0x7e/0x3a0
  [] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue.  If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

ext4: fix race in xattr block allocation path

2011-11-11T17:36:34+00:00

commit 6d6a435190bdf2e04c9465cde5bdc3ac68cf11a4 upstream.

Ceph users reported that when using Ceph on ext4, the filesystem
would often become corrupted, containing inodes with incorrect
i_blocks counters.

I managed to reproduce this with a very hacked-up "streamtest"
binary from the Ceph tree.

Ceph is doing a lot of xattr writes, to out-of-inode blocks.
There is also another thread which does sync_file_range and close,
of the same files.  The problem appears to happen due to this race:

sync/flush thread               xattr-set thread
-----------------               ----------------

do_writepages                   ext4_xattr_set
ext4_da_writepages              ext4_xattr_set_handle
mpage_da_map_blocks             ext4_xattr_block_set
        set DELALLOC_RESERVE
                                ext4_new_meta_blocks
                                        ext4_mb_new_blocks
                                                if (!i_delalloc_reserved_flag)
                                                        vfs_dq_alloc_block
ext4_get_blocks
	down_write(i_data_sem)
        set i_delalloc_reserved_flag
	...
	up_write(i_data_sem)
                                        if (i_delalloc_reserved_flag)
                                                vfs_dq_alloc_block_nofail


In other words, the sync/flush thread pops in and sets
i_delalloc_reserved_flag on the inode, which makes the xattr thread
think that it's in a delalloc path in ext4_new_meta_blocks(),
and add the block for a second time, after already having added
it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks

The real problem is that we shouldn't be using the DELALLOC_RESERVED
state flag, and instead we should be passing
EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of
using an inode state flag.  We'll fix this for now with using
i_data_sem to prevent this race, but this is really not the right way
to fix things.

Signed-off-by: Eric Sandeen 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: call ext4_handle_dirty_metadata with correct inode in ext4_dx_add_entry

2011-11-11T17:36:34+00:00

commit 5930ea643805feb50a2f8383ae12eb6f10935e49 upstream.

ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads
that point to directory blocks assigned to the directory inode.  However, the
function calls ext4_handle_dirty_metadata with the inode of the file that's
being added to the directory, not the directory inode itself.  Therefore,
correct the code to dirty the directory buffers with the directory inode, not
the file inode.

Signed-off-by: Darrick J. Wong 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: ext4_mkdir should dirty dir_block with newly created directory inode

2011-11-11T17:36:34+00:00

commit f9287c1f2d329f4d78a3bbc9cf0db0ebae6f146a upstream.

ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
Unfortunately, dir_block belongs to the newly created directory (which is
"inode"), not the parent directory (which is "dir").  Fix the incorrect
association.

Signed-off-by: Darrick J. Wong 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: ext4_rename should dirty dir_bh with the correct directory

2011-11-11T17:36:33+00:00

commit bcaa992975041e40449be8c010c26192b8c8b409 upstream.

When ext4_rename performs a directory rename (move), dir_bh is a
buffer that is modified to update the '..' link in the directory being
moved (old_inode).  However, ext4_handle_dirty_metadata is called with
the old parent directory inode (old_dir) and dir_bh, which is
incorrect because dir_bh does not belong to the parent inode.  Fix
this error.

Signed-off-by: Darrick J. Wong 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodes

2011-11-11T17:36:32+00:00

commit 1cd9f0976aa4606db8d6e3dc3edd0aca8019372a upstream.

This doesn't make much sense, and it exposes a bug in the kernel where
attempts to create a new file in an append-only directory using
O_CREAT will fail (but still leave a zero-length file).  This was
discovered when xfstests #79 was generalized so it could run on all
file systems.

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

vfs: show O_CLOEXE bit properly in /proc//fdinfo/ files

2011-11-11T17:36:30+00:00

commit 1117f72ea0217ba0cc19f05adbbd8b9a397f5ab7 upstream.

The CLOEXE bit is magical, and for performance (and semantic) reasons we
don't actually maintain it in the file descriptor itself, but in a
separate bit array.  Which means that when we show f_flags, the CLOEXE
status is shown incorrectly: we show the status not as it is now, but as
it was when the file was opened.

Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
array.

Uli needs this in order to re-implement the pfiles program:

  "For normal file descriptors (not sockets) this was the last piece of
   information which wasn't available.  This is all part of my 'give
   Solaris users no reason to not switch' effort.  I intend to offer the
   code to the util-linux-ng maintainers."

Requested-by: Ulrich Drepper 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman