linux-stable.git/fs/ceph, branch v5.4.26

ceph: do not execute direct write in parallel if O_APPEND is specified

2020-03-05T15:43:38+00:00

[ Upstream commit 8e4473bb50a1796c9c32b244e5dbc5ee24ead937 ]

In O_APPEND & O_DIRECT mode, the data from different writers will
be possibly overlapping each other since they take the shared lock.

For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
mode:

          Writer1                         Writer2

     shared_lock()                   shared_lock()
     getattr(CAP_SIZE)               getattr(CAP_SIZE)
     iocb->ki_pos = EOF              iocb->ki_pos = EOF
     write(data1)
                                     write(data2)
     shared_unlock()                 shared_unlock()

The data2 will overlap the data1 from the same file offset, the
old EOF.

Switch to exclusive lock instead when O_APPEND is specified.

Signed-off-by: Xiubo Li 
Reviewed-by: Jeff Layton 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Sasha Levin

ceph: check availability of mds cluster on mount after wait timeout

2020-02-24T07:36:59+00:00

[ Upstream commit 97820058fb2831a4b203981fa2566ceaaa396103 ]

If all the MDS daemons are down for some reason, then the first mount
attempt will fail with EIO after the mount request times out.  A mount
attempt will also fail with EIO if all of the MDS's are laggy.

This patch changes the code to return -EHOSTUNREACH in these situations
and adds a pr_info error message to help the admin determine the cause.

URL: https://tracker.ceph.com/issues/4386
Signed-off-by: Xiubo Li 
Reviewed-by: Jeff Layton 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Sasha Levin

ceph: hold extra reference to r_parent over life of request

2020-01-29T15:45:24+00:00

commit 9c1c2b35f1d94de8325344c2777d7ee67492db3b upstream.

Currently, we just assume that it will stick around by virtue of the
submitter's reference, but later patches will allow the syscall to
return early and we can't rely on that reference at that point.

While I'm not aware of any reports of it, Xiubo pointed out that this
may fix a use-after-free.  If the wait for a reply times out or is
canceled via signal, and then the reply comes in after the syscall
returns, the client can end up trying to access r_parent without a
reference.

Take an extra reference to the inode when setting r_parent and release
it when releasing the request.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton 
Reviewed-by: "Yan, Zheng" 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Greg Kroah-Hartman

ceph: fix compat_ioctl for ceph_dir_operations

2019-12-17T18:55:31+00:00

commit 18bd6caaef4021803dd0d031dc37c2d001d18a5b upstream.

The ceph_ioctl function is used both for files and directories, but only
the files support doing that in 32-bit compat mode.

On the s390 architecture, there is also a problem with invalid 31-bit
pointers that need to be passed through compat_ptr().

Use the new compat_ptr_ioctl() to address both issues.

Note: When backporting this patch to stable kernels, "compat_ioctl:
add compat_ptr_ioctl()" is needed as well.

Reviewed-by: "Yan, Zheng" 
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann 
Signed-off-by: Greg Kroah-Hartman

ceph: increment/decrement dio counter on async requests

2019-11-14T17:44:51+00:00

Ceph can in some cases issue an async DIO request, in which case we can
end up calling ceph_end_io_direct before the I/O is actually complete.
That may allow buffered operations to proceed while DIO requests are
still in flight.

Fix this by incrementing the i_dio_count when issuing an async DIO
request, and decrement it when tearing down the aio_req.

Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Signed-off-by: Jeff Layton 
Signed-off-by: Ilya Dryomov

ceph: take the inode lock before acquiring cap refs

2019-11-14T17:44:51+00:00

Most of the time, we (or the vfs layer) takes the inode_lock and then
acquires caps, but ceph_read_iter does the opposite, and that can lead
to a deadlock.

When there are multiple clients treading over the same data, we can end
up in a situation where a reader takes caps and then tries to acquire
the inode_lock. Another task holds the inode_lock and issues a request
to the MDS which needs to revoke the caps, but that can't happen until
the inode_lock is unwedged.

Fix this by having ceph_read_iter take the inode_lock earlier, before
attempting to acquire caps.

Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Link: https://tracker.ceph.com/issues/36348
Signed-off-by: Jeff Layton 
Signed-off-by: Ilya Dryomov

ceph: return -EINVAL if given fsc mount option on kernel w/o support

2019-11-07T17:03:23+00:00

If someone requests fscache on the mount, and the kernel doesn't
support it, it should fail the mount.

[ Drop ceph prefix -- it's provided by pr_err. ]

Signed-off-by: Jeff Layton 
Reviewed-by: Ilya Dryomov 
Signed-off-by: Ilya Dryomov

ceph: don't allow copy_file_range when stripe_count != 1

2019-11-05T14:42:58+00:00

copy_file_range tries to use the OSD 'copy-from' operation, which simply
performs a full object copy.  Unfortunately, the implementation of this
system call assumes that stripe_count is always set to 1 and doesn't take
into account that the data may be striped across an object set.  If the
file layout has stripe_count different from 1, then the destination file
data will be corrupted.

For example:

Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
stripe_size of 2 MiB; the first half of the file will be filled with 'A's
and the second half will be filled with 'B's:

               0      4M     8M       Obj1     Obj2
               +------+------+       +----+   +----+
        file:  | AAAA | BBBB |       | AA |   | AA |
               +------+------+       |----|   |----|
                                     | BB |   | BB |
                                     +----+   +----+

If we copy_file_range this file into a new file (which needs to have the
same file layout!), then it will start by copying the object starting at
file offset 0 (Obj1).  And then it will copy the object starting at file
offset 4M -- which is Obj1 again.

Unfortunately, the solution for this is to not allow remote object copies
to be performed when the file layout stripe_count is not 1 and simply
fallback to the default (VFS) copy_file_range implementation.

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques 
Reviewed-by: Jeff Layton 
Signed-off-by: Ilya Dryomov

ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open

2019-11-05T14:42:44+00:00

If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
that it already passed d_revalidate so we *know* that it's negative (or
at least was very recently). Just return -ENOENT in that case.

This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
call atomic_open with the parent's i_rwsem shared, but calling
d_splice_alias on a hashed dentry requires the exclusive lock.

If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
open, and another client were to race in and create the file before we
issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
the dentry with the new inode with insufficient locks.

Cc: stable@vger.kernel.org
Reported-by: Al Viro 
Signed-off-by: Jeff Layton 
Signed-off-by: Ilya Dryomov

ceph: add missing check in d_revalidate snapdir handling

2019-10-29T21:29:55+00:00

We should not play with dcache without parent locked...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Jeff Layton 
Signed-off-by: Ilya Dryomov