linux.git/fs/fuse/file.c, branch v4.15

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

2017-09-13T17:10:19+00:00

Pull fuse updates from Miklos Szeredi:
 "This fixes a regression (spotted by the Sandstorm.io folks) in the pid
  namespace handling introduced in 4.12.

  There's also a fix for honoring sync/dsync flags for pwritev2()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: getattr cleanup
  fuse: honor iocb sync flags on write
  fuse: allow server to run in different pid_ns

fuse: getattr cleanup

2017-09-12T14:57:54+00:00

The refreshed argument isn't used by any caller, get rid of it.

Use a helper for just updating the inode (no need to fill in a kstat).

Signed-off-by: Miklos Szeredi

fuse: honor iocb sync flags on write

2017-09-12T14:57:53+00:00

If the IOCB_DSYNC flag is set a sync is not being performed by
fuse_file_write_iter.

Honor IOCB_DSYNC/IOCB_SYNC by setting O_DYSNC/O_SYNC respectively in the
flags filed of the write request.

We don't need to sync data or metadata, since fuse_perform_write() does
write-through and the filesystem is responsible for updating file times.

Original patch by Vitaly Zolotusky.

Reported-by: Nate Clark 
Cc: Vitaly Zolotusky .
Signed-off-by: Miklos Szeredi

fuse: allow server to run in different pid_ns

2017-09-12T14:57:53+00:00

Commit 0b6e9ea041e6 ("fuse: Add support for pid namespaces") broke
Sandstorm.io development tools, which have been sending FUSE file
descriptors across PID namespace boundaries since early 2014.

The above patch added a check that prevented I/O on the fuse device file
descriptor if the pid namespace of the reader/writer was different from the
pid namespace of the mounter.  With this change passing the device file
descriptor to a different pid namespace simply doesn't work.  The check was
added because pids are transferred to/from the fuse userspace server in the
namespace registered at mount time.

To fix this regression, remove the checks and do the following:

1) the pid in the request header (the pid of the task that initiated the
filesystem operation) is translated to the reader's pid namespace.  If a
mapping doesn't exist for this pid, then a zero pid is used.  Note: even if
a mapping would exist between the initiator task's pid namespace and the
reader's pid namespace the pid will be zero if either mapping from
initator's to mounter's namespace or mapping from mounter's to reader's
namespace doesn't exist.

2) The lk.pid value in setlk/setlkw requests and getlk reply is left alone.
Userspace should not interpret this value anyway.  Also allow the
setlk/setlkw operations if the pid of the task cannot be represented in the
mounter's namespace (pid being zero in that case).

Reported-by: Kenton Varda 
Signed-off-by: Miklos Szeredi 
Fixes: 0b6e9ea041e6 ("fuse: Add support for pid namespaces")
Cc:  # v4.12+
Cc: Eric W. Biederman 
Cc: Seth Forshee

Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

2017-09-06T21:11:03+00:00

Pull writeback error handling updates from Jeff Layton:
 "This pile continues the work from last cycle on better tracking
  writeback errors. In v4.13 we added some basic errseq_t infrastructure
  and converted a few filesystems to use it.

  This set continues refining that infrastructure, adds documentation,
  and converts most of the other filesystems to use it. The main
  exception at this point is the NFS client"

* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  ecryptfs: convert to file_write_and_wait in ->fsync
  mm: remove optimizations based on i_size in mapping writeback waits
  fs: convert a pile of fsync routines to errseq_t based reporting
  gfs2: convert to errseq_t based writeback error reporting for fsync
  fs: convert sync_file_range to use errseq_t based error-tracking
  mm: add file_fdatawait_range and file_write_and_wait
  fuse: convert to errseq_t based error tracking for fsync
  mm: consolidate dax / non-dax checks for writeback
  Documentation: add some docs for errseq_t
  errseq: rename __errseq_set to errseq_set

Merge tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

2017-09-06T20:43:26+00:00

Pull file locking updates from Jeff Layton:
 "This pile just has a few file locking fixes from Ben Coddington. There
  are a couple of cleanup patches + an attempt to bring sanity to the
  l_pid value that is reported back to userland on an F_GETLK request.

  After a few gyrations, he came up with a way for filesystems to
  communicate to the VFS layer code whether the pid should be translated
  according to the namespace or presented as-is to userland"

* tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: restore a warn for leaked locks on close
  fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
  fs/locks: Use allocation rather than the stack in fcntl_getlk()

fuse: set mapping error in writepage_locked when it fails

2017-08-11T09:38:26+00:00

This ensures that we see errors on fsync when writeback fails.

Signed-off-by: Jeff Layton 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Miklos Szeredi

fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio

2017-08-03T15:55:58+00:00

Commit 8fba54aebbdf ("fuse: direct-io: don't dirty ITER_BVEC pages") fixes
the ITER_BVEC page deadlock for direct io in fuse by checking in
fuse_direct_io(), whether the page is a bvec page or not, before locking
it.  However, this check is missed when the "async_dio" mount option is
enabled.  In this case, set_page_dirty_lock() is called from the req->end
callback in request_end(), when the fuse thread is returning from userspace
to respond to the read request.  This will cause the same deadlock because
the bvec condition is not checked in this path.

Here is the stack of the deadlocked thread, while returning from userspace:

[13706.656686] INFO: task glusterfs:3006 blocked for more than 120 seconds.
[13706.657808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[13706.658788] glusterfs       D ffffffff816c80f0     0  3006      1
0x00000080
[13706.658797]  ffff8800d6713a58 0000000000000086 ffff8800d9ad7000
ffff8800d9ad5400
[13706.658799]  ffff88011ffd5cc0 ffff8800d6710008 ffff88011fd176c0
7fffffffffffffff
[13706.658801]  0000000000000002 ffffffff816c80f0 ffff8800d6713a78
ffffffff816c790e
[13706.658803] Call Trace:
[13706.658809]  [] ? bit_wait_io_timeout+0x80/0x80
[13706.658811]  [] schedule+0x3e/0x90
[13706.658813]  [] schedule_timeout+0x1b5/0x210
[13706.658816]  [] ? gup_pud_range+0x1db/0x1f0
[13706.658817]  [] ? kvm_clock_read+0x1e/0x20
[13706.658819]  [] ? kvm_clock_get_cycles+0x9/0x10
[13706.658822]  [] ? ktime_get+0x52/0xc0
[13706.658824]  [] io_schedule_timeout+0xa4/0x110
[13706.658826]  [] bit_wait_io+0x36/0x50
[13706.658828]  [] __wait_on_bit_lock+0x76/0xb0
[13706.658831]  [] ? lock_request+0x46/0x70 [fuse]
[13706.658834]  [] __lock_page+0xaa/0xb0
[13706.658836]  [] ? wake_atomic_t_function+0x40/0x40
[13706.658838]  [] set_page_dirty_lock+0x58/0x60
[13706.658841]  [] fuse_release_user_pages+0x58/0x70 [fuse]
[13706.658844]  [] ? fuse_aio_complete+0x190/0x190 [fuse]
[13706.658847]  [] fuse_aio_complete_req+0x29/0x90 [fuse]
[13706.658849]  [] request_end+0xd9/0x190 [fuse]
[13706.658852]  [] fuse_dev_do_write+0x336/0x490 [fuse]
[13706.658854]  [] fuse_dev_write+0x6e/0xa0 [fuse]
[13706.658857]  [] ? security_file_permission+0x23/0x90
[13706.658859]  [] do_iter_readv_writev+0x60/0x90
[13706.658862]  [] ? fuse_dev_splice_write+0x350/0x350
[fuse]
[13706.658863]  [] do_readv_writev+0x171/0x1f0
[13706.658866]  [] ? try_to_wake_up+0x210/0x210
[13706.658868]  [] vfs_writev+0x41/0x50
[13706.658870]  [] SyS_writev+0x56/0xf0
[13706.658872]  [] ? syscall_trace_leave+0xf1/0x160
[13706.658874]  [] system_call_fastpath+0x12/0x71

Fix this by making should_dirty a fuse_io_priv parameter that can be
checked in fuse_aio_complete_req().

Reported-by: Tiger Yang 
Signed-off-by: Ashish Samant 
Signed-off-by: Miklos Szeredi

fuse: convert to errseq_t based error tracking for fsync

2017-07-31T23:12:25+00:00

Change to file_write_and_wait_range and
file_check_and_advance_wb_err

Signed-off-by: Jeff Layton

fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks

2017-07-16T14:28:22+00:00

Since commit c69899a17ca4 "NFSv4: Update of VFS byte range lock must be
atomic with the stateid update", NFSv4 has been inserting locks in rpciod
worker context.  The result is that the file_lock's fl_nspid is the
kworker's pid instead of the original userspace pid.

The fl_nspid is only used to represent the namespaced virtual pid number
when displaying locks or returning from F_GETLK.  There's no reason to set
it for every inserted lock, since we can usually just look it up from
fl_pid.  So, instead of looking up and holding struct pid for every lock,
let's just look up the virtual pid number from fl_pid when it is needed.
That means we can remove fl_nspid entirely.

The translaton and presentation of fl_pid should handle the following four
cases:

1 - F_GETLK on a remote file with a remote lock:
    In this case, the filesystem should determine the l_pid to return here.
    Filesystems should indicate that the fl_pid represents a non-local pid
    value that should not be translated by returning an fl_pid <= 0.

2 - F_GETLK on a local file with a remote lock:
    This should be the l_pid of the lock manager process, and translated.

3 - F_GETLK on a remote file with a local lock, and
4 - F_GETLK on a local file with a local lock:
    These should be the translated l_pid of the local locking process.

Fuse was already doing the correct thing by translating the pid into the
caller's namespace.  With this change we must update fuse to translate
to init's pid namespace, so that the locks API can then translate from
init's pid namespace into the pid namespace of the caller.

With this change, the locks API will expect that if a filesystem returns
a remote pid as opposed to a local pid for F_GETLK, that remote pid will
be <= 0.  This signifies that the pid is remote, and the locks API will
forego translating that pid into the pid namespace of the local calling
process.

Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests.

Since local pids will never be larger than PID_MAX_LIMIT (which is
currently defined as <= 4 million), but pid_t is an unsigned int, we
should have plenty of room to represent remote pids with negative
numbers if we assume that remote pid numbers are similarly limited.

If this is not the case, then we run the risk of having a remote pid
returned for which there is also a corresponding local pid.  This is a
problem we have now, but this patch should reduce the chances of that
occurring, while also returning those remote pid numbers, for whatever
that may be worth.

Signed-off-by: Benjamin Coddington 
Signed-off-by: Jeff Layton