linux-stable.git/fs, branch v4.4.80

Btrfs: adjust outstanding_extents counter properly when dio write is split

2017-08-07T02:19:46+00:00

[ Upstream commit c2931667c83ded6504b3857e99cc45b21fa496fb ]

Currently how btrfs dio deals with split dio write is not good
enough if dio write is split into several segments due to the
lack of contiguous space, a large dio write like 'dd bs=1G count=1'
can end up with incorrect outstanding_extents counter and endio
would complain loudly with an assertion.

This fixes the problem by compensating the outstanding_extents
counter in inode if a large dio write gets split.

Reported-by: Anand Jain 
Tested-by: Anand Jain 
Signed-off-by: Liu Bo 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

pstore: Use dynamic spinlock initializer

2017-08-07T02:19:43+00:00

commit e9a330c4289f2ba1ca4bf98c2b430ab165a8931b upstream.

The per-prz spinlock should be using the dynamic initializer so that
lockdep can correctly track it. Without this, under lockdep, we get a
warning at boot that the lock is in non-static memory.

Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
Fixes: 76d5692a5803 ("pstore: Correctly initialize spinlock and flags")
Signed-off-by: Kees Cook 
Signed-off-by: Greg Kroah-Hartman

pstore: Correctly initialize spinlock and flags

2017-08-07T02:19:43+00:00

commit 76d5692a58031696e282384cbd893832bc92bd76 upstream.

The ram backend wasn't always initializing its spinlock correctly. Since
it was coming from kzalloc memory, though, it was harmless on
architectures that initialize unlocked spinlocks to 0 (at least x86 and
ARM). This also fixes a possibly ignored flag setting too.

When running under CONFIG_DEBUG_SPINLOCK, the following Oops was visible:

[    0.760836] persistent_ram: found existing buffer, size 29988, start 29988
[    0.765112] persistent_ram: found existing buffer, size 30105, start 30105
[    0.769435] persistent_ram: found existing buffer, size 118542, start 118542
[    0.785960] persistent_ram: found existing buffer, size 0, start 0
[    0.786098] persistent_ram: found existing buffer, size 0, start 0
[    0.786131] pstore: using zlib compression
[    0.790716] BUG: spinlock bad magic on CPU#0, swapper/0/1
[    0.790729]  lock: 0xffffffc0d1ca9bb0, .magic: 00000000, .owner: /-1, .owner_cpu: 0
[    0.790742] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.10.0-rc2+ #913
[    0.790747] Hardware name: Google Kevin (DT)
[    0.790750] Call trace:
[    0.790768] [] dump_backtrace+0x0/0x2bc
[    0.790780] [] show_stack+0x20/0x28
[    0.790794] [] dump_stack+0xa4/0xcc
[    0.790809] [] spin_dump+0xe0/0xf0
[    0.790821] [] spin_bug+0x30/0x3c
[    0.790834] [] do_raw_spin_lock+0x50/0x1b8
[    0.790846] [] _raw_spin_lock_irqsave+0x54/0x6c
[    0.790862] [] buffer_size_add+0x48/0xcc
[    0.790875] [] persistent_ram_write+0x60/0x11c
[    0.790888] [] ramoops_pstore_write_buf+0xd4/0x2a4
[    0.790900] [] pstore_console_write+0xf0/0x134
[    0.790912] [] console_unlock+0x48c/0x5e8
[    0.790923] [] register_console+0x3b0/0x4d4
[    0.790935] [] pstore_register+0x1a8/0x234
[    0.790947] [] ramoops_probe+0x6b8/0x7d4
[    0.790961] [] platform_drv_probe+0x7c/0xd0
[    0.790972] [] driver_probe_device+0x1b4/0x3bc
[    0.790982] [] __device_attach_driver+0xc8/0xf4
[    0.790996] [] bus_for_each_drv+0xb4/0xe4
[    0.791006] [] __device_attach+0xd0/0x158
[    0.791016] [] device_initial_probe+0x24/0x30
[    0.791026] [] bus_probe_device+0x50/0xe4
[    0.791038] [] device_add+0x3a4/0x76c
[    0.791051] [] of_device_add+0x74/0x84
[    0.791062] [] of_platform_device_create_pdata+0xc0/0x100
[    0.791073] [] of_platform_device_create+0x34/0x40
[    0.791086] [] of_platform_default_populate_init+0x58/0x78
[    0.791097] [] do_one_initcall+0x88/0x160
[    0.791109] [] kernel_init_freeable+0x264/0x31c
[    0.791123] [] kernel_init+0x18/0x11c
[    0.791133] [] ret_from_fork+0x10/0x50
[    0.793717] console [pstore-1] enabled
[    0.797845] pstore: Registered ramoops as persistent store backend
[    0.804647] ramoops: attached 0x100000@0xf7edc000, ecc: 0/0

Fixes: 663deb47880f ("pstore: Allow prz to control need for locking")
Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
Reported-by: Brian Norris 
Signed-off-by: Kees Cook 
Signed-off-by: Greg Kroah-Hartman

pstore: Allow prz to control need for locking

2017-08-07T02:19:43+00:00

commit 663deb47880f2283809669563c5a52ac7c6aef1a upstream.

In preparation of not locking at all for certain buffers depending on if
there's contention, make locking optional depending on the initialization
of the prz.

Signed-off-by: Joel Fernandes 
[kees: moved locking flag into prz instead of via caller arguments]
Signed-off-by: Kees Cook 
Signed-off-by: Greg Kroah-Hartman

Make file credentials available to the seqfile interfaces

2017-08-07T02:19:42+00:00

commit 34dbbcdbf63360661ff7bda6c5f52f99ac515f92 upstream.

A lot of seqfile users seem to be using things like %pK that uses the
credentials of the current process, but that is actually completely
wrong for filesystem interfaces.

The unix semantics for permission checking files is to check permissions
at _open_ time, not at read or write time, and that is not just a small
detail: passing off stdin/stdout/stderr to a suid application and making
the actual IO happen in privileged context is a classic exploit
technique.

So if we want to be able to look at permissions at read time, we need to
use the file open credentials, not the current ones.  Normal file
accesses can just use "f_cred" (or any of the helper functions that do
that, like file_ns_capable()), but the seqfile interfaces do not have
any such options.

It turns out that seq_file _does_ save away the user_ns information of
the file, though.  Since user_ns is just part of the full credential
information, replace that special case with saving off the cred pointer
instead, and suddenly seq_file has all the permission information it
needs.

[sumits: this is used in Ubuntu as a fix for CVE-2015-8944]

Signed-off-by: Linus Torvalds 
Signed-off-by: Sumit Semwal 
Signed-off-by: Greg Kroah-Hartman

dentry name snapshots

2017-08-07T02:19:42+00:00

commit 49d31c2f389acfe83417083e1208422b4091cd9e upstream.

take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified).  In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
	struct name_snapshot s;

	take_dentry_name_snapshot(&s, dentry);
	...
	access s.name
	...
	release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman

xfs: don't BUG() on mixed direct and mapped I/O

2017-08-07T02:19:40+00:00

commit 04197b341f23b908193308b8d63d17ff23232598 upstream.

We've had reports of generic/095 causing XFS to BUG() in
__xfs_get_blocks() due to the existence of delalloc blocks on a
direct I/O read. generic/095 issues a mix of various types of I/O,
including direct and memory mapped I/O to a single file. This is
clearly not supported behavior and is known to lead to such
problems. E.g., the lack of exclusion between the direct I/O and
write fault paths means that a write fault can allocate delalloc
blocks in a region of a file that was previously a hole after the
direct read has attempted to flush/inval the file range, but before
it actually reads the block mapping. In turn, the direct read
discovers a delalloc extent and cannot proceed.

While the appropriate solution here is to not mix direct and memory
mapped I/O to the same regions of the same file, the current
BUG_ON() behavior is probably overkill as it can crash the entire
system.  Instead, localize the failure to the I/O in question by
returning an error for a direct I/O that cannot be handled safely
due to delalloc blocks. Be careful to allow the case of a direct
write to post-eof delalloc blocks. This can occur due to speculative
preallocation and is safe as post-eof blocks are not accompanied by
dirty pages in pagecache (conversely, preallocation within eof must
have been zeroed, and thus dirtied, before the inode size could have
been increased beyond said blocks).

Finally, provide an additional warning if a direct I/O write occurs
while the file is memory mapped. This may not catch all problematic
scenarios, but provides a hint that some known-to-be-problematic I/O
methods are in use.

Signed-off-by: Brian Foster 
Reviewed-by: Dave Chinner 
Signed-off-by: Dave Chinner 
Signed-off-by: Nikolay Borisov 
Acked-by: Darrick J. Wong 
Signed-off-by: Greg Kroah-Hartman

pstore: Make spinlock per zone instead of global

2017-08-07T02:19:38+00:00

commit 109704492ef637956265ec2eb72ae7b3b39eb6f4 upstream.

Currently pstore has a global spinlock for all zones. Since the zones
are independent and modify different areas of memory, there's no need
to have a global lock, so we should use a per-zone lock as introduced
here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
introduced later, which splits the ftrace memory area into a single zone
per CPU, it will eliminate the need for locking. In preparation for this,
make the locking optional.

Signed-off-by: Joel Fernandes 
[kees: updated commit message]
Signed-off-by: Kees Cook 
Cc: Leo Yan 
Signed-off-by: Greg Kroah-Hartman

ceph: fix race in concurrent readdir

2017-07-27T22:06:09+00:00

commit 84583cfb973c4313955c6231cc9cb3772d280b15 upstream.

For a large directory, program needs to issue multiple readdir
syscalls to get all dentries. When there are multiple programs
read the directory concurrently. Following sequence of events
can happen.

 - program calls readdir with pos = 2. ceph sends readdir request
   to mds. The reply contains N1 entries. ceph adds these N1 entries
   to readdir cache.
 - program calls readdir with pos = N1+2. The readdir is satisfied
   by the readdir cache, N2 entries are returned. (Other program
   calls readdir in the middle, which fills the cache)
 - program calls readdir with pos = N1+N2+2. ceph sends readdir
   request to mds. The reply contains N3 entries and it reaches
   directory end. ceph adds these N3 entries to the readdir cache
   and marks directory complete.

The second readdir call does not update fi->readdir_cache_idx.
ceph add the last N3 entries to wrong places.

Signed-off-by: "Yan, Zheng" 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Greg Kroah-Hartman

udf: Fix deadlock between writeback and udf_setsize()

2017-07-27T22:06:09+00:00

commit f2e95355891153f66d4156bf3a142c6489cd78c6 upstream.

udf_setsize() called truncate_setsize() with i_data_sem held. Thus
truncate_pagecache() called from truncate_setsize() could lock a page
under i_data_sem which can deadlock as page lock ranks below
i_data_sem - e. g. writeback can hold page lock and try to acquire
i_data_sem to map a block.

Fix the problem by moving truncate_setsize() calls from under
i_data_sem. It is safe for us to change i_size without holding
i_data_sem as all the places that depend on i_size being stable already
hold inode_lock.

Fixes: 7e49b6f2480cb9a9e7322a91592e56a5c85361f5
Signed-off-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman