linux-stable.git/fs, branch v5.10.78

Revert "io_uring: reinforce cancel on flush during exit"

2021-11-06T13:10:08+00:00

This reverts commit 88dbd085a51ec78c83dde79ad63bca8aa4272a9d.

Causes the following Syzkaller reported issue:

BUG: kernel NULL pointer dereference, address: 0000000000000010
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 546 Comm: syz-executor631 Tainted: G    B             5.10.76-syzkaller-01178-g4944ec82ebb9 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:arch_atomic_try_cmpxchg syzkaller/managers/android-5-10/kernel/./arch/x86/include/asm/atomic.h:202 [inline]
RIP: 0010:atomic_try_cmpxchg_acquire syzkaller/managers/android-5-10/kernel/./include/asm-generic/atomic-instrumented.h:707 [inline]
RIP: 0010:queued_spin_lock syzkaller/managers/android-5-10/kernel/./include/asm-generic/qspinlock.h:82 [inline]
RIP: 0010:do_raw_spin_lock_flags syzkaller/managers/android-5-10/kernel/./include/linux/spinlock.h:195 [inline]
RIP: 0010:__raw_spin_lock_irqsave syzkaller/managers/android-5-10/kernel/./include/linux/spinlock_api_smp.h:119 [inline]
RIP: 0010:_raw_spin_lock_irqsave+0x10d/0x210 syzkaller/managers/android-5-10/kernel/kernel/locking/spinlock.c:159
Code: 00 00 00 e8 d5 29 09 fd 4c 89 e7 be 04 00 00 00 e8 c8 29 09 fd 42 8a 04 3b 84 c0 0f 85 be 00 00 00 8b 44 24 40 b9 01 00 00 00  41 0f b1 4d 00 75 45 48 c7 44 24 20 0e 36 e0 45 4b c7 04 37 00
RSP: 0018:ffffc90000f174e0 EFLAGS: 00010097
RAX: 0000000000000000 RBX: 1ffff920001e2ea4 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffc90000f17520
RBP: ffffc90000f175b0 R08: dffffc0000000000 R09: 0000000000000003
R10: fffff520001e2ea5 R11: 0000000000000004 R12: ffffc90000f17520
R13: 0000000000000010 R14: 1ffff920001e2ea0 R15: dffffc0000000000
FS:  0000000000000000(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000000640f000 CR4: 00000000003506a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 prepare_to_wait+0x9c/0x290 syzkaller/managers/android-5-10/kernel/kernel/sched/wait.c:248
 io_uring_cancel_files syzkaller/managers/android-5-10/kernel/fs/io_uring.c:8690 [inline]
 io_uring_cancel_task_requests+0x16a9/0x1ed0 syzkaller/managers/android-5-10/kernel/fs/io_uring.c:8760
 io_uring_flush+0x170/0x6d0 syzkaller/managers/android-5-10/kernel/fs/io_uring.c:8923
 filp_close+0xb0/0x150 syzkaller/managers/android-5-10/kernel/fs/open.c:1319
 close_files syzkaller/managers/android-5-10/kernel/fs/file.c:401 [inline]
 put_files_struct+0x1d4/0x350 syzkaller/managers/android-5-10/kernel/fs/file.c:429
 exit_files+0x80/0xa0 syzkaller/managers/android-5-10/kernel/fs/file.c:458
 do_exit+0x6d9/0x23a0 syzkaller/managers/android-5-10/kernel/kernel/exit.c:808
 do_group_exit+0x16a/0x2d0 syzkaller/managers/android-5-10/kernel/kernel/exit.c:910
 get_signal+0x133e/0x1f80 syzkaller/managers/android-5-10/kernel/kernel/signal.c:2790
 arch_do_signal+0x8d/0x620 syzkaller/managers/android-5-10/kernel/arch/x86/kernel/signal.c:805
 exit_to_user_mode_loop syzkaller/managers/android-5-10/kernel/kernel/entry/common.c:161 [inline]
 exit_to_user_mode_prepare+0xaa/0xe0 syzkaller/managers/android-5-10/kernel/kernel/entry/common.c:191
 syscall_exit_to_user_mode+0x24/0x40 syzkaller/managers/android-5-10/kernel/kernel/entry/common.c:266
 do_syscall_64+0x3d/0x70 syzkaller/managers/android-5-10/kernel/arch/x86/entry/common.c:56
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fc6d1589a89
Code: Unable to access opcode bytes at RIP 0x7fc6d1589a5f.
RSP: 002b:00007ffd2b5da728 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffdfc RBX: 0000000000005193 RCX: 00007fc6d1589a89
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fc6d161142c
RBP: 0000000000000032 R08: 00007ffd2b5eb0b8 R09: 0000000000000000
R10: 00007ffd2b5da750 R11: 0000000000000246 R12: 00007fc6d161142c
R13: 00007ffd2b5da750 R14: 00007ffd2b5da770 R15: 0000000000000000
Modules linked in:
CR2: 0000000000000010
---[ end trace fe8044f7dc4d8d65 ]---
RIP: 0010:arch_atomic_try_cmpxchg syzkaller/managers/android-5-10/kernel/./arch/x86/include/asm/atomic.h:202 [inline]
RIP: 0010:atomic_try_cmpxchg_acquire syzkaller/managers/android-5-10/kernel/./include/asm-generic/atomic-instrumented.h:707 [inline]
RIP: 0010:queued_spin_lock syzkaller/managers/android-5-10/kernel/./include/asm-generic/qspinlock.h:82 [inline]
RIP: 0010:do_raw_spin_lock_flags syzkaller/managers/android-5-10/kernel/./include/linux/spinlock.h:195 [inline]
RIP: 0010:__raw_spin_lock_irqsave syzkaller/managers/android-5-10/kernel/./include/linux/spinlock_api_smp.h:119 [inline]
RIP: 0010:_raw_spin_lock_irqsave+0x10d/0x210 syzkaller/managers/android-5-10/kernel/kernel/locking/spinlock.c:159
Code: 00 00 00 e8 d5 29 09 fd 4c 89 e7 be 04 00 00 00 e8 c8 29 09 fd 42 8a 04 3b 84 c0 0f 85 be 00 00 00 8b 44 24 40 b9 01 00 00 00  41 0f b1 4d 00 75 45 48 c7 44 24 20 0e 36 e0 45 4b c7 04 37 00
RSP: 0018:ffffc90000f174e0 EFLAGS: 00010097
RAX: 0000000000000000 RBX: 1ffff920001e2ea4 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffc90000f17520
RBP: ffffc90000f175b0 R08: dffffc0000000000 R09: 0000000000000003
R10: fffff520001e2ea5 R11: 0000000000000004 R12: ffffc90000f17520
R13: 0000000000000010 R14: 1ffff920001e2ea0 R15: dffffc0000000000
FS:  0000000000000000(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000000640f000 CR4: 00000000003506a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess), 1 bytes skipped:
   0:	00 00                	add    %al,(%rax)
   2:	e8 d5 29 09 fd       	callq  0xfd0929dc
   7:	4c 89 e7             	mov    %r12,%rdi
   a:	be 04 00 00 00       	mov    $0x4,%esi
   f:	e8 c8 29 09 fd       	callq  0xfd0929dc
  14:	42 8a 04 3b          	mov    (%rbx,%r15,1),%al
  18:	84 c0                	test   %al,%al
  1a:	0f 85 be 00 00 00    	jne    0xde
  20:	8b 44 24 40          	mov    0x40(%rsp),%eax
  24:	b9 01 00 00 00       	mov    $0x1,%ecx
* 29:	f0 41 0f b1 4d 00    	lock cmpxchg %ecx,0x0(%r13) <-- trapping instruction
  2f:	75 45                	jne    0x76
  31:	48 c7 44 24 20 0e 36 	movq   $0x45e0360e,0x20(%rsp)
  38:	e0 45
  3a:	4b                   	rex.WXB
  3b:	c7                   	.byte 0xc7
  3c:	04 37                	add    $0x37,%al

Link: https://syzkaller.appspot.com/bug?extid=b0003676644cf0d6acc4
Reported-by: syzbot+b0003676644cf0d6acc4@syzkaller.appspotmail.com
Signed-off-by: Lee Jones 
Signed-off-by: Greg Kroah-Hartman

ocfs2: fix race between searching chunks and release journal_head from buffer_head

2021-11-02T18:48:20+00:00

commit 6f1b228529ae49b0f85ab89bcdb6c365df401558 upstream.

Encountered a race between ocfs2_test_bg_bit_allocatable() and
jbd2_journal_put_journal_head() resulting in the below vmcore.

  PID: 106879  TASK: ffff880244ba9c00  CPU: 2   COMMAND: "loop3"
  Call trace:
    panic
    oops_end
    no_context
    __bad_area_nosemaphore
    bad_area_nosemaphore
    __do_page_fault
    do_page_fault
    page_fault
      [exception RIP: ocfs2_block_group_find_clear_bits+316]
    ocfs2_block_group_find_clear_bits [ocfs2]
    ocfs2_cluster_group_search [ocfs2]
    ocfs2_search_chain [ocfs2]
    ocfs2_claim_suballoc_bits [ocfs2]
    __ocfs2_claim_clusters [ocfs2]
    ocfs2_claim_clusters [ocfs2]
    ocfs2_local_alloc_slide_window [ocfs2]
    ocfs2_reserve_local_alloc_bits [ocfs2]
    ocfs2_reserve_clusters_with_limit [ocfs2]
    ocfs2_reserve_clusters [ocfs2]
    ocfs2_lock_refcount_allocators [ocfs2]
    ocfs2_make_clusters_writable [ocfs2]
    ocfs2_replace_cow [ocfs2]
    ocfs2_refcount_cow [ocfs2]
    ocfs2_file_write_iter [ocfs2]
    lo_rw_aio
    loop_queue_work
    kthread_worker_fn
    kthread
    ret_from_fork

When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the
bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and
released the jounal head from the buffer head.  Needed to take bit lock
for the bit 'BH_JournalHead' to fix this race.

Link: https://lkml.kernel.org/r/1634820718-6043-1-git-send-email-gautham.ananthakrishna@oracle.com
Signed-off-by: Gautham Ananthakrishna 
Reviewed-by: Joseph Qi 
Cc: 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Junxiao Bi 
Cc: Changwei Ge 
Cc: Gang He 
Cc: Jun Piao 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ext4: fix possible UAF when remounting r/o a mmp-protected file system

2021-11-02T18:48:18+00:00

commit 61bb4a1c417e5b95d9edb4f887f131de32e419cb upstream.

After commit 618f003199c6 ("ext4: fix memory leak in
ext4_fill_super"), after the file system is remounted read-only, there
is a race where the kmmpd thread can exit, causing sbi->s_mmp_tsk to
point at freed memory, which the call to ext4_stop_mmpd() can trip
over.

Fix this by only allowing kmmpd() to exit when it is stopped via
ext4_stop_mmpd().

Link: https://lore.kernel.org/r/20210707002433.3719773-1-tytso@mit.edu
Reported-by: Ye Bin 
Bug-Report-Link: <20210629143603.2166962-1-yebin10@huawei.com>
Signed-off-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
Cc: Tadeusz Struk 
Signed-off-by: Greg Kroah-Hartman

io_uring: don't take uring_lock during iowq cancel

2021-11-02T18:48:18+00:00

commit 792bb6eb862333658bf1bd2260133f0507e2da8d upstream.

[   97.866748] a.out/2890 is trying to acquire lock:
[   97.867829] ffff8881046763e8 (&ctx->uring_lock){+.+.}-{3:3}, at:
io_wq_submit_work+0x155/0x240
[   97.869735]
[   97.869735] but task is already holding lock:
[   97.871033] ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at:
__x64_sys_io_uring_enter+0x3f0/0x5b0
[   97.873074]
[   97.873074] other info that might help us debug this:
[   97.874520]  Possible unsafe locking scenario:
[   97.874520]
[   97.875845]        CPU0
[   97.876440]        ----
[   97.877048]   lock(&ctx->uring_lock);
[   97.877961]   lock(&ctx->uring_lock);
[   97.878881]
[   97.878881]  *** DEADLOCK ***
[   97.878881]
[   97.880341]  May be due to missing lock nesting notation
[   97.880341]
[   97.881952] 1 lock held by a.out/2890:
[   97.882873]  #0: ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at:
__x64_sys_io_uring_enter+0x3f0/0x5b0
[   97.885108]
[   97.885108] stack backtrace:
[   97.890457] Call Trace:
[   97.891121]  dump_stack+0xac/0xe3
[   97.891972]  __lock_acquire+0xab6/0x13a0
[   97.892940]  lock_acquire+0x2c3/0x390
[   97.894894]  __mutex_lock+0xae/0x9f0
[   97.901101]  io_wq_submit_work+0x155/0x240
[   97.902112]  io_wq_cancel_cb+0x162/0x490
[   97.904126]  io_async_find_and_cancel+0x3b/0x140
[   97.905247]  io_issue_sqe+0x86d/0x13e0
[   97.909122]  __io_queue_sqe+0x10b/0x550
[   97.913971]  io_queue_sqe+0x235/0x470
[   97.914894]  io_submit_sqes+0xcce/0xf10
[   97.917872]  __x64_sys_io_uring_enter+0x3fb/0x5b0
[   97.921424]  do_syscall_64+0x2d/0x40
[   97.922329]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

While holding uring_lock, e.g. from inline execution, async cancel
request may attempt cancellations through io_wq_submit_work, which may
try to grab a lock. Delay it to task_work, so we do it from a clean
context and don't have to worry about locking.

Cc:  # 5.5+
Fixes: c07e6719511e ("io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()")
Reported-by: Abaci 
Reported-by: Hao Xu 
Signed-off-by: Pavel Begunkov 
Signed-off-by: Jens Axboe 
[Lee: The first hunk solves a different (double free) issue in v5.10.
      Only the first hunk of the original patch is relevant to v5.10 AND
      the first hunk of the original patch is only relevant to v5.10]
Reported-by: syzbot+59d8a1f4e60c20c066cf@syzkaller.appspotmail.com
Signed-off-by: Lee Jones 
Signed-off-by: Greg Kroah-Hartman

btrfs: deal with errors when checking if a dir entry exists during log replay

2021-10-27T07:56:54+00:00

[ Upstream commit 77a5b9e3d14cbce49ceed2766b2003c034c066dc ]

Currently inode_in_dir() ignores errors returned from
btrfs_lookup_dir_index_item() and from btrfs_lookup_dir_item(), treating
any errors as if the directory entry does not exists in the fs/subvolume
tree, which is obviously not correct, as we can get errors such as -EIO
when reading extent buffers while searching the fs/subvolume's tree.

Fix that by making inode_in_dir() return the errors and making its only
caller, add_inode_ref(), deal with returned errors as well.

Signed-off-by: Filipe Manana 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin

vfs: check fd has read access in kernel_read_file_from_fd()

2021-10-27T07:56:51+00:00

commit 032146cda85566abcd1c4884d9d23e4e30a07e9a upstream.

If we open a file without read access and then pass the fd to a syscall
whose implementation calls kernel_read_file_from_fd(), we get a warning
from __kernel_read():

        if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))

This currently affects both finit_module() and kexec_file_load(), but it
could affect other syscalls in the future.

Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
Fixes: b844f0ecbc56 ("vfs: define kernel_copy_file_from_fd()")
Signed-off-by: Matthew Wilcox (Oracle) 
Reported-by: Hao Sun 
Reviewed-by: Kees Cook 
Acked-by: Christian Brauner 
Cc: Al Viro 
Cc: Mimi Zohar 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

userfaultfd: fix a race between writeprotect and exit_mmap()

2021-10-27T07:56:51+00:00

commit cb185d5f1ebf900f4ae3bf84cee212e6dd035aca upstream.

A race is possible when a process exits, its VMAs are removed by
exit_mmap() and at the same time userfaultfd_writeprotect() is called.

The race was detected by KASAN on a development kernel, but it appears
to be possible on vanilla kernels as well.

Use mmget_not_zero() to prevent the race as done in other userfaultfd
operations.

Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
Fixes: 63b2d4174c4ad ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Signed-off-by: Nadav Amit 
Tested-by: Li  Wang 
Reviewed-by: Peter Xu 
Cc: Andrea Arcangeli 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ocfs2: mount fails with buffer overflow in strlen

2021-10-27T07:56:51+00:00

commit b15fa9224e6e1239414525d8d556d824701849fc upstream.

Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an
ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the
trace below.  Problem seems to be that strings for cluster stack and
cluster name are not guaranteed to be null terminated in the disk
representation, while strlcpy assumes that the source string is always
null terminated.  This causes a read outside of the source string
triggering the buffer overflow detection.

  detected buffer overflow in strlen
  ------------[ cut here ]------------
  kernel BUG at lib/string.c:1149!
  invalid opcode: 0000 [#1] SMP PTI
  CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1
    Debian 5.14.6-2
  RIP: 0010:fortify_panic+0xf/0x11
  ...
  Call Trace:
   ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2]
   ocfs2_fill_super+0x359/0x19b0 [ocfs2]
   mount_bdev+0x185/0x1b0
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   path_mount+0x454/0xa20
   __x64_sys_mount+0x103/0x140
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Valentin Vidic 
Reviewed-by: Joseph Qi 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Junxiao Bi 
Cc: Changwei Ge 
Cc: Gang He 
Cc: Jun Piao 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ocfs2: fix data corruption after conversion from inline format

2021-10-27T07:56:51+00:00

commit 5314454ea3ff6fc746eaf71b9a7ceebed52888fa upstream.

Commit 6dbf7bb55598 ("fs: Don't invalidate page buffers in
block_write_full_page()") uncovered a latent bug in ocfs2 conversion
from inline inode format to a normal inode format.

The code in ocfs2_convert_inline_data_to_extents() attempts to zero out
the whole cluster allocated for file data by grabbing, zeroing, and
dirtying all pages covering this cluster.  However these pages are
beyond i_size, thus writeback code generally ignores these dirty pages
and no blocks were ever actually zeroed on the disk.

This oversight was fixed by commit 693c241a5f6a ("ocfs2: No need to zero
pages past i_size.") for standard ocfs2 write path, inline conversion
path was apparently forgotten; the commit log also has a reasoning why
the zeroing actually is not needed.

After commit 6dbf7bb55598, things became worse as writeback code stopped
invalidating buffers on pages beyond i_size and thus these pages end up
with clean PageDirty bit but with buffers attached to these pages being
still dirty.  So when a file is converted from inline format, then
writeback triggers, and then the file is grown so that these pages
become valid, the invalid dirtiness state is preserved,
mark_buffer_dirty() does nothing on these pages (buffers are already
dirty) but page is never written back because it is clean.  So data
written to these pages is lost once pages are reclaimed.

Simple reproducer for the problem is:

  xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \
    -c "pwrite 4000 2000" ocfs2_file

After unmounting and mounting the fs again, you can observe that end of
'ocfs2_file' has lost its contents.

Fix the problem by not doing the pointless zeroing during conversion
from inline format similarly as in the standard write path.

[akpm@linux-foundation.org: fix whitespace, per Joseph]

Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz
Fixes: 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()")
Signed-off-by: Jan Kara 
Reviewed-by: Joseph Qi 
Tested-by: Joseph Qi 
Acked-by: Gang He 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Junxiao Bi 
Cc: Changwei Ge 
Cc: Jun Piao 
Cc: "Markov, Andrey" 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ceph: fix handling of "meta" errors

2021-10-27T07:56:51+00:00

commit 1bd85aa65d0e7b5e4d09240f492f37c569fdd431 upstream.

Currently, we check the wb_err too early for directories, before all of
the unsafe child requests have been waited on. In order to fix that we
need to check the mapping->wb_err later nearer to the end of ceph_fsync.

We also have an overly-complex method for tracking errors after
blocklisting. The errors recorded in cleanup_session_requests go to a
completely separate field in the inode, but we end up reporting them the
same way we would for any other error (in fsync).

There's no real benefit to tracking these errors in two different
places, since the only reporting mechanism for them is in fsync, and
we'd need to advance them both every time.

Given that, we can just remove i_meta_err, and convert the places that
used it to instead just use mapping->wb_err instead. That also fixes
the original problem by ensuring that we do a check_and_advance of the
wb_err at the end of the fsync op.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52864
Reported-by: Patrick Donnelly 
Signed-off-by: Jeff Layton 
Reviewed-by: Xiubo Li 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Greg Kroah-Hartman