linux.git/fs/super.c, branch v6.8

fs/super.c: don't drop ->s_user_ns until we free struct super_block itself

2024-02-25T07:10:31+00:00

Avoids fun races in RCU pathwalk...  Same goes for freeing LSM shite
hanging off super_block's arse.

Reviewed-by: Christian Brauner 
Signed-off-by: Al Viro

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux

2024-01-10T18:24:49+00:00

Pull fscrypt updates from Eric Biggers:
 "Adjust the timing of the fscrypt keyring destruction, to prepare for
  btrfs's fscrypt support.

  Also document that CephFS supports fscrypt now"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  fs: move fscrypt keyring destruction to after ->put_super
  f2fs: move release of block devices to after kill_block_super()
  fscrypt: document that CephFS supports fscrypt now
  fscrypt: update comment for do_remove_key()
  fscrypt.rst: update definition of struct fscrypt_context_v2

Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2024-01-08T18:43:51+00:00

Pull vfs super updates from Christian Brauner:
"This contains the super work for this cycle including the long-awaited
series by Jan to make it possible to prevent writing to mounted block
devices:

- Writing to mounted devices is dangerous and can lead to filesystem
corruption as well as crashes. Furthermore syzbot comes with more
and more involved examples how to corrupt block device under a
mounted filesystem leading to kernel crashes and reports we can do
nothing about. Add tracking of writers to each block device and a
kernel cmdline argument which controls whether other writeable
opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are
allowed.

Note that this effectively only prevents modification of the
particular block device's page cache by other writers. The actual
device content can still be modified by other means - e.g. by
issuing direct scsi commands, by doing writes through devices lower
in the storage stack (e.g. in case loop devices, DM, or MD are
involved) etc. But blocking direct modifications of the block
device page cache is enough to give filesystems a chance to perform
data validation when loading data from the underlying storage and
thus prevent kernel crashes.

Syzbot can use this cmdline argument option to avoid uninteresting
crashes. Also users whose userspace setup does not need writing to
mounted block devices can set this option for hardening. We expect
that this will be interesting to quite a few workloads.

Btrfs is currently opted out of this because they still haven't
merged patches we require for this to work from three kernel
releases ago.

- Reimplement block device freezing and thawing as holder operations
on the block device.

This allows us to extend block device freezing to all devices
associated with a superblock and not just the main device. It also
allows us to remove get_active_super() and thus another function
that scans the global list of superblocks.

Freezing via additional block devices only works if the filesystem
chooses to use @fs_holder_ops for these additional devices as well.
That currently only includes ext4 and xfs.

Earlier releases switched get_tree_bdev() and mount_bdev() to use
@fs_holder_ops. The remaining nilfs2 open-coded version of
mount_bdev() has been converted to rely on @fs_holder_ops as well.
So block device freezing for the main block device will continue to
work as before.

There should be no regressions in functionality. The only special
case is btrfs where block device freezing for the main block device
never worked because sb->s_bdev isn't set. Block device freezing
for btrfs can be fixed once they can switch to @fs_holder_ops but
that can happen whenever they're ready"

* tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
block: Fix a memory leak in bdev_open_by_dev()
super: don't bother with WARN_ON_ONCE()
super: massage wait event mechanism
ext4: Block writes to journal device
xfs: Block writes to log device
fs: Block writes to mounted block devices
btrfs: Do not restrict writes to btrfs devices
block: Add config option to not allow writing to mounted devices
block: Remove blkdev_get_by_*() functions
bcachefs: Convert to bdev_open_by_path()
fs: handle freezing from multiple devices
fs: remove dead check
nilfs2: simplify device handling
fs: streamline thaw_super_locked
ext4: simplify device handling
xfs: simplify device handling
fs: simplify setup_bdev_super() calls
blkdev: comment fs_holder_ops
porting: document block device freeze and thaw changes
fs: remove unused helper
...

fs: move fscrypt keyring destruction to after ->put_super

2023-12-28T03:56:01+00:00

btrfs has a variety of asynchronous things we do with inodes that can
potentially last until ->put_super, when we shut everything down and
clean up all of our async work.  Due to this we need to move
fscrypt_destroy_keyring() to after ->put_super, otherwise we get
warnings about still having active references on the master key.

Signed-off-by: Josef Bacik 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Neal Gompa 
Link: https://lore.kernel.org/r/20231227171429.9223-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers

fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation

2023-12-12T13:24:54+00:00

There is no reason to use a GFP_USER flag for struct super_block allocation
in the alloc_super(). Instead, let's use GFP_KERNEL for that.

>From the memory management perspective, the only difference between
GFP_USER and GFP_KERNEL is that GFP_USER allocations are tied to a cpuset,
while GFP_KERNEL ones are not.

There is no real issue and this is not a candidate to go to the stable,
but let's fix it for a consistency sake.

Cc: Jan Kara 
Cc: Alexander Viro 
Cc: Christian Brauner 
Cc: 
Cc: 
Signed-off-by: Alexander Mikhalitsyn 
Link: https://lore.kernel.org/r/20231208151022.156273-1-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner

super: don't bother with WARN_ON_ONCE()

2023-11-28T18:15:37+00:00

We hold our own active reference and we've checked it above.

Link: https://lore.kernel.org/r/20231127-vfs-super-massage-wait-v1-2-9ab277bfd01a@kernel.org
Signed-off-by: Christian Brauner

super: massage wait event mechanism

2023-11-28T18:15:34+00:00

We're currently using two separate helpers wait_born() and wait_dead()
when we can just all do it in a single helper super_load_flags(). We're
also acquiring the lock before we check whether this superblock is even
a viable candidate. If it's already dying we don't even need to bother
with the lock.

Link: https://lore.kernel.org/r/20231127-vfs-super-massage-wait-v1-1-9ab277bfd01a@kernel.org
Signed-off-by: Christian Brauner

fs: handle freezing from multiple devices

2023-11-18T13:59:25+00:00

Before [1] freezing a filesystems through the block layer only worked
for the main block device as the owning superblock of additional block
devices could not be found. Any filesystem that made use of multiple
block devices would only be freezable via it's main block device.

For example, consider xfs over device mapper with /dev/dm-0 as main
block device and /dev/dm-1 as external log device. Two freeze requests
before [1]:

(1) dmsetup suspend /dev/dm-0 on the main block device

    bdev_freeze(dm-0)
    -> dm-0->bd_fsfreeze_count++
    -> freeze_super(xfs-sb)

    The owning superblock is found and the filesystem gets frozen.
    Returns 0.

(2) dmsetup suspend /dev/dm-1 on the log device

    bdev_freeze(dm-1)
    -> dm-1->bd_fsfreeze_count++

    The owning superblock isn't found and only the block device freeze
    count is incremented. Returns 0.

Two freeze requests after [1]:

(1') dmsetup suspend /dev/dm-0 on the main block device

    bdev_freeze(dm-0)
    -> dm-0->bd_fsfreeze_count++
    -> freeze_super(xfs-sb)

    The owning superblock is found and the filesystem gets frozen.
    Returns 0.

(2') dmsetup suspend /dev/dm-1 on the log device

    bdev_freeze(dm-0)
    -> dm-0->bd_fsfreeze_count++
    -> freeze_super(xfs-sb)

    The owning superblock is found and the filesystem gets frozen.
    Returns -EBUSY.

When (2') is called we initiate a freeze from another block device of
the same superblock. So we increment the bd_fsfreeze_count for that
additional block device. But we now also find the owning superblock for
additional block devices and call freeze_super() again which reports
-EBUSY.

This can be reproduced through xfstests via:

    mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p4 /dev/nvme1n1p3
    mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p6 /dev/nvme1n1p5

    FSTYP=xfs
    export TEST_DEV=/dev/nvme1n1p3
    export TEST_DIR=/mnt/test
    export TEST_LOGDEV=/dev/nvme1n1p4
    export SCRATCH_DEV=/dev/nvme1n1p5
    export SCRATCH_MNT=/mnt/scratch
    export SCRATCH_LOGDEV=/dev/nvme1n1p6
    export USE_EXTERNAL=yes

    sudo ./check generic/311

Current semantics allow two concurrent freezers: one initiated from
userspace via FREEZE_HOLDER_USERSPACE and one initiated from the kernel
via FREEZE_HOLDER_KERNEL. If there are multiple concurrent freeze
requests from either FREEZE_HOLDER_USERSPACE or FREEZE_HOLDER_KERNEL
-EBUSY is returned.

We need to preserve these semantics because as they are uapi via
FIFREEZE and FITHAW ioctl()s. IOW, freezes don't nest for FIFREEZE and
FITHAW. Other kernels consumers rely on non-nesting freezes as well.

With freezes initiated from the block layer freezes need to nest if the
same superblock is frozen via multiple devices. So we need to start
counting the number of freeze requests.

If FREEZE_MAY_NEST is passed alongside FREEZE_HOLDER_KERNEL or
FREEZE_HOLDER_USERSPACE we allow the caller to nest freeze calls.

To accommodate the old semantics we split the freeze counter into two
counting kernel initiated and userspace initiated freezes separately. We
can then also stop recording FREEZE_HOLDER_* in struct sb_writers.

We also simplify freezing by making all concurrent freezers share a
single active superblock reference count instead of having separate
references for kernel and userspace. I don't see why we would need two
active reference counts. Neither FREEZE_HOLDER_KERNEL nor
FREEZE_HOLDER_USERSPACE can put the active reference as long as they are
concurrent freezers anwyay. That was already true before we allowed
nesting freezes.

Survives various fstests runs with different options including the
reproducer, online scrub, and online repair, fsfreze, and so on. Also
survives blktests.

Link: https://lore.kernel.org/linux-block/87bkccnwxc.fsf@debian-BULLSEYE-live-builder-AMD64
Link: https://lore.kernel.org/r/20231104-vfs-multi-device-freeze-v2-2-5b5b69626eac@kernel.org
Fixes: 288d8706abfc ("bdev: implement freeze and thaw holder operations") [1] # no backport needed
Tested-by: Chandan Babu R 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reported-by: Chandan Babu R 
Signed-off-by: Christian Brauner

fs: remove dead check

2023-11-18T13:59:24+00:00

Above we call super_lock_excl() which waits until the superblock is
SB_BORN and since SB_BORN is never unset once set this check can never
fire. Plus, we also hold an active reference at this point already so
this superblock can't even be shutdown.

Link: https://lore.kernel.org/r/20231104-vfs-multi-device-freeze-v2-1-5b5b69626eac@kernel.org
Tested-by: Chandan Babu R 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner

fs: streamline thaw_super_locked

2023-11-18T13:59:24+00:00

Add a new out_unlock label to share code that just releases s_umount
and returns an error, and rename and reuse the out label that deactivates
the sb for one more case.

Signed-off-by: Christoph Hellwig 
Link: https://lore.kernel.org/r/20231027064001.GA9469@lst.de
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner