linux-stable.git/fs/ceph, branch master

Merge tag 'ceph-for-7.2-rc1' of https://github.com/ceph/ceph-client

2026-06-26T23:15:53+00:00

Pull ceph updates from Ilya Dryomov:
 "This adds support for manual client session reset in CephFS, allowing
  operators to get out of tricky livelock situations involving caps and
  file locks without evicting the problematic client instance on the MDS
  side or rebooting the client node both of which can be disruptive"

* tag 'ceph-for-7.2-rc1' of https://github.com/ceph/ceph-client:
  ceph: add manual reset debugfs control and tracepoints
  ceph: add client reset state machine and session teardown
  ceph: add diagnostic timeout loop to wait_caps_flush()
  ceph: harden send_mds_reconnect and handle active-MDS peer reset
  ceph: use proper endian conversion for flock_len in reconnect
  ceph: convert inode flags to named bit positions and atomic bitops
  rbd: switch to dynamic root device

ceph: add manual reset debugfs control and tracepoints

2026-06-22T20:45:05+00:00

Add the debugfs and trace plumbing used to trigger and observe
manual client reset.

The reset interface exposes a trigger file for operator-initiated
reset and a status file for tracking the most recent run.  The
tracepoints record scheduling, completion, and blocked caller
behavior so reset progress can be diagnosed from the client side.

debugfs layout under /sys/kernel/debug/ceph//reset/:
  trigger - write to initiate a manual reset
  status  - read to see the most recent reset result

The reset directory is cleaned up via debugfs_remove_recursive()
on the parent, so individual file dentries are not stored.

Tracepoints:
  ceph_client_reset_schedule  - reset queued
  ceph_client_reset_complete  - reset finished (success or failure)
  ceph_client_reset_blocked   - caller blocked waiting for reset
  ceph_client_reset_unblocked - caller unblocked after reset

All tracepoints use a null-safe access for monc.auth->global_id
to guard against early-init or late-teardown edge cases.

Signed-off-by: Alex Markuze 
Reviewed-by: Viacheslav Dubeyko 
Signed-off-by: Viacheslav Dubeyko 
Signed-off-by: Ilya Dryomov

ceph: add client reset state machine and session teardown

2026-06-22T20:45:00+00:00

Add the client-side reset state machine, request gating, and manual
session teardown implementation.

Manual reset is an operator-triggered escape hatch for client/MDS
stalemates in which caps, locks, or unsafe metadata state stop making
forward progress.  The reset blocks new metadata work, attempts a
bounded best-effort drain of dirty client state while sessions are
still alive, and finally asks the MDS to close sessions before tearing
local session state down directly.

The reset state machine tracks four phases: IDLE -> QUIESCING ->
DRAINING -> TEARDOWN -> IDLE.  QUIESCING is set synchronously by
schedule_reset() before the workqueue item is dispatched, so that new
metadata requests and file-lock acquisitions are gated immediately --
even before the work function begins running.  All non-IDLE phases
block callers on blocked_wq, preventing races with session teardown.

The drain phase flushes mdlog state, dirty caps, and pending cap
releases for a bounded interval.  State that still cannot make progress
within that interval is discarded during teardown, which is the point
of the reset: break the stalemate and allow fresh sessions to rebuild
clean state.

The session teardown follows the established check_new_map()
forced-close pattern: unregister sessions under mdsc->mutex, then clean
up caps and requests under s->s_mutex.  Reconnect is not attempted
because the MDS only accepts reconnects during its own RECONNECT phase
after restart, not from an active client.

Blocked callers are released when reset completes and observe the final
result via -EAGAIN (reset failed) or 0 (success).  Internal work-function
errors such as -ENOMEM are not propagated to unrelated callers like
open() or flock(); the detailed error remains in debugfs and
tracepoints.

The work function checks st->shutdown before each phase transition
(DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
overwritten.  If destroy already took ownership, the work function
releases session references and returns without touching the state.

The timeout calculation for blocked-request waiters uses max_t() to
prevent jiffies underflow when the deadline has already passed.

The close-grace sleep before teardown is a best-effort nudge to let
queued REQUEST_CLOSE messages egress; it is not a correctness
requirement since the MDS still has session_autoclose as a fallback.

The destroy path marks reset as failed and wakes blocked waiters before
cancel_work_sync() so unmount does not stall.

Signed-off-by: Alex Markuze 
Reviewed-by: Viacheslav Dubeyko 
Signed-off-by: Viacheslav Dubeyko 
Signed-off-by: Ilya Dryomov

ceph: add diagnostic timeout loop to wait_caps_flush()

2026-06-22T20:44:56+00:00

Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
wait loop that periodically dumps pending cap flush state.

The underlying wait semantics remain intact: callers still wait until the
requested cap flushes complete. The difference is that long stalls now
produce actionable diagnostics instead of looking like a silent hang.

CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries
emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS
limits the number of timed diagnostic dumps before the wait
continues silently.  When more entries exist than the per-dump
limit, a truncation count is reported.  When the dump iteration
limit is reached, a final suppression message is emitted so the
transition to silence is explicit.

The diagnostic dump collects flush entry data under cap_dirty_lock into
a bounded on-stack array, then prints after releasing the lock.  This
avoids holding the spinlock across printk calls.

A null cf->ci on the global flush list indicates a bug since all
cap_flush entries are initialized with a valid ci before being added.
Signal this with WARN_ON_ONCE while still printing enough context for
debugging.

READ_ONCE is used for the i_last_cap_flush_ack field, which is read
outside the inode lock domain. Flush tids are monotonically increasing
and acks are processed in order under i_ceph_lock, so the latest ack
tid is always the most recently written value.

Add a ci pointer to struct ceph_cap_flush so that the diagnostic
dump can identify which inode each pending flush belongs to.  The
new i_last_cap_flush_ack field tracks the latest acknowledged flush
tid per inode for diagnostic correlation.

This improves reset-drain observability and is also useful for
existing sync and writeback troubleshooting paths.

Signed-off-by: Alex Markuze 
Reviewed-by: Viacheslav Dubeyko 
Signed-off-by: Viacheslav Dubeyko 
Signed-off-by: Ilya Dryomov

ceph: harden send_mds_reconnect and handle active-MDS peer reset

2026-06-22T20:44:50+00:00

Change send_mds_reconnect() to return an error code so callers can detect
and report reconnect failures instead of silently ignoring them. Add early
bailout checks for sessions that are already closed, rejected, or
unregistered, which avoids sending reconnect messages for sessions that
can no longer be recovered.

The early -ESTALE and -ENOENT bailouts use a separate fail_return label
that skips the pr_err_client diagnostic, since these codes indicate
expected concurrent-teardown races rather than genuine reconnect build
failures.

Move the "reconnect start" log after the early-bailout checks so it
only appears for sessions that actually proceed with reconnect.

Save the prior session state before transitioning to RECONNECTING,
and restore it in the failure path.  Without this, a transient
build or encoding failure (-ENOMEM, -ENOSPC) strands the session
in RECONNECTING indefinitely because check_new_map() only retries
sessions in RESTARTING state.

Rewrite mds_peer_reset() to handle the case where the MDS is past its
RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
messages because it only accepts them during its own RECONNECT window
after restart. Previously, the client would send a doomed reconnect
that the MDS would reject or ignore. Now, the client tears the session
down locally and lets new requests re-open a fresh session, which is
the correct recovery for this scenario. The RECONNECTING state is
handled on the same teardown path, since the MDS will reject reconnect
attempts from an active client regardless of the session's local state.

Add explicit cases for CLOSED and REJECTED session states in
mds_peer_reset() since these are terminal states where a connection
drop is expected behavior.

The session teardown path in mds_peer_reset() follows the established
drop-and-reacquire locking pattern from check_new_map(): take
mdsc->mutex for session unregistration, release it, then take s->s_mutex
separately for cleanup. This avoids introducing a new simultaneous lock
nesting pattern.

Log reconnect failures from check_new_map() and mds_peer_reset() at
pr_warn level rather than pr_err, since return codes like -ESTALE
(closed/rejected session) and -ENOENT (unregistered session) are
expected during concurrent teardown. Log dropped messages for
unregistered sessions via doutc() (dynamic debug) rather than
pr_info, as post-reset message arrival is routine and does not
warrant unconditional logging.

Signed-off-by: Alex Markuze 
Reviewed-by: Viacheslav Dubeyko 
Signed-off-by: Viacheslav Dubeyko 
Signed-off-by: Ilya Dryomov

ceph: use proper endian conversion for flock_len in reconnect

2026-06-22T20:44:47+00:00

Replace the __force __le32 cast with cpu_to_le32() for the flock_len field
in reconnect_caps_cb(). The old code used a type-system bypass to silence
sparse; the new form uses the proper endian conversion macro.

Also switch from a raw bitmask test against i_ceph_flags to test_bit() on
the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the
unsigned long flags field after the bit-position conversion.

Remove the now-unused CEPH_I_ERROR_FILELOCK mask define since all callers
use the _BIT form with test_bit/set_bit/clear_bit.

Signed-off-by: Alex Markuze 
Reviewed-by: Viacheslav Dubeyko 
Signed-off-by: Viacheslav Dubeyko 
Signed-off-by: Ilya Dryomov

ceph: convert inode flags to named bit positions and atomic bitops

2026-06-22T20:44:42+00:00

Define named bit-position constants for all CEPH_I_* inode flags and
derive the bitmask values from them.  This gives every flag a named
_BIT constant usable with the test_bit/set_bit/clear_bit family.
The intentionally unused bit position 1 is documented inline.

Convert all flag modifications to use atomic bitops (set_bit,
clear_bit, test_and_clear_bit).  The previous code mixed lockless
atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic
read-modify-write (|= / &= ~) on other flags sharing the same
unsigned long.  A concurrent non-atomic RMW can clobber an
adjacent lockless atomic update -- for example, a lockless
clear_bit(ERROR_WRITE) could be silently resurrected by a
concurrent ci->i_ceph_flags |= CEPH_I_FLUSH under the spinlock.
Using atomic bitops for all modifications eliminates this class
of race entirely.

Flags whose only users are now the _BIT form (ERROR_WRITE,
ASYNC_CHECK_CAPS) have their old mask defines removed to document
that callers must use the _BIT constant with the set_bit/test_bit
family.  ERROR_FILELOCK and SHUTDOWN retain their mask defines
because they are still used via bitmask tests in lockless readers
(ceph_inode_is_shutdown, reconnect_caps_cb).

The direct assignment in ceph_finish_async_create() is converted
from i_ceph_flags = CEPH_I_ASYNC_CREATE to set_bit().  This
inode is I_NEW at this point -- still invisible to other threads
and guaranteed to have zero flags from alloc_inode -- so either
form is safe, but set_bit() keeps the conversion uniform.

Signed-off-by: Alex Markuze 
Reviewed-by: Viacheslav Dubeyko 
Signed-off-by: Viacheslav Dubeyko 
Signed-off-by: Ilya Dryomov

Merge tag 'vfs-7.2-rc1.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T21:41:05+00:00

Pull openat2 updates from Christian Brauner:
 "Features:

   - Add O_EMPTYPATH to openat(2)/openat2(2). To get an operable file
     descriptor from an O_PATH file descriptor it is possible to use
     openat(fd, ".", O_DIRECTORY) for directories, but other file types
     require going through open("/proc//fd/") and thus depend
     on a functioning procfs.

     With O_EMPTYPATH an empty path string is accepted and LOOKUP_EMPTY
     is set at path resolution time, allowing to reopen the file behind
     the file descriptor directly. Selftests are included.

   - Add an OPENAT2_REGULAR flag for openat2(2) which refuses to open
     anything but regular files with the new EFTYPE error code.

     This implements the "ability to only open regular files" feature
     requested by userspace via uapi-group.org and protects services
     from being redirected to fifos, device nodes, and friends.

     All atomic_open implementations were audited for OPENAT2_REGULAR
     handling. Explicit checks were added to ceph, gfs2, nfs (v4), and
     cifs/smb - these are the filesystems whose atomic_open can
     encounter an existing non-regular file and would otherwise call
     finish_open() on it or return a misleading error code.

     The remaining implementations (9p, fuse, vboxsf, nfs v2/v3) only
     call finish_open() on freshly created files and use
     finish_no_open() for lookup hits, letting the VFS catch non-regular
     files via the do_open() safety net.

  Cleanups:

   - Migrate the openat2 selftests to the kselftest harness and move
     them under selftests/filesystems/. The tests were written in the
     early days of selftests' TAP support and the modern kselftest
     harness is much easier to follow and maintain. The contents of the
     tests are unchanged and the new emptypath tests are ported on top.

   - Make the LAST_XXX last-type constants private to fs/namei.c. The
     only user outside of fs/namei.c was ksmbd which only needs to know
     whether the last component is a regular one, so
     vfs_path_parent_lookup() now performs the LAST_NORM check
     internally. The ints are replaced with a dedicated enum last_type"

* tag 'vfs-7.2-rc1.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: replace ints with enum last_type for LAST_XXX
  vfs: make LAST_XXX private to fs/namei.c
  selftests: openat2: port emptypath_test to kselftest harness
  kselftest/openat2: test for OPENAT2_REGULAR flag
  openat2: new OPENAT2_REGULAR flag support
  openat2: introduce EFTYPE error code
  selftest: add tests for O_EMPTYPATH
  vfs: add O_EMPTYPATH to openat(2)/openat2(2)
  selftests: openat2: migrate to kselftest harness
  selftests: openat2: switch from custom ARRAY_LEN to ARRAY_SIZE
  selftests: openat2: move helpers to header
  selftests: move openat2 tests to selftests/filesystems/

Merge tag 'vfs-7.2-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T21:14:23+00:00

Pull vfs inode updates from Christian Brauner:
 "This extends the lockless ->i_count handling.

  iput() could already decrement any value greater than one locklessly
  but acquiring a reference always required taking inode->i_lock. Now
  acquiring a reference is lockless as long as the count was already at
  least 1, i.e., only the 0->1 and 1->0 transitions take the lock.

  This avoids the lock for the common cases of nfs calling into the
  inode hash and btrfs using igrab(). Cleanup-wise icount_read_once() is
  added to line up with inode_state_read_once() and the open-coded
  ->i_count loads across the tree are converted, and ihold() is
  relocated and tidied up.

  On top of that some stale lock ordering annotations are retired from
  the inode hash code: iunique() no longer takes the hash lock since the
  inode hash became RCU-searchable and s_inode_list_lock is no longer
  taken under the hash lock either"

* tag 'vfs-7.2-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: retire stale lock ordering annotations from inode hash
  fs: allow lockless ->i_count bumps as long as it does not transition 0->1
  fs: relocate and tidy up ihold()
  fs: add icount_read_once() and stop open-coding ->i_count loads

openat2: new OPENAT2_REGULAR flag support

2026-05-21T13:33:47+00:00

This flag indicates the path should be opened if it's a regular file.
This is useful to write secure programs that want to avoid being
tricked into opening device nodes with special semantics while thinking
they operate on regular files. This is a requested feature from the
uapi-group[1].

The previously introduced EFTYPE error code is returned when the path
doesn't refer to a regular file. For example, if openat2 is called on
path /dev/null with OPENAT2_REGULAR in the flag param, it will return
-EFTYPE.

When used in combination with O_CREAT, either the regular file is
created, or if the path already exists, it is opened if it's a regular
file. Otherwise, -EFTYPE is returned.

When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
as it doesn't make sense to open a path that is both a directory and a
regular file.

The UAPI bit lives in the upper 32 bits of open_how::flags
(((__u64)1 << 32)) so that open(2) and openat(2) -- whose @flags
argument is a C int -- cannot physically express it. This is a
structural guarantee, not a runtime mask: the bit is unrepresentable in
32 bits.

Because the rest of the VFS open path narrows to 32 bits in several
places (op->open_flag, f->f_flags, the unsigned open_flag argument of
i_op->atomic_open()), build_open_flags() translates OPENAT2_REGULAR
into a kernel-internal lower-32-bit carrier __O_REGULAR (bit 4, unused
as an O_* on every architecture) before the assignment to op->open_flag.
__O_REGULAR then rides through the existing channels exactly like
__FMODE_EXEC. do_dentry_open() strips it so it cannot leak back to
userspace via fcntl(F_GETFL).

Four BUILD_BUG_ON_MSG() invariants in build_open_flags() prevent any
future bit collision or accidental low-32 redefinition:

  - VALID_OPEN_FLAGS fits in 32 bits.
  - OPENAT2_REGULAR lives in the upper 32 bits.
  - OPENAT2_REGULAR does not alias any open()/openat() flag.
  - __O_REGULAR does not alias any user-visible flag.

[1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files

Christian Brauner  says:

Move OPENAT2_REGULAR to the upper 32 bits of open_how::flags with a
kernel-internal __O_REGULAR carrier so that open(2)/openat(2) cannot
encode the flag; add BUILD_BUG_ON_MSG() invariants and register
__O_REGULAR in the fcntl_init() allocation-uniqueness BUILD_BUG_ON()
(bit count 21 -> 22).

Signed-off-by: Dorjoy Chowdhury 
Link: https://patch.msgid.link/20260328172314.45807-2-dorjoychy111@gmail.com
Reviewed-by: Jeff Layton 
Reviewed-by: Aleksa Sarai 
Signed-off-by: Christian Brauner (Amutable)