| Age | Commit message (Collapse) | Author |
|
conn->preauth_info is shared connection state (struct
preauth_integrity_info, kmalloc-96) that is allocated and freed by the
SMB2 NEGOTIATE handler and read by the response send path.
smb2_handle_negotiate() allocates conn->preauth_info, and on a
deassemble_neg_contexts() failure kfrees it and sets it to NULL. Both the
allocation and the free/NULL happen under ksmbd_conn_lock(conn) (the
connection srv_mutex), which is held across the whole handler body.
The response send path smb3_preauth_hash_rsp(), called from the send:
block of __handle_ksmbd_work(), reads conn->preauth_info and dereferences
conn->preauth_info->Preauth_HashValue (via
ksmbd_gen_preauth_integrity_hash()) without taking conn_lock. When a
client drives two SMB2 NEGOTIATE requests on the same connection, one
worker can free conn->preauth_info on the failing-negotiate path while a
concurrent send-path worker is reading it, producing a slab
use-after-free read (KASAN-confirmed).
The send-path read tested conn->preauth_info for NULL but raced with the
free that occurs between the NULL check and the dereference, so the NULL
guard alone does not close the window.
Serialize the NEGOTIATE-branch read in smb3_preauth_hash_rsp() under
ksmbd_conn_lock(conn) and re-check conn->preauth_info inside the lock.
Because the negotiate handler holds conn_lock across its kfree + NULL
assignment, a reader that also takes conn_lock either runs fully before
the allocation or fully after the NULL store, and can never observe the
freed-but-not-yet-NULLed pointer. ksmbd_gen_preauth_integrity_hash()
takes no locks itself (it only computes a SHA-512 over the buffer), so
no lock-ordering inversion is introduced, and conn_lock is a sleepable
mutex which is safe on this send path (it already performs network I/O).
Fixes: aa7253c2393f ("ksmbd: fix memory leak in smb2_handle_negotiate")
Signed-off-by: Gil Portnoy <dddhkts1@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Most SMB responses need no more than four kvec entries, but every work
item currently allocates a separate four-entry array and frees it after
the response is sent.
Embed the common array in struct ksmbd_work and allocate a larger array
only when a response exceeds the inline capacity. This removes one
allocation and one free from the common request path while preserving
support for larger compound and read responses.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
v2 leases are scoped by ClientGuid. When the same client uses multiple
connections, smbtorture expects lease break notifications to be sent on
the connection associated with the client lease table, not necessarily
on the connection that owns the individual open being broken.
Keep a referenced connection in the lease table and use it for v2 lease
break notifications while it is still active. Fall back to the open's
connection if the table connection is being released.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The delete paths only marked the opened file delete pending or
delete-on-close. When another client still held a read/handle lease, no
lease break was sent before the delete state changed.
smb2.lease.unlink uses a create request with FILE_DELETE_ON_CLOSE and
expects the second client's unlink to break the first client's RH lease to
R with ACK_REQUIRED set. SetInfo(FileDispositionInformation) has the same
lease-breaking requirement.
Break level-II/read-handle leases before setting delete pending or
delete-on-close so clients are notified before the file is removed.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
v2 lease responses should continue from the client supplied epoch.
Initialize a new v2 lease from the requested epoch plus one so create
responses match the epoch returned by Windows and expected by smbtorture.
For a single chained break sequence, increment the epoch only for the first
break notification. Follow-up breaks such as RH->R and R->NONE in
smb2.lease.v2_breaking3 reuse the same epoch.
Record when a waiter slept behind pending_break and let the later
truncate/open overwrite break consume that marker to reuse the current
epoch instead of assigning a new one.
Do not increment the epoch when a same-client, same-key create asks for
the already granted RH state. The epoch changes only when the granted lease
state changes.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
smb2.lease.breaking4 expects an overwrite against an RH lease to send
RH->NONE lease break notification but complete the triggering create
without waiting for the break ack.
Keep the lease in break-in-progress state until the client eventually
acknowledges the downgrade, but do not hold the overwrite request behind
that ack.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
A pending open can require more than one lease break before the existing
lease becomes compatible with the operation that triggered the break.
smb2.lease.breaking3 expects the server to hold the pending normal open
through RWH->RH and RH->R, while a later overwrite waiter must not
collapse that second break directly to RH->NONE.
Keep pending_break held for lease breaks until the current triggering
operation is compatible with the lease state. Snapshot the truncate request
per oplock_break() call so another waiter cannot overwrite the state of
the active break.
Use the requested oplock level when deciding whether to chain another
break. A second lease open only needs RWH->RH, while a normal none-oplock
open can continue down to R and then NONE.
For non-truncating metadata operations, break leases only down to read
caching. Operations such as delete-on-close need to drop handle caching,
but should not send a second R->NONE break after the client acknowledges
RH->R.
Also send STATUS_PENDING for levelII/read-lease break waiters. An async
SMB2 create becomes cancelable only after the server sends
an NT_STATUS_PENDING interim response. A waiter that blocks behind an
already active lease break must receive the interim response before
sleeping on pending_break, otherwise the client can process a later lease
break while the create request is still not marked pending.
Avoid duplicate interim responses when an overwrite first breaks a write
oplock and then scans levelII/read leases.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
SMB2_LEASE_FLAG_BREAK_IN_PROGRESS is a transient create response flag,
not persistent lease state.
Do not store the flag in lease->flags when a same-key open is granted
during a pending break. Instead, derive it from lease opens that are still
waiting for a break ACK while building the lease create response, and keep
lease->flags for persistent lease flags such as the parent lease key.
This clears the flag naturally after the break ACK completes and fixes
reopen responses that report BREAK_IN_PROGRESS after the lease is no
longer breaking.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The SMB path suffix :: names the unnamed data stream of the base
file, not an alternate data stream backed by a DosStream xattr.
Canonicalize an empty stream name with an explicit type to a NULL
stream name after parsing. This keeps the base filename produced by
strsep() and lets open continue through the normal base-file path instead
of looking for a non-existent empty stream xattr.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Handle SMB2 oplock break acknowledgments according to the server-side
validation rules in MS-SMB2.
Return STATUS_INVALID_DEVICE_STATE when an ACK arrives while the open is
not breaking, reject SMB2_OPLOCK_LEVEL_LEASE with
STATUS_INVALID_PARAMETER, allow BATCH acknowledgments to EXCLUSIVE, and
make invalid ACK levels fail with STATUS_INVALID_OPLOCK_PROTOCOL after
lowering the oplock to NONE.
Update the successful response from the final granted oplock level instead
of relying on the oplock transition helpers, which could turn invalid ACKs
into successful responses.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Model SMB2 leases as per-client/per-key objects instead of keeping a
separate lease copy in every oplock_info. The lease table now stores
lease objects and each lease tracks the opens that reference it.
This makes same ClientGuid/LeaseKey opens observe a single lease state,
so lease upgrades, breaks, ACKs, and close teardown do not diverge across
per-open copies. Keep one reference for the lease table entry and one
reference for each open, and remove the table entry when the last open is
detached.
Update lease break ACK handling to refresh all open oplock levels from
the shared lease state.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Do not echo reserved v1 lease flags back to clients. For lease v2
responses, only return BREAK_IN_PROGRESS and PARENT_LEASE_KEY_SET when
they are meaningful, and preserve the parent lease key in the response.
Allow directory leases whenever the request is a valid lease v2 request,
and initialize v2 lease epochs from the first server-granted state change.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Do not skip valid lease states containing WRITE_CACHING when breaking
level-II/read leases for writes and truncates.
Handle lease break acknowledgments according to the SMB2 rule that the
acknowledged state must be a subset of the server's break target. Apply
the acknowledged state directly and keep the break pending on failed ACKs.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
MS-SMB2 defines the lease table lookup key as Connection.ClientGuid.
Use the connection ClientGUID consistently when checking for same-client
leases and duplicate lease keys.
Also preserve directory and parent lease metadata when copying an existing
lease state to a new open.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Validate SMB2 lease context lengths, requested lease state bits, and v2
flags before using the context. Return errors via ERR_PTR so CREATE can
distinguish a missing lease context from a malformed one.
Also ignore lease v2 contexts for SMB 2.1, where they are not valid.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Convert to CPU byte order to avoid incorrect debug log
on big-endian architectures.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
SMB2_LOCK adds each granted byte-range lock to both the file lock list
and the lock list of the connection which handled the request. The
final close and durable handle paths, however, remove the connection
list entry while holding fp->conn->llist_lock.
With SMB3 multichannel, the connection handling the LOCK request can be
different from the connection which opened the file. The entry can
therefore be removed under a different spinlock from the one protecting
the list it belongs to. A concurrent traversal can then access freed
struct ksmbd_lock and struct file_lock objects.
Record the connection owning each lock's clist entry and hold a
reference to it while the entry is linked. Use that connection and its
llist_lock for unlock, rollback, close, and durable preserve. Durable
reconnect assigns the new connection as the owner when publishing the
locks again.
Fixes: f5a544e3bab7 ("ksmbd: add support for SMB3 multichannel")
Cc: stable@vger.kernel.org
Reported-by: Musaab Khan <musaab.khan@protonmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name()
with FW PA stat names regardless of whether the PA stats block is
present on the hardware. emac_get_stat_by_name() already guards the
PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer
is NULL the lookup falls through to netdev_err() and returns -EINVAL.
Because ndo_get_stats64 is polled regularly by the networking stack
this produces thousands of log entries of the form:
icssg-prueth icssg1-eth end0: Invalid stats FW_RX_ERROR
A secondary consequence is that the int(-EINVAL) return value is
implicitly widened to a near-ULLONG_MAX unsigned value when accumulated
into the __u64 fields of rtnl_link_stats64, silently corrupting the
rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`.
Every other PA-aware code path in the driver is already guarded with
the same `if (emac->prueth->pa_stats)` check. Apply the same guard
here.
Fixes: 0d15a26b247d ("net: ti: icssg-prueth: Add ICSSG FW Stats")
Signed-off-by: Philippe Schenker <philippe.schenker@impulsing.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Cc: danishanwar@ti.com
Cc: rogerq@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260618093037.3448858-1-dev@pschenker.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tristan Madani says:
====================
Fix stale register bounds on LSM retval context load
From: Tristan Madani <tristan@talencesecurity.com>
check_mem_access() calls __mark_reg_s32_range() to narrow a register to
the LSM hook retval range, but the intersection preserves stale bounds
from prior instructions. Add mark_reg_unknown() before narrowing (same
pattern as the else branch) and a selftest that catches the mismatch.
Changes in v3:
- Add selftest demonstrating the issue (Eduard Zingerman)
- No code change in patch 1 from v2
====================
Link: https://patch.msgid.link/20260622230123.3695446-1-tristmd@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a verifier test that catches the stale-bounds issue fixed in the
previous patch. The test sets r6 = 0 to create known bounds, then loads
the LSM hook return value into r6 from the context. Without the fix,
the verifier intersects the retval range with the stale bounds and
incorrectly narrows r6 to a single value, pruning the fall-through
branch as dead code and missing the div-by-zero.
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260622230123.3695446-3-tristmd@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When the BPF verifier processes a context load of an LSM hook return
value, it calls __mark_reg_s32_range() to narrow the register to the
hook's valid range. However, __mark_reg_s32_range() intersects the new
range with the register's existing bounds using max_t()/min_t() rather
than replacing them.
If the destination register carries stale bounds from a prior instruction
(e.g. BPF_MOV64_IMM), the intersection can produce a range narrower than
reality. The verifier then believes it knows the register's exact value,
while at runtime the actual hook return value is loaded, creating a
verifier/runtime mismatch that can be used to bypass BPF memory safety
checks.
The else branch already calls mark_reg_unknown() to reset register state
before any narrowing. Apply the same reset in the is_retval path so
stale bounds are cleared before __mark_reg_s32_range() intersects.
Fixes: 5d99e198be27 ("bpf, lsm: Add check for BPF LSM return value")
Cc: stable@vger.kernel.org
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260622230123.3695446-2-tristmd@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In ofdpa_port_fdb(), the hash_del() only unlinks the node from
hash table, but does not free it.
Fix this by adding kfree(found) after the !found == removing check,
where the pointer value is no longer needed.
Found by Coccinelle kfree script.
Cc: <stable+noautosel@kernel.org> # rocker is a test harness, it's never loaded on production systems
Signed-off-by: Ziran Zhang <zhangcoder@yeah.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260616013245.7098-1-zhangcoder@yeah.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Yiyang Chen says:
====================
bpf: Guard conntrack opts error writes
The conntrack lookup/allocation kfuncs expose an opts/opts__sz pair.
The verifier checks the caller-provided opts__sz range, but the wrappers
currently write opts->error after internal errors even when opts__sz is too
small to include that field.
Patch 1 writes opts->error only when opts__sz includes it, and uses a
single helper to fold ERR_PTR returns into the kfunc ABI result while keeping
the local nfct result variable in each wrapper.
Patch 2 adds a bpf_nf regression check that keeps a guard in opts->error
while passing opts__sz covering only netns_id.
The regression check follows the existing bpf_nf test shape. Before the
fix, the guard is overwritten with -EINVAL even though opts__sz covers only
the first four bytes of the options object. After the fix, the kfunc still
returns NULL for the invalid size, but the guard remains intact.
Validation, rebased and tested on bpf-next master e771677c937d
("Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd"):
git diff --check origin/master..HEAD: OK
scripts/checkpatch.pl --strict on 1/2 and 2/2: OK
make O=/root/ebpf-verifier-bug-detection/kernel-build/bpf-next \
net/netfilter/nf_conntrack_bpf.o: OK
Focused QEMU direct-runner against XDP and TC lookup/alloc paths:
unpatched bpf-next e771677c937d: guard overwritten with -EINVAL
patched v2 007dfd0341cd: guard preserved as 0x12345678
QEMU upstream bpf_nf selftest with CONFIG_NF_CONNTRACK_MARK,
CONFIG_NF_CONNTRACK_ZONES, and legacy iptables enabled:
./test_progs -t bpf_nf -vv: OK
git am of exported 1/2 and 2/2 on a fresh worktree at base: OK
range-diff between branch commits and git-am result: equivalent
Changes in v2:
- Rebased onto current bpf-next master.
- Reworked patch 1 to use bpf_ct_opts_result() for the ERR_PTR-to-NULL
conversion and guarded opts->error write, as suggested by Alexei.
- Kept the local nfct result variable in each wrapper before returning
through bpf_ct_opts_result().
- Added matching Fixes tags to the selftest patch so the regression test
can be backported with the fix.
v1: https://lore.kernel.org/bpf/cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn/
====================
Link: https://patch.msgid.link/cover.1781765747.git.chenyy23@mails.tsinghua.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a conntrack kfunc regression check for opts__sz values that do not
cover opts->error. The BPF program initializes opts->error with a guard
value, calls the lookup and allocation kfuncs with opts__sz set to
sizeof(opts->netns_id), and verifies that the guard is still intact
after the kfunc returns NULL.
Without the conntrack wrapper guard, the kfunc error path overwrites
that guard with -EINVAL even though the verifier checked only the first
four bytes of the options object.
Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/r/007dfd0341cd84560e4795a2a951cc56d4adff1d.1781765747.git.chenyy23@mails.tsinghua.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The conntrack lookup and allocation kfuncs take an opts pointer
together with an opts__sz argument. The verifier checks only the memory
range described by opts__sz, but the wrappers unconditionally write
opts->error whenever the internal lookup or allocation helper returns an
error.
For an invalid size smaller than the end of opts->error, that write can
land outside the verifier-checked range. Keep returning NULL for invalid
arguments, but only report the error through opts->error when the
supplied size includes the field.
This preserves error reporting for the supported 12-byte and 16-byte
layouts, and for other invalid sizes that still include opts->error.
Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/r/9535e781fe14449b1d4e9bbc3baa7566a93bf512.1781765747.git.chenyy23@mails.tsinghua.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add PM suspend/resume callbacks to enable/disable IRQ wake for the
RTC alarm interrupt. This allows the RTC alarm to wake the system
from STR (e.g. via rtcwake -m mem -s N).
Without this, the RTC IRQ is masked during suspend by the MPIC's
IRQCHIP_MASK_ON_SUSPEND behavior, preventing alarm-based wakeup.
Signed-off-by: Xue Lei <Xue.Lei@windriver.com>
Link: https://patch.msgid.link/20260611023350.1370881-1-Xue.Lei@windriver.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Add support for matching the RTC controller on ASPEED AST2700 SoCs.
The AST2700 RTC controller is compatible with the existing ASPEED
RTC driver implementation.
Signed-off-by: Tommy Huang <tommy_huang@aspeedtech.com>
Link: https://patch.msgid.link/20260601-ast2700-rtc-v1-2-15d4ca46500a@aspeedtech.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Document the compatible string for the RTC controller found on
ASPEED AST2700 SoCs.
Signed-off-by: Tommy Huang <tommy_huang@aspeedtech.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260601-ast2700-rtc-v1-1-15d4ca46500a@aspeedtech.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Fix spelling of 'occurence' to 'occurrence' and 'of' to 'or' in the
kernel-doc comment for rtc_handle_legacy_irq().
Signed-off-by: Yahya Saqban <yahyasaqban@gmail.com>
Link: https://patch.msgid.link/20260512210235.343070-1-yahyasaqban@gmail.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
msc313_rtc_probe() calls devm_request_irq() with IRQF_SHARED and
&pdev->dev as the cookie, but platform_set_drvdata() is only called
later after the clock setup. With a shared IRQ line, another device
on the same line can trigger the handler in that window. The
handler does dev_get_drvdata() on the cookie, gets NULL, and
dereferences priv->rtc_base in interrupt context.
Pass priv as the cookie directly so the handler reads it from
dev_id without the lookup, removing the dependency on probe order.
Fixes: be7d9c9161b9 ("rtc: Add support for the MSTAR MSC313 RTC")
Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com>
Link: https://patch.msgid.link/20260511032703.48262-1-sozdayvek@gmail.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
The platform was removed a few years ago, and the mfd driver
is also gone now, so it is impossible to build or use it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260527193927.3523952-1-arnd@kernel.org
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Add a new function rtc_read_next_alarm() that reads the next expiring
alarm from the RTC timerqueue. This is different from rtc_read_alarm(),
which only reads the aie_timer.
The wakealarm sysfs file programs the rtc->aie_timer, whereas the
alarmtimer suspend routine programs its own timer into the RTC timerqueue.
Both timers end up in the RTC's timerqueue, and the first expiring timer
is what gets armed in the hardware.
This new function allows code to query which alarm will actually fire
next, regardless of which subsystem programmed it. This is needed by
platform code that needs to program secondary timers based on the
actual next wakeup time.
Link: https://lore.kernel.org/all/87ed50z0le.ffs@tglx
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/20260521043714.1022930-2-mario.limonciello@amd.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
[BUG]
Our fuzz testing triggered a blkcg use-after-free issue:
BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
Call Trace:
...
blkcg_deactivate_policy+0x244/0x4d0
ioc_rqos_exit+0x44/0xe0
rq_qos_exit+0xba/0x120
__del_gendisk+0x50b/0x800
del_gendisk+0xff/0x190
...
[CAUSE]
process1 process2
cgroup_rmdir
...
css_killed_work_fn
offline_css
...
blkcg_destroy_blkgs
...
__blkg_release
css_put(&blkg->blkcg->css)
blkg_free
INIT_WORK(xxx, blkg_free_workfn)
schedule_work
css_put
...
blkcg_css_free
kfree(blkcg)--------blkcg has been freed!!!
====================================schedule_work
blkg_free_workfn
__del_gendisk
rq_qos_exit
ioc_rqos_exit
blkcg_deactivate_policy
mutex_lock(&q->blkcg_mutex)
spin_lock_irq(&q->queue_lock)
list_for_each_entry(blkg, xxx)
blkcg = blkg->blkcg
spin_lock(&blkcg->lock)-------UAF!!!
mutex_lock(&q->blkcg_mutex)
spin_lock_irq(&q->queue_lock)
/* Only then is the blkg removed from the list */
list_del_init(&blkg->q_node)
As a result, a blkg can still be reachable through q->blkg_list while
its ->blkcg has already been freed.
[Fix]
Fix this by deferring the blkcg css_put() until after the blkg has been
unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
blkcg outlives every blkg still reachable through q->blkg_list, so any
iterator holding q->queue_lock is guaranteed to observe a valid
blkg->blkcg.
While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
so that the css reference is owned by the alloc/free pair rather than
straddling layers:
blkg_alloc() <-> blkg_free()
blkg_create() <-> blkg_destroy()
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Link: https://patch.msgid.link/20260616011746.2451461-1-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When multiple blkgs in the same blkcg are released concurrently,
a use-after-free can occur. The race happens when one blkg's
__blkcg_rstat_flush() removes another blkg's iostat entries via
llist_del_all(). The second blkg sees an empty list and proceeds
to free itself while the first is still iterating over its entries.
Move the flush from __blkg_release() (RCU callback) to blkg_release()
(before call_rcu). This ensures the RCU grace period waits for any
concurrent flush's rcu_read_lock() section to complete before freeing.
Cc: stable@vger.kernel.org
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Fixes: 20cb1c2fb756 ("blk-cgroup: Flush stats before releasing blkcg_gq")
Reported-by: coregee2000@gmail.com
Closes: https://lore.kernel.org/linux-block/CAHPqNmwT9oRpem3J3erS_W0uSQND47LGGSBsNxP8E6uSUish1w@mail.gmail.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
Link: https://patch.msgid.link/20260205155425.342084-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Writing 0 to BFQ's low_latency attribute ends weight raising for active,
idle and async queues. The async cgroup path walks q->blkg_list, converts
each blkg to BFQ policy data and then reads bfqg->async_bfqq and
bfqg->async_idle_bfqq.
That walk was protected only by bfqd->lock. blkcg release work is
serialized by q->blkcg_mutex and q->queue_lock instead, and
blkg_free_workfn() can call BFQ's pd_free_fn before it removes
blkg->q_node from q->blkg_list. A low_latency reset can therefore still
find the blkg on the queue list after the BFQ policy data has been freed.
The buggy scenario involves two paths, with each column showing the order
within that path:
BFQ low_latency reset: blkcg blkg release work:
1. bfq_low_latency_store() 1. blkg_free_workfn() takes
calls bfq_end_wr(). q->blkcg_mutex.
2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the
q->blkg_list. final bfq_group reference.
3. blkg_to_bfqg() returns 3. blkg->q_node remains on
the stale policy data. q->blkg_list until list_del_init().
4. bfq_end_wr_async_queues()
reads async queue fields.
Fix this by taking q->blkcg_mutex and q->queue_lock around the
q->blkg_list walk, then taking bfqd->lock before touching BFQ async
queues. The mutex serializes against policy-data free and queue_lock
stabilizes the list. Move the async reset out of bfq_end_wr()'s existing
bfqd->lock critical section so the lock order matches blkcg policy
callbacks.
Validation reproduced this kernel report:
BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340
Call Trace:
<TASK>
dump_stack_lvl+0x66/0xa0
print_report+0xce/0x630
? bfq_end_wr_async_queues+0x246/0x340
? srso_alias_return_thunk+0x5/0xfbef5
? __virt_addr_valid+0x20d/0x410
? bfq_end_wr_async_queues+0x246/0x340
kasan_report+0xe0/0x110
? bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async+0xba/0x180
bfq_low_latency_store+0x4e5/0x690
? 0xffffffffc02150da
? __pfx_bfq_low_latency_store+0x10/0x10
? __pfx_bfq_low_latency_store+0x10/0x10
elv_attr_store+0xc4/0x110
kernfs_fop_write_iter+0x2f5/0x4a0
vfs_write+0x604/0x11f0
? __pfx_locks_remove_posix+0x10/0x10
? __pfx_vfs_write+0x10/0x10
ksys_write+0xf9/0x1d0
? __pfx_ksys_write+0x10/0x10
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Allocated by task 544:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
bfq_pd_alloc+0xc0/0x1b0
blkg_alloc+0x346/0x960
blkg_create+0x8c2/0x10d0
bio_associate_blkg_from_css+0x9f3/0xfa0
bio_associate_blkg+0xd9/0x200
bio_init+0x303/0x640
__blkdev_direct_IO_simple+0x56b/0x8a0
blkdev_direct_IO+0x8e7/0x2580
blkdev_read_iter+0x205/0x400
vfs_read+0x7b0/0xda0
ksys_read+0xf9/0x1d0
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 465:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x5f/0x80
kfree+0x307/0x580
blkg_free_workfn+0xef/0x460
process_one_work+0x8d0/0x1870
worker_thread+0x575/0xf80
kthread+0x2e7/0x3c0
ret_from_fork+0x576/0x810
ret_from_fork_asm+0x1a/0x30
Fixes: 44e44a1b329e ("block, bfq: improve responsiveness")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Reviewed-by: Tao Cui <cuitao@kylinos.cn>
Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
nbd_reclassify_socket() warns via WARN_ON_ONCE() if the socket lock is
held at the point of reclassification. That assertion was copied from
nvme-tcp, where the socket is created internally by the kernel
(sock_create_kern()) and is never visible to user space, so the lock
is guaranteed to be free.
NBD is different: the socket is looked up from a user-supplied fd in
nbd_get_socket(), and user space retains that fd. A concurrent syscall
on the same socket (or softirq processing taking bh_lock_sock() on a
connected TCP socket) can legitimately hold the lock at the instant
NBD reclassifies it. sock_allow_reclassification() then returns false
and the WARN_ON_ONCE() fires, which turns into a crash under
panic_on_warn. This is reachable by simply racing NBD_CMD_CONNECT
against socket activity on the same fd, as reported by syzbot.
Hitting a held lock here is expected for an externally owned socket and
is not a kernel bug, so skip reclassification silently instead of
warning. Reclassification is a lockdep-only annotation, so skipping it
in the rare racing case is harmless.
Reported-by: syzbot+6b85d1e39a5b8ed9a954@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6b85d1e39a5b8ed9a954
Fixes: d532cddb6c60 ("nbd: Reclassify sockets to avoid lockdep circular dependency")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260621235255.66015-1-kartikey406@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Only decrement the static key when we had items and thus it was
incremented before.
Fixes: e8dcf2d142bd ("block: add configurable error injection")
Reported-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260622160752.1552516-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1")
disabled K1 by default on Meteor Lake and newer systems due to packet
loss observed on various platforms. However, disabling K1 caused an
increase in power consumption.
To mitigate this, reconfigure the PLL clock gate value so that K1 can
remain enabled without incurring the additional power consumption.
Re-enable K1 by default, but keep the private flag to support disabling
it via ethtool. Additionally, introduce a DMI quirk table, so that K1 may
be disabled by default on known problematic systems. Currently, this
includes the Dell Pro 16 Plus, where the issue has been reported to persist
despite the changes to the PLL lock timeout.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220954
Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20250623/048860.html
Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20260330/054059.html
Signed-off-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Co-developed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Fixes: 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1")
Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
i40e_debug() macro takes struct i40e_hw *h as first argument. But the
macro body uses hw instead of h. This has been working so far because hw
happens to be the name of the variable in the context where the macro is
expanded. Fix the macro to use the passed argument.
Fixes: 5dfd37c37a44 ("i40e: Split i40e_osdep.h")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Several error return paths in ice_dpll_init_info() directly return
without freeing previously allocated resources, causing memory leaks:
- When de->input_prio allocation fails, d->inputs is leaked
- When dp->input_prio allocation fails, d->inputs and de->input_prio
are leaked
- When ice_get_cgu_rclk_pin_info() fails, all previously allocated
inputs/outputs/input_prio are leaked
- When ice_dpll_init_pins_info(RCLK_INPUT) fails, same resources
are leaked
Fix this by jumping to the deinit_info label which properly calls
ice_dpll_deinit_info() to free all allocated resources.
Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
ice_dpll_deinit_info() calls kfree() on several pf->dplls fields
(inputs, outputs, eec.input_prio, pps.input_prio) but does not set
the pointers to NULL afterward. This leaves dangling pointers in the
pf->dplls structure.
While not currently exploitable through existing code paths, this is
unsafe because:
1. If ice_dpll_init_info() is called again after a deinit (e.g. during
driver recovery), and a subsequent allocation within init fails, the
error path will jump to deinit_info and call ice_dpll_deinit_info()
again. Since some pointers still hold the old freed addresses, this
would result in a double-free.
2. Any future code that checks these pointers before use or after free
would be unprotected against use-after-free.
Follow the common kernel convention of setting pointers to NULL after
kfree() so that:
- kfree(NULL) is a safe no-op, preventing double-free
- NULL checks on these pointers become meaningful
This is a preparatory fix for a subsequent patch that routes additional
error paths in ice_dpll_init_info() to the deinit_info label.
Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
isl1208_setup_irq() calls enable_irq_wake() after a successful
IRQ request, but the driver has no remove path that balances it.
The driver is devm-only, so on unbind devm releases the IRQ -
but enable_irq_wake() is not undone by IRQ release, so the wake
count for that IRQ stays incremented.
Each rebind therefore leaks one wake reference; the leak doubles
for the chip variant that has a separate evdet IRQ, since
isl1208_setup_irq() is then called twice during probe.
Register a devm action that calls disable_irq_wake() per IRQ.
While at it, check enable_irq_wake()'s return value:
on failure, propagate the error rather than silently registering
a disable action for an IRQ whose wake state was never enabled.
Fixes: 9ece7cd833a3 ("rtc: isl1208: Add "evdet" interrupt source for isl1219")
Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com>
Link: https://patch.msgid.link/20260425154959.2796261-3-john.madieu.xa@bp.renesas.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
io_pin_pages() checks that nr_pages does not exceed INT_MAX, then
allocates a struct page * array of nr_pages entries. kvmalloc() limits
allocations to INT_MAX bytes, but the check counts pages, not bytes.
On 64-bit each entry is 8 bytes, so the array hits the INT_MAX byte
limit at INT_MAX / sizeof(struct page *) pages, well before the page
count check fires.
Since commit b4e41050b212 ("io_uring/rsrc: raise registered buffer 1GB
limit") raised the per-buffer cap to 1TB, a buffer near that cap maps
~2^28 pages, making the array allocation exceed INT_MAX bytes. This
passes the page count check, reaches kvmalloc(), and triggers the
WARN_ON_ONCE() for oversized allocations in __kvmalloc_node_noprof().
Check nr_pages against INT_MAX / sizeof(struct page *) so the buffer is
rejected with -EOVERFLOW before the allocation is attempted.
Reported-by: syzbot+f99b00a963915b6b52c6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f99b00a963915b6b52c6
Fixes: b4e41050b212 ("io_uring/rsrc: raise registered buffer 1GB limit")
Tested-by: syzbot+f99b00a963915b6b52c6@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://patch.msgid.link/20260621012933.50571-1-kartikey406@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
netif_keep_dst() only needs to be called once for the uplink VSI, not
once for each port representor. Move it from ice_eswitch_setup_repr()
to ice_eswitch_enable_switchdev().
Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path")
Signed-off-by: Marcin Szycik <marcin.szycik@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
ice_init_link() can return an error status from ice_update_link_info()
or ice_init_phy_user_cfg(), causing probe to fail.
An incorrect NVM update procedure can result in link/PHY errors, and
the recommended resolution is to update the NVM using the correct
procedure. If the driver fails probe due to link errors, the user
cannot update the NVM to recover. The link/PHY errors logged are
non-fatal: they are already annotated as 'not a fatal error if this
fails'.
Since none of the errors inside ice_init_link() should prevent probe
from completing, convert it to void and remove the error check in the
caller. All failures are already logged; callers have no meaningful
recovery path for link init errors.
Fixes: 5b246e533d01 ("ice: split probe into smaller functions")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Fix unreachable code: the conditionals in ice_set_pauseparam() used
the bitwise-AND operator suggesting aq_failures is a bitmap, but it
is actually an enum, making the third condition logically unreachable.
Replace the if-else ladder with a switch statement. Also move the
aq_failures initialization to the variable declaration and remove the
redundant zeroing from ice_set_fc().
Fixes: fcea6f3da546 ("ice: Add stats and ethtool support")
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Resetting all VFs causes resource leak on VFs with FDIR filters
enabled as CTRL VSIs are only invalidated and not freed. Fix by using
ice_vf_ctrl_vsi_release() instead of ice_vf_ctrl_invalidate_vsi() which
aligns behavior with the ice_reset_vf() function.
Reproduction:
echo 1 > /sys/class/net/$pf/device/sriov_numvfs
ethtool -N $vf flow-type ether proto 0x9000 action 0
echo 1 > /sys/class/net/$pf/device/reset
Fixes: da62c5ff9dcd ("ice: Add support for per VF ctrl VSI enabling")
Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Add the debugfs and trace plumbing used to trigger and observe
manual client reset.
The reset interface exposes a trigger file for operator-initiated
reset and a status file for tracking the most recent run. The
tracepoints record scheduling, completion, and blocked caller
behavior so reset progress can be diagnosed from the client side.
debugfs layout under /sys/kernel/debug/ceph/<client>/reset/:
trigger - write to initiate a manual reset
status - read to see the most recent reset result
The reset directory is cleaned up via debugfs_remove_recursive()
on the parent, so individual file dentries are not stored.
Tracepoints:
ceph_client_reset_schedule - reset queued
ceph_client_reset_complete - reset finished (success or failure)
ceph_client_reset_blocked - caller blocked waiting for reset
ceph_client_reset_unblocked - caller unblocked after reset
All tracepoints use a null-safe access for monc.auth->global_id
to guard against early-init or late-teardown edge cases.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Add the client-side reset state machine, request gating, and manual
session teardown implementation.
Manual reset is an operator-triggered escape hatch for client/MDS
stalemates in which caps, locks, or unsafe metadata state stop making
forward progress. The reset blocks new metadata work, attempts a
bounded best-effort drain of dirty client state while sessions are
still alive, and finally asks the MDS to close sessions before tearing
local session state down directly.
The reset state machine tracks four phases: IDLE -> QUIESCING ->
DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by
schedule_reset() before the workqueue item is dispatched, so that new
metadata requests and file-lock acquisitions are gated immediately --
even before the work function begins running. All non-IDLE phases
block callers on blocked_wq, preventing races with session teardown.
The drain phase flushes mdlog state, dirty caps, and pending cap
releases for a bounded interval. State that still cannot make progress
within that interval is discarded during teardown, which is the point
of the reset: break the stalemate and allow fresh sessions to rebuild
clean state.
The session teardown follows the established check_new_map()
forced-close pattern: unregister sessions under mdsc->mutex, then clean
up caps and requests under s->s_mutex. Reconnect is not attempted
because the MDS only accepts reconnects during its own RECONNECT phase
after restart, not from an active client.
Blocked callers are released when reset completes and observe the final
result via -EAGAIN (reset failed) or 0 (success). Internal work-function
errors such as -ENOMEM are not propagated to unrelated callers like
open() or flock(); the detailed error remains in debugfs and
tracepoints.
The work function checks st->shutdown before each phase transition
(DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
overwritten. If destroy already took ownership, the work function
releases session references and returns without touching the state.
The timeout calculation for blocked-request waiters uses max_t() to
prevent jiffies underflow when the deadline has already passed.
The close-grace sleep before teardown is a best-effort nudge to let
queued REQUEST_CLOSE messages egress; it is not a correctness
requirement since the MDS still has session_autoclose as a fallback.
The destroy path marks reset as failed and wakes blocked waiters before
cancel_work_sync() so unmount does not stall.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
wait loop that periodically dumps pending cap flush state.
The underlying wait semantics remain intact: callers still wait until the
requested cap flushes complete. The difference is that long stalls now
produce actionable diagnostics instead of looking like a silent hang.
CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries
emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS
limits the number of timed diagnostic dumps before the wait
continues silently. When more entries exist than the per-dump
limit, a truncation count is reported. When the dump iteration
limit is reached, a final suppression message is emitted so the
transition to silence is explicit.
The diagnostic dump collects flush entry data under cap_dirty_lock into
a bounded on-stack array, then prints after releasing the lock. This
avoids holding the spinlock across printk calls.
A null cf->ci on the global flush list indicates a bug since all
cap_flush entries are initialized with a valid ci before being added.
Signal this with WARN_ON_ONCE while still printing enough context for
debugging.
READ_ONCE is used for the i_last_cap_flush_ack field, which is read
outside the inode lock domain. Flush tids are monotonically increasing
and acks are processed in order under i_ceph_lock, so the latest ack
tid is always the most recently written value.
Add a ci pointer to struct ceph_cap_flush so that the diagnostic
dump can identify which inode each pending flush belongs to. The
new i_last_cap_flush_ack field tracks the latest acknowledged flush
tid per inode for diagnostic correlation.
This improves reset-drain observability and is also useful for
existing sync and writeback troubleshooting paths.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|