linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
12 days	ksmbd: fix use-after-free of conn->preauth_info in concurrent SMB2 NEGOTIATE	Gil Portnoy
	conn->preauth_info is shared connection state (struct preauth_integrity_info, kmalloc-96) that is allocated and freed by the SMB2 NEGOTIATE handler and read by the response send path. smb2_handle_negotiate() allocates conn->preauth_info, and on a deassemble_neg_contexts() failure kfrees it and sets it to NULL. Both the allocation and the free/NULL happen under ksmbd_conn_lock(conn) (the connection srv_mutex), which is held across the whole handler body. The response send path smb3_preauth_hash_rsp(), called from the send: block of __handle_ksmbd_work(), reads conn->preauth_info and dereferences conn->preauth_info->Preauth_HashValue (via ksmbd_gen_preauth_integrity_hash()) without taking conn_lock. When a client drives two SMB2 NEGOTIATE requests on the same connection, one worker can free conn->preauth_info on the failing-negotiate path while a concurrent send-path worker is reading it, producing a slab use-after-free read (KASAN-confirmed). The send-path read tested conn->preauth_info for NULL but raced with the free that occurs between the NULL check and the dereference, so the NULL guard alone does not close the window. Serialize the NEGOTIATE-branch read in smb3_preauth_hash_rsp() under ksmbd_conn_lock(conn) and re-check conn->preauth_info inside the lock. Because the negotiate handler holds conn_lock across its kfree + NULL assignment, a reader that also takes conn_lock either runs fully before the allocation or fully after the NULL store, and can never observe the freed-but-not-yet-NULLed pointer. ksmbd_gen_preauth_integrity_hash() takes no locks itself (it only computes a SHA-512 over the buffer), so no lock-ordering inversion is introduced, and conn_lock is a sleepable mutex which is safe on this send path (it already performs network I/O). Fixes: aa7253c2393f ("ksmbd: fix memory leak in smb2_handle_negotiate") Signed-off-by: Gil Portnoy <dddhkts1@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: keep common response iovecs in the work item	Namjae Jeon
	Most SMB responses need no more than four kvec entries, but every work item currently allocates a separate four-entry array and frees it after the response is sent. Embed the common array in struct ksmbd_work and allocate a larger array only when a response exceeds the inline capacity. This removes one allocation and one free from the common request path while preserving support for larger compound and read responses. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: route v2 lease breaks on the client lease channel	Namjae Jeon
	v2 leases are scoped by ClientGuid. When the same client uses multiple connections, smbtorture expects lease break notifications to be sent on the connection associated with the client lease table, not necessarily on the connection that owns the individual open being broken. Keep a referenced connection in the lease table and use it for v2 lease break notifications while it is still active. Fall back to the open's connection if the table connection is being released. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: break RH leases before delete-on-close	Namjae Jeon
	The delete paths only marked the opened file delete pending or delete-on-close. When another client still held a read/handle lease, no lease break was sent before the delete state changed. smb2.lease.unlink uses a create request with FILE_DELETE_ON_CLOSE and expects the second client's unlink to break the first client's RH lease to R with ACK_REQUIRED set. SetInfo(FileDispositionInformation) has the same lease-breaking requirement. Break level-II/read-handle leases before setting delete pending or delete-on-close so clients are notified before the file is removed. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: honor SMB2 v2 lease epochs	Namjae Jeon
	v2 lease responses should continue from the client supplied epoch. Initialize a new v2 lease from the requested epoch plus one so create responses match the epoch returned by Windows and expected by smbtorture. For a single chained break sequence, increment the epoch only for the first break notification. Follow-up breaks such as RH->R and R->NONE in smb2.lease.v2_breaking3 reuse the same epoch. Record when a waiter slept behind pending_break and let the later truncate/open overwrite break consume that marker to reuse the current epoch instead of assigning a new one. Do not increment the epoch when a same-client, same-key create asks for the already granted RH state. The epoch changes only when the granted lease state changes. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: do not wait for RH lease break ack on overwrite	Namjae Jeon
	smb2.lease.breaking4 expects an overwrite against an RH lease to send RH->NONE lease break notification but complete the triggering create without waiting for the break ack. Keep the lease in break-in-progress state until the client eventually acknowledges the downgrade, but do not hold the overwrite request behind that ack. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: chain pending lease breaks before waking waiters	Namjae Jeon
	A pending open can require more than one lease break before the existing lease becomes compatible with the operation that triggered the break. smb2.lease.breaking3 expects the server to hold the pending normal open through RWH->RH and RH->R, while a later overwrite waiter must not collapse that second break directly to RH->NONE. Keep pending_break held for lease breaks until the current triggering operation is compatible with the lease state. Snapshot the truncate request per oplock_break() call so another waiter cannot overwrite the state of the active break. Use the requested oplock level when deciding whether to chain another break. A second lease open only needs RWH->RH, while a normal none-oplock open can continue down to R and then NONE. For non-truncating metadata operations, break leases only down to read caching. Operations such as delete-on-close need to drop handle caching, but should not send a second R->NONE break after the client acknowledges RH->R. Also send STATUS_PENDING for levelII/read-lease break waiters. An async SMB2 create becomes cancelable only after the server sends an NT_STATUS_PENDING interim response. A waiter that blocks behind an already active lease break must receive the interim response before sleeping on pending_break, otherwise the client can process a later lease break while the create request is still not marked pending. Avoid duplicate interim responses when an overwrite first breaks a write oplock and then scans levelII/read leases. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: compute lease break-in-progress flag on response	Namjae Jeon
	SMB2_LEASE_FLAG_BREAK_IN_PROGRESS is a transient create response flag, not persistent lease state. Do not store the flag in lease->flags when a same-key open is granted during a pending break. Instead, derive it from lease opens that are still waiting for a break ACK while building the lease create response, and keep lease->flags for persistent lease flags such as the parent lease key. This clears the flag naturally after the break ACK completes and fixes reopen responses that report BREAK_IN_PROGRESS after the lease is no longer breaking. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: treat unnamed DATA stream as base file	Namjae Jeon
	The SMB path suffix :: names the unnamed data stream of the base file, not an alternate data stream backed by a DosStream xattr. Canonicalize an empty stream name with an explicit type to a NULL stream name after parsing. This keeps the base filename produced by strsep() and lets open continue through the normal base-file path instead of looking for a non-existent empty stream xattr. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: align SMB2 oplock break ack handling	Namjae Jeon
	Handle SMB2 oplock break acknowledgments according to the server-side validation rules in MS-SMB2. Return STATUS_INVALID_DEVICE_STATE when an ACK arrives while the open is not breaking, reject SMB2_OPLOCK_LEVEL_LEASE with STATUS_INVALID_PARAMETER, allow BATCH acknowledgments to EXCLUSIVE, and make invalid ACK levels fail with STATUS_INVALID_OPLOCK_PROTOCOL after lowering the oplock to NONE. Update the successful response from the final granted oplock level instead of relying on the oplock transition helpers, which could turn invalid ACKs into successful responses. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: share SMB2 lease state across opens	Namjae Jeon
	Model SMB2 leases as per-client/per-key objects instead of keeping a separate lease copy in every oplock_info. The lease table now stores lease objects and each lease tracks the opens that reference it. This makes same ClientGuid/LeaseKey opens observe a single lease state, so lease upgrades, breaks, ACKs, and close teardown do not diverge across per-open copies. Keep one reference for the lease table entry and one reference for each open, and remove the table entry when the last open is detached. Update lease break ACK handling to refresh all open oplock levels from the shared lease state. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: clean up lease response flags and directory leases	Namjae Jeon
	Do not echo reserved v1 lease flags back to clients. For lease v2 responses, only return BREAK_IN_PROGRESS and PARENT_LEASE_KEY_SET when they are meaningful, and preserve the parent lease key in the response. Allow directory leases whenever the request is a valid lease v2 request, and initialize v2 lease epochs from the first server-granted state change. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: fix lease break and ack state handling	Namjae Jeon
	Do not skip valid lease states containing WRITE_CACHING when breaking level-II/read leases for writes and truncates. Handle lease break acknowledgments according to the SMB2 rule that the acknowledged state must be a subset of the server's break target. Apply the acknowledged state directly and keep the break pending on failed ACKs. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: use connection ClientGUID for lease lookup	Namjae Jeon
	MS-SMB2 defines the lease table lookup key as Connection.ClientGuid. Use the connection ClientGUID consistently when checking for same-client leases and duplicate lease keys. Also preserve directory and parent lease metadata when copying an existing lease state to a new open. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: validate SMB2 lease create contexts	Namjae Jeon
	Validate SMB2 lease context lengths, requested lease state bits, and v2 flags before using the context. Return errors via ERR_PTR so CREATE can distinguish a missing lease context from a malformed one. Also ignore lease v2 contexts for SMB 2.1, where they are not valid. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	smb/server: fix debug log endianness in smb2_cancel()	ChenXiaoSong
	Convert to CPU byte order to avoid incorrect debug log on big-endian architectures. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	ksmbd: track the connection owning a byte-range lock	Namjae Jeon
	SMB2_LOCK adds each granted byte-range lock to both the file lock list and the lock list of the connection which handled the request. The final close and durable handle paths, however, remove the connection list entry while holding fp->conn->llist_lock. With SMB3 multichannel, the connection handling the LOCK request can be different from the connection which opened the file. The entry can therefore be removed under a different spinlock from the one protecting the list it belongs to. A concurrent traversal can then access freed struct ksmbd_lock and struct file_lock objects. Record the connection owning each lock's clist entry and hold a reference to it while the entry is linked. Use that connection and its llist_lock for unlock, rollback, close, and durable preserve. Durable reconnect assigns the new connection as the owner when publishing the locks again. Fixes: f5a544e3bab7 ("ksmbd: add support for SMB3 multichannel") Cc: stable@vger.kernel.org Reported-by: Musaab Khan <musaab.khan@protonmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
12 days	net: ethernet: ti: icssg: guard PA stat lookups	Philippe Schenker
	icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name() with FW PA stat names regardless of whether the PA stats block is present on the hardware. emac_get_stat_by_name() already guards the PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer is NULL the lookup falls through to netdev_err() and returns -EINVAL. Because ndo_get_stats64 is polled regularly by the networking stack this produces thousands of log entries of the form: icssg-prueth icssg1-eth end0: Invalid stats FW_RX_ERROR A secondary consequence is that the int(-EINVAL) return value is implicitly widened to a near-ULLONG_MAX unsigned value when accumulated into the __u64 fields of rtnl_link_stats64, silently corrupting the rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`. Every other PA-aware code path in the driver is already guarded with the same `if (emac->prueth->pa_stats)` check. Apply the same guard here. Fixes: 0d15a26b247d ("net: ti: icssg-prueth: Add ICSSG FW Stats") Signed-off-by: Philippe Schenker <philippe.schenker@impulsing.ch> Reviewed-by: Simon Horman <horms@kernel.org> Cc: danishanwar@ti.com Cc: rogerq@kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260618093037.3448858-1-dev@pschenker.ch Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 days	Merge branch 'fix-stale-register-bounds-on-lsm-retval-context-load'	Alexei Starovoitov
	Tristan Madani says: ==================== Fix stale register bounds on LSM retval context load From: Tristan Madani <tristan@talencesecurity.com> check_mem_access() calls __mark_reg_s32_range() to narrow a register to the LSM hook retval range, but the intersection preserves stale bounds from prior instructions. Add mark_reg_unknown() before narrowing (same pattern as the else branch) and a selftest that catches the mismatch. Changes in v3: - Add selftest demonstrating the issue (Eduard Zingerman) - No code change in patch 1 from v2 ==================== Link: https://patch.msgid.link/20260622230123.3695446-1-tristmd@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	selftests/bpf: Add test for stale bounds on LSM retval context load	Tristan Madani
	Add a verifier test that catches the stale-bounds issue fixed in the previous patch. The test sets r6 = 0 to create known bounds, then loads the LSM hook return value into r6 from the context. Without the fix, the verifier intersects the retval range with the stale bounds and incorrectly narrows r6 to a single value, pruning the fall-through branch as dead code and missing the div-by-zero. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260622230123.3695446-3-tristmd@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	bpf: Reset register bounds before narrowing retval range in check_mem_access()	Tristan Madani
	When the BPF verifier processes a context load of an LSM hook return value, it calls __mark_reg_s32_range() to narrow the register to the hook's valid range. However, __mark_reg_s32_range() intersects the new range with the register's existing bounds using max_t()/min_t() rather than replacing them. If the destination register carries stale bounds from a prior instruction (e.g. BPF_MOV64_IMM), the intersection can produce a range narrower than reality. The verifier then believes it knows the register's exact value, while at runtime the actual hook return value is loaded, creating a verifier/runtime mismatch that can be used to bypass BPF memory safety checks. The else branch already calls mark_reg_unknown() to reset register state before any narrowing. Apply the same reset in the is_retval path so stale bounds are cleared before __mark_reg_s32_range() intersects. Fixes: 5d99e198be27 ("bpf, lsm: Add check for BPF LSM return value") Cc: stable@vger.kernel.org Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260622230123.3695446-2-tristmd@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	rocker: Fix memory leak in ofdpa_port_fdb()	Ziran Zhang
	In ofdpa_port_fdb(), the hash_del() only unlinks the node from hash table, but does not free it. Fix this by adding kfree(found) after the !found == removing check, where the pointer value is no longer needed. Found by Coccinelle kfree script. Cc: <stable+noautosel@kernel.org> # rocker is a test harness, it's never loaded on production systems Signed-off-by: Ziran Zhang <zhangcoder@yeah.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260616013245.7098-1-zhangcoder@yeah.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 days	Merge branch 'bpf-guard-conntrack-opts-error-writes'	Alexei Starovoitov
	Yiyang Chen says: ==================== bpf: Guard conntrack opts error writes The conntrack lookup/allocation kfuncs expose an opts/opts__sz pair. The verifier checks the caller-provided opts__sz range, but the wrappers currently write opts->error after internal errors even when opts__sz is too small to include that field. Patch 1 writes opts->error only when opts__sz includes it, and uses a single helper to fold ERR_PTR returns into the kfunc ABI result while keeping the local nfct result variable in each wrapper. Patch 2 adds a bpf_nf regression check that keeps a guard in opts->error while passing opts__sz covering only netns_id. The regression check follows the existing bpf_nf test shape. Before the fix, the guard is overwritten with -EINVAL even though opts__sz covers only the first four bytes of the options object. After the fix, the kfunc still returns NULL for the invalid size, but the guard remains intact. Validation, rebased and tested on bpf-next master e771677c937d ("Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd"): git diff --check origin/master..HEAD: OK scripts/checkpatch.pl --strict on 1/2 and 2/2: OK make O=/root/ebpf-verifier-bug-detection/kernel-build/bpf-next \ net/netfilter/nf_conntrack_bpf.o: OK Focused QEMU direct-runner against XDP and TC lookup/alloc paths: unpatched bpf-next e771677c937d: guard overwritten with -EINVAL patched v2 007dfd0341cd: guard preserved as 0x12345678 QEMU upstream bpf_nf selftest with CONFIG_NF_CONNTRACK_MARK, CONFIG_NF_CONNTRACK_ZONES, and legacy iptables enabled: ./test_progs -t bpf_nf -vv: OK git am of exported 1/2 and 2/2 on a fresh worktree at base: OK range-diff between branch commits and git-am result: equivalent Changes in v2: - Rebased onto current bpf-next master. - Reworked patch 1 to use bpf_ct_opts_result() for the ERR_PTR-to-NULL conversion and guarded opts->error write, as suggested by Alexei. - Kept the local nfct result variable in each wrapper before returning through bpf_ct_opts_result(). - Added matching Fixes tags to the selftest patch so the regression test can be backported with the fix. v1: https://lore.kernel.org/bpf/cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn/ ==================== Link: https://patch.msgid.link/cover.1781765747.git.chenyy23@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	selftests/bpf: Cover small conntrack opts error writes	Yiyang Chen
	Add a conntrack kfunc regression check for opts__sz values that do not cover opts->error. The BPF program initializes opts->error with a guard value, calls the lookup and allocation kfuncs with opts__sz set to sizeof(opts->netns_id), and verifies that the guard is still intact after the kfunc returns NULL. Without the conntrack wrapper guard, the kfunc error path overwrites that guard with -EINVAL even though the verifier checked only the first four bytes of the options object. Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF") Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT") Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/007dfd0341cd84560e4795a2a951cc56d4adff1d.1781765747.git.chenyy23@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	bpf: Guard conntrack opts error writes	Yiyang Chen
	The conntrack lookup and allocation kfuncs take an opts pointer together with an opts__sz argument. The verifier checks only the memory range described by opts__sz, but the wrappers unconditionally write opts->error whenever the internal lookup or allocation helper returns an error. For an invalid size smaller than the end of opts->error, that write can land outside the verifier-checked range. Keep returning NULL for invalid arguments, but only report the error through opts->error when the supplied size includes the field. This preserves error reporting for the supported 12-byte and 16-byte layouts, and for other invalid sizes that still include opts->error. Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF") Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT") Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/9535e781fe14449b1d4e9bbc3baa7566a93bf512.1781765747.git.chenyy23@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	rtc: mv: add suspend/resume support for wakeup	Xue Lei
	Add PM suspend/resume callbacks to enable/disable IRQ wake for the RTC alarm interrupt. This allows the RTC alarm to wake the system from STR (e.g. via rtcwake -m mem -s N). Without this, the RTC IRQ is masked during suspend by the MPIC's IRQCHIP_MASK_ON_SUSPEND behavior, preventing alarm-based wakeup. Signed-off-by: Xue Lei <Xue.Lei@windriver.com> Link: https://patch.msgid.link/20260611023350.1370881-1-Xue.Lei@windriver.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	rtc: aspeed: add AST2700 compatible	Tommy Huang
	Add support for matching the RTC controller on ASPEED AST2700 SoCs. The AST2700 RTC controller is compatible with the existing ASPEED RTC driver implementation. Signed-off-by: Tommy Huang <tommy_huang@aspeedtech.com> Link: https://patch.msgid.link/20260601-ast2700-rtc-v1-2-15d4ca46500a@aspeedtech.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	dt-bindings: rtc: add ASPEED AST2700 compatible	Tommy Huang
	Document the compatible string for the RTC controller found on ASPEED AST2700 SoCs. Signed-off-by: Tommy Huang <tommy_huang@aspeedtech.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/20260601-ast2700-rtc-v1-1-15d4ca46500a@aspeedtech.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	rtc: interface: fix typos in rtc_handle_legacy_irq() documentation	Yahya Saqban
	Fix spelling of 'occurence' to 'occurrence' and 'of' to 'or' in the kernel-doc comment for rtc_handle_legacy_irq(). Signed-off-by: Yahya Saqban <yahyasaqban@gmail.com> Link: https://patch.msgid.link/20260512210235.343070-1-yahyasaqban@gmail.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	rtc: msc313: fix NULL deref in shared IRQ handler at probe	Stepan Ionichev
	msc313_rtc_probe() calls devm_request_irq() with IRQF_SHARED and &pdev->dev as the cookie, but platform_set_drvdata() is only called later after the clock setup. With a shared IRQ line, another device on the same line can trigger the handler in that window. The handler does dev_get_drvdata() on the cookie, gets NULL, and dereferences priv->rtc_base in interrupt context. Pass priv as the cookie directly so the handler reads it from dev_id without the lookup, removing the dependency on probe order. Fixes: be7d9c9161b9 ("rtc: Add support for the MSTAR MSC313 RTC") Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com> Link: https://patch.msgid.link/20260511032703.48262-1-sozdayvek@gmail.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	rtc: remove unused pcap driver	Arnd Bergmann
	The platform was removed a few years ago, and the mfd driver is also gone now, so it is impossible to build or use it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260527193927.3523952-1-arnd@kernel.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	rtc: interface: Add rtc_read_next_alarm() to read next expiring timer	Mario Limonciello
	Add a new function rtc_read_next_alarm() that reads the next expiring alarm from the RTC timerqueue. This is different from rtc_read_alarm(), which only reads the aie_timer. The wakealarm sysfs file programs the rtc->aie_timer, whereas the alarmtimer suspend routine programs its own timer into the RTC timerqueue. Both timers end up in the RTC's timerqueue, and the first expiring timer is what gets armed in the hardware. This new function allows code to query which alarm will actually fire next, regardless of which subsystem programmed it. This is needed by platform code that needs to program secondary timers based on the actual next wakeup time. Link: https://lore.kernel.org/all/87ed50z0le.ffs@tglx Suggested-by: Thomas Gleixner <tglx@linutronix.de> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20260521043714.1022930-2-mario.limonciello@amd.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	blk-cgroup: defer blkcg css_put until blkg is unlinked from queue	Zizhi Wo
	[BUG] Our fuzz testing triggered a blkcg use-after-free issue: BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0 Call Trace: ... blkcg_deactivate_policy+0x244/0x4d0 ioc_rqos_exit+0x44/0xe0 rq_qos_exit+0xba/0x120 __del_gendisk+0x50b/0x800 del_gendisk+0xff/0x190 ... [CAUSE] process1 process2 cgroup_rmdir ... css_killed_work_fn offline_css ... blkcg_destroy_blkgs ... __blkg_release css_put(&blkg->blkcg->css) blkg_free INIT_WORK(xxx, blkg_free_workfn) schedule_work css_put ... blkcg_css_free kfree(blkcg)--------blkcg has been freed!!! ====================================schedule_work blkg_free_workfn __del_gendisk rq_qos_exit ioc_rqos_exit blkcg_deactivate_policy mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) list_for_each_entry(blkg, xxx) blkcg = blkg->blkcg spin_lock(&blkcg->lock)-------UAF!!! mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) /* Only then is the blkg removed from the list */ list_del_init(&blkg->q_node) As a result, a blkg can still be reachable through q->blkg_list while its ->blkcg has already been freed. [Fix] Fix this by deferring the blkcg css_put() until after the blkg has been unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the blkcg outlives every blkg still reachable through q->blkg_list, so any iterator holding q->queue_lock is guaranteed to observe a valid blkg->blkcg. While at it, move css_tryget_online() from blkg_create() into blkg_alloc() so that the css reference is owned by the alloc/free pair rather than straddling layers: blkg_alloc() <-> blkg_free() blkg_create() <-> blkg_destroy() Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai@fygo.io> Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://patch.msgid.link/20260616011746.2451461-1-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days	blk-cgroup: fix UAF in __blkcg_rstat_flush()	Michal Koutný
	When multiple blkgs in the same blkcg are released concurrently, a use-after-free can occur. The race happens when one blkg's __blkcg_rstat_flush() removes another blkg's iostat entries via llist_del_all(). The second blkg sees an empty list and proceeds to free itself while the first is still iterating over its entries. Move the flush from __blkg_release() (RCU callback) to blkg_release() (before call_rcu). This ensures the RCU grace period waits for any concurrent flush's rcu_read_lock() section to complete before freeing. Cc: stable@vger.kernel.org Cc: Jay Shin <jaeshin@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Fixes: 20cb1c2fb756 ("blk-cgroup: Flush stats before releasing blkcg_gq") Reported-by: coregee2000@gmail.com Closes: https://lore.kernel.org/linux-block/CAHPqNmwT9oRpem3J3erS_W0uSQND47LGGSBsNxP8E6uSUish1w@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Link: https://patch.msgid.link/20260205155425.342084-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days	block, bfq: protect async queue reset with blkcg locks	Cen Zhang
	Writing 0 to BFQ's low_latency attribute ends weight raising for active, idle and async queues. The async cgroup path walks q->blkg_list, converts each blkg to BFQ policy data and then reads bfqg->async_bfqq and bfqg->async_idle_bfqq. That walk was protected only by bfqd->lock. blkcg release work is serialized by q->blkcg_mutex and q->queue_lock instead, and blkg_free_workfn() can call BFQ's pd_free_fn before it removes blkg->q_node from q->blkg_list. A low_latency reset can therefore still find the blkg on the queue list after the BFQ policy data has been freed. The buggy scenario involves two paths, with each column showing the order within that path: BFQ low_latency reset: blkcg blkg release work: 1. bfq_low_latency_store() 1. blkg_free_workfn() takes calls bfq_end_wr(). q->blkcg_mutex. 2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the q->blkg_list. final bfq_group reference. 3. blkg_to_bfqg() returns 3. blkg->q_node remains on the stale policy data. q->blkg_list until list_del_init(). 4. bfq_end_wr_async_queues() reads async queue fields. Fix this by taking q->blkcg_mutex and q->queue_lock around the q->blkg_list walk, then taking bfqd->lock before touching BFQ async queues. The mutex serializes against policy-data free and queue_lock stabilizes the list. Move the async reset out of bfq_end_wr()'s existing bfqd->lock critical section so the lock order matches blkcg policy callbacks. Validation reproduced this kernel report: BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340 Call Trace: <TASK> dump_stack_lvl+0x66/0xa0 print_report+0xce/0x630 ? bfq_end_wr_async_queues+0x246/0x340 ? srso_alias_return_thunk+0x5/0xfbef5 ? __virt_addr_valid+0x20d/0x410 ? bfq_end_wr_async_queues+0x246/0x340 kasan_report+0xe0/0x110 ? bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async+0xba/0x180 bfq_low_latency_store+0x4e5/0x690 ? 0xffffffffc02150da ? __pfx_bfq_low_latency_store+0x10/0x10 ? __pfx_bfq_low_latency_store+0x10/0x10 elv_attr_store+0xc4/0x110 kernfs_fop_write_iter+0x2f5/0x4a0 vfs_write+0x604/0x11f0 ? __pfx_locks_remove_posix+0x10/0x10 ? __pfx_vfs_write+0x10/0x10 ksys_write+0xf9/0x1d0 ? __pfx_ksys_write+0x10/0x10 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 544: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0xaa/0xb0 bfq_pd_alloc+0xc0/0x1b0 blkg_alloc+0x346/0x960 blkg_create+0x8c2/0x10d0 bio_associate_blkg_from_css+0x9f3/0xfa0 bio_associate_blkg+0xd9/0x200 bio_init+0x303/0x640 __blkdev_direct_IO_simple+0x56b/0x8a0 blkdev_direct_IO+0x8e7/0x2580 blkdev_read_iter+0x205/0x400 vfs_read+0x7b0/0xda0 ksys_read+0xf9/0x1d0 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 465: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x5f/0x80 kfree+0x307/0x580 blkg_free_workfn+0xef/0x460 process_one_work+0x8d0/0x1870 worker_thread+0x575/0xf80 kthread+0x2e7/0x3c0 ret_from_fork+0x576/0x810 ret_from_fork_asm+0x1a/0x30 Fixes: 44e44a1b329e ("block, bfq: improve responsiveness") Assisted-by: Codex:gpt-5.5 Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Tao Cui <cuitao@kylinos.cn> Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days	nbd: don't warn when reclassifying a busy socket lock	Deepanshu Kartikey
	nbd_reclassify_socket() warns via WARN_ON_ONCE() if the socket lock is held at the point of reclassification. That assertion was copied from nvme-tcp, where the socket is created internally by the kernel (sock_create_kern()) and is never visible to user space, so the lock is guaranteed to be free. NBD is different: the socket is looked up from a user-supplied fd in nbd_get_socket(), and user space retains that fd. A concurrent syscall on the same socket (or softirq processing taking bh_lock_sock() on a connected TCP socket) can legitimately hold the lock at the instant NBD reclassifies it. sock_allow_reclassification() then returns false and the WARN_ON_ONCE() fires, which turns into a crash under panic_on_warn. This is reachable by simply racing NBD_CMD_CONNECT against socket activity on the same fd, as reported by syzbot. Hitting a held lock here is expected for an externally owned socket and is not a kernel bug, so skip reclassification silently instead of warning. Reclassification is a lockdep-only annotation, so skipping it in the rare racing case is harmless. Reported-by: syzbot+6b85d1e39a5b8ed9a954@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6b85d1e39a5b8ed9a954 Fixes: d532cddb6c60 ("nbd: Reclassify sockets to avoid lockdep circular dependency") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260621235255.66015-1-kartikey406@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days	block: fix incorrect error injection static key decrement	Christoph Hellwig
	Only decrement the static key when we had items and thus it was incremented before. Fixes: e8dcf2d142bd ("block: add configurable error injection") Reported-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260622160752.1552516-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days	e1000e: Reconfigure PLL clock gate timeout and re-enable K1 on Meteor Lake	Dima Ruinskiy
	Commit 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1") disabled K1 by default on Meteor Lake and newer systems due to packet loss observed on various platforms. However, disabling K1 caused an increase in power consumption. To mitigate this, reconfigure the PLL clock gate value so that K1 can remain enabled without incurring the additional power consumption. Re-enable K1 by default, but keep the private flag to support disabling it via ethtool. Additionally, introduce a DMI quirk table, so that K1 may be disabled by default on known problematic systems. Currently, this includes the Dell Pro 16 Plus, where the issue has been reported to persist despite the changes to the PLL lock timeout. Link: https://bugzilla.kernel.org/show_bug.cgi?id=220954 Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20250623/048860.html Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20260330/054059.html Signed-off-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Co-developed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Fixes: 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1") Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	i40e: Fix i40e_debug() to use struct i40e_hw argument	Mohamed Khalfella
	i40e_debug() macro takes struct i40e_hw *h as first argument. But the macro body uses hw instead of h. This has been working so far because hw happens to be the name of the variable in the context where the macro is expanded. Fix the macro to use the passed argument. Fixes: 5dfd37c37a44 ("i40e: Split i40e_osdep.h") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	ice: dpll: fix memory leak in ice_dpll_init_info error paths	ZhaoJinming
	Several error return paths in ice_dpll_init_info() directly return without freeing previously allocated resources, causing memory leaks: - When de->input_prio allocation fails, d->inputs is leaked - When dp->input_prio allocation fails, d->inputs and de->input_prio are leaked - When ice_get_cgu_rclk_pin_info() fails, all previously allocated inputs/outputs/input_prio are leaked - When ice_dpll_init_pins_info(RCLK_INPUT) fails, same resources are leaked Fix this by jumping to the deinit_info label which properly calls ice_dpll_deinit_info() to free all allocated resources. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	ice: dpll: set pointers to NULL after kfree in ice_dpll_deinit_info	ZhaoJinming
	ice_dpll_deinit_info() calls kfree() on several pf->dplls fields (inputs, outputs, eec.input_prio, pps.input_prio) but does not set the pointers to NULL afterward. This leaves dangling pointers in the pf->dplls structure. While not currently exploitable through existing code paths, this is unsafe because: 1. If ice_dpll_init_info() is called again after a deinit (e.g. during driver recovery), and a subsequent allocation within init fails, the error path will jump to deinit_info and call ice_dpll_deinit_info() again. Since some pointers still hold the old freed addresses, this would result in a double-free. 2. Any future code that checks these pointers before use or after free would be unprotected against use-after-free. Follow the common kernel convention of setting pointers to NULL after kfree() so that: - kfree(NULL) is a safe no-op, preventing double-free - NULL checks on these pointers become meaningful This is a preparatory fix for a subsequent patch that routes additional error paths in ice_dpll_init_info() to the deinit_info label. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	rtc: isl1208: Balance enable_irq_wake() with disable_irq_wake() on cleanup	John Madieu
	isl1208_setup_irq() calls enable_irq_wake() after a successful IRQ request, but the driver has no remove path that balances it. The driver is devm-only, so on unbind devm releases the IRQ - but enable_irq_wake() is not undone by IRQ release, so the wake count for that IRQ stays incremented. Each rebind therefore leaks one wake reference; the leak doubles for the chip variant that has a separate evdet IRQ, since isl1208_setup_irq() is then called twice during probe. Register a devm action that calls disable_irq_wake() per IRQ. While at it, check enable_irq_wake()'s return value: on failure, propagate the error rather than silently registering a disable action for an IRQ whose wake state was never enabled. Fixes: 9ece7cd833a3 ("rtc: isl1208: Add "evdet" interrupt source for isl1219") Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com> Link: https://patch.msgid.link/20260425154959.2796261-3-john.madieu.xa@bp.renesas.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
12 days	io_uring/memmap: bound io_pin_pages() by page array byte size	Deepanshu Kartikey
	io_pin_pages() checks that nr_pages does not exceed INT_MAX, then allocates a struct page * array of nr_pages entries. kvmalloc() limits allocations to INT_MAX bytes, but the check counts pages, not bytes. On 64-bit each entry is 8 bytes, so the array hits the INT_MAX byte limit at INT_MAX / sizeof(struct page ) pages, well before the page count check fires. Since commit b4e41050b212 ("io_uring/rsrc: raise registered buffer 1GB limit") raised the per-buffer cap to 1TB, a buffer near that cap maps ~2^28 pages, making the array allocation exceed INT_MAX bytes. This passes the page count check, reaches kvmalloc(), and triggers the WARN_ON_ONCE() for oversized allocations in __kvmalloc_node_noprof(). Check nr_pages against INT_MAX / sizeof(struct page ) so the buffer is rejected with -EOVERFLOW before the allocation is attempted. Reported-by: syzbot+f99b00a963915b6b52c6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f99b00a963915b6b52c6 Fixes: b4e41050b212 ("io_uring/rsrc: raise registered buffer 1GB limit") Tested-by: syzbot+f99b00a963915b6b52c6@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/20260621012933.50571-1-kartikey406@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days	ice: call netif_keep_dst() once when entering switchdev mode	Marcin Szycik
	netif_keep_dst() only needs to be called once for the uplink VSI, not once for each port representor. Move it from ice_eswitch_setup_repr() to ice_eswitch_enable_switchdev(). Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path") Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Patryk Holda <patryk.holda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	ice: fix ice_init_link() error return preventing probe	Paul Greenwalt
	ice_init_link() can return an error status from ice_update_link_info() or ice_init_phy_user_cfg(), causing probe to fail. An incorrect NVM update procedure can result in link/PHY errors, and the recommended resolution is to update the NVM using the correct procedure. If the driver fails probe due to link errors, the user cannot update the NVM to recover. The link/PHY errors logged are non-fatal: they are already annotated as 'not a fatal error if this fails'. Since none of the errors inside ice_init_link() should prevent probe from completing, convert it to void and remove the error check in the caller. All failures are already logged; callers have no meaningful recovery path for link init errors. Fixes: 5b246e533d01 ("ice: split probe into smaller functions") Cc: stable@vger.kernel.org Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	ice: fix AQ error code comparison in ice_set_pauseparam()	Lukasz Czapnik
	Fix unreachable code: the conditionals in ice_set_pauseparam() used the bitwise-AND operator suggesting aq_failures is a bitmap, but it is actually an enum, making the third condition logically unreachable. Replace the if-else ladder with a switch statement. Also move the aq_failures initialization to the variable declaration and remove the redundant zeroing from ice_set_fc(). Fixes: fcea6f3da546 ("ice: Add stats and ethtool support") Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	ice: fix FDIR CTRL VSI resource leak in ice_reset_all_vfs()	Dawid Osuchowski
	Resetting all VFs causes resource leak on VFs with FDIR filters enabled as CTRL VSIs are only invalidated and not freed. Fix by using ice_vf_ctrl_vsi_release() instead of ice_vf_ctrl_invalidate_vsi() which aligns behavior with the ice_reset_vf() function. Reproduction: echo 1 > /sys/class/net/$pf/device/sriov_numvfs ethtool -N $vf flow-type ether proto 0x9000 action 0 echo 1 > /sys/class/net/$pf/device/reset Fixes: da62c5ff9dcd ("ice: Add support for per VF ctrl VSI enabling") Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
12 days	ceph: add manual reset debugfs control and tracepoints	Alex Markuze
	Add the debugfs and trace plumbing used to trigger and observe manual client reset. The reset interface exposes a trigger file for operator-initiated reset and a status file for tracking the most recent run. The tracepoints record scheduling, completion, and blocked caller behavior so reset progress can be diagnosed from the client side. debugfs layout under /sys/kernel/debug/ceph/<client>/reset/: trigger - write to initiate a manual reset status - read to see the most recent reset result The reset directory is cleaned up via debugfs_remove_recursive() on the parent, so individual file dentries are not stored. Tracepoints: ceph_client_reset_schedule - reset queued ceph_client_reset_complete - reset finished (success or failure) ceph_client_reset_blocked - caller blocked waiting for reset ceph_client_reset_unblocked - caller unblocked after reset All tracepoints use a null-safe access for monc.auth->global_id to guard against early-init or late-teardown edge cases. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
12 days	ceph: add client reset state machine and session teardown	Alex Markuze
	Add the client-side reset state machine, request gating, and manual session teardown implementation. Manual reset is an operator-triggered escape hatch for client/MDS stalemates in which caps, locks, or unsafe metadata state stop making forward progress. The reset blocks new metadata work, attempts a bounded best-effort drain of dirty client state while sessions are still alive, and finally asks the MDS to close sessions before tearing local session state down directly. The reset state machine tracks four phases: IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by schedule_reset() before the workqueue item is dispatched, so that new metadata requests and file-lock acquisitions are gated immediately -- even before the work function begins running. All non-IDLE phases block callers on blocked_wq, preventing races with session teardown. The drain phase flushes mdlog state, dirty caps, and pending cap releases for a bounded interval. State that still cannot make progress within that interval is discarded during teardown, which is the point of the reset: break the stalemate and allow fresh sessions to rebuild clean state. The session teardown follows the established check_new_map() forced-close pattern: unregister sessions under mdsc->mutex, then clean up caps and requests under s->s_mutex. Reconnect is not attempted because the MDS only accepts reconnects during its own RECONNECT phase after restart, not from an active client. Blocked callers are released when reset completes and observe the final result via -EAGAIN (reset failed) or 0 (success). Internal work-function errors such as -ENOMEM are not propagated to unrelated callers like open() or flock(); the detailed error remains in debugfs and tracepoints. The work function checks st->shutdown before each phase transition (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not overwritten. If destroy already took ownership, the work function releases session references and returns without touching the state. The timeout calculation for blocked-request waiters uses max_t() to prevent jiffies underflow when the deadline has already passed. The close-grace sleep before teardown is a best-effort nudge to let queued REQUEST_CLOSE messages egress; it is not a correctness requirement since the MDS still has session_autoclose as a fallback. The destroy path marks reset as failed and wakes blocked waiters before cancel_work_sync() so unmount does not stall. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
12 days	ceph: add diagnostic timeout loop to wait_caps_flush()	Alex Markuze
	Convert wait_caps_flush() from a silent indefinite wait into a diagnostic wait loop that periodically dumps pending cap flush state. The underlying wait semantics remain intact: callers still wait until the requested cap flushes complete. The difference is that long stalls now produce actionable diagnostics instead of looking like a silent hang. CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS limits the number of timed diagnostic dumps before the wait continues silently. When more entries exist than the per-dump limit, a truncation count is reported. When the dump iteration limit is reached, a final suppression message is emitted so the transition to silence is explicit. The diagnostic dump collects flush entry data under cap_dirty_lock into a bounded on-stack array, then prints after releasing the lock. This avoids holding the spinlock across printk calls. A null cf->ci on the global flush list indicates a bug since all cap_flush entries are initialized with a valid ci before being added. Signal this with WARN_ON_ONCE while still printing enough context for debugging. READ_ONCE is used for the i_last_cap_flush_ack field, which is read outside the inode lock domain. Flush tids are monotonically increasing and acks are processed in order under i_ceph_lock, so the latest ack tid is always the most recently written value. Add a ci pointer to struct ceph_cap_flush so that the diagnostic dump can identify which inode each pending flush belongs to. The new i_last_cap_flush_ack field tracks the latest acknowledged flush tid per inode for diagnostic correlation. This improves reset-drain observability and is also useful for existing sync and writeback troubleshooting paths. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>