linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
13 days	ksmbd: validate SMB2 lease create contexts	Namjae Jeon
	Validate SMB2 lease context lengths, requested lease state bits, and v2 flags before using the context. Return errors via ERR_PTR so CREATE can distinguish a missing lease context from a malformed one. Also ignore lease v2 contexts for SMB 2.1, where they are not valid. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
13 days	smb/server: fix debug log endianness in smb2_cancel()	ChenXiaoSong
	Convert to CPU byte order to avoid incorrect debug log on big-endian architectures. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
13 days	ksmbd: track the connection owning a byte-range lock	Namjae Jeon
	SMB2_LOCK adds each granted byte-range lock to both the file lock list and the lock list of the connection which handled the request. The final close and durable handle paths, however, remove the connection list entry while holding fp->conn->llist_lock. With SMB3 multichannel, the connection handling the LOCK request can be different from the connection which opened the file. The entry can therefore be removed under a different spinlock from the one protecting the list it belongs to. A concurrent traversal can then access freed struct ksmbd_lock and struct file_lock objects. Record the connection owning each lock's clist entry and hold a reference to it while the entry is linked. Use that connection and its llist_lock for unlock, rollback, close, and durable preserve. Durable reconnect assigns the new connection as the owner when publishing the locks again. Fixes: f5a544e3bab7 ("ksmbd: add support for SMB3 multichannel") Cc: stable@vger.kernel.org Reported-by: Musaab Khan <musaab.khan@protonmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
13 days	net: ethernet: ti: icssg: guard PA stat lookups	Philippe Schenker
	icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name() with FW PA stat names regardless of whether the PA stats block is present on the hardware. emac_get_stat_by_name() already guards the PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer is NULL the lookup falls through to netdev_err() and returns -EINVAL. Because ndo_get_stats64 is polled regularly by the networking stack this produces thousands of log entries of the form: icssg-prueth icssg1-eth end0: Invalid stats FW_RX_ERROR A secondary consequence is that the int(-EINVAL) return value is implicitly widened to a near-ULLONG_MAX unsigned value when accumulated into the __u64 fields of rtnl_link_stats64, silently corrupting the rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`. Every other PA-aware code path in the driver is already guarded with the same `if (emac->prueth->pa_stats)` check. Apply the same guard here. Fixes: 0d15a26b247d ("net: ti: icssg-prueth: Add ICSSG FW Stats") Signed-off-by: Philippe Schenker <philippe.schenker@impulsing.ch> Reviewed-by: Simon Horman <horms@kernel.org> Cc: danishanwar@ti.com Cc: rogerq@kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260618093037.3448858-1-dev@pschenker.ch Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 days	Merge branch 'fix-stale-register-bounds-on-lsm-retval-context-load'	Alexei Starovoitov
	Tristan Madani says: ==================== Fix stale register bounds on LSM retval context load From: Tristan Madani <tristan@talencesecurity.com> check_mem_access() calls __mark_reg_s32_range() to narrow a register to the LSM hook retval range, but the intersection preserves stale bounds from prior instructions. Add mark_reg_unknown() before narrowing (same pattern as the else branch) and a selftest that catches the mismatch. Changes in v3: - Add selftest demonstrating the issue (Eduard Zingerman) - No code change in patch 1 from v2 ==================== Link: https://patch.msgid.link/20260622230123.3695446-1-tristmd@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	selftests/bpf: Add test for stale bounds on LSM retval context load	Tristan Madani
	Add a verifier test that catches the stale-bounds issue fixed in the previous patch. The test sets r6 = 0 to create known bounds, then loads the LSM hook return value into r6 from the context. Without the fix, the verifier intersects the retval range with the stale bounds and incorrectly narrows r6 to a single value, pruning the fall-through branch as dead code and missing the div-by-zero. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260622230123.3695446-3-tristmd@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	bpf: Reset register bounds before narrowing retval range in check_mem_access()	Tristan Madani
	When the BPF verifier processes a context load of an LSM hook return value, it calls __mark_reg_s32_range() to narrow the register to the hook's valid range. However, __mark_reg_s32_range() intersects the new range with the register's existing bounds using max_t()/min_t() rather than replacing them. If the destination register carries stale bounds from a prior instruction (e.g. BPF_MOV64_IMM), the intersection can produce a range narrower than reality. The verifier then believes it knows the register's exact value, while at runtime the actual hook return value is loaded, creating a verifier/runtime mismatch that can be used to bypass BPF memory safety checks. The else branch already calls mark_reg_unknown() to reset register state before any narrowing. Apply the same reset in the is_retval path so stale bounds are cleared before __mark_reg_s32_range() intersects. Fixes: 5d99e198be27 ("bpf, lsm: Add check for BPF LSM return value") Cc: stable@vger.kernel.org Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260622230123.3695446-2-tristmd@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	rocker: Fix memory leak in ofdpa_port_fdb()	Ziran Zhang
	In ofdpa_port_fdb(), the hash_del() only unlinks the node from hash table, but does not free it. Fix this by adding kfree(found) after the !found == removing check, where the pointer value is no longer needed. Found by Coccinelle kfree script. Cc: <stable+noautosel@kernel.org> # rocker is a test harness, it's never loaded on production systems Signed-off-by: Ziran Zhang <zhangcoder@yeah.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260616013245.7098-1-zhangcoder@yeah.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 days	Merge branch 'bpf-guard-conntrack-opts-error-writes'	Alexei Starovoitov
	Yiyang Chen says: ==================== bpf: Guard conntrack opts error writes The conntrack lookup/allocation kfuncs expose an opts/opts__sz pair. The verifier checks the caller-provided opts__sz range, but the wrappers currently write opts->error after internal errors even when opts__sz is too small to include that field. Patch 1 writes opts->error only when opts__sz includes it, and uses a single helper to fold ERR_PTR returns into the kfunc ABI result while keeping the local nfct result variable in each wrapper. Patch 2 adds a bpf_nf regression check that keeps a guard in opts->error while passing opts__sz covering only netns_id. The regression check follows the existing bpf_nf test shape. Before the fix, the guard is overwritten with -EINVAL even though opts__sz covers only the first four bytes of the options object. After the fix, the kfunc still returns NULL for the invalid size, but the guard remains intact. Validation, rebased and tested on bpf-next master e771677c937d ("Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd"): git diff --check origin/master..HEAD: OK scripts/checkpatch.pl --strict on 1/2 and 2/2: OK make O=/root/ebpf-verifier-bug-detection/kernel-build/bpf-next \ net/netfilter/nf_conntrack_bpf.o: OK Focused QEMU direct-runner against XDP and TC lookup/alloc paths: unpatched bpf-next e771677c937d: guard overwritten with -EINVAL patched v2 007dfd0341cd: guard preserved as 0x12345678 QEMU upstream bpf_nf selftest with CONFIG_NF_CONNTRACK_MARK, CONFIG_NF_CONNTRACK_ZONES, and legacy iptables enabled: ./test_progs -t bpf_nf -vv: OK git am of exported 1/2 and 2/2 on a fresh worktree at base: OK range-diff between branch commits and git-am result: equivalent Changes in v2: - Rebased onto current bpf-next master. - Reworked patch 1 to use bpf_ct_opts_result() for the ERR_PTR-to-NULL conversion and guarded opts->error write, as suggested by Alexei. - Kept the local nfct result variable in each wrapper before returning through bpf_ct_opts_result(). - Added matching Fixes tags to the selftest patch so the regression test can be backported with the fix. v1: https://lore.kernel.org/bpf/cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn/ ==================== Link: https://patch.msgid.link/cover.1781765747.git.chenyy23@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	selftests/bpf: Cover small conntrack opts error writes	Yiyang Chen
	Add a conntrack kfunc regression check for opts__sz values that do not cover opts->error. The BPF program initializes opts->error with a guard value, calls the lookup and allocation kfuncs with opts__sz set to sizeof(opts->netns_id), and verifies that the guard is still intact after the kfunc returns NULL. Without the conntrack wrapper guard, the kfunc error path overwrites that guard with -EINVAL even though the verifier checked only the first four bytes of the options object. Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF") Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT") Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/007dfd0341cd84560e4795a2a951cc56d4adff1d.1781765747.git.chenyy23@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	bpf: Guard conntrack opts error writes	Yiyang Chen
	The conntrack lookup and allocation kfuncs take an opts pointer together with an opts__sz argument. The verifier checks only the memory range described by opts__sz, but the wrappers unconditionally write opts->error whenever the internal lookup or allocation helper returns an error. For an invalid size smaller than the end of opts->error, that write can land outside the verifier-checked range. Keep returning NULL for invalid arguments, but only report the error through opts->error when the supplied size includes the field. This preserves error reporting for the supported 12-byte and 16-byte layouts, and for other invalid sizes that still include opts->error. Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF") Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT") Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/9535e781fe14449b1d4e9bbc3baa7566a93bf512.1781765747.git.chenyy23@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	rtc: mv: add suspend/resume support for wakeup	Xue Lei
	Add PM suspend/resume callbacks to enable/disable IRQ wake for the RTC alarm interrupt. This allows the RTC alarm to wake the system from STR (e.g. via rtcwake -m mem -s N). Without this, the RTC IRQ is masked during suspend by the MPIC's IRQCHIP_MASK_ON_SUSPEND behavior, preventing alarm-based wakeup. Signed-off-by: Xue Lei <Xue.Lei@windriver.com> Link: https://patch.msgid.link/20260611023350.1370881-1-Xue.Lei@windriver.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	rtc: aspeed: add AST2700 compatible	Tommy Huang
	Add support for matching the RTC controller on ASPEED AST2700 SoCs. The AST2700 RTC controller is compatible with the existing ASPEED RTC driver implementation. Signed-off-by: Tommy Huang <tommy_huang@aspeedtech.com> Link: https://patch.msgid.link/20260601-ast2700-rtc-v1-2-15d4ca46500a@aspeedtech.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	dt-bindings: rtc: add ASPEED AST2700 compatible	Tommy Huang
	Document the compatible string for the RTC controller found on ASPEED AST2700 SoCs. Signed-off-by: Tommy Huang <tommy_huang@aspeedtech.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/20260601-ast2700-rtc-v1-1-15d4ca46500a@aspeedtech.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	rtc: interface: fix typos in rtc_handle_legacy_irq() documentation	Yahya Saqban
	Fix spelling of 'occurence' to 'occurrence' and 'of' to 'or' in the kernel-doc comment for rtc_handle_legacy_irq(). Signed-off-by: Yahya Saqban <yahyasaqban@gmail.com> Link: https://patch.msgid.link/20260512210235.343070-1-yahyasaqban@gmail.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	rtc: msc313: fix NULL deref in shared IRQ handler at probe	Stepan Ionichev
	msc313_rtc_probe() calls devm_request_irq() with IRQF_SHARED and &pdev->dev as the cookie, but platform_set_drvdata() is only called later after the clock setup. With a shared IRQ line, another device on the same line can trigger the handler in that window. The handler does dev_get_drvdata() on the cookie, gets NULL, and dereferences priv->rtc_base in interrupt context. Pass priv as the cookie directly so the handler reads it from dev_id without the lookup, removing the dependency on probe order. Fixes: be7d9c9161b9 ("rtc: Add support for the MSTAR MSC313 RTC") Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com> Link: https://patch.msgid.link/20260511032703.48262-1-sozdayvek@gmail.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	rtc: remove unused pcap driver	Arnd Bergmann
	The platform was removed a few years ago, and the mfd driver is also gone now, so it is impossible to build or use it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260527193927.3523952-1-arnd@kernel.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	rtc: interface: Add rtc_read_next_alarm() to read next expiring timer	Mario Limonciello
	Add a new function rtc_read_next_alarm() that reads the next expiring alarm from the RTC timerqueue. This is different from rtc_read_alarm(), which only reads the aie_timer. The wakealarm sysfs file programs the rtc->aie_timer, whereas the alarmtimer suspend routine programs its own timer into the RTC timerqueue. Both timers end up in the RTC's timerqueue, and the first expiring timer is what gets armed in the hardware. This new function allows code to query which alarm will actually fire next, regardless of which subsystem programmed it. This is needed by platform code that needs to program secondary timers based on the actual next wakeup time. Link: https://lore.kernel.org/all/87ed50z0le.ffs@tglx Suggested-by: Thomas Gleixner <tglx@linutronix.de> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20260521043714.1022930-2-mario.limonciello@amd.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	blk-cgroup: defer blkcg css_put until blkg is unlinked from queue	Zizhi Wo
	[BUG] Our fuzz testing triggered a blkcg use-after-free issue: BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0 Call Trace: ... blkcg_deactivate_policy+0x244/0x4d0 ioc_rqos_exit+0x44/0xe0 rq_qos_exit+0xba/0x120 __del_gendisk+0x50b/0x800 del_gendisk+0xff/0x190 ... [CAUSE] process1 process2 cgroup_rmdir ... css_killed_work_fn offline_css ... blkcg_destroy_blkgs ... __blkg_release css_put(&blkg->blkcg->css) blkg_free INIT_WORK(xxx, blkg_free_workfn) schedule_work css_put ... blkcg_css_free kfree(blkcg)--------blkcg has been freed!!! ====================================schedule_work blkg_free_workfn __del_gendisk rq_qos_exit ioc_rqos_exit blkcg_deactivate_policy mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) list_for_each_entry(blkg, xxx) blkcg = blkg->blkcg spin_lock(&blkcg->lock)-------UAF!!! mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) /* Only then is the blkg removed from the list */ list_del_init(&blkg->q_node) As a result, a blkg can still be reachable through q->blkg_list while its ->blkcg has already been freed. [Fix] Fix this by deferring the blkcg css_put() until after the blkg has been unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the blkcg outlives every blkg still reachable through q->blkg_list, so any iterator holding q->queue_lock is guaranteed to observe a valid blkg->blkcg. While at it, move css_tryget_online() from blkg_create() into blkg_alloc() so that the css reference is owned by the alloc/free pair rather than straddling layers: blkg_alloc() <-> blkg_free() blkg_create() <-> blkg_destroy() Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai@fygo.io> Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://patch.msgid.link/20260616011746.2451461-1-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 days	blk-cgroup: fix UAF in __blkcg_rstat_flush()	Michal Koutný
	When multiple blkgs in the same blkcg are released concurrently, a use-after-free can occur. The race happens when one blkg's __blkcg_rstat_flush() removes another blkg's iostat entries via llist_del_all(). The second blkg sees an empty list and proceeds to free itself while the first is still iterating over its entries. Move the flush from __blkg_release() (RCU callback) to blkg_release() (before call_rcu). This ensures the RCU grace period waits for any concurrent flush's rcu_read_lock() section to complete before freeing. Cc: stable@vger.kernel.org Cc: Jay Shin <jaeshin@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Fixes: 20cb1c2fb756 ("blk-cgroup: Flush stats before releasing blkcg_gq") Reported-by: coregee2000@gmail.com Closes: https://lore.kernel.org/linux-block/CAHPqNmwT9oRpem3J3erS_W0uSQND47LGGSBsNxP8E6uSUish1w@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Link: https://patch.msgid.link/20260205155425.342084-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 days	block, bfq: protect async queue reset with blkcg locks	Cen Zhang
	Writing 0 to BFQ's low_latency attribute ends weight raising for active, idle and async queues. The async cgroup path walks q->blkg_list, converts each blkg to BFQ policy data and then reads bfqg->async_bfqq and bfqg->async_idle_bfqq. That walk was protected only by bfqd->lock. blkcg release work is serialized by q->blkcg_mutex and q->queue_lock instead, and blkg_free_workfn() can call BFQ's pd_free_fn before it removes blkg->q_node from q->blkg_list. A low_latency reset can therefore still find the blkg on the queue list after the BFQ policy data has been freed. The buggy scenario involves two paths, with each column showing the order within that path: BFQ low_latency reset: blkcg blkg release work: 1. bfq_low_latency_store() 1. blkg_free_workfn() takes calls bfq_end_wr(). q->blkcg_mutex. 2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the q->blkg_list. final bfq_group reference. 3. blkg_to_bfqg() returns 3. blkg->q_node remains on the stale policy data. q->blkg_list until list_del_init(). 4. bfq_end_wr_async_queues() reads async queue fields. Fix this by taking q->blkcg_mutex and q->queue_lock around the q->blkg_list walk, then taking bfqd->lock before touching BFQ async queues. The mutex serializes against policy-data free and queue_lock stabilizes the list. Move the async reset out of bfq_end_wr()'s existing bfqd->lock critical section so the lock order matches blkcg policy callbacks. Validation reproduced this kernel report: BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340 Call Trace: <TASK> dump_stack_lvl+0x66/0xa0 print_report+0xce/0x630 ? bfq_end_wr_async_queues+0x246/0x340 ? srso_alias_return_thunk+0x5/0xfbef5 ? __virt_addr_valid+0x20d/0x410 ? bfq_end_wr_async_queues+0x246/0x340 kasan_report+0xe0/0x110 ? bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async+0xba/0x180 bfq_low_latency_store+0x4e5/0x690 ? 0xffffffffc02150da ? __pfx_bfq_low_latency_store+0x10/0x10 ? __pfx_bfq_low_latency_store+0x10/0x10 elv_attr_store+0xc4/0x110 kernfs_fop_write_iter+0x2f5/0x4a0 vfs_write+0x604/0x11f0 ? __pfx_locks_remove_posix+0x10/0x10 ? __pfx_vfs_write+0x10/0x10 ksys_write+0xf9/0x1d0 ? __pfx_ksys_write+0x10/0x10 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 544: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0xaa/0xb0 bfq_pd_alloc+0xc0/0x1b0 blkg_alloc+0x346/0x960 blkg_create+0x8c2/0x10d0 bio_associate_blkg_from_css+0x9f3/0xfa0 bio_associate_blkg+0xd9/0x200 bio_init+0x303/0x640 __blkdev_direct_IO_simple+0x56b/0x8a0 blkdev_direct_IO+0x8e7/0x2580 blkdev_read_iter+0x205/0x400 vfs_read+0x7b0/0xda0 ksys_read+0xf9/0x1d0 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 465: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x5f/0x80 kfree+0x307/0x580 blkg_free_workfn+0xef/0x460 process_one_work+0x8d0/0x1870 worker_thread+0x575/0xf80 kthread+0x2e7/0x3c0 ret_from_fork+0x576/0x810 ret_from_fork_asm+0x1a/0x30 Fixes: 44e44a1b329e ("block, bfq: improve responsiveness") Assisted-by: Codex:gpt-5.5 Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Tao Cui <cuitao@kylinos.cn> Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 days	nbd: don't warn when reclassifying a busy socket lock	Deepanshu Kartikey
	nbd_reclassify_socket() warns via WARN_ON_ONCE() if the socket lock is held at the point of reclassification. That assertion was copied from nvme-tcp, where the socket is created internally by the kernel (sock_create_kern()) and is never visible to user space, so the lock is guaranteed to be free. NBD is different: the socket is looked up from a user-supplied fd in nbd_get_socket(), and user space retains that fd. A concurrent syscall on the same socket (or softirq processing taking bh_lock_sock() on a connected TCP socket) can legitimately hold the lock at the instant NBD reclassifies it. sock_allow_reclassification() then returns false and the WARN_ON_ONCE() fires, which turns into a crash under panic_on_warn. This is reachable by simply racing NBD_CMD_CONNECT against socket activity on the same fd, as reported by syzbot. Hitting a held lock here is expected for an externally owned socket and is not a kernel bug, so skip reclassification silently instead of warning. Reclassification is a lockdep-only annotation, so skipping it in the rare racing case is harmless. Reported-by: syzbot+6b85d1e39a5b8ed9a954@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6b85d1e39a5b8ed9a954 Fixes: d532cddb6c60 ("nbd: Reclassify sockets to avoid lockdep circular dependency") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260621235255.66015-1-kartikey406@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 days	block: fix incorrect error injection static key decrement	Christoph Hellwig
	Only decrement the static key when we had items and thus it was incremented before. Fixes: e8dcf2d142bd ("block: add configurable error injection") Reported-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260622160752.1552516-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 days	e1000e: Reconfigure PLL clock gate timeout and re-enable K1 on Meteor Lake	Dima Ruinskiy
	Commit 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1") disabled K1 by default on Meteor Lake and newer systems due to packet loss observed on various platforms. However, disabling K1 caused an increase in power consumption. To mitigate this, reconfigure the PLL clock gate value so that K1 can remain enabled without incurring the additional power consumption. Re-enable K1 by default, but keep the private flag to support disabling it via ethtool. Additionally, introduce a DMI quirk table, so that K1 may be disabled by default on known problematic systems. Currently, this includes the Dell Pro 16 Plus, where the issue has been reported to persist despite the changes to the PLL lock timeout. Link: https://bugzilla.kernel.org/show_bug.cgi?id=220954 Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20250623/048860.html Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20260330/054059.html Signed-off-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Co-developed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Fixes: 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1") Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	i40e: Fix i40e_debug() to use struct i40e_hw argument	Mohamed Khalfella
	i40e_debug() macro takes struct i40e_hw *h as first argument. But the macro body uses hw instead of h. This has been working so far because hw happens to be the name of the variable in the context where the macro is expanded. Fix the macro to use the passed argument. Fixes: 5dfd37c37a44 ("i40e: Split i40e_osdep.h") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	ice: dpll: fix memory leak in ice_dpll_init_info error paths	ZhaoJinming
	Several error return paths in ice_dpll_init_info() directly return without freeing previously allocated resources, causing memory leaks: - When de->input_prio allocation fails, d->inputs is leaked - When dp->input_prio allocation fails, d->inputs and de->input_prio are leaked - When ice_get_cgu_rclk_pin_info() fails, all previously allocated inputs/outputs/input_prio are leaked - When ice_dpll_init_pins_info(RCLK_INPUT) fails, same resources are leaked Fix this by jumping to the deinit_info label which properly calls ice_dpll_deinit_info() to free all allocated resources. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	ice: dpll: set pointers to NULL after kfree in ice_dpll_deinit_info	ZhaoJinming
	ice_dpll_deinit_info() calls kfree() on several pf->dplls fields (inputs, outputs, eec.input_prio, pps.input_prio) but does not set the pointers to NULL afterward. This leaves dangling pointers in the pf->dplls structure. While not currently exploitable through existing code paths, this is unsafe because: 1. If ice_dpll_init_info() is called again after a deinit (e.g. during driver recovery), and a subsequent allocation within init fails, the error path will jump to deinit_info and call ice_dpll_deinit_info() again. Since some pointers still hold the old freed addresses, this would result in a double-free. 2. Any future code that checks these pointers before use or after free would be unprotected against use-after-free. Follow the common kernel convention of setting pointers to NULL after kfree() so that: - kfree(NULL) is a safe no-op, preventing double-free - NULL checks on these pointers become meaningful This is a preparatory fix for a subsequent patch that routes additional error paths in ice_dpll_init_info() to the deinit_info label. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	rtc: isl1208: Balance enable_irq_wake() with disable_irq_wake() on cleanup	John Madieu
	isl1208_setup_irq() calls enable_irq_wake() after a successful IRQ request, but the driver has no remove path that balances it. The driver is devm-only, so on unbind devm releases the IRQ - but enable_irq_wake() is not undone by IRQ release, so the wake count for that IRQ stays incremented. Each rebind therefore leaks one wake reference; the leak doubles for the chip variant that has a separate evdet IRQ, since isl1208_setup_irq() is then called twice during probe. Register a devm action that calls disable_irq_wake() per IRQ. While at it, check enable_irq_wake()'s return value: on failure, propagate the error rather than silently registering a disable action for an IRQ whose wake state was never enabled. Fixes: 9ece7cd833a3 ("rtc: isl1208: Add "evdet" interrupt source for isl1219") Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com> Link: https://patch.msgid.link/20260425154959.2796261-3-john.madieu.xa@bp.renesas.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
13 days	io_uring/memmap: bound io_pin_pages() by page array byte size	Deepanshu Kartikey
	io_pin_pages() checks that nr_pages does not exceed INT_MAX, then allocates a struct page * array of nr_pages entries. kvmalloc() limits allocations to INT_MAX bytes, but the check counts pages, not bytes. On 64-bit each entry is 8 bytes, so the array hits the INT_MAX byte limit at INT_MAX / sizeof(struct page ) pages, well before the page count check fires. Since commit b4e41050b212 ("io_uring/rsrc: raise registered buffer 1GB limit") raised the per-buffer cap to 1TB, a buffer near that cap maps ~2^28 pages, making the array allocation exceed INT_MAX bytes. This passes the page count check, reaches kvmalloc(), and triggers the WARN_ON_ONCE() for oversized allocations in __kvmalloc_node_noprof(). Check nr_pages against INT_MAX / sizeof(struct page ) so the buffer is rejected with -EOVERFLOW before the allocation is attempted. Reported-by: syzbot+f99b00a963915b6b52c6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f99b00a963915b6b52c6 Fixes: b4e41050b212 ("io_uring/rsrc: raise registered buffer 1GB limit") Tested-by: syzbot+f99b00a963915b6b52c6@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/20260621012933.50571-1-kartikey406@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 days	ice: call netif_keep_dst() once when entering switchdev mode	Marcin Szycik
	netif_keep_dst() only needs to be called once for the uplink VSI, not once for each port representor. Move it from ice_eswitch_setup_repr() to ice_eswitch_enable_switchdev(). Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path") Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Patryk Holda <patryk.holda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	ice: fix ice_init_link() error return preventing probe	Paul Greenwalt
	ice_init_link() can return an error status from ice_update_link_info() or ice_init_phy_user_cfg(), causing probe to fail. An incorrect NVM update procedure can result in link/PHY errors, and the recommended resolution is to update the NVM using the correct procedure. If the driver fails probe due to link errors, the user cannot update the NVM to recover. The link/PHY errors logged are non-fatal: they are already annotated as 'not a fatal error if this fails'. Since none of the errors inside ice_init_link() should prevent probe from completing, convert it to void and remove the error check in the caller. All failures are already logged; callers have no meaningful recovery path for link init errors. Fixes: 5b246e533d01 ("ice: split probe into smaller functions") Cc: stable@vger.kernel.org Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	ice: fix AQ error code comparison in ice_set_pauseparam()	Lukasz Czapnik
	Fix unreachable code: the conditionals in ice_set_pauseparam() used the bitwise-AND operator suggesting aq_failures is a bitmap, but it is actually an enum, making the third condition logically unreachable. Replace the if-else ladder with a switch statement. Also move the aq_failures initialization to the variable declaration and remove the redundant zeroing from ice_set_fc(). Fixes: fcea6f3da546 ("ice: Add stats and ethtool support") Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	ice: fix FDIR CTRL VSI resource leak in ice_reset_all_vfs()	Dawid Osuchowski
	Resetting all VFs causes resource leak on VFs with FDIR filters enabled as CTRL VSIs are only invalidated and not freed. Fix by using ice_vf_ctrl_vsi_release() instead of ice_vf_ctrl_invalidate_vsi() which aligns behavior with the ice_reset_vf() function. Reproduction: echo 1 > /sys/class/net/$pf/device/sriov_numvfs ethtool -N $vf flow-type ether proto 0x9000 action 0 echo 1 > /sys/class/net/$pf/device/reset Fixes: da62c5ff9dcd ("ice: Add support for per VF ctrl VSI enabling") Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
13 days	ceph: add manual reset debugfs control and tracepoints	Alex Markuze
	Add the debugfs and trace plumbing used to trigger and observe manual client reset. The reset interface exposes a trigger file for operator-initiated reset and a status file for tracking the most recent run. The tracepoints record scheduling, completion, and blocked caller behavior so reset progress can be diagnosed from the client side. debugfs layout under /sys/kernel/debug/ceph/<client>/reset/: trigger - write to initiate a manual reset status - read to see the most recent reset result The reset directory is cleaned up via debugfs_remove_recursive() on the parent, so individual file dentries are not stored. Tracepoints: ceph_client_reset_schedule - reset queued ceph_client_reset_complete - reset finished (success or failure) ceph_client_reset_blocked - caller blocked waiting for reset ceph_client_reset_unblocked - caller unblocked after reset All tracepoints use a null-safe access for monc.auth->global_id to guard against early-init or late-teardown edge cases. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
13 days	ceph: add client reset state machine and session teardown	Alex Markuze
	Add the client-side reset state machine, request gating, and manual session teardown implementation. Manual reset is an operator-triggered escape hatch for client/MDS stalemates in which caps, locks, or unsafe metadata state stop making forward progress. The reset blocks new metadata work, attempts a bounded best-effort drain of dirty client state while sessions are still alive, and finally asks the MDS to close sessions before tearing local session state down directly. The reset state machine tracks four phases: IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by schedule_reset() before the workqueue item is dispatched, so that new metadata requests and file-lock acquisitions are gated immediately -- even before the work function begins running. All non-IDLE phases block callers on blocked_wq, preventing races with session teardown. The drain phase flushes mdlog state, dirty caps, and pending cap releases for a bounded interval. State that still cannot make progress within that interval is discarded during teardown, which is the point of the reset: break the stalemate and allow fresh sessions to rebuild clean state. The session teardown follows the established check_new_map() forced-close pattern: unregister sessions under mdsc->mutex, then clean up caps and requests under s->s_mutex. Reconnect is not attempted because the MDS only accepts reconnects during its own RECONNECT phase after restart, not from an active client. Blocked callers are released when reset completes and observe the final result via -EAGAIN (reset failed) or 0 (success). Internal work-function errors such as -ENOMEM are not propagated to unrelated callers like open() or flock(); the detailed error remains in debugfs and tracepoints. The work function checks st->shutdown before each phase transition (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not overwritten. If destroy already took ownership, the work function releases session references and returns without touching the state. The timeout calculation for blocked-request waiters uses max_t() to prevent jiffies underflow when the deadline has already passed. The close-grace sleep before teardown is a best-effort nudge to let queued REQUEST_CLOSE messages egress; it is not a correctness requirement since the MDS still has session_autoclose as a fallback. The destroy path marks reset as failed and wakes blocked waiters before cancel_work_sync() so unmount does not stall. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
13 days	ceph: add diagnostic timeout loop to wait_caps_flush()	Alex Markuze
	Convert wait_caps_flush() from a silent indefinite wait into a diagnostic wait loop that periodically dumps pending cap flush state. The underlying wait semantics remain intact: callers still wait until the requested cap flushes complete. The difference is that long stalls now produce actionable diagnostics instead of looking like a silent hang. CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS limits the number of timed diagnostic dumps before the wait continues silently. When more entries exist than the per-dump limit, a truncation count is reported. When the dump iteration limit is reached, a final suppression message is emitted so the transition to silence is explicit. The diagnostic dump collects flush entry data under cap_dirty_lock into a bounded on-stack array, then prints after releasing the lock. This avoids holding the spinlock across printk calls. A null cf->ci on the global flush list indicates a bug since all cap_flush entries are initialized with a valid ci before being added. Signal this with WARN_ON_ONCE while still printing enough context for debugging. READ_ONCE is used for the i_last_cap_flush_ack field, which is read outside the inode lock domain. Flush tids are monotonically increasing and acks are processed in order under i_ceph_lock, so the latest ack tid is always the most recently written value. Add a ci pointer to struct ceph_cap_flush so that the diagnostic dump can identify which inode each pending flush belongs to. The new i_last_cap_flush_ack field tracks the latest acknowledged flush tid per inode for diagnostic correlation. This improves reset-drain observability and is also useful for existing sync and writeback troubleshooting paths. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
13 days	ceph: harden send_mds_reconnect and handle active-MDS peer reset	Alex Markuze
	Change send_mds_reconnect() to return an error code so callers can detect and report reconnect failures instead of silently ignoring them. Add early bailout checks for sessions that are already closed, rejected, or unregistered, which avoids sending reconnect messages for sessions that can no longer be recovered. The early -ESTALE and -ENOENT bailouts use a separate fail_return label that skips the pr_err_client diagnostic, since these codes indicate expected concurrent-teardown races rather than genuine reconnect build failures. Move the "reconnect start" log after the early-bailout checks so it only appears for sessions that actually proceed with reconnect. Save the prior session state before transitioning to RECONNECTING, and restore it in the failure path. Without this, a transient build or encoding failure (-ENOMEM, -ENOSPC) strands the session in RECONNECTING indefinitely because check_new_map() only retries sessions in RESTARTING state. Rewrite mds_peer_reset() to handle the case where the MDS is past its RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT messages because it only accepts them during its own RECONNECT window after restart. Previously, the client would send a doomed reconnect that the MDS would reject or ignore. Now, the client tears the session down locally and lets new requests re-open a fresh session, which is the correct recovery for this scenario. The RECONNECTING state is handled on the same teardown path, since the MDS will reject reconnect attempts from an active client regardless of the session's local state. Add explicit cases for CLOSED and REJECTED session states in mds_peer_reset() since these are terminal states where a connection drop is expected behavior. The session teardown path in mds_peer_reset() follows the established drop-and-reacquire locking pattern from check_new_map(): take mdsc->mutex for session unregistration, release it, then take s->s_mutex separately for cleanup. This avoids introducing a new simultaneous lock nesting pattern. Log reconnect failures from check_new_map() and mds_peer_reset() at pr_warn level rather than pr_err, since return codes like -ESTALE (closed/rejected session) and -ENOENT (unregistered session) are expected during concurrent teardown. Log dropped messages for unregistered sessions via doutc() (dynamic debug) rather than pr_info, as post-reset message arrival is routine and does not warrant unconditional logging. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
13 days	ceph: use proper endian conversion for flock_len in reconnect	Alex Markuze
	Replace the __force __le32 cast with cpu_to_le32() for the flock_len field in reconnect_caps_cb(). The old code used a type-system bypass to silence sparse; the new form uses the proper endian conversion macro. Also switch from a raw bitmask test against i_ceph_flags to test_bit() on the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the unsigned long flags field after the bit-position conversion. Remove the now-unused CEPH_I_ERROR_FILELOCK mask define since all callers use the _BIT form with test_bit/set_bit/clear_bit. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
13 days	ceph: convert inode flags to named bit positions and atomic bitops	Alex Markuze
	Define named bit-position constants for all CEPH_I_* inode flags and derive the bitmask values from them. This gives every flag a named _BIT constant usable with the test_bit/set_bit/clear_bit family. The intentionally unused bit position 1 is documented inline. Convert all flag modifications to use atomic bitops (set_bit, clear_bit, test_and_clear_bit). The previous code mixed lockless atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic read-modify-write (\|= / &= ~) on other flags sharing the same unsigned long. A concurrent non-atomic RMW can clobber an adjacent lockless atomic update -- for example, a lockless clear_bit(ERROR_WRITE) could be silently resurrected by a concurrent ci->i_ceph_flags \|= CEPH_I_FLUSH under the spinlock. Using atomic bitops for all modifications eliminates this class of race entirely. Flags whose only users are now the _BIT form (ERROR_WRITE, ASYNC_CHECK_CAPS) have their old mask defines removed to document that callers must use the _BIT constant with the set_bit/test_bit family. ERROR_FILELOCK and SHUTDOWN retain their mask defines because they are still used via bitmask tests in lockless readers (ceph_inode_is_shutdown, reconnect_caps_cb). The direct assignment in ceph_finish_async_create() is converted from i_ceph_flags = CEPH_I_ASYNC_CREATE to set_bit(). This inode is I_NEW at this point -- still invisible to other threads and guaranteed to have zero flags from alloc_inode -- so either form is safe, but set_bit() keeps the conversion uniform. Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
13 days	sched_ext: Move shared helpers from ext.c into internal.h and cid.h	Tejun Heo
	idle.c and cid.c are included into build_policy.c together with ext.c and use helpers that ext.c defines. Because the helpers live in ext.c, the two files can not parse as standalone units and clangd reports errors in them. Move the helpers to the headers they belong to. The op-dispatch macros and helpers plus scx_parent() to internal.h, and scx_cpu_arg()/scx_cpu_ret() to cid.h. No functional change. idle.c and cid.c now parse clean standalone. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
13 days	sched_ext: Make kernel/sched/ext/ sources self-contained for clangd	Tejun Heo
	The sources under kernel/sched/ext/ build as a single translation unit: build_policy.c includes the source files and headers. An LSP/clangd editor parses each as a standalone unit, sees no types, and reports a flood of errors. Give each header its dependencies and include guard, and have each source include the headers it uses. ext.c, arena.c and the ext headers now parse clean standalone. idle.c and cid.c still reference a few macros and helpers defined in ext.c. The next patch moves those to shared headers. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
13 days	selftests/bpf: Cover half-slot cleanup of pointer spills	Nuoqi Gui
	Add a verifier regression test for a pointer spill whose high half is cleaned dead while the low half remains live. Force checkpoint creation with BPF_F_TEST_STATE_FREQ and assert the verifier log reaches the checkpoint and the subsequent 32-bit fill before rejecting the partial fill from a non-scalar spill. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/20260617-f01-06-half-slot-pointer-spill-v2-2-42b9cdc3cf64@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	bpf: Preserve pointer spill metadata during half-slot cleanup	Nuoqi Gui
	__clean_func_state() cleans dead stack slots in 4-byte halves. When the high half of a STACK_SPILL slot is dead and the low half remains live, cleanup converts the live low half to STACK_MISC or STACK_ZERO and clears the saved spilled_ptr metadata. That conversion is safe only for scalar spills. For a pointer spill, this metadata clear lets a later 32-bit fill from the still-live half avoid the normal non-scalar register-fill check and be treated as an ordinary scalar stack read. Leave non-scalar spill slots intact in this half-live shape. This is conservative for pruning and preserves the existing check_stack_read_fixed_off() rejection path for partial fills from pointer spills. Fixes: be23266b4a08 ("bpf: 4-byte precise clean_verifier_state") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/20260617-f01-06-half-slot-pointer-spill-v2-1-42b9cdc3cf64@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days	genirq/msi: Correct CONFIG_PCI_MSI_ARCH_FALLBACKS macro name in comment	Ethan Nelson-Moore
	A comment in kernel/irq/msi.c incorrectly refers to CONFIG_PCI_MSI_ARCH_FALLBACK instead of CONFIG_PCI_MSI_ARCH_FALLBACKS. Correct it. Discovered while searching for CONFIG_* symbols referenced in code but not defined in any Kconfig file. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260613213544.90613-1-enelsonmoore@gmail.com
13 days	PCI: endpoint: pci-epf-vntb: Guard configfs writes after EPC attach	Koichiro Den
	db_count controls how many doorbell slots are allocated and exposed. It is also used by the doorbell mask helpers. After an EPC has been attached, changing it from configfs can leave runtime paths using a different count than the one used to set up the doorbell resources. Reject db_count writes after EPC attach, and reject values outside MIN_DB_COUNT..MAX_DB_COUNT before attach. Now that MIN_DB_COUNT documents the usable doorbell floor, use it in the store path too. While at it, apply the same after-attach guard to the other vNTB configfs knobs. BAR choices, spad_count, memory-window counts and sizes, and the virtual PCI IDs are also consumed during bind, so changing them later at runtime is meaningless and unsafe. Return -EOPNOTSUPP for after-attach writes. The value itself may be valid, but changing it in that state is not supported. Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260513024923.451765-6-den@valinux.co.jp
13 days	PCI: endpoint: pci-epf-vntb: Reject unusable doorbell counts	Koichiro Den
	pci-epf-vntb reserves slot 0 for link events and keeps slot 1 unused for legacy layout compatibility. A db_count smaller than MIN_DB_COUNT leaves no usable doorbell slot after those reservations. Reject such configurations when configuring interrupts. While at it, move MAX_DB_COUNT next to MIN_DB_COUNT. They are used as a pair in the range check, and keeping them together makes the valid doorbell range easier to read. Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260513024923.451765-5-den@valinux.co.jp
13 days	PCI: endpoint: pci-epf-vntb: Report 0-based doorbell vector via ntb_db_event()	Koichiro Den
	ntb_db_event() expects the vector number to be relative to the first doorbell vector starting at 0. pci-epf-vntb reserves vector 0 for link events and uses higher vector indices for doorbells. By passing the raw slot index to ntb_db_event(), it effectively assumes that doorbell 0 maps to vector 1. However, because the host uses a legacy slot layout and writes doorbell 0 into the third slot, doorbell 0 ultimately appears as vector 2 from the NTB core perspective. Adjust pci-epf-vntb to: - skip the unused second slot, and - report doorbells as 0-based vectors (DB#0 -> vector 0). This change does not introduce a behavioral difference until .db_vector_count()/.db_vector_mask() are implemented, because without those callbacks NTB clients effectively ignore the vector number. Fixes: e35f56bb0330 ("PCI: endpoint: Support NTB transfer between RC and EP") Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260513024923.451765-4-den@valinux.co.jp
13 days	PCI: endpoint: pci-epf-vntb: Defer pci_epc_raise_irq() out of atomic context	Koichiro Den
	The NTB .peer_db_set() callback may be invoked from atomic context. pci-epf-vntb currently calls pci_epc_raise_irq() directly, but pci_epc_raise_irq() may sleep (it takes epc->lock). Avoid sleeping in atomic context by coalescing doorbell bits into an atomic64 pending mask and raising MSIs from a work item. Limit the amount of work per run to avoid monopolizing the workqueue under a doorbell storm. Clear stale pending bits before enabling the work item and after disabling it during cleanup. Also mask requested doorbells against the currently valid doorbell mask before queueing work, and iterate the pending u64 with __ffs64() so high doorbell bits are handled correctly. Fixes: e35f56bb0330 ("PCI: endpoint: Support NTB transfer between RC and EP") Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260513024923.451765-3-den@valinux.co.jp
13 days	PCI: endpoint: pci-epf-vntb: Document legacy MSI doorbell offset	Koichiro Den
	vntb_epf_peer_db_set() raises an MSI interrupt to notify the RC side of a doorbell event. pci_epc_raise_irq(..., PCI_IRQ_MSI, interrupt_num) takes a 1-based MSI interrupt number. The ntb_hw_epf driver reserves MSI #1 for link events, so doorbells would naturally start at MSI #2 (doorbell bit 0 -> MSI #2). However, pci-epf-vntb has historically applied an extra offset and mapped doorbell bit 0 to MSI #3. This matches the legacy behavior of ntb_hw_epf and has been preserved since commit e35f56bb0330 ("PCI: endpoint: Support NTB transfer between RC and EP"). This offset has not surfaced as a functional issue because: - ntb_hw_epf typically allocates enough MSI vectors, so the off-by-one still hits a valid MSI vector, and - ntb_hw_epf does not implement .db_vector_count()/.db_vector_mask(), so client drivers such as ntb_transport effectively ignore the vector number and schedule all QPs. Correcting the MSI number would break interoperability with peers running older kernels. Document the legacy offset to avoid confusion when enabling per-db-vector handling. Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260513024923.451765-2-den@valinux.co.jp
13 days	PCI: endpoint: pci-epf-ntb: Add check to detect 'db_count' value of 0	Manivannan Sadhasivam
	epf_ntb->db_count value should be within 1 to MAX_DB_COUNT. Current code only checks for the upper bound, while the lower bound is unchecked. This can cause a lot of issues in the driver if the user passes 'db_count' as 0. Add a check for 0 also. While at it, remove the redundant 'db_count' variable from epf_ntb_configure_interrupt(). Fixes: 8b821cf76150 ("PCI: endpoint: Add EP function driver to provide NTB functionality") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20260407124421.282766-3-mani@kernel.org