| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild / Kconfig updates from Nathan Chancellor:
"Kbuild:
- Remove broken module linking exclusion for BTF
- Add documentation around how offset header files work
- Include unstripped vDSO libraries in pacman packages
- Bump minimum version of LLVM for building the kernel to 17.0.1 and
clean up unnecessary workarounds
- Use a context manager in run-clang-tools
- Add dist macro value if present to release tag for RPM packages
- Detect and report truncated buf_printf() output in modpost
- Add __llvm_covfun and __llvm_covmap to section whitelist in modpost
- Support Clang's distributed ThinLTO mode
- Remove architecture specific configurations for AutoFDO and
Propeller to ease individual architecture maintenance
Kconfig:
- Add kconfig-sym-check target to look for dangling Kconfig symbol
references and invalid tristate literal values
- Harden against potential NULL pointer dereference
- Fix typo in Kconfig test comment"
* tag 'kbuild-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (31 commits)
kconfig: tests: fix typo in comment
kconfig: Remove the architecture specific config for Propeller
kconfig: Remove the architecture specific config for AutoFDO
modpost: Add __llvm_covfun and __llvm_covmap to section_white_list
kconfig: add kconfig-sym-check static checker
kbuild: Remove unnecessary 'T' modifier in cmd_ar_builtin_fixup
kbuild: distributed build support for Clang ThinLTO
kbuild: move vmlinux.a build rule to scripts/Makefile.vmlinux_a
scripts: modpost: detect and report truncated buf_printf() output
kbuild: rpm-pkg: append %{?dist} macro to Release tag
run-clang-tools: run multiprocessing.Pool as context manager
compiler-clang.h: Drop explicit version number from "all" diagnostic macro
compiler-clang.h: Remove __cleanup -Wunused-variable workaround
kbuild: Remove check for broken scoping with clang < 17 in CC_HAS_ASM_GOTO_OUTPUT
x86/entry/vdso32: Remove conditional omission of '.cfi_offset eflags'
x86/module: Revert "Deal with GOT based stack cookie load on Clang < 17"
x86/build: Drop unnecessary '-ffreestanding' addition to KBUILD_CFLAGS
scripts/Makefile.warn: Drop -Wformat handling for clang < 16
riscv: Drop tautological condition from TOOLCHAIN_NEEDS_OLD_ISA_SPEC
riscv: Remove tautological condition from selection of ARCH_SUPPORTS_CFI
...
|
|
Allow the PMT class to read discovery headers from either PCI MMIO or
ACPI-provided entries, depending on the discovery source. The new
source-aware fetch helper caches the canonical discovery header for both
paths, capping PCI MMIO reads to the mapped resource size, while keeping
the mapped PCI discovery table available for users such as crashlog.
Split intel_pmt_populate_entry() into source-specific resolvers:
- pmt_resolve_access_pci(): handles both ACCESS_LOCAL and ACCESS_BARID
for PCI-backed devices and sets entry->pcidev. Same existing
functionality.
- pmt_resolve_access_acpi(): handles only ACCESS_BARID for ACPI-backed
devices, rejecting ACCESS_LOCAL which has no valid semantics without
a physical discovery resource.
This maintains existing PCI behavior and makes no functional changes
for PCI devices.
Assisted-by: GitHub-Copilot:claude-opus-4.7
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: https://patch.msgid.link/4b33b04ffaf0943b67d330f48b5d1dfcb6d1be5d.1781294741.git.david.e.box@linux.intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
- d_alloc_parallel() API change (Neil's with my changes)
- NORCU fixes
- Reorganization and simplification of dentry eviction logic
- Simplifying rcu_read_lock() scopes in fs/dcache.c
- Secondary roots work - getting rid of NFS fake root dentries and
dealing with remaining shrink_dcache_for_umount() and
shrink_dentry_list() races
- making cursors NORCU (surprisingly easy)
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits)
make cursors NORCU
nfs: get rid of fake root dentries
wind ->s_roots via ->d_sib instead of ->d_hash
shrink_dentry_tree(): unify the calls of shrink_dentry_list()
shrinking rcu_read_lock() scope in d_alloc_parallel()
d_walk(): shrink rcu_read_lock() scope
document dentry_kill()
adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()
Document rcu_read_lock() use in select_collect2()
Shift rcu_read_{,un}lock() inside fast_dput()
simplify safety for lock_for_kill() slowpath
fold lock_for_kill() and __dentry_kill() into common helper
fold lock_for_kill() into shrink_kill()
shrink_dentry_list(): start with removing from shrink list
d_prune_aliases(): make sure to skip NORCU aliases
kill d_dispose_if_unused()
make to_shrink_list() return whether it has moved dentry to list
select_collect(): ignore dentries on shrink lists if they have positive refcounts
find_acceptable_alias(): skip NORCU aliases with zero refcount
fix a race between d_find_any_alias() and final dput() of NORCU dentries
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull procfs updates from Christian Brauner:
- Revamp fs/filesystems.c
The file was a mess with a hand-rolled linked list in desperate need
of a cleanup. The filesystems list is now RCU-ified, /proc files can
be marked permanent from outside fs/proc/, and the string emitted
when reading /proc/filesystems is pre-generated and cached instead of
pointer-chasing and printfing entry by entry on every read.
The file is read frequently because libselinux reads it and is linked
into numerous frequently used programs (even ones you would not
suspect, like sed!). Scalability also improves since reference
maintenance on open/close is bypassed.
open+read+close cycle single-threaded (ops/s):
before: 442732
after: 1063462 (+140%)
open+read+close cycle with 20 processes (ops/s):
before: 606177
after: 3300576 (+444%)
A follow-up patch adds missing unlocks in some corner cases and
tidies things up.
- Relax the mount visibility check for subset=pid mounts
When procfs is mounted with subset=pid, all static files become
unavailable and only the dynamic pid information is accessible. In
that case there is no point in imposing the full mount visibility
restrictions on the mounter - everything that can be hidden in procfs
is already inaccessible. These restrictions prevented procfs from
being mounted inside rootless containers since almost all container
implementations overmount parts of procfs to hide certain
directories.
As part of this /proc/self/net is only shown in subset=pid mounts for
CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the
SB_I_USERNS_VISIBLE superblock flag is replaced with an
FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are
recorded in a list, and the mount restrictions are finally
documented.
- Protect ptrace_may_access() with exec_update_lock in procfs
Most uses of ptrace_may_access() in procfs should hold
exec_update_lock to avoid TOCTOU issues with concurrent privileged
execve() (like setuid binary execution).
This fixes the easy cases - the owner and visibility checks and the
FD link permission checks - with the gnarlier ones to follow later.
* tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: fix ups and tidy ups to /proc/filesystems caching
proc: protect ptrace_may_access() with exec_update_lock (FD links)
proc: protect ptrace_may_access() with exec_update_lock (part 1)
docs: proc: add documentation about mount restrictions
proc: handle subset=pid separately in userns visibility checks
proc: prevent reconfiguring subset=pid
proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
fs: cache the string generated by reading /proc/filesystems
sysfs: remove trivial sysfs_get_tree() wrapper
fs: RCU-ify filesystems list
fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
proc: allow to mark /proc files permanent outside of fs/proc/
namespace: record fully visible mounts in list
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Reduce pipe->mutex contention by pre-allocating pages outside the
lock in anon_pipe_write().
anon_pipe_write() called alloc_page() once per page while holding
pipe->mutex. The allocation can sleep doing direct reclaim and runs
memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex. Now up to 8 pages are
pre-allocated before the mutex is taken, leftovers are recycled
into the per-pipe tmp_page[] cache before unlock, and any remainder
is released after unlock, keeping the allocator out of the critical
section on both sides. On a writers x readers sweep with 64KB
writes against a 1 MB pipe throughput improves 6-28% and average
write latency drops 5-22%; under memory pressure - when the cost of
holding the mutex across reclaim is highest - throughput improves
21-48% and latency drops 17-33%. The microbenchmark is added to
selftests.
- uaccess/sockptr: fix the ignored_trailing logic in
copy_struct_to_user() to behave as documented and the usize check
in copy_struct_from_sockptr() for user pointers, and add
copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).
- bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
inode backing a dentry via d_real_inode(). On overlayfs the inode
attached to the dentry doesn't carry the underlying device
information; this is used by the filesystem restriction BPF program
that was merged into systemd.
- docs: add guidelines for submitting new filesystems, motivated by
the maintenance burden abandoned and untestable filesystems impose
on VFS developers, blocking infrastructure work like folio
conversions and iomap migration.
Fixes:
- libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
and drop the now-redundant assignments in callers. This began as a
one-line dma-buf fix for a path_noexec() warning; a pseudo
filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
callers were audited: the only visible effect is on dma-buf where
SB_I_NOEXEC silences the warning.
- Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
device with a sector size > PAGE_SIZE crashed roughly half of them;
the rest had the same missing error handling pattern. Plus a
follow-up releasing the superblock buffer_head when setting the
minix v3 block size fails.
- mount: honour SB_NOUSER in the new mount API.
- fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
switching the process-group paths of send_sigio() and send_sigurg()
from read_lock(&tasklist_lock) to RCU, matching the single-PID
path.
- vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
delegated NFS mounts (fsopen() in a container with the mount
performed by a privileged daemon) that broke when non-init
s_user_ns was tied to FS_USERNS_MOUNT.
- selftests/namespaces: fix a hang in nsid_test where an unreaped
grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
listns_efault_test, and a false FAIL on kernels without listns()
where the tests should SKIP.
- filelock: fix the break_lease() stub signature for
CONFIG_FILE_LOCKING=n.
- init/initramfs_test: wait for the async initramfs unpacking before
running; the test and do_populate_rootfs() share the parser state.
- fs/coredump: reduce redundant log noise in
validate_coredump_safety().
- iomap: pass the correct length to fserror_report_io() in
__iomap_write_begin().
- backing-file: fix the backing_file_open() kerneldoc.
Cleanups:
- initramfs: refactor the cpio hex header parsing to use hex2bin()
instead of the hand-rolled simple_strntoul() which is reverted, and
extend the initramfs KUnit tests to cover header fields with 0x
prefixes.
- Replace __get_free_pages() and friends with kmalloc()/kzalloc()
across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
do_mounts init code - part of the larger work of replacing page
allocator calls with kmalloc().
- Use clear_and_wake_up_bit() in unlock_buffer() and
journal_end_buffer_io_sync() instead of open-coding the sequence.
- Drop unused VFS exports: unexport drop_super_exclusive(), remove
start_removing_user_path_at(), and fold __start_removing_path()
into start_removing_path().
- fs/read_write: narrow the __kernel_write() export with
EXPORT_SYMBOL_FOR_MODULES().
- vfs: uapi: retire octal and hex constants in favor of (1 << n) for
the O_ flags. Finding a free bit for a new flag across the
architectures was needlessly hard with the mixed bases.
- dcache: add extra sanity checks of dead dentries in dentry_free()
via a new DENTRY_WARN_ONCE() that also prints d_flags.
- iov_iter: use kmemdup_array() in dup_iter() to harden the
allocation against multiplication overflow.
- fs/pipe: write to ->poll_usage only once.
- vfs: remove an always-taken if-branch in find_next_fd().
- dcache: use kmalloc_flex() for struct external_name in __d_alloc().
- namei: use QSTR() instead of QSTR_INIT() in path_pts().
- sync_file_range: delete dead S_ISLNK code.
- Comment fixes: retire a stale comment in fget_task_next() and fix
assorted spelling mistakes"
* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
backing-file: fix backing_file_open() kerneldoc parameter
iomap: pass the correct len to fserror_report_io in __iomap_write_begin
vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
bpf: add bpf_real_inode() kfunc
fs/read_write: Do not export __kernel_write() to the entire world
libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
mount: honour SB_NOUSER in the new mount API
fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
selftests/pipe: add pipe_bench microbenchmark
fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
fs: retire stale comment in fget_task_next()
fs: fix spelling mistakes in comment
bfs: replace get_zeroed_page() with kzalloc()
binfmt_misc: replace __get_free_page() with kmalloc()
configfs: replace __get_free_pages() with kzalloc()
fs/namespace: use __getname() to allocate mntpath buffer
fs/select: replace __get_free_page() with kmalloc()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull simple_xattr updates from Christian Brauner:
"This reworks the simple xattr api to make it more efficient and easier
to use for all consumers.
The simple_xattr hash table moves from the inode into a per-superblock
cache, removing the per-inode overhead for the common case of few or
no xattrs. The interface now passes struct simple_xattrs ** so lazy
allocation is handled internally instead of by every caller, kernfs
xattr operations on kernfs nodes shared between multiple superblocks
are properly serialized, and tmpfs constructs "security.foo" xattr
names with kasprintf() instead of kmalloc() plus two memcpy()s.
A follow-up fix links kernfs nodes to their parent before the LSM init
hook runs: with the per-sb cache kernfs_xattr_set() computes the cache
via kernfs_root(kn), which faulted on a freshly allocated node when
selinux_kernfs_init_security() called into it - reproducible as a NULL
pointer dereference on the first cgroup mkdir on SELinux-enabled
systems.
On top of this bpffs gains support for trusted.* and security.* xattrs
so that user space and BPF LSM programs can attach metadata - for
example a content hash or a security label - to pinned objects and
directories and inspect it uniformly like on other filesystems. The
store is in-memory and non-persistent, living only for the lifetime of
the mount like everything else in bpffs"
* tag 'vfs-7.2-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
bpf: Add simple xattr support to bpffs
kernfs: link kn to its parent before the LSM init hook
simpe_xattr: use per-sb cache
simple_xattr: change interface to pass struct simple_xattrs **
tmpfs: simplify constructing "security.foo" xattr names
kernfs: fix xattr race condition with multiple superblocks
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
- Add the vfs infrastructure required to implement fs-verity support
for XFS with a post-EOF merkle tree: fsverity generates and stores a
zero-block hash, and iomap learns to verify data on buffered reads,
to handle fsverity during writeback via the new IOMAP_F_FSVERITY
flag, and to write fsverity metadata through iomap_fsverity_write().
- Skip the memset of the iomap in iomap_iter() once the iteration is
done. In high-IOPS scenarios (4k randread NVMe polling via io_uring)
the pointless memset wasted memory write bandwidth; this improves
IOPS by about 5% on ext4 and xfs.
- Add balance_dirty_pages_ratelimited() to iomap_zero_iter(), aligning
it with iomap_write_iter(). This prepares for the exFAT iomap
conversion where zeroing beyond valid_size can trigger large-scale
zeroing operations that caused memory pressure without throttling.
- Remove the over-strict inline data boundary check. If a filesystem
provides a valid inline_data pointer and length there is no reason to
require that inline data must not cross a page boundary.
- Don't make REQ_POLLED imply REQ_NOWAIT, matching the earlier
equivalent block layer fix: there are valid cases to poll for I/O
completion without REQ_NOWAIT, and REQ_NOWAIT for file system writes
is currently not supported as writes aren't idempotent.
- Introduce IOMAP_F_ZERO_TAIL for filesystems that maintain a separate
valid data length (exFAT, NTFS). For a write starting at or beyond
valid_size, __iomap_write_begin() now zeroes only the tail portion of
the block while preserving valid data before it, instead of leaving
stale data in the page cache. The flag is also added to the iomap
trace event strings.
* tag 'vfs-7.2-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: Add IOMAP_F_ZERO_TAIL flag to trace event strings
iomap: introduce iomap_fsverity_write() for writing fsverity metadata
iomap: teach iomap to read files with fsverity
iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
fsverity: generate and store zero-block hash
iomap: introduce IOMAP_F_ZERO_TAIL flag
iomap: don't make REQ_POLLED imply REQ_NOWAIT
iomap: remove over-strict inline data boundary check
iomap: add dirty page control to iomap_zero_iter
iomap: avoid memset iomap when iter is done
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull eventpoll updates from Christian Brauner:
- eventpoll clarity refactor
The recent eventpoll UAF fixes (a6dc643c6931 and follow-ups) depended
on invariants in fs/eventpoll.c that were nowhere documented and had
to be reverse-engineered from the code: the lifetime relationships
between struct eventpoll, struct epitem, and struct file, the three
removal paths coordinating via epi_fget() pins and ep->mtx, the
ovflist sentinel-encoded scan state machine, the POLLFREE
release/acquire handshake, and the loop / path check globals
serialized by epnested_mutex. The fixes were correct but the next
person to touch this code would hit the same learning curve.
This series codifies those invariants in source and tightens the
surrounding structure. No functional changes intended:
- Documentation: a top-of-file overview with field-protection
tables for struct eventpoll and struct epitem, a section
gathering the loop-check / path-check globals next to their
declarations, labelled comments on the two sides of the POLLFREE
handshake, refreshed comments on epi_fget() and ep_remove_file(),
and a docblock on ep_clear_and_put() that names its two-pass
structure as load-bearing.
- Mechanical renames: ep_refcount_dec_and_test() -> ep_put() to
pair with ep_get(), attach_epitem() -> ep_attach_file() for
ep_remove_file() symmetry, the unused depth argument dropped from
epoll_mutex_lock(), and the CONFIG_KCMP block relocated next to
CONFIG_COMPAT so the hot-path code is contiguous.
- Helper extraction: ep_insert() splits into ep_alloc_epitem() and
ep_register_epitem(), ep_clear_and_put()'s two passes become
ep_drain_pollwaits() and ep_drain_tree() so the ordering
invariant is enforced by the call sequence rather than
convention, the per-event delivery loop body becomes
ep_deliver_event(), and the ep->mtx + epnested_mutex acquisition
dance lifts out of do_epoll_ctl() into ep_ctl_lock() /
ep_ctl_unlock().
- Sentinel and predicate cleanup: the EP_UNACTIVE_PTR overload is
hidden behind named helpers (ep_is_scanning, epi_on_ovflist,
...), epi->next is renamed to epi->ovflist_next, and the boolean
predicates return bool.
- The per-CTL_ADD scratch state (tfile_check_list, path_count[],
inserting_into) moves from file-scope globals into a
stack-allocated struct ep_ctl_ctx plumbed through the loop / path
check chain.
Two follow-up fixes are included: missing kernel-doc for the new @ctx
parameters, and restoring the EP_UNACTIVE_PTR sentinel for
ctx->tfile_check_list - replacing it with NULL termination broke
ep_remove_file()'s "never listed" check for the list tail, causing a
syzbot-reported use-after-free.
- io_uring related epoll cleanups
One of the nastier things about epoll is how it allows nesting
contexts inside each other, leading to the necessity of loop
detection and the issues that have come with that. There is no reason
to support nesting on the io_uring side, so contain the damage and
disallow nested contexts from there: eventpoll gains a file based
control interface and struct epoll_filefd is renamed to epoll_key.
The io_uring side proper goes on top of this through the block tree.
- Fix epoll_wait() reporting false negatives
ep_events_available() checks ep->rdllist and ep_is_scanning() without
a lock and can race with a concurrent scan such that neither check
sees the events, causing epoll_wait() with a zero timeout to wrongly
report no events even though events are available. A sequence lock
closes the race and a reproducer is added to the eventpoll selftests.
* tag 'vfs-7.2-rc1.eventpoll' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
eventpoll: restore EP_UNACTIVE_PTR sentinel for ctx->tfile_check_list
eventpoll: Fix epoll_wait() report false negative
selftests/eventpoll: Add test for multiple waiters
eventpoll: add missing kernel-doc for @ctx function parameters
eventpoll: rename struct epoll_filefd to epoll_key
eventpoll: add file based control interface
eventpoll: export is_file_epoll()
eventpoll: pass struct epoll_filefd through ep_find() and ep_insert()
eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx
eventpoll: use bool for predicate helpers
eventpoll: rename epi->next and txlist for clarity
eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers
eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock()
eventpoll: extract ep_deliver_event() from ep_send_events()
eventpoll: split ep_clear_and_put() into drain helpers
eventpoll: split ep_insert() into alloc + register stages
eventpoll: relocate KCMP helpers near compat syscalls
eventpoll: rename attach_epitem() to ep_attach_file()
eventpoll: drop unused depth argument from epoll_mutex_lock()
eventpoll: rename ep_refcount_dec_and_test() to ep_put()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull buffer_head updates from Christian Brauner:
"This removes b_end_io from struct buffer_head.
Instead of setting bio->bi_end_io to end_bio_bh_io_sync() which then
calls bh->b_end_io(), the new bh_submit() and __bh_submit() interfaces
set bio->bi_end_io to the appropriate completion handler directly,
replacing two indirect function calls in the completion path with one.
It is also one fewer function pointer in the middle of a writable data
structure that can be corrupted, it shrinks struct buffer_head from
104 to 96 bytes allowing roughly 7% more buffer_heads to be cached in
the same amount of memory, and it removes some atomic operations as
the buffer refcount is no longer incremented before calling the end_io
handler.
All in-tree users (fs/buffer.c itself, ext4, jbd2, ocfs2, gfs2,
nilfs2, and md-bitmap) are converted, and submit_bh(),
mark_buffer_async_write(), and end_buffer_write_sync() are removed"
* tag 'vfs-7.2-rc1.bh' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (34 commits)
buffer: Remove end_buffer_write_sync()
buffer: Change calling convention for end_buffer_read_sync()
buffer: Remove b_end_io
buffer: Remove submit_bh()
md-bitmap: Convert read_file_page and write_file_page to bh_submit()
nilfs2: Convert nilfs_mdt_submit_block to bh_submit()
nilfs2: Convert nilfs_gccache_submit_read_data to bh_submit()
nilfs2: Convert nilfs_btnode_submit_block to bh_submit()
buffer: Remove mark_buffer_async_write()
gfs2: Convert gfs2_aspace_write_folio to bh_submit()
gfs2: Remove use of b_end_io in gfs2_meta_read_endio()
gfs2: Convert gfs2_dir_readahead to bh_submit()
gfs2: Convert gfs2_metapath_ra to bh_submit()
ocfs2: Convert ocfs2_write_super_or_backup to bh_submit()
ocfs2: Convert ocfs2_read_blocks to bh_submit()
ocfs2: Convert ocfs2_read_block to bh_submit()
ocfs2: Convert ocfs2_write_block to bh_submit()
jbd2: Convert jbd2_write_superblock() to bh_submit()
jbd2: Convert journal commit to bh_submit()
ext4: Convert ext4_commit_super() to bh_submit()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs writeback updates from Christian Brauner:
- Fix a race between cgroup_writeback_umount() and inode_switch_wbs()
When a container exits, a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy
inodes after unmount" followed by a use-after-free on percpu
counters.
There is a window between inode_prepare_wbs_switch() returning true
(having passed the SB_ACTIVE check and grabbed the inode) and the
subsequent wb_queue_isw() call: if cgroup_writeback_umount() observes
the global isw_nr_in_flight counter as non-zero but flush_workqueue()
finds nothing queued yet, it returns early - leaving a held inode
reference that blocks evict_inodes() and a later iput() that hits
freed percpu counters.
The race is closed by covering the window from
inode_prepare_wbs_switch() through wb_queue_isw() with an RCU
read-side critical section and synchronizing in the umount path.
On top of that the now-dead rcu_barrier() left over from the
queue_rcu_work() era is removed, and the global
synchronize_rcu()/flush_workqueue() pair is replaced with a per-sb
in-flight counter plus pin/unpin/drain helpers so umount no longer
serializes against switch activity on unrelated superblocks.
Under cgroup writeback churn on a 16 vCPU guest this takes umount
latency from ~92-138ms p50 down to ~5-8ms p50 and the cumulative cost
of cgroup_writeback_umount() from ~62ms to ~4us per call.
The initial race fix is kept separate and minimal so it backports
cleanly to stable trees that still queue switches via
queue_rcu_work().
- Improve write performance with RWF_DONTCACHE
Dirty DONTCACHE pages are now tracked per bdi_writeback so that the
writeback flusher can be kicked in a targeted fashion for
IOCB_DONTCACHE writes instead of relying on global writeback, and the
PG_dropbehind flag is preserved when a folio is split.
* tag 'vfs-7.2-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
mm: track DONTCACHE dirty pages per bdi_writeback
mm: preserve PG_dropbehind flag during folio split
writeback: use a per-sb counter to drain inode wb switches at umount
writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()
writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs superblock updates from Christian Brauner:
"This retires sget().
CIFS plus the two ext4 KUnit tests (extents-test, mballoc-test) were
the last in-tree callers, and all three convert cleanly to sget_fc().
That lets sget() and its prototype come out, taking ~60 lines that
only existed to be kept in lockstep with sget_fc() on every
publish-path change"
* tag 'vfs-7.2-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: retire sget()
smb: client: convert cifs_smb3_do_mount() to sget_fc()
ext4: convert mballoc KUnit test to sget_fc()
ext4: convert extents KUnit test to sget_fc()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull openat2 updates from Christian Brauner:
"Features:
- Add O_EMPTYPATH to openat(2)/openat2(2). To get an operable file
descriptor from an O_PATH file descriptor it is possible to use
openat(fd, ".", O_DIRECTORY) for directories, but other file types
require going through open("/proc/<pid>/fd/<nr>") and thus depend
on a functioning procfs.
With O_EMPTYPATH an empty path string is accepted and LOOKUP_EMPTY
is set at path resolution time, allowing to reopen the file behind
the file descriptor directly. Selftests are included.
- Add an OPENAT2_REGULAR flag for openat2(2) which refuses to open
anything but regular files with the new EFTYPE error code.
This implements the "ability to only open regular files" feature
requested by userspace via uapi-group.org and protects services
from being redirected to fifos, device nodes, and friends.
All atomic_open implementations were audited for OPENAT2_REGULAR
handling. Explicit checks were added to ceph, gfs2, nfs (v4), and
cifs/smb - these are the filesystems whose atomic_open can
encounter an existing non-regular file and would otherwise call
finish_open() on it or return a misleading error code.
The remaining implementations (9p, fuse, vboxsf, nfs v2/v3) only
call finish_open() on freshly created files and use
finish_no_open() for lookup hits, letting the VFS catch non-regular
files via the do_open() safety net.
Cleanups:
- Migrate the openat2 selftests to the kselftest harness and move
them under selftests/filesystems/. The tests were written in the
early days of selftests' TAP support and the modern kselftest
harness is much easier to follow and maintain. The contents of the
tests are unchanged and the new emptypath tests are ported on top.
- Make the LAST_XXX last-type constants private to fs/namei.c. The
only user outside of fs/namei.c was ksmbd which only needs to know
whether the last component is a regular one, so
vfs_path_parent_lookup() now performs the LAST_NORM check
internally. The ints are replaced with a dedicated enum last_type"
* tag 'vfs-7.2-rc1.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: replace ints with enum last_type for LAST_XXX
vfs: make LAST_XXX private to fs/namei.c
selftests: openat2: port emptypath_test to kselftest harness
kselftest/openat2: test for OPENAT2_REGULAR flag
openat2: new OPENAT2_REGULAR flag support
openat2: introduce EFTYPE error code
selftest: add tests for O_EMPTYPATH
vfs: add O_EMPTYPATH to openat(2)/openat2(2)
selftests: openat2: migrate to kselftest harness
selftests: openat2: switch from custom ARRAY_LEN to ARRAY_SIZE
selftests: openat2: move helpers to header
selftests: move openat2 tests to selftests/filesystems/
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc kernel updates from Christian Brauner:
"Fixes
- rhashtable: give each instance its own lockdep class
syzbot reported a circular locking dependency between ht->mutex and
fs_reclaim via the simple_xattrs rhashtable being torn down during
inode eviction.
The predicted deadlock cannot occur: rhashtable_free_and_destroy()
cancels the deferred worker before taking ht->mutex and
acquisitions on distinct rhashtables are on distinct mutexes.
Lockdep flags a cycle anyway because every ht->mutex in the kernel
shared the single static lockdep class from
rhashtable_init_noprof().
The lockdep key is lifted to a per-call-site static key so every
rhashtable instance gets its own class.
- selftests/clone3: fix misuse of the libcap library interface in the
cap_checkpoint_restore test and remove unused variables
- selftests/pid_namespace: compute the pid_max test limits
dynamically instead of hardcoding values below the kernel-enforced
minimum of PIDS_PER_CPU_MIN * num_possible_cpus() which made the
tests fail on machines with many possible CPUs
- selftests: fix the Makefile TARGETS entry for nsfs which wasn't
adjusted when the tests moved under filesystems/
Cleanups
- ipc/sem.c: use unsigned int for nsops to match the declaration in
syscalls.h"
* tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/clone3: remove unused variables
selftests/clone3: fix libcap interface usage
ipc/sem.c: use unsigned int for nsops
selftests: Fix Makefile target for nsfs
rhashtable: give each instance its own lockdep class
selftests/pid_namespace: compute pid_max test limits dynamically
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull task_exec_state updates from Christian Brauner:
"This introduces a new per-task task_exec_state structure and relocates
the dumpable mode and the user namespace captured at execve() from
mm_struct onto it. It stays attached to the task for its full
lifetime.
__ptrace_may_access() and several /proc owner and visibility checks
need to consult two pieces of state for any observable task, including
zombies that have already gone through exit_mm(): the dumpable mode
and the user namespace captured at execve(). Both live on mm_struct
today, which exit_mm() clears from the task long before the task is
reaped. A reader that races with do_exit() observes task->mm == NULL
and either fails the check or falls back to init_user_ns - which
denies legitimate access to non-dumpable zombies that were running in
a nested user namespace.
mm_struct loses ->user_ns and the dumpability bits in ->flags.
MMF_DUMPABLE_BITS is reserved so the MMF_DUMP_FILTER_* layout exposed
via /proc/<pid>/coredump_filter stays stable. task->user_dumpable and
its exit_mm() snapshot are removed.
task_exec_state is the privilege domain established by an execve().
Within a thread group it is shared via refcount; across thread groups
each task has its own:
- CLONE_VM siblings (thread-group members, io_uring workers)
refcount-share the parent's exec_state.
- Non-CLONE_VM clones (fork(), vfork() without CLONE_VM) allocate a
fresh exec_state inheriting the parent's dumpable mode and user_ns.
- execve() in the child allocates a fresh instance and installs it
under task_lock + exec_update_lock via task_exec_state_replace().
- Credential changes (setresuid, capset, ...) and
prctl(PR_SET_DUMPABLE) update dumpability on the current task's
exec_state, i.e., on the thread group's shared instance.
On top of this exec_mmap() no longer tears down the old mm while
holding exec_update_lock for writing and cred_guard_mutex. Neither
lock is needed for that: exec_update_lock only exists to make the mm
swap atomic with the later commit_creds() and all its readers operate
on the new mm; none looks at the detached old mm.
The cost was real: __mmput() runs exit_mmap() over the entire old
address space and can block in exit_aio() waiting for in-flight AIO,
so execve() of a large process blocked ptrace_attach() and every
exec_update_lock reader for the duration of the teardown.
The old mm is now stashed in bprm->old_mm and released from
setup_new_exec() after both locks are dropped, with a backstop in
free_bprm() for the error paths"
* tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
exec: free the old mm outside the exec locks
exec_state: relocate dumpable information
ptrace: add ptracer_access_allowed()
exec: introduce struct task_exec_state
sched/coredump: introduce enum task_dumpable
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs casefolding updates from Christian Brauner:
"This exposes the case folding behavior of local filesystems so that
file servers - nfsd, ksmbd, and user space file servers - can report
the actual behavior to clients instead of guessing.
Filesystems report case-insensitive and case-nonpreserving behavior
via new file_kattr flags in their fileattr_get implementations. fat,
exfat, ntfs3, hfs, hfsplus, xfs, cifs, nfs, vboxsf, and isofs are
wired up. Local filesystems that are not explicitly handled default to
the usual POSIX behavior of case-sensitive and case-preserving.
nfsd uses this to report case folding via NFSv3 PATHCONF and to
implement the NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING
attributes - both have been part of the NFS protocols for decades to
support clients on non-POSIX systems - and ksmbd reports it via
FS_ATTRIBUTE_INFORMATION. Exposing the information through the
fileattr uapi covers user space file servers.
The immediate motivation is interoperability: Windows NFS clients
hard-require servers to report case-insensitivity for Win32
applications to work correctly, and a client that knows the server is
case-insensitive can avoid issuing multiple LOOKUP/READDIR requests
searching for case variants.
The Linux NFS client already grew support for case-insensitive shares
years ago in support of the Hammerspace NFS server - negative dentry
caching must be disabled (a lookup for "FILE.TXT" failing must not
cache a negative entry when "file.txt" exists) and directory change
invalidation must drop cached case-folded name variants. Such servers
often operate in multi-protocol environments where a single file
service instance caters to both NFS and SMB clients, and nfsd needs to
report case folding properly to participate as a first-class citizen
there.
A follow-up series brings fixes for the initial work: the nfsd
case-info probe now uses kernel credentials, maps -ESTALE to
NFS3ERR_STALE, and has its cost capped across READDIR entries; the nfs
client avoids transiently zeroed case capability bits during the probe
and skips the pathconf probe when neither field is consumed; the
FS_CASEFOLD_FL semantics are clarified in the UAPI header; and the
tools UAPI headers are synced"
* tag 'vfs-7.2-rc1.casefold' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
nfsd: Cap case-folding probe cost across READDIR entries
nfsd: Map -ESTALE from case probe to NFS3ERR_STALE
nfsd: Use kernel credentials for case-info probe
fs: Clarify FS_CASEFOLD_FL semantics in UAPI header
nfs: Skip pathconf probe when neither field is consumed
nfs: Avoid transient zeroed case capability bits during probe
tools headers UAPI: Sync case-sensitivity flags from linux/fs.h
ksmbd: Report filesystem case sensitivity via FS_ATTRIBUTE_INFORMATION
nfsd: Implement NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING
nfsd: Report export case-folding via NFSv3 PATHCONF
isofs: Implement fileattr_get for case sensitivity
vboxsf: Implement fileattr_get for case sensitivity
nfs: Implement fileattr_get for case sensitivity
cifs: Implement fileattr_get for case sensitivity
xfs: Report case sensitivity in fileattr_get
hfsplus: Report case sensitivity in fileattr_get
hfs: Implement fileattr_get for case sensitivity
ntfs3: Implement fileattr_get for case sensitivity
exfat: Implement fileattr_get for case sensitivity
fat: Implement fileattr_get for case sensitivity
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs directory delegations from Christian Brauner:
"This contains the VFS prerequisites for supporting directory
delegations in nfsd via CB_NOTIFY callbacks.
The filelock core gains support for ignoring delegation breaks for
directory change events together with an inode_lease_ignore_mask()
helper, and fsnotify gains fsnotify_modify_mark_mask() and a
FSNOTIFY_EVENT_RENAME data type.
With this in place nfsd can request delegations on directories and set
up inotify watches to trigger sending CB_NOTIFY events to clients
instead of having every directory change break the delegation.
New tracepoints are added to fsnotify() and to the start of
break_lease(), and trace_break_lease_block() is passed the currently
blocking lease instead of the new one.
A follow-up fix moves the LEASE_BREAK_* flags out of
#ifdef CONFIG_FILE_LOCKING to fix the build for CONFIG_FILE_LOCKING=n
configurations"
* tag 'vfs-7.2-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
filelock: move LEASE_BREAK_* flags out of #ifdef CONFIG_FILE_LOCKING
fsnotify: add FSNOTIFY_EVENT_RENAME data type
fsnotify: add fsnotify_modify_mark_mask()
fsnotify: new tracepoint in fsnotify()
filelock: add an inode_lease_ignore_mask helper
filelock: add a tracepoint to start of break_lease()
filelock: add support for ignoring deleg breaks for dir change events
filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"This extends the lockless ->i_count handling.
iput() could already decrement any value greater than one locklessly
but acquiring a reference always required taking inode->i_lock. Now
acquiring a reference is lockless as long as the count was already at
least 1, i.e., only the 0->1 and 1->0 transitions take the lock.
This avoids the lock for the common cases of nfs calling into the
inode hash and btrfs using igrab(). Cleanup-wise icount_read_once() is
added to line up with inode_state_read_once() and the open-coded
->i_count loads across the tree are converted, and ihold() is
relocated and tidied up.
On top of that some stale lock ordering annotations are retired from
the inode hash code: iunique() no longer takes the hash lock since the
inode hash became RCU-searchable and s_inode_list_lock is no longer
taken under the hash lock either"
* tag 'vfs-7.2-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: retire stale lock ordering annotations from inode hash
fs: allow lockless ->i_count bumps as long as it does not transition 0->1
fs: relocate and tidy up ihold()
fs: add icount_read_once() and stop open-coding ->i_count loads
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull exportfs updates from Christian Brauner:
"This cleans up the exportfs support for block-style layouts that
provide direct block device access: the operations for layout-based
block device access are split out of struct export_operations into a
separate header, ->commit_blocks() no longer takes a struct iattr
argument, and the way support for layout-based block device access is
detected is reworked.
nfsd's blocklayout code also stops honoring loca_time_modify. This is
preparation for supporting export of more than a single device per
file system"
* tag 'vfs-7.2-rc1.exportfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
exportfs,nfsd: rework checking for layout-based block device access support
exportfs: don't pass struct iattr to ->commit_blocks
exportfs: split out the ops for layout-based block device access
nfsd/blocklayout: always ignore loca_time_modify
|
|
Bump MAX_CALL_FRAMES from 8 to 16 to allow deeper call chains
that Rust-BPF requires and update selftests.
Link: https://lore.kernel.org/r/20260613180755.29671-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The return value of i3c_master_add_i3c_dev_locked() is not used by any
caller, and callers are not in a position to recover from failures in
this path.
Change the function to return void. Amend the kernel-doc accordingly,
fix some grammar and remove a stale paragraph.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260612080107.11606-6-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
The existing i3c_master_enec_locked() wrapper always treats a NACKed
ENEC CCC as a failure (M2 error). However, broadcasting ENEC to enable
Hot-Join is legitimately useful even when no I3C devices are currently
present on the bus, in which case the broadcast will be NACKed and
should not be reported as an error.
The underlying helper i3c_master_enec_disec_locked() already accepts a
suppress_m2 flag that lets callers ignore such NACKs. Expose it so that
a subsequent patch enabling Hot-Join events can issue ENEC with M2
suppression.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260608054312.10604-8-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Master drivers may invoke i3c_master_do_daa_ext() during resume to
re-run Dynamic Address Assignment. As well as assigning addresses to
any newly arrived devices, this restores the dynamic address of devices
that lost it across system suspend, so it has to run as part of the
controller's resume path.
A side effect of i3c_master_do_daa_ext() today is that it also
registers any newly discovered I3C devices with the driver model
inline, via i3c_master_register_new_i3c_devs(). Doing that from the
resume path is problematic: a hot-join-capable device may join the bus
during this same DAA, and registering it immediately would push driver
model work (probing, sysfs, etc.) into the controller's resume context,
where the rest of the system is not yet fully resumed and the
controller driver is still partway through its own resume sequence.
Decouple discovery from registration: add a reg_work work item to
struct i3c_master_controller and have i3c_master_do_daa_ext() queue it
on master->wq (the freezable workqueue) instead of calling
i3c_master_register_new_i3c_devs() directly. The worker performs the
registration only when the controller is not shutting_down, and is
cancelled alongside hj_work in i3c_master_shutdown(). Because wq is
freezable, any newly observed devices end up being registered after
the system has finished resuming.
i3c_master_register() also routes its initial post-bus-init registration
through reg_work, using flush_work() to keep probe-time behavior
synchronous. This keeps a single registration code path and ensures the
worker is the only writer of desc->dev.
Fixes: 3a379bbcea0af ("i3c: Add core I3C infrastructure")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260608054312.10604-7-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
System shutdown invokes each device's bus shutdown callback to quiesce
hardware, but the I3C bus type does not currently implement one. As a
result, on shutdown the controller's Hot-Join work and any in-flight
i3c_master_do_daa() can keep running (or be newly triggered) while the
rest of the system is being torn down.
A similar window exists at i3c_master_unregister() time: cancel_work_sync()
on hj_work prevents queued work from completing, but does not stop a
fresh Hot-Join IBI from re-queueing the worker, nor a concurrent sysfs
writer from toggling Hot-Join via i3c_set_hotjoin().
Introduce a single "shutting down" gate in the I3C core, set under the
bus maintenance lock so it is observed by any in-progress DAA path
before pending work is cancelled. Install an i3c_bus_type shutdown
callback that engages this gate for master devices during system
shutdown, and use the same gate in i3c_master_unregister() so both
paths get identical guarantees.
Once the gate is engaged, the Hot-Join worker, i3c_master_do_daa_ext()
and i3c_set_hotjoin() all bail out cleanly, so Hot-Join IBIs that race
with shutdown become no-ops, direct DAA callers see -ENODEV, and sysfs
writers can no longer re-enable Hot-Join through ops->enable_hotjoin()
while the controller is going away.
No functional change for the steady-state runtime path; the new checks
only take effect once the controller has been marked as shutting down.
Note, this patch depends on patch "i3c: master: Consolidate Hot-Join DAA
work in the core".
Fixes: 3a379bbcea0af ("i3c: Add core I3C infrastructure")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260608054312.10604-5-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Three master drivers (dw-i3c-master, i3c-master-cdns, svc-i3c-master)
each carry an essentially identical Hot-Join handler: a struct
work_struct embedded in their private state, a work function that just
calls i3c_master_do_daa() on the embedded i3c_master_controller, plus
matching INIT_WORK()/cancel_work_sync() boilerplate in probe/remove (and
shutdown for dw-i3c). The IBI/ISR paths then queue that work onto
master->wq, which already lives in the core.
Move this pattern into the I3C core:
- Add struct work_struct hj_work to struct i3c_master_controller and
initialise it in i3c_master_register() with a core-provided handler
i3c_master_hj_work_fn() that performs i3c_master_do_daa().
- Cancel the work in i3c_master_unregister() so all controllers get
correct teardown ordering against the workqueue for free.
- Export i3c_master_queue_hotjoin() as the single entry point drivers
call from their Hot-Join IBI handler.
Convert the three existing users to the new API: drop their private
hj_work fields, work functions, INIT_WORK() and cancel_work_sync()
calls, and replace the queue_work(master->wq, &drv->hj_work) call sites
with i3c_master_queue_hotjoin(&drv->base). The dw-i3c shutdown path
still needs to flush pending Hot-Join work before tearing down the
hardware, so it is updated to cancel master->base.hj_work directly.
No functional change intended: the work is still queued on the same
master->wq, runs the same i3c_master_do_daa(), and is cancelled at
controller teardown. Future Hot-Join improvements now only need to
be made in one place.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260608054312.10604-4-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
The I3C master workqueue (master->wq) is used to defer work that needs
thread context and the bus maintenance lock, most notably Hot Join
processing (which calls i3c_master_do_daa() to assign dynamic addresses
to newly joined devices).
Currently the workqueue keeps running across system suspend, which can
race with the suspend path:
- do_daa() may execute after the controller has been suspended,
issuing bus transactions on a powered-down or otherwise unusable
controller.
- New I3C devices can be enumerated and added to the bus mid-suspend,
registering driver model objects at a point where the I3C subsystem
and its consumers are not prepared to handle them.
Mark the workqueue WQ_FREEZABLE so its workers are frozen for the
duration of system suspend/hibernate and resumed afterwards. This
naturally defers any pending or newly queued Hot Join work until the
system (and the controller) is fully resumed, closing both races
without adding explicit suspend/resume synchronization in the master
drivers.
Update the kerneldoc for struct i3c_master_controller::wq to reflect
that the workqueue is freezable.
Fixes: 3a379bbcea0af ("i3c: Add core I3C infrastructure")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260608054312.10604-2-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Extend the DPLL pin notification API to include a source identifier
indicating where the notification originates. This allows notifier
consumers to distinguish between notifications coming from
an associated DPLL instance, a parent pin, or the pin itself.
A new field, src_clock_id, is added to struct dpll_pin_notifier_info
and is passed through all pin-related notification paths. Callers of
dpll_pin_notify() are updated to provide a meaningful source identifier
based on their context:
- pin registration/unregistration uses the DPLL's clock_id,
- pin-on-pin operations use the parent pin's clock_id,
- pin changes use the pin's own clock_id.
As introduced in the commit ("dpll: allow registering FW-identified pin
with a different DPLL"), it is possible to share the same physical pin
via firmware description (fwnode) with DPLL objects from different
kernel modules. This means that a given pin can be registered multiple
times.
Driver such as ICE (E825 devices) rely on this mechanism when listening
for the event where a shared-fwnode pin appears, while avoiding reacting
to events triggered by their own registration logic.
This change only extends the notification metadata and does not alter
existing semantics for drivers that do not use the new field.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-9-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.
With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.
Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.
Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.
For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:
1.02% sock-test [kernel.kallsyms] [k] io_req_local_work_add
0.88% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop
0.60% sock-test [kernel.kallsyms] [k] llist_reverse_order
0.14% sock-test [kernel.kallsyms] [k] __io_run_local_work
2.64% at ~46Gb/sec
and after this change:
1.08% sock-test [kernel.kallsyms] [k] io_req_local_work_add
1.03% sock-test [kernel.kallsyms] [k] __io_run_local_work
2.11% at ~53Gb/sec
which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:
2.22% sock-test [kernel.kallsyms] [k] llist_reverse_order
0.84% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop
0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add
0.02% sock-test [kernel.kallsyms] [k] __io_run_local_work
3.50% at ~24Gb/sec
we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:
0.90% sock-test [kernel.kallsyms] [k] __io_run_local_work
0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add
1.32% at ~26Gb/sec
most of that overhead is gone, and performance is better as well.
Caleb Sander Mateos <csander@purestorage.com> reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.
[1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.
Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.
Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.
The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.
The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.
The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline. The cursor is written for every
popped entry, and keeping it on the same cacheline as ->tail would have
the consumer invalidating the line that producers need for every push.
Keeping it external lets the caller place it with its own consumer side
data instead.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for boolean and void LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.
Fix it by skipping setting -EPERM for hooks not returning errno.
Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Blamed commit converted the untracked dev_hold()/dev_put() calls
in the watchdog code to use the tracked dev_hold_track()/dev_put_track()
(which were later renamed/interfaced to netdev_hold() and netdev_put()).
By introducing dev->watchdog_dev_tracker to store the
reference tracking information without adding synchronization
between netdev_watchdog_up() and dev_watchdog(), it enabled the
race condition where this pointer could be overwritten or freed
concurrently, leading to the list corruption crash syzbot reported:
list_del corruption, ffff888114a18c00->next is NULL
kernel BUG at lib/list_debug.c:52 !
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 91 Comm: kworker/u8:5 Not tainted syzkaller #0 PREEMPT(lazy)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Workqueue: events_unbound linkwatch_event
RIP: 0010:__list_del_entry_valid_or_report.cold+0x22/0x2a lib/list_debug.c:52
Call Trace:
<TASK>
__list_del_entry_valid include/linux/list.h:132 [inline]
__list_del_entry include/linux/list.h:246 [inline]
list_move_tail include/linux/list.h:341 [inline]
ref_tracker_free+0x1a7/0x6c0 lib/ref_tracker.c:329
netdev_tracker_free include/linux/netdevice.h:4491 [inline]
netdev_put include/linux/netdevice.h:4508 [inline]
netdev_put include/linux/netdevice.h:4504 [inline]
netdev_watchdog_down net/sched/sch_generic.c:600 [inline]
dev_deactivate_many+0x28c/0xfe0 net/sched/sch_generic.c:1363
dev_deactivate+0x109/0x1d0 net/sched/sch_generic.c:1397
linkwatch_do_dev net/core/link_watch.c:184 [inline]
linkwatch_do_dev+0xd3/0x120 net/core/link_watch.c:166
__linkwatch_run_queue+0x3a5/0x810 net/core/link_watch.c:240
linkwatch_event+0x8f/0xc0 net/core/link_watch.c:314
process_one_work+0xa0e/0x1980 kernel/workqueue.c:3314
process_scheduled_works kernel/workqueue.c:3397 [inline]
worker_thread+0x5ef/0xe50 kernel/workqueue.c:3478
kthread+0x370/0x450 kernel/kthread.c:436
ret_from_fork+0x69a/0xc80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
This patch has three coordinated parts:
1) Add dev->watchdog_lock and dev->watchdog_ref_held to serialize watchdog operations.
2) Remove netdev_watchdog_up() call from netif_carrier_on():
This ensures netdev_watchdog_up() is only called from process/BH context
(via linkwatch workqueue dev_activate()), allowing us to use
spin_lock_bh() for synchronization.
3) Synchronize watchdog up and watchdog timer:
Protect netdev_watchdog_up() with tx_global_lock and watchdog_lock.
Only allocate a new tracker in netdev_watchdog_up() if one is
not already present.
In dev_watchdog(), ensure we don't release the tracker if the
timer was rescheduled either by dev_watchdog() itself or concurrently
by netdev_watchdog_up().
Fixes: f12bf6f3f942 ("net: watchdog: add net device refcount tracker")
Reported-by: syzbot+381d82bbf0253710b35d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a26b751.c25708ab.1b19ef.0013.GAE@google.com/T/#u
Tested-by: syzbot+3479efbc2821cb2a79f2@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260611152737.2580480-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The tls_toe feature and its single user (chelsio chtls) have been
unmaintained for multiple years. It also hooks into the core of the
TCP implementation, and bypasses most of the networking stack.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/1f30e73275c07bf879f547589872d0916025a52e.1781165969.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new block error injection interface that allows to inject specific
status code for specific ranges.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260611140703.2401204-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
So far our parsing of {iommu,msi}-map properties has always blindly
assumed that the output specifiers will always have exactly 1 cell.
This typically does happen to be the case, but is not actually enforced
(and the PCI msi-map binding even explicitly states support for 0 or 1
cells) - as a result we've now ended up with dodgy DTs out in the field
which depend on this behaviour to map a 1-cell specifier for a 2-cell
provider, despite that being bogus per the bindings themselves.
Since there is some potential use in being able to map at least single
input IDs to multi-cell output specifiers (and properly support 0-cell
outputs as well), add support for properly parsing and using the target
nodes' #cells values, albeit with the unfortunate complication of still
having to work around expectations of the old behaviour too.
Since there are multi-cell output specifiers, the callers of of_map_id()
may need to get the exact cell output value for further processing.
Update of_map_id() to set args_count in the output to reflect the actual
number of output specifier cells.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com>
Signed-off-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com>
Link: https://patch.msgid.link/20260603-parse_iommu_cells-v16-3-dc509dacb19a@oss.qualcomm.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
Change of_map_id() to take a pointer to struct of_phandle_args
instead of passing target device node and translated IDs separately.
Update all callers accordingly.
Add an explicit filter_np parameter to of_map_id() and of_map_msi_id()
to separate the filter input from the output. Previously, the target
parameter served dual purpose: as an input filter (if non-NULL, only
match entries targeting that node) and as an output (receiving the
matched node with a reference held). Now filter_np is the explicit
input filter and arg->np is the pure output.
Previously, of_map_id() would call of_node_put() on the matched node
when a filter was provided, making reference ownership inconsistent.
Remove this internal of_node_put() call so that of_map_id() now always
transfers ownership of the matched node reference to the caller via
arg->np. Callers are now consistently responsible for releasing this
reference with of_node_put(arg->np) when done.
Acked-by: Frank Li <Frank.Li@nxp.com>
Suggested-by: Rob Herring (Arm) <robh@kernel.org>
Suggested-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com>
Signed-off-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com>
Link: https://patch.msgid.link/20260603-parse_iommu_cells-v16-2-dc509dacb19a@oss.qualcomm.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
Since we now have quite a few users parsing "iommu-map" and "msi-map"
properties, give them some wrappers to conveniently encapsulate the
appropriate sets of property names. This will also make it easier to
then change of_map_id() to correctly account for specifier cells.
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com>
Link: https://patch.msgid.link/20260603-parse_iommu_cells-v16-1-dc509dacb19a@oss.qualcomm.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
iommu_copy_struct_from_full_user_array() copies a whole user array into a
kernel buffer. In the common case, where user entry_len equals destination
entry size, it takes a fast path and copies the whole array with a single
copy_from_user().
That fast path does not return, so it falls through into the item-by-item
copy_struct_from_user() loop and copies every entry a second time. For an
equal entry_len that loop is just a copy_from_user() of the same bytes, so
the whole array is copied twice for no benefit.
Return right after the bulk copy. The per-item loop then runs only on the
slow path, where entry_len differs and each entry needs size adaption.
Fixes: 4f2e59ccb698 ("iommu: Add iommu_copy_struct_from_full_user_array helper")
Link: https://patch.msgid.link/r/6c9eca4ff584cb977661e97799ac6fe934e7f51c.1780521606.git.nicolinc@nvidia.com
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The secondary_op_tmpl kdoc line has been removed accidentally, add it
back.
Reported-by: Michael Walle <mwalle@kernel.org>
Closes: https://lore.kernel.org/linux-mtd/DJ56CDMRVFQ6.FOZRIQTF3VDW@kernel.org/T/#u
Fixes: 38fbe4b3f66e ("spi: spi-mem: Add a no_cs_assertion capability")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://patch.msgid.link/20260612-perso-fix-no-cs-assertion-kdoc-v1-1-626b2d6d0d9b@bootlin.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Merge tag 'mtd/spi-mem-cont-read-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux into spi-7.2
Miquel Raynal <miquel.raynal@bootlin.com> says:
Aside from preparation changes in the SPI NAND core, the changes carried
here focus on the shared spi-mem layer which is enhanced in order to
bring two new features:
- The possibility to fill a primary and a secondary operation template
in the direct mapping structure in order to support continuous reads
in SPI NAND, which may require two different read operations.
- SPI controllers may indicate possible CS instabilities over long
transfers by setting a boolean. This capability is related to the
previous one, the need for it has arised while testing SPI NAND
continuous reads with the Cadence QSPI controller which cannot, under
certain conditions, keep the CS asserted for the length of
an eraseblock-large transfer.
|
|
'rockchip', 'verisilicon', 'riscv', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
Merge series "slab: support for compiler-assisted type-based slab cache
partitioning" from Marco Elver. From the cover letter [6]:
Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more
flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning
mode of the latter.
Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature
available in Clang 22 and later, called "allocation tokens" via
__builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM
(formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a
slab cache to an allocation of type T, regardless of allocation site.
The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs
the compiler to infer an allocation type from arguments commonly passed
to memory-allocating functions and returns a type-derived token ID. The
implementation passes kmalloc-args to the builtin: the compiler performs
best-effort type inference, and then recognizes common patterns such as
`kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also
`(T *)kmalloc(...)`. Where the compiler fails to infer a type the
fallback token (default: 0) is chosen.
Note: kmalloc_obj(..) APIs fix the pattern how size and result type are
expressed, and therefore ensures there's not much drift in which
patterns the compiler needs to recognize. Specifically, kmalloc_obj()
and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the
compiler recognizes via the cast to TYPE*.
Clang's default token ID calculation is described as [1]:
typehashpointersplit: This mode assigns a token ID based on the hash
of the allocated type's name, where the top half ID-space is reserved
for types that contain pointers and the bottom half for types that do
not contain pointers.
Separating pointer-containing objects from pointerless objects and data
allocations can help mitigate certain classes of memory corruption
exploits [2]: attackers who gains a buffer overflow on a primitive
buffer cannot use it to directly corrupt pointers or other critical
metadata in an object residing in a different, isolated heap region.
It is important to note that heap isolation strategies offer a
best-effort approach, and do not provide a 100% security guarantee,
albeit achievable at relatively low performance cost. Note that this
also does not prevent cross-cache attacks: while waiting for future
features like SLAB_VIRTUAL [3] to provide physical page isolation, this
feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and
init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as
much as possible today.
With all that, my kernel (x86 defconfig) shows me a histogram of slab
cache object distribution per /proc/slabinfo (after boot):
<slab cache> <objs> <hist>
kmalloc-part-15 1465 ++++++++++++++
kmalloc-part-14 2988 +++++++++++++++++++++++++++++
kmalloc-part-13 1656 ++++++++++++++++
kmalloc-part-12 1045 ++++++++++
kmalloc-part-11 1697 ++++++++++++++++
kmalloc-part-10 1489 ++++++++++++++
kmalloc-part-09 965 +++++++++
kmalloc-part-08 710 +++++++
kmalloc-part-07 100 +
kmalloc-part-06 217 ++
kmalloc-part-05 105 +
kmalloc-part-04 4047 ++++++++++++++++++++++++++++++++++++++++
kmalloc-part-03 183 +
kmalloc-part-02 283 ++
kmalloc-part-01 316 +++
kmalloc 1422 ++++++++++++++
The above /proc/slabinfo snapshot shows me there are 6673 allocated
objects (slabs 00 - 07) that the compiler claims contain no pointers or
it was unable to infer the type of, and 12015 objects that contain
pointers (slabs 08 - 15). On a whole, this looks relatively sane.
Additionally, when I compile my kernel with -Rpass=alloc-token, which
provides diagnostics where (after dead-code elimination) type inference
failed, I see 186 allocation sites where the compiler failed to identify
a type (down from 966 when I sent the RFC [4]). Some initial review
confirms these are mostly variable sized buffers, but also include
structs with trailing flexible length arrays.
Link: https://clang.llvm.org/docs/AllocToken.html [1]
Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/ [2]
Link: https://lwn.net/Articles/944647/ [3]
Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/ [4]
Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434 [5]
Link: https://lore.kernel.org/all/20260511200136.3201646-1-elver@google.com/ [6]
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt into usb-next
Mika writes:
thunderbolt: Changes for v7.2 merge window
This includes following USB4/Thunderbolt changes for the v7.2 merge
window:
- Make the driver more compliant with the connection manager guide.
- Improvements over Thunderbolt XDomain service handling.
- USB4STREAM driver.
- Split out PCIe bits into pci.c to allow the driver to work on
non-PCIe hosts as well.
- Various fixes and improvements.
All these have been in linux-next with no reported issues.
* tag 'thunderbolt-for-v7.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: (41 commits)
thunderbolt: debugfs: Fix sideband write size check
thunderbolt: debugfs: Fix margining error counter buffer leak
thunderbolt: test: Release third DP tunnel
thunderbolt: Prevent XDomain delayed work use-after-free on disconnect
thunderbolt: test: Add KUnit tests for property parser bounds checks
thunderbolt: Add some more descriptive probe error messages
thunderbolt: Require nhi->ops be valid
thunderbolt: Separate out common NHI bits
thunderbolt: Move pci_device out of tb_nhi
thunderbolt: Increase Notification Timeout to 255 ms for USB4 routers
thunderbolt: Increase timeout for Configuration Ready bit
thunderbolt: Verify Router Ready bit is set after router enumeration
thunderbolt: Verify PCIe adapter in detect state before tunnel setup
thunderbolt: Activate path hops from source to destination
thunderbolt: Fix lane bonding log when bonding not possible
thunderbolt: Don't access path config space on Lane 1 adapters in tb_switch_reset_host()
thunderbolt: Improve multi-display DisplayPort tunnel allocation
docs: admin-guide: thunderbolt: Add instructions how to use USB4STREAM
thunderbolt: Add support for USB4STREAM
thunderbolt: Add support for ConfigFS
...
|
|
KVM SEV changes for 7.2
- Don't advertise support for unusuable VM types, and account for VM types
that are disabled by firmware, e.g. to mitigate security vulnerabilities.
- Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with bugs and
unnecessarily complicated, and add comprehensive tests.
- Clean up and deduplicate the SEV page pinning code.
- Fix minor goofs related to writing back CPUID information after firmware
rejects a CPUID page for an SNP vCPU.
|
|
KVM generic changes for 7.2
- Rename invalidate_begin() to invalidate_start() throughout KVM to follow
the kernel's nomenclature, e.g. for mmu_notifiers.
- Minor cleanups.
|
|
mr_table.cache_resolve_queue_len is always updated under
spin_lock_bh(&mfc_unres_lock).
Let's convert it to u32.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260609222013.1550355-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-7.1-rc8).
Conflicts:
drivers/net/ethernet/wangxun/txgbe/txgbe_aml.c
f67aead16e85 ("net: txgbe: rework service event handling")
57d39faed4c9 ("net: txgbe: improve functions of AML 40G devices")
net/rds/info.c
512db8267b73 ("rds: mark snapshot pages dirty in rds_info_getsockopt()")
6e94eeb2a2a6 ("rds: convert to getsockopt_iter")
Adjacent changes:
include/net/sock.h
1ee90b77b727 ("net: guard timestamp cmsgs to real error queue skbs")
f0de88303d5e ("net: make is_skb_wmem() available to modules")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Merge cpufreq updates for 7.2:
- Fix a race between cpufreq suspend and CPU hotplug during system
shutdown (Tianxiang Chen)
- Avoid redundant target() calls for unchanged limits and fix a typo
in a comment in the cpufreq core (Viresh Kumar)
- Fix concurrency issues related to sysfs attributes access that affect
cpufreq governors using the common governor code (Zhongqiu Han)
- Simplify frequency limit handling in the conservative cpufreq
governor (Lifeng Zheng)
- Fix descriptions of the conservative governor freq_step tunable and
the ondemand governor sampling_down_factor tunable in the cpufreq
documentation (Pengjie Zhang)
- Fix use-after-free and double free during _OSC evaluation in the PCC
cpufreq driver (Yuho Choi)
- Rework the handling of policy min and max frequency values in the
cpufreq core to allow drivers to specify special initial values for
the scaling_min_freq and scaling_max_freq sysfs attributes (Pierre
Gondois)
- Add cpufreq scaling support for Qualcomm Shikra SoC (Taniya Das,
Imran Shaik).
- Improve the warning message on HWP-disabled hybrid processors printed
by the intel_pstate driver and sync policy->cur during CPU offline in
it (Yohei Kojima, Fushuai Wang)
- Drop cpufreq support for AMD Elan SC4* (Sean Young)
- Minor fixes for cpufreq drivers (Krzysztof Kozlowski, Akashdeep Kaur,
Hans Zhang, Guangshuo Li, Xueqin Luo)
- Clean up dead dependencies on X86 in the cpufreq Kconfig (Julian
Braha)
* pm-cpufreq: (25 commits)
cpufreq: Use policy->min/max init as QoS request
cpufreq: Remove driver default policy->min/max init
cpufreq: Set default policy->min/max values for all drivers
cpufreq: Extract cpufreq_policy_init_qos() function
cpufreq: Documentation: fix conservative governor freq_step description
cpufreq: ti: Add EPROBE_DEFER for K3 SoCs
cpufreq: qcom: Add cpufreq scaling support for Qualcomm Shikra SoC
dt-bindings: cpufreq: Document Qualcomm Shikra SoC EPSS
cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load
cpufreq: governor: Fix data races on per-CPU idle/nice baselines
cpufreq: intel_pstate: Improve warning message on HWP-disabled hybrid CPUs
cpufreq: elanfreq: Drop support for AMD Elan SC4*
cpufreq: clean up dead dependencies on X86 in Kconfig
cpufreq: conservative: Simplify frequency limit handling
cpufreq: Avoid redundant target() calls for unchanged limits
cpufreq: Fix typo in comment
cpufreq: intel_pstate: Sync policy->cur during CPU offline
cpufreq: Documentation: fix sampling_down_factor range
cpufreq: Fix hotplug-suspend race during reboot
cpufreq: pcc: fix use-after-free and double free in _OSC evaluation
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from IPsec and netfilter.
This is relatively small, mostly because we are a bit behind our PW
queue. I'm not aware of any pending regression.
Current release - regressions:
- netfilter: nf_tables_offload: drop device refcount on error
Previous releases - regressions:
- core: add pskb_may_pull() to skb_gro_receive_list()
- xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags()
- ipv6: fix a potential NPD in cleanup_prefix_route()
- ipv4: fix use-after-free caused by the fqdir_pre_exit() flush
- eth:
- bnxt_en: fix NULL pointer dereference
- emac: fix use-after-free during device removal
- octeontx2-af: fix memory leak in rvu_setup_hw_resources()
- tun: zero the whole vnet header in tun_put_user()
- sit: reload inner IPv6 header after GSO offloads
Previous releases - always broken:
- core: fix double-free in netdev_nl_bind_rx_doit()
- netfilter: nf_log: validate MAC header was set before dumping it
- xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()
- tcp: restrict SO_ATTACH_FILTER to priv users
- mctp: usb: fix race between urb completion and rx_retry
cancellation
- eth:
- mlx5: fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list
- mvpp2: sync RX data at the hardware packet offset"
* tag 'net-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
octeontx2-af: fix IP fragment flag corruption on custom KPU profile load
ipv6: Fix a potential NPD in cleanup_prefix_route()
net: txgbe: initialize PHY interface to 0
net: txgbe: distinguish module types by checking identifier
net: txgbe: initialize module info buffer
net: mvpp2: build skb from XDP-adjusted data on XDP_PASS
net: mvpp2: refill RX buffers before XDP or skb use
net: mvpp2: limit XDP frame size to the RX buffer
net: mvpp2: sync RX data at the hardware packet offset
netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register
netfilter: nft_fib: fix stale stack leak via the OIFNAME register
netfilter: nft_exthdr: fix register tracking for F_PRESENT flag
netfilter: nf_log: validate MAC header was set before dumping it
netfilter: x_tables: avoid leaking percpu counter pointers
netfilter: nf_conntrack: destroy stale expectfn expectations on unregister
netfilter: nf_tables_offload: drop device refcount on error
netfilter: revalidate bridge ports
rds: mark snapshot pages dirty in rds_info_getsockopt()
ip6_vti: fix incorrect tunnel matching in vti6_tnl_lookup()
ptp: ocp: fix resource freeing order
...
|
|
|