| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- A stable fix for performance regression in tests that perform
kmem_cache_destroy() a lot, due to unnecessarily wide scope of
kvfree_rcu_barrier() (Harry Yoo)
* tag 'slab-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- Use the MSI parent domain API instead of the legacy API for setup and
teardown of PCI MSI IRQs
- Select POSIX_CPU_TIMERS_TASK_WORK now that VIRT_XFER_TO_GUEST_WORK
has been implemented for s390
- Fix a KVM bug which can lead to guest memory corruption
- Fix KASAN shadow memory mapping for hotplugged memory
- Minor bug fixes and improvements
* tag 's390-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/bug: Add missing alignment
s390/bug: Add missing CONFIG_BUG ifdef again
KVM: s390: Fix gmap_helper_zap_one_page() again
s390/pci: Migrate s390 IRQ logic to IRQ domain API
genirq: Change hwirq parameter to irq_hw_number_t
s390: Select POSIX_CPU_TIMERS_TASK_WORK
s390: Unmap early KASAN shadow on memory offlining
s390/vmem: Support 2G page splitting for KASAN shadow freeing
s390/boot: Use entire page for PTEs
s390/vmur: Use scnprintf() instead of sprintf()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- last minute fix for missing parenthesis in recently merged code (Hans
de Goede)
- removal of excessive, non-fatal warnings (Dave Kleikamp)
* tag 'dma-mapping-6.19-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: Fix DMA_BIT_MASK() macro being broken
dma/pool: eliminate alloc_pages warning in atomic_pool_expand
|
|
Deeply daisy chained DP/MST displays are no longer able to light
up. This reverts commit e0dec00f3d05 ("drm/amd/display: Fix pbn
to kbps Conversion")
Cc: Jerry Zuo <jerry.zuo@amd.com>
Reported-by: nat@nullable.se
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4756
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e1c94109c76e8a77a21531bd53f6c63356c81158)
Cc: stable@vger.kernel.org # 6.17+
|
|
Unbinding amdgpu has no problems, but binding it again leads to an
error of sysfs file already existing. This is because it wasn't
actually cleaned up on unbind. Add the missing cleanup step.
Fixes: 547aad32edac ("drm/amdgpu: add VCN4 ip block support")
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d717e62e9b6ccff0e3cec78a58dfbd00858448b3)
Cc: stable@vger.kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lindholm/alpha
Pull alpha updates from Magnus Lindholm:
"Two small uapi fixes. One patch hardcodes TC* ioctl values that
previously depended on the deprecated termio struct, avoiding build
issues with newer glibc versions. The other patch switches uapi
headers to use the compiler-defined __ASSEMBLER__ macro for better
consistency between kernel and userspace.
- don't reference obsolete termio struct for TC* constants
- Replace __ASSEMBLY__ with __ASSEMBLER__ in the alpha headers"
* tag 'alpha-for-v6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/lindholm/alpha:
alpha: don't reference obsolete termio struct for TC* constants
alpha: Replace __ASSEMBLY__ with __ASSEMBLER__ in the alpha headers
|
|
Pull ARM updates from Russell King:
- disable jump label and high PTE for PREEMPT RT kernels
- fix input operand modification in load_unaligned_zeropad()
- fix hash_name() / fault path induced warnings
- fix branch predictor hardening
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: fix branch predictor hardening
ARM: fix hash_name() fault
ARM: allow __do_kernel_fault() to report execution of memory faults
ARM: group is_permission_fault() with is_translation_fault()
ARM: 9464/1: fix input-only operand modification in load_unaligned_zeropad()
ARM: 9461/1: Disable HIGHPTE on PREEMPT_RT kernels
ARM: 9459/1: Disable jump-label on PREEMPT_RT
|
|
Ensure the dma state is initialized when we're not using the contiguous
iova, otherwise the caller may be using a stale state from a previous
request that could use the coalesed iova allocation.
Fixes: 2f6b2565d43cdb5 ("block: accumulate memory segment gaps per bio")
Reported-by: Sebastian Ott <sebott@redhat.com>
Tested-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Segment info indexing also used sizeof(struct) instead of the
4K metadata stride, so info_index could point between slots and
subsequent writes would advance incorrectly. Derive info_index
from the pointer returned by the segment meta search using
PCACHE_SEG_INFO_SIZE and advance to the next slot for future
updates.
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Zheng Gu <cengku@gmail.com>
Cc: stable@vger.kernel.org # 6.18
|
|
The on-media cache_info index used sizeof(struct) instead of the
4K metadata stride, so gc_percent updates from dmsetup message
were written between slots and lost after reboot. Use
PCACHE_CACHE_INFO_SIZE in get_cache_info_addr() and align
info_index with the slot returned by pcache_meta_find_latest().
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Zheng Gu <cengku@gmail.com>
Cc: stable@vger.kernel.org # 6.18
|
|
In dm-pcache, in order to ensure crash-consistency, a dual-copy scheme
is used to alternately update metadata, and there is a slot index that
records the current slot. However, in the write path the current
implementation writes directly to the current slot indexed by slot
index, and then advances the slot — which ends up overwriting the
existing slot, violating the crash-consistency guarantee.
This patch fixes that behavior, preventing metadata from being
overwritten incorrectly.
In addition, this patch add a missing pmem_wmb() after memcpy_flushcache().
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Zheng Gu <cengku@gmail.com>
Cc: stable@vger.kernel.org # 6.18
|
|
examples
Also enhance possible takeover/reshape information and do some reformatting.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
The log_writes_kthread() calls try_to_freeze() but lacks set_freezable(),
rendering the freeze attempt ineffective since kernel threads are
non-freezable by default. This prevents proper thread suspension during
system suspend/hibernate.
Add set_freezable() to explicitly mark the thread as freezable.
Fixes: 0e9cebe72459 ("dm: add log writes target")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
rs->raid_type is assigned from get_raid_type_by_ll(), which may return
NULL. This NULL value could be dereferenced later in the condition
'if (!(rs_is_raid10(rs) && rt_is_raid0(rs->raid_type)))'.
Add a fail-fast check to return early with an error if raid_type is NULL,
similar to other uses of this function.
Found by Linux Verification Center (linuxtesting.org) with Svace.
Fixes: 33e53f06850f ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Alexey Simakov <bigalex934@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
There is reported 'scheduling while atomic' bug when using dm-snapshot on
real-time kernels. The reason for the bug is that the hlist_bl code does
preempt_disable() when taking the lock and the kernel attempts to take
other spinlocks while holding the hlist_bl lock.
Fix this by converting a hlist_bl spinlock into a regular spinlock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Jiping Ma <jiping.ma2@windriver.com>
|
|
__blkdev_issue_discard() always returns 0, making all error checking
at call sites dead code.
For dm-thin change issue_discard() return type to void, in
passdown_double_checking_shared_status() remove the r assignment from
return value of the issue_discard(), for end_discard() hardcode value of
r to 0 that matches only value returned from __blkdev_issue_discard().
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Ben will be working with me as a maintainer, so add him to the
MAINTAINERS file.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
|
|
There's no point to the MPATHF_RETAIN_ATTACHED_HW_HANDLER flag any more.
The way setup_scsi_dh() worked, if that flag wasn't set, it would
attempt to attach any passed in hardware handler. This would always fail
if a different hardware handler was attached, which caused
setup_scsi_dh() to rerun as if the flag was set. So the code would
already retain any attached handler, because attaching a different one
would always fail.
Also, the code had a bug. If attached_handler_name was NULL but there
was a scsi device handler attached (because either
scsi_dh_attached_handler_name failed() to allocate a name, a handler got
attached after it was called) the code would loop endlessly.
Instead, ignore MPATHF_RETAIN_ATTACHED_HW_HANDLER, and always free the
passed in handler if *attached_handler_name is set. This simplifies the
code, and avoids the endless loop bug, while keeping the functionality
the same.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Fix kerneldoc warnings across the dm-vdo target. Also
remove some unhelpful or inaccurate doc comments, and fix
some format inconsistencies that did not produce warnings.
No functional changes.
Suggested-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
There may be devices with physical block size larger than 4k.
If dm-bufio sends I/O that is not aligned on physical block size,
performance is degraded.
The 4k minimum alignment limit is there because some SSDs report logical
and physical block size 512 despite having 4k internally - so dm-bufio
shouldn't send I/Os not aligned on 4k boundary, because they perform
badly (the SSD does read-modify-write for them).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: stable@vger.kernel.org
|
|
Allow handling of bios with REQ_ATOMIC flag set.
Don't split these bios and fail them if they overrun the hard limit
"BIO_MAX_VECS << PAGE_SHIFT".
In order to simplify the code, this commit joins the logic that avoids
splitting emulated zone append bios with the logic that avoids
splitting atomic write bios.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: John Garry <john.g.garry@oracle.com>
|
|
Any bio with REQ_ATOMIC flag set should never be split or partially
completed, so BUG_ON() on this scenario in dm_accept_partial_bio() (whose
intent is to allow partial completions).
Also, we must reject atomic bio to targets that don't support them,
otherwise this BUG could be triggered by stray bios that have the
REQ_ATOMIC set.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: John Garry <john.g.garry@oracle.com>
|
|
v->fec->extra_pool has zero reserved entries, so we can remove it and use
the kernel cache directly.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
|
|
There are two problems with the recursive correction:
1. It may cause denial-of-service. In fec_read_bufs, there is a loop that
has 253 iterations. For each iteration, we may call verity_hash_for_block
recursively. There is a limit of 4 nested recursions - that means that
there may be at most 253^4 (4 billion) iterations. Red Hat QE team
actually created an image that pushes dm-verity to this limit - and this
image just makes the udev-worker process get stuck in the 'D' state.
2. It doesn't work. In fec_read_bufs we store data into the variable
"fio->bufs", but fio bufs is shared between recursive invocations, if
"verity_hash_for_block" invoked correction recursively, it would
overwrite partially filled fio->bufs.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
|
|
Commit 9a7c987fb92b ("crypto: arm64/ghash - Use API partial block
handling") made ghash_finup() pass the wrong buffer to
ghash_do_simd_update(). As a result, ghash-neon now produces incorrect
outputs when the message length isn't divisible by 16 bytes. Fix this.
(I didn't notice this earlier because this code is reached only on CPUs
that support NEON but not PMULL. I haven't yet found a way to get
qemu-system-aarch64 to emulate that configuration.)
Fixes: 9a7c987fb92b ("crypto: arm64/ghash - Use API partial block handling")
Cc: stable@vger.kernel.org
Reported-by: Diederik de Haas <diederik@cknow-tech.com>
Closes: https://lore.kernel.org/linux-crypto/DETXT7QI62KE.F3CGH2VWX1SC@cknow-tech.com/
Tested-by: Diederik de Haas <diederik@cknow-tech.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20251209223417.112294-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
When config table entries don't match with the device to be probed,
currently we fall back to SND_INTEL_DSP_DRIVER_ANY, which means to
allow any drivers to bind with it.
This was set so with the assumption (or hope) that all controller
drivers should cover the devices generally, but in practice, this
caused a problem as reported recently. Namely, when a specific
kconfig for SOF isn't set for the modern Intel chips like Alderlake,
a wrong driver (AVS) got probed and failed. This is because we have
entries like:
#if IS_ENABLED(CONFIG_SND_SOC_SOF_ALDERLAKE)
/* Alder Lake / Raptor Lake */
{
.flags = FLAG_SOF | FLAG_SOF_ONLY_IF_DMIC_OR_SOUNDWIRE,
.device = PCI_DEVICE_ID_INTEL_HDA_ADL_S,
},
....
#endif
so this entry is effective only when CONFIG_SND_SOC_SOF_ALDERLAKE is
set. If not set, there is no matching entry, hence it returns
SND_INTEL_DSP_DRIVER_ANY as fallback. OTOH, if the kconfig is set, it
explicitly falls back to SND_INTEL_DSP_DRIVER_LEGACY when no DMIC or
SoundWire is found -- that was the working scenario. That being said,
the current setup may be broken for modern Intel chips that are
supposed to work with either SOF or legacy driver when the
corresponding kconfig were missing.
For addressing the problem above, this patch changes the fallback
driver to the legacy driver, i.e. return SND_INTEL_DSP_DRIVER_LEGACY
type as much as possible. When CONFIG_SND_HDA_INTEL is also disabled,
the fallback is set to SND_INTEL_DSP_DRIVER_ANY type, just to be sure.
Reported-by: Askar Safin <safinaskar@gmail.com>
Closes: https://lore.kernel.org/all/20251014034156.4480-1-safinaskar@gmail.com/
Tested-by: Askar Safin <safinaskar@gmail.com>
Reviewed-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20251210131553.184404-1-tiwai@suse.de
|
|
When CONFIG_SLUB_TINY is enabled, kfree_nolock() calls kasan_slab_free()
before defer_free(). On ARM64 with MTE (Memory Tagging Extension),
kasan_slab_free() poisons the memory and changes the tag from the
original (e.g., 0xf3) to a poison tag (0xfe).
When defer_free() then tries to write to the freed object to build the
deferred free list via llist_add(), the pointer still has the old tag,
causing a tag mismatch and triggering a KASAN use-after-free report:
BUG: KASAN: slab-use-after-free in defer_free+0x3c/0xbc mm/slub.c:6537
Write at addr f3f000000854f020 by task kworker/u8:6/983
Pointer tag: [f3], memory tag: [fe]
Fix this by calling kasan_reset_tag() before accessing the freed memory.
This is safe because defer_free() is part of the allocator itself and is
expected to manipulate freed memory for bookkeeping purposes.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Reported-by: syzbot+7a25305a76d872abcfa1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7a25305a76d872abcfa1
Tested-by: syzbot+7a25305a76d872abcfa1@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://patch.msgid.link/20251210022024.3255826-1-kartikey406@gmail.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
|
|
__do_user_fault() may be called with indeterminent interrupt enable
state, which means we may be preemptive at this point. This causes
problems when calling harden_branch_predictor(). For example, when
called from a data abort, do_alignment_fault()->do_bad_area().
Move harden_branch_predictor() out of __do_user_fault() and into the
calling contexts.
Moving it into do_kernel_address_page_fault(), we can be sure that
interrupts will be disabled here.
Converting do_translation_fault() to use do_kernel_address_page_fault()
rather than do_bad_area() means that we keep branch predictor handling
for translation faults. Interrupts will also be disabled at this call
site.
do_sect_fault() needs special handling, so detect user mode accesses
to kernel-addresses, and add an explicit call to branch predictor
hardening.
Finally, add branch predictor hardening to do_alignment() for the
faulting case (user mode accessing kernel addresses) before interrupts
are enabled.
This should cover all cases where harden_branch_predictor() is called,
ensuring that it is always has interrupts disabled, also ensuring that
it is called early in each call path.
Reviewed-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Tested-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
Zizhi Wo reports:
"During the execution of hash_name()->load_unaligned_zeropad(), a
potential memory access beyond the PAGE boundary may occur. For
example, when the filename length is near the PAGE_SIZE boundary.
This triggers a page fault, which leads to a call to
do_page_fault()->mmap_read_trylock(). If we can't acquire the lock,
we have to fall back to the mmap_read_lock() path, which calls
might_sleep(). This breaks RCU semantics because path lookup occurs
under an RCU read-side critical section."
This is seen with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_KFENCE=y.
Kernel addresses (with the exception of the vectors/kuser helper
page) do not have VMAs associated with them. If the vectors/kuser
helper page faults, then there are two possibilities:
1. if the fault happened while in kernel mode, then we're basically
dead, because the CPU won't be able to vector through this page
to handle the fault.
2. if the fault happened while in user mode, that means the page was
protected from user access, and we want to fault anyway.
Thus, we can handle kernel addresses from any context entirely
separately without going anywhere near the mmap lock. This gives us
an entirely non-sleeping path for all kernel mode kernel address
faults.
As we handle the kernel address faults before interrupts are enabled,
this change has the side effect of improving the branch predictor
hardening, but does not completely solve the issue.
Reported-by: Zizhi Wo <wozizhi@huaweicloud.com>
Reported-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Link: https://lore.kernel.org/r/20251126090505.3057219-1-wozizhi@huaweicloud.com
Reviewed-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Tested-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
Allow __do_kernel_fault() to detect the execution of memory, so we can
provide the same fault message as do_page_fault() would do. This is
required when we split the kernel address fault handling from the
main do_page_fault() code path.
Reviewed-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Tested-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
Jakub says: "We try to reserve SKIP for tests skipped because tool is
missing in env, something isn't built into the kernel etc."
use xfail, we can't force the race condition to appear at will
so its expected that the test 'fails' occasionally.
Fixes: 78a588363587 ("selftests: netfilter: add conntrack clash resolution test case")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20251206175647.5c32f419@kernel.org/
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Always set nf_flow_route tuple out ifindex even if the indev is not one
of the flowtable configured devices since otherwise the outdev lookup in
nf_flow_offload_ip_hook() or nf_flow_offload_ipv6_hook() for
FLOW_OFFLOAD_XMIT_NEIGH flowtable entries will fail.
The above issue occurs in the following configuration since IP6IP6
tunnel does not support flowtable acceleration yet:
$ip addr show
5: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:11:22:33:22:55 brd ff:ff:ff:ff:ff:ff link-netns ns1
inet6 2001:db8:1::2/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::211:22ff:fe33:2255/64 scope link tentative proto kernel_ll
valid_lft forever preferred_lft forever
6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:22:22:33:22:55 brd ff:ff:ff:ff:ff:ff link-netns ns3
inet6 2001:db8:2::1/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::222:22ff:fe33:2255/64 scope link tentative proto kernel_ll
valid_lft forever preferred_lft forever
7: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1452 qdisc noqueue state UNKNOWN group default qlen 1000
link/tunnel6 2001:db8:2::1 peer 2001:db8:2::2 permaddr a85:e732:2c37::
inet6 2002:db8:1::1/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::885:e7ff:fe32:2c37/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
$ip -6 route show
2001:db8:1::/64 dev eth0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev eth1 proto kernel metric 256 pref medium
2002:db8:1::/64 dev tun0 proto kernel metric 256 pref medium
default via 2002:db8:1::2 dev tun0 metric 1024 pref medium
$nft list ruleset
table inet filter {
flowtable ft {
hook ingress priority filter
devices = { eth0, eth1 }
}
chain forward {
type filter hook forward priority filter; policy accept;
meta l4proto { tcp, udp } flow add @ft
}
}
Fixes: b5964aac51e0 ("netfilter: flowtable: consolidate xmit path")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
The IPv4 code path in __ip_vs_get_out_rt() calls dst_link_failure()
without ensuring skb->dev is set, leading to a NULL pointer dereference
in fib_compute_spec_dst() when ipv4_link_failure() attempts to send
ICMP destination unreachable messages.
The issue emerged after commit ed0de45a1008 ("ipv4: recompile ip options
in ipv4_link_failure") started calling __ip_options_compile() from
ipv4_link_failure(). This code path eventually calls fib_compute_spec_dst()
which dereferences skb->dev. An attempt was made to fix the NULL skb->dev
dereference in commit 0113d9c9d1cc ("ipv4: fix null-deref in
ipv4_link_failure"), but it only addressed the immediate dev_net(skb->dev)
dereference by using a fallback device. The fix was incomplete because
fib_compute_spec_dst() later in the call chain still accesses skb->dev
directly, which remains NULL when IPVS calls dst_link_failure().
The crash occurs when:
1. IPVS processes a packet in NAT mode with a misconfigured destination
2. Route lookup fails in __ip_vs_get_out_rt() before establishing a route
3. The error path calls dst_link_failure(skb) with skb->dev == NULL
4. ipv4_link_failure() → ipv4_send_dest_unreach() →
__ip_options_compile() → fib_compute_spec_dst()
5. fib_compute_spec_dst() dereferences NULL skb->dev
Apply the same fix used for IPv6 in commit 326bf17ea5d4 ("ipvs: fix
ipv6 route unreach panic"): set skb->dev from skb_dst(skb)->dev before
calling dst_link_failure().
KASAN: null-ptr-deref in range [0x0000000000000328-0x000000000000032f]
CPU: 1 PID: 12732 Comm: syz.1.3469 Not tainted 6.6.114 #2
RIP: 0010:__in_dev_get_rcu include/linux/inetdevice.h:233
RIP: 0010:fib_compute_spec_dst+0x17a/0x9f0 net/ipv4/fib_frontend.c:285
Call Trace:
<TASK>
spec_dst_fill net/ipv4/ip_options.c:232
spec_dst_fill net/ipv4/ip_options.c:229
__ip_options_compile+0x13a1/0x17d0 net/ipv4/ip_options.c:330
ipv4_send_dest_unreach net/ipv4/route.c:1252
ipv4_link_failure+0x702/0xb80 net/ipv4/route.c:1265
dst_link_failure include/net/dst.h:437
__ip_vs_get_out_rt+0x15fd/0x19e0 net/netfilter/ipvs/ip_vs_xmit.c:412
ip_vs_nat_xmit+0x1d8/0xc80 net/netfilter/ipvs/ip_vs_xmit.c:764
Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure")
Signed-off-by: Slavin Liu <slavin452@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
There are some situations where ct might be leaked as error paths are
skipping the refcounted check and return immediately. In order to solve
it make sure that the check is always called.
Fixes: be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
None of the RBD code directly requires CRC32, CRYPTO, or CRYPTO_AES.
These options are needed by CEPH_LIB code and they are selected there
directly.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@linux.dev>
|
|
None of the CEPH_FS code directly requires CRC32, CRYPTO, or CRYPTO_AES.
These options do get selected indirectly anyway via CEPH_LIB, which does
need them, but there is no need for CEPH_FS to select them too.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
If the osdmap is (maliciously) corrupted such that the encoded length
of ceph_pg_pool envelope is less than what is expected for a particular
encoding version, out-of-bounds reads may ensue because the only bounds
check that is there is based on that length value.
This patch adds explicit bounds checks for each field that is decoded
or skipped.
Cc: stable@vger.kernel.org
Reported-by: ziming zhang <ezrakiez@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Tested-by: ziming zhang <ezrakiez@gmail.com>
|
|
In a few cases the code compares 32-bit value to a SIZE_MAX derived
constant which is much higher than that value on 64-bit platforms,
Clang, in particular, is not happy about this
net/ceph/osdmap.c:1441:10: error: result of comparison of constant 4611686018427387891 with expression of type 'u32' (aka 'unsigned int') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
1441 | if (len > (SIZE_MAX - sizeof(*pg)) / sizeof(u32))
| ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/ceph/osdmap.c:1624:10: error: result of comparison of constant 2305843009213693945 with expression of type 'u32' (aka 'unsigned int') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
1624 | if (len > (SIZE_MAX - sizeof(*pg)) / (2 * sizeof(u32)))
| ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by casting to size_t. Note, that possible replacement of SIZE_MAX
by U32_MAX may lead to the behaviour changes on the corner cases.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In a few cases the code compares 32-bit value to a SIZE_MAX derived
constant which is much higher than that value on 64-bit platforms,
Clang, in particular, is not happy about this
fs/ceph/snap.c:377:10: error: result of comparison of constant 2305843009213693948 with expression of type 'u32' (aka 'unsigned int') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
377 | if (num > (SIZE_MAX - sizeof(*snapc)) / sizeof(u64))
| ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by casting to size_t. Note, that possible replacement of SIZE_MAX
by U32_MAX may lead to the behaviour changes on the corner cases.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
This patch adds trace points to the Ceph filesystem MDS client:
- request submission (CEPH_MSG_CLIENT_REQUEST) and completion
(CEPH_MSG_CLIENT_REPLY)
- capabilities (CEPH_MSG_CLIENT_CAPS)
These are the central pieces that are useful for analyzing MDS
latency/performance problems from the client's perspective.
In the long run, all doutc() calls should be replaced with
tracepoints. This way, the Ceph filesystem can be traced at any time
(without spamming the kernel log). Additionally, trace points can be
used in BPF programs (which can even deference the pointer parameters
and extract more values).
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
OSD client logging has a problem in get_osd() and put_osd().
For one logging output refcount_read() is called twice. If recount
value changes between both calls logging output is not consistent.
This patch prints out only the resulting value.
[ idryomov: don't make the log messages more verbose ]
Signed-off-by: Simon Buttgereit <simon.buttgereit@tu-ilmenau.de>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
blk_hctx_poll() always checks if the task is running or not, and returns
1 if the task is running. This is a leftover from when polled IO was
purely for synchronous IO, and doesn't make sense anymore when polled IO
is purely asynchronous. Similarly, marking the task as TASK_RUNNING is
also superflous, as the very much has to be running to enter the
function in the first place.
It looks like there has been this judgment for historical reasons, and
in very early versions of this function the user would set the process
state to TASK_UNINTERRUPTIBLE.
Signed-off-by: Diangang Li <lidiangang@bytedance.com>
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
[axboe: kill all remnants of task running, pointless now. massage message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Shuran Liu says:
====================
bpf: fix bpf_d_path() helper prototype
Hi,
This series fixes a verifier issue with bpf_d_path() and adds a
regression test to cover its use within a hook function.
Patch 1 updates the bpf_d_path() helper prototype so that the second
argument is marked as MEM_WRITE. This makes it explicit to the verifier
that the helper writes into the provided buffer.
Patch 2 extends the existing d_path selftest to cover incorrect verifier
assumptions caused by an incorrect function prototype. The test program calls
bpf_d_path() and checks if the first character of the path can be read.
It ensures the verifier does not assume the buffer remains unwritten.
Changelog
=========
v5:
- Moved the temporary file for the fallocate test from /tmp to /dev/shm
Since bpf CI's 9P filesystem under /tmp does not support fallocate.
v4:
- Use the fallocate hook instead of an LSM hook to simplify the selftest,
as suggested by Matt and Alexei.
- Add a utility function in test_d_path.c to load the BPF program,
improving code reuse.
v3:
- Switch the pathname prefix loop to use bpf_for() instead of
#pragma unroll, as suggested by Matt.
- Remove /tmp/bpf_d_path_test in the test cleanup path.
- Add the missing Reviewed-by tags.
v2:
- Merge the new test into the existing d_path selftest rather than
creating new files.
- Add PID filtering in the LSM program to avoid nondeterministic failures
due to unrelated processes triggering bprm_check_security.
- Synchronize child execution using a pipe to ensure deterministic
updates to the PID.
Thanks for your time and reviews.
====================
Link: https://patch.msgid.link/20251206141210.3148-1-electronlsr@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a regression test for bpf_d_path() to cover incorrect verifier
assumptions caused by an incorrect function prototype. The test
attaches to the fallocate hook, calls bpf_d_path() and verifies that
a simple prefix comparison on the returned pathname behaves correctly
after the fix in patch 1. It ensures the verifier does not assume
the buffer remains unwritten.
Co-developed-by: Zesen Liu <ftyg@live.com>
Signed-off-by: Zesen Liu <ftyg@live.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Link: https://lore.kernel.org/r/20251206141210.3148-3-electronlsr@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Commit 37cce22dbd51 ("bpf: verifier: Refactor helper access type
tracking") started distinguishing read vs write accesses performed by
helpers.
The second argument of bpf_d_path() is a pointer to a buffer that the
helper fills with the resulting path. However, its prototype currently
uses ARG_PTR_TO_MEM without MEM_WRITE.
Before 37cce22dbd51, helper accesses were conservatively treated as
potential writes, so this mismatch did not cause issues. Since that
commit, the verifier may incorrectly assume that the buffer contents
are unchanged across the helper call and base its optimizations on this
wrong assumption. This can lead to misbehaviour in BPF programs that
read back the buffer, such as prefix comparisons on the returned path.
Fix this by marking the second argument of bpf_d_path() as
ARG_PTR_TO_MEM | MEM_WRITE so that the verifier correctly models the
write to the caller-provided buffer.
Fixes: 37cce22dbd51 ("bpf: verifier: Refactor helper access type tracking")
Co-developed-by: Zesen Liu <ftyg@live.com>
Signed-off-by: Zesen Liu <ftyg@live.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Reviewed-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20251206141210.3148-2-electronlsr@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Jakub Kicinski says:
====================
inet: frags: flush pending skbs in fqdir_pre_exit()
Fix the issue reported by NIPA starting on Sep 18th [1], where
pernet_ops_rwsem is constantly held by a reader, preventing writers
from grabbing it (specifically driver modules from loading).
The fact that reports started around that time seems coincidental.
The issue seems to be skbs queued for defrag preventing conntrack
from exiting.
First patch fixes another theoretical issue, it's mostly a leftover
from an attempt to get rid of the inet_frag_queue refcnt, which
I gave up on (still think it's doable but a bit of a time sink).
Second patch is a minor refactor.
The real fix is in the third patch. It's the simplest fix I can
think of which is to flush the frag queues. Perhaps someone has
a better suggestion?
Last patch adds an explicit warning for conntrack getting stuck,
as this seems like something that can easily happen if bugs sneak in.
The warning will hopefully save us the first 20% of the investigation
effort.
Link: https://lore.kernel.org/20251001082036.0fc51440@kernel.org # [1]
====================
Link: https://patch.msgid.link/20251207010942.1672972-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nf_conntrack_cleanup_net_list() calls schedule() so it does not
show up as a hung task. Add an explicit check to make debugging
leaked skbs/conntack references more obvious.
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We have been seeing occasional deadlocks on pernet_ops_rwsem since
September in NIPA. The stuck task was usually modprobe (often loading
a driver like ipvlan), trying to take the lock as a Writer.
lockdep does not track readers for rwsems so the read wasn't obvious
from the reports.
On closer inspection the Reader holding the lock was conntrack looping
forever in nf_conntrack_cleanup_net_list(). Based on past experience
with occasional NIPA crashes I looked thru the tests which run before
the crash and noticed that the crash follows ip_defrag.sh. An immediate
red flag. Scouring thru (de)fragmentation queues reveals skbs sitting
around, holding conntrack references.
The problem is that since conntrack depends on nf_defrag_ipv6,
nf_defrag_ipv6 will load first. Since nf_defrag_ipv6 loads first its
netns exit hooks run _after_ conntrack's netns exit hook.
Flush all fragment queue SKBs during fqdir_pre_exit() to release
conntrack references before conntrack cleanup runs. Also flush
the queues in timer expiry handlers when they discover fqdir->dead
is set, in case packet sneaks in while we're running the pre_exit
flush.
The commit under Fixes is not exactly the culprit, but I think
previously the timer firing would eventually unblock the spinning
conntrack.
Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Instead of exporting inet_frag_rbtree_purge() which requires that
caller takes care of memory accounting, add a new helper. We will
need to call it from a few places in the next patch.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|