| Age | Commit message (Collapse) | Author |
|
Merge series from Stefan Binding <sbinding@opensource.cirrus.com>:
In PC systems using ACPI, the driver is able to read back an SSID from
the _SUB property. This SSID uniquely identifies the system, which
enables the driver to read the correct firmware and tuning for that
system from linux-firmware. Currently there is no way of reading this
property from device tree. Add an equivalent property in device tree
to perform the same role.
|
|
can_set_static_ctrlmode() is declared as a static inline. But it is
only called in the probe function of the devices and so does not
really benefit from any kind of optimization.
Transform it into a "normal" function by moving it to
drivers/net/can/dev/dev.c
Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
Link: https://patch.msgid.link/20250923-can-fix-mtu-v3-2-581bde113f52@kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Commit 79525b51acc1 ("io_uring: fix nvme's 32b cqes on mixed cq") split
out a separate io_uring_cmd_done32() helper for ->uring_cmd()
implementations that return 32-byte CQEs. The res2 value passed to
io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores
it when is_cqe32 is passed as false. So drop the parameter from
io_uring_cmd_done() to simplify the callers and clarify that it's not
possible to return an extra value beyond the 32-bit CQE result.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for
nested file systems") we introduced the ability for stacked drivers and
file systems to correctly invoke the f_op->mmap_prepare() handler from an
f_op->mmap() handler via a compatibility layer implemented in
compat_vma_mmap_prepare().
This populates vm_area_desc fields according to those found in the (not
yet fully initialised) VMA passed to f_op->mmap().
However this function implicitly assumes that the struct file which we are
operating upon is equal to vma->vm_file. This is not a safe assumption in
all cases.
The only really sane situation in which this matters would be something
like e.g. i915_gem_dmabuf_mmap() which invokes vfs_mmap() against
obj->base.filp:
ret = vfs_mmap(obj->base.filp, vma);
if (ret)
return ret;
And then sets the VMA's file to this, should the mmap operation succeed:
vma_set_file(vma, obj->base.filp);
That is - it is the file that is intended to back the VMA mapping.
This is not an issue currently, as so far we have only implemented
f_op->mmap_prepare() handlers for some file systems and internal mm uses,
and the only stacked f_op->mmap() operations that can be performed upon
these are those in backing_file_mmap() and coda_file_mmap(), both of which
use vma->vm_file.
However, moving forward, as we convert drivers to using
f_op->mmap_prepare(), this will become a problem.
Resolve this issue by explicitly setting desc->file to the provided file
parameter and update callers accordingly.
Callers are expected to read desc->file and update desc->vm_file - the
former will be the file provided by the caller (if stacked, this may
differ from vma->vm_file).
If the caller needs to differentiate between the two they therefore now
can.
While we are here, also provide a variant of compat_vma_mmap_prepare()
that operates against a pointer to any file_operations struct and does not
assume that the file_operations struct we are interested in is file->f_op.
This function is __compat_vma_mmap_prepare() and we invoke it from
compat_vma_mmap_prepare() so that we share code between the two functions.
This is important, because some drivers provide hooks in a separate
struct, for instance struct drm_device provides an fops field for this
purpose.
Also update the VMA selftests accordingly.
Link: https://lkml.kernel.org/r/dd0c72df8a33e8ffaa243eeb9b01010b670610e9.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()", v2.
As part of the efforts to eliminate the problematic f_op->mmap callback, a
new callback - f_op->mmap_prepare was provided.
While we are converting these callbacks, we must deal with 'stacked'
filesystems and drivers - those which in their own f_op->mmap callback
invoke an inner f_op->mmap callback.
To accomodate for this, a compatibility layer is provided that, via
vfs_mmap(), detects if f_op->mmap_prepare is provided and if so, generates
a vm_area_desc containing the VMA's metadata and invokes the call.
So far, we have provided desc->file equal to vma->vm_file. However this
is not necessarily valid, especially in the case of stacked drivers which
wish to assign a new file after the inner hook is invoked.
To account for this, we adjust vm_area_desc to have both file and vm_file
fields. The .vm_file field is strictly set to vma->vm_file (or in the
case of a new mapping, what will become vma->vm_file).
However, .file is set to whichever file vfs_mmap() is invoked with when
using the compatibilty layer.
Therefore, if the VMA's file needs to be updated in .mmap_prepare,
desc->vm_file should be assigned, whilst desc->file should be read.
No current f_op->mmap_prepare users assign desc->file so this is safe to
do.
This makes the .mmap_prepare callback in the context of a stacked
filesystem or driver completely consistent with the existing .mmap
implementations.
While we're here, we do a few small cleanups, and ensure that we const-ify
things correctly in the vm_area_desc struct to avoid hooks accidentally
trying to assign fields they should not.
This patch (of 2):
Stacked filesystems and drivers may invoke mmap hooks with a struct file
pointer that differs from the overlying file. We will make this
functionality possible in a subsequent patch.
In order to prepare for this, let's update vm_area_struct to separately
provide desc->file and desc->vm_file parameters.
The desc->file parameter is the file that the hook is expected to operate
upon, and is not assignable (though the hok may wish to e.g. update the
file's accessed time for instance).
The desc->vm_file defaults to what will become vma->vm_file and is what
the hook must reassign should it wish to change the VMA"s vma->vm_file.
For now we keep desc->file, vm_file the same to remain consistent.
No f_op->mmap_prepare() callback sets a new vma->vm_file currently, so
this is safe to change.
While we're here, make the mm_struct desc->mm pointers at immutable as
well as the desc->mm field itself.
As part of this change, also update the single hook which this would
otherwise break - mlock_future_ok(), invoked by secretmem_mmap_prepare()).
We additionally update set_vma_from_desc() to compare fields in a more
logical fashion, checking the (possibly) user-modified fields as the first
operand against the existing value as the second one.
Additionally, update VMA tests to accommodate changes.
Link: https://lkml.kernel.org/r/cover.1756920635.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/3fa15a861bb7419f033d22970598aa61850ea267.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The ancient comment above task_lock() states that it can be nested outside
of read_lock(&tasklist_lock), but this is no longer true:
CPU_0 CPU_1 CPU_2
task_lock() read_lock(tasklist)
write_lock_irq(tasklist)
read_lock(tasklist) task_lock()
Unless CPU_0 calls read_lock() in IRQ context, queued_read_lock_slowpath()
won't get the lock immediately, it will spin waiting for the pending
writer on CPU_2, resulting in a deadlock.
Link: https://lkml.kernel.org/r/20250914110908.GA18769@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch extends the BPF_PROG_LOAD command by adding three new fields
to `union bpf_attr` in the user-space API:
- signature: A pointer to the signature blob.
- signature_size: The size of the signature blob.
- keyring_id: The serial number of a loaded kernel keyring (e.g.,
the user or session keyring) containing the trusted public keys.
When a BPF program is loaded with a signature, the kernel:
1. Retrieves the trusted keyring using the provided `keyring_id`.
2. Verifies the supplied signature against the BPF program's
instruction buffer.
3. If the signature is valid and was generated by a key in the trusted
keyring, the program load proceeds.
4. If no signature is provided, the load proceeds as before, allowing
for backward compatibility. LSMs can chose to restrict unsigned
programs and implement a security policy.
5. If signature verification fails for any reason,
the program is not loaded.
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250921160120.9711-2-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This removes 8bytes waste on 64bit builds.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tp->tcp_clean_acked is fetched in tx path when snd_una is updated.
This field thus belongs to tcp_sock_read_tx group.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fill a hole in tcp_sock_read_txrx, instead of possibly wasting
a cache line.
Note that tcp_recvmsg_locked() is also reading tp->repair,
so this removes one cache line miss in tcp recvmsg().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_ack() writes this field, it belongs to tcp_sock_write_txrx.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2025-09-21
* tag 'mlx5-next-counters' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add uar access and odp page fault counters
====================
Link: https://patch.msgid.link/1758443940-708689-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the old sfp_parse_*() functions that are now no longer used.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVz-000000061Wj-13Yd@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Provide a function to retrieve the current sfp_module_caps structure
so that upstreams can get the entire module support in one go.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVj-000000061WQ-3q47@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pre-parse the module support on insert rather than when the upstream
requests the data. This will allow more flexible and extensible
parsing.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVZ-000000061WE-2pXD@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a helper for copying PHY interface bitmasks. This will be used by
the SFP bus code, which will then be moved to phylink in the subsequent
patches.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVU-000000061W8-2IDT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Sphinx emitted a warning during make texinfodocs:
WARNING: Inline literal start-string without end-string.
This was caused by the trailing '*' in "%WQ_*" being parsed as
reStructuredText markup in the kernel-doc comment.
Escape the '*' in the comment so that Sphinx treats it as a literal
character, resolving the warning.
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Simply derive the ns operations from the namespace type.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add the missing include of the ns_common header.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Rename SPI_CS_CNT_MAX to SPI_DEVICE_CS_CNT_MAX to make it more obvious
that this is the max number of CS per device supported, not per
controller.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Link: https://patch.msgid.link/20250915183725.219473-8-jonas.gorski@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
The spi chipselect limit SPI_CS_CNT_MAX was raised with commit
2f8c7c3715f2 ("spi: Raise limit on number of chip selects") from 4 to 16
to accommodate spi controllers with more than 4 chip selects, and then
later to 24 with commit 96893cdd4760 ("spi: Raise limit on number of
chip selects to 24").
Now that we removed SPI_CS_CNT_MAX limiting the chip selects of
controllers, we can reduce the amount of chip selects per device again
to 4, the original value.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Link: https://patch.msgid.link/20250915183725.219473-7-jonas.gorski@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
There are several places where we need to iterate over a device's
chipselect. To be able to do it efficiently, store the number of
chipselects in spi_device, like we do for controllers.
Since we now use a device supplied value, add a check to make sure it
isn't more than we can support.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Link: https://patch.msgid.link/20250915183725.219473-3-jonas.gorski@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Clean up: because svc_rpcb_cleanup() and svc_xprt_destroy_all()
are always invoked in pairs, we can deduplicate code by moving
the svc_rpcb_cleanup() call sites into svc_xprt_destroy_all().
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When a listener is added, a part of creation of transport also registers
program/port with rpcbind. However, when the listener is removed,
while transport goes away, rpcbind still has the entry for that
port/type.
When deleting the transport, unregister with rpcbind when appropriate.
---v2 created a new xpt_flag XPT_RPCB_UNREG to mark TCP and UDP
transport and at xprt destroy send rpcbind unregister if flag set.
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: d093c9089260 ("nfsd: fix management of listener transports")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When ATTR_ATIME_SET and ATTR_MTIME_SET are set in the ia_valid mask, the
notify_change() logic takes that to mean that the request should set
those values explicitly, and not override them with "now".
With the advent of delegated timestamps, similar functionality is needed
for the ctime. Add a ATTR_CTIME_SET flag, and use that to indicate that
the ctime should be accepted as-is. Also, clean up the if statements to
eliminate the extra negatives.
In setattr_copy() and setattr_copy_mgtime() use inode_set_ctime_deleg()
when ATTR_CTIME_SET is set, instead of basing the decision on ATTR_DELEG.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Since the opaque is fixed in size, the caller already knows how many
bytes were decoded, on success. Thus, xdr_stream_decode_opaque_fixed()
doesn't need to return that value. And, xdr_stream_decode_u32 and _u64
both return zero on success.
This patch simplifies the caller's error checking to avoid potential
integer promotion issues.
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This interface was deprecated by commit e6f7e1487ab5 ("nfs_localio:
simplify interface to nfsd for getting nfsd_file") and is now
unused. So let's remove it.
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
UAPI Changes:
- Drop L3 bank mask reporting from the media GT on Xe3 and later. Only
do that for the primary GT. No userspace needs or uses it for media
and some platforms may report bogus values.
- Add SLPC power_profile sysfs interface with support for base and
power_saving modes (Vinay Belgaumkar, Rodrigo Vivi)
- Add configfs attributes to add post/mid context-switch commands
(Lucas De Marchi)
Cross-subsystem Changes:
- Fix hmm_pfn_to_map_order() usage in gpusvm and refactor APIs to
align with pieces previous handled by xe_hmm (Matthew Auld)
Core Changes:
- Add MEI driver for Late Binding Firmware Update/Upload
(Alexander Usyskin)
Driver Changes:
- Fix GuC CT teardown wrt TLB invalidation (Satyanarayana)
- Fix CCS save/restore on VF (Satyanarayana)
- Increase default GuC crash buffer size (Zhanjun)
- Allow to clear GT stats in debugfs to aid debugging (Matthew Brost)
- Add more SVM GT stats to debugfs (Matthew Brost)
- Fix error handling in VMA attr query (Himal)
- Move sa_info in debugfs to be per tile (Michal Wajdeczko)
- Limit number of retries upon receiving NO_RESPONSE_RETRY from GuC to
avoid endless loop (Michal Wajdeczko)
- Fix configfs handling for survivability_mode undoing user choice when
unbinding the module (Michal Wajdeczko)
- Refactor configfs attribute visibility to future-proof it and stop
exposing survivability_mode if not applicable (Michal Wajdeczko)
- Constify some functions (Harish Chegondi, Michal Wajdeczko)
- Add/extend more HW workarounds for Xe2 and Xe3
(Harish Chegondi, Tangudu Tilak Tirumalesh)
- Replace xe_hmm with gpusvm (Matthew Auld)
- Improve fake pci and WA kunit handling for testing new platforms
(Michal Wajdeczko)
- Reduce unnecessary PTE writes when migrating (Sanjay Yadav)
- Cleanup GuC interface definitions and log message (John Harrison)
- Small improvements around VF CCS (Michal Wajdeczko)
- Enable bus mastering for the I2C controller (Raag Jadav)
- Prefer devm_mutex of hand rolling it (Christophe JAILLET)
- Drop sysfs and debugfs attributes not available for VF (Michal Wajdeczko)
- GuC CT devm actions improvements (Michal Wajdeczko)
- Recommend new GuC versions for PTL and BMG (Julia Filipchuk)
- Improveme driver handling for exhaustive eviction using new
xe_validation wrapper around drm_exec (Thomas Hellström)
- Add and use printk wrappers for tile and device (Michal Wajdeczko)
- Better document workaround handling in Xe (Lucas De Marchi)
- Improvements on ARRAY_SIZE and ERR_CAST usage (Lucas De Marchi,
Fushuai Wang)
- Align CSS firmware headers with the GuC APIs (John Harrison)
- Test GuC to GuC (G2G) communication to aid debug in pre-production
firmware (John Harrison)
- Bail out driver probing if GuC fails to load (John Harrison)
- Allow error injection in xe_pxp_exec_queue_add()
(Daniele Ceraolo Spurio)
- Minor refactors in xe_svm (Shuicheng Lin)
- Fix madvise ioctl error handling (Shuicheng Lin)
- Use attribute groups to simplify sysfs registration
(Michal Wajdeczko)
- Add Late Binding Firmware implementation in Xe to work together with
the MEI component (Badal Nilawar, Daniele Ceraolo Spurio, Rodrigo
Vivi)
- Fix build with CONFIG_MODULES=n (Lucas De Marchi)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/c2et6dnkst2apsgt46dklej4nprqdukjosb55grpaknf3pvcxy@t7gtn3hqtp6n
|
|
This was ambiguous enough for a broken patch (206cc44588f7 ("virtio:
reject shm region if length is zero")) to make it into the kernel, so
make it clearer.
Link: https://lore.kernel.org/r/20250816071600-mutt-send-email-mst@kernel.org/
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Message-Id: <20250829150944.233505-1-hi@alyssa.is>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Patch series "mm/damon: define and use DAMON initialization check
function".
DAMON is initialized in subsystem initialization time, by damon_init().
If DAMON API functions are called before the initialization, the
system could crash. Actually such issues happened and were fixed [1]
in the past. For the fix, DAMON API callers have updated to check if
DAMON is initialized or not, using their own hacks. The hacks are
unnecessarily duplicated on every DAMON API callers and therefore it
would be difficult to reliably maintain in the long term.
Make it reliable and easy to maintain. For this, implement a new DAMON
core layer API function that returns if DAMON is successfully
initialized. If it returns true, it means DAMON API functions are safe
to be used. After the introduction of the new API, update DAMON API
callers to use the new function instead of their own hacks.
This patch (of 7):
If DAMON is tried to be used when it is not yet successfully initialized,
the caller could be crashed. DAMON core layer is not providing a reliable
way to see if it is successfully initialized and therefore ready to be
used, though. As a result, DAMON API callers are implementing their own
hacks to see it. The hacks simply assume DAMON should be ready on module
init time. It is not reliable as DAMON initialization can indeed fail if
KMEM_CACHE() fails, and difficult to maintain as those are duplicates.
Implement a core layer API function for better reliability and
maintainability to replace the hacks with followup commits.
Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org
Link: https://lore.kernel.org/20250909022238.2989-1-sj@kernel.org [1]
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While rare, memory allocation profiling can contain inaccurate counters if
slab object extension vector allocation fails. That allocation might
succeed later but prior to that, slab allocations that would have used
that object extension vector will not be accounted for. To indicate
incorrect counters, "accurate:no" marker is appended to the call site line
in the /proc/allocinfo output. Bump up /proc/allocinfo version to reflect
the change in the file format and update documentation.
Example output with invalid counters:
allocinfo - version: 2.0
0 0 arch/x86/kernel/kdebugfs.c:105 func:create_setup_data_nodes
0 0 arch/x86/kernel/alternative.c:2090 func:alternatives_smp_module_add
0 0 arch/x86/kernel/alternative.c:127 func:__its_alloc accurate:no
0 0 arch/x86/kernel/fpu/regset.c:160 func:xstateregs_set
0 0 arch/x86/kernel/fpu/xstate.c:1590 func:fpstate_realloc
0 0 arch/x86/kernel/cpu/aperfmperf.c:379 func:arch_enable_hybrid_capacity_scale
0 0 arch/x86/kernel/cpu/amd_cache_disable.c:258 func:init_amd_l3_attrs
49152 48 arch/x86/kernel/cpu/mce/core.c:2709 func:mce_device_create accurate:no
32768 1 arch/x86/kernel/cpu/mce/genpool.c:132 func:mce_gen_pool_create
0 0 arch/x86/kernel/cpu/mce/amd.c:1341 func:mce_threshold_create_device
[surenb@google.com: document new "accurate:no" marker]
Fixes: 39d117e04d15 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output")
[akpm@linux-foundation.org: simplification per Usama, reflow text]
[akpm@linux-foundation.org: add newline to prevent docs warning, per Randy]
Link: https://lkml.kernel.org/r/20250915230224.4115531-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Improvements to Victim Process Thawing and OOM Reaper
Traversal Order", v10.
This patch series focuses on optimizing victim process thawing and
refining the traversal order of the OOM reaper. Since __thaw_task() is
used to thaw a single thread of the victim, thawing only one thread cannot
guarantee the exit of the OOM victim when it is frozen. Patch 1 thaw the
entire process of the OOM victim to ensure that OOM victims are able to
terminate themselves. Even if the oom_reaper is delayed, patch 2 is still
beneficial for reaping processes with a large address space footprint, and
it also greatly improves process_mrelease.
This patch (of 10):
OOM killer is a mechanism that selects and kills processes when the system
runs out of memory to reclaim resources and keep the system stable. But
the oom victim cannot terminate on its own when it is frozen, even if the
OOM victim task is thawed through __thaw_task(). This is because
__thaw_task() can only thaw a single OOM victim thread, and cannot thaw
the entire OOM victim process.
In addition, freezing_slow_path() determines whether a task is an OOM
victim by checking the task's TIF_MEMDIE flag. When a task is identified
as an OOM victim, the freezer bypasses both PM freezing and cgroup
freezing states to thaw it.
Historically, TIF_MEMDIE was a "this is the oom victim & it has access to
memory reserves" flag in the past. It has that thread vs. process
problems and tsk_is_oom_victim was introduced later to get rid of them and
other issues as well as the guarantee that we can identify the oom
victim's mm reliably for other oom_reaper.
Therefore, thaw_process() is introduced to unfreeze all threads within the
OOM victim process, ensuring that every thread is properly thawed. The
freezer now uses tsk_is_oom_victim() to determine OOM victim status,
allowing all victim threads to be unfrozen as necessary.
With this change, the entire OOM victim process will be thawed when an OOM
event occurs, ensuring that the victim can terminate on its own.
Link: https://lkml.kernel.org/r/20250915162946.5515-1-zhongjinji@honor.com
Link: https://lkml.kernel.org/r/20250915162946.5515-2-zhongjinji@honor.com
Signed-off-by: zhongjinji <zhongjinji@honor.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
static inlines
For all the usual reasons, plus a new one. Calling
(void)arch_enter_lazy_mmu_mode();
deservedly blows up.
Cc: Balbir Singh <balbirs@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We already use page->private for storing the order of a page while it's in
the buddy allocator system; extend that to also storing the order while
it's in the pcp_llist.
Link: https://lkml.kernel.org/r/20250910142923.2465470-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Small cleanups".
These small cleanups can be applied now to reduce conflicts during the
next merge window. They're all from various efforts to split struct page
from other memdescs. Thanks to Vlastimil for the suggestion.
This patch (of 3):
These functions do not modify their arguments. Telling the compiler this
may improve code generation, and allows us to pass const arguments from
other functions.
Link: https://lkml.kernel.org/r/20250910142923.2465470-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250910142923.2465470-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As raised by Andrew [1], a folio/compound page never spans a negative
number of pages. Consequently, let's use "unsigned long" instead of
"long" consistently for folio_nr_pages(), folio_large_nr_pages() and
compound_nr().
Using "unsigned long" as return value is fine, because even
"(long)-folio_nr_pages()" will keep on working as expected. Using
"unsigned int" instead would actually break these use cases.
This patch takes the first step changing these to return unsigned long
(and making drm_gem_get_pages() use the new types instead of replacing
min()).
In the future, we might want to make more callers of these functions to
consistently use "unsigned long".
Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/
Link: https://lkml.kernel.org/r/20250826153721.GA23292@cathedrallabs.org
Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/ [1]
Signed-off-by: Aristeu Rozanski <aris@ruivo.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This header uses THREAD_SIZE, which is provided by the thread_info.h
header but is not included in this header. Depending on the #include
ordering in other files, this can produce preprocessor errors.
Link: https://lkml.kernel.org/r/20250909201419.827638-1-briannorris@chromium.org
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a
row, kswapd on that node gets disabled. That is, the system won't wakeup
kswapd for that node until page reclamation is observed at least once.
That reclamation is mostly done by direct reclaim, which in turn enables
kswapd back.
However, on systems with CXL memory nodes, workloads with high anon page
usage can disable kswapd indefinitely, without triggering direct
reclaim. This can be reproduced with following steps:
numa node 0 (32GB memory, 48 CPUs)
numa node 2~5 (512GB CXL memory, 128GB each)
(numa node 1 is disabled)
swap space 8GB
1) Set /sys/kernel/mm/demotion_enabled to 0.
2) Set /proc/sys/kernel/numa_balancing to 0.
3) Run a process that allocates and random accesses 500GB of anon
pages.
4) Let the process exit normally.
During 3), free memory on node 0 gets lower than low watermark, and
kswapd runs and depletes swap space. Then, kswapd fails consecutively
and gets disabled. Allocation afterwards happens on CXL memory, so node
0 never gains more memory pressure to trigger direct reclaim.
After 4), kswapd on node 0 remains disabled, and tasks running on that
node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING
and demotion now, it won't work properly since kswapd is disabled.
To mitigate this problem, reset kswapd_failures to 0 on following
conditions:
a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback
memory node gets cleared.
b) demotion_enabled is changed from false to true.
Rationale for a):
ZONE_BELOW_HIGH bit being cleared might be a sign that the node may
be reclaimable afterwards. This won't help much if the memory-hungry
process keeps running without freeing anything, but at least the node
will go back to reclaimable state when the process exits.
Rationale for b):
When demotion_enabled is false, kswapd can only reclaim anon pages by
swapping them out to swap space. If demotion_enabled is turned on,
kswapd can demote anon pages to another node for reclaiming. So, the
original failure count for determining reclaimability is no longer
valid.
Since kswapd_failures resets may be missed by ++ operation, it is
changed from int to atomic_t.
[akpm@linux-foundation.org: tweak whitespace]
Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22
Signed-off-by: Chanwon Park <flyinrm@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This has the same effect as ptdesc_address() so convert the callers to use
that and delete the function. Add kernel-doc for ptdesc_address().
Link: https://lkml.kernel.org/r/20250908171104.2409217-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for splitting struct ptdesc from struct page and struct
folio, remove mentions of struct folio from these functions. Introduce
ptdesc_nr_pages() to avoid using lruvec_stat_add/sub_folio()
Link: https://lkml.kernel.org/r/20250908171104.2409217-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Some ptdesc cleanups".
The first two patches here are preparation for splitting struct ptdesc
from struct page and struct folio. I think their only dependency is on
the memdesc_flags_t patches from August which is in mm-new. The third
patch is just something I noticed while working on the code.
This patch (of 3):
Use the new memdesc_flags_t type to show that these are the same bits as
page/folio/slab and thesefore have the zone/node/section information in
them. Remove a use of ptdesc_folio() by converting
pagetable_is_reserved() to use test_bit() directly.
Link: https://lkml.kernel.org/r/20250908171104.2409217-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250908171104.2409217-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Having the ma_external_lock field exist when CONFIG_LOCKDEP=n isn't used
anywhere, so just get rid of it. This also avoids generating a typedef
called lockdep_map_p that could overlap with typedefs in other header
files.
Link: https://lkml.kernel.org/r/20250902-maple-lockdep-p-v1-1-3ae5a398a379@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Danilo Krummrich <dakr@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce basic swap table infrastructures, which are now just a
fixed-sized flat array inside each swap cluster, with access wrappers.
Each cluster contains a swap table of 512 entries. Each table entry is an
opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a
folio type (pointer), or NULL.
In this first step, it only supports storing a folio or shadow, and it is
a drop-in replacement for the current swap cache. Convert all swap cache
users to use the new sets of APIs. Chris Li has been suggesting using a
new infrastructure for swap cache for better performance, and that idea
combined well with the swap table as the new backing structure. Now the
lock contention range is reduced to 2M clusters, which is much smaller
than the 64M address_space. And we can also drop the multiple
address_space design.
All the internal works are done with swap_cache_get_* helpers. Swap cache
lookup is still lock-less like before, and the helper's contexts are same
with original swap cache helpers. They still require a pin on the swap
device to prevent the backing data from being freed.
Swap cache updates are now protected by the swap cluster lock instead of
the XArray lock. This is mostly handled internally, but new
__swap_cache_* helpers require the caller to lock the cluster. So, a few
new cluster access and locking helpers are also introduced.
A fully cluster-based unified swap table can be implemented on top of this
to take care of all count tracking and synchronization work, with dynamic
allocation. It should reduce the memory usage while making the
performance even better.
Link: https://lkml.kernel.org/r/20250916160100.31545-12-ryncsn@gmail.com
Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
swp_swap_info is the most commonly used helper for retrieving swap info.
It has an internal check that may lead to a NULL return value, but almost
none of its caller checks the return value, making the internal check
pointless. In fact, most of these callers already ensured the entry is
valid and never expect a NULL value.
Tidy this up and improve the function names. If the caller can make sure
the swap entry/type is valid and the device is pinned, use the new
introduced __swap_entry_to_info/__swap_type_to_info instead. They have
more debug sanity checks and lower overhead as they are inlined.
Callers that may expect a NULL value should use
swap_entry_to_info/swap_type_to_info instead.
No feature change. The rearranged codes should have had no effect, or
they should have been hitting NULL de-ref bugs already. Only some new
sanity checks are added so potential issues may show up in debug build.
The new helpers will be frequently used with swap table later when working
with swap cache folios. A locked swap cache folio ensures the entries are
valid and stable so these helpers are very helpful.
Link: https://lkml.kernel.org/r/20250916160100.31545-8-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
No feature change, move cluster related definitions and helpers to
mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
helpers, so they can be used outside of swap files. And while at it, add
kerneldoc.
Link: https://lkml.kernel.org/r/20250916160100.31545-7-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
GUP no longer uses get_dev_pagemap(). As it was the only user of the
get_dev_pagemap() pgmap caching feature it can be removed.
Link: https://lkml.kernel.org/r/20250903225926.34702-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
split_huge_page_to_list_[to_order](), split_huge_page() and
try_folio_split() return 0 on success and error codes on failure.
When THP is disabled, these functions return 0 indicating success even
though an error code should be returned as it is not possible to split a
folio when THP is disabled.
Make all these functions return -EINVAL to indicate failure instead of 0.
As large folios depend on CONFIG_THP, issue warning as this function
should not be called without a large folio.
Link: https://lkml.kernel.org/r/20250905150012.93714-1-kernel@pankajraghav.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509051753.riCeG7LC-lkp@intel.com/
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Add Rust abstraction for Maple Trees", v3.
This will be used in the Tyr driver [1] to allocate from the GPU's VA
space that is not owned by userspace, but by the kernel, for kernel GPU
mappings.
Danilo tells me that in nouveau, the maple tree is used for keeping track
of "VM regions" on top of GPUVM, and that he will most likely end up doing
the same in the Rust Nova driver as well.
These abstractions intentionally do not expose any way to make use of
external locking. You are required to use the internal spinlock. For
now, we do not support loads that only utilize rcu for protection.
This contains some parts taken from Andrew Ballance's RFC [2] from April.
However, it has also been reworked significantly compared to that RFC
taking the use-cases in Tyr into account.
This patch (of 3):
The maple tree will be used in the Tyr driver to allocate and keep track
of GPU allocations created internally (i.e. not by userspace). It will
likely also be used in the Nova driver eventually.
This adds the simplest methods for additional and removal that do not
require any special care with respect to concurrency.
This implementation is based on the RFC by Andrew but with significant
changes to simplify the implementation.
[ojeda@kernel.org: fix intra-doc links]
Link: https://lkml.kernel.org/r/20250910140212.997771-1-ojeda@kernel.org
Link: https://lkml.kernel.org/r/20250902-maple-tree-v3-0-fb5c8958fb1e@google.com
Link: https://lkml.kernel.org/r/20250902-maple-tree-v3-1-fb5c8958fb1e@google.com
Link: https://lore.kernel.org/r/20250627-tyr-v1-1-cb5f4c6ced46@collabora.com [1]
Link: https://lore.kernel.org/r/20250405060154.1550858-1-andrewjballance@gmail.com [2]
Co-developed-by: Andrew Ballance <andrewjballance@gmail.com>
Signed-off-by: Andrew Ballance <andrewjballance@gmail.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Danilo Krummrich <dakr@kernel.org>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Daniel Almeida <daniel.almeida@collabora.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Trevor Gross <tmgross@umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All users now use folio->mlock_count so we can remove this element of
struct page. Move the useful comments over to struct folio.
Link: https://lkml.kernel.org/r/20250903191041.1630338-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On XFS systems with pagesize=4K, blocksize=16K, and
CONFIG_TRANSPARENT_HUGEPAGE enabled, We observed the following readahead
behaviors:
# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=64k count=1
# ./tools/mm/page-types -r -L -f /mnt/xfs/test
foffset offset flags
0 136d4c __RU_l_________H______t_________________F_1
1 136d4d __RU_l__________T_____t_________________F_1
2 136d4e __RU_l__________T_____t_________________F_1
3 136d4f __RU_l__________T_____t_________________F_1
...
c 136bb8 __RU_l_________H______t_________________F_1
d 136bb9 __RU_l__________T_____t_________________F_1
e 136bba __RU_l__________T_____t_________________F_1
f 136bbb __RU_l__________T_____t_________________F_1 <-- first read
10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag
11 13c2cd ___U_l__________T_____t______________I__F_1
12 13c2ce ___U_l__________T_____t______________I__F_1
13 13c2cf ___U_l__________T_____t______________I__F_1
...
1c 1405d4 ___U_l_________H______t_________________F_1
1d 1405d5 ___U_l__________T_____t_________________F_1
1e 1405d6 ___U_l__________T_____t_________________F_1
1f 1405d7 ___U_l__________T_____t_________________F_1
[ra_size = 32, req_count = 16, async_size = 16]
# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=60k count=1
# ./page-types -r -L -f /mnt/xfs/test
foffset offset flags
0 136048 __RU_l_________H______t_________________F_1
...
c 110a40 __RU_l_________H______t_________________F_1
d 110a41 __RU_l__________T_____t_________________F_1
e 110a42 __RU_l__________T_____t_________________F_1 <-- first read
f 110a43 __RU_l__________T_____t_________________F_1 <-- first readahead flag
10 13e7a8 ___U_l_________H______t_________________F_1
...
20 137a00 ___U_l_________H______t_______P______I__F_1 <-- second readahead flag (20 - 2f)
21 137a01 ___U_l__________T_____t_______P______I__F_1
...
3f 10d4af ___U_l__________T_____t_______P_________F_1
[first readahead: ra_size = 32, req_count = 15, async_size = 17]
When reading 64k data (same for 61-63k range, where last_index is
page-aligned in filemap_get_pages()), 128k readahead is triggered via
page_cache_sync_ra() and the PG_readahead flag is set on the next folio
(the one containing 0x10 page).
When reading 60k data, 128k readahead is also triggered via
page_cache_sync_ra(). However, in this case the readahead flag is set on
the 0xf page. Although the requested read size (req_count) is 60k, the
actual read will be aligned to folio size (64k), which triggers the
readahead flag and initiates asynchronous readahead via
page_cache_async_ra(). This results in two readahead operations totaling
256k.
The root cause is that when the requested size is smaller than the actual
read size (due to folio alignment), it triggers asynchronous readahead.
By changing last_index alignment from page size to folio size, we ensure
the requested size matches the actual read size, preventing the case where
a single read operation triggers two readahead operations.
After applying the patch:
# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=60k count=1
# ./page-types -r -L -f /mnt/xfs/test
foffset offset flags
0 136d4c __RU_l_________H______t_________________F_1
1 136d4d __RU_l__________T_____t_________________F_1
2 136d4e __RU_l__________T_____t_________________F_1
3 136d4f __RU_l__________T_____t_________________F_1
...
c 136bb8 __RU_l_________H______t_________________F_1
d 136bb9 __RU_l__________T_____t_________________F_1
e 136bba __RU_l__________T_____t_________________F_1 <-- first read
f 136bbb __RU_l__________T_____t_________________F_1
10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag
11 13c2cd ___U_l__________T_____t______________I__F_1
12 13c2ce ___U_l__________T_____t______________I__F_1
13 13c2cf ___U_l__________T_____t______________I__F_1
...
1c 1405d4 ___U_l_________H______t_________________F_1
1d 1405d5 ___U_l__________T_____t_________________F_1
1e 1405d6 ___U_l__________T_____t_________________F_1
1f 1405d7 ___U_l__________T_____t_________________F_1
[ra_size = 32, req_count = 16, async_size = 16]
The same phenomenon will occur when reading from 49k to 64k. Set the
readahead flag to the next folio.
Because the minimum order of folio in address_space equals the block size
(at least in xfs and bcachefs that already support bs > ps), having
request_count aligned to block size will not cause overread.
[klarasmodin@gmail.com: fix overflow on 32-bit]
Link: https://lkml.kernel.org/r/yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va
[akpm@linux-foundation.org: update it for Max's constification efforts]
Link: https://lkml.kernel.org/r/20250711055509.91587-1-youling.tang@linux.dev
Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Klara Modin <klarasmodin@gmail.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Youling Tang <youling.tang@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Klara Modin <klarasmodin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|