| Age | Commit message (Collapse) | Author |
|
The crypto/krb5 library accepts data in scatterlist form, but
the GSS-API layer presents RPC payloads as struct xdr_buf.
Bridge that gap with a pair of helper functions:
xdr_buf_to_sg() - populate a caller-supplied scatterlist
array from a byte range
xdr_buf_to_sg_alloc() - populate a caller-supplied inline
scatterlist, chaining to a heap-
allocated overflow for large payloads
The inline array (typically stack-allocated at eight entries)
covers the common case of small RPCs with no heap allocation
on the encrypt/decrypt path. Only buffers spanning many pages
incur a kmalloc for the chained extension.
The segment-walking logic follows the same head, page array,
tail traversal as xdr_process_buf(), but populates a
scatterlist directly rather than invoking a per-segment
callback. sg_next() traversal makes the walker safe for
chained scatterlists. Once subsequent patches reroute all
per-message crypto operations through crypto/krb5,
xdr_process_buf() loses its last callers and is removed.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When a filesystem is exported to NFS clients, NFSv4 state
(opens, locks, delegations, layouts) holds references that
prevent the underlying filesystem from being unmounted.
NFSD_CMD_UNLOCK_FILESYSTEM addresses this at superblock
granularity, but administrators unexporting a single path on a
shared filesystem (e.g., one of several exports on the same device)
need finer control.
Add NFSD_CMD_UNLOCK_EXPORT, which revokes NFSv4 state acquired
through exports of a specific path. Matching is by path identity
(dentry + vfsmount) via the sc_export field on each nfs4_stid,
so multiple svc_export objects for the same path -- one per
auth_domain -- are handled correctly without requiring the caller
to name a specific client.
The command takes a single "path" attribute. Userspace (exportfs
-u) sends this after removing the last client for a given path,
enabling the underlying filesystem to be unmounted. When multiple
clients share an export path, individual unexports do not trigger
state revocation; only the final one does.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add NFSD_CMD_UNLOCK_FILESYSTEM as a dedicated netlink command for
revoking NFS state under a filesystem path, providing a netlink
equivalent of /proc/fs/nfsd/unlock_fs.
The command requires a "path" string attribute containing the
filesystem path whose state should be released. The handler
resolves the path to its superblock, then cancels async copies,
releases NLM locks, and revokes NFSv4 state on that superblock.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The existing write_unlock_ip procfs interface releases NLM file
locks held by a specific client IP address, but procfs provides
no structured way to extend that operation to other scopes such as
revoking NFSv4 state.
Add NFSD_CMD_UNLOCK_IP as a dedicated netlink command for
releasing NLM locks by client address. The command accepts a
binary sockaddr_in or sockaddr_in6 in its address attribute.
The handler validates the address family and length, then calls
nlmsvc_unlock_all_by_ip() to release matching NLM locks. Because
lockd is a single global instance, that call operates across
all network namespaces regardless of which namespace the caller
inhabits.
A separate netlink command for filesystem-scoped unlock is added in
a subsequent commit.
The nfsd_ctl_unlock_ip tracepoint is updated from string-based
address logging to __sockaddr, which stores the binary sockaddr
and formats it with %pISpc. This affects both the new netlink path
and the existing procfs write_unlock_ip path, giving consistent
structured output in both cases.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Catch up various bits of documentation after the locking changes.
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert the IOCTL path similarly to how we converted Netlink.
The device lookup gets a little hairy. We could take rtnl_lock
unconditionally and drop it before calling the driver (this would
avoid the reference + liveness check). But I think being able
to make progress even if rtnl is dead-locked is quite useful.
First extra concern is handling features. List all the cmds which
modify features and always take rtnl_lock. We could fold this list
into ethtool_ioctl_needs_rtnl() but seems cleaner to keep
ethtool_ioctl_needs_rtnl() driver-related. If a driver changed
features and we were not holding rtnl_lock - warn about it.
It can only happen on buggy ops locked drivers (buggy because
they should have set appropriate "I need rtnl for op X" bit).
Second wrinkle is the PHY ID hack which drops the locks while
sleeping. Convert its static "busy" variable which used to
be protected by rtnl_lock to a field in struct ethtool_netdev_state.
This feature is about identifying an adapter or a port within
a system, so being able to blink multiple LEDs at the same
time is likely not very useful in practice. But it's the simplest
fix, we can add a mutex if someone thinks a system should only
be ID'ing one port at a time.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Module firmware flashing reads SFF-8024 identifier bytes via
.get_module_eeprom_by_page(). Other than that it modifies
a bit in the netdev->ethtool struct. Both should be ops-locked
at this point.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make ethtool not take rtnl_lock for SET commands when operation
is performed on an ops-locked driver. cfg/cfg_pending are now
ops-locked, since only ethtool modifies them.
Some SET driver callbacks will still need rtnl_lock, most notably
those which may end up calling netdev_update_features() or the qdisc
layer (via netif_set_real_num_tx_queues()). Let drivers selectively
opt back into the rtnl_lock with a new bitfield in ops.
We need two helpers since Netlink and ioctl cmds have different
values. Keep the helpers side by side in common.h to make sure
they get updated together, even tho they will only get called
from ioctl.c and netlink.c.
SET commands which don't use ethnl_default_set_doit() are converted
by subsequent commits.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ethnl_default_doit() and ethnl_default_dump_one() are both used
exclusively for GET callbacks (former to get info for a single
device or get global strings). ops-locked devices don't need
rtnl_lock for GET callbacks, stop taking it.
Introduce an opt-out mechanism for devices which use phylink (fbnic)
since phylink currently depends on rtnl_lock protection. Subsequent
patches will add more exceptions, anyway. Practically the new helpers
for judging if command needs rtnl_lock could also call
netdev_need_ops_lock() but I find that it makes the code in the callers
slightly less obvious.
Add a helper for IOCTLs already, even tho it's unused so that
we can keep them in sync as the series progresses.
This is the first user-visible step of moving ethtool ops out
from under rtnl. Subsequent patches do the same for SET ops,
as well as the ioctl path.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dev->hwprov tracks the active hwtstamp provider for the device.
Make it ops protected (instance lock if the netdev driver opts
into holding instance lock around callbacks, otherwise rtnl_lock).
hwprov is written and read in:
- drivers/net/phy/phy_device.c
phydev and ops protection don't currently mix, add a comment
- net/ethtool/
as of now holds both rtnl lock and ops lock, this one will
soon only hold one lock or the other
read in:
- net/core/dev_ioctl.c
holds both rtnl lock and ops lock
- net/core/timestamping.c
RCU reader
The new netdev_ops_lock_dereference() helper does not have
"compat" in the name. The name would be quite long and I think
in this case it should be obvious that we need _a_ lock.
netdev_lock_dereference() already exists and means dev->lock
is always expected.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
phydev <> netdev linking and lifecycle depends on rtnl_lock.
We want to switch to instance locks for most ethtool ops.
Let's add an assert that ops locked devices don't use phydev
today. If one does we can either opt the phy ops out of
being purely ops locked, or do deeper surgery to make phy
locking ops-compatible. I don't think there's any fundamental
challenge to make that work.
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Increase the pm_runtime autosuspend delay to be longer than the
timeout of the firmware's own inactivity timer.
There is no point attempting to pm_runtime suspend any sooner than
the firmware idle timeout because it would only mean the driver has
to poll waiting for the firmware idle.
Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Link: https://patch.msgid.link/20260609122946.288103-1-rf@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Add KVM_CAP_S390_HPAGE_2G to signal to userspace that 2G hugepages may
be used to back the guest; restrictions apply similar to 1M hugepages.
Enable the (for now still ignored) GMAP_FLAG_ALLOW_HPAGE_2G flag for
the guest gmap, and propagate / disable it as necessary.
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260609150930.665370-3-imbrenda@linux.ibm.com>
|
|
Add Raspberry Pi firmware voltage domain identifiers for the mailbox
property interface.
Also add the voltage request structure used with
RPI_FIRMWARE_GET_VOLTAGE so firmware clients can share the common API
definition from the firmware header.
Signed-off-by: Shubham Chakraborty <chakrabortyshubham66@gmail.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20260517080445.103962-2-chakrabortyshubham66@gmail.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Add support for guard() and scoped_guard() for the hwmon subsystem lock
to simplify its use.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Pull rdma fixes from Jason Gunthorpe:
"Several significant bug fixes of pre-existing issues:
- Missing validation on ucap fd types passed from userspace
- Missing validation of HW DMA space vs userpace expected sizes in
EFA queue setup
- DMA corruption when using DMA block sizes >= 4G when setting up MRs
in all drivers
- Missing validation of CPU IDs when setting up dma handles
- Missing validation of IB_MR_REREG_ACCESS when changing writability
of a MR
- Missing validation of received message/packet size in ISER and SRP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/srp: bound SRP_RSP sense copy by the received length
IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN
RDMA: During rereg_mr ensure that REREG_ACCESS is compatible
RDMA/core: Validate cpu_id against nr_cpu_ids in DMAH alloc
RDMA/umem: Fix truncation for block sizes >= 4G
RDMA/efa: Validate SQ ring size against max LLQ size
RDMA/core: Validate the passed in fops for ib_get_ucaps()
|
|
Commit e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems
without FS_USERNS_MOUNT") prevents the mount of any filesystem inside a
container that doesn't have FS_USERNS_MOUNT set.
This broke NFS mounts in our containerized environment. We have a daemon
somewhat like systemd-mountfsd running in the init_ns. A process does a
fsopen() inside the container and passes it to the daemon via unix
socket.
The daemon then vets that the request is for an allowed NFS server and
performs the mount. This now fails because the fc->user_ns is set to the
value in the container and NFS doesn't set FS_USERNS_MOUNT. We don't
want to add FS_USERNS_MOUNT to NFS since that would allow the container
to mount any NFS server (even malicious ones).
Add a new FS_USERNS_DELEGATABLE flag, and enable it on NFS.
Fixes: e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260129-twmount-v1-1-4874ed2a15c4@kernel.org
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux into soc/drivers
TI SoC driver updates for v7.2
TI K3 TISCI:
- ti_sci: Add BOARDCFG_MANAGED mode for support system suspend/resume cycles
- ti_sci: Add support for restoring IRQ and clock contexts during resume.
- clk: keystone: sci-clk: Add clock restoration support.
SoC Drivers:
- k3-socinfo: Add support for identifying AM62P silicon variants via NVMEM,
along with corresponding dt-bindings update for nvmem-cells support
- k3-ringacc: Fix incorrect access mode for ring pop tail IO/proxy operations
Keystone Navigator (knav) Cleanup and Fixes:
- knav_qmss: Multiple code quality improvements
- knav_qmss_queue: Implement proper resource cleanup in the remove() path
General Cleanups:
- k3-ringacc: Use str_enabled_disabled() helper for consistency
- knav_qmss: Use %pe format specifier for PTR_ERR() printing
* tag 'ti-driver-soc-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux:
firmware: ti_sci: Add support for restoring clock context during resume
clk: keystone: sci-clk: Add restore_context() operation
firmware: ti_sci: Add support for restoring IRQs during resume
firmware: ti_sci: Add BOARDCFG_MANAGED mode support
soc: ti: k3-ringacc: Use str_enabled_disabled() helper
soc: ti: knav_dma: Use IOMEM_ERR_PTR() in pktdma_get_regs()
soc: ti: knav_dma: Remove dead check on unsigned args.args[0]
soc: ti: knav_dma: Remove unused DMA_PRIO_MASK macro
soc: ti: knav_qmss_acc: Fix kernel-doc Return: tag
soc: ti: knav_qmss: Fix __iomem annotations and __be32 type
soc: ti: knav_qmss: Use %pe to print PTR_ERR()
soc: ti: knav_qmss: Fix kernel-doc Return: tags
soc: ti: knav_qmss: Inline lockdep condition in for_each_handle_rcu
soc: ti: knav_qmss: Rename global kdev to knav_qdev to fix -Wshadow
soc: ti: knav_qmss: Remove remaining redundant ENOMEM printks
soc: ti: knav_qmss_queue: Implement resource cleanup in remove()
soc: ti: k3-ringacc: Fix access mode for k3_ringacc_ring_pop_tail_io/proxy
soc: ti: knav_dma: fix all kernel-doc warnings in knav_dma.h
soc: ti: k3-socinfo: Add support for AM62P variants via NVMEM
dt-bindings: hwinfo: ti,k3-socinfo: Add nvmem-cells support
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into soc/drivers
Memory controller drivers for v7.2
1. Tegra MC/EMC:
- Handle system sleep, necessary to re-program registers after system
resume. A few more improvements.
- Add Tegra114 and Tegra238 Memory Controller, and Tegra114 External
MC support.
- Grow Tegra264 support.
2. Renesas XSPI: Document RZ/T2H and RZ/N2H variants, compatible with
existing devices.
* tag 'memory-controller-drv-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl:
memory: tegra264: Add full set of MC clients
dt-bindings: memory: tegra264: Add full set of MC client IDs
memory: tegra264: Skip clients without bpmp_id or type
memory: renesas-rpc-if: Fix duplicate device name on multi-instance platforms
dt-bindings: memory: renesas,rzg3e-xspi: Add RZ/T2H and RZ/N2H support
memory: omap-gpmc: Silence W=1 kerneldoc warnings
memory: tegra114-emc: Simplify tegra114_emc_interconnect_init() error message
memory: tegra114-emc: Do not print error on icc_node_create() failure
memory: tegra: Fix possible null pointer dereference
memory: tegra: Add Tegra114 EMC driver
memory: tegra: Implement EMEM regs and ICC ops for Tegra114
dt-bindings: memory: Document Tegra114 External Memory Controller
dt-bindings: memory: Document Tegra114 Memory Controller
memory: tegra: Add Tegra238 MC support
dt-bindings: memory: tegra: Add nvidia,tegra238-mc compatible
memory: tegra: Restore MC interrupt masks on resume
memory: tegra: Wire up system sleep PM ops
memory: tegra: Make ->resume() callback return void
memory: tegra: Deduplicate rate request management code
memory: atmel-ebi: Allow deferred probing
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into soc/drivers
Samsung SoC drivers for v7.2
Improve Samsung Exynos (and Google GS101) ACPM (Alive Clock and Power
Manager) firmware driver:
1. Few code improvements.
2. Add support for protocol used to communicate with Thermal Management
Unit (TMU). This will allow to implement the thermal driver working
for newer Samsung Exynos and Google GS101 SoCs.
* tag 'samsung-drivers-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
firmware: samsung: acpm: remove compile-testing stubs
firmware: samsung: acpm: Add devm_acpm_get_by_phandle helper
firmware: samsung: acpm: Add TMU protocol support
firmware: samsung: acpm: Make acpm_ops const and access via pointer
firmware: samsung: acpm: Drop redundant _ops suffix in acpm_ops members
firmware: samsung: acpm: Annotate rx_data->cmd with __counted_by_ptr
firmware: samsung: acpm: Consolidate transfer initialization helper
firmware: samsung: acpm: Fix infinite loop on sequence number exhaustion
firmware: samsung: acpm: Fix missing LKMM barriers in sequence allocator
firmware: samsung: acpm: Fix false timeouts and Use-After-Free in polling
firmware: samsung: acpm: Fix mailbox channel leak on probe error
firmware: samsung: acpm: Fix cross-thread RX length corruption
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into soc/drivers
soc/tegra: pmc: Changes for v7.2-rc1
The bulk of these changes converts existing users to the modern variants
of the API that take a PMC instance as argument. This completes the
transition to multi-instance support, which then makes room for cleanups
and restricting the remaining legacy APIs to 32-bit platforms.
Some changes in this set also clean up powergate debugfs and restrict
the power-off handler to be installed only where appropriate. Lastly,
support for Tegra238 is added.
* tag 'tegra-for-7.2-pmc' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
soc/tegra: pmc: Add Tegra238 support
soc/tegra: pmc: Restrict power-off handler to Nexus 7
soc/tegra: pmc: Populate powergate debugfs only when needed
soc/tegra: pmc: Move legacy code behind CONFIG_ARM guard
soc/tegra: pmc: Remove unused legacy functions
soc/tegra: pmc: Create PMC context dynamically
usb: xhci: tegra: Explicitly specify PMC instance to use
PCI: tegra: Explicitly specify PMC instance to use
media: vde: Explicitly specify PMC instance to use
drm/tegra: Explicitly specify PMC instance to use
drm/nouveau: tegra: Explicitly specify PMC instance to use
ata: ahci_tegra: Explicitly specify PMC instance to use
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
There are no callers of fbcon outside fbdev. Move the declarations
into the internal header.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Handle console remapping in fbcon in fb_switch_output(). Vga-switcheroo
invokes this functionality before switching physical outputs to a new
graphics device. Open-coding fbcon state in vga-switcheroo exposed fbdev
implementation details.
Vga-switcheroo is used for switching physical outputs among graphics
hardware. This functionality is only supported by DRM drivers. A later
update will further move fb_switch_output() into DRM's fbdev emulation;
thus fully decoupling vga-switcheroo from fbdev.
v3:
- remove Kconfig dependency related to fbcon (Geert)
v2:
- use '#if defined' (Helge)
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Handle fbcon during display updates in fb_set_var_from_user(). Check
with fbcon if the mode change is possible, update hardware state and
finally update fbcon. Update all callers.
Only the FBIOPUT_VSCREENINFO ioctl currently does all steps. Other
mode-changes callers in sysfs and driver code are missing fbcon-related
steps.
With the new helper, ps3fb and sh_mobile_lcdcfb no longer maintain
fbcon state themselves.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel into soc/dt
Renesas DTS updates for v7.2 (take two)
- Add timer (MTU3) and xSPI FLASH support for the RZ/T2H and RZ/N2H
SoCs and their EVK boards,
- Add PCIe support for the RZ/V2N SoC and the RZ/V2N EVK board,
- Add support for the R-Car M3Le SoC and the Geist development board,
- Specify ethernet PHY reset timings on various R-Car boards,
- Add (more) serial, I2C, DMA, and sound support for the RZ/G3L SoC,
- Add PSCI, Multifunctional Interface (MFIS), and SCMI support for the
R-Car X5H SoC and Ironhide development board,
- Add serial DMA support for the RZ/G2L SoC,
- Add keyboard, I2C, Versa clock, and audio support for the RZ/G3L
SMARC SoM and EVK boards,
- Miscellaneous fixes and improvements.
* tag 'renesas-dts-for-v7.2-tag2' of https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel: (56 commits)
arm64: dts: renesas: r9a08g046l48-smarc: Enable audio
arm64: dts: renesas: rzg3l-smarc-som: Enable Versa clock generator
arm64: dts: renesas: r9a08g046l48-smarc: Enable I2C{2,3} devices
arm64: dts: renesas: r9a08g046l48-smarc: Add gpio keys
arm64: dts: renesas: rzt2h-n2h-evk: Enable xSPI nodes
arm64: dts: renesas: r9a09g087: Add xSPI nodes
arm64: dts: renesas: r9a09g077: Add xSPI nodes
arm64: dts: renesas: rzg3e-smarc-som: Sort GMAC pinmux entries
arm64: dts: renesas: r8a779md: Add support for R-Car M3Le R8A779MD Geist
arm64: dts: renesas: r9a07g044: Add DMA properties to serial nodes
arm64: dts: renesas: r9a07g054: Add max-frequency to SDHI nodes
arm64: dts: renesas: r9a07g044: Add max-frequency to SDHI nodes
arm64: dts: renesas: r9a07g043: Add max-frequency to SDHI nodes
arm64: dts: renesas: r9a08g046: Add rsci{0..3} device nodes
arm64: dts: renesas: ironhide: Enable to use SCMI
arm64: dts: renesas: r8a78000: Add MFIS, MFIS-SCP, and transport nodes
arm64: dts: renesas: ironhide: Describe all reserved memory
arm64: dts: renesas: rzt2h-n2h-evk: Configure eMMC/SDHI pins
arm64: dts: renesas: r8a78000: Fix GIC-720AE View 1 Redistributor description
arm64: dts: renesas: r8a78000: Add PSCI node
...
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
Decode the 'RECLAIM_ZONES' state in tracepoints, as of now only the
numerical state is shown in the tracepoints.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Print the type of the inode (directory, regular file, symlink, etc) to
facilitate debugging.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_sync_log() is one of the main functions called during a fsync.
Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_log_new_name() is an important function that affects inode logging
and is called during link and rename operations. Add trace events for when
entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_record_new_subvolume() is an important operation that affects
inode logging and is called during subvolume creation. Add a trace event
for it to help debug issues.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_record_snapshot_destroy() is an important operation that affects
inode logging and is called during subvolume/snapshot deletion as well as
during rmdir. Add a trace event for it to help debug issues.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_record_unlink_dir() is an important operation that affects inode
logging and is called during unlink and rename operations. Add a trace
event for it to help debug issues.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
log_new_delayed_dentries() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
log_conflicting_inodes() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
add_conflicting_inode() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
find_skb() falls back to np->skb_pool when the GFP_ATOMIC alloc_skb()
fails. The pool is refilled by refill_skbs(), which always allocates
buffers of MAX_SKB_SIZE (ethhdr + iphdr + udphdr + MAX_UDP_CHUNK ==
1502 bytes).
netconsole, however, computes the requested length dynamically as
total_len + np->dev->needed_tailroom
If the egress device declares a non-zero needed_tailroom (e.g. some
tunnel or hardware accelerator devices), the required length can exceed
MAX_SKB_SIZE. The pooled skb is then handed back to the caller, which
immediately performs skb_put(skb, len), trips the tail > end check, and
triggers skb_over_panic().
Leave the normal alloc_skb(len, GFP_ATOMIC) path untouched -- the slab
allocator can still satisfy oversized requests when memory is available,
so senders to devices with non-zero needed_tailroom keep working in the
common case. Only the pool fallback is gated: when alloc_skb() failed
and len exceeds the pool buffer size, skip the skb_dequeue() instead of
burning a pre-allocated skb on a request that would later trip
skb_over_panic(). Reserving pool entries for requests they can actually
satisfy also keeps the panic path, which depends on the pool being
primed, intact.
When that drop happens, emit a rate-limited net_warn() so the user
notices that netconsole is unable to push messages on the egress device.
The warn is skipped under in_nmi() for the same reason schedule_work()
is: printk machinery taken by net_warn_ratelimited() is not NMI-safe and
would risk recursing into the same nbcon console we are servicing.
MAX_SKB_SIZE / MAX_UDP_CHUNK were private to net/core/netpoll.c. Move
them to include/linux/netpoll.h so netconsole can reference the same
definition that refill_skbs() uses, keeping the two in sync by
construction. The header now pulls in <linux/ip.h> and <linux/udp.h>
explicitly so MAX_SKB_SIZE remains self-contained for any future user.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260604-netcons_fix_before_move-v3-2-ab055b3a6aa5@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mediatek/linux into soc/drivers
MediaTek SoC driver updates
This adds subsys ID compatibility in MediaTek CMDQ, paving
the way for adding support for the MT8196 SoC, and fixes
the Multimedia System (MMSYS) routing masks for the MT8167
SoC.
* tag 'mtk-soc-for-v7.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mediatek/linux:
soc: mediatek: mtk-mmsys: Restore MT8167 routing masks lost during merge
soc: mediatek: mtk-cmdq: Add cmdq_pkt_jump_rel_temp() for removing shift_pa
soc: mediatek: Use pkt_write function pointer for subsys ID compatibility
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
Commit 2ccccf5fb43f ("net_sched: update hierarchical backlog too") renamed
qdisc_tree_decrease_qlen() to qdisc_tree_reduce_backlog(), but this comment
was missed. Update it.
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260605212239.2261320-1-victor@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Each record release via tls_strp_msg_done() triggered
tls_strp_check_rcv(), which called tls_rx_msg_ready() and
fired saved_data_ready(). During a multi-record receive, the
first N-1 wakeups are pure overhead: the caller is already
running and will pick up subsequent records on the next loop
iteration. The recvmsg and splice_read paths share this waste.
Suppress per-record notifications and emit a single one on
reader exit. tls_rx_rec_done() releases the current record
and parses the next without announcing; tls_strp_check_rcv()
gains a bool announce parameter so callers can request the
quiet form. tls_rx_reader_release() fires the deferred
announce on exit through tls_rx_msg_maybe_announce(), an
idempotent helper that calls saved_data_ready() only when a
record is parsed and has not yet been announced.
To keep the final notification idempotent against records that
the BH or the worker has already announced, tls_strparser gains
a msg_announced bit. tls_rx_msg_maybe_announce() sets the bit
when firing saved_data_ready(); the bit is cleared whenever
the parsed record is wiped, by tls_strp_msg_consume() on
consumption or by tls_strp_msg_load() when the lower socket
loses bytes from under the parse. A second call for the same
parsed record -- as when recvmsg() satisfies the request from
ctx->rx_list without touching the strparser -- becomes a
no-op.
With no remaining callers, tls_strp_msg_done() is removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260604-tls-read-sock-v12-5-b114efa6e3e2@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Because LLC wasn't complicated/annoying enough, there's 2 more
"ethertypes" being used for it:
- 0x8870 is pretty "normal", it got standardized in
802.1AC-2016/Cor1-2018 for transporting LLC frames > 1500 bytes.
It simply replaces the length value (which is no longer encoded, and
must now be derived from the packet.) The actual value dates back to
2001; https://datatracker.ietf.org/doc/html/draft-ietf-isis-ext-eth-01
(it was used without "proper" standardization for a long time)
- 0x00fe is a doozy - actually "invalid" depending on how you look at
it; it's used in GRE (and possibly GENEVE) tunnels to transport the
IS-IS routing protocol. https://seclists.org/tcpdump/2002/q4/61 is
the best/oldest source I could find. It's inspired by the 0xfe SAP
value, a GRE packet with protocol 0x00fe is followed by a payload "as
if" it was Ethernet with "<length> 0xfe 0xfe 0x03". (Again the length
isn't encoded explicitly anymore.)
The 0x00fe value is quite close to other values the kernel is using
internally for various things (after all they "won't clash for 1500
types"). Except this one does clash, and if someone unknowingly starts
using it for something internal... we end up in a world of pain in
getting IS-IS running on GRE tunnels. Hence the "WARNING".
Signed-off-by: David 'equinox' Lamparter <equinox@diac24.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Link: https://patch.msgid.link/20260605164144.81184-1-equinox@diac24.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mlx5_query_nic_vport_mac_list() sizes its firmware command buffer using
the PF's log_max_current_uc/mc_list capabilities. When querying a VF
vport with a larger configured max (via devlink), the firmware response
can overflow this buffer:
BUG: KASAN: slab-out-of-bounds in mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core]
Read of size 4 at addr ff1100013ffc8a12 by task kworker/u96:2/385
CPU: 12 UID: 0 PID: 385 Comm: kworker/u96:2 Not tainted 7.0.0-rc6+ #1 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
Workqueue: mlx5_esw_wq esw_vport_change_handler [mlx5_core]
Call Trace:
<TASK>
dump_stack_lvl+0x69/0xa0
print_report+0x176/0x4e4
kasan_report+0xc8/0x100
mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core]
esw_update_vport_addr_list+0x2e3/0xda0 [mlx5_core]
esw_vport_change_handle_locked+0xa1f/0x1060 [mlx5_core]
esw_vport_change_handler+0x6a/0x90 [mlx5_core]
process_one_work+0x87f/0x15e0
worker_thread+0x62b/0x1020
kthread+0x375/0x490
ret_from_fork+0x4dc/0x810
ret_from_fork_asm+0x11/0x20
</TASK>
Fix by querying the vport's own HCA caps to size the buffer correctly.
Refactor the function to allocate and return the MAC list internally,
removing the caller's dependency on knowing the correct max.
Fixes: e16aea2744ab ("net/mlx5: Introduce access functions to modify/query vport mac lists")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260604135849.458060-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
compact_gap() returns 2 << order, which is used as watermark headroom in
__compaction_suitable() and as a threshold in kswapd reclaim decisions.
The computed value scales exponentially by order. For order-9 THP
allocations this evaluates to 1024 pages, but the compaction free
scanner's working set is bounded by COMPACT_CLUSTER_MAX (32 pages). The
scanner stops isolating free pages once it matches the migration batch.
The current gap over-reserves by 32x.
On fragmented production hosts, kswapd will try to reclaim up to the gap,
but it only reaches that threshold in 18% of attempts. As a result,
reclaim continues in the majority of cases despite many lower-order free
pages being available. The over-sized gap also causes 46% of order-9
compaction suitability checks to fail unnecessarily: the zone has
sufficient free pages for the scanner to operate, but not enough to clear
the inflated threshold.
Cap compact_gap() at COMPACT_CLUSTER_MAX so the watermark headroom
reflects the scanner's actual capacity. This function is used by two key
heuristics. The first is when kswapd can stop high-order reclaim and
downgrade to order-0 balancing, allowing kcompactd to be woken for the
original higher allocation order. The second is zone suitability
checking, where the smaller gap allows compaction to start sooner.
Note that orders 0-4 are unaffected since their gap is already less than
or equal to COMPACT_CLUSTER_MAX.
A/B test on v6.13-based instagram production hosts (64GB, 60s
measurement):
Unpatched (43 hosts)
pgscan_kswapd (mean/host): ~1.6M
reclaim efficiency (steal/scan): 83.8%
per-compaction success (success/stall): 2.1%
THP success (alloc/alloc+fallback): 4.9%
forced lru_add_drain (mean/host): ~107K
Patched (59 hosts)
pgscan_kswapd (mean/host): ~449K
reclaim efficiency (steal/scan): 91.0%
per-compaction success (success/stall): 28.3%
THP success (alloc/alloc+fallback): 17.2%
forced lru_add_drain (mean/host): ~64K
Additional tests were also performed using a workload of similar shape and
based on mm-new at the time of testing. Across three 60s runs, the patch
showed improvements consistent with the previous test: reduced kswapd
reclaim and fewer THP fault fallbacks.
Unpatched
kswapd_shrink_node downgrade to order-0 (mean): 0
thp_fault_fallback (mean): 1217
pgscan_kswapd (mean): 6328
pgsteal_kswapd (mean): 5657
Patched
kswapd_shrink_node downgrade to order-0 (mean): 28
thp_fault_fallback (mean): 738
pgscan_kswapd (mean): 3773
pgsteal_kswapd (mean): 3243
Link: https://lore.kernel.org/20260604061725.13800-1-jp.kobryn@linux.dev
Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/swap, PM: hibernate: fix swapoff race in uswsusp by
pinning swap device", v8.
Currently, in the uswsusp path, only the swap type value is retrieved at
lookup time without holding a reference. If swapoff races after the type
is acquired, subsequent slot allocations operate on a stale swap device.
Additionally, grabbing and releasing the swap device reference on every
slot allocation is inefficient across the entire hibernation swap path.
This patch series addresses these issues:
- Patch 1: Fixes the swapoff race in uswsusp by pinning the swap device
from the point it is looked up until the session completes.
- Patch 2: Removes the overhead of per-slot reference counting in alloc/free
paths and cleans up the redundant SWP_WRITEOK check.
This patch (of 2):
Hibernation via uswsusp (/dev/snapshot ioctls) has a race window: after
selecting the resume swap area but before user space is frozen, swapoff
may run and invalidate the selected swap device.
Fix this by pinning the swap device with SWP_HIBERNATION while it is in
use. The pin is exclusive, which is sufficient since hibernate_acquire()
already prevents concurrent hibernation sessions.
The kernel swsusp path (sysfs-based hibernate/resume) uses
find_hibernation_swap_type() which is not affected by the pin. It freezes
user space before touching swap, so swapoff cannot race.
Introduce dedicated helpers:
- pin_hibernation_swap_type(): Look up and pin the swap device.
Used by the uswsusp path.
- find_hibernation_swap_type(): Lookup without pinning.
Used by the kernel swsusp path.
- unpin_hibernation_swap_type(): Clear the hibernation pin.
While a swap device is pinned, swapoff is prevented from proceeding.
Link: https://lore.kernel.org/20260323160822.1409904-1-youngjun.park@lge.com
Link: https://lore.kernel.org/20260323160822.1409904-2-youngjun.park@lge.com
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
find_vm_area() can return NULL if the given address is not a valid vmalloc
area. Check the return value before dereferencing it to avoid a kernel
crash.
Link: https://lore.kernel.org/20260529014130.671291-1-hui.zhu@linux.dev
Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The VMA flags bitmap is a single word today: NUM_VMA_FLAG_BITS is
BITS_PER_LONG, so on 32-bit vma_flags_t holds only 32 bits. (The bitmap
type exists so this can grow past BITS_PER_LONG later; until it does,
anything declared above the first word is out of range on 32-bit.) The bit
enum nevertheless declares some bits unconditionally above BITS_PER_LONG
-- VMA_UFFD_MINOR_BIT is 41, with VM_UFFD_MINOR == VM_NONE on 32-bit so no
VMA actually carries the bit.
__VMA_UFFD_FLAGS feeds VMA_UFFD_MINOR_BIT to mk_vma_flags()
unconditionally. On 32-bit that becomes __set_bit(41, &one_long), a write
one word past the end of the single-word bitmap. The compiler folds the
out-of-bounds store with wraparound (1UL << (41 % 32) == bit 9) into the
first word; bit 9 is already in __VMA_UFFD_FLAGS so the mask happens to
come out right today, but it is an out-of-bounds write all the same, and
any high-numbered bit whose mod-BITS_PER_LONG position is otherwise unused
would silently OR an extra bit into the mask.
Rather than feed bit numbers that may not exist on the current build to
mk_vma_flags(), build the mask from whole per-mode masks that collapse to
EMPTY_VMA_FLAGS when their feature is unavailable. Add
mk_vma_flags_from_masks() for that, and define VMA_UFFD_MISSING / _WP /
_MINOR alongside the VM_UFFD_* flags, gating VMA_UFFD_MINOR on the same
config as VM_UFFD_MINOR (which implies 64BIT, where bit 41 fits). An
out-of-range bit is then never materialised, on any arch, and the in-range
fast path stays a compile-time constant.
Link: https://lore.kernel.org/20260529172331.356655-7-kas@kernel.org
Fixes: 9ea35a25d51b ("mm: introduce VMA flags bitmap type")
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reported-by: Sashiko AI review <sashiko-bot@kernel.org>
Suggested-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Assisted-by: Claude:claude-opus-4-8
Cc: David Hildenbrand <david@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
These comments have been wrong since commit a211c6550efc ("mm: page_alloc:
defrag_mode kswapd/kcompactd watermarks") added NR_FREE_PAGES_BLOCKS.
Since nobody has complained about it in the last year, it seems unlikely
these comments were particularly useful anyway, so delete them.
Link: https://lore.kernel.org/20260601-zone_stat_item-comment-v1-1-f452dd91d5eb@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
should_compact_retry() handles COMPACT_SKIPPED by asking
compaction_zonelist_suitable() whether reclaim can make a later compaction
attempt worthwhile. That answer is used for the current allocation, so it
should follow the same zone eligibility rules as the allocation itself.
When cpusets are enabled, allocator slowpath decisions are marked with
ALLOC_CPUSET. The allocation path, direct compaction and reclaim retry
all skip zones rejected by __cpuset_zone_allowed().
compaction_zonelist_suitable() does not apply that filter. It only walks
ac->zonelist/ac->nodemask, so it can return true because a zone that is
not usable for the current allocation would pass __compaction_suitable().
That does not let the allocation use the disallowed zone. Later
allocation and direct compaction paths still apply cpuset filtering.
However, it can make should_compact_retry() retry based on memory that
this allocation cannot use.
Pass gfp_mask down and apply the same ALLOC_CPUSET check in
compaction_zonelist_suitable(). This keeps the retry decision aligned
with the zones that the allocation is allowed to use.
A temporary debugfs probe was also used to call the old and new
compaction_zonelist_suitable() predicates in the same two-node NUMA guest.
The task was restricted to mems=0 while ac->nodemask covered nodes 0-1.
After putting pressure on node0, node0 failed __compaction_suitable() for
order-10 and node1 passed it, but node1 was rejected by
__cpuset_zone_allowed(). In that state the old predicate returned true
and the patched predicate returned false.
Link: https://lore.kernel.org/tencent_F59F2BA2CC5779308E10DF54593C736D3E0A@qq.com
Fixes: 435b3894e742 ("mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()")
Signed-off-by: fujunjie <fujunjie1@qq.com>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The deferred split queue handles cgroups in a suboptimal fashion. The
queue is per-NUMA node or per-cgroup, not the intersection. That means on
a cgrouped system, a node-restricted allocation entering reclaim can end
up splitting large pages on other nodes:
alloc/unmap
deferred_split_folio()
list_add_tail(memcg->split_queue)
set_shrinker_bit(memcg, node, deferred_shrinker_id)
for_each_zone_zonelist_nodemask(restricted_nodes)
mem_cgroup_iter()
shrink_slab(node, memcg)
shrink_slab_memcg(node, memcg)
if test_shrinker_bit(memcg, node, deferred_shrinker_id)
deferred_split_scan()
walks memcg->split_queue
The shrinker bit adds an imperfect guard rail. As soon as the cgroup has
a single large page on the node of interest, all large pages owned by that
memcg, including those on other nodes, will be split.
list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
streamlines a lot of the list operations and reclaim walks. It's used
widely by other major shrinkers already. Convert the deferred split queue
as well.
The list_lru per-memcg heads are instantiated on demand when the first
object of interest is allocated for a cgroup, by calling
folio_memcg_alloc_deferred(). Add calls to where splittable pages are
created: anon faults, swapin faults, khugepaged collapse.
These calls create all possible node heads for the cgroup at once, so the
migration code (between nodes) doesn't need any special care.
[akpm@linux-foundation.org: fix build with CONFIG_TRANSPARENT_HUGEPAGE=n]
Link: https://lore.kernel.org/202605281620.lc3rtkBm-lkp@intel.com
[hannes@cmpxchg.org: fix cgroup.memory=nokmem handling]
Link: https://lore.kernel.org/ah9PGv12mqai84ES@cmpxchg.org
Link: https://lore.kernel.org/20260527204757.2544958-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nico Pache <npache@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
memcg_list_lru_alloc() is called every time an object that may end up on
the list_lru is created. It needs to quickly check if the list_lru heads
for the memcg already exist, and allocate them when they don't.
Doing this with folio objects is tricky: folio_memcg() is not stable and
requires either RCU protection or pinning the cgroup. But it's desirable
to make the existence check lightweight under RCU, and only pin the memcg
when we need to allocate list_lru heads and may block.
In preparation for switching the THP shrinker to list_lru, add a helper
function for allocating list_lru heads coming from a folio.
Link: https://lore.kernel.org/20260527204757.2544958-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nico Pache <npache@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Usama Arif <usama.arif@linux.dev>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Locking is currently internal to the list_lru API. However, a caller
might want to keep auxiliary state synchronized with the LRU state.
For example, the THP shrinker uses the lock of its custom LRU to keep
PG_partially_mapped and vmstats consistent.
To allow the THP shrinker to switch to list_lru, provide normal and
irqsafe locking primitives as well as caller-locked variants of the
addition and deletion functions.
Link: https://lore.kernel.org/20260527204757.2544958-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nico Pache <npache@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Usama Arif <usama.arif@linux.dev>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|