summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2026-06-09SUNRPC: Add helpers to convert xdr_buf byte ranges to scatterlistsChuck Lever
The crypto/krb5 library accepts data in scatterlist form, but the GSS-API layer presents RPC payloads as struct xdr_buf. Bridge that gap with a pair of helper functions: xdr_buf_to_sg() - populate a caller-supplied scatterlist array from a byte range xdr_buf_to_sg_alloc() - populate a caller-supplied inline scatterlist, chaining to a heap- allocated overflow for large payloads The inline array (typically stack-allocated at eight entries) covers the common case of small RPCs with no heap allocation on the encrypt/decrypt path. Only buffers spanning many pages incur a kmalloc for the chained extension. The segment-walking logic follows the same head, page array, tail traversal as xdr_process_buf(), but populates a scatterlist directly rather than invoking a per-segment callback. sg_next() traversal makes the walker safe for chained scatterlists. Once subsequent patches reroute all per-message crypto operations through crypto/krb5, xdr_process_buf() loses its last callers and is removed. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09NFSD: Add NFSD_CMD_UNLOCK_EXPORT netlink commandChuck Lever
When a filesystem is exported to NFS clients, NFSv4 state (opens, locks, delegations, layouts) holds references that prevent the underlying filesystem from being unmounted. NFSD_CMD_UNLOCK_FILESYSTEM addresses this at superblock granularity, but administrators unexporting a single path on a shared filesystem (e.g., one of several exports on the same device) need finer control. Add NFSD_CMD_UNLOCK_EXPORT, which revokes NFSv4 state acquired through exports of a specific path. Matching is by path identity (dentry + vfsmount) via the sc_export field on each nfs4_stid, so multiple svc_export objects for the same path -- one per auth_domain -- are handled correctly without requiring the caller to name a specific client. The command takes a single "path" attribute. Userspace (exportfs -u) sends this after removing the last client for a given path, enabling the underlying filesystem to be unmounted. When multiple clients share an export path, individual unexports do not trigger state revocation; only the final one does. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09NFSD: Add NFSD_CMD_UNLOCK_FILESYSTEM netlink commandChuck Lever
Add NFSD_CMD_UNLOCK_FILESYSTEM as a dedicated netlink command for revoking NFS state under a filesystem path, providing a netlink equivalent of /proc/fs/nfsd/unlock_fs. The command requires a "path" string attribute containing the filesystem path whose state should be released. The handler resolves the path to its superblock, then cancels async copies, releases NLM locks, and revokes NFSv4 state on that superblock. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09NFSD: Add NFSD_CMD_UNLOCK_IP netlink commandChuck Lever
The existing write_unlock_ip procfs interface releases NLM file locks held by a specific client IP address, but procfs provides no structured way to extend that operation to other scopes such as revoking NFSv4 state. Add NFSD_CMD_UNLOCK_IP as a dedicated netlink command for releasing NLM locks by client address. The command accepts a binary sockaddr_in or sockaddr_in6 in its address attribute. The handler validates the address family and length, then calls nlmsvc_unlock_all_by_ip() to release matching NLM locks. Because lockd is a single global instance, that call operates across all network namespaces regardless of which namespace the caller inhabits. A separate netlink command for filesystem-scoped unlock is added in a subsequent commit. The nfsd_ctl_unlock_ip tracepoint is updated from string-based address logging to __sockaddr, which stores the binary sockaddr and formats it with %pISpc. This affects both the new netlink path and the existing procfs write_unlock_ip path, giving consistent structured output in both cases. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09docs: net: ethtool: document ops-locked drivers and op_needs_rtnlJakub Kicinski
Catch up various bits of documentation after the locking changes. Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-13-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock on IOCTL pathJakub Kicinski
Convert the IOCTL path similarly to how we converted Netlink. The device lookup gets a little hairy. We could take rtnl_lock unconditionally and drop it before calling the driver (this would avoid the reference + liveness check). But I think being able to make progress even if rtnl is dead-locked is quite useful. First extra concern is handling features. List all the cmds which modify features and always take rtnl_lock. We could fold this list into ethtool_ioctl_needs_rtnl() but seems cleaner to keep ethtool_ioctl_needs_rtnl() driver-related. If a driver changed features and we were not holding rtnl_lock - warn about it. It can only happen on buggy ops locked drivers (buggy because they should have set appropriate "I need rtnl for op X" bit). Second wrinkle is the PHY ID hack which drops the locks while sleeping. Convert its static "busy" variable which used to be protected by rtnl_lock to a field in struct ethtool_netdev_state. This feature is about identifying an adapter or a port within a system, so being able to blink multiple LEDs at the same time is likely not very useful in practice. But it's the simplest fix, we can add a mutex if someone thinks a system should only be ID'ing one port at a time. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-12-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock in ethnl_act_module_fw_flash()Jakub Kicinski
Module firmware flashing reads SFF-8024 identifier bytes via .get_module_eeprom_by_page(). Other than that it modifies a bit in the netdev->ethtool struct. Both should be ops-locked at this point. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock on Netlink path for SET opsJakub Kicinski
Make ethtool not take rtnl_lock for SET commands when operation is performed on an ops-locked driver. cfg/cfg_pending are now ops-locked, since only ethtool modifies them. Some SET driver callbacks will still need rtnl_lock, most notably those which may end up calling netdev_update_features() or the qdisc layer (via netif_set_real_num_tx_queues()). Let drivers selectively opt back into the rtnl_lock with a new bitfield in ops. We need two helpers since Netlink and ioctl cmds have different values. Keep the helpers side by side in common.h to make sure they get updated together, even tho they will only get called from ioctl.c and netlink.c. SET commands which don't use ethnl_default_set_doit() are converted by subsequent commits. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock on Netlink path for GET opsJakub Kicinski
ethnl_default_doit() and ethnl_default_dump_one() are both used exclusively for GET callbacks (former to get info for a single device or get global strings). ops-locked devices don't need rtnl_lock for GET callbacks, stop taking it. Introduce an opt-out mechanism for devices which use phylink (fbnic) since phylink currently depends on rtnl_lock protection. Subsequent patches will add more exceptions, anyway. Practically the new helpers for judging if command needs rtnl_lock could also call netdev_need_ops_lock() but I find that it makes the code in the callers slightly less obvious. Add a helper for IOCTLs already, even tho it's unused so that we can keep them in sync as the series progresses. This is the first user-visible step of moving ethtool ops out from under rtnl. Subsequent patches do the same for SET ops, as well as the ioctl path. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: make dev->hwprov ops-protectedJakub Kicinski
dev->hwprov tracks the active hwtstamp provider for the device. Make it ops protected (instance lock if the netdev driver opts into holding instance lock around callbacks, otherwise rtnl_lock). hwprov is written and read in: - drivers/net/phy/phy_device.c phydev and ops protection don't currently mix, add a comment - net/ethtool/ as of now holds both rtnl lock and ops lock, this one will soon only hold one lock or the other read in: - net/core/dev_ioctl.c holds both rtnl lock and ops lock - net/core/timestamping.c RCU reader The new netdev_ops_lock_dereference() helper does not have "compat" in the name. The name would be quite long and I think in this case it should be obvious that we need _a_ lock. netdev_lock_dereference() already exists and means dev->lock is always expected. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: relax ethnl_req_get_phydev() locking assertionJakub Kicinski
phydev <> netdev linking and lifecycle depends on rtnl_lock. We want to switch to instance locks for most ethtool ops. Let's add an assert that ops locked devices don't use phydev today. If one does we can either opt the phy ops out of being purely ops locked, or do deeper surgery to make phy locking ops-compatible. I don't think there's any fundamental challenge to make that work. Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09ASoC: cs35l56: Increase pm_runtime autosuspend delayRichard Fitzgerald
Increase the pm_runtime autosuspend delay to be longer than the timeout of the firmware's own inactivity timer. There is no point attempting to pm_runtime suspend any sooner than the firmware idle timeout because it would only mean the driver has to poll waiting for the firmware idle. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://patch.msgid.link/20260609122946.288103-1-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-06-09KVM: s390: Add capability to support 2G hugepagesClaudio Imbrenda
Add KVM_CAP_S390_HPAGE_2G to signal to userspace that 2G hugepages may be used to back the guest; restrictions apply similar to 1M hugepages. Enable the (for now still ignored) GMAP_FLAG_ALLOW_HPAGE_2G flag for the guest gmap, and propagate / disable it as necessary. Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20260609150930.665370-3-imbrenda@linux.ibm.com>
2026-06-09soc: bcm2835: raspberrypi-firmware: Add voltage domain IDsShubham Chakraborty
Add Raspberry Pi firmware voltage domain identifiers for the mailbox property interface. Also add the voltage request structure used with RPI_FIRMWARE_GET_VOLTAGE so firmware clients can share the common API definition from the firmware header. Signed-off-by: Shubham Chakraborty <chakrabortyshubham66@gmail.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20260517080445.103962-2-chakrabortyshubham66@gmail.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2026-06-09hwmon: Support guard() and scoped_guard for subsystem locksGuenter Roeck
Add support for guard() and scoped_guard() for the hwmon subsystem lock to simplify its use. Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2026-06-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "Several significant bug fixes of pre-existing issues: - Missing validation on ucap fd types passed from userspace - Missing validation of HW DMA space vs userpace expected sizes in EFA queue setup - DMA corruption when using DMA block sizes >= 4G when setting up MRs in all drivers - Missing validation of CPU IDs when setting up dma handles - Missing validation of IB_MR_REREG_ACCESS when changing writability of a MR - Missing validation of received message/packet size in ISER and SRP" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/srp: bound SRP_RSP sense copy by the received length IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN RDMA: During rereg_mr ensure that REREG_ACCESS is compatible RDMA/core: Validate cpu_id against nr_cpu_ids in DMAH alloc RDMA/umem: Fix truncation for block sizes >= 4G RDMA/efa: Validate SQ ring size against max LLQ size RDMA/core: Validate the passed in fops for ib_get_ucaps()
2026-06-09vfs: add FS_USERNS_DELEGATABLE flag and set it for NFSJeff Layton
Commit e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT") prevents the mount of any filesystem inside a container that doesn't have FS_USERNS_MOUNT set. This broke NFS mounts in our containerized environment. We have a daemon somewhat like systemd-mountfsd running in the init_ns. A process does a fsopen() inside the container and passes it to the daemon via unix socket. The daemon then vets that the request is for an allowed NFS server and performs the mount. This now fails because the fc->user_ns is set to the value in the container and NFS doesn't set FS_USERNS_MOUNT. We don't want to add FS_USERNS_MOUNT to NFS since that would allow the container to mount any NFS server (even malicious ones). Add a new FS_USERNS_DELEGATABLE flag, and enable it on NFS. Fixes: e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT") Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260129-twmount-v1-1-4874ed2a15c4@kernel.org Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-09Merge tag 'ti-driver-soc-for-v7.2' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux into soc/drivers TI SoC driver updates for v7.2 TI K3 TISCI: - ti_sci: Add BOARDCFG_MANAGED mode for support system suspend/resume cycles - ti_sci: Add support for restoring IRQ and clock contexts during resume. - clk: keystone: sci-clk: Add clock restoration support. SoC Drivers: - k3-socinfo: Add support for identifying AM62P silicon variants via NVMEM, along with corresponding dt-bindings update for nvmem-cells support - k3-ringacc: Fix incorrect access mode for ring pop tail IO/proxy operations Keystone Navigator (knav) Cleanup and Fixes: - knav_qmss: Multiple code quality improvements - knav_qmss_queue: Implement proper resource cleanup in the remove() path General Cleanups: - k3-ringacc: Use str_enabled_disabled() helper for consistency - knav_qmss: Use %pe format specifier for PTR_ERR() printing * tag 'ti-driver-soc-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux: firmware: ti_sci: Add support for restoring clock context during resume clk: keystone: sci-clk: Add restore_context() operation firmware: ti_sci: Add support for restoring IRQs during resume firmware: ti_sci: Add BOARDCFG_MANAGED mode support soc: ti: k3-ringacc: Use str_enabled_disabled() helper soc: ti: knav_dma: Use IOMEM_ERR_PTR() in pktdma_get_regs() soc: ti: knav_dma: Remove dead check on unsigned args.args[0] soc: ti: knav_dma: Remove unused DMA_PRIO_MASK macro soc: ti: knav_qmss_acc: Fix kernel-doc Return: tag soc: ti: knav_qmss: Fix __iomem annotations and __be32 type soc: ti: knav_qmss: Use %pe to print PTR_ERR() soc: ti: knav_qmss: Fix kernel-doc Return: tags soc: ti: knav_qmss: Inline lockdep condition in for_each_handle_rcu soc: ti: knav_qmss: Rename global kdev to knav_qdev to fix -Wshadow soc: ti: knav_qmss: Remove remaining redundant ENOMEM printks soc: ti: knav_qmss_queue: Implement resource cleanup in remove() soc: ti: k3-ringacc: Fix access mode for k3_ringacc_ring_pop_tail_io/proxy soc: ti: knav_dma: fix all kernel-doc warnings in knav_dma.h soc: ti: k3-socinfo: Add support for AM62P variants via NVMEM dt-bindings: hwinfo: ti,k3-socinfo: Add nvmem-cells support Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-06-09Merge tag 'memory-controller-drv-7.2' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into soc/drivers Memory controller drivers for v7.2 1. Tegra MC/EMC: - Handle system sleep, necessary to re-program registers after system resume. A few more improvements. - Add Tegra114 and Tegra238 Memory Controller, and Tegra114 External MC support. - Grow Tegra264 support. 2. Renesas XSPI: Document RZ/T2H and RZ/N2H variants, compatible with existing devices. * tag 'memory-controller-drv-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl: memory: tegra264: Add full set of MC clients dt-bindings: memory: tegra264: Add full set of MC client IDs memory: tegra264: Skip clients without bpmp_id or type memory: renesas-rpc-if: Fix duplicate device name on multi-instance platforms dt-bindings: memory: renesas,rzg3e-xspi: Add RZ/T2H and RZ/N2H support memory: omap-gpmc: Silence W=1 kerneldoc warnings memory: tegra114-emc: Simplify tegra114_emc_interconnect_init() error message memory: tegra114-emc: Do not print error on icc_node_create() failure memory: tegra: Fix possible null pointer dereference memory: tegra: Add Tegra114 EMC driver memory: tegra: Implement EMEM regs and ICC ops for Tegra114 dt-bindings: memory: Document Tegra114 External Memory Controller dt-bindings: memory: Document Tegra114 Memory Controller memory: tegra: Add Tegra238 MC support dt-bindings: memory: tegra: Add nvidia,tegra238-mc compatible memory: tegra: Restore MC interrupt masks on resume memory: tegra: Wire up system sleep PM ops memory: tegra: Make ->resume() callback return void memory: tegra: Deduplicate rate request management code memory: atmel-ebi: Allow deferred probing Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-06-09Merge tag 'samsung-drivers-7.2' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into soc/drivers Samsung SoC drivers for v7.2 Improve Samsung Exynos (and Google GS101) ACPM (Alive Clock and Power Manager) firmware driver: 1. Few code improvements. 2. Add support for protocol used to communicate with Thermal Management Unit (TMU). This will allow to implement the thermal driver working for newer Samsung Exynos and Google GS101 SoCs. * tag 'samsung-drivers-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux: firmware: samsung: acpm: remove compile-testing stubs firmware: samsung: acpm: Add devm_acpm_get_by_phandle helper firmware: samsung: acpm: Add TMU protocol support firmware: samsung: acpm: Make acpm_ops const and access via pointer firmware: samsung: acpm: Drop redundant _ops suffix in acpm_ops members firmware: samsung: acpm: Annotate rx_data->cmd with __counted_by_ptr firmware: samsung: acpm: Consolidate transfer initialization helper firmware: samsung: acpm: Fix infinite loop on sequence number exhaustion firmware: samsung: acpm: Fix missing LKMM barriers in sequence allocator firmware: samsung: acpm: Fix false timeouts and Use-After-Free in polling firmware: samsung: acpm: Fix mailbox channel leak on probe error firmware: samsung: acpm: Fix cross-thread RX length corruption Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-06-09Merge tag 'tegra-for-7.2-pmc' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into soc/drivers soc/tegra: pmc: Changes for v7.2-rc1 The bulk of these changes converts existing users to the modern variants of the API that take a PMC instance as argument. This completes the transition to multi-instance support, which then makes room for cleanups and restricting the remaining legacy APIs to 32-bit platforms. Some changes in this set also clean up powergate debugfs and restrict the power-off handler to be installed only where appropriate. Lastly, support for Tegra238 is added. * tag 'tegra-for-7.2-pmc' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux: soc/tegra: pmc: Add Tegra238 support soc/tegra: pmc: Restrict power-off handler to Nexus 7 soc/tegra: pmc: Populate powergate debugfs only when needed soc/tegra: pmc: Move legacy code behind CONFIG_ARM guard soc/tegra: pmc: Remove unused legacy functions soc/tegra: pmc: Create PMC context dynamically usb: xhci: tegra: Explicitly specify PMC instance to use PCI: tegra: Explicitly specify PMC instance to use media: vde: Explicitly specify PMC instance to use drm/tegra: Explicitly specify PMC instance to use drm/nouveau: tegra: Explicitly specify PMC instance to use ata: ahci_tegra: Explicitly specify PMC instance to use Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-06-09fbdev: Do not export fbcon from fbdevThomas Zimmermann
There are no callers of fbcon outside fbdev. Move the declarations into the internal header. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-06-09fbdev: Wrap fbcon updates from vga-switcheroo in helperThomas Zimmermann
Handle console remapping in fbcon in fb_switch_output(). Vga-switcheroo invokes this functionality before switching physical outputs to a new graphics device. Open-coding fbcon state in vga-switcheroo exposed fbdev implementation details. Vga-switcheroo is used for switching physical outputs among graphics hardware. This functionality is only supported by DRM drivers. A later update will further move fb_switch_output() into DRM's fbdev emulation; thus fully decoupling vga-switcheroo from fbdev. v3: - remove Kconfig dependency related to fbcon (Geert) v2: - use '#if defined' (Helge) Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-06-09fbdev: Wrap user-invoked calls to fb_set_var() in helperThomas Zimmermann
Handle fbcon during display updates in fb_set_var_from_user(). Check with fbcon if the mode change is possible, update hardware state and finally update fbcon. Update all callers. Only the FBIOPUT_VSCREENINFO ioctl currently does all steps. Other mode-changes callers in sysfs and driver code are missing fbcon-related steps. With the new helper, ps3fb and sh_mobile_lcdcfb no longer maintain fbcon state themselves. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-06-09Merge tag 'renesas-dts-for-v7.2-tag2' of ↵Krzysztof Kozlowski
https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel into soc/dt Renesas DTS updates for v7.2 (take two) - Add timer (MTU3) and xSPI FLASH support for the RZ/T2H and RZ/N2H SoCs and their EVK boards, - Add PCIe support for the RZ/V2N SoC and the RZ/V2N EVK board, - Add support for the R-Car M3Le SoC and the Geist development board, - Specify ethernet PHY reset timings on various R-Car boards, - Add (more) serial, I2C, DMA, and sound support for the RZ/G3L SoC, - Add PSCI, Multifunctional Interface (MFIS), and SCMI support for the R-Car X5H SoC and Ironhide development board, - Add serial DMA support for the RZ/G2L SoC, - Add keyboard, I2C, Versa clock, and audio support for the RZ/G3L SMARC SoM and EVK boards, - Miscellaneous fixes and improvements. * tag 'renesas-dts-for-v7.2-tag2' of https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel: (56 commits) arm64: dts: renesas: r9a08g046l48-smarc: Enable audio arm64: dts: renesas: rzg3l-smarc-som: Enable Versa clock generator arm64: dts: renesas: r9a08g046l48-smarc: Enable I2C{2,3} devices arm64: dts: renesas: r9a08g046l48-smarc: Add gpio keys arm64: dts: renesas: rzt2h-n2h-evk: Enable xSPI nodes arm64: dts: renesas: r9a09g087: Add xSPI nodes arm64: dts: renesas: r9a09g077: Add xSPI nodes arm64: dts: renesas: rzg3e-smarc-som: Sort GMAC pinmux entries arm64: dts: renesas: r8a779md: Add support for R-Car M3Le R8A779MD Geist arm64: dts: renesas: r9a07g044: Add DMA properties to serial nodes arm64: dts: renesas: r9a07g054: Add max-frequency to SDHI nodes arm64: dts: renesas: r9a07g044: Add max-frequency to SDHI nodes arm64: dts: renesas: r9a07g043: Add max-frequency to SDHI nodes arm64: dts: renesas: r9a08g046: Add rsci{0..3} device nodes arm64: dts: renesas: ironhide: Enable to use SCMI arm64: dts: renesas: r8a78000: Add MFIS, MFIS-SCP, and transport nodes arm64: dts: renesas: ironhide: Describe all reserved memory arm64: dts: renesas: rzt2h-n2h-evk: Configure eMMC/SDHI pins arm64: dts: renesas: r8a78000: Fix GIC-720AE View 1 Redistributor description arm64: dts: renesas: r8a78000: Add PSCI node ... Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-06-09btrfs: zoned: decode 'RECLAIM_ZONES' state in tracepointsJohannes Thumshirn
Decode the 'RECLAIM_ZONES' state in tracepoints, as of now only the numerical state is shown in the tracepoints. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: show inode type in btrfs_sync_file_enter() eventFilipe Manana
Print the type of the inode (directory, regular file, symlink, etc) to facilitate debugging. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for btrfs_sync_log()Filipe Manana
btrfs_sync_log() is one of the main functions called during a fsync. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for btrfs_log_new_name()Filipe Manana
btrfs_log_new_name() is an important function that affects inode logging and is called during link and rename operations. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for btrfs_record_new_subvolume()Filipe Manana
btrfs_record_new_subvolume() is an important operation that affects inode logging and is called during subvolume creation. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for btrfs_record_snapshot_destroy()Filipe Manana
btrfs_record_snapshot_destroy() is an important operation that affects inode logging and is called during subvolume/snapshot deletion as well as during rmdir. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for btrfs_record_unlink_dir()Filipe Manana
btrfs_record_unlink_dir() is an important operation that affects inode logging and is called during unlink and rename operations. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for log_new_delayed_dentries()Filipe Manana
log_new_delayed_dentries() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for log_conflicting_inodes()Filipe Manana
log_conflicting_inodes() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09btrfs: tracepoints: add trace event for add_conflicting_inode()Filipe Manana
add_conflicting_inode() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-09netconsole: do not dequeue pooled skbs that cannot satisfy lenBreno Leitao
find_skb() falls back to np->skb_pool when the GFP_ATOMIC alloc_skb() fails. The pool is refilled by refill_skbs(), which always allocates buffers of MAX_SKB_SIZE (ethhdr + iphdr + udphdr + MAX_UDP_CHUNK == 1502 bytes). netconsole, however, computes the requested length dynamically as total_len + np->dev->needed_tailroom If the egress device declares a non-zero needed_tailroom (e.g. some tunnel or hardware accelerator devices), the required length can exceed MAX_SKB_SIZE. The pooled skb is then handed back to the caller, which immediately performs skb_put(skb, len), trips the tail > end check, and triggers skb_over_panic(). Leave the normal alloc_skb(len, GFP_ATOMIC) path untouched -- the slab allocator can still satisfy oversized requests when memory is available, so senders to devices with non-zero needed_tailroom keep working in the common case. Only the pool fallback is gated: when alloc_skb() failed and len exceeds the pool buffer size, skip the skb_dequeue() instead of burning a pre-allocated skb on a request that would later trip skb_over_panic(). Reserving pool entries for requests they can actually satisfy also keeps the panic path, which depends on the pool being primed, intact. When that drop happens, emit a rate-limited net_warn() so the user notices that netconsole is unable to push messages on the egress device. The warn is skipped under in_nmi() for the same reason schedule_work() is: printk machinery taken by net_warn_ratelimited() is not NMI-safe and would risk recursing into the same nbcon console we are servicing. MAX_SKB_SIZE / MAX_UDP_CHUNK were private to net/core/netpoll.c. Move them to include/linux/netpoll.h so netconsole can reference the same definition that refill_skbs() uses, keeping the two in sync by construction. The header now pulls in <linux/ip.h> and <linux/udp.h> explicitly so MAX_SKB_SIZE remains self-contained for any future user. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260604-netcons_fix_before_move-v3-2-ab055b3a6aa5@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-06-09Merge tag 'mtk-soc-for-v7.2' of ↵Krzysztof Kozlowski
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mediatek/linux into soc/drivers MediaTek SoC driver updates This adds subsys ID compatibility in MediaTek CMDQ, paving the way for adding support for the MT8196 SoC, and fixes the Multimedia System (MMSYS) routing masks for the MT8167 SoC. * tag 'mtk-soc-for-v7.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mediatek/linux: soc: mediatek: mtk-mmsys: Restore MT8167 routing masks lost during merge soc: mediatek: mtk-cmdq: Add cmdq_pkt_jump_rel_temp() for removing shift_pa soc: mediatek: Use pkt_write function pointer for subsys ID compatibility Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-06-09net/sched: Update function name in TCQ_F_NOPARENT commentVictor Nogueira
Commit 2ccccf5fb43f ("net_sched: update hierarchical backlog too") renamed qdisc_tree_decrease_qlen() to qdisc_tree_reduce_backlog(), but this comment was missed. Update it. Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260605212239.2261320-1-victor@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-06-08tls: Suppress spurious saved_data_ready on all receive pathsChuck Lever
Each record release via tls_strp_msg_done() triggered tls_strp_check_rcv(), which called tls_rx_msg_ready() and fired saved_data_ready(). During a multi-record receive, the first N-1 wakeups are pure overhead: the caller is already running and will pick up subsequent records on the next loop iteration. The recvmsg and splice_read paths share this waste. Suppress per-record notifications and emit a single one on reader exit. tls_rx_rec_done() releases the current record and parses the next without announcing; tls_strp_check_rcv() gains a bool announce parameter so callers can request the quiet form. tls_rx_reader_release() fires the deferred announce on exit through tls_rx_msg_maybe_announce(), an idempotent helper that calls saved_data_ready() only when a record is parsed and has not yet been announced. To keep the final notification idempotent against records that the BH or the worker has already announced, tls_strparser gains a msg_announced bit. tls_rx_msg_maybe_announce() sets the bit when firing saved_data_ready(); the bit is cleared whenever the parsed record is wiped, by tls_strp_msg_consume() on consumption or by tls_strp_msg_load() when the lower socket loses bytes from under the parse. A second call for the same parsed record -- as when recvmsg() satisfies the request from ctx->rx_list without touching the strparser -- becomes a no-op. With no remaining callers, tls_strp_msg_done() is removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260604-tls-read-sock-v12-5-b114efa6e3e2@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-08if_ether.h: add 802.1AC, warn about GRE 0x00FEDavid 'equinox' Lamparter
Because LLC wasn't complicated/annoying enough, there's 2 more "ethertypes" being used for it: - 0x8870 is pretty "normal", it got standardized in 802.1AC-2016/Cor1-2018 for transporting LLC frames > 1500 bytes. It simply replaces the length value (which is no longer encoded, and must now be derived from the packet.) The actual value dates back to 2001; https://datatracker.ietf.org/doc/html/draft-ietf-isis-ext-eth-01 (it was used without "proper" standardization for a long time) - 0x00fe is a doozy - actually "invalid" depending on how you look at it; it's used in GRE (and possibly GENEVE) tunnels to transport the IS-IS routing protocol. https://seclists.org/tcpdump/2002/q4/61 is the best/oldest source I could find. It's inspired by the 0xfe SAP value, a GRE packet with protocol 0x00fe is followed by a payload "as if" it was Ethernet with "<length> 0xfe 0xfe 0x03". (Again the length isn't encoded explicitly anymore.) The 0x00fe value is quite close to other values the kernel is using internally for various things (after all they "won't clash for 1500 types"). Except this one does clash, and if someone unknowingly starts using it for something internal... we end up in a world of pain in getting IS-IS running on GRE tunnels. Hence the "WARNING". Signed-off-by: David 'equinox' Lamparter <equinox@diac24.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Link: https://patch.msgid.link/20260605164144.81184-1-equinox@diac24.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-08net/mlx5: Fix slab-out-of-bounds in mlx5_query_nic_vport_mac_listDragos Tatulea
mlx5_query_nic_vport_mac_list() sizes its firmware command buffer using the PF's log_max_current_uc/mc_list capabilities. When querying a VF vport with a larger configured max (via devlink), the firmware response can overflow this buffer: BUG: KASAN: slab-out-of-bounds in mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core] Read of size 4 at addr ff1100013ffc8a12 by task kworker/u96:2/385 CPU: 12 UID: 0 PID: 385 Comm: kworker/u96:2 Not tainted 7.0.0-rc6+ #1 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) Workqueue: mlx5_esw_wq esw_vport_change_handler [mlx5_core] Call Trace: <TASK> dump_stack_lvl+0x69/0xa0 print_report+0x176/0x4e4 kasan_report+0xc8/0x100 mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core] esw_update_vport_addr_list+0x2e3/0xda0 [mlx5_core] esw_vport_change_handle_locked+0xa1f/0x1060 [mlx5_core] esw_vport_change_handler+0x6a/0x90 [mlx5_core] process_one_work+0x87f/0x15e0 worker_thread+0x62b/0x1020 kthread+0x375/0x490 ret_from_fork+0x4dc/0x810 ret_from_fork_asm+0x11/0x20 </TASK> Fix by querying the vport's own HCA caps to size the buffer correctly. Refactor the function to allocate and return the MAC list internally, removing the caller's dependency on knowing the correct max. Fixes: e16aea2744ab ("net/mlx5: Introduce access functions to modify/query vport mac lists") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260604135849.458060-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-08mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAXJP Kobryn
compact_gap() returns 2 << order, which is used as watermark headroom in __compaction_suitable() and as a threshold in kswapd reclaim decisions. The computed value scales exponentially by order. For order-9 THP allocations this evaluates to 1024 pages, but the compaction free scanner's working set is bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free pages once it matches the migration batch. The current gap over-reserves by 32x. On fragmented production hosts, kswapd will try to reclaim up to the gap, but it only reaches that threshold in 18% of attempts. As a result, reclaim continues in the majority of cases despite many lower-order free pages being available. The over-sized gap also causes 46% of order-9 compaction suitability checks to fail unnecessarily: the zone has sufficient free pages for the scanner to operate, but not enough to clear the inflated threshold. Cap compact_gap() at COMPACT_CLUSTER_MAX so the watermark headroom reflects the scanner's actual capacity. This function is used by two key heuristics. The first is when kswapd can stop high-order reclaim and downgrade to order-0 balancing, allowing kcompactd to be woken for the original higher allocation order. The second is zone suitability checking, where the smaller gap allows compaction to start sooner. Note that orders 0-4 are unaffected since their gap is already less than or equal to COMPACT_CLUSTER_MAX. A/B test on v6.13-based instagram production hosts (64GB, 60s measurement): Unpatched (43 hosts) pgscan_kswapd (mean/host): ~1.6M reclaim efficiency (steal/scan): 83.8% per-compaction success (success/stall): 2.1% THP success (alloc/alloc+fallback): 4.9% forced lru_add_drain (mean/host): ~107K Patched (59 hosts) pgscan_kswapd (mean/host): ~449K reclaim efficiency (steal/scan): 91.0% per-compaction success (success/stall): 28.3% THP success (alloc/alloc+fallback): 17.2% forced lru_add_drain (mean/host): ~64K Additional tests were also performed using a workload of similar shape and based on mm-new at the time of testing. Across three 60s runs, the patch showed improvements consistent with the previous test: reduced kswapd reclaim and fewer THP fault fallbacks. Unpatched kswapd_shrink_node downgrade to order-0 (mean): 0 thp_fault_fallback (mean): 1217 pgscan_kswapd (mean): 6328 pgsteal_kswapd (mean): 5657 Patched kswapd_shrink_node downgrade to order-0 (mean): 28 thp_fault_fallback (mean): 738 pgscan_kswapd (mean): 3773 pgsteal_kswapd (mean): 3243 Link: https://lore.kernel.org/20260604061725.13800-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap deviceYoungjun Park
Patch series "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device", v8. Currently, in the uswsusp path, only the swap type value is retrieved at lookup time without holding a reference. If swapoff races after the type is acquired, subsequent slot allocations operate on a stale swap device. Additionally, grabbing and releasing the swap device reference on every slot allocation is inefficient across the entire hibernation swap path. This patch series addresses these issues: - Patch 1: Fixes the swapoff race in uswsusp by pinning the swap device from the point it is looked up until the session completes. - Patch 2: Removes the overhead of per-slot reference counting in alloc/free paths and cleans up the redundant SWP_WRITEOK check. This patch (of 2): Hibernation via uswsusp (/dev/snapshot ioctls) has a race window: after selecting the resume swap area but before user space is frozen, swapoff may run and invalidate the selected swap device. Fix this by pinning the swap device with SWP_HIBERNATION while it is in use. The pin is exclusive, which is sufficient since hibernate_acquire() already prevents concurrent hibernation sessions. The kernel swsusp path (sysfs-based hibernate/resume) uses find_hibernation_swap_type() which is not affected by the pin. It freezes user space before touching swap, so swapoff cannot race. Introduce dedicated helpers: - pin_hibernation_swap_type(): Look up and pin the swap device. Used by the uswsusp path. - find_hibernation_swap_type(): Lookup without pinning. Used by the kernel swsusp path. - unpin_hibernation_swap_type(): Clear the hibernation pin. While a swap device is pinned, swapoff is prevented from proceeding. Link: https://lore.kernel.org/20260323160822.1409904-1-youngjun.park@lge.com Link: https://lore.kernel.org/20260323160822.1409904-2-youngjun.park@lge.com Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08vmalloc: fix NULL pointer dereference in is_vm_area_hugepages()Hui Zhu
find_vm_area() can return NULL if the given address is not a valid vmalloc area. Check the return value before dereferencing it to avoid a kernel crash. Link: https://lore.kernel.org/20260529014130.671291-1-hui.zhu@linux.dev Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08userfaultfd: build __VMA_UFFD_FLAGS from config-gated masksKiryl Shutsemau (Meta)
The VMA flags bitmap is a single word today: NUM_VMA_FLAG_BITS is BITS_PER_LONG, so on 32-bit vma_flags_t holds only 32 bits. (The bitmap type exists so this can grow past BITS_PER_LONG later; until it does, anything declared above the first word is out of range on 32-bit.) The bit enum nevertheless declares some bits unconditionally above BITS_PER_LONG -- VMA_UFFD_MINOR_BIT is 41, with VM_UFFD_MINOR == VM_NONE on 32-bit so no VMA actually carries the bit. __VMA_UFFD_FLAGS feeds VMA_UFFD_MINOR_BIT to mk_vma_flags() unconditionally. On 32-bit that becomes __set_bit(41, &one_long), a write one word past the end of the single-word bitmap. The compiler folds the out-of-bounds store with wraparound (1UL << (41 % 32) == bit 9) into the first word; bit 9 is already in __VMA_UFFD_FLAGS so the mask happens to come out right today, but it is an out-of-bounds write all the same, and any high-numbered bit whose mod-BITS_PER_LONG position is otherwise unused would silently OR an extra bit into the mask. Rather than feed bit numbers that may not exist on the current build to mk_vma_flags(), build the mask from whole per-mode masks that collapse to EMPTY_VMA_FLAGS when their feature is unavailable. Add mk_vma_flags_from_masks() for that, and define VMA_UFFD_MISSING / _WP / _MINOR alongside the VM_UFFD_* flags, gating VMA_UFFD_MINOR on the same config as VM_UFFD_MINOR (which implies 64BIT, where bit 41 fits). An out-of-range bit is then never materialised, on any arch, and the in-range fast path stays a compile-time constant. Link: https://lore.kernel.org/20260529172331.356655-7-kas@kernel.org Fixes: 9ea35a25d51b ("mm: introduce VMA flags bitmap type") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reported-by: Sashiko AI review <sashiko-bot@kernel.org> Suggested-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Assisted-by: Claude:claude-opus-4-8 Cc: David Hildenbrand <david@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: delete stale comment about cachelinesBrendan Jackman
These comments have been wrong since commit a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks") added NR_FREE_PAGES_BLOCKS. Since nobody has complained about it in the last year, it seems unlikely these comments were particularly useful anyway, so delete them. Link: https://lore.kernel.org/20260601-zone_stat_item-comment-v1-1-f452dd91d5eb@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm/compaction: respect cpusets when checking retry suitabilityfujunjie
should_compact_retry() handles COMPACT_SKIPPED by asking compaction_zonelist_suitable() whether reclaim can make a later compaction attempt worthwhile. That answer is used for the current allocation, so it should follow the same zone eligibility rules as the allocation itself. When cpusets are enabled, allocator slowpath decisions are marked with ALLOC_CPUSET. The allocation path, direct compaction and reclaim retry all skip zones rejected by __cpuset_zone_allowed(). compaction_zonelist_suitable() does not apply that filter. It only walks ac->zonelist/ac->nodemask, so it can return true because a zone that is not usable for the current allocation would pass __compaction_suitable(). That does not let the allocation use the disallowed zone. Later allocation and direct compaction paths still apply cpuset filtering. However, it can make should_compact_retry() retry based on memory that this allocation cannot use. Pass gfp_mask down and apply the same ALLOC_CPUSET check in compaction_zonelist_suitable(). This keeps the retry decision aligned with the zones that the allocation is allowed to use. A temporary debugfs probe was also used to call the old and new compaction_zonelist_suitable() predicates in the same two-node NUMA guest. The task was restricted to mems=0 while ac->nodemask covered nodes 0-1. After putting pressure on node0, node0 failed __compaction_suitable() for order-10 and node1 passed it, but node1 was rejected by __cpuset_zone_allowed(). In that state the old predicate returned true and the patched predicate returned false. Link: https://lore.kernel.org/tencent_F59F2BA2CC5779308E10DF54593C736D3E0A@qq.com Fixes: 435b3894e742 ("mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()") Signed-off-by: fujunjie <fujunjie1@qq.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: switch deferred split shrinker to list_lruJohannes Weiner
The deferred split queue handles cgroups in a suboptimal fashion. The queue is per-NUMA node or per-cgroup, not the intersection. That means on a cgrouped system, a node-restricted allocation entering reclaim can end up splitting large pages on other nodes: alloc/unmap deferred_split_folio() list_add_tail(memcg->split_queue) set_shrinker_bit(memcg, node, deferred_shrinker_id) for_each_zone_zonelist_nodemask(restricted_nodes) mem_cgroup_iter() shrink_slab(node, memcg) shrink_slab_memcg(node, memcg) if test_shrinker_bit(memcg, node, deferred_shrinker_id) deferred_split_scan() walks memcg->split_queue The shrinker bit adds an imperfect guard rail. As soon as the cgroup has a single large page on the node of interest, all large pages owned by that memcg, including those on other nodes, will be split. list_lru properly sets up per-node, per-cgroup lists. As a bonus, it streamlines a lot of the list operations and reclaim walks. It's used widely by other major shrinkers already. Convert the deferred split queue as well. The list_lru per-memcg heads are instantiated on demand when the first object of interest is allocated for a cgroup, by calling folio_memcg_alloc_deferred(). Add calls to where splittable pages are created: anon faults, swapin faults, khugepaged collapse. These calls create all possible node heads for the cgroup at once, so the migration code (between nodes) doesn't need any special care. [akpm@linux-foundation.org: fix build with CONFIG_TRANSPARENT_HUGEPAGE=n] Link: https://lore.kernel.org/202605281620.lc3rtkBm-lkp@intel.com [hannes@cmpxchg.org: fix cgroup.memory=nokmem handling] Link: https://lore.kernel.org/ah9PGv12mqai84ES@cmpxchg.org Link: https://lore.kernel.org/20260527204757.2544958-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: list_lru: introduce folio_memcg_list_lru_alloc()Johannes Weiner
memcg_list_lru_alloc() is called every time an object that may end up on the list_lru is created. It needs to quickly check if the list_lru heads for the memcg already exist, and allocate them when they don't. Doing this with folio objects is tricky: folio_memcg() is not stable and requires either RCU protection or pinning the cgroup. But it's desirable to make the existence check lightweight under RCU, and only pin the memcg when we need to allocate list_lru heads and may block. In preparation for switching the THP shrinker to list_lru, add a helper function for allocating list_lru heads coming from a folio. Link: https://lore.kernel.org/20260527204757.2544958-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: list_lru: introduce caller locking for additions and deletionsJohannes Weiner
Locking is currently internal to the list_lru API. However, a caller might want to keep auxiliary state synchronized with the LRU state. For example, the THP shrinker uses the lock of its custom LRU to keep PG_partially_mapped and vmstats consistent. To allow the THP shrinker to switch to list_lru, provide normal and irqsafe locking primitives as well as caller-locked variants of the addition and deletion functions. Link: https://lore.kernel.org/20260527204757.2544958-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>