linux-stable.git/arch/x86/kernel/cpu/resctrl/monitor.c, branch linux-rolling-stable

Merge tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2025-12-02T19:55:58+00:00

Pull x86 resource control updates from Borislav Petkov:

 - Add support for AMD's Smart Data Cache Injection feature which allows
   for direct insertion of data from I/O devices into the L3 cache, thus
   bypassing DRAM and saving its bandwidth; the resctrl side of the
   feature allows the size of the L3 used for data injection to be
   controlled

 - Add Intel Clearwater Forest to the list of CPUs which support
   Sub-NUMA clustering

 - Other fixes and cleanups

* tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  fs/resctrl: Update bit_usage to reflect io_alloc
  fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
  fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
  fs/resctrl: Introduce interface to display io_alloc CBMs
  fs/resctrl: Add user interface to enable/disable io_alloc feature
  fs/resctrl: Introduce interface to display "io_alloc" support
  x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
  x86,fs/resctrl: Detect io_alloc feature
  x86/resctrl: Add SDCIAE feature in the command line options
  x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
  fs/resctrl: Consider sparse masks when initializing new group's allocation
  x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest

x86,fs/resctrl: Fix NULL pointer dereference with events force-disabled in mbm_event mode

2025-10-20T16:06:31+00:00

The following NULL pointer dereference is encountered on mount of resctrl fs
after booting a system that supports assignable counters with the
"rdt=!mbmtotal,!mbmlocal" kernel parameters:

BUG: kernel NULL pointer dereference, address: 0000000000000008
RIP: 0010:mbm_cntr_get
Call Trace:
rdtgroup_assign_cntr_event
rdtgroup_assign_cntrs
rdt_get_tree

Specifying the kernel parameter "rdt=!mbmtotal,!mbmlocal" effectively disables
the legacy X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL features
and the MBM events they represent. This results in the per-domain MBM event
related data structures to not be allocated during early initialization.

resctrl fs initialization follows by implicitly enabling both MBM total and
local events on a system that supports assignable counters (mbm_event mode),
but this enabling occurs after the per-domain data structures have been
created.

After booting, resctrl fs assumes that an enabled event can access all its
state. This results in NULL pointer dereference when resctrl attempts to
access the un-allocated structures of an enabled event.

Remove the late MBM event enabling from resctrl fs.

This leaves a problem where the X86_FEATURE_CQM_MBM_TOTAL and
X86_FEATURE_CQM_MBM_LOCAL features may be disabled while assignable counter
(mbm_event) mode is enabled without any events to support. Switching between
the "default" and "mbm_event" mode without any events is not practical.

Create a dependency between the X86_FEATURE_{CQM_MBM_TOTAL,CQM_MBM_LOCAL} and
X86_FEATURE_ABMC (assignable counter) hardware features. An x86 system that
supports assignable counters now requires support of X86_FEATURE_CQM_MBM_TOTAL
or X86_FEATURE_CQM_MBM_LOCAL.

This ensures all needed MBM related data structures are created before use and
that it is only possible to switch between "default" and "mbm_event" mode when
the same events are available in both modes. This dependency does not exist in
the hardware but this usage of these feature settings work for known systems.

[ bp: Massage commit message. ]

Fixes: 13390861b426e ("x86,fs/resctrl: Detect Assignable Bandwidth Monitoring feature details")
Co-developed-by: Reinette Chatre
Signed-off-by: Reinette Chatre
Signed-off-by: Babu Moger
Signed-off-by: Borislav Petkov (AMD)
Reviewed-by: Reinette Chatre
Link: https://patch.msgid.link/a62e6ac063d0693475615edd213d5be5e55443e6.1760560934.git.babu.moger@amd.com

x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID

2025-10-13T19:24:39+00:00

Users can create as many monitoring groups as the number of RMIDs supported
by the hardware. However, on AMD systems, only a limited number of RMIDs
are guaranteed to be actively tracked by the hardware. RMIDs that exceed
this limit are placed in an "Unavailable" state.

When a bandwidth counter is read for such an RMID, the hardware sets
MSR_IA32_QM_CTR.Unavailable (bit 62). When such an RMID starts being tracked
again the hardware counter is reset to zero. MSR_IA32_QM_CTR.Unavailable
remains set on first read after tracking re-starts and is clear on all
subsequent reads as long as the RMID is tracked.

resctrl miscounts the bandwidth events after an RMID transitions from the
"Unavailable" state back to being tracked. This happens because when the
hardware starts counting again after resetting the counter to zero, resctrl
in turn compares the new count against the counter value stored from the
previous time the RMID was tracked.

This results in resctrl computing an event value that is either undercounting
(when new counter is more than stored counter) or a mistaken overflow (when
new counter is less than stored counter).

Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to
zero whenever the RMID is in the "Unavailable" state to ensure accurate
counting after the RMID resets to zero when it starts to be tracked again.

Example scenario that results in mistaken overflow
==================================================
1. The resctrl filesystem is mounted, and a task is assigned to a
   monitoring group.

   $mount -t resctrl resctrl /sys/fs/resctrl
   $mkdir /sys/fs/resctrl/mon_groups/test1/
   $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   21323            <- Total bytes on domain 0
   "Unavailable"    <- Total bytes on domain 1

   Task is running on domain 0. Counter on domain 1 is "Unavailable".

2. The task runs on domain 0 for a while and then moves to domain 1. The
   counter starts incrementing on domain 1.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   7345357          <- Total bytes on domain 0
   4545             <- Total bytes on domain 1

3. At some point, the RMID in domain 0 transitions to the "Unavailable"
   state because the task is no longer executing in that domain.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   "Unavailable"    <- Total bytes on domain 0
   434341           <- Total bytes on domain 1

4.  Since the task continues to migrate between domains, it may eventually
    return to domain 0.

    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
    17592178699059  <- Overflow on domain 0
    3232332         <- Total bytes on domain 1

In this case, the RMID on domain 0 transitions from "Unavailable" state to
active state. The hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62) when
the counter is read and begins tracking the RMID counting from 0.

Subsequent reads succeed but return a value smaller than the previously
saved MSR value (7345357). Consequently, the resctrl's overflow logic is
triggered, it compares the previous value (7345357) with the new, smaller
value and incorrectly interprets this as a counter overflow, adding a large
delta.

In reality, this is a false positive: the counter did not overflow but was
simply reset when the RMID transitioned from "Unavailable" back to active
state.

Here is the text from APM [1] available from [2].

"In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
first QM_CTR read when it begins tracking an RMID that it was not
previously tracking. The U bit will be zero for all subsequent reads from
that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
read with the U bit set when that RMID is in use by a processor can be
considered 0 when calculating the difference with a subsequent read."

[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
    Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
    Bandwidth (MBM).

  [ bp: Split commit message into smaller paragraph chunks for better
    consumption. ]

Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Tested-by: Reinette Chatre 
Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]

x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest

2025-10-13T14:59:55+00:00

Clearwater Forest supports SNC mode. Add it to the snc_cpu_ids[] table.

Signed-off-by: Chen Yu 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Acked-by: Tony Luck

x86/resctrl: Configure mbm_event mode if supported

2025-09-15T10:52:04+00:00

Configure mbm_event mode on AMD platforms. On AMD platforms, it is
recommended to use the mbm_event mode, if supported, to prevent the
hardware from resetting counters between reads. This can result in
misleading values or display "Unavailable" if no counter is assigned
to the event.

Enable mbm_event mode, known as ABMC (Assignable Bandwidth Monitoring
Counters) on AMD, by default if the system supports it.

Update ABMC across all logical processors within the resctrl domain to
ensure proper functionality.

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com

x86/resctrl: Implement resctrl_arch_reset_cntr() and resctrl_arch_cntr_read()

2025-09-15T10:30:22+00:00

System software reads resctrl event data for a particular resource by writing
the RMID and Event Identifier (EvtID) to the QM_EVTSEL register and then
reading the event data from the QM_CTR register.

In ABMC mode, the event data of a specific counter ID is read by setting the
following fields: QM_EVTSEL.ExtendedEvtID = 1, QM_EVTSEL.EvtID = L3CacheABMC
(=1) and setting QM_EVTSEL.RMID to the desired counter ID.  Reading the QM_CTR
then returns the contents of the specified counter ID.  RMID_VAL_ERROR bit is
set if the counter configuration is invalid, or if an invalid counter ID is
set in the QM_EVTSEL.RMID field.  RMID_VAL_UNAVAIL bit is set if the counter
data is unavailable.

Introduce resctrl_arch_reset_cntr() and resctrl_arch_cntr_read() to reset and
read event data for a specific counter.

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com

x86/resctrl: Refactor resctrl_arch_rmid_read()

2025-09-15T10:28:21+00:00

resctrl_arch_rmid_read() adjusts the value obtained from MSR_IA32_QM_CTR to
account for the overflow for MBM events and apply counter scaling for all the
events. This logic is common to both reading an RMID and reading a hardware
counter directly.

Refactor the hardware value adjustment logic into get_corrected_val() to
prepare for support of reading a hardware counter.

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com

x86,fs/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC

2025-09-15T10:20:29+00:00

The ABMC feature allows users to assign a hardware counter to an RMID,
event pair and monitor bandwidth usage as long as it is assigned. The
hardware continues to track the assigned counter until it is explicitly
unassigned by the user.

Implement an x86 architecture-specific handler to configure a counter. This
architecture specific handler is called by resctrl fs when a counter is
assigned or unassigned as well as when an already assigned counter's
configuration should be updated. Configure counters by writing to the
L3_QOS_ABMC_CFG MSR, specifying the counter ID, bandwidth source (RMID),
and event configuration.

The ABMC feature details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
    Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
    Monitoring (ABMC).

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]

x86/resctrl: Add support to enable/disable AMD ABMC feature

2025-09-15T10:09:30+00:00

Add the functionality to enable/disable the AMD ABMC feature.

The AMD ABMC feature is enabled by setting enabled bit(0) in the
L3_QOS_EXT_CFG MSR. When the state of ABMC is changed, the MSR needs to be
updated on all the logical processors in the QOS Domain.

Hardware counters will reset when ABMC state is changed.

  [ bp: Massage commit message. ]

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]

x86,fs/resctrl: Detect Assignable Bandwidth Monitoring feature details

2025-09-15T10:08:01+00:00

ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
     Monitoring Counter ID + 1

The ABMC feature details are documented in APM [1] available from [2].

  [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
  Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
  Monitoring (ABMC).

Detect the feature and number of assignable counters supported. For backward
compatibility, upon detecting the assignable counter feature, enable the
mbm_total_bytes and mbm_local_bytes events that users are familiar with as
part of original L3 MBM support.

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]