linux.git/arch/x86/kernel/cpu/resctrl/ctrlmondata.c, branch v6.9

x86/resctrl: Separate arch and fs resctrl locks

2024-02-19T18:28:07+00:00

resctrl has one mutex that is taken by the architecture-specific code, and the
filesystem parts. The two interact via cpuhp, where the architecture code
updates the domain list. Filesystem handlers that walk the domains list should
not run concurrently with the cpuhp callback modifying the list.

Exposing a lock from the filesystem code means the interface is not cleanly
defined, and creates the possibility of cross-architecture lock ordering
headaches. The interaction only exists so that certain filesystem paths are
serialised against CPU hotplug. The CPU hotplug code already has a mechanism to
do this using cpus_read_lock().

MPAM's monitors have an overflow interrupt, so it needs to be possible to walk
the domains list in irq context. RCU is ideal for this, but some paths need to
be able to sleep to allocate memory.

Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part of a cpuhp
callback, cpus_read_lock() must always be taken first.
rdtgroup_schemata_write() already does this.

Most of the filesystem code's domain list walkers are currently protected by
the rdtgroup_mutex taken in rdtgroup_kn_lock_live().  The exceptions are
rdt_bit_usage_show() and the mon_config helpers which take the lock directly.

Make the domain list protected by RCU. An architecture-specific lock prevents
concurrent writers. rdt_bit_usage_show() could walk the domain list using RCU,
but to keep all the filesystem operations the same, this is changed to call
cpus_read_lock().  The mon_config helpers send multiple IPIs, take the
cpus_read_lock() in these cases.

The other filesystem list walkers need to be able to sleep.  Add
cpus_read_lock() to rdtgroup_kn_lock_live() so that the cpuhp callbacks can't
be invoked when file system operations are occurring.

Add lockdep_assert_cpus_held() in the cases where the rdtgroup_kn_lock_live()
call isn't obvious.

Resctrl's domain online/offline calls now need to take the rdtgroup_mutex
themselves.

  [ bp: Fold in a build fix: https://lore.kernel.org/r/87zfvwieli.ffs@tglx ]

Signed-off-by: James Morse 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Shaopeng Tan 
Reviewed-by: Reinette Chatre 
Reviewed-by: Babu Moger 
Tested-by: Shaopeng Tan 
Tested-by: Peter Newman 
Tested-by: Babu Moger 
Tested-by: Carl Worth  # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-25-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD)

x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU

2024-02-16T18:18:33+00:00

When a CPU is taken offline resctrl may need to move the overflow or limbo
handlers to run on a different CPU.

Once the offline callbacks have been split, cqm_setup_limbo_handler() will be
called while the CPU that is going offline is still present in the CPU mask.

Pass the CPU to exclude to cqm_setup_limbo_handler() and
mbm_setup_overflow_handler(). These functions can use a variant of
cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs need
excluding.

Signed-off-by: James Morse 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Shaopeng Tan 
Reviewed-by: Babu Moger 
Reviewed-by: Reinette Chatre 
Tested-by: Shaopeng Tan 
Tested-by: Peter Newman 
Tested-by: Babu Moger 
Tested-by: Carl Worth  # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-22-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD)

x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()

2024-02-16T18:18:32+00:00

Depending on the number of monitors available, Arm's MPAM may need to
allocate a monitor prior to reading the counter value. Allocating a
contended resource may involve sleeping.

__check_limbo() and mon_event_count() each make multiple calls to
resctrl_arch_rmid_read(), to avoid extra work on contended systems,
the allocation should be valid for multiple invocations of
resctrl_arch_rmid_read().

The memory or hardware allocated is not specific to a domain.

Add arch hooks for this allocation, which need calling before
resctrl_arch_rmid_read(). The allocated monitor is passed to
resctrl_arch_rmid_read(), then freed again afterwards. The helper
can be called on any CPU, and can sleep.

Signed-off-by: James Morse 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Shaopeng Tan 
Reviewed-by: Reinette Chatre 
Reviewed-by: Babu Moger 
Tested-by: Shaopeng Tan 
Tested-by: Peter Newman 
Tested-by: Babu Moger 
Tested-by: Carl Worth  # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-16-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD)

x86/resctrl: Queue mon_event_read() instead of sending an IPI

2024-02-16T18:18:32+00:00

Intel is blessed with an abundance of monitors, one per RMID, that can be
read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
the number implemented is up to the manufacturer. This means when there are
fewer monitors than needed, they need to be allocated and freed.

MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
CSU counter is allowed to return 'not ready' for a small number of
micro-seconds after programming. To allow one CSU hardware monitor to be
used for multiple control or monitor groups, the CPU accessing the
monitor needs to be able to block when configuring and reading the
counter.

Worse, the domain may be broken up into slices, and the MMIO accesses
for each slice may need performing from different CPUs.

These two details mean MPAMs monitor code needs to be able to sleep, and
IPI another CPU in the domain to read from a resource that has been sliced.

mon_event_read() already invokes mon_event_count() via IPI, which means
this isn't possible. On systems using nohz-full, some CPUs need to be
interrupted to run kernel work as they otherwise stay in user-space
running realtime workloads. Interrupting these CPUs should be avoided,
and scheduling work on them may never complete.

Change mon_event_read() to pick a housekeeping CPU, (one that is not using
nohz_full) and schedule mon_event_count() and wait. If all the CPUs
in a domain are using nohz-full, then an IPI is used as the fallback.

This function is only used in response to a user-space filesystem request
(not the timing sensitive overflow code).

This allows MPAM to hide the slice behaviour from resctrl, and to keep
the monitor-allocation in monitor.c. When the IPI fallback is used on
machines where MPAM needs to make an access on multiple CPUs, the counter
read will always fail.

Signed-off-by: James Morse 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Shaopeng Tan 
Reviewed-by: Peter Newman 
Reviewed-by: Reinette Chatre 
Reviewed-by: Babu Moger 
Tested-by: Shaopeng Tan 
Tested-by: Peter Newman 
Tested-by: Babu Moger 
Tested-by: Carl Worth  # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-14-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD)

x86/resctrl: Enable non-contiguous CBMs in Intel CAT

2023-10-11T19:48:52+00:00

The setting for non-contiguous 1s support in Intel CAT is
hardcoded to false. On these systems, writing non-contiguous
1s into the schemata file will fail before resctrl passes
the value to the hardware.

In Intel CAT CPUID.0x10.1:ECX[3] and CPUID.0x10.2:ECX[3] stopped
being reserved and now carry information about non-contiguous 1s
value support for L3 and L2 cache respectively. The CAT
capacity bitmask (CBM) supports a non-contiguous 1s value if
the bit is set.

The exception are Haswell systems where non-contiguous 1s value
support needs to stay disabled since they can't make use of CPUID
for Cache allocation.

Originally-by: Fenghua Yu 
Signed-off-by: Maciej Wieczor-Retman 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Ilpo Järvinen 
Reviewed-by: Peter Newman 
Reviewed-by: Reinette Chatre 
Reviewed-by: Babu Moger 
Tested-by: Peter Newman 
Link: https://lore.kernel.org/r/1849b487256fe4de40b30f88450cba3d9abc9171.1696934091.git.maciej.wieczor-retman@intel.com

x86/resctrl: Rename arch_has_sparse_bitmaps

2023-10-11T17:43:43+00:00

Rename arch_has_sparse_bitmaps to arch_has_sparse_bitmasks to ensure
consistent terminology throughout resctrl.

Suggested-by: Reinette Chatre 
Signed-off-by: Maciej Wieczor-Retman 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Ilpo Järvinen 
Reviewed-by: Peter Newman 
Reviewed-by: Reinette Chatre 
Reviewed-by: Babu Moger 
Tested-by: Peter Newman 
Link: https://lore.kernel.org/r/e330fcdae873ef1a831e707025a4b70fa346666e.1696934091.git.maciej.wieczor-retman@intel.com

x86/resctrl: Clear staged_config[] before and after it is used

2023-03-15T22:19:43+00:00

As a temporary storage, staged_config[] in rdt_domain should be cleared
before and after it is used. The stale value in staged_config[] could
cause an MSR access error.

Here is a reproducer on a system with 16 usable CLOSIDs for a 15-way L3
Cache (MBA should be disabled if the number of CLOSIDs for MB is less than
16.) :
	mount -t resctrl resctrl -o cdp /sys/fs/resctrl
	mkdir /sys/fs/resctrl/p{1..7}
	umount /sys/fs/resctrl/
	mount -t resctrl resctrl /sys/fs/resctrl
	mkdir /sys/fs/resctrl/p{1..8}

An error occurs when creating resource group named p8:
    unchecked MSR access error: WRMSR to 0xca0 (tried to write 0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60)
    Call Trace:
     
     __flush_smp_call_function_queue+0x11d/0x170
     __sysvec_call_function+0x24/0xd0
     sysvec_call_function+0x89/0xc0
     
     
     asm_sysvec_call_function+0x16/0x20

When creating a new resource control group, hardware will be configured
by the following process:
    rdtgroup_mkdir()
      rdtgroup_mkdir_ctrl_mon()
        rdtgroup_init_alloc()
          resctrl_arch_update_domains()

resctrl_arch_update_domains() iterates and updates all resctrl_conf_type
whose have_new_ctrl is true. Since staged_config[] holds the same values as
when CDP was enabled, it will continue to update the CDP_CODE and CDP_DATA
configurations. When group p8 is created, get_config_index() called in
resctrl_arch_update_domains() will return 16 and 17 as the CLOSIDs for
CDP_CODE and CDP_DATA, which will be translated to an invalid register -
0xca0 in this scenario.

Fix it by clearing staged_config[] before and after it is used.

[reinette: re-order commit tags]

Fixes: 75408e43509e ("x86/resctrl: Allow different CODE/DATA configurations to be staged")
Suggested-by: Xin Hao 
Signed-off-by: Shawn Wang 
Signed-off-by: Reinette Chatre 
Signed-off-by: Dave Hansen 
Tested-by: Reinette Chatre 
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/2fad13f49fbe89687fc40e9a5a61f23a28d1507a.1673988935.git.reinette.chatre%40intel.com

x86/resctrl: Detect and configure Slow Memory Bandwidth Allocation

2023-01-23T16:38:44+00:00

The QoS slow memory configuration details are available via
CPUID_Fn80000020_EDX_x02. Detect the available details and
initialize the rest to defaults.

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/r/20230113152039.770054-7-babu.moger@amd.com

x86/resctrl: Replace smp_call_function_many() with on_each_cpu_mask()

2023-01-23T16:38:04+00:00

on_each_cpu_mask() runs the function on each CPU specified by cpumask,
which may include the local processor.

Replace smp_call_function_many() with on_each_cpu_mask() to simplify
the code.

Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/r/20230113152039.770054-2-babu.moger@amd.com

x86/resctrl: Remove arch_has_empty_bitmaps

2022-10-24T08:30:29+00:00

The field arch_has_empty_bitmaps is not required anymore. The field
min_cbm_bits is enough to validate the CBM (capacity bit mask) if the
architecture can support the zero CBM or not.

Suggested-by: Reinette Chatre 
Signed-off-by: Babu Moger 
Signed-off-by: Borislav Petkov 
Reviewed-by: Reinette Chatre 
Reviewed-by: Fenghua Yu 
Link: https://lore.kernel.org/r/166430979654.372014.615622285687642644.stgit@bmoger-ubuntu