linux.git/arch/x86/kernel/cpu/resctrl, branch v6.10

x86/resctrl: Don't try to free nonexistent RMIDs

2024-06-19T09:39:09+00:00

Commit

  6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")

adds logic to map individual monitoring groups into a global index space used
for tracking allocated RMIDs.

Attempts to free the default RMID are ignored in free_rmid(), and this works
fine on x86.

With arm64 MPAM, there is a latent bug here however: on platforms with no
monitors exposed through resctrl, each control group still gets a different
monitoring group ID as seen by the hardware, since the CLOSID always forms part
of the monitoring group ID.

This means that when removing a control group, the code may try to free this
group's default monitoring group RMID for real.  If there are no monitors
however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is
never allocated, leading to a splat when free_rmid() tries to dereference the
table.

One option would be to treat RMID 0 as special for every CLOSID, but this would
be ugly since bookkeeping still needs to be done for these monitoring group IDs
when there are monitors present in the hardware.

Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and
just do nothing if the hardware doesn't have monitors.

This fix mirrors the gating checks already present in
mkdir_rdt_prepare_rmid_alloc() and elsewhere.

No functional change on x86.

  [ bp: Massage commit message. ]

Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
Signed-off-by: Dave Martin 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Tested-by: Reinette Chatre 
Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com

Merge tag 'x86_cache_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-05-14T16:04:37+00:00

Pull x86 resource control updates from Borislav Petkov:

 - Add a tracepoint to read out LLC occupancy of resource monitor IDs
   with the goal of freeing them sooner rather than later

 - Other code improvements and cleanups

* tag 'x86_cache_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Add tracepoint for llc_occupancy tracking
  x86/resctrl: Rename pseudo_lock_event.h to trace.h
  x86/resctrl: Simplify call convention for MSR update functions
  x86/resctrl: Pass domain to target CPU

x86/resctrl: Switch to new Intel CPU model defines

2024-04-29T08:31:28+00:00

New CPU #defines encode vendor and family as well as model.

  [ bp: Squash two resctrl patches into one. ]

Signed-off-by: Tony Luck 
Signed-off-by: Dave Hansen 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/all/20240424181514.41848-1-tony.luck%40intel.com

x86/resctrl: Add tracepoint for llc_occupancy tracking

2024-04-24T12:24:48+00:00

In our production environment, after removing monitor groups, those
unused RMIDs get stuck in the limbo list forever because their
llc_occupancy is always larger than the threshold. But the unused RMIDs
can be successfully freed by turning up the threshold.

In order to know how much the threshold should be, perf can be used to
acquire the llc_occupancy of RMIDs in each rdt domain.

Instead of using perf tool to track llc_occupancy and filter the log
manually, it is more convenient for users to use tracepoint to do this
work. So add a new tracepoint that shows the llc_occupancy of busy RMIDs
when scanning the limbo list.

Suggested-by: Reinette Chatre 
Suggested-by: James Morse 
Signed-off-by: Haifeng Xu 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: James Morse 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/r/20240408092303.26413-3-haifeng.xu@shopee.com

x86/resctrl: Rename pseudo_lock_event.h to trace.h

2024-04-24T12:21:52+00:00

Now only the pseudo-locking part uses tracepoints to do event tracking,
but other parts of resctrl may need new tracepoints. It is unnecessary
to create separate header files and define CREATE_TRACE_POINTS in
different c files which fragments the resctrl tracing.

Therefore, give the resctrl tracepoint header file a generic name to
support its use for tracepoints that are not specific to pseudo-locking.

No functional change.

Suggested-by: Reinette Chatre 
Signed-off-by: Haifeng Xu 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Link: https://lore.kernel.org/r/20240408092303.26413-2-haifeng.xu@shopee.com

x86/resctrl: Simplify call convention for MSR update functions

2024-04-24T11:47:00+00:00

The per-resource MSR update functions cat_wrmsr(), mba_wrmsr_intel(),
and mba_wrmsr_amd() all take three arguments:

  (struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)

struct msr_param contains pointers to both struct rdt_resource and struct
rdt_domain, thus only struct msr_param is necessary.

Pass struct msr_param as a single parameter. Clean up formatting and
fix some fir tree declaration ordering.

No functional change.

Suggested-by: Reinette Chatre 
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Reviewed-by: Babu Moger 
Tested-by: Maciej Wieczor-Retman 
Link: https://lore.kernel.org/r/20240308213846.77075-3-tony.luck@intel.com

x86/resctrl: Pass domain to target CPU

2024-04-24T11:41:41+00:00

reset_all_ctrls() and resctrl_arch_update_domains() use on_each_cpu_mask()
to call rdt_ctrl_update() on potentially one CPU from each domain.

But this means rdt_ctrl_update() needs to figure out which domain to
apply changes to. Doing so requires a search of all domains in a resource,
which can only be done safely if cpus_lock is held. Both callers do hold
this lock, but there isn't a way for a function called on another CPU
via IPI to verify this.

Commit

  c0d848fcb09d ("x86/resctrl: Remove lockdep annotation that triggers
  false positive")

removed the incorrect assertions.

Add the target domain to the msr_param structure and call
rdt_ctrl_update() for each domain separately using
smp_call_function_single(). This means that rdt_ctrl_update() doesn't
need to search for the domain and get_domain_from_cpu() can safely
assert that the cpus_lock is held since the remaining callers do not use
IPI.

Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Reinette Chatre 
Reviewed-by: James Morse 
Reviewed-by: Babu Moger 
Tested-by: Maciej Wieczor-Retman 
Link: https://lore.kernel.org/r/20240308213846.77075-2-tony.luck@intel.com

x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline

2024-04-03T07:30:01+00:00

Tony encountered this OOPS when the last CPU of a domain goes
offline while running a kernel built with CONFIG_NO_HZ_FULL:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP NOPTI
    ...
    RIP: 0010:__find_nth_andnot_bit+0x66/0x110
    ...
    Call Trace:
     
     ? __die()
     ? page_fault_oops()
     ? exc_page_fault()
     ? asm_exc_page_fault()
     cpumask_any_housekeeping()
     mbm_setup_overflow_handler()
     resctrl_offline_cpu()
     resctrl_arch_offline_cpu()
     cpuhp_invoke_callback()
     cpuhp_thread_fun()
     smpboot_thread_fn()
     kthread()
     ret_from_fork()
     ret_from_fork_asm()
     

The NULL pointer dereference is encountered while searching for another
online CPU in the domain (of which there are none) that can be used to
run the MBM overflow handler.

Because the kernel is configured with CONFIG_NO_HZ_FULL the search for
another CPU (in its effort to prefer those CPUs that aren't marked
nohz_full) consults the mask representing the nohz_full CPUs,
tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y
tick_nohz_full_mask is not allocated unless the kernel is booted with
the "nohz_full=" parameter and because of that any access to
tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled().

Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled().
The latter ensures tick_nohz_full_mask can be accessed safely and can be
used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not.

[ Use Ingo's suggestion that combines the two NO_HZ checks into one. ]

Fixes: a4846aaf3945 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow")
Reported-by: Tony Luck 
Signed-off-by: Reinette Chatre 
Signed-off-by: Ingo Molnar 
Reviewed-by: Babu Moger 
Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com
Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/

x86/resctrl: Remove lockdep annotation that triggers false positive

2024-02-22T15:15:38+00:00

get_domain_from_cpu() walks a list of domains to find the one that
contains the specified CPU. This needs to be protected against races
with CPU hotplug when the list is modified. It has recently gained
a lockdep annotation to check this.

The lockdep annotation causes false positives when called via IPI as the
lock is held, but by another process. Remove it.

  [ bp: Refresh it ontop of x86/cache. ]

Fixes: fb700810d30b ("x86/resctrl: Separate arch and fs resctrl locks")
Reported-by: Tony Luck 
Signed-off-by: James Morse 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/all/ZdUSwOM9UUNpw84Y@agluck-desk3

x86/resctrl: Separate arch and fs resctrl locks

2024-02-19T18:28:07+00:00

resctrl has one mutex that is taken by the architecture-specific code, and the
filesystem parts. The two interact via cpuhp, where the architecture code
updates the domain list. Filesystem handlers that walk the domains list should
not run concurrently with the cpuhp callback modifying the list.

Exposing a lock from the filesystem code means the interface is not cleanly
defined, and creates the possibility of cross-architecture lock ordering
headaches. The interaction only exists so that certain filesystem paths are
serialised against CPU hotplug. The CPU hotplug code already has a mechanism to
do this using cpus_read_lock().

MPAM's monitors have an overflow interrupt, so it needs to be possible to walk
the domains list in irq context. RCU is ideal for this, but some paths need to
be able to sleep to allocate memory.

Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part of a cpuhp
callback, cpus_read_lock() must always be taken first.
rdtgroup_schemata_write() already does this.

Most of the filesystem code's domain list walkers are currently protected by
the rdtgroup_mutex taken in rdtgroup_kn_lock_live().  The exceptions are
rdt_bit_usage_show() and the mon_config helpers which take the lock directly.

Make the domain list protected by RCU. An architecture-specific lock prevents
concurrent writers. rdt_bit_usage_show() could walk the domain list using RCU,
but to keep all the filesystem operations the same, this is changed to call
cpus_read_lock().  The mon_config helpers send multiple IPIs, take the
cpus_read_lock() in these cases.

The other filesystem list walkers need to be able to sleep.  Add
cpus_read_lock() to rdtgroup_kn_lock_live() so that the cpuhp callbacks can't
be invoked when file system operations are occurring.

Add lockdep_assert_cpus_held() in the cases where the rdtgroup_kn_lock_live()
call isn't obvious.

Resctrl's domain online/offline calls now need to take the rdtgroup_mutex
themselves.

  [ bp: Fold in a build fix: https://lore.kernel.org/r/87zfvwieli.ffs@tglx ]

Signed-off-by: James Morse 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Shaopeng Tan 
Reviewed-by: Reinette Chatre 
Reviewed-by: Babu Moger 
Tested-by: Shaopeng Tan 
Tested-by: Peter Newman 
Tested-by: Babu Moger 
Tested-by: Carl Worth  # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-25-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD)