linux.git/Documentation/scheduler, branch v6.9

Merge tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-03-24T18:11:05+00:00

Pull scheduler doc clarification from Thomas Gleixner:
 "A single update for the documentation of the base_slice_ns tunable to
  clarify that any value which is less than the tick slice has no effect
  because the scheduler tick is not guaranteed to happen within the set
  time slice"

* tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation

sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation

2024-03-21T19:34:16+00:00

The tunable base_slice_ns is dependent on CONFIG_HZ (i.e. TICK_NSEC)
for any significant performance improvement. The reason being the
scheduler tick is not frequent enough to force preemption when
base_slice expires in case of:

           base_slice_ns < TICK_NSEC

The below data is of stress-ng:

	Number of CPU: 1
	Stressor threads: 4
	Time: 30sec

	On CONFIG_HZ=1000

	| base_slice | avg-run (msec) | context-switches |
	| ---------- | -------------- | ---------------- |
	| 3ms        | 2.914          | 10342            |
	| 6ms        | 4.857          | 6196             |
	| 9ms        | 6.754          | 4482             |
	| 12ms       | 7.872          | 3802             |
	| 22ms       | 11.294         | 2710             |
	| 32ms       | 13.425         | 2284             |

	On CONFIG_HZ=100

	| base_slice | avg-run (msec) | context-switches |
	| ---------- | -------------- | ---------------- |
	| 3ms        | 9.144          | 3337             |
	| 6ms        | 9.113          | 3301             |
	| 9ms        | 8.991          | 3315             |
	| 12ms       | 12.935         | 2328             |
	| 22ms       | 16.031         | 1915             |
	| 32ms       | 18.608         | 1622             |

	base_slice: the value of base_slice in ms
	avg-run (msec): average time of the stressor threads got on cpu before
	it got preempted
	context-switches: number of context switches for the stress-ng process

Signed-off-by: Mukesh Kumar Chaurasiya 
Signed-off-by: Ingo Molnar 
Cc: Randy Dunlap 
Link: https://lore.kernel.org/r/20240320173815.927637-2-mchauras@linux.ibm.com

membarrier: Create Documentation/scheduler/membarrier.rst

2024-02-15T16:04:12+00:00

To gather the architecture requirements of the "private/global
expedited" membarrier commands.  The file will be expanded to
integrate further information about the membarrier syscall (as
needed/desired in the future).  While at it, amend some related
inline comments in the membarrier codebase.

Suggested-by: Mathieu Desnoyers 
Signed-off-by: Andrea Parri 
Reviewed-by: Mathieu Desnoyers 
Link: https://lore.kernel.org/r/20240131144936.29190-3-parri.andrea@gmail.com
Signed-off-by: Palmer Dabbelt

sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)

2023-12-23T14:59:56+00:00

sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
in order to check for possibly related regressions. After 3 years, it has
never been used and no regression has been reported. Let's remove it
and make fast increase a permanent behavior.

Signed-off-by: Vincent Guittot 
Signed-off-by: Ingo Molnar 
Tested-by: Lukasz Luba 
Reviewed-by: Lukasz Luba 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Hongyan Xia 
Reviewed-by: Tang Yizhou 
Reviewed-by: Yanteng Si  [for the Chinese translation]
Reviewed-by: Alex Shi 
Link: https://lore.kernel.org/r/20231201161652.1241695-2-vincent.guittot@linaro.org

sched/doc: Update documentation after renames and synchronize Chinese version

2023-11-26T15:24:48+00:00

Update the documentation after these changes, which didn't entirely
propagate the changes:

 e23edc86b09d ("sched/fair: Rename check_preempt_curr() to wakeup_preempt()")
 03b7fad167ef ("sched: Add task_struct pointer to sched_class::set_curr_task")
 2f88c8e802c8 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well")

[ mingo: Reworked the changelog. ]

Signed-off-by: Wenyu Huang 
Signed-off-by: Ingo Molnar 
Cc: linux-kernel@vger.kernel.org

Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

2023-11-02T01:28:33+00:00

Pull ia64 removal and asm-generic updates from Arnd Bergmann:

 - The ia64 architecture gets its well-earned retirement as planned,
   now that there is one last (mostly) working release that will be
   maintained as an LTS kernel.

 - The architecture specific system call tables are updated for the
   added map_shadow_stack() syscall and to remove references to the
   long-gone sys_lookup_dcookie() syscall.

* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  hexagon: Remove unusable symbols from the ptrace.h uapi
  asm-generic: Fix spelling of architecture
  arch: Reserve map_shadow_stack() syscall number for all architectures
  syscalls: Cleanup references to sys_lookup_dcookie()
  Documentation: Drop or replace remaining mentions of IA64
  lib/raid6: Drop IA64 support
  Documentation: Drop IA64 from feature descriptions
  kernel: Drop IA64 support from sig_fault handlers
  arch: Remove Itanium (IA-64) architecture

sched/topology: Remove the EM_MAX_COMPLEXITY limit

2023-10-09T11:07:27+00:00

The Energy Aware Scheduler (EAS) estimates the energy consumption
of placing a task on different CPUs. The goal is to minimize this
energy consumption. Estimating the energy of different task placements
is increasingly complex with the size of the platform.

To avoid having a slow wake-up path, EAS is only enabled if this
complexity is low enough.

The current complexity limit was set in:

  b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms")

... based on the first implementation of EAS, which was re-computing
the power of the whole platform for each task placement scenario, see:

  390031e4c309 ("sched/fair: Introduce an energy estimation helper function")

... but the complexity of EAS was reduced in:

  eb92692b2544d ("sched/fair: Speed-up energy-aware wake-ups")

... and find_energy_efficient_cpu() (feec) algorithm was updated in:

  3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")

find_energy_efficient_cpu() (feec) is now doing:

	feec()
	\_ for_each_pd(pd) [0]
	  // get max_spare_cap_cpu and compute_prev_delta
	  \_ for_each_cpu(pd) [1]

	  \_ eenv_pd_busy_time(pd) [2]
		\_ for_each_cpu(pd)

	  // compute_energy(pd) without the task
	  \_ eenv_pd_max_util(pd, -1) [3.0]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, -1)
	    \_ for_each_ps(pd)

	  // compute_energy(pd) with the task on prev_cpu
	  \_ eenv_pd_max_util(pd, prev_cpu) [3.1]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, prev_cpu)
	    \_ for_each_ps(pd)

	  // compute_energy(pd) with the task on max_spare_cap_cpu
	  \_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, max_spare_cap_cpu)
	    \_ for_each_ps(pd)

	[3.1] happens only once since prev_cpu is unique. With the same
	      definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:

		nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
		+ ([nr_cpus in pd] + [nr_ps in pd])

		 [0]  * (     [1] + [2]      +       [3.0] + [3.2]                  )
		+ [3.1]

		= nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
		+ [nr_cpus in prev pd] + nr_ps

The complexity limit was set to 2048 in:

  b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms")

... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
performance states each". For the same platform, the complexity would
actually be of:

  16 * (4 + 2 * 7) + 1 + 7 = 296

Since the EAS complexity was greatly reduced since the limit was
introduced, bigger platforms can handle EAS.

For instance, a platform with 112 CPUs with 7 performance states
each would not reach it:

  112 * (4 + 2 * 7) + 1 + 7 = 2024

To reflect this improvement in the underlying EAS code, remove
the EAS complexity check.

Note that a limit on the number of CPUs still holds against
EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.

[ mingo: Updates to the changelog. ]

Signed-off-by: Pierre Gondois 
Signed-off-by: Ingo Molnar 
Reviewed-by: Lukasz Luba 
Reviewed-by: Dietmar Eggemann 
Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com

sched/topology: Consolidate and clean up access to a CPU's max compute capacity

2023-10-09T10:59:48+00:00

Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.

The scheduler uses 3 methods to get access to a CPU's max compute capacity:

 - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.

 - cpu_capacity_orig field which is periodically updated with
   arch_scale_cpu_capacity().

 - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.

There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:

 - either a per_cpu variable.

 - or a const value for systems which have only one capacity.

Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.

No functional changes.

Some performance tests on Arm64:

  - small SMP device (hikey): no noticeable changes
  - HMP device (RB5):         hackbench shows minor improvement (1-2%)
  - large smp (thx2):         hackbench and tbench shows minor improvement (1%)

Signed-off-by: Vincent Guittot 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dietmar Eggemann 
Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org

sched/rt/docs: Use 'real-time' instead of 'realtime'

2023-10-02T13:17:14+00:00

Standardize on a single variant.

Signed-off-by: Cyril Hrubis 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20231002115553.3007-4-chrubis@suse.cz

sched/rt/docs: Clarify & fix sched_rt_* sysctl docs

2023-10-02T13:17:13+00:00

- Describe explicitly that sched_rt_runtime_us is allocated from
  sched_rt_period_us and hence always less or equal to that value.

- The limit for sched_rt_runtime_us is not INT_MAX-1, but rather it's
  limited by the value of sched_rt_period_us. If sched_rt_period_us is
  INT_MAX then sched_rt_runtime_us can be set to INT_MAX as well.

Signed-off-by: Cyril Hrubis 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20231002115553.3007-3-chrubis@suse.cz