linux.git/kernel/sched/fair.c, branch v5.14

Merge tag 'sched-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-07-11T18:13:57+00:00

Pull scheduler fixes from Ingo Molnar:
 "Three fixes:

   - Fix load tracking bug/inconsistency

   - Fix a sporadic CFS bandwidth constraints enforcement bug

   - Fix a uclamp utilization tracking bug for newly woken tasks"

* tag 'sched-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/uclamp: Ignore max aggregation if rq is idle
  sched/fair: Fix CFS bandwidth hrtimer expiry type
  sched/fair: Sync load_sum with load_avg after dequeue

sched/fair: Fix CFS bandwidth hrtimer expiry type

2021-07-02T13:58:24+00:00

The time remaining until expiry of the refresh_timer can be negative.
Casting the type to an unsigned 64-bit value will cause integer
underflow, making the runtime_refresh_within return false instead of
true. These situations are rare, but they do happen.

This does not cause user-facing issues or errors; other than
possibly unthrottling cfs_rq's using runtime from the previous period(s),
making the CFS bandwidth enforcement less strict in those (special)
situations.

Signed-off-by: Odin Ugedal 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al

sched/fair: Sync load_sum with load_avg after dequeue

2021-07-02T13:58:23+00:00

commit 9e077b52d86a ("sched/pelt: Check that *_avg are null when *_sum are")
reported some inconsitencies between *_avg and *_sum.

commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
fixed some but one remains when dequeuing load.

sync the cfs's load_sum with its load_avg after dequeuing the load of a
sched_entity.

Fixes: 9e077b52d86a ("sched/pelt: Check that *_avg are null when *_sum are")
Reported-by: Sachin Sant 
Signed-off-by: Vincent Guittot 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Odin Ugedal 
Tested-by: Sachin Sant 
Link: https://lore.kernel.org/r/20210701171837.32156-1-vincent.guittot@linaro.org

Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-06-30T22:37:49+00:00

Pull scheduler fixes from Ingo Molnar:

 - Fix a small inconsistency (bug) in load tracking, caught by a new
   warning that several people reported.

 - Flip CONFIG_SCHED_CORE to default-disabled, and update the Kconfig
   help text.

* tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Disable CONFIG_SCHED_CORE by default
  sched/fair: Ensure _sum and _avg values stay consistent

Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-06-28T19:14:19+00:00

Pull scheduler udpates from Ingo Molnar:

- Changes to core scheduling facilities:

- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.

There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.

- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.

- Load-balancing changes:

- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.

- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.

- Fix & improve energy-aware (EAS) balancing logic & metrics.

- Fix & improve the uclamp metrics.

- Fix task migration (taskset) corner case on !CONFIG_CPUSET.

- Fix RT and deadline utilization tracking across policy changes

- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu//cpu.cfs_burst_us.

- Rework assymetric topology/capacity detection & handling.

- Scheduler statistics & tooling:

- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.

- Use sched_clock() in delayacct, instead of ktime_get_ns().

- Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...

sched/fair: Ensure _sum and _avg values stay consistent

2021-06-28T13:42:24+00:00

The _sum and _avg values are in general sync together with the PELT
divider. They are however not always completely in perfect sync,
resulting in situations where _sum gets to zero while _avg stays
positive. Such situations are undesirable.

This comes from the fact that PELT will increase period_contrib, also
increasing the PELT divider, without updating _sum and _avg values to
stay in perfect sync where (_sum == _avg * divider). However, such PELT
change will never lower _sum, making it impossible to end up in a
situation where _sum is zero and _avg is not.

Therefore, we need to ensure that when subtracting load outside PELT,
that when _sum is zero, _avg is also set to zero. This occurs when
(_sum < _avg * divider), and the subtracted (_avg * divider) is bigger
or equal to the current _sum, while the subtracted _avg is smaller than
the current _avg.

Reported-by: Sachin Sant 
Reported-by: Naresh Kamboju 
Signed-off-by: Odin Ugedal 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Vincent Guittot 
Tested-by: Sachin Sant 
Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al

sched/fair: Introduce the burstable CFS controller

2021-06-24T07:07:50+00:00

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.

Traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

	mkdir /sys/fs/cgroup/cpu/test
	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.

Tail latencies are shown below, and it wasn't the worst case.

	Latency percentiles (usec)
		50.0000th: 19872
		75.0000th: 21344
		90.0000th: 22176
		95.0000th: 22496
		*99.0000th: 22752
		99.5000th: 22752
		99.9000th: 22752
		min=0, max=22727
	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

Co-developed-by: Shanpei Chen 
Signed-off-by: Shanpei Chen 
Co-developed-by: Tianchen Ding 
Signed-off-by: Tianchen Ding 
Signed-off-by: Huaixin Chang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Acked-by: Tejun Heo 
Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com

sched/fair: Ensure that the CFS parent is added after unthrottling

2021-06-22T12:06:57+00:00

Ensure that a CFS parent will be in the list whenever one of its children is also
in the list.

A warning on rq->tmp_alone_branch != &rq->leaf_cfs_rq_list has been
reported while running LTP test cfs_bandwidth01.

Odin Ugedal found the root cause:

	$ tree /sys/fs/cgroup/ltp/ -d --charset=ascii
	/sys/fs/cgroup/ltp/
	|-- drain
	`-- test-6851
	    `-- level2
		|-- level3a
		|   |-- worker1
		|   `-- worker2
		`-- level3b
		    `-- worker3

Timeline (ish):
- worker3 gets throttled
- level3b is decayed, since it has no more load
- level2 get throttled
- worker3 get unthrottled
- level2 get unthrottled
  - worker3 is added to list
  - level3b is not added to list, since nr_running==0 and is decayed

 [ Vincent Guittot: Rebased and updated to fix for the reported warning. ]

Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Reported-by: Sachin Sant 
Suggested-by: Vincent Guittot 
Signed-off-by: Rik van Riel 
Signed-off-by: Vincent Guittot 
Signed-off-by: Ingo Molnar 
Tested-by: Sachin Sant 
Acked-by: Odin Ugedal 
Link: https://lore.kernel.org/r/20210621174330.11258-1-vincent.guittot@linaro.org

sched: Change task_struct::state

2021-06-18T09:43:09+00:00

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Daniel Bristot de Oliveira 
Acked-by: Will Deacon 
Acked-by: Daniel Thompson 
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

Merge branch 'sched/urgent' into sched/core, to resolve conflicts

2021-06-18T09:31:25+00:00

This commit in sched/urgent moved the cfs_rq_is_decayed() function:

  a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

and this fresh commit in sched/core modified it in the old location:

  9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")

Merge the two variants.

Conflicts:
	kernel/sched/fair.c

Signed-off-by: Ingo Molnar