linux-stable.git/kernel/sched/cputime.c, branch v4.9.331

sched/cputime: Fix ksoftirqd cputime accounting regression

2018-10-20T07:51:32+00:00

commit 25e2d8c1b9e327ed260edd13169cc22bc7a78bc6 upstream.

irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.

But this behaviour got broken by:

  a499a5a14db ("sched/cputime: Increment kcpustat directly on irqtime account")

... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().

This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.

ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().

To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.

Reported-and-tested-by: Jesper Dangaard Brouer 
Signed-off-by: Frederic Weisbecker 
Reviewed-by: Rik van Riel 
Acked-by: Jesper Dangaard Brouer 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stanislaw Gruszka 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Ivan Delalande 
Signed-off-by: Greg Kroah-Hartman

sched/cputime: Increment kcpustat directly on irqtime account

2018-10-20T07:51:32+00:00

commit a499a5a14dbd1d0315a96fc62a8798059325e9e6 upstream.

The irqtime is accounted is nsecs and stored in
cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the
accumulated amount reaches a new jiffy, this one gets accounted to the
kcpustat.

This was necessary when kcpustat was stored in cputime_t, which could at
worst have jiffies granularity. But now kcpustat is stored in nsecs
so this whole discretization game with temporary irqtime storage has
become unnecessary.

We can now directly account the irqtime to the kcpustat.

Signed-off-by: Frederic Weisbecker 
Cc: Benjamin Herrenschmidt 
Cc: Fenghua Yu 
Cc: Heiko Carstens 
Cc: Linus Torvalds 
Cc: Martin Schwidefsky 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Stanislaw Gruszka 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Ivan Delalande 
Signed-off-by: Greg Kroah-Hartman

sched/cputime: Convert kcpustat to nsecs

2018-10-20T07:51:32+00:00

commit 7fb1327ee9b92fca27662f9b9d60c7c3376d6c69 upstream.

Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.

Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.

Signed-off-by: Frederic Weisbecker 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Heiko Carstens 
Cc: Martin Schwidefsky 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Stanislaw Gruszka 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar 
[colona: minor conflict as 527b0a76f41d ("sched/cpuacct: Avoid %lld seq_printf
 warning") is missing from v4.9]
Signed-off-by: Ivan Delalande 
Signed-off-by: Greg Kroah-Hartman

sched/irqtime: Consolidate irqtime flushing code

2016-09-30T09:46:41+00:00

The code performing irqtime nsecs stats flushing to kcpustat is roughly
the same for hardirq and softirq. So lets consolidate that common code.

Signed-off-by: Frederic Weisbecker 
Reviewed-by: Rik van Riel 
Cc: Eric Dumazet 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1474849761-12678-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

sched/irqtime: Consolidate accounting synchronization with u64_stats API

2016-09-30T09:46:40+00:00

The irqtime accounting currently implement its own ad hoc implementation
of u64_stats API. Lets rather consolidate it with the appropriate
library.

Signed-off-by: Frederic Weisbecker 
Reviewed-by: Rik van Riel 
Cc: Eric Dumazet 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1474849761-12678-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

sched/irqtime: Remove needless IRQs disablement on kcpustat update

2016-09-30T09:46:39+00:00

The callers of the functions performing irqtime kcpustat updates have
IRQS disabled, no need to disable them again.

Signed-off-by: Frederic Weisbecker 
Reviewed-by: Rik van Riel 
Cc: Eric Dumazet 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1474849761-12678-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

sched/irqtime: No need for preempt-safe accessors

2016-09-30T09:46:38+00:00

We can safely use the preempt-unsafe accessors for irqtime when we
flush its counters to kcpustat as IRQs are disabled at this time.

Signed-off-by: Frederic Weisbecker 
Reviewed-by: Rik van Riel 
Cc: Eric Dumazet 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1474849761-12678-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

sched/cputime: Improve scalability by not accounting thread group tasks pending runtime

2016-08-18T09:53:46+00:00

Commit:

  d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles")

started accounting thread group tasks pending runtime in thread_group_cputime().

Another commit:

  6e998916dfe32 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")

updated scheduler runtime statistics (call update_curr()) when reading task pending
runtime. Those changes cause bad performance of SYS_times() and
SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on
larger systems with many CPUs.

While we would like to have cpuclock monotonicity kept i.e. have
problems fixed by above commits stay fixed, we also would like to have
good performance.

However when we notice that change from commit d670ec13178d0 is not
longer needed to solve problem addressed by that commit, because of
change from the second commit 6e998916dfe32, we can get room for
optimization. Since we update task while reading it's pending runtime
in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will
see updated values and on testcase from d670ec13178d0 process cpuclock
will not be smaller than thread cpuclock.

I tested the patch on testcases from commits d670ec13178d0,
6e998916dfe32 and some other cpuclock/cputimers testcases and
did not found cpuclock monotonicity problems or other malfunction.

This patch has the drawback that we will not provide thread group cputime
up-to-date to the last moment. For example when arming cputime timer,
we will arm it with possibly a bit outdated values and that timer will
trigger earlier compared to behaviour without the patch. However that
was the behaviour before d670ec13178d0 commit (kernel v3.1) so it's
unlikely to affect applications.

Patch improves related syscall performance, as measured by Giovanni's
benchmarks described in commit:

  6075620b0590e ("sched/cputime: Mitigate performance regression in times()/clock_gettime()")

The benchmark results are:

SYS_clock_gettime():

  threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                         (pre-6e998916dfe3)
  2          3.48        2.23 ( 35.68%)        3.06 ( 11.83%)        1.08 ( 68.81%)
  5          3.33        2.83 ( 14.84%)        3.25 (  2.40%)        0.71 ( 78.55%)
  8          3.37        2.84 ( 15.80%)        3.26 (  3.30%)        0.56 ( 83.49%)
  12         3.32        3.09 (  6.69%)        3.37 ( -1.60%)        0.42 ( 87.28%)
  21         4.01        3.14 ( 21.70%)        3.90 (  2.74%)        0.35 ( 91.35%)
  30         3.63        3.28 (  9.75%)        3.36 (  7.41%)        0.28 ( 92.23%)
  48         3.71        3.02 ( 18.69%)        3.11 ( 16.27%)        0.39 ( 89.39%)
  79         3.75        2.88 ( 23.23%)        3.16 ( 15.74%)        0.46 ( 87.76%)
  110        3.81        2.95 ( 22.62%)        3.25 ( 14.80%)        0.56 ( 85.41%)
  128        3.88        3.05 ( 21.28%)        3.31 ( 14.76%)        0.62 ( 84.10%)

SYS_times():

  threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                         (pre-6e998916dfe3)
  2          3.65        2.27 ( 37.94%)        3.25 ( 11.03%)        1.62 ( 55.71%)
  5          3.45        2.78 ( 19.34%)        3.17 (  7.92%)        2.33 ( 32.28%)
  8          3.52        2.79 ( 20.66%)        3.22 (  8.69%)        2.06 ( 41.44%)
  12         3.29        3.02 (  8.33%)        3.36 ( -2.04%)        2.00 ( 39.18%)
  21         4.07        3.10 ( 23.86%)        3.92 (  3.78%)        2.07 ( 49.18%)
  30         3.87        3.33 ( 13.80%)        3.40 ( 12.17%)        1.89 ( 51.12%)
  48         3.79        2.96 ( 21.94%)        3.16 ( 16.61%)        1.69 ( 55.46%)
  79         3.88        2.88 ( 25.82%)        3.28 ( 15.42%)        1.60 ( 58.81%)
  110        3.90        2.98 ( 23.73%)        3.38 ( 13.35%)        1.73 ( 55.61%)
  128        4.00        3.10 ( 22.40%)        3.38 ( 15.45%)        1.66 ( 58.52%)

Reported-and-tested-by: Giovanni Gherdovich 
Signed-off-by: Stanislaw Gruszka 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.com
Signed-off-by: Ingo Molnar

sched/cputime: Resync steal time when guest & host lose sync

2016-08-18T09:19:48+00:00

Commit:

  57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")

... fixed a bug but also triggered a regression:

On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
on host one by one until there is only one left, then observe CPU utilization
via 'top' in the guest, it shows:

  100% st for cpu0(housekeeping)
   75% st for other CPUs (nohz full mode)

However, w/o this commit it shows the correct 75% for all four CPUs.

When a guest is interrupted for a longer amount of time, missed clock ticks
are not redelivered later. Because of that, we should not limit the amount
of steal time accounted to the amount of time that the calling functions
think have passed.

However, the interval returned by account_other_time() is NOT rounded down
to the nearest jiffy, while the base interval in get_vtime_delta() it is
subtracted from is, so the max cputime limit is required to avoid underflow.

This patch fixes the regression by limiting the account_other_time() from
get_vtime_delta() to avoid underflow, and lets the other three call sites
(in account_other_time() and steal_account_process_time()) account however
much steal time the host told us elapsed.

Suggested-by: Rik van Riel 
Suggested-by: Paolo Bonzini 
Signed-off-by: Wanpeng Li 
Reviewed-by: Rik van Riel 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Radim Krcmar 
Cc: Thomas Gleixner 
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar

sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression

2016-08-18T08:48:46+00:00

Mike reports:

 Roughly 10% of the time, ltp testcase getrusage04 fails:
 getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
 getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
 getrusage04    0  TINFO  :  utime:           0us; stime:         179us
 getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
 getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased > 5000us:

And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.

Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.

Reported-by: Mike Galbraith 
Tested-by: Mike Galbraith 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Fredrik Markstrom 
Cc: Linus Torvalds 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Radim 
Cc: Rik van Riel 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Wanpeng Li 
Cc: stable@vger.kernel.org # 4.3+
Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar