linux-stable.git/kernel/sched/cputime.c, branch linux-4.4.y

sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression

2016-09-07T06:32:41+00:00

commit 173be9a14f7b2e901cf77c18b1aafd4d672e9d9e upstream.

Mike reports:

 Roughly 10% of the time, ltp testcase getrusage04 fails:
 getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
 getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
 getrusage04    0  TINFO  :  utime:           0us; stime:         179us
 getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
 getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased > 5000us:

And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.

Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.

Reported-by: Mike Galbraith 
Tested-by: Mike Galbraith 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Fredrik Markstrom 
Cc: Linus Torvalds 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Radim 
Cc: Rik van Riel 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Wanpeng Li 
Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/cputime: Fix steal_account_process_tick() to always return jiffies

2016-04-12T16:08:35+00:00

commit f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d upstream.

The callers of steal_account_process_tick() expect it to return
whether a jiffy should be considered stolen or not.

Currently the return value of steal_account_process_tick() is in
units of cputime, which vary between either jiffies or nsecs
depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.

If cputime has nsecs granularity and there is a tiny amount of
stolen time (a few nsecs, say) then we will consider the entire
tick stolen and will not account the tick on user/system/idle,
causing /proc/stats to show invalid data.

The fix is to change steal_account_process_tick() to accumulate
the stolen time and only account it once it's worth a jiffy.

(Thanks to Frederic Weisbecker for suggestions to fix a bug in my
first version of the patch.)

Signed-off-by: Chris Friesen 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Thomas Gleixner 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/cputime: Fix invalid gtime in proc

2015-12-04T09:18:49+00:00

/proc/stats shows invalid gtime when the thread is running in guest.
When vtime accounting is not enabled, we cannot get a valid delta.
The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
is only updated when vtime accounting is runtime enabled.

This patch makes task_gtime() just return gtime without computing the
buggy non-existing tickless delta when vtime accounting is not enabled.

Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless delta. This
way we fix the gtime value regression on machines not running nohz full.

The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.

I ran and stop a busy loop in VM and see the gtime in host.
Dump the 43rd field which shows the gtime in every second:

	 # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
	S 4348
	R 7064566
	R 7064766
	R 7064967
	R 7065168
	S 4759
	S 4759

During running busy loop, it returns large value.

After applying this patch, we can see right gtime.

	 # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
	S 5338
	R 5365
	R 5465
	R 5566
	R 5666
	S 5726
	S 5726

Signed-off-by: Hiroshi Shimamoto 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Linus Torvalds 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: Paul E . McKenney 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support

2015-10-01T13:06:33+00:00

HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."

Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Paolo Bonzini 
CC: Gleb Natapov 
Signed-off-by: Paolo Bonzini

sched/cputime: Guarantee stime + utime == rtime

2015-08-03T10:21:21+00:00

While the current code guarantees monotonicity for stime and utime
independently of one another, it does not guarantee that the sum of
both is equal to the total time we started out with.

This confuses things (and peoples) who look at this sum, like top, and
will report >100% usage followed by a matching period of 0%.

Rework the code to provide both individual monotonicity and a coherent
sum.

Suggested-by: Fredrik Markstrom 
Reported-by: Fredrik Markstrom 
Tested-by: Fredrik Markstrom 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Stanislaw Gruszka 
Cc: Thomas Gleixner 
Cc: jason.low2@hp.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()

2015-05-08T10:11:32+00:00

ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

Signed-off-by: Jason Low 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Thomas Gleixner 
Acked-by: Rik van Riel 
Acked-by: Waiman Long 
Cc: Andrew Morton 
Cc: Aswin Chandramouleeswaran 
Cc: Borislav Petkov 
Cc: Davidlohr Bueso 
Cc: Frederic Weisbecker 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Oleg Nesterov 
Cc: Paul E. McKenney 
Cc: Preeti U Murthy 
Cc: Scott J Norton 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

sched, time: Fix build error with 64 bit cputime_t on 32 bit systems

2014-10-03T03:46:55+00:00

On 32 bit systems cmpxchg cannot handle 64 bit values, so
some additional magic is required to allow a 32 bit system
with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.

Make sure the correct cmpxchg function is used when doing
an atomic swap of a cputime_t.

Reported-by: Arnd Bergmann 
Signed-off-by: Rik van Riel 
Acked-by: Arnd Bergmann 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Cc: Andrew Morton 
Cc: Benjamin Herrenschmidt 
Cc: Heiko Carstens 
Cc: Linus Torvalds 
Cc: Martin Schwidefsky 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linux390@de.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com
Signed-off-by: Ingo Molnar

sched, time: Fix lock inversion in thread_group_cputime()

2014-09-19T10:35:17+00:00

The sig->stats_lock nests inside the tasklist_lock and the
sighand->siglock in __exit_signal and wait_task_zombie.

However, both of those locks can be taken from irq context,
which means we need to use the interrupt safe variant of
read_seqbegin_or_lock. This blocks interrupts when the "lock"
branch is taken (seq is odd), preventing the lock inversion.

On the first (lockless) pass through the loop, irqs are not
blocked.

Reported-by: Stanislaw Gruszka 
Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: prarit@redhat.com
Cc: oleg@redhat.com
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1410527535-9814-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar

sched, time: Atomically increment stime & utime

2014-09-08T06:17:02+00:00

The functions task_cputime_adjusted and thread_group_cputime_adjusted()
can be called locklessly, as well as concurrently on many different CPUs.

This can occasionally lead to the utime and stime reported by times(), and
other syscalls like it, going backward. The cause for this appears to be
multiple threads racing in cputime_adjust(), both with values for utime or
stime that is larger than the original, but each with a different value.

Sometimes the larger value gets saved first, only to be immediately
overwritten with a smaller value by another thread.

Using atomic exchange prevents that problem, and ensures time
progresses monotonically.

Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: akpm@linux-foundation.org
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar

time, signal: Protect resource use statistics with seqlock

2014-09-08T06:17:01+00:00

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alex Thorlton 
Cc: Andrew Morton 
Cc: Daeseok Youn 
Cc: David Rientjes 
Cc: Dongsheng Yang 
Cc: Geert Uytterhoeven 
Cc: Guillaume Morin 
Cc: Ionut Alexa 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Li Zefan 
Cc: Michal Hocko 
Cc: Michal Schmidt 
Cc: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar