linux-stable.git/kernel/sched/cputime.c, branch v4.4.2

sched/cputime: Fix invalid gtime in proc

2015-12-04T09:18:49+00:00

/proc/stats shows invalid gtime when the thread is running in guest.
When vtime accounting is not enabled, we cannot get a valid delta.
The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
is only updated when vtime accounting is runtime enabled.

This patch makes task_gtime() just return gtime without computing the
buggy non-existing tickless delta when vtime accounting is not enabled.

Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless delta. This
way we fix the gtime value regression on machines not running nohz full.

The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.

I ran and stop a busy loop in VM and see the gtime in host.
Dump the 43rd field which shows the gtime in every second:

	 # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
	S 4348
	R 7064566
	R 7064766
	R 7064967
	R 7065168
	S 4759
	S 4759

During running busy loop, it returns large value.

After applying this patch, we can see right gtime.

	 # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
	S 5338
	R 5365
	R 5465
	R 5566
	R 5666
	S 5726
	S 5726

Signed-off-by: Hiroshi Shimamoto 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Linus Torvalds 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: Paul E . McKenney 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support

2015-10-01T13:06:33+00:00

HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."

Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Paolo Bonzini 
CC: Gleb Natapov 
Signed-off-by: Paolo Bonzini

sched/cputime: Guarantee stime + utime == rtime

2015-08-03T10:21:21+00:00

While the current code guarantees monotonicity for stime and utime
independently of one another, it does not guarantee that the sum of
both is equal to the total time we started out with.

This confuses things (and peoples) who look at this sum, like top, and
will report >100% usage followed by a matching period of 0%.

Rework the code to provide both individual monotonicity and a coherent
sum.

Suggested-by: Fredrik Markstrom 
Reported-by: Fredrik Markstrom 
Tested-by: Fredrik Markstrom 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Stanislaw Gruszka 
Cc: Thomas Gleixner 
Cc: jason.low2@hp.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()

2015-05-08T10:11:32+00:00

ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

Signed-off-by: Jason Low 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Thomas Gleixner 
Acked-by: Rik van Riel 
Acked-by: Waiman Long 
Cc: Andrew Morton 
Cc: Aswin Chandramouleeswaran 
Cc: Borislav Petkov 
Cc: Davidlohr Bueso 
Cc: Frederic Weisbecker 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Oleg Nesterov 
Cc: Paul E. McKenney 
Cc: Preeti U Murthy 
Cc: Scott J Norton 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

sched, time: Fix build error with 64 bit cputime_t on 32 bit systems

2014-10-03T03:46:55+00:00

On 32 bit systems cmpxchg cannot handle 64 bit values, so
some additional magic is required to allow a 32 bit system
with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.

Make sure the correct cmpxchg function is used when doing
an atomic swap of a cputime_t.

Reported-by: Arnd Bergmann 
Signed-off-by: Rik van Riel 
Acked-by: Arnd Bergmann 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Cc: Andrew Morton 
Cc: Benjamin Herrenschmidt 
Cc: Heiko Carstens 
Cc: Linus Torvalds 
Cc: Martin Schwidefsky 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linux390@de.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com
Signed-off-by: Ingo Molnar

sched, time: Fix lock inversion in thread_group_cputime()

2014-09-19T10:35:17+00:00

The sig->stats_lock nests inside the tasklist_lock and the
sighand->siglock in __exit_signal and wait_task_zombie.

However, both of those locks can be taken from irq context,
which means we need to use the interrupt safe variant of
read_seqbegin_or_lock. This blocks interrupts when the "lock"
branch is taken (seq is odd), preventing the lock inversion.

On the first (lockless) pass through the loop, irqs are not
blocked.

Reported-by: Stanislaw Gruszka 
Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: prarit@redhat.com
Cc: oleg@redhat.com
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1410527535-9814-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar

sched, time: Atomically increment stime & utime

2014-09-08T06:17:02+00:00

The functions task_cputime_adjusted and thread_group_cputime_adjusted()
can be called locklessly, as well as concurrently on many different CPUs.

This can occasionally lead to the utime and stime reported by times(), and
other syscalls like it, going backward. The cause for this appears to be
multiple threads racing in cputime_adjust(), both with values for utime or
stime that is larger than the original, but each with a different value.

Sometimes the larger value gets saved first, only to be immediately
overwritten with a smaller value by another thread.

Using atomic exchange prevents that problem, and ensures time
progresses monotonically.

Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: akpm@linux-foundation.org
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar

time, signal: Protect resource use statistics with seqlock

2014-09-08T06:17:01+00:00

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alex Thorlton 
Cc: Andrew Morton 
Cc: Daeseok Youn 
Cc: David Rientjes 
Cc: Dongsheng Yang 
Cc: Geert Uytterhoeven 
Cc: Guillaume Morin 
Cc: Ionut Alexa 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Li Zefan 
Cc: Michal Hocko 
Cc: Michal Schmidt 
Cc: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar

sched: Change thread_group_cputime() to use for_each_thread()

2014-08-20T07:47:18+00:00

Change thread_group_cputime() to use for_each_thread() instead of
buggy while_each_thread(). This also makes the pid_alive() check
unnecessary.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra 
Cc: Mike Galbraith 
Cc: Hidetoshi Seto 
Cc: Frank Mayhar 
Cc: Frederic Weisbecker 
Cc: Andrew Morton 
Cc: Sanjay Rao 
Cc: Larry Woodman 
Cc: Rik van Riel 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/20140813192000.GA19327@redhat.com
Signed-off-by: Ingo Molnar

sched: Sanitize irq accounting madness

2014-05-07T09:51:30+00:00

Russell reported, that irqtime_account_idle_ticks() takes ages due to:

       for (i = 0; i < ticks; i++)
               irqtime_account_process_tick(current, 0, rq);

It's sad, that this code was written way _AFTER_ the NOHZ idle
functionality was available. I charge myself guitly for not paying
attention when that crap got merged with commit abb74cefa ("sched:
Export ns irqtimes through /proc/stat")

So instead of looping nr_ticks times just apply the whole thing at
once.

As a side note: The whole cputime_t vs. u64 business in that context
wants to be cleaned up as well. There is no point in having all these
back and forth conversions. Lets standardise on u64 nsec for all
kernel internal accounting and be done with it. Everything else does
not make sense at all for fine grained accounting. Frederic, can you
please take care of that?

Reported-by: Russell King 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Paul E. McKenney 
Signed-off-by: Peter Zijlstra 
Cc: Venkatesh Pallipadi 
Cc: Shaun Ruffell 
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de
Signed-off-by: Ingo Molnar