linux-stable.git/kernel, branch v3.2.55

sched/rt: Avoid updating RT entry timeout twice within one tick period

2014-02-15T19:20:18+00:00

commit 57d2aa00dcec67afa52478730f2b524521af14fb upstream.

The issue below was found in 2.6.34-rt rather than mainline rt
kernel, but the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, it means there
is at least one RT FIFO task per cpu. The priority of these
tasks is set to 49 by default. If user launches an RT FIFO task
with priority lower than 49 of softirq RT tasks, it's possible
there are two RT FIFO tasks enqueued one cpu runqueue at one
moment. By current strategy of balancing RT tasks, when it comes
to RT tasks, we really need to put them off to a CPU that they
can run on as soon as possible. Even if it means a bit of cache
line flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is
running, the sched timer tick of the current cpu happens. In this
tick period, the timeout value of the user RT task will be
updated once. Subsequently, we try to wake up one softirq RT
task on its local cpu. As the priority of current user RT task
is lower than the softirq RT task, the current task will be
preempted by the higher priority softirq RT task. Before
preemption, we check to see if current can readily move to a
different cpu. If so, we will reschedule to allow the RT push logic
to try to move current somewhere else. Whenever the woken
softirq RT task runs, it first tries to migrate the user FIFO RT
task over to a cpu that is running a task of lesser priority. If
migration is done, it will send a reschedule request to the found
cpu by IPI interrupt. Once the target cpu responds the IPI
interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu,
the sched timer tick of the cpu fires. So it will tick the user
RT task again. This also means the RT task timeout value will be
updated again. As the migration may be done in one tick period,
it means the user RT task timeout value will be updated twice
within one tick.

If we set a limit on the amount of cpu time for the user RT task
by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
upon reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the
RT task timeout value. In fact the timeout mechanism of sending
the SIGXCPU signal assumes the RT task timeout is increased once
every tick.

However, currently the timeout value may be added twice per
tick. So it results in the SIGXCPU signal being sent earlier
than expected.

To solve this issue, we prevent the timeout value from increasing
twice within one tick time by remembering the jiffies value of
last updating the timeout. As long as the RT task's jiffies is
different with the global jiffies value, we allow its timeout to
be updated.

Signed-off-by: Ying Xue 
Signed-off-by: Fan Du 
Reviewed-by: Yong Zhang 
Acked-by: Steven Rostedt 
Cc: 
Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com
Signed-off-by: Ingo Molnar 
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan 
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings

sched: Unthrottle rt runqueues in __disable_runtime()

2014-02-15T19:20:18+00:00

commit a4c96ae319b8047f62dedbe1eac79e321c185749 upstream.

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel 
Signed-off-by: Peter Zijlstra 
Cc: Paul Turner 
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar 
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan 
[bwh: Backported to 3.2:
 - Adjust filenames
 - unthrottle_offline_cfs_rqs() is already static, but defined in sched.c
   after including sched_fair.c, so add forward declaration
 - unthrottle_offline_cfs_rqs() also needs to be defined for all CONFIG_SMP
   configurations now]
Signed-off-by: Ben Hutchings

sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled

2014-02-15T19:20:17+00:00

commit e221d028bb08b47e624c5f0a31732c642db9d19a upstream.

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Signed-off-by: Mike Galbraith 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings

sched/rt: Fix SCHED_RR across cgroups

2014-02-15T19:20:17+00:00

commit 454c79999f7eaedcdf4c15c449e43902980cbdf5 upstream.

task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
if they were the only element on their rq.  However, with cgroups
a SCHED_RR task could be the only element on its per-cgroup rq but
still be competing with other SCHED_RR tasks in its parent's
cgroup.  In this case, the SCHED_RR task in the child cgroup would
never yield at the end of its timeslice.  If the child cgroup
rt_runtime_us was the same as the parent cgroup rt_runtime_us,
the task in the parent cgroup would starve completely.

Modify task_tick_rt() to check that the task is the only task on its
rq, and that the each of the scheduling entities of its ancestors
is also the only entity on its rq.

Signed-off-by: Colin Cross 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings

sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

2014-02-15T19:20:14+00:00

commit 757dfcaa41844595964f1220f1d33182dae49976 upstream.

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Signed-off-by: Kirill Tkhai 
Signed-off-by: Peter Zijlstra 
CC: Steven Rostedt 
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings

ftrace: Initialize the ftrace profiler for each possible cpu

2014-02-15T19:20:13+00:00

commit c4602c1c818bd6626178d6d3fcc152d9f2f48ac0 upstream.

Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
  test, we will lose the trace information on that CPU.
  Steps to reproduce:
  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # cd /tracing/
  # echo  >> set_ftrace_filter
  # echo 1 > function_profile_enabled
  # echo 1 > /sys/devices/system/cpu/cpu1/online
  # run test
- If we offline a CPU before we enable the function profile, we will not clear
  the trace information when we enable the function profile. It will trouble
  the users.
  Steps to reproduce:
  # cd /tracing/
  # echo  >> set_ftrace_filter
  # echo 1 > function_profile_enabled
  # run test
  # cat trace_stat/function*
  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # echo 0 > function_profile_enabled
  # echo 1 > function_profile_enabled
  # cat trace_stat/function*
  # run test
  # cat trace_stat/function*

So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.

Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com

Signed-off-by: Miao Xie 
Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings

ftrace: Fix function graph with loading of modules

2014-01-03T04:33:36+00:00

commit 8a56d7761d2d041ae5e8215d20b4167d8aa93f51 upstream.

Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload"
fixed module loading and unloading with respect to function tracing, but
it missed the function graph tracer. If you perform the following

 # cd /sys/kernel/debug/tracing
 # echo function_graph > current_tracer
 # modprobe nfsd
 # echo nop > current_tracer

You'll get the following oops message:

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
 Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
 CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
  0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
  0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
  ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
 Call Trace:
  [] dump_stack+0x4f/0x7c
  [] warn_slowpath_common+0x81/0x9b
  [] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
  [] warn_slowpath_null+0x1a/0x1c
  [] __ftrace_hash_rec_update.part.35+0x168/0x1b9
  [] ? __mutex_lock_slowpath+0x364/0x364
  [] ftrace_shutdown+0xd7/0x12b
  [] unregister_ftrace_graph+0x49/0x78
  [] graph_trace_reset+0xe/0x10
  [] tracing_set_tracer+0xa7/0x26a
  [] tracing_set_trace_write+0x8b/0xbd
  [] ? ftrace_return_to_handler+0xb2/0xde
  [] ? __sb_end_write+0x5e/0x5e
  [] vfs_write+0xab/0xf6
  [] ftrace_graph_caller+0x85/0x85
  [] SyS_write+0x59/0x82
  [] ftrace_graph_caller+0x85/0x85
  [] system_call_fastpath+0x16/0x1b
 ---[ end trace 940358030751eafb ]---

The above mentioned commit didn't go far enough. Well, it covered the
function tracer by adding checks in __register_ftrace_function(). The
problem is that the function graph tracer circumvents that (for a slight
efficiency gain when function graph trace is running with a function
tracer. The gain was not worth this).

The problem came with ftrace_startup() which should always be called after
__register_ftrace_function(), if you want this bug to be completely fixed.

Anyway, this solution moves __register_ftrace_function() inside of
ftrace_startup() and removes the need to call them both.

Reported-by: Dave Wysochanski 
Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace")
Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings

ftrace: Check module functions being traced on reload

2014-01-03T04:33:36+00:00

commit 8c4f3c3fa9681dc549cd35419b259496082fef8b upstream.

There's been a nasty bug that would show up and not give much info.
The bug displayed the following warning:

 WARNING: at kernel/trace/ftrace.c:1529 __ftrace_hash_rec_update+0x1e3/0x230()
 Pid: 20903, comm: bash Tainted: G           O 3.6.11+ #38405.trunk
 Call Trace:
  [] warn_slowpath_common+0x7f/0xc0
  [] warn_slowpath_null+0x1a/0x20
  [] __ftrace_hash_rec_update+0x1e3/0x230
  [] ftrace_hash_move+0x28/0x1d0
  [] ? kfree+0x2c/0x110
  [] ftrace_regex_release+0x8e/0x150
  [] __fput+0xae/0x220
  [] ____fput+0xe/0x10
  [] task_work_run+0x72/0x90
  [] do_notify_resume+0x6c/0xc0
  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [] int_signal+0x12/0x17
 ---[ end trace 793179526ee09b2c ]---

It was finally narrowed down to unloading a module that was being traced.

It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.

The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).

When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.

Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).

The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.

With the help of Steve and Joern, we found a reproducer:

 Using uinput module and uinput_release function.

 cd /sys/kernel/debug/tracing
 modprobe uinput
 echo uinput_release > set_ftrace_filter
 echo function > current_tracer
 rmmod uinput
 modprobe uinput
 # check /proc/modules to see if loaded in same addr, otherwise try again
 echo nop > current_tracer

 [BOOM]

The above loads the uinput module, which creates a table of functions that
can be traced within the module.

We add uinput_release to the filter_hash to trace just that function.

Enable function tracincg, which increments the ref count of the record
associated to uinput_release.

Remove uinput, which frees the records including the one that represents
uinput_release.

Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.

Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).

The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.

There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace doesn't know about
it. That situation can cause the same bug. This solution solves that case too.
Another benefit of this solution, is it is possible to trace a module's
function on unload and load.

Link: http://lkml.kernel.org/r/20130705142629.GA325@redhat.com

Reported-by: Jörn Engel 
Reported-by: Dave Jones 
Reported-by: Steve Hodgson 
Tested-by: Steve Hodgson 
Signed-off-by: Steven Rostedt 
[bwh: Backported to 3.2: adjust context, indentation]
Signed-off-by: Ben Hutchings

ftrace: Create ftrace_hash_empty() helper routine

2014-01-03T04:33:36+00:00

commit 06a51d9307380c78bb5c92e68fc80ad2c7d7f890 upstream.

There are two types of hashes in the ftrace_ops; one type
is the filter_hash and the other is the notrace_hash. Either
one may be null, meaning it has no elements. But when elements
are added, the hash is allocated.

Throughout the code, a check needs to be made to see if a hash
exists or the hash has elements, but the check if the hash exists
is usually missing causing the possible "NULL pointer dereference bug".

Add a helper routine called "ftrace_hash_empty()" that returns
true if the hash doesn't exist or its count is zero. As they mean
the same thing.

Last-bug-reported-by: Jiri Olsa 
Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings

ftrace: Fix ftrace hash record update with notrace

2014-01-03T04:33:35+00:00

commit c842e975520f8ab09e293cc92f51a1f396251fd5 upstream.

When disabling the "notrace" records, that means we want to trace them.
If the notrace_hash is zero, it means that we want to trace all
records. But to disable a zero notrace_hash means nothing.

The check for the notrace_hash count was incorrect with:

	if (hash && !hash->count)
		return

With the correct comment above it that states that we do nothing
if the notrace_hash has zero count. But !hash also means that
the notrace hash has zero count. I think this was done to
protect against dereferencing NULL. But if !hash is true, then
we go through the following loop without doing a single thing.

Fix it to:

	if (!hash || !hash->count)
		return;

Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings