| Age | Commit message (Collapse) | Author |
|
commit 8fdc65391c6ad16461526a685f03262b3b01bfde upstream.
Stephane reported that commit:
3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
introduced a regression wrt. time tracking, as easily observed by:
> This patch introduce a bug in the time tracking of events when
> multiplexing is used.
>
> The issue is easily reproducible with the following perf run:
>
> $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
> 1.000730239 652,394 branches (66.41%)
> 1.000730239 597,809 branches (66.41%)
> 1.000730239 593,870 branches (66.63%)
> 1.000730239 651,440 branches (67.03%)
> 1.000730239 656,725 branches (66.96%)
> 1.000730239 <not counted> branches
>
> One branches event is shown as not having run. Yet, with
> multiplexing, all events should run especially with a 1s (-I 1000)
> interval. The delta for time_running comes out to 0. Yet, the event
> has run because the kernel is actually multiplexing the events. The
> problem is that the time tracking is the kernel and especially in
> ctx_sched_out() is wrong now.
>
> The problem is that in case that the kernel enters ctx_sched_out() with the
> following state:
> ctx->is_active=0x7 event_type=0x1
> Call Trace:
> [<ffffffff813ddd41>] dump_stack+0x63/0x82
> [<ffffffff81182bdc>] ctx_sched_out+0x2bc/0x2d0
> [<ffffffff81183896>] perf_mux_hrtimer_handler+0xf6/0x2c0
> [<ffffffff811837a0>] ? __perf_install_in_context+0x130/0x130
> [<ffffffff810f5818>] __hrtimer_run_queues+0xf8/0x2f0
> [<ffffffff810f6097>] hrtimer_interrupt+0xb7/0x1d0
> [<ffffffff810509a8>] local_apic_timer_interrupt+0x38/0x60
> [<ffffffff8175ca9d>] smp_apic_timer_interrupt+0x3d/0x50
> [<ffffffff8175ac7c>] apic_timer_interrupt+0x8c/0xa0
>
> In that case, the test:
> if (is_active & EVENT_TIME)
>
> will be false and the time will not be updated. Time must always be updated on
> sched out.
Fix this by always updating time if EVENT_TIME was set, as opposed to
only updating time when EVENT_TIME changed.
Reported-by: Stephane Eranian <eranian@google.com>
Tested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Fixes: 3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 201c2f85bd0bc13b712d9c0b3d11251b182e06ae upstream.
In the error path, event_file not being NULL is used to determine
whether the event itself still needs to be free'd, so fix it up to
avoid leaking.
Reported-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 130056275ade ("perf: Do not double free")
Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 920c720aa5aa3900a7f1689228fdfc2580a91e7e upstream.
Similar to commit b4b29f94856a ("locking/osq: Fix ordering of node
initialisation in osq_lock") the use of xchg_acquire() is
fundamentally broken with MCS like constructs.
Furthermore, it turns out we rely on the global transitivity of this
operation because the unlock path observes the pointer with a
READ_ONCE(), not an smp_load_acquire().
This is non-critical because the MCS code isn't actually used and
mostly serves as documentation, a stepping stone to the more complex
things we've build on top of the idea.
Reported-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 3552a07a9c4a ("locking/mcs: Use acquire/release semantics")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
cgroup_subsys->post_attach callback
commit 5cf1cacb49aee39c3e02ae87068fc3c6430659b0 upstream.
Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write(). This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.
memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function. This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex. This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished. cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem. We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fe1bce9e2107ba3a8faffe572483b6974201a0e6 upstream.
Otherwise an incoming waker on the dest hash bucket can miss
the waiter adding itself to the plist during the lockless
check optimization (small window but still the correct way
of doing this); similarly to the decrement counterpart.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: bigeasy@linutronix.de
Cc: dvhart@infradead.org
Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 89e9e66ba1b3bde9d8ea90566c2aee20697ad681 upstream.
If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
transition in user space first then the user space value might not have the
waiters bit set. This opens the following race:
CPU0 CPU1
uval = get_user(futex)
lock(hb)
lock(hb)
futex |= FUTEX_WAITERS
....
unlock(hb)
cmpxchg(futex, uval, newval)
So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
the futex value is valid.
To handle this (yes, yet another) corner case gracefully, check for a flag
change and retry.
[ tglx: Massaged changelog and slightly reworked implementation ]
Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2f5177f0fd7e531b26d54633be62d1d4cb94621c upstream.
The CPU controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit:
2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
caused it to explode.
The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.
The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.
Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().
And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.
Suggested-by: Tejun Heo <tj@kernel.org>
Reported-by: Kazuki Yamaguchi <k@rhe.jp>
Reported-by: Niklas Cassel <niklas.cassel@axis.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit cdc4e47da8f4c32eeb6b2061a8a834f4362a12b7 ]
Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.
Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
Reported-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream.
On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
value over CPU down and up. So after the CPU comes up again the delta
calculation in steal_account_process_tick() wreckages itself due to the
unsigned math:
u64 steal = paravirt_steal_clock(smp_processor_id());
steal -= this_rq()->prev_steal_time;
So if steal is smaller than rq->prev_steal_time we end up with an insane large
value which then gets added to rq->prev_steal_time, resulting in a permanent
wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
become stale.
Nice trick to tell the world how idle the system is (100%) while the CPU is
100% busy running tasks. Though we prefer realistic numbers.
None of the accounting values which use a previous value to account for
fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
deals with clock warps and limits the /proc/stat visible wreckage. The
prev_time values are still wrong.
Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: commit 095c0aa83e52 "sched: adjust scheduler cpu power for stolen time"
Fixes: commit aa483808516c "sched: Remove irq time from available CPU power"
Fixes: commit e6e6685accfa "KVM guest: Steal time accounting"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 276142730c39c9839465a36a90e5674a8c34e839 upstream.
When suspending to RAM, waking up and later suspending to disk,
we gratuitously runtime resume devices after the thaw phase.
This does not occur if we always suspend to RAM or always to disk.
pm_complete_with_resume_check(), which gets called from
pci_pm_complete() among others, schedules a runtime resume
if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
a suspend-to-RAM cycle. It is cleared at the beginning of
the suspend-to-RAM cycle but not afterwards and it is not
cleared during a suspend-to-disk cycle at all. Fix it.
Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement)
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3debb0a9ddb16526de8b456491b7db60114f7b5e upstream.
The trace_printk() code will allocate extra buffers if the compile detects
that a trace_printk() is used. To do this, the format of the trace_printk()
is saved to the __trace_printk_fmt section, and if that section is bigger
than zero, the buffers are allocated (along with a message that this has
happened).
If trace_printk() uses a format that is not a constant, and thus something
not guaranteed to be around when the print happens, the compiler optimizes
the fmt out, as it is not used, and the __trace_printk_fmt section is not
filled. This means the kernel will not allocate the special buffers needed
for the trace_printk() and the trace_printk() will not write anything to the
tracing buffer.
Adding a "__used" to the variable in the __trace_printk_fmt section will
keep it around, even though it is set to NULL. This will keep the string
from being printed in the debugfs/tracing/printk_formats section as it is
not needed.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 07d777fe8c398 "tracing: Add percpu buffers for trace_printk()"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a29054d9478d0435ab01b7544da4f674ab13f533 upstream.
If tracing contains data and the trace_pipe file is read with sendfile(),
then it can trigger a NULL pointer dereference and various BUG_ON within the
VM code.
There's a patch to fix this in the splice_to_pipe() code, but it's also a
good idea to not let that happen from trace_pipe either.
Link: http://lkml.kernel.org/r/1457641146-9068-1-git-send-email-rabin@rab.in
Reported-by: Rabin Vincent <rabin.vincent@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cb86e05390debcc084cfdb0a71ed4c5dbbec517d upstream.
Joel Fernandes reported that the function tracing of preempt disabled
sections was not being reported when running either the preemptirqsoff or
preemptoff tracers. This was due to the fact that the function tracer
callback for those tracers checked if irqs were disabled before tracing. But
this fails when we want to trace preempt off locations as well.
Joel explained that he wanted to see funcitons where interrupts are enabled
but preemption was disabled. The expected output he wanted:
<...>-2265 1d.h1 3419us : preempt_count_sub <-irq_exit
<...>-2265 1d..1 3419us : __do_softirq <-irq_exit
<...>-2265 1d..1 3419us : msecs_to_jiffies <-__do_softirq
<...>-2265 1d..1 3420us : irqtime_account_irq <-__do_softirq
<...>-2265 1d..1 3420us : __local_bh_disable_ip <-__do_softirq
<...>-2265 1..s1 3421us : run_timer_softirq <-__do_softirq
<...>-2265 1..s1 3421us : hrtimer_run_pending <-run_timer_softirq
<...>-2265 1..s1 3421us : _raw_spin_lock_irq <-run_timer_softirq
<...>-2265 1d.s1 3422us : preempt_count_add <-_raw_spin_lock_irq
<...>-2265 1d.s2 3422us : _raw_spin_unlock_irq <-run_timer_softirq
<...>-2265 1..s2 3422us : preempt_count_sub <-_raw_spin_unlock_irq
<...>-2265 1..s1 3423us : rcu_bh_qs <-__do_softirq
<...>-2265 1d.s1 3423us : irqtime_account_irq <-__do_softirq
<...>-2265 1d.s1 3423us : __local_bh_enable <-__do_softirq
There's a comment saying that the irq disabled check is because there's a
possible race that tracing_cpu may be set when the function is executed. But
I don't remember that race. For now, I added a check for preemption being
enabled too to not record the function, as there would be no race if that
was the case. I need to re-investigate this, as I'm now thinking that the
tracing_cpu will always be correct. But no harm in keeping the check for
now, except for the slight performance hit.
Link: http://lkml.kernel.org/r/1457770386-88717-1-git-send-email-agnel.joel@gmail.com
Fixes: 5e6d2b9cfa3a "tracing: Use one prologue for the preempt irqs off tracer function tracers"
Reported-by: Joel Fernandes <agnel.joel@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 378c6520e7d29280f400ef2ceaf155c86f05a71a upstream.
This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:
- The fs.suid_dumpable sysctl is set to 2.
- The kernel.core_pattern sysctl's value starts with "/". (Systems
where kernel.core_pattern starts with "|/" are not affected.)
- Unprivileged user namespace creation is permitted. (This is
true on Linux >=3.8, but some distributions disallow it by
default using a distro patch.)
Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.
To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2b021cbf3cb6208f0d40fd2f1869f237934340ed upstream.
Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.
However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.
WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[<ffffffff8127e63c>] dump_stack+0x4e/0x82
[<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
[<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
[<ffffffff810c33e1>] cgroup_get+0x121/0x160
[<ffffffff810c349b>] link_css_set+0x7b/0x90
[<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
[<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
[<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
[<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
[<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
[<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
[<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
[<ffffffff81151673>] __vfs_write+0x23/0xe0
[<ffffffff81152494>] vfs_write+0xa4/0x1a0
[<ffffffff811532d4>] SyS_write+0x44/0xa0
[<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a1ee1932aa6bea0bb074f5e3ced112664e4637ed upstream.
While working on a script to restore all sysctl params before a series of
tests I found that writing any value into the
/proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
causes them to call proc_watchdog_update().
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
There doesn't appear to be a reason for doing this work every time a write
occurs, so only do it when the values change.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7400d3bbaa229eb8e7631d28fb34afd7cd2c96ff upstream.
decay_load_missed() cannot handle nagative values, so we need to prevent
using the function with a negative value.
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: perterz@infradead.org
Fixes: 59543275488d ("sched/fair: Prepare __update_cpu_load() to handle active tickless")
Link: http://lkml.kernel.org/r/20160115070749.GA1914@X58A-UD3R
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d upstream.
The callers of steal_account_process_tick() expect it to return
whether a jiffy should be considered stolen or not.
Currently the return value of steal_account_process_tick() is in
units of cputime, which vary between either jiffies or nsecs
depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.
If cputime has nsecs granularity and there is a tiny amount of
stolen time (a few nsecs, say) then we will consider the entire
tick stolen and will not account the tick on user/system/idle,
causing /proc/stats to show invalid data.
The fix is to change steal_account_process_tick() to accumulate
the stolen time and only account it once it's worth a jiffy.
(Thanks to Frederic Weisbecker for suggestions to fix a bug in my
first version of the patch.)
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 927a5570855836e5d5859a80ce7e91e963545e8f upstream.
The error path in perf_event_open() is such that asking for a sampling
event on a PMU that doesn't generate interrupts will end up in dropping
the perf_sched_count even though it hasn't been incremented for this
event yet.
Given a sufficient amount of these calls, we'll end up disabling
scheduler's jump label even though we'd still have active events in the
system, thereby facilitating the arrival of the infernal regions upon us.
I'm fixing this by moving account_event() inside perf_event_alloc().
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In memremap's helper function try_ram_remap(), we dereference a struct
page pointer that was derived from a PFN that is known to be covered by
a 'System RAM' iomem region, and is thus assumed to be a 'valid' PFN,
i.e., a PFN that has a struct page associated with it and is covered by
the kernel direct mapping.
However, the assumption that there is a 1:1 relation between the System
RAM iomem region and the kernel direct mapping is not universally valid
on all architectures, and on ARM and arm64, 'System RAM' may include
regions for which pfn_valid() returns false.
Generally speaking, both __va() and pfn_to_page() should only ever be
called on PFNs/physical addresses for which pfn_valid() returns true, so
add that check to try_ram_remap().
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Functions which the compiler has instrumented for KASAN place poison on
the stack shadow upon entry and remove this poision prior to returning.
In the case of CPU hotplug, CPUs exit the kernel a number of levels deep
in C code. Any instrumented functions on this critical path will leave
portions of the stack shadow poisoned.
When a CPU is subsequently brought back into the kernel via a different
path, depending on stackframe, layout calls to instrumented functions
may hit this stale poison, resulting in (spurious) KASAN splats to the
console.
To avoid this, clear any stale poison from the idle thread for a CPU
prior to bringing a CPU online.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The check for whether we overlap "System RAM" needs to be done at
section granularity. For example a system with the following mapping:
100000000-37bffffff : System RAM
37c000000-837ffffff : Persistent Memory
...is unable to use devm_memremap_pages() as it would result in two
zones colliding within a given section.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Given we have uninitialized list_heads being passed to list_add() it
will always be the case that those uninitialized values randomly trigger
the poison value. Especially since a list_add() operation will seed the
stack with the poison value for later stack allocations to trip over.
For example, see these two false positive reports:
list_add attempted on force-poisoned entry
WARNING: at lib/list_debug.c:34
[..]
NIP [c00000000043c390] __list_add+0xb0/0x150
LR [c00000000043c38c] __list_add+0xac/0x150
Call Trace:
__list_add+0xac/0x150 (unreliable)
__down+0x4c/0xf8
down+0x68/0x70
xfs_buf_lock+0x4c/0x150 [xfs]
list_add attempted on force-poisoned entry(0000000000000500),
new->next == d0000000059ecdb0, new->prev == 0000000000000500
WARNING: at lib/list_debug.c:33
[..]
NIP [c00000000042db78] __list_add+0xa8/0x140
LR [c00000000042db74] __list_add+0xa4/0x140
Call Trace:
__list_add+0xa4/0x140 (unreliable)
rwsem_down_read_failed+0x6c/0x1a0
down_read+0x58/0x60
xfs_log_commit_cil+0x7c/0x600 [xfs]
Fixes: commit 5c2c2587b132 ("mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"A feature was added in 4.3 that allowed users to filter trace points
on a tasks "comm" field. But this prevented filtering on a comm field
that is within a trace event (like sched_migrate_task).
When trying to filter on when a program migrated, this change
prevented the filtering of the sched_migrate_task.
To fix this, the event fields are examined first, and then the extra
fields like "comm" and "cpu" are examined. Also, instead of testing
to assign the comm filter function based on the field's name, the
generic comm field is given a new filter type (FILTER_COMM). When
this field is used to filter the type is checked. The same is done
for the cpu filter field.
Two new special filter types are added: "COMM" and "CPU". This allows
users to still filter the tasks comm for events that have "comm" as
one of their fields, in cases that users would like to filter
sched_migrate_task on the comm of the task that called the event, and
not the comm of the task that is being migrated"
* tag 'trace-fixes-v4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Do not have 'comm' filter override event 'comm' field
|
|
Commit 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and
process names" added a 'comm' filter that will filter events based on the
current tasks struct 'comm'. But this now hides the ability to filter events
that have a 'comm' field too. For example, sched_migrate_task trace event.
That has a 'comm' field of the task to be migrated.
echo 'comm == "bash"' > events/sched_migrate_task/filter
will now filter all sched_migrate_task events for tasks named "bash" that
migrates other tasks (in interrupt context), instead of seeing when "bash"
itself gets migrated.
This fix requires a couple of changes.
1) Change the look up order for filter predicates to look at the events
fields before looking at the generic filters.
2) Instead of basing the filter function off of the "comm" name, have the
generic "comm" filter have its own filter_type (FILTER_COMM). Test
against the type instead of the name to assign the filter function.
3) Add a new "COMM" filter that works just like "comm" but will filter based
on the current task, even if the trace event contains a "comm" field.
Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".
Cc: stable@vger.kernel.org # v4.3+
Fixes: 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and process names"
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A rather largish series of 12 patches addressing a maze of race
conditions in the perf core code from Peter Zijlstra"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Robustify task_function_call()
perf: Fix scaling vs. perf_install_in_context()
perf: Fix scaling vs. perf_event_enable()
perf: Fix scaling vs. perf_event_enable_on_exec()
perf: Fix ctx time tracking by introducing EVENT_TIME
perf: Cure event->pending_disable race
perf: Fix race between event install and jump_labels
perf: Fix cloning
perf: Only update context time when active
perf: Allow perf_release() with !event->ctx
perf: Do not double free
perf: Close install vs. exit race
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixlet from Thomas Gleixner:
"A trivial printk typo fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix trivial typo in printk() message
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Another small bug reported to me by Chunyu Hu.
When perf added a "reg" function to the function tracing event (not a
tracepoint), it caused that event to be displayed as a tracepoint and
could cause errors in tracepoint handling. That was solved by adding
a flag to ignore ftrace non-tracepoint events. But that flag was
missed when displaying events in available_events, which should only
contain tracepoint events.
This broke a documented way to enable all events with:
cat available_events > set_event
As the function non-tracepoint event would cause that to error out.
The commit here fixes that by having the available_events file not
list events that have the ignore flag set"
* tag 'trace-fixes-v4.5-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix showing function event in available_events
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
- Two fixes for compatibility with the ACPI 6.1 specification.
Without these fixes multi-interface DIMMs will fail to be probed, and
address range scrub commands to find memory errors will give results
that the kernel will mis-interpret. For multi-interface DIMMs Linux
will accept either the original 6.0 implementation or 6.1.
For address range scrub we'll only support 6.1 since ACPI formalized
this DSM differently than the original example [1] implemented in
v4.2. The expectation is that production systems will only ever ship
the ACPI 6.1 address range scrub command definition.
- The wider async address range scrub work targeting 4.6 discovered
that the original synchronous implementation in 4.5 is not sizing its
return buffer correctly.
- Arnd caught that my recent fix to the size of the pfn_t flags missed
updating the flags variable used in the pmem driver.
- Toshi found that we mishandle the memremap() return value in
devm_memremap().
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nvdimm: use 'u64' for pfn flags
devm_memremap: Fix error value when memremap failed
nfit: update address range scrub commands to the acpi 6.1 format
libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
nfit: fix multi-interface dimm handling, acpi6.1 compatibility
|
|
Since there is no serialization between task_function_call() doing
task_curr() and the other CPU doing context switches, we could end
up not sending an IPI even if we had to.
And I'm not sure I still buy my own argument we're OK.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.340031200@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Completely reworks perf_install_in_context() (again!) in order to
ensure that there will be no ctx time hole between add_event_to_ctx()
and any potential ctx_sched_in().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.279399438@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Similar to the perf_enable_on_exec(), ensure that event timings are
consistent across perf_event_enable().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.218288698@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The recent commit 3e349507d12d ("perf: Fix perf_enable_on_exec() event
scheduling") caused this by moving task_ctx_sched_out() from before
__perf_event_mask_enable() to after it.
The overlooked consequence of that change is that task_ctx_sched_out()
would update the ctx time fields, and now __perf_event_mask_enable()
uses stale time.
In order to fix this, explicitly stop our context's time before
enabling the event(s).
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Fixes: 3e349507d12d ("perf: Fix perf_enable_on_exec() event scheduling")
Link: http://lkml.kernel.org/r/20160224174948.159242158@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently any ctx_sched_in() call will re-start the ctx time tracking,
this means that calls like:
ctx_sched_in(.event_type = EVENT_PINNED);
ctx_sched_in(.event_type = EVENT_FLEXIBLE);
will have a hole in their ctx time tracking. This is likely harmless
but can confuse things a little. By adding EVENT_TIME, we can have the
first ctx_sched_in() (is_active: 0 -> !0) start the time and any
further ctx_sched_in() will leave the timestamps alone.
Secondly, this allows for an early disable like:
ctx_sched_out(.event_type = EVENT_TIME);
which would update the ctx time (if the ctx is active) and any further
calls to ctx_sched_out() would not further modify the ctx time.
For ctx_sched_in() any 0 -> !0 transition will automatically include
EVENT_TIME.
For ctx_sched_out(), any transition that clears EVENT_ALL will
automatically clear EVENT_TIME.
These two rules ensure that under normal circumstances we need not
bother with EVENT_TIME and get natural ctx time behaviour.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.100446561@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Because event_sched_out() checks event->pending_disable _before_
actually disabling the event, it can happen that the event fires after
it checks but before it gets disabled.
This would leave event->pending_disable set and the queued irq_work
will try and process it.
However, if the event trigger was during schedule(), the event might
have been de-scheduled by the time the irq_work runs, and
perf_event_disable_local() will fail.
Fix this by checking event->pending_disable _after_ we call
event->pmu->del(). This depends on the latter being a compiler
barrier, such that the compiler does not lift the load and re-creates
the problem.
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.040469884@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
perf_install_in_context() relies upon the context switch hooks to have
scheduled in events when the IPI misses its target -- after all, if
the task has moved from the CPU (or wasn't running at all), it will
have to context switch to run elsewhere.
This however doesn't appear to be happening.
It is possible for the IPI to not happen (task wasn't running) only to
later observe the task running with an inactive context.
The only possible explanation is that the context switch hooks are not
called. Therefore put in a sync_sched() after toggling the jump_label
to guarantee all CPUs will have them enabled before we install an
event.
A simple if (0->1) sync_sched() will not in fact work, because any
further increment can race and complete before the sync_sched().
Therefore we must jump through some hoops.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.980211985@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Alexander reported that when the 'original' context gets destroyed, no
new clones happen.
This can happen irrespective of the ctx switch optimization, any task
can die, even the parent, and we want to continue monitoring the task
hierarchy until we either close the event or no tasks are left in the
hierarchy.
perf_event_init_context() will attempt to pin the 'parent' context
during clone(). At that point current is the parent, and since current
cannot have exited while executing clone(), its context cannot have
passed through perf_event_exit_task_context(). Therefore
perf_pin_task_context() cannot observe ctx->task == TASK_TOMBSTONE.
However, since inherit_event() does:
if (parent_event->parent)
parent_event = parent_event->parent;
it looks at the 'original' event when it does: is_orphaned_event().
This can return true if the context that contains the this event has
passed through perf_event_exit_task_context(). And thus we'll fail to
clone the perf context.
Fix this by adding a new state: STATE_DEAD, which is set by
perf_release() to indicate that the filedesc (or kernel reference) is
dead and there are no observers for our data left.
Only for STATE_DEAD will is_orphaned_event() be true and inhibit
cloning.
STATE_EXIT is otherwise preserved such that is_event_hup() remains
functional and will report when the observed task hierarchy becomes
empty.
Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Fixes: c6e5b73242d2 ("perf: Synchronously clean up child events")
Link: http://lkml.kernel.org/r/20160224174947.919845295@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.860690919@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In the err_file: fput(event_file) case, the event will not yet have
been attached to a context. However perf_release() does assume it has
been. Cure this.
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.793996260@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In case of: err_file: fput(event_file), we'll end up calling
perf_release() which in turn will free the event.
Do not then free the event _again_.
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Consider the following scenario:
CPU0 CPU1
ctx = find_get_ctx();
perf_event_exit_task_context()
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, ...);
/* NO-OP */
mutex_unlock(&ctx->mutex);
...
perf_release()
WARN_ON_ONCE(event->state != STATE_EXIT);
Since the event doesn't pass through perf_remove_from_context()
because perf_install_in_context() NO-OPs because the ctx is dead, and
perf_event_exit_task_context() will not observe the event because its
not attached yet, the event->state will not be set.
Solve this by revalidating ctx->task after we acquire ctx->mutex and
failing the event creation as a whole.
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.626853419@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The ftrace:function event is only displayed for parsing the function tracer
data. It is not used to enable function tracing, and does not include an
"enable" file in its event directory.
Originally, this event was kept separate from other events because it did
not have a ->reg parameter. But perf added a "reg" parameter for its use
which caused issues, because it made the event available to functions where
it was not compatible for.
Commit 9b63776fa3ca9 "tracing: Do not enable function event with enable"
added a TRACE_EVENT_FL_IGNORE_ENABLE flag that prevented the function event
from being enabled by normal trace events. But this commit missed keeping
the function event from being displayed by the "available_events" directory,
which is used to show what events can be enabled by set_event.
One documented way to enable all events is to:
cat available_events > set_event
But because the function event is displayed in the available_events, this
now causes an INVALID error:
cat: write error: Invalid argument
Reported-by: Chunyu Hu <chuhu@redhat.com>
Fixes: 9b63776fa3ca9 "tracing: Do not enable function event with enable"
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
devm_memremap() returns an ERR_PTR() value in case of error.
However, it returns NULL when memremap() failed. This causes
the caller, such as the pmem driver, to proceed and oops later.
Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
failed.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two more small fixes.
One is by Yang Shi who added a READ_ONCE_NOCHECK() to the scan of the
stack made by the stack tracer. As the stack tracer scans the entire
kernel stack, KASAN triggers seeing it as a "stack out of bounds"
error. As the scan is looking at the contents of the stack from
parent functions. The NOCHECK() tells KASAN that this is done on
purpose, and is not some kind of stack overflow.
The second fix is to the ftrace selftests, to retrieve the PID of
executed commands from the shell with '$!' and not by parsing 'jobs'"
* tag 'trace-fixes-v4.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing, kasan: Silence Kasan warning in check_stack of stack_tracer
ftracetest: Fix instance test to use proper shell command for pids
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"A handful of CPU hotplug related fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Plug potential memory leak in CPU_UP_PREPARE
perf/core: Remove the bogus and dangerous CPU_DOWN_FAILED hotplug state
perf/core: Remove bogus UP_CANCELED hotplug state
perf/x86/amd/uncore: Plug reference leak
|
|
In __request_region, if a conflict with a BUSY and MUXED resource is
detected, then the caller goes to sleep and waits for the resource to be
released. A pointer on the conflicting resource is kept. At wake-up
this pointer is used as a parent to retry to request the region.
A first problem is that this pointer might well be invalid (if for
example the conflicting resource have already been freed). Another
problem is that the next call to __request_region() fails to detect a
remaining conflict. The previously conflicting resource is passed as a
parameter and __request_region() will look for a conflict among the
children of this resource and not at the resource itself. It is likely
to succeed anyway, even if there is still a conflict.
Instead, the parent of the conflicting resource should be passed to
__request_region().
As a fix, this patch doesn't update the parent resource pointer in the
case we have to wait for a muxed region right after.
Reported-and-tested-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Tested-by: Vincent Donnefort <vdonnefort@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: slab: free kmem_cache_node after destroy sysfs file
ipc/shm: handle removed segments gracefully in shm_mmap()
MAINTAINERS: update Kselftest Framework mailing list
devm_memremap_release(): fix memremap'd addr handling
mm/hugetlb.c: fix incorrect proc nr_hugepages value
mm, x86: fix pte_page() crash in gup_pte_range()
fsnotify: turn fsnotify reaper thread into a workqueue job
Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
mm: fix regression in remap_file_pages() emulation
thp, dax: do not try to withdraw pgtable from non-anon VMA
|
|
When enabling stack trace via "echo 1 > /proc/sys/kernel/stack_tracer_enabled",
the below KASAN warning is triggered:
BUG: KASAN: stack-out-of-bounds in check_stack+0x344/0x848 at addr ffffffc0689ebab8
Read of size 8 by task ksoftirqd/4/29
page:ffffffbdc3a27ac0 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 4 PID: 29 Comm: ksoftirqd/4 Not tainted 4.5.0-rc1 #129
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[<ffffffc000091300>] dump_backtrace+0x0/0x3a0
[<ffffffc0000916c4>] show_stack+0x24/0x30
[<ffffffc0009bbd78>] dump_stack+0xd8/0x168
[<ffffffc000420bb0>] kasan_report_error+0x6a0/0x920
[<ffffffc000421688>] kasan_report+0x70/0xb8
[<ffffffc00041f7f0>] __asan_load8+0x60/0x78
[<ffffffc0002e05c4>] check_stack+0x344/0x848
[<ffffffc0002e0c8c>] stack_trace_call+0x1c4/0x370
[<ffffffc0002af558>] ftrace_ops_no_ops+0x2c0/0x590
[<ffffffc00009f25c>] ftrace_graph_call+0x0/0x14
[<ffffffc0000881bc>] fpsimd_thread_switch+0x24/0x1e8
[<ffffffc000089864>] __switch_to+0x34/0x218
[<ffffffc0011e089c>] __schedule+0x3ac/0x15b8
[<ffffffc0011e1f6c>] schedule+0x5c/0x178
[<ffffffc0001632a8>] smpboot_thread_fn+0x350/0x960
[<ffffffc00015b518>] kthread+0x1d8/0x2b0
[<ffffffc0000874d0>] ret_from_fork+0x10/0x40
Memory state around the buggy address:
ffffffc0689eb980: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
ffffffc0689eba00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
>ffffffc0689eba80: 00 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00
^
ffffffc0689ebb00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffffffc0689ebb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The stacker tracer traverses the whole kernel stack when saving the max stack
trace. It may touch the stack red zones to cause the warning. So, just disable
the instrumentation to silence the warning.
Link: http://lkml.kernel.org/r/1455309960-18930-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fixes from Jiri Kosina:
- regression (from 4.4) fix for ordering issue, introduced by an
earlier ftrace change, that broke live patching of modules.
The fix replaces the ftrace module notifier by direct call in order
to make the ordering guaranteed and well-defined. The patch, from
Jessica Yu, has been acked both by Steven and Rusty
- error message fix from Miroslav Benes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
ftrace/module: remove ftrace module notifier
livepatch: change the error message in asm/livepatch.h header files
|