linux.git/kernel/trace, branch for-next

tracing: Print lazy preemption model

2025-01-14T14:44:33+00:00

Print lazy preemption model in ftrace header when latency-format=1.

 # cat /sys/kernel/debug/sched/preempt
 none voluntary full (lazy)

Without patch:
  latency: 0 us, #232946/232946, CPU#40 | (M:unknown VP:0, KP:0, SP:0 HP:0 #P:80)
                                             ^^^^^^^

With Patch:
  latency: 0 us, #1897938/25566788, CPU#16 | (M:lazy VP:0, KP:0, SP:0 HP:0 #P:80)
                                                ^^^^

Now that lazy preemption is part of the kernel, make sure the tracing
infrastructure reflects that.

Link: https://lore.kernel.org/20250103093647.575919-1-sshegde@linux.ibm.com
Signed-off-by: Shrikanth Hegde 
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Steven Rostedt (Google)

tracing: Fix irqsoff and wakeup latency tracers when using function graph

2025-01-14T14:38:09+00:00

The function graph tracer has become generic so that kretprobes and BPF
can use it along with function graph tracing itself. Some of the
infrastructure was specific for function graph tracing such as recording
the calltime and return time of the functions. Calling the clock code on a
high volume function does add overhead. The calculation of the calltime
was removed from the generic code and placed into the function graph
tracer itself so that the other users did not incur this overhead as they
did not need that timestamp.

The calltime field was still kept in the generic return entry structure
and the function graph return entry callback filled it as that structure
was passed to other code.

But this broke both irqsoff and wakeup latency tracer as they still
depended on the trace structure containing the calltime when the option
display-graph is set as it used some of those same functions that the
function graph tracer used. But now the calltime was not set and was just
zero. This caused the calculation of the function time to be the absolute
value of the return timestamp and not the length of the function.

 # cd /sys/kernel/tracing
 # echo 1 > options/display-graph
 # echo irqsoff > current_tracer

The tracers went from:

 #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||      |   |                     |   |   |   |
        0 us |   4)    -0    |  d..1. |   0.000 us    |  irqentry_enter();
        3 us |   4)    -0    |  d..2. |               |  irq_enter_rcu() {
        4 us |   4)    -0    |  d..2. |   0.431 us    |    preempt_count_add();
        5 us |   4)    -0    |  d.h2. |               |    tick_irq_enter() {
        5 us |   4)    -0    |  d.h2. |   0.433 us    |      tick_check_oneshot_broadcast_this_cpu();
        6 us |   4)    -0    |  d.h2. |   2.426 us    |      ktime_get();
        9 us |   4)    -0    |  d.h2. |               |      tick_nohz_stop_idle() {
       10 us |   4)    -0    |  d.h2. |   0.398 us    |        nr_iowait_cpu();
       11 us |   4)    -0    |  d.h1. |   1.903 us    |      }
       11 us |   4)    -0    |  d.h2. |               |      tick_do_update_jiffies64() {
       12 us |   4)    -0    |  d.h2. |               |        _raw_spin_lock() {
       12 us |   4)    -0    |  d.h2. |   0.360 us    |          preempt_count_add();
       13 us |   4)    -0    |  d.h3. |   0.354 us    |          do_raw_spin_lock();
       14 us |   4)    -0    |  d.h2. |   2.207 us    |        }
       15 us |   4)    -0    |  d.h3. |   0.428 us    |        calc_global_load();
       16 us |   4)    -0    |  d.h3. |               |        _raw_spin_unlock() {
       16 us |   4)    -0    |  d.h3. |   0.380 us    |          do_raw_spin_unlock();
       17 us |   4)    -0    |  d.h3. |   0.334 us    |          preempt_count_sub();
       18 us |   4)    -0    |  d.h1. |   1.768 us    |        }
       18 us |   4)    -0    |  d.h2. |               |        update_wall_time() {
      [..]

To:

 #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||      |   |                     |   |   |   |
        0 us |   5)    -0    |  d.s2. |   0.000 us    |  _raw_spin_lock_irqsave();
        0 us |   5)    -0    |  d.s3. |   312159583 us |      preempt_count_add();
        2 us |   5)    -0    |  d.s4. |   312159585 us |      do_raw_spin_lock();
        3 us |   5)    -0    |  d.s4. |               |      _raw_spin_unlock() {
        3 us |   5)    -0    |  d.s4. |   312159586 us |        do_raw_spin_unlock();
        4 us |   5)    -0    |  d.s4. |   312159587 us |        preempt_count_sub();
        4 us |   5)    -0    |  d.s2. |   312159587 us |      }
        5 us |   5)    -0    |  d.s3. |               |      _raw_spin_lock() {
        5 us |   5)    -0    |  d.s3. |   312159588 us |        preempt_count_add();
        6 us |   5)    -0    |  d.s4. |   312159589 us |        do_raw_spin_lock();
        7 us |   5)    -0    |  d.s3. |   312159590 us |      }
        8 us |   5)    -0    |  d.s4. |   312159591 us |      calc_wheel_index();
        9 us |   5)    -0    |  d.s4. |               |      enqueue_timer() {
        9 us |   5)    -0    |  d.s4. |               |        wake_up_nohz_cpu() {
       11 us |   5)    -0    |  d.s4. |               |          native_smp_send_reschedule() {
       11 us |   5)    -0    |  d.s4. |   312171987 us |            default_send_IPI_single_phys();
    12408 us |   5)    -0    |  d.s3. |   312171990 us |          }
    12408 us |   5)    -0    |  d.s3. |   312171991 us |        }
    12409 us |   5)    -0    |  d.s3. |   312171991 us |      }

Where the calculation of the time for each function was the return time
minus zero and not the time of when the function returned.

Have these tracers also save the calltime in the fgraph data section and
retrieve it again on the return to get the correct timings again.

Cc: Masami Hiramatsu 
Cc: Mathieu Desnoyers 
Cc: Mark Rutland 
Link: https://lore.kernel.org/20250113183124.61767419@gandalf.local.home
Fixes: f1f36e22bee9 ("ftrace: Have calltime be saved in the fgraph storage")
Signed-off-by: Steven Rostedt (Google)

tracing/kprobes: Fix to free objects when failed to copy a symbol

2025-01-09T23:57:18+00:00

In __trace_kprobe_create(), if something fails it must goto error block
to free objects. But when strdup() a symbol, it returns without that.
Fix it to goto the error block to free objects correctly.

Link: https://lore.kernel.org/all/173643297743.1514810.2408159540454241947.stgit@devnote2/

Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe events")
Signed-off-by: Masami Hiramatsu (Google) 
Reviewed-by: Steven Rostedt (Google)

Merge tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

2025-01-03T18:04:43+00:00

Pull ftrace fixes from Steven Rostedt:

 - Add needed READ_ONCE() around access to the fgraph array element

   The updates to the fgraph array can happen when callbacks are
   registered and unregistered. The __ftrace_return_to_handler() can
   handle reading either the old value or the new value. But once it
   reads that value it must stay consistent otherwise the check that
   looks to see if the value is a stub may show false, but if the
   compiler decides to re-read after that check, it can be true which
   can cause the code to crash later on.

 - Make function profiler use the top level ops for filtering again

   When function graph became available for instances, its filter ops
   became independent from the top level set_ftrace_filter. In the
   process the function profiler received its own filter ops as well.
   But the function profiler uses the top level set_ftrace_filter file
   and does not have one of its own. In giving it its own filter ops, it
   lost any user interface it once had. Make it use the top level
   set_ftrace_filter file again. This fixes a regression.

* tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Fix function profiler's filtering functionality
  fgraph: Add READ_ONCE() when accessing fgraph_array[]

ftrace: Fix function profiler's filtering functionality

2025-01-02T22:21:33+00:00

Commit c132be2c4fcc ("function_graph: Have the instances use their own
ftrace_ops for filtering"), function profiler (enabled via
function_profile_enabled) has been showing statistics for all functions,
ignoring set_ftrace_filter settings.

While tracers are instantiated, the function profiler is not. Therefore, it
should use the global set_ftrace_filter for consistency.  This patch
modifies the function profiler to use the global filter, fixing the
filtering functionality.

Before (filtering not working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
  Function                               Hit    Time            Avg
     s^2
  --------                               ---    ----            ---
     ---
  schedule                               314    22290594 us     70989.15 us
     40372231 us
  x64_sys_call                          1527    8762510 us      5738.382 us
     3414354 us
  schedule_hrtimeout_range               176    8665356 us      49234.98 us
     405618876 us
  __x64_sys_ppoll                        324    5656635 us      17458.75 us
     19203976 us
  do_sys_poll                            324    5653747 us      17449.83 us
     19214945 us
  schedule_timeout                        67    5531396 us      82558.15 us
     2136740827 us
  __x64_sys_pselect6                      12    3029540 us      252461.7 us
     63296940171 us
  do_pselect.constprop.0                  12    3029532 us      252461.0 us
     63296952931 us
```

After (filtering working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
  Function                               Hit    Time            Avg
     s^2
  --------                               ---    ----            ---
     ---
  vfs_write                              462    68476.43 us     148.217 us
     25874.48 us
  vfs_read                               641    9611.356 us     14.994 us
     28868.07 us
  vfs_fstat                              890    878.094 us      0.986 us
     1.667 us
  vfs_fstatat                            227    757.176 us      3.335 us
     18.928 us
  vfs_statx                              226    610.610 us      2.701 us
     17.749 us
  vfs_getattr_nosec                     1187    460.919 us      0.388 us
     0.326 us
  vfs_statx_path                         297    343.287 us      1.155 us
     11.116 us
  vfs_rename                               6    291.575 us      48.595 us
     9889.236 us
```

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250101190820.72534-1-enjuk@amazon.com
Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Kohei Enju 
Signed-off-by: Steven Rostedt (Google)

fgraph: Add READ_ONCE() when accessing fgraph_array[]

2025-01-02T22:21:18+00:00

In __ftrace_return_to_handler(), a loop iterates over the fgraph_array[]
elements, which are fgraph_ops. The loop checks if an element is a
fgraph_stub to prevent using a fgraph_stub afterward.

However, if the compiler reloads fgraph_array[] after this check, it might
race with an update to fgraph_array[] that introduces a fgraph_stub. This
could result in the stub being processed, but the stub contains a null
"func_hash" field, leading to a NULL pointer dereference.

To ensure that the gops compared against the fgraph_stub matches the gops
processed later, add a READ_ONCE(). A similar patch appears in commit
63a8dfb ("function_graph: Add READ_ONCE() when accessing fgraph_array[]").

Cc: stable@vger.kernel.org
Fixes: 37238abe3cb47 ("ftrace/function_graph: Pass fgraph_ops to function graph callbacks")
Link: https://lore.kernel.org/20241231113731.277668-1-zilin@seu.edu.cn
Signed-off-by: Zilin Guan 
Signed-off-by: Steven Rostedt (Google)

tracing: Have process_string() also allow arrays

2024-12-31T05:10:32+00:00

In order to catch a common bug where a TRACE_EVENT() TP_fast_assign()
assigns an address of an allocated string to the ring buffer and then
references it in TP_printk(), which can be executed hours later when the
string is free, the function test_event_printk() runs on all events as
they are registered to make sure there's no unwanted dereferencing.

It calls process_string() to handle cases in TP_printk() format that has
"%s". It returns whether or not the string is safe. But it can have some
false positives.

For instance, xe_bo_move() has:

 TP_printk("move_lacks_source:%s, migrate object %p [size %zu] from %s to %s device_id:%s",
            __entry->move_lacks_source ? "yes" : "no", __entry->bo, __entry->size,
            xe_mem_type_to_name[__entry->old_placement],
            xe_mem_type_to_name[__entry->new_placement], __get_str(device_id))

Where the "%s" references into xe_mem_type_to_name[]. This is an array of
pointers that should be safe for the event to access. Instead of flagging
this as a bad reference, if a reference points to an array, where the
record field is the index, consider it safe.

Link: https://lore.kernel.org/all/9dee19b6185d325d0e6fa5f7cbba81d007d99166.camel@sapience.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu 
Cc: Mathieu Desnoyers 
Link: https://lore.kernel.org/20241231000646.324fb5f7@gandalf.local.home
Fixes: 65a25d9f7ac02 ("tracing: Add "%s" check in test_event_printk()")
Reported-by: Genes Lists 
Tested-by: Gene C 
Signed-off-by: Steven Rostedt (Google)

Merge tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

2024-12-27T19:03:15+00:00

Pull probes fix from Masami Hiramatsu:
 "Change the priority of the module callback of kprobe events so that it
  is called after the jump label list on the module is updated.

  This ensures the kprobe can check whether it is not on the jump label
  address correctly"

* tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobe: Make trace_kprobe's module callback called after jump_label update

tracing: Prevent bad count for tracing_cpumask_write

2024-12-24T02:59:15+00:00

If a large count is provided, it will trigger a warning in bitmap_parse_user.
Also check zero for it.

Cc: stable@vger.kernel.org
Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions")
Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com
Reported-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd
Tested-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu 
Signed-off-by: Steven Rostedt (Google)

tracing/kprobe: Make trace_kprobe's module callback called after jump_label update

2024-12-23T15:08:13+00:00

Make sure the trace_kprobe's module notifer callback function is called
after jump_label's callback is called. Since the trace_kprobe's callback
eventually checks jump_label address during registering new kprobe on
the loading module, jump_label must be updated before this registration
happens.

Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/

Fixes: 614243181050 ("tracing/kprobes: Support module init function probing")
Signed-off-by: Masami Hiramatsu (Google)