These are some examples of using the perf Linux profiler, which has also been called Performance Counters for Linux (PCL), Linux perf events (LPE), or perf_events. Like Vince Weaver, I’ll call it perf_events so that you can search on that term later. Searching for just "perf" finds sites on the police, petroleum, weed control, and a T-shirt. This is not an official perf page, for either perf_events or the T-shirt.
perf_events is an event-oriented observability tool, which can help you solve advanced performance and troubleshooting functions. Questions that can be answered include:
- Why is the kernel on-…
These are some examples of using the perf Linux profiler, which has also been called Performance Counters for Linux (PCL), Linux perf events (LPE), or perf_events. Like Vince Weaver, I’ll call it perf_events so that you can search on that term later. Searching for just "perf" finds sites on the police, petroleum, weed control, and a T-shirt. This is not an official perf page, for either perf_events or the T-shirt.
perf_events is an event-oriented observability tool, which can help you solve advanced performance and troubleshooting functions. Questions that can be answered include:
- Why is the kernel on-CPU so much? What code-paths?
- Which code-paths are causing CPU level 2 cache misses?
- Are the CPUs stalled on memory I/O?
- Which code-paths are allocating memory, and how much?
- What is triggering TCP retransmits?
- Is a certain kernel function being called, and how often?
- What reasons are threads leaving the CPU?
perf_events is part of the Linux kernel, under tools/perf. While it uses many Linux tracing features, some are not yet exposed via the perf command, and need to be used via the ftrace interface instead. My perf-tools collection (github) uses both perf_events and ftrace as needed.
This page includes my examples of perf_events. A table of contents:
Key sections to start with are: Events, One-Liners, Presentations, Prerequisites, CPU statistics, Timed Profiling, and Flame Graphs. Also see my Posts about perf_events, and Links for the main (official) perf_events page, awesome tutorial, and other links. The next sections introduce perf_events further, starting with a screenshot, one-liners, and then background.
This page is under construction, and there’s a lot more to perf_events that I’d like to add. Hopefully this is useful so far.
1. Screenshot
Starting with a screenshot, here’s perf version 3.9.3 tracing disk I/O:
# perf record -e block:block_rq_issue -ag
^C
# ls -l perf.data
-rw------- 1 root root 3458162 Jan 26 03:03 perf.data
# perf report
[...]
# Samples: 2K of event 'block:block_rq_issue'
# Event count (approx.): 2216
#
# Overhead Command Shared Object Symbol
# ........ ............ ................. ....................
#
32.13% dd [kernel.kallsyms] [k] blk_peek_request
|
--- blk_peek_request
virtblk_request
__blk_run_queue
|
|--98.31%-- queue_unplugged
| blk_flush_plug_list
| |
| |--91.00%-- blk_queue_bio
| | generic_make_request
| | submit_bio
| | ext4_io_submit
| | |
| | |--58.71%-- ext4_bio_write_page
| | | mpage_da_submit_io
| | | mpage_da_map_and_submit
| | | write_cache_pages_da
| | | ext4_da_writepages
| | | do_writepages
| | | __filemap_fdatawrite_range
| | | filemap_flush
| | | ext4_alloc_da_blocks
| | | ext4_release_file
| | | __fput
| | | ____fput
| | | task_work_run
| | | do_notify_resume
| | | int_signal
| | | close
| | | 0x0
| | |
| | --41.29%-- mpage_da_submit_io
[...]
A perf record command was used to trace the block:block_rq_issue probe, which fires when a block device I/O request is issued (disk I/O). Options included -a to trace all CPUs, and -g to capture call graphs (stack traces). Trace data is written to a perf.data file, and tracing ended when Ctrl-C was hit. A summary of the perf.data file was printed using perf report, which builds a tree from the stack traces, coalescing common paths, and showing percentages for each path.
The perf report output shows that 2,216 events were traced (disk I/O), 32% of which from a dd command. These were issued by the kernel function blk_peek_request(), and walking down the stacks, about half of these 32% were from the close() system call.
Note that I use the "#" prompt to signify that these commands were run as root, and I’ll use "$" for user commands. Use sudo as needed.
2. One-Liners
Some useful one-liners I’ve gathered or written. Terminology I’m using, from lowest to highest overhead:
- statistics/count: increment an integer counter on events
- sample: collect details (eg, instruction pointer or stack) from a subset of events (once every ...)
- trace: collect details from every event
Listing Events
# Listing all currently known events:
perf list
# Listing sched tracepoints:
perf list 'sched:*'
Counting Events
# CPU counter statistics for the specified command:
perf stat command
# Detailed CPU counter statistics (includes extras) for the specified command:
perf stat -d command
# CPU counter statistics for the specified PID, until Ctrl-C:
perf stat -p PID
# CPU counter statistics for the entire system, for 5 seconds:
perf stat -a sleep 5
# Various basic CPU statistics, system wide, for 10 seconds:
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10
# Various CPU level 1 data cache statistics for the specified command:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command
# Various CPU data TLB statistics for the specified command:
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command
# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
# Using raw PMC counters, eg, counting unhalted core cycles:
perf stat -e r003c -a sleep 5
# PMCs: counting cycles and frontend stalls via raw specification:
perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5
# Count syscalls per-second system-wide:
perf stat -e raw_syscalls:sys_enter -I 1000 -a
# Count system calls by type for the specified PID, until Ctrl-C:
perf stat -e 'syscalls:sys_enter_*' -p PID
# Count system calls by type for the entire system, for 5 seconds:
perf stat -e 'syscalls:sys_enter_*' -a sleep 5
# Count scheduler events for the specified PID, until Ctrl-C:
perf stat -e 'sched:*' -p PID
# Count scheduler events for the specified PID, for 10 seconds:
perf stat -e 'sched:*' -p PID sleep 10
# Count ext4 events for the entire system, for 10 seconds:
perf stat -e 'ext4:*' -a sleep 10
# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10
# Count all vmscan events, printing a report every second:
perf stat -e 'vmscan:*' -a -I 1000
Profiling
# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command
# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C:
perf record -F 99 -p PID
# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID sleep 10
# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10
# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID --call-graph dwarf sleep 10
# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11):
perf record -F 99 -ag -- sleep 10
# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11):
perf record -F 99 -g -- sleep 10
# If the previous command didn't work, try forcing perf to use the cpu-clock event:
perf record -F 99 -e cpu-clock -ag -- sleep 10
# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup:
perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10
# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -a --call-graph dwarf sleep 10
# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?):
perf record -F 99 -a --call-graph lbr sleep 10
# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5
# Sample on-CPU kernel instructions, for 5 seconds:
perf record -e cycles:k -a -- sleep 5
# Sample on-CPU user instructions, for 5 seconds:
perf record -e cycles:u -a -- sleep 5
# Sample on-CPU user instructions precisely (using PEBS), for 5 seconds:
perf record -e cycles:up -a -- sleep 5
# Perform branch tracing (needs HW support), for 1 second:
perf record -b -a sleep 1
# Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file):
perf top -F 49
# Sample CPUs at 49 Hertz, and show top process names and segments, live:
perf top -F 49 -ns comm,dso
Static Tracing
# Trace new processes, until Ctrl-C:
perf record -e sched:sched_process_exec -a
# Sample (take a subset of) context-switches, until Ctrl-C:
perf record -e context-switches -a
# Trace all context-switches, until Ctrl-C:
perf record -e context-switches -c 1 -a
# Include raw settings used (see: man perf_event_open):
perf record -vv -e context-switches -a
# Trace all context-switches via sched tracepoint, until Ctrl-C:
perf record -e sched:sched_switch -a
# Sample context-switches with stack traces, until Ctrl-C:
perf record -e context-switches -ag
# Sample context-switches with stack traces, for 10 seconds:
perf record -e context-switches -ag -- sleep 10
# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default):
perf record -e context-switches -ag -T
# Sample CPU migrations, for 10 seconds:
perf record -e migrations -a -- sleep 10
# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect -ag
# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_accept* -ag
# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag
# Sample at most 100 block device requests per second, until Ctrl-C:
perf record -F 100 -e block:block_rq_insert -a
# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a
# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'
# Trace all block completions, synchronous writes only, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'
# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'
# Sample minor faults (RSS growth) with stack traces, until Ctrl-C:
perf record -e minor-faults -ag
# Trace all minor faults with stack traces, until Ctrl-C:
perf record -e minor-faults -c 1 -ag
# Sample page faults with stack traces, until Ctrl-C:
perf record -e page-faults -ag
# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a
# Trace kswapd wakeup events, until Ctrl-C:
perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag
# Add Node.js USDT probes (Linux 4.10+):
perf buildid-cache --add `which node`
# Trace the node http__server__request USDT event (Linux 4.10+):
perf record -e sdt_node:http__server__request -a
Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional):
perf probe --add tcp_sendmsg
# Remove the tcp_sendmsg() tracepoint (or use "--del"):
perf probe -d tcp_sendmsg
# Add a tracepoint for the kernel tcp_sendmsg() function return:
perf probe 'tcp_sendmsg%return'
# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo):
perf probe -V tcp_sendmsg
# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs
# Show available line probes for tcp_sendmsg() (needs debuginfo):
perf probe -L tcp_sendmsg
# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo):
perf probe -V tcp_sendmsg:81
# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'
# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific):
perf probe 'tcp_sendmsg bytes=%cx'
# Trace previously created probe when the bytes (alias) variable is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'
# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'
# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo):
perf probe 'tcp_sendmsg size'
# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo):
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'
# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo):
perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'
# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a
# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo):
perf probe 'tcp_sendmsg:81 seglen'
# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo):
perf probe 'do_sys_open filename:string'
# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'
# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc
# Add a tracepoint for this user-level static probe (USDT, aka SDT event):
perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry
# List currently available dynamic probes:
perf probe -l
Mixed
# Trace system calls by process, showing a summary refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm
# Trace sent network packets by on-CPU process, rolling output (no clear):
stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings
# Sample stacks at 99 Hertz, and, context switches:
perf record -F99 -e cpu-clock -e cs -a -g
# Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8):
perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g
Special
# Record cacheline events (Linux 4.10+):
perf c2c record -a -- sleep 10
# Report cacheline events from previous recording (Linux 4.10+):
perf c2c report
Reporting
# Show perf.data in an ncurses browser (TUI) if possible:
perf report
# Show perf.data with a column for sample count:
perf report -n
# Show perf.data as a text report, with data coalesced and percentages:
perf report --stdio
# Report, with stacks in folded format: one line per stack (needs 4.4):
perf report --stdio -n -g folded
# List all events from perf.data:
perf script
# List all perf.data events, with data header (newer kernels; was previously default):
perf script --header
# List all perf.data events, with customized fields (< Linux 4.1):
perf script -f time,event,trace
# List all perf.data events, with customized fields (>= Linux 4.1):
perf script -F time,event,trace
# List all perf.data events, with my recommended fields (needs record -a; newer kernels):
perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso
# List all perf.data events, with my recommended fields (needs record -a; older kernels):
perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso
# Dump raw contents from perf.data as hex (for debugging):
perf script -D
# Disassemble and annotate instructions with percentages (needs some debuginfo):
perf annotate --stdio
These one-liners serve to illustrate the capabilities of perf_events, and can also be used a bite-sized tutorial: learn perf_events one line at a time. You can also print these out as a perf_events cheatsheet.
3. Presentations
Kernel Recipes (2017)
At Kernel Recipes 2017 I gave an updated talk on Linux perf at Netflix, focusing on getting CPU profiling and flame graphs to work. This talk includes a crash course on perf_events, plus gotchas such as fixing stack traces and symbols when profiling Java, Node.js, VMs, and containers.
A video of the talk is on youtube and the slides are on slideshare:
There’s also an older version of this talk from 2015, which I’ve posted about.
4. Background
The following sections provide some background for understanding perf_events and how to use it. I’ll describe the prerequisites, audience, usage, events, and tracepoints.
4.1. Prerequisites
The perf tool is in the linux-tools-common package. Start by adding that, then running "perf" to see if you get the USAGE message. It may tell you to install another related package (linux-tools-kernelversion).
You can also build and add perf from the Linux kernel source. See the Building section.
To get the most out perf, you’ll want symbols and stack traces. These may work by default in your Linux distribution, or they may require the addition of packages, or recompilation of the kernel with additional config options.
4.2. Symbols
perf_events, like other debug tools, needs symbol information (symbols). These are used to translate memory addresses into function and variable names, so that they can be read by us humans. Without symbols, you’ll see hexadecimal numbers representing the memory addresses profiled.
The following perf report output shows stack traces, however, only hexadecimal numbers can be seen:
57.14% sshd libc-2.15.so [.] connect
|
--- connect
|
|--25.00%-- 0x7ff3c1cddf29
|
|--25.00%-- 0x7ff3bfe82761
| 0x7ff3bfe82b7c
|
|--25.00%-- 0x7ff3bfe82dfc
--25.00%-- [...]
If the software was added by packages, you may find debug packages (often "-dbgsym") which provide the symbols. Sometimes perf report will tell you to install these, eg: "no symbols found in /bin/dd, maybe install a debug package?".
Here’s the same perf report output seen earlier, after adding openssh-server-dbgsym and libc6-dbgsym (this is on ubuntu 12.04):
57.14% sshd libc-2.15.so [.] __GI___connect_internal
|
--- __GI___connect_internal
|
|--25.00%-- add_one_listen_addr.isra.0
|
|--25.00%-- __nscd_get_mapping
| __nscd_get_map_ref
|
|--25.00%-- __nscd_open_socket
--25.00%-- [...]
I find it useful to add both libc6-dbgsym and coreutils-dbgsym, to provide some symbol coverage of user-level OS codepaths.
Another way to get symbols is to compile the software yourself. For example, I just compiled node (Node.js):
# file node-v0.10.28/out/Release/node
node-v0.10.28/out/Release/node: ELF 64-bit LSB executable, ... not stripped
This has not been stripped, so I can profile node and see more than just hex. If the result is stripped, configure your build system not to run strip(1) on the output binaries.
Kernel-level symbols are in the kernel debuginfo package, or when the kernel is compiled with CONFIG_KALLSYMS.
4.3. JIT Symbols (Java, Node.js)
Programs that have virtual machines (VMs), like Java’s JVM and node’s v8, execute their own virtual processor, which has its own way of executing functions and managing stacks. If you profile these using perf_events, you’ll see symbols for the VM engine, which have some use (eg, to identify if time is spent in GC), but you won’t see the language-level context you might be expecting. Eg, you won’t see Java classes and methods.
perf_events has JIT support to solve this, which requires the VM to maintain a /tmp/perf-PID.map file for symbol translation. Java can do this with perf-map-agent, and Node.js 0.11.13+ with –perf_basic_prof. See my blog post Node.js flame graphs on Linux for the steps.
Note that Java may not show full stacks to begin with, due to hotspot on x86 omitting the frame pointer (just like gcc). On newer versions (JDK 8u60+), you can use the -XX:+PreserveFramePointer option to fix this behavior, and profile fully using perf. See my Netflix Tech Blog post, Java in Flames, for a full writeup, and my Java flame graphs section, which links to an older patch and includes an example resulting flame graph. I also summarized the latest in my JavaOne 2016 talk Java Performance Analysis on Linux with Flame Graphs.
4.4 Stack Traces
Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events, like seen in the earlier sshd symbols example. There are three ways to fix this: either using dwarf data to unwind the stack, using last branch record (LBR) if available (a processor feature), or returning the frame pointers.
There are other stack walking techniques, like BTS (Branch Trace Store), and the new ORC unwinder. I’ll add docs for them at some point (and as perf support arrives).
Frame Pointers
The earlier sshd example was a default build of OpenSSH, which uses compiler optimizations (-O2), which in this case has omitted the frame pointer. Here’s how it looks after recompiling OpenSSH with -fno-omit-frame-pointer:
100.00% sshd libc-2.15.so [.] __GI___connect_internal
|
--- __GI___connect_internal
|
|--30.00%-- add_one_listen_addr.isra.0
| add_listen_addr
| fill_default_server_options
| main
| __libc_start_main
|
|--20.00%-- __nscd_get_mapping
| __nscd_get_map_ref
|
|--20.00%-- __nscd_open_socket
--30.00%-- [...]
Now the ancestry from add_one_listen_addr() can be seen, down to main() and __libc_start_main().
The kernel can suffer the same problem. Here’s an example CPU profile collected on an idle server, with stack traces (-g):
99.97% swapper [kernel.kallsyms] [k] default_idle
|
--- default_idle
0.03% sshd [kernel.kallsyms] [k] iowrite16
|
--- iowrite16
__write_nocancel
(nil)
The kernel stack traces are incomplete. Now a similar profile with CONFIG_FRAME_POINTER=y:
99.97% swapper [kernel.kallsyms] [k] default_idle
|
--- default_idle
cpu_idle
|
|--87.50%-- start_secondary
|
--12.50%-- rest_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel
0.03% sshd [kernel.kallsyms] [k] iowrite16
|
--- iowrite16
vp_notify
virtqueue_kick
start_xmit
dev_hard_start_xmit
sch_direct_xmit
dev_queue_xmit
ip_finish_output
ip_output
ip_local_out
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_sendmsg
inet_sendmsg
sock_aio_write
do_sync_write
vfs_write
sys_write
system_call_fastpath
__write_nocancel
Much better – the entire path from the write() syscall (__write_nocancel) to iowrite16() can be seen.
Dwarf
Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. This can be enabled using "–call-graph dwarf" (or "-g dwarf").
Also see the Building section for other notes about building perf_events, as without the right library, it may build itself without dwarf support.
LBR
You must have Last Branch Record access to be able to use this. It is disabled in most cloud environments, where you’ll get this error:
# perf record -F 99 -a --call-graph lbr
Error:
PMU Hardware doesn't support sampling/overflow-interrupts.
Here’s an example of it working:
# perf record -F 99 -a --call-graph lbr
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.903 MB perf.data (163 samples) ]
# perf script
[...]
stackcollapse-p 23867 [007] 4762187.971824: 29003297 cycles:ppp:
1430c0 Perl_re_intuit_start (/usr/bin/perl)
144118 Perl_regexec_flags (/usr/bin/perl)
cfcc9 Perl_pp_match (/usr/bin/perl)
cbee3 Perl_runops_standard (/usr/bin/perl)
51fb3 perl_run (/usr/bin/perl)
2b168 main (/usr/bin/perl)
stackcollapse-p 23867 [007] 4762187.980184: 31532281 cycles:ppp:
e3660 Perl_sv_force_normal_flags (/usr/bin/perl)
109b86 Perl_leave_scope (/usr/bin/perl)
1139db Perl_pp_leave (/usr/bin/perl)
cbee3 Perl_runops_standard (/usr/bin/perl)
51fb3 perl_run (/usr/bin/perl)
2b168 main (/usr/bin/perl)
stackcollapse-p 23867 [007] 4762187.989283: 32341031 cycles:ppp:
cfae0 Perl_pp_match (/usr/bin/perl)
cbee3 Perl_runops_standard (/usr/bin/perl)
51fb3 perl_run (/usr/bin/perl)
2b168 main (/usr/bin/perl)
Nice! Note that LBR is usually limited in stack depth (either 8, 16, or 32 frames), so it may not be suitable for deep stacks or flame graph generation, as flame graphs need to walk to the common root for merging.
Here’s that same program sampled using the by-default frame pointer walk:
# perf record -F 99 -a -g
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.882 MB perf.data (81 samples) ]
# perf script
[...]
stackcollapse-p 23883 [005] 4762405.747834: 35044916 cycles:ppp:
135b83 [unknown] (/usr/bin/perl)
stackcollapse-p 23883 [005] 4762405.757935: 35036297 cycles:ppp:
ee67d Perl_sv_gets (/usr/bin/perl)
stackcollapse-p 23883 [005] 4762405.768038: 35045174 cycles:ppp:
137334 [unknown] (/usr/bin/perl)
You can recompile Perl with frame pointer support (in its ./Configure, it asks what compiler options: add -fno-omit-frame-pointer). Or you can use LBR if it’s available, and you don’t need very long stacks.
4.5. Audience
To use perf_events, you’ll either:
- Develop your own commands
- Run example commands
Developing new invocations of perf_events requires the study of kernel and application code, which isn’t for everyone. Many more people will use perf_events by running commands developed by other people, like the examples on this page. This can work out fine: your organization may only need one or two people who can develop perf_events commands or source them, and then share them for use by the entire operation and support groups.
Either way, you need to know the capabilities of perf_events so you know when to reach for it, whether that means searching for an example command or writing your own. One goal of the examples that follow is just to show you what can be done, to help you learn these capabilities. You should also browse examples on other sites (Links).
If you’ve never used perf_events before, you may want to test before production use (it has had kernel panic bugs in the past). My experience has been a good one (no panics).
4.6. Usage
perf_events provides a command line tool, perf, and subcommands for various profiling activities. This is a single interface for the different instrumentation frameworks that provide the various events.
The perf command alone will list the subcommands; here is perf version 4.10 (for the Linux 4.10 kernel):
# perf
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
archive Create archive with object files with build-ids found in perf.data file
bench General framework for benchmark suites
buildid-cache Manage build-id cache.
buildid-list List the buildids in a perf.data file
config Get and set variables in a configuration file.
data Data file related processing
diff Read perf.data files and display the differential profile
evlist List the event names in a perf.data file
inject Filter to augment the events stream with additional information
kmem Tool to trace/measure kernel memory properties
kvm Tool to trace/measure kvm guest os
list List all symbolic event types
lock Analyze lock events
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
test Runs sanity tests.
timechart Tool to visualize total system behavior during a workload
top System profiling tool.
probe Define new dynamic tracepoints
trace strace inspired tool
See 'perf help COMMAND' for more information on a specific command.
Apart from separate help for each subcommand, there is also documentation in the kernel source under tools/perf/Documentation. perf has evolved, with different functionality added over time, so on an older kernel you may be missing some subcommands or functionality. Also, its usage may not feel consistent as you switch between activities. It’s best to think of it as a multi-tool.
perf_events can instrument in three ways (now using the perf_events terminology):
- counting events in-kernel context, where a summary of counts is printed by perf. This mode does not generate a perf.data file.
- sampling events, which writes event data to a kernel buffer, which is read at a gentle asynchronous rate by the perf command to write to the perf.data file. This file is then read by the perf report or perf script commands.
- bpf programs on events, a new feature in Linux 4.4+ kernels that can execute custom user-defined programs in kernel space, which can perform efficient filters and summaries of the data. Eg, efficiently-measured latency histograms.
Try starting by counting events using the perf stat command, to see if this is sufficient. This subcommand costs the least overhead.
When using the sampling mode with perf record, you’ll need to be a little careful about the overheads, as the capture files can quickly become hundreds of Mbytes. It depends on the rate of the event you are tracing: the more frequent, the higher the overhead and larger the perf.data size.
To really cut down overhead and generate more advanced summaries, write BPF programs executed by perf. See the eBPF section.
4.7. Usage Examples
These example sequences have been chosen to illustrate some different ways that perf is used, from gathering to reporting.
Performance counter summaries, including IPC, for the gzip command:
# perf stat gzip largefile
Count all scheduler process events for 5 seconds, and count by tracepoint:
# perf stat -e 'sched:sched_process_*' -a sleep 5
Trace all scheduler process events for 5 seconds, and count by both tracepoint and process name:
# perf record -e 'sched:sched_process_*' -a sleep 5
# perf report
Trace all scheduler process events for 5 seconds, and dump per-event details:
# perf record -e 'sched:sched_process_*' -a sleep 5
# perf script
Trace read() syscalls, when requested bytes is less than 10:
# perf record -e 'syscalls:sys_enter_read' --filter 'count < 10' -a
Sample CPU stacks at 99 Hertz, for 5 seconds:
# perf record -F 99 -ag -- sleep 5
# perf report
Dynamically instrument the kernel tcp_sendmsg() function, and trace it for 5 seconds, with stack traces:
# perf probe --add tcp_sendmsg
# perf record -e probe:tcp_sendmsg -ag -- sleep 5
# perf probe --del tcp_sendmsg
# perf report
Deleting the tracepoint (–del) wasn’t necessary; I included it to show how to return the system to its original state.
Caveats
The use of -p PID as a filter doesn’t work properly on some older kernel versions (Linux 3.x): perf hits 100% CPU and needs to be killed. It’s annoying. The workaround is to profile all CPUs (-a), and filter PIDs later.
4.8. Special Usage
There’s a number of subcommands that provide special purpose functionality. These include:
- perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
- perf kmem: kernel memory allocation analysis.
- perf kvm: KVM virtual guest analysis.
- perf lock: lock analysis.
- perf mem: memory access analysis.
- perf sched: kernel scheduler statistics. Examples.
These make use of perf’s existing instrumentation capabilities, recording selected events and reporting them in custom ways.
5. Events
perf_events instruments "events", which are a unified interface for different kernel instrumentation frameworks. The following map (from my SCaLE13x talk) illustrates the event sources:
The types of events are:
- Hardware Events: CPU performance monitoring counters.
- Software Events: These are low level events based on kernel counters. For example, CPU migrations, minor faults, major faults, etc.
- Kernel Tracepoint Events: This are static kernel-level instrumentation points that are hardcoded in interesting and logical places in the kernel.
- User Statically-Defined Tracing (USDT): These are static tracepoints for user-level programs and applications.
- Dynamic Tracing: Software can be dynamically instrumented, creating events in any location. For kernel software, this uses the kprobes framework. For user-level software, uprobes.
- Timed Profiling: Snapshots can be collected at an arbitrary frequency, using perf record -FHz. This is commonly used for CPU usage profiling, and works by creating custom timed interrupt events.
Details about the events can be collected, including timestamps, the code path that led to it, and other specific details. The capabilities of perf_events are enormous, and you’re likely to only ever use a fraction.
Currently available events can be listed using the list subcommand:
# perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
ref-cycles [Hardware event]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
minor-faults [Software event]
major-faults [Software event]
alignment-faults [Software event]
emulation-faults [Software event]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
[...]
rNNN [Raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
(see 'man perf-list' on how to encode it)
mem:<addr>[:access] [Hardware breakpoint]
probe:tcp_sendmsg [Tracepoint event]
[...]
sched:sched_process_exec [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
[...]
# perf list | wc -l
657
When you use dynamic tracing, you are extending this list. The probe:tcp_sendmsg tracepoint in this list is an example, which I added by instrumenting tcp_sendmsg(). Profiling (sampling) events are not listed.
5.1. Software Events
There is a small number of fixed software events provided by perf:
# perf list
List of pre-defined events (to be used in -e):
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
[...]
These are also documented in the man page perf_event_open(2):
[...]
PERF_COUNT_SW_CPU_CLOCK
This reports the CPU clock, a high-resolution per-
CPU timer.
PERF_COUNT_SW_TASK_CLOCK
This reports a clock count specific to the task that
is running.
PERF_COUNT_SW_PAGE_FAULTS
This reports the number of page faults.
PERF_COUNT_SW_CONTEXT_SWITCHES
This counts context switches. Until Linux 2.6.34,
these were all reported as user-space events, after
that they are reported as happening in the kernel.
PERF_COUNT_SW_CPU_MIGRATIONS
This reports the number of times the process has
migrated to a new CPU.
PERF_COUNT_SW_PAGE_FAULTS_MIN
This counts the number of minor page faults. These
did not require disk I/O to handle.
[...]
The kernel also supports traecpoints, which are very similar to software events, but have a different more extensible API.
Software events may have a default period. This means that when you use them for sampling, you’re sampling a subset of events, not tracing every event. You can check with perf record -vv:
# perf record -vv -e context-switches /bin/true
Using CPUID GenuineIntel-6-55
------------------------------------------------------------
perf_event_attr:
type 1
size 112
config 0x3
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|PERIOD
disabled 1
inherit 1
mmap 1
comm 1
freq 1
enable_on_exec 1
[...]
See the perf_event_open(2) man page for a description of these fields. This default means is that the kernel adjusts the rate of sampling so that it’s capturing about 4,000 context switch events per second. If you really meant to record them all, use -c 1:
# perf record -vv -e context-switches -c 1 /bin/true
Using CPUID GenuineIntel-6-55
------------------------------------------------------------
perf_event_attr:
type 1
size 112
config 0x3
{ sample_period, sample_freq } 1
sample_type IP|TID|TIME
disabled 1
inherit 1
mmap 1
comm 1
enable_on_exec 1
Check the rate of events using perf stat first, so that you can estimate the volume of data you’ll be capturing. Sampling a subset by default may be a good thing, especially for high frequency events like context switches.
Many other events (like tracepoints) have a default of 1 anyway. You’ll encounter a non-1 default for many software and hardware events.
5.2. Hardware Events (PMCs)
perf_events began life as a tool for instrumenting the processor’s performance monitoring unit (PMU) hardware counters, also called performance monitoring counters (PMCs), or performance instrumentation counters (PICs). These instrument low-level processor activity, for example, CPU cycles, instructions retired, memory stall cycles, level 2 cache misses, etc. Some will be listed as Hardware Cache Events.
PMCs are documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2 and the BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors. There are thousands of different PMCs available.
A typical processor will implement PMCs in the following way: only a few or several can be recorded at the same time, from the many thousands that are available. This is because they are a fixed hardware resource on the processor (a limited number of registers), and are programmed to begin counting the selected events.
For examples of using PMCs, see CPU Statistics.
5.3. Kernel Tracepoints
These tracepoints are hard coded in interesting and logical locations of the kernel, so that higher-level behavior can be easily traced. For example, system calls, TCP events, file system I/O, disk I/O, etc. These are grouped into libraries of tracepoints; eg, "sock:" for socket events, "sched:" for CPU scheduler events. A key value of tracepoints is that they should have a stable API, so if you write tools that use them on one kernel version, they should work on later versions as well.
Tracepoints are usually added to kernel code by placing a macro from include/trace/events/*. XXX cover implementation.
Summarizing the tracepoint library names and numbers of tracepoints, on my Linux 4.10 system:
# perf list | awk -F: '/Tracepoint event/ { lib[$1]++ } END {
for (l in lib) { printf " %-16.16s %d\n", l, lib[l] } }' | sort | column
alarmtimer 4 i2c 8 page_isolation 1 swiotlb 1
block 19 iommu 7 pagemap 2 syscalls 614
btrfs 51 irq 5 power 22 task 2
cgroup 9 irq_vectors 22 printk 1 thermal 7
clk 14 jbd2 16 random 15 thermal_power_ 2
cma 2 kmem 12 ras 4 timer 13
compaction 14 libata 6 raw_syscalls 2 tlb 1
cpuhp 3 mce 1 rcu 1 udp 1
dma_fence 8 mdio 1 regmap 15 vmscan 15
exceptions 2 migrate 2 regulator 7 vsyscall 1
ext4 95 mmc 2 rpm 4 workqueue 4
fib 3 module 5 sched 24 writeback 30
fib6 1 mpx 5 scsi 5 x86_fpu 14
filelock 10 msr 3 sdt_node 1 xen 35
filemap 2 napi 1 signal 2 xfs 495
ftrace 1 net 10 skb 3 xhci-hcd 9
gpio 2 nmi 1 sock 2
huge_memory 4 oom 1 spi 7
These include:
- block: block device I/O
- ext4: file system operations
- kmem: kernel memory allocation events
- random: kernel random number generator events
- sched: CPU scheduler events
- syscalls: system call enter and exits
- task: task events
It’s worth checking the list of tracepoints after every kernel upgrade, to see if any are new. The value of adding them has been debated from time to time, with it wondered how many people will use them (I do). There is a balance to aim for: I’d include the smallest number of probes that sufficiently covers common needs, and anything unusual or uncommon can be left to dynamic tracing.
For examples of using tracepoints, see Static Kernel Tracing.
5.4. User-Level Statically Defined Tracing (USDT)
Similar to kernel tracepoints, these are hardcoded (usually by placing macros) in the application source at logical and interesting locations, and presented (event name and arguments) as a stable API. Many applications already include tracepoints, added to support DTrace. However, many of these applications do not compile them in by default on Linux. Often you need to compile the application yourself using a –with-dtrace flag.
For example, compiling USDT events with this version of Node.js:
$ sudo apt-get install systemtap-sdt-dev # adds "dtrace", used by node build
$ wget https://nodejs.org/dist/v4.4.1/node-v4.4.1.tar.gz
$ tar xvf node-v4.4.1.tar.gz
$ cd node-v4.4.1
$ ./configure --with-dtrace
$ make -j 8
To check that the resulting node binary has probes included:
$ readelf -n node
Displaying notes found at file offset 0x00000254 with length 0x00000020:
Owner Data size Description
GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag)
OS: Linux, ABI: 2.6.32
Displaying notes found at file offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 1e01659b0aecedadf297b2c56c4a2b536ae2308a
Displaying notes found at file offset 0x00e70994 with length 0x000003c4:
Owner Data size Description
stapsdt 0x0000003c NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: gc__start
Location: 0x0000000000dc14e4, Base: 0x000000000112e064, Semaphore: 0x000000000147095c
Arguments: 4@%esi 4@%edx 8@%rdi
stapsdt 0x0000003b NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: gc__done
Location: 0x0000000000dc14f4, Base: 0x000000000112e064, Semaphore: 0x000000000147095e
Arguments: 4@%esi 4@%edx 8@%rdi
stapsdt 0x00000067 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__server__response
Location: 0x0000000000dc1894, Base: 0x000000000112e064, Semaphore: 0x0000000001470956
Arguments: 8@%rax 8@-1144(%rbp) -4@-1148(%rbp) -4@-1152(%rbp)
stapsdt 0x00000061 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: net__stream__end
Location: 0x0000000000dc1c44, Base: 0x000000000112e064, Semaphore: 0x0000000001470952
Arguments: 8@%rax 8@-1144(%rbp) -4@-1148(%rbp) -4@-1152(%rbp)
stapsdt 0x00000068 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: net__server__connection
Location: 0x0000000000dc1ff4, Base: 0x000000000112e064, Semaphore: 0x0000000001470950
Arguments: 8@%rax 8@-1144(%rbp) -4@-1148(%rbp) -4@-1152(%rbp)
stapsdt 0x00000060 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__client__response
Location: 0x0000000000dc23c5, Base: 0x000000000112e064, Semaphore: 0x000000000147095a
Arguments: 8@%rdx 8@-1144(%rbp) -4@%eax -4@-1152(%rbp)
stapsdt 0x00000089 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__client__request
Location: 0x0000000000dc285e, Base: 0x000000000112e064, Semaphore: 0x0000000001470958
Arguments: 8@%rax 8@%rdx 8@-2184(%rbp) -4@-2188(%rbp) 8@-2232(%rbp) 8@-2240(%rbp) -4@-2192(%rbp)
stapsdt 0x00000089 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__server__request
Location: 0x0000000000dc2e69, Base: 0x000000000112e064, Semaphore: 0x0000000001470954
Arguments: 8@%r14 8@%rax 8@-4344(%rbp) -4@-4348(%rbp) 8@-4304(%rbp) 8@-4312(%rbp) -4@-4352(%rbp)
For examples of using USDT events, see Static User Tracing.
5.5. Dynamic Tracing
The difference between tracepoints and dynamic tracing is shown in the following fi