Skip to content

Trace analysis using trace.py

tgrabiec edited this page Sep 10, 2014 · 1 revision

The purpose of the script is to display trace data, possibly applying various transformations to make the data more digestable. It operates on a trace file, which needs to be obtained beforehand.

See Tracepoints page to learn more about tracing OSv, how to enable it, etc.

Invoking the script

Trace analysis requires a collection of scripts from osv.git/scripts. The command line interface is accessed by invoking scripts/trace.py from base OSv directory like this:

$ scripts/trace.py -h

It's perfectly ok to invoke the script like above but it may be more convenient to create an alias for a shorthand version in your development environment using the following shell commands (you can put it in .bashrc):

OSV_BASE=~/src/osv
PYTHONPATH=PYTHONPATH:${OSV_BASE}/scripts
alias trace=${OSV_BASE}/scripts/trace.py

Now you can call the script like this from any location:

$ trace -h

For brevity I will use the shorter version in this document.

Obtaining trace file

There are currently two ways to extract trace data from an OSv instance.

GDB

If running in KVM on an accessible host, the trace file can be obtained via GDB.

To save the trace execute:

$ trace extract

This will invoke GDB and the trace will be saved to a file named tracefile (by default).

REST API

Trace data can also be downloaded in the trace dump format via REST. This data can then be converted into the format used by the analysis tools.

To download:

$ trace download \
         [--url <prot><address>:<port> (default http://192.168.122.89:8000/] \
         [-f <trace dump file> (default "buffers")]

Downloaded data can then be converted to a tracefile by using:

$ trace convert [-f <trace dump file> (default "buffers")] [tracefile]

Or all in one go:

$ trace download --convert \
         [--url <prot><address>:<port> (default http://192.168.122.89:8000/] \
         [-f <trace dump file> (default "buffers")] \
         [tracefile]

Which will download to a temporary file and convert on the fly. (Recommended).

Thus, in most cases when you run locally with KVM

$ trace extract

would be equivalent to:

$ trace download --convert 

Simple example

First, run OSv with tracepoints enabled. I use one of the tests in this example:

$ scripts/run.py -H --trace=memory_malloc --trace-backtrace -e "/tests/misc-fs-stress.so file"

Explanation of parameters:

  • -H means the OSv instance should not terminate on poweroff, we use it because we need to get the trace before it powers off
  • --trace=memory_malloc enables memory_malloc tracepoint which we are interested in collecting
  • --trace-backtrace enables collecting of call backtrace at tracepoints; we want to know where the tracepoints are called from

Let the test run for a while, then collect the trace:

$ trace extract

Then you can print some statistics about the trace file:

$ trace summary
Collected 31742 samples spanning 2.57 s

Tracepoint statistics:

  name             count
  ----             -----
  memory_malloc    31742

You can list the samples:

$ trace list
0xffffc0003e990010  0 1389705265.413128853 memory_malloc        buf=0xffffc0003cb73500, len=152
0xffffc0003e990010  0 1389705265.413129568 memory_malloc        buf=0xffffc00036677980, len=80
0xffffc0003e990010  0 1389705265.413130045 memory_malloc        buf=0xffffc0003cb44b80, len=80
...

Meaning of the columns is:

Thread ID         CPU            TIMESTAMP TRACEPOINT           ARGUMENTS                   
0xffffc0003e990010  0 1389705265.413128853 memory_malloc        buf=0xffffc0003cb73500, len=152

If backtrace was collected you can show it by passing -b option to trace.py list:

$ trace list -b
0x0000000000d0ee58  0 1389705320.885835886 memory_malloc        buf=0xffffc00035f47800, len=192   [log, operator(), zfs_kmem_alloc, zfs_range_lock, zfs_get_data, zil_lwb_commit, zil_close, zfsvfs_teardown, sys_umount2, unmount_rootfs]
0x0000000000d0ee58  0 1389705320.885836601 memory_malloc        buf=0xffffc0003fe1f158, len=8   [log, operator(), zfs_kmem_alloc, dmu_buf_hold_array_by_dnode, zfs_get_data, zil_lwb_commit, zil_close, zfsvfs_teardown, sys_umount2, unmount_rootfs]

Symbol resolution

The backtrace which is associated with trace samples contains only raw addresses. The script is using addr2line to resolve these symbols. For that to work it needs to have an object file with debug information. By default it uses build/release/loader.elf for that. You can change the default by passing -d or --debug, in which case it uses build/debug/loader.elf. You can also point it to arbitrary object file using -e or --exe option. You can disable symbol resolution by passing -x or --no-resolve in which case only raw addresses are displayed.

The user should ensure that correct object is used. There could be no sign, other than suspicious symbol chain in the trace, that the wrong object file was used for symbol resolution.

By default only function name is displayed. More information can be added using these options, any combination is allowed:

  • -L, --show-line-number, eg: log:320
  • -A, --show-address, eg: log @ 0x2032e3
  • -F, --show-file-name, eg: log /home/tgrabiec/src/osv/include/osv/trace.hh

Applying all of the above (-LAF) will yield log /home/tgrabiec/src/osv/include/osv/trace.hh:320 @ 0x2032e3

Symbol resolution options can be used with all commands which display backtrace.

Timed tracepoints

Tracepoints only carry information about a point in time a particular place in code was reached. However we can use a pair of trace samples to calculate the amount of time it took to execute the code between them. By convention the ending sample has the name starting with the name of entry sample and ending with either _ret or _err. The pair can be used to create a so called timed trace, which is basically the entry trace plus duration.

You can print timed trace summary using --timed option of summary command:

$ trace summary --timed
Collected 36843 samples spanning 8.85 s

Tracepoint statistics:

  name              count
  ----              -----
  sched_wait        18420
  sched_wait_ret    18423

Timed tracepoints [ms]:

  name                    count      min      50%      90%      99%    99.9%      max           total
  ----                    -----      ---      ---      ---      ---    -----      ---           -----
  sched_wait              18255    0.000    0.043   15.210 2226.438 4232.178 6125.616     1066349.137

You can list timed samples using list-timed command. Passing --sort duration sorts by duration, descending. The duration is in the column next to the one with tracepoint name:

$ trace list-timed --sort duration
0xffffc00009680000  1 1389721284.391018152 6125.616 sched_wait           
0xffffc0003cc1d000  0 1389721284.387134552 6022.635 sched_wait           
0xffffc0003f386000  0 1389721281.852320671 4893.802 sched_wait           
0xffffc0003eece000  0 1389721281.852322102 4893.799 sched_wait           
0xffffc0003ef37000  0 1389721281.852324009 4893.791 sched_wait           
0xffffc0003ef4c000  0 1389721281.852328539 4893.791 sched_wait           
0xffffc0003ef61000  0 1389721281.852325916 4893.791 sched_wait           
0xffffc0003eef8000  0 1389721281.852327824 4893.790 sched_wait           

Profiles

When backtrace collecting is enabled (--trace-backtrace) trace samples carry information from where they were hit. This information can be used to identify hot places in code, or generally annotate code with information carried by trace samples. Such annotated call-graph is called a "profile" in this document.

Hit profile

The prof command can be used to display a profile with the number of times tracepoints were hit:

Example:

$ trace prof
(100.00%, #31742) All
 |-- (79.68%, #25291) zfs_kmem_alloc
 |    |-- (16.95%, #5380) zio_create
 |    |    |-- (8.96%, #2844) dmu_buf_hold_array_by_dnode
 |    |    |    |-- (8.86%, #2811) zfs_get_data
 |    |    |    |    zil_lwb_commit
 |    |    |    |    zil_commit
 |    |    |    |    zfsvfs_teardown
 |    |    |    |    sys_umount2
 |    |    |    |   
 |    |    |    |-- (0.10%, #32) dmu_buf_hold_array
 |    |    |    |    space_map_sync
 |    |    |    |    metaslab_sync
 |    |    |    |    vdev_sync
 |    |    |    |    spa_sync
 |    |    |    |   
 |    |    |    \-- (0.00%, #1) space_map_load
 |    |    |         metaslab_activate
 |    |    |         metaslab_group_alloc
 |    |    |         zio_dva_allocate
 |    |    |         zio_execute
(...)

The tree shows all paths leading to memory_malloc. On every level nodes are sorted by number of hits in descending order. So the most frequent path is at the top.

We can see that 79.68% of samples come from zfs_kmem_alloc, which is called from different places. Among these places the most frequent is zio_create, which is hit in 16.95% of all samples. And so on.

When there is only one path leading to given node the path elements are lined up one after another like this:

- (8.86%, #2811) zfs_get_data
   zil_lwb_commit
   zil_commit
   zfsvfs_teardown
   sys_umount2

By default prof shows all tracepoints. If you have many tracepoints in the trace you can narrow down analysis to one tracepoint using -t option, eg prof -t memory_malloc.

Wait profile

Wait profile belongs to the class of profiles which annotate call-graph with a time-weight, typically representing time spent inside particular place in code. These profiles are called time profiles.

Wait profile is using time spent in sched::thread::wait() to annotate the call-graph. The time is calculated based on samples of two tracepoints: sched_wait and sched_wait_ret. To be able to show this profile, you need to start OSv with --trace=sched_wait* or enable these tracepoints dynamically.

Example

First collect some some traces by running OSv like this:

$ scripts/run.py -H --trace=vfs_pwritev* --trace=sched_wait* --trace-backtrace -e "tests/misc-fs-stress.so ff"

The vfs_pwritev* tracepoints will be needed later to investigate latency of writes.

After the test is done execute:

$ trace extract
$ trace prof-wait -g thread

You will see a lot of output looking like this:

=== Thread 0xffffc0003d511000 ===

12.10 s (100.00%, #581) All
 |-- 12.10 s (99.98%, #561) do_wait_until
 |    TQ_SLEEP
 |    sched::thread::main()
 |   
 |-- 1.35 ms (0.01%, #14) waiter::wait(sched::timer*) const
 |    wait_until<memory::pool::free_different_cpu(memory::pool::free_object*, unsigned int)::__lambda1>
 |    memory::pool::free(void*)
 |    |-- 582.01 us (0.00%, #3) zio_destroy
 |    |    zio_execute
 |    |    zio_notify_parent
 |    |    zio_execute
 |    |    taskqueue_run_locked
 |    |   
 |    |-- 466.61 us (0.00%, #4) dbuf_write_done
 |    |    arc_write_done
 |    |    zio_done
 |    |    zio_execute
 |    |    zio_notify_parent
 |    |   
 |    |-- 274.89 us (0.00%, #6) zio_remove_child
 |    |    zio_execute
 |    |    |-- 196.16 us (0.00%, #4) zio_notify_parent
 |    |    |    zio_execute
 |    |    |    taskqueue_run_locked
 |    |    |   
 |    |    \-- 78.73 us (0.00%, #2) taskqueue_run_locked
 |    |         sched::thread::main()

What to do with all this information depends on what we're trying to find out. The test we ran is measuring throughput of POSIX write() which eventually calls zfs_write(). In addition to that the test records latency of write and prints a summary of after the test, which may look like this:

Latency of write() [s]:
0     0.000000092
0.5   0.000004264
0.9   0.000005597
0.99  0.000012213
0.999 0.004506400
1.0   2.430870533

As we can see, in this run there were calls which took as long as 2.4s to return. Suppose we want to find out why. Let's see if the latency is really inside vfs_pwritev:

$ trace summary --timed
(...)
Timed tracepoints [ms]:

  name                    count      min      50%      90%      99%    99.9%      max           total
  ----                    -----      ---      ---      ---      ---    -----      ---           -----
  vfs_pwritev             44949    0.004    0.005    0.008    1.089   18.284 2430.869        8086.956
(...)

Yes, max duration of vfs_pwritev is 2430ms, which is what the test reports. Let's locate the long sample by using list-timed:

$ trace list-timed --sort=duration | grep pwritev | head
0xffffc0003e990010  1 1389797838.338931799 2430.869 vfs_pwritev          3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010  0 1389797842.818786860 578.166 vfs_pwritev          3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010  1 1389797837.760517836 578.070 vfs_pwritev          3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010  0 1389797843.445839882 232.688 vfs_pwritev          3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010  1 1389797840.812511921 143.236 vfs_pwritev          3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010  0 1389797842.681091785  95.171 vfs_pwritev          3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010  1 1389797841.075021982  86.724 vfs_pwritev          3 0x00002000002ff9b0 0x2 0x-1

We now know when the issue happened. Now we can pass that information to prof-wait (note the presence of time units in parameters):

$ trace prof-wait --since 1389797838.338931799 --period 2430.869ms

What we do here is called time slicing. The profile will include only samples which overlap with specified period. If a sample crosses the edge of the time range its duration will be trimmed to reflect only the overlapping part.

Searching for zfs_write gives an evidence that the pause is caused by waiting on ZIO to complete:

=== Thread 0xffffc0003e990010 ===

2.43 s (100.00%, #1) All
 waiter::wait(sched::timer*) const
 condvar::wait(lockfree::mutex*, unsigned long)
 zio_wait
 dmu_tx_count_write
 dmu_tx_hold_write
 zfs_write
 vfs_file::write(uio*, int)
 sys_write

The investigation process now continues. Why are we waiting for ZIO to complete during async write? This subject is out of scope of this tutorial.

Profile options

The way a profile is displayed can be controlled with various options.

Grouping

By default samples are presented in on tree. The tool allows to group samples by a couple of properties via -g or --group-by option:

  • -g thread - groups data by thread
  • -g cpu - groups data by CPU

Orientation

By default frames closer to the tracepoint call-site are closer to the root. We call this kind of profile callee-oriented, because callees are higher. It shows paths leading to tracepoint.

You can reverse the orientation by passing -r or --caller-oriented. In this case we will have the most-bottom frames in the root, its callees branching from them, and finally tracepoint call-sites in the leafs. This orientation is useful to show how bigger operations are composed.

NOTE: be careful with caller-oriented profiles. If the maximum backtrace length used during collecting of trace is too low it may not cover bottom-most frames (thread_main()) for some samples. As a result traces which would otherwise end up in a single root will be spread across multiple 1st-level nodes.

Example of memory_malloc hit profile, caller oriented:

$ trace prof -r
(100.00%, #31742) All
 |-- (29.60%, #9397) sched::thread::main()
 |    |-- (10.02%, #3182) txg_sync_thread
 |    |    |-- (10.02%, #3180) spa_sync
 |    |    |    |-- (6.57%, #2086) bplist_iterate
 |    |    |    |    spa_free_sync_cb
 |    |    |    |    |-- (3.29%, #1043) zio_create
 |    |    |    |    |    zfs_kmem_alloc
 |    |    |    |    |   
 |    |    |    |    \-- (3.29%, #1043) zio_add_child
 |    |    |    |         zfs_kmem_alloc
 |    |    |    |   
 |    |    |    |-- (1.90%, #603) dsl_pool_sync
 |    |    |    |    |-- (1.78%, #565) dmu_objset_sync
 |    |    |    |    |    |-- (1.68%, #532) zil_sync
 |    |    |    |    |    |    bplist_append
 |    |    |    |    |    |    zfs_kmem_alloc
 |    |    |    |    |    |   
 |    |    |    |    |    |-- (0.03%, #11) zio_create
 |    |    |    |    |    |    zfs_kmem_alloc
 |    |    |    |    |    |   
 |    |    |    |    |    |-- (0.03%, #11) zio_add_child
 |    |    |    |    |    |    zfs_kmem_alloc
 |    |    |    |    |    |   
 |    |    |    |    |    \-- (0.03%, #11) arc_write
 |    |    |    |    |         zfs_kmem_alloc

Function rooting

Some times you want to focus on particular function. For example, I would like to know the paths leading to memory allocation which are coming from dsl_pool_sync(). There's an option which allows that: --function. If that option is used with --caller-oriented, it will show callees of that function. Otherwise it will show the callers.

Example (caller oriented, showing callees of dsl_pool_sync):

$ trace prof -r --function dsl_pool_sync
(100.00%, #813) All
 dsl_pool_sync
 |-- (76.51%, #622) dmu_objset_sync
 |    |-- (65.44%, #532) zil_sync
 |    |    bplist_append
 |    |    zfs_kmem_alloc
 |    |   
 |    |-- (7.01%, #57) dnode_sync
 |    |    |-- (6.64%, #54) dbuf_sync_leaf
 |    |    |    dbuf_write
 |    |    |    |-- (2.21%, #18) zio_create
 |    |    |    |    zfs_kmem_alloc
 |    |    |    |   
 |    |    |    |-- (2.21%, #18) zio_add_child
 |    |    |    |    zfs_kmem_alloc
 |    |    |    |   
 |    |    |    \-- (2.21%, #18) arc_write
 |    |    |         zfs_kmem_alloc
 |    |    |   
(...)

Filtering

Often the profile has too much data. There are some switches which you can use to filter data. One of them is --max-levels which reduces the number of tree levels displayed in the profile. Example:

$ trace prof --max-levels 1
(100.00%, #31742) All
 |-- (79.68%, #25291) zfs_kmem_alloc
 |   
 |-- (14.90%, #4728) operator new(unsigned long)
 |   
 |-- (3.30%, #1047) alloc_bio
 |   
 |-- (2.10%, #665) virtio::vring::add_buf(void*)
 |    virtio::vring::add_buf_wait(void*)
 |    virtio::blk::make_request(bio*)
 |    blk_strategy
 |    vdev_disk_start_bio
 |    zio_vdev_io_start
 |    zio_execute
 |   
 \-- (0.03%, #11) AcpiUtAllocate

You can also use --min-duration in time profiles to hide nodes with resident time shorter than specified.

You can filter-out infrequently hit paths using --min-hits, eg:

  • --min-hits 500 - shows only paths which were hit at least 500 times
  • --min-hist 10% - shows only paths which were visited at least in 10% of cases

Tracing network layer

OSv has support for creating trace samples from network packets. To enable capturing of packets enable the following tracepoints:

  • net_packet_in - generated when packet is received
  • net_packet_out - generated when packet is sent over non-local interface.
  • net_packet_handling - generated when packet is passed to the protocol handler and was previously queued. Not generated for packets which are handled immediately upon net_packet_in.
  • or just net_packet* to enable all

For example, the guest could be started like this:

$ sudo scripts/run.py -nv --trace net_packet*

To demonstrate that tracing works, let's put some network traffic through OSv:

$ curl http://osv:8000/os/version
"v0.07-113-gaeafbcb"

Let's extract the trace:

$ trace extract

Now we can examine the packets:

$ trace tcpdump
10:26:54.645328 IP 192.168.122.1.41735 > 192.168.122.89.8000: Flags [S], seq 3199657602, win 29200, options [mss 1460,sackOK,TS val 178516383 ecr 0,nop,wscale 7], length 0
10:26:54.645332 IP 192.168.122.1.41735 > 192.168.122.89.8000: Flags [S], seq 3199657602, win 29200, options [mss 1460,sackOK,TS val 178516383 ecr 0,nop,wscale 7], length 0
10:26:54.645423 ARP, Request who-has 192.168.122.1 tell 192.168.122.89, length 28
10:26:54.645529 ARP, Reply 192.168.122.1 is-at 56:a7:0e:6b:ac:e5, length 28
10:26:54.645530 ARP, Reply 192.168.122.1 is-at 56:a7:0e:6b:ac:e5, length 28
10:26:54.645554 IP 192.168.122.89.8000 > 192.168.122.1.41735: Flags [S.], seq 2124970473, ack 3199657603, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 4266404729 ecr 178516383], length 0
10:26:54.645660 IP 192.168.122.1.41735 > 192.168.122.89.8000: Flags [.], ack 1, win 229, options [nop,nop,TS val 178516383 ecr 4266404729], length 0
10:26:54.645662 IP 192.168.122.1.41735 > 192.168.122.89.8000: Flags [.], ack 1, win 229, options [nop,nop,TS val 178516383 ecr 4266404729], length 0

Note that some data packets have this X bytes missing! warning. That's because there is a limit on how many bytes are captured per-packet. That limit is currently 128 bytes (See capture_limit @ net_trace.cc).

You can also see raw samples in trace list:

$ trace list
(...)
0xffffc0003e24b2b0 virtio-net-rx    0 1398795409.487980366 net_packet_in       proto=9, data={52 54 00 12 34 56 ba ec 0e fc 7f a4 08 00 45 00 00 34 be f7 40 00 40 06 06 21 c0 a8 7a 01 c0 a8 7a 59 aa 60 1f 40 d2 07 1d a7 6e 9f 0f 43 80 10 00 e5 75 d2 00 00 01 01 08 0a 05 7d a3 a0 ee d6 97 0a}
0xffffc0003e24b2b0 virtio-net-rx    0 1398795409.488048553 net_packet_in       proto=9, data={52 54 00 12 34 56 ba ec 0e fc 7f a4 08 00 45 00 00 34 be f8 40 00 40 06 06 20 c0 a8 7a 01 c0 a8 7a 59 aa 60 1f 40 d2 07 1d a7 6e 9f 0f 44 80 11 00 e5 75 d2 00 00 01 01 08 0a 05 7d a3 a0 ee d6 97 0a}

You can convert packet data to human readable form in the listing by passing --tcpdump flag:

$ trace list --tcpdump
20:16:49.486215 IP 192.168.122.1.43616 > 192.168.122.89.8000: Flags [S], seq 3523681620, win 29200, options [mss 1460,sackOK,TS val 92119968 ecr 0,nop,wscale 7], length 0
20:16:49.486269 ARP, Request who-has 192.168.122.1 tell 192.168.122.89, length 28
20:16:49.486457 ARP, Reply 192.168.122.1 is-at ba:ec:0e:fc:7f:a4, length 28
20:16:49.486467 IP 192.168.122.89.8000 > 192.168.122.1.43616: Flags [S.], seq 1855917805, ack 3523681621, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 4007040777 ecr 92119968], length 0
20:16:49.486709 IP 192.168.122.1.43616 > 192.168.122.89.8000: Flags [.], ack 1, win 229, options [nop,nop,TS val 92119968 ecr 4007040777], length 0
20:16:49.487197 IP truncated-ip - 20 bytes missing! 192.168.122.1.43616 > 192.168.122.89.8000: Flags [P.], seq 1:83, ack 1, win 229, options [nop,nop,TS val 92119968 ecr 4007040777], length 82
20:16:49.487758 IP truncated-ip - 23 bytes missing! 192.168.122.89.8000 > 192.168.122.1.43616: Flags [P.], seq 1:86, ack 83, win 1040, options [nop,nop,TS val 4007040778 ecr 92119968], length 85
20:16:49.487840 IP 192.168.122.89.8000 > 192.168.122.1.43616: Flags [F.], seq 86, ack 83, win 1040, options [nop,nop,TS val 4007040778 ecr 92119968], length 0
20:16:49.487980 IP 192.168.122.1.43616 > 192.168.122.89.8000: Flags [.], ack 86, win 229, options [nop,nop,TS val 92119968 ecr 4007040778], length 0
20:16:49.488048 IP 192.168.122.1.43616 > 192.168.122.89.8000: Flags [F.], seq 83, ack 87, win 229, options [nop,nop,TS val 92119968 ecr 4007040778], length 0
20:16:49.488061 IP 192.168.122.89.8000 > 192.168.122.1.43616: Flags [.], ack 84, win 1040, options [nop,nop,TS val 4007040779 ecr 92119968], length 0

The last method may be very useful in some situations, for example when one needs to correlate network packet traffic with other tracepoints.

Another thing you can use the listing for is to determine which path in the OS triggered particular packet to be sent (remember to pass --trace-backtrace to run.py). Example:

$ trace list --tcpdump -bL
0xffffc0003e35a010                  0 1399278414.646342754 net_packet_handling  IP truncated-ip - 20 bytes missing! 192.168.122.1.41735 > 192.168.122.89.8000: Flags [P.], seq 3199657603:3199657685, ack 2124970474, win 229, options [nop,nop,TS val 178516383 ecr 4266404729], length 82   [log:363, trace_slow_path:343, operator():29, log_packet_handling(mbuf*, int):153, tcp_net_channel_packet:3654, operator():3689, _M_invoke:2071, std::function<void (mbuf*)>::operator()(mbuf*) const:2464, net_channel::process_queue():37, flush_net_channel:587, soreceive_generic:1243, kern_recvit:595, linux_recvmsg:1057, recvmsg:183, 0x10000009abf7, 0x10000009ac75, 0x10000009f26a]
0xffffc0003e35a010                  0 1399278414.647043943 net_packet_out       IP truncated-ip - 22 bytes missing! 192.168.122.89.8000 > 192.168.122.1.41735: Flags [P.], seq 2124970474:2124970558, ack 3199657685, win 1040, options [nop,nop,TS val 4266404731 ecr 178516383], length 84   [log:363, trace_slow_path:343, operator():29, log_packet_out(mbuf*, int):148, ether_output_frame:392, ether_output:361, ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*):613, tcp_output:1235, tcp_output_send:282, tcp_usr_send:839, sosend_generic:1083, kern_sendit:506, linux_sendit:499, sendmsg:239]
0xffffc0003e35a010                  0 1399278414.647149563 net_packet_out       IP 192.168.122.89.8000 > 192.168.122.1.41735: Flags [F.], seq 2124970558, ack 3199657685, win 1040, options [nop,nop,TS val 4266404731 ecr 178516383], length 0   [log:363, trace_slow_path:343, operator():29, log_packet_out(mbuf*, int):148, ether_output_frame:392, ether_output:361, ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*):613, tcp_output:1235, tcp_output_disconnect:304, tcp_usr_shutdown:713, sys_shutdown:719, shutdown:298, 0x10000009ab60, 0x10000009b17b]

Tracing lock contention

Lock contention often drags the performance down. Contended locks can be easily identified using mutex_lock_wait tracepoint, which is hit when a thread is about to block after a failed lock attempt. In addition to that mutex_lock_wake tracepoint may be used together with mutex_lock_wait to measure the duration for which threads were scheduled out because of contention and thus asses its cost.

There is a command which shows a duration profile of mutex waits: trace prof-lock. To see that profile, the guest should be started like this:

$ scripts/run.py --trace mutex_lock_wa* --trace-backtrace

Then:

$ trace extract
$ trace summary
Collected 21376 samples spanning 14.66 s

Tracepoint statistics:

  name               count
  ----               -----
  mutex_lock_wait    10771
  mutex_lock_wake    10605

$ trace prof-lock
8.38 m (100.00%, #10770) All
 |-- 7.86 m (93.80%, #7363) lockfree::mutex::lock()
 |    lockfree_mutex_lock
 |    |-- 2.87 m (34.18%, #2494) in_matroute
 |    |    rtalloc1_fib
 |    |    rtalloc_ign_fib
 |    |    ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*)
 |    |    tcp_output
 |    |    |-- 2.78 m (33.11%, #2321) tcp_output_send
 |    |    |    tcp_usr_send
 |    |    |    sosend_generic
 |    |    |    kern_sendit
 |    |    |    sendit
 |    |    |   
 |    |    |-- 5.32 s (1.06%, #171) tcp_do_segment
 |    |    |    tcp_net_channel_packet
 |    |    |    std::function<void (mbuf*)>::operator()(mbuf*) const
 |    |    |    net_channel::process_queue()
 |    |    |    |-- 4.15 s (0.83%, #128) flush_net_channel
 |    |    |    |    sosend_generic
 |    |    |    |   
 |    |    |    \-- 1.16 s (0.23%, #43) socket_file::poll(int)
(...)

The profile above reports that a lock taken inside in_matroute is contended, causing threads to be stalled for almost 3 minutes in 14.66 seconds. There are other contended locks too, but this is the most contended one (34%).

The reported duration is longer than the tracing period because blocks of all threads are summed up. Individual threads cannot block for longer than tracing period, which can be seen when the profile is grouped by thread:

$ trace prof-lock -g thread

=== Thread 0xffffc00018c75010 ===

2.78 s (100.00%, #65) All
 |-- 2.16 s (77.59%, #34) lockfree::mutex::lock()
 |    lockfree_mutex_lock
 |    |-- 963.92 ms (34.69%, #12) in_matroute
 |    |    rtalloc1_fib
 |    |    rtalloc_ign_fib
 |    |    ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*)
 |    |    tcp_output
 |    |    |-- 942.47 ms (33.92%, #11) tcp_output_send
 |    |    |    tcp_usr_send
 |    |    |    sosend_generic
 |    |    |    kern_sendit
 |    |    |    sendit
 |    |    |   
(...)

It's not always obvious which lock is contended based on the function name only when that function takes multiple locks. The exact lock can be identified by uncovering line numbers with -L and file names with -F, if needed, like this:

$ trace prof-lock -LF
8.38 m (100.00%, #10770) All
 |-- 7.86 m (93.80%, #7363) lockfree::mutex::lock() /home/tgrabiec/src/osv/core/lfmutex.cc:87
 |    lockfree_mutex_lock /home/tgrabiec/src/osv/core/lfmutex.cc:282
 |    |-- 2.87 m (34.18%, #2494) in_matroute /home/tgrabiec/src/osv/bsd/sys/netinet/in_rmx.cc:114
 |    |    rtalloc1_fib /home/tgrabiec/src/osv/bsd/sys/net/route.cc:346

Tracing memory allocations

Memory allocations can be tracked using memory_malloc_* tracepoints. Collected data contains requested alignment and size, the allocator responsible for each allocation as well as the number of bytes actually allocated, which, due to the internal fragmentation, often will be different than the requested object size. There is a command trace memory-analyzer which shows this information in a readable way. Moreover, if backtraces are collected code responsible for particular type of memory allocations is identified and presented in a manner similar to the other tracing commands described in this document.

Example

To analyse memory allocations, start the guest like this:

$  scripts/run.py --trace memory_malloc_* --trace-backtrace -H -e tests/tst-align.so

When the test finishes execute the following commands to extract and print memory allocation analysis:

$ trace extract
$ trace memory-analyzer

Output will be similar to this:

All
count: 7714
unused: 15%
 |-- large
 |   count: 2043
 |   unused: 15%
 |    |-- 16 alignment
 |    |   count: 153
 |    |   unused: 19%
 |    |    |-- 8192 allocated
 |    |    |   count: 50
 |    |    |   unused: 52%
 |    |    |    |-- 2560 requested
 |    |    |    |   count: 5
 |    |    |    |   unused: 68%
 |    |    |    |    \-- (100.00% #5) zfs_kmem_alloc
 |    |    |    |        vdev_queue_io_to_issue
 |    |    |    |        vdev_queue_io_done
 |    |    |    |        zio_vdev_io_done
 |    |    |    |        zio_execute
 |    |    |    |        taskqueue_run_locked
 |    |    |    |        taskqueue_thread_loop
 |    |    |    |        sched::thread::main()
 |    |    |    |   
 |    |    |    |-- 3072 requested
 |    |    |    |   count: 2
 |    |    |    |   unused: 62%
 |    |    |    |    \-- (100.00% #2) zfs_kmem_alloc
 |    |    |    |        vdev_queue_io_to_issue
 |    |    |    |        vdev_queue_io_done
 |    |    |    |        zio_vdev_io_done
 |    |    |    |        zio_execute
 |    |    |    |        taskqueue_run_locked
 |    |    |    |        taskqueue_thread_loop
 |    |    |    |        sched::thread::main()

large memory allocator serviced 2043 out of 7714 registered allocations. 153 of them requested the default alignment 16. 50 resulted in allocating 8192 bytes, but five of them asked only for 2560 bytes.

Filtering

Rare allocations can be filtered out using option --min-count which makes memory analyzer show only allocations at least as frequent as the given argument. Moreover, backtraces can be filtered using options --min-hits and --max-levels exactly in the same way like it is described in "Profile options" section of this document. When option --no-backtrace is added backtraces will not be shown at all even if they were collected.

Grouping

Using -g or --group-by option memory allocations can be grouped by the following criteria:

  • allocator - memory allocator responsible for an allocation
  • alignment - requested alignment
  • allocated - number of bytes actually allocated
  • requested - number of requested bytes

The default grouping is -g allocator alignment allocated requested.

Example:

$ trace memory-analyzer -g alignment --no-backtrace
All
count: 7714
unused: 15%
 |-- 16 alignment
 |   count: 4924
 |   unused: 18%
 |   
 |-- 32 alignment
 |   count: 200
 |   unused: 33%
 |   
 |-- 512 alignment
 |   count: 200
 |   unused: 74%
 |   
 |-- 1024 alignment
 |   count: 600
 |   unused: 26%
 |   
 |-- 2048 alignment
 |   count: 290
 |   unused: 25%
 |   

Sorting

Option -s or --sort allows sorting allocations by one of the following criteria:

  • count - number of allocations of that type
  • unused - percentage of memory lost due to internal fragmentation
  • size - size of an allocation

Example:

All
count: 7714
unused: 15%
 |-- mempool
 |   count: 5671
 |   unused: 24%
 |    |-- 2048 alignment
 |    |   count: 200
 |    |   unused: 74%
 |    |    \-- 2048 allocated
 |    |        count: 200
 |    |        unused: 74%
 |    |         |-- 8 requested
 |    |         |   count: 100
 |    |         |   unused: 99%
 |    |         |   
 |    |         \-- 1024 requested
 |    |             count: 100
 |    |             unused: 50%
 |    |   
 |    |-- 512 alignment
 |    |   count: 200
 |    |   unused: 74%

CPU sampler

OSv has a simple timer-based stack-sampling profiler which is using tracing infrastructure.

It can be enabled from the command line with the --sampler switch:

$ scripts/run.py --sampler 

By default it will sample each CPU 1000 times a second. The frequency can be changed by putting the desired frequency in Hz after the --sampler switch, eg the following will result in sampling rate of 500 times a seconds:

$ scripts/run.py --sampler 500

It can also be controlled remotely using RESTful API. To start it:

$ curl -X POST http://osv:8000/trace/sampler?freq=1000

To stop it pass 0 as the frequency:

$ curl -X POST http://osv:8000/trace/sampler?freq=0

You can extract samples at any time with trace extract or trace download. The summary will look like this:

$ trace summary
Collected 23999 samples spanning 6.42 s

Tracepoint statistics:

  name            count
  ----            -----
  sampler_tick    23999

Currently the best way to visualize the samples is by using trace prof, which will present a statistical picture of what's running on CPUs:

$ trace prof
(100.00%, #23998) All
 |-- (84.25%, #20218) sched::cpu::idle()
 |    operator()
 |    sched::thread::main()
 |   
 |-- (3.51%, #843) sched::cpu::do_idle()
 |    sched::cpu::idle()
 |    operator()
 |    sched::thread::main()
 |   
 |-- (2.05%, #493) do_wait_until
 |    |-- (1.82%, #436) wait_until<virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)()const)::__lambda3>
 |    |    virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)() const)
 |    |    |-- (1.70%, #409) virtio::net::receiver()
 |    |    |    operator()
 |    |    |    sched::thread::main()
 |    |    |   
 |    |    |-- (0.09%, #21) virtio::blk::req_done()
 |    |    |    operator()
 |    |    |    sched::thread::main()
 |    |    |   
 |    |    \-- (0.03%, #6) virtio::rng::refill()
 |    |         virtio::rng::worker()
 |    |         operator()
 |    |   
 |    |-- (0.23%, #55) wait_until<mmu::flush_tlb_all()::__lambda3>
 |    |    mmu::flush_tlb_all()
 |    |    |-- (0.23%, #54) pagecache::access_scanner::run()
 |    |    |    sched::thread::main()
 |    |    |   

The profile above tells us that the guest was pretty much idle in terms of CPU usage.

Note that the profile shows all the collected samples which in this case span 6 seconds. The maximum period may vary depending on the size of trace buffers and sampling frequency. It may be worthwhile to use time slicing to see how the load evolved. For example, to show the last second:

$ end=$(trace list | tail -n1 | awk '{print $4}')
$ trace prof --until $end --period 1s
(100.00%, #3583) All
 |-- (81.11%, #2906) sched::cpu::idle()
 |    operator()
 |    sched::thread::main()
 |   
 |-- (4.61%, #165) sched::cpu::do_idle()
 |    sched::cpu::idle()
 |    operator()
 |    sched::thread::main()
 |   
 |-- (4.61%, #165) void sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<net_channel>, sched::wait_object<waitqueue>, sched::wait_object<sched::timer>, sched::wait_object<signal_catcher> >(lockfree::mutex&, sched::wait_object<net_channel>&&, sched::wait_object<waitqueue>&&, sched::wait_object<sched::timer>&&, sched::wait_object<signal_catcher>&&)
 |    void sched::thread::wait_for<net_channel&, waitqueue&, sched::timer&, signal_catcher&>(lockfree::mutex&, net_channel&, waitqueue&, sched::timer&, signal_catcher&)
 |    int sbwait_tmo<osv::clock::uptime>(socket*, sockbuf*, boost::optional<std::chrono::time_point<osv::clock::uptime, osv::clock::uptime::duration> >)
 |    socket_file::poll_sync(pollfd&, boost::optional<std::chrono::time_point<osv::clock::uptime, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >)
 |    poll_one
 |    poll

A note on sampling precision

It is important to note that this sampler has some limitations which should be taken under consideration when interpreting the results. Because the profiler is using regular APIC timer for sampling it will only show code which is running with interrupts enabled. It is good to compare the output with perf kvm top.

Tracing one-liners

Some examples assume you have these tools from https://github.com/tgrabiec/shell-tools on PATH.

How many threads with given name run on each vCPU?

$ trace list | clmn 3 2 1 | sort | uniq | clmn 1 2 | uniq -c
     32 0 >>>>>>java.so
      3 0 >>>>>java.so
      1 0 async_worker0
      1 0 balancer0
      1 0 idle0
      1 0 page_pool_l1_0
      1 0 percpu0
      1 0 rand_harvestq
      1 0 solthread-0x2ab
      1 0 virtio-tx-worke
      1 0 zil_clean
      6 1 >>>>>>>java.so
     18 1 >>>>>>java.so
      4 1 >>>>>java.so
      3 1 >>java.so
      1 1 async_worker1

Read: There are 32 threads with name >>>>>>java.so on vCPU 0.

Clone this wiki locally