Stack walking should use eh_elfs from frdwarf #83

stephenrkell · 2024-03-02T17:04:37Z

Currently we use vanilla libunwind, but to improve performance for stack-heavy workloads, we could bundle the 'compiled' frame information from frdwarf/dwarf-assembly and then use its special libunwind. In particular, the compiled frame information should be generated as a further kind of metadata.

stephenrkell · 2024-03-02T17:49:39Z

I have sketched out some improvements to the eh_elfs approach, which eliminates the binary search tree in favour of a more compact table. Implementing and evaluating this would be an extension of the work. Those notes follow (written a few years ago).

Our eh-elfs "compiled" dwarf-frame information takes a lot of space, mainly because of the binary tree that maps an address to its range of surrounding addresses for which the unwind description is identical. These ranges then map on to a (much smaller) set of code fragments that write the register values, i.e. that actually perform the unwind.

Even simpler idea: instead of the binary tree, use a vector. Each record of the vector represents a range, and the vector is sorted by base address. Do binary search over this. (Cost: roughly one pointer per range.)

More precisely: each vector entry needs to record

base address (32 bits, as a DSO-base-relative vaddr)
length (how many bits?)
pointer to its fragment (32 bits minus a few)

How big are the frame tables relative to text? Let's try glibc 2.28.

  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [14] .text             PROGBITS        0000000000025320 025320 14608b 00  AX  0   0 16
  [18] .eh_frame_hdr     PROGBITS        000000000018e33c 18e33c 0061f4 00   A  0   0  4
  [19] .eh_frame         PROGBITS        0000000000194530 194530 020b38 00   A  0   0  8

So for 1335435 bytes of text, we get 133944 bytes of .eh_frame ... roughly 10% overhead.

How many distinct address-range entries would our table have?

$ readelf -wF /lib/x86_64-linux-gnu/libc-2.28.so | grep '^[0-9a-f]\{16\} [a-z]' | wc -l
20898

How many distinct row contents would our table have?

$ readelf -wF /lib/x86_64-linux-gnu/libc-2.28.so | sed -n '/\(^[0-9a-f]\{16\} \)\([a-z]\)/ {s//\2/;p}' | sort | uniq | wc -l
361

OK. So if (hypothetically) we spend 8 bytes per address range (in the vector) and 16 bytes per distinct row (i.e. 16 bytes is the average length of the instruction fragment), we would have space overhead:

$ echo $(( 8 * 20898 + 16 * 361 ))
172960
... a little more, but not a lot more

What's the length distribution of the address ranges?

$ readelf -wF /lib/x86_64-linux-gnu/libc-2.28.so | grep '^[0-9a-f]\{16\} [a-z]' | grep -v '^0000000000000000'| sort -n | while read addr rest; do if [[ -n "$last_addr" ]]; then echo $(( 0x$addr - 0x$last_addr )); fi; last_addr="$addr"; done | sort -n | uniq -c

frequency  range_size_naddrs
-----------------
   4128 1
   4920 2
    420 3
   1824 4
    811 5
    434 6
    871 7
    453 8
    325 9
    285 10
    222 11
    657 12
    139 13
    106 14
    112 15
    195 16
     84 17
     98 18
     77 19
    110 20
     45 21
     36 22
     31 23
     80 24
     35 25
     40 26
     18 27
     44 28
     21 29
     48 30
     27 31
     94 32
     26 33
     55 34
     17 35
     31 36
     17 37
     28 38
     26 39
     40 40
     20 41
     18 42
     ...
      1 3882
      1 3915
      1 3973
      1 4019
      1 4053
      1 4054
      1 4060
      1 4061
      1 4064
      1 4067
      1 4072
      1 4079
      1 4096
      1 4097
      1 4127
      1 4168
      1 4256
      1 4264
      1 4297
      1 4311
      1 4616
      1 4652
      1 4992
      1 5006
      1 5037
      1 5104
      1 5112
    ... 
      1 69854
      1 183500
      1 198632
      1 240122
      1 449191
      1 449197
      1 449215
      1 453732
      1 501585

(suspicious about some of the last ones, because we also got some negative answers)
      1 -1337237
      1 -1328779
      1 -1327843
      1 -1132634
      1 -876492

OK, so if we can make a table with 8-byte entries and if our return functions can fit in 16 bytes average, then we should have a better time/space trade-off than the "fully compiled" approach. It would still use a bit more space.

Can we do better?

We could possibly play with 'shortcut vectors' to avoid representing base addresses in full, i.e. to cut down on the 8-byte size of the vector entry.

The shortcut vector is basically the first stage in a two-stage lookup. Say we divide the address space into 2048-byte regions. We then discard all but the bottom 11 bits of each vector entry's base address field. The first stage of the lookup involves recovering those bits. How?

The shortcut vector has one entry for every 2048-byte region, and points to the lowest 'main' vector entry overlapping any part of that region. The main vector entry may span multiple 2048-byte regions, i.e. its implicit base address may be a predecessor of the looked-up 2048-byte region. We could just forbid this and require duplicate entries in this
case. Or we could store the delta somewhere -- which would need to be in the shortcut vector (because by construction, the main vector entry will be reached from >1 shortcut vector entries).

(Thinking more carefully: let's flip this. For any vector entry, what is the base? The entry has its own range. We need to look at the 2048-byte regions that that range overlaps. The base is the lowest of these 2048-byte-aligned addresses. i.e. the lowest 2048-byte region that could reach that vector entry. Is this just "the entry's base address aligned down to 2048 bytes?" Yes, I think so. We only need to record the base modulo 2048.)

So what about a record whose range spans a 2048-byte boundary? Then suppose we will reach it from the higher 2048-byte boundary. So in our shortcut record, we need to know how many 2048-byte boundaries we cross to get to the bottom of the vector entry record's range, i.e. "how many 2048s to round down" ... the maximum number of 2048s tells us the maximum range size of one record.

If we set it at 0, it means all ranges must be split at 2048-byte boundaries. This will increase the number of ranges in the vector, but not by a lot.
Suppose vector entries are 11 bits for start offset from range base, We have 11 bits for size (max 11! since we split them). 10 bits for fragment ID? Not enough.
Instead of 2048 we could go for 512, at cost of larger vector: 9 bits for start offset from range base, 9 bits for size, 14 bits for fragment ID (is 16384 fragments enough? our libc has only 361!)

How big is our shortcut vector? It just records:

the index of the first main vector entry that span this 512-byte region.
the number of such spanning entries (at most 512)

This fits in 4 bytes: say, 23 for the idx and 9 for the n (is 23 enough? yes because 2^23 is 8M entries, each spanning 0.5 kB, so covers 4G of code)

So for our libc's 1335kB text we would be paying:

1064 x 4 bytes of shortcut vector => 4kB
(20898 x 4)*OVERHEAD bytes of vector entry => 90kB?
361 x 16 bytes (MAYBE) of code => 6kB or maybe a bit more

... smaller even than .eh_frame!

If we set it at ~15, then one vector entry can span up to 32kB THEN our shortcut needs begin/end idxs in the vector (how many entries in the vector? in my libc there are 21k rows each spans between 1byte and 32kB of text) and the round-down/round-up numbers for the begin/end entries

OK. So shortcut should be a 64-bit record, say 2x 28-bit and 2x 4-bit numbers. 8 bytes per 2048 bytes is very lean (cf. 10% .eh_frame vs .text size)

Then each vector record only needs to record

its start offset from the 2048-byte base (11 bits)
the idx of the 16-byte-aligned row function (21 bits is enough)
i.e. a 32-bit number.

So for our libc's 1335kB text we would be paying

266 x 8 bytes of shortcut vector => 2kB
20898 x 4 bytes of vector entry => 80kB
361 x 16 bytes (MAYBE) of code => 6kB or maybe a bit more

... even smaller! But more complex

Remember: our size overhead in the OOPSLA '19 paper was in the 3x--4x range, and time improvements were in the 13x--102x range. We want similar time improvements and less size overhead.

stephenrkell · 2024-06-14T16:08:23Z

It would be fun to compare against Go's (pclntab) encoding. This was mentioned to me by Felix Geisendörfer who has also written a blog post about it (belated thanks Felix).

stephenrkell mentioned this issue Mar 2, 2024

Meta-issue: viable student projects #84

Open

stephenrkell mentioned this issue Mar 2, 2024

Explore how stack type metadata can be 'compiled' as code, not pre-elaborated into struct types #85

Open

stephenrkell mentioned this issue Jun 13, 2024

frametypes should also describe registers and callee-save stack slots #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack walking should use eh_elfs from frdwarf #83

Stack walking should use eh_elfs from frdwarf #83

stephenrkell commented Mar 2, 2024

stephenrkell commented Mar 2, 2024 •

edited

Loading

stephenrkell commented Jun 14, 2024

Stack walking should use eh_elfs from frdwarf #83

Stack walking should use eh_elfs from frdwarf #83

Comments

stephenrkell commented Mar 2, 2024

stephenrkell commented Mar 2, 2024 • edited Loading

stephenrkell commented Jun 14, 2024

stephenrkell commented Mar 2, 2024 •

edited

Loading