-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack walking should use eh_elfs from frdwarf #83
Comments
I have sketched out some improvements to the eh_elfs approach, which eliminates the binary search tree in favour of a more compact table. Implementing and evaluating this would be an extension of the work. Those notes follow (written a few years ago). Our eh-elfs "compiled" dwarf-frame information takes a lot of space, mainly because of the binary tree that maps an address to its range of surrounding addresses for which the unwind description is identical. These ranges then map on to a (much smaller) set of code fragments that write the register values, i.e. that actually perform the unwind. Even simpler idea: instead of the binary tree, use a vector. Each record of the vector represents a range, and the vector is sorted by base address. Do binary search over this. (Cost: roughly one pointer per range.) More precisely: each vector entry needs to record
How big are the frame tables relative to text? Let's try glibc 2.28.
So for 1335435 bytes of text, we get 133944 bytes of .eh_frame ... roughly 10% overhead. How many distinct address-range entries would our table have?
How many distinct row contents would our table have?
OK. So if (hypothetically) we spend 8 bytes per address range (in the vector) and 16 bytes per distinct row (i.e. 16 bytes is the average length of the instruction fragment), we would have space overhead:
What's the length distribution of the address ranges?
OK, so if we can make a table with 8-byte entries and if our return functions can fit in 16 bytes average, then we should have a better time/space trade-off than the "fully compiled" approach. It would still use a bit more space. Can we do better? We could possibly play with 'shortcut vectors' to avoid representing base addresses in full, i.e. to cut down on the 8-byte size of the vector entry. The shortcut vector is basically the first stage in a two-stage lookup. Say we divide the address space into 2048-byte regions. We then discard all but the bottom 11 bits of each vector entry's base address field. The first stage of the lookup involves recovering those bits. How? The shortcut vector has one entry for every 2048-byte region, and points to the lowest 'main' vector entry overlapping any part of that region. The main vector entry may span multiple 2048-byte regions, i.e. its implicit base address may be a predecessor of the looked-up 2048-byte region. We could just forbid this and require duplicate entries in this (Thinking more carefully: let's flip this. For any vector entry, what is the base? The entry has its own range. We need to look at the 2048-byte regions that that range overlaps. The base is the lowest of these 2048-byte-aligned addresses. i.e. the lowest 2048-byte region that could reach that vector entry. Is this just "the entry's base address aligned down to 2048 bytes?" Yes, I think so. We only need to record the base modulo 2048.) So what about a record whose range spans a 2048-byte boundary? Then suppose we will reach it from the higher 2048-byte boundary. So in our shortcut record, we need to know how many 2048-byte boundaries we cross to get to the bottom of the vector entry record's range, i.e. "how many 2048s to round down" ... the maximum number of 2048s tells us the maximum range size of one record.
How big is our shortcut vector? It just records:
This fits in 4 bytes: say, 23 for the idx and 9 for the n (is 23 enough? yes because 2^23 is 8M entries, each spanning 0.5 kB, so covers 4G of code) So for our libc's 1335kB text we would be paying:
... smaller even than .eh_frame!
OK. So shortcut should be a 64-bit record, say 2x 28-bit and 2x 4-bit numbers. 8 bytes per 2048 bytes is very lean (cf. 10% .eh_frame vs .text size) Then each vector record only needs to record
So for our libc's 1335kB text we would be paying
... even smaller! But more complex Remember: our size overhead in the OOPSLA '19 paper was in the 3x--4x range, and time improvements were in the 13x--102x range. We want similar time improvements and less size overhead. |
It would be fun to compare against Go's (pclntab) encoding. This was mentioned to me by Felix Geisendörfer who has also written a blog post about it (belated thanks Felix). |
Currently we use vanilla libunwind, but to improve performance for stack-heavy workloads, we could bundle the 'compiled' frame information from frdwarf/dwarf-assembly and then use its special libunwind. In particular, the compiled frame information should be generated as a further kind of metadata.
The text was updated successfully, but these errors were encountered: