Post-hoc allocation site analysis #99

stephenrkell · 2024-11-17T23:27:23Z

It would be very useful to be able to run a dumpallocs-style analysis for a built binary, given its debug info and source tree but without rebuilding it -- which takes time.

This is particularly important for introspecting on the ld.so (#98) because we don't want to have to rebuild the whole ld.so, but we could reasonably expect a source tree to be available.

An obvious problem is that dumpallocs requires a complete .i file to analyse, but DWARF does not currently include enough information to reconstruct the .i for any compilation unit. It does, however, contain various hints: many header files will be mentioned in the line table (but many won't!) and we have information about the compiler including its version (if that matters) and some of its command-line arguments (apparently just the cc1-style ones, though).

(One interesting question is to what extent the macro information, enabled at -g3 with GCC, fills these gaps. I'm not pursuing this because it's almost never used in the field.)

I had a go at creating a simple awk script that mimics the preprocessor and tries to reconstruct where the included files came from. On simple examples this works*, but on realistic examples (e.g. compilation units from glibc's ld.so), even with maximum guesswork, it falls down for a number of reasons:

generated headers
computed includes, i.e. where macro expansion is used to generate the include spec
ambiguity across include paths, i.e. the absence of information about include paths' ordering
#include_next, again in the absence of information about include paths' ordering

What might be a better way forward is a way to reconstitute a 'good enough' .i file even in the absence of such information. That would require a forgiving parser. C is already forgiving about missing function prototypes, so we'd mostly be worried about type information (which of course affects the parse tree!). We could even use dwarfidl to generate a rendering of all the type information up-front, and then our parser would just have to be forgiving of duplicates.

Some problems might remain, e.g. function-like macros used as syntax generators -- if we don't manage to slurp the definition of such a macro, we will choke. However, we expect to get most headers, so it could still work most of the time.

* The definition of 'works' here is already a bit generous. Since we don't have information about -D options on the command line, or builtin macros defined by the compiler, we don't know which #if / #ifdef branches are taken. So the tool will follow both branches but keeping track of a 'path condition'. The idea would then be to synthesise an environment of -D options that could have generated the output file's DWARF, e.g. would have included the set of embodied header files that the line table reports (among others, but hitting all of those). That already is into SMT-solver territory.

The text was updated successfully, but these errors were encountered:

stephenrkell · 2024-11-18T03:18:42Z

(Of course the right solution is to extend the toolchain, e.g. by expanding what is captured in debugging information, such that it's possible to reconstruct the .i file. But my current use case is to provide workarounds for existing already-built binaries, so of course these extensions would be too late.)

stephenrkell · 2024-11-28T04:11:19Z

Roughly the solution I have in mind is:

chunk the .c file by splitting it after every top-level closing brace and top-level semicolon
select those chunks that begin with a function declaration, after filtering out intervening preprocessor directives, and ideally extracting its name
prepend each such chunk with a full dwarfidl dump of its environment, ideally specialised to the function name
run each chunk through a C parser and, if successful, proceed with the allocation site analysis.

Chunks that use functionlike macros in a functionlike way will have "undeclared function" problems but will probably parse.

Chunks that use functionlike macros in other ways will probably have syntactic errors.

Chunks that use non-functionlike macros will have name resolution problems for types or variables that are being referenced through a macro. Will they still parse? Not in general... the classic C (A)*B ambiguity is one thing that will cause problems. If A is in need of macro expansion but the macro is missing, even a lexer-hacked parser has no way forward.

stephenrkell · 2024-11-28T04:13:36Z

Probably the best we could do is pre-scan for unresolvable identifiers, then make a guess about which is a type. This is overkill for now... just discard the chunks that fail to parse, i.e. hope they are not functions that do allocations.

stephenrkell · 2025-01-21T12:03:19Z

I think the slight generalisation of this to a "C-repairing parser" could be a nice student project. Step 1 is to write a parser for modern / GNU-extended C that is tolerant of type/var ambiguities, producing multiple parses if necessary. Step 2 is to traverse the resulting ASTs to scrape a list of the depended-on definitions for each top-level definition -- again, there may be "type or value" disjunction as a result of the ambiguity. Step 3 is to synthesise definitions of the dependencies, using the DWARF information. Extensions could try to work with syntax-generating macros somehow (e.g. imagine how to handle Linux's MODULE-* macros, say).

stephenrkell mentioned this issue Jan 21, 2025

Meta-issue: viable student projects #84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post-hoc allocation site analysis #99

Post-hoc allocation site analysis #99

stephenrkell commented Nov 17, 2024

stephenrkell commented Nov 18, 2024

stephenrkell commented Nov 28, 2024 •

edited

Loading

stephenrkell commented Nov 28, 2024

stephenrkell commented Jan 21, 2025

Post-hoc allocation site analysis #99

Post-hoc allocation site analysis #99

Comments

stephenrkell commented Nov 17, 2024

stephenrkell commented Nov 18, 2024

stephenrkell commented Nov 28, 2024 • edited Loading

stephenrkell commented Nov 28, 2024

stephenrkell commented Jan 21, 2025

stephenrkell commented Nov 28, 2024 •

edited

Loading