Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post-hoc allocation site analysis #99

Open
stephenrkell opened this issue Nov 17, 2024 · 4 comments
Open

Post-hoc allocation site analysis #99

stephenrkell opened this issue Nov 17, 2024 · 4 comments

Comments

@stephenrkell
Copy link
Owner

It would be very useful to be able to run a dumpallocs-style analysis for a built binary, given its debug info and source tree but without rebuilding it -- which takes time.

This is particularly important for introspecting on the ld.so (#98) because we don't want to have to rebuild the whole ld.so, but we could reasonably expect a source tree to be available.

An obvious problem is that dumpallocs requires a complete .i file to analyse, but DWARF does not currently include enough information to reconstruct the .i for any compilation unit. It does, however, contain various hints: many header files will be mentioned in the line table (but many won't!) and we have information about the compiler including its version (if that matters) and some of its command-line arguments (apparently just the cc1-style ones, though).

(One interesting question is to what extent the macro information, enabled at -g3 with GCC, fills these gaps. I'm not pursuing this because it's almost never used in the field.)

I had a go at creating a simple awk script that mimics the preprocessor and tries to reconstruct where the included files came from. On simple examples this works*, but on realistic examples (e.g. compilation units from glibc's ld.so), even with maximum guesswork, it falls down for a number of reasons:

  • generated headers
  • computed includes, i.e. where macro expansion is used to generate the include spec
  • ambiguity across include paths, i.e. the absence of information about include paths' ordering
  • #include_next, again in the absence of information about include paths' ordering

What might be a better way forward is a way to reconstitute a 'good enough' .i file even in the absence of such information. That would require a forgiving parser. C is already forgiving about missing function prototypes, so we'd mostly be worried about type information (which of course affects the parse tree!). We could even use dwarfidl to generate a rendering of all the type information up-front, and then our parser would just have to be forgiving of duplicates.

Some problems might remain, e.g. function-like macros used as syntax generators -- if we don't manage to slurp the definition of such a macro, we will choke. However, we expect to get most headers, so it could still work most of the time.

* The definition of 'works' here is already a bit generous. Since we don't have information about -D options on the command line, or builtin macros defined by the compiler, we don't know which #if / #ifdef branches are taken. So the tool will follow both branches but keeping track of a 'path condition'. The idea would then be to synthesise an environment of -D options that could have generated the output file's DWARF, e.g. would have included the set of embodied header files that the line table reports (among others, but hitting all of those). That already is into SMT-solver territory.

@stephenrkell
Copy link
Owner Author

(Of course the right solution is to extend the toolchain, e.g. by expanding what is captured in debugging information, such that it's possible to reconstruct the .i file. But my current use case is to provide workarounds for existing already-built binaries, so of course these extensions would be too late.)

@stephenrkell
Copy link
Owner Author

stephenrkell commented Nov 28, 2024

Roughly the solution I have in mind is:

  • chunk the .c file by splitting it after every top-level closing brace and top-level semicolon
  • select those chunks that begin with a function declaration, after filtering out intervening preprocessor directives, and ideally extracting its name
  • prepend each such chunk with a full dwarfidl dump of its environment, ideally specialised to the function name
  • run each chunk through a C parser and, if successful, proceed with the allocation site analysis.

Chunks that use functionlike macros in a functionlike way will have "undeclared function" problems but will probably parse.

Chunks that use functionlike macros in other ways will probably have syntactic errors.

Chunks that use non-functionlike macros will have name resolution problems for types or variables that are being referenced through a macro. Will they still parse? Not in general... the classic C (A)*B ambiguity is one thing that will cause problems. If A is in need of macro expansion but the macro is missing, even a lexer-hacked parser has no way forward.

@stephenrkell
Copy link
Owner Author

Probably the best we could do is pre-scan for unresolvable identifiers, then make a guess about which is a type. This is overkill for now... just discard the chunks that fail to parse, i.e. hope they are not functions that do allocations.

@stephenrkell
Copy link
Owner Author

I think the slight generalisation of this to a "C-repairing parser" could be a nice student project. Step 1 is to write a parser for modern / GNU-extended C that is tolerant of type/var ambiguities, producing multiple parses if necessary. Step 2 is to traverse the resulting ASTs to scrape a list of the depended-on definitions for each top-level definition -- again, there may be "type or value" disjunction as a result of the ambiguity. Step 3 is to synthesise definitions of the dependencies, using the DWARF information. Extensions could try to work with syntax-generating macros somehow (e.g. imagine how to handle Linux's MODULE-* macros, say).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant