Home

Welcome to the bil-to-boogie-translator wiki!

Architecture Overview of the Translator

Parser

The parser has two stages, the parsing phase and the walking phase.

Parsing phase

In the parsing phase primarily depends on the rules laid out in the BilAdt.g4 grammar file. The text is converted into lexical tokens, which are then parsed to form a parse tree.

Antlr does this by converting the grammar into java source, which is capable of taking a string input and creating the aforementioned parse tree.

Walking/Listening phase

The Walking/Listening phase primarily depends on the AdtStatementLoader.scala file. The Antlr walker essentially traverses the parse tree generated in the previous phase, and performs actions when entering and exiting each token.

The primary goal of this stage is to produce a "State" data structure from our parse tree, which is our representation of the program. We output information from this stage such as the list of statements, and we also keep a dictionary of expressions which we use in the statements.

Analysis phases (including verification conditions)

[TODO]

Boogie code generation

[TODO]

TODO List

minor fixes to old-state version:
- turning off constant propagation
- adding examples.
- update readme on how to run new examples
New Intermediate Representation (IR) Design:
- review intermediate representation (State data structure)
- update tests
investigate ARM stack semantics
cleanup generation of memory accesses
generating rely-guarantee & verification conditions only when needed (memory accesses, branches etc)
combine reg31 and stack pointer
review constant propagation and folding
- use information from new parser
- split inference from replacing?
start with simple boolean security levels
fix rely conditions and implement guarantees
alternate representation to bit-vectors
- use ints instead where possible
- or simple booleans for single bits
improved memory model
- points-to analysis (partition memory)
- investigate LLVM data structure analysis algorithm (https://llvm.org/pubs/2007-06-10-PLDI-DSA.pdf)
- separate stack from heap (more reliably)

Design Decisions

2022-06-01: decided to record design decisions.

Future Ideas

infer loop invariants

Pros and Cons of parsing the ADT

Pros

Explicit bit vector lengths / types.
Explicit control flow graph information
More intuitive casting syntax
It should be a complete representation of what the lifter knows - i.e. there should never be missing information (probably)

Cons

It is extremely verbose and somewhat difficult to parse compared to the BIL
Unsure whether it's officially supported or maintained

Derivation of the ADT grammar

The grammar of the parser is inspired by the existing parser found at BinaryAnalysisPlatform/bap-python, in particular, the nodes found in adt.py, br.py and bil.py. The grammar attempts to reflect the inheritance hierarchy - the alternatives of a token are its subclasses.

For example, exp and term are subclasses of ADT, and thus in the grammar we have adt : exp | term | ....

General notes

Negative numbers are encoded somewhat strangely. It takes the 2's complement representation of the number, and interprets the binary string as if it were an unsigned integer.
- For example, the number -2 in 64 bits is 1111111111111111111111111111111111111111111111111111111111111110, which is interpreted in the ADT as Int(18446744073709551614, 64).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly