Skip to content
Thomas Le edited this page Jun 24, 2022 · 18 revisions

Welcome to the bil-to-boogie-translator wiki!

Architecture Overview of the Translator

Parser

The parser has two stages, the parsing phase and the walking phase.

Parsing phase

In the parsing phase primarily depends on the rules laid out in the BilAdt.g4 grammar file. The text is converted into lexical tokens, which are then parsed to form a parse tree.

Antlr does this by converting the grammar into java source, which is capable of taking a string input and creating the aforementioned parse tree.

Walking/Listening phase

The Walking/Listening phase primarily depends on the AdtStatementLoader.scala file. The Antlr walker essentially traverses the parse tree generated in the previous phase, and performs actions when entering and exiting each token.

The primary goal of this stage is to produce a "State" data structure from our parse tree, which is our representation of the program. We output information from this stage such as the list of statements, and we also keep a dictionary of expressions which we use in the statements.

Analysis phases (including verification conditions)

[TODO]

Boogie code generation

[TODO]

TODO List

  • minor fixes to old-state version:
    • turning off constant propagation
    • adding examples.
    • update readme on how to run new examples
  • New Intermediate Representation (IR) Design:
    • review intermediate representation (State data structure)
    • update tests
  • investigate ARM stack semantics
  • cleanup generation of memory accesses
  • generating rely-guarantee & verification conditions only when needed (memory accesses, branches etc)
  • combine reg31 and stack pointer
  • review constant propagation and folding
    • use information from new parser
    • split inference from replacing?
  • start with simple boolean security levels
  • fix rely conditions and implement guarantees
  • alternate representation to bit-vectors
    • use ints instead where possible
    • or simple booleans for single bits
  • improved memory model

Design Decisions

  • 2022-06-01: decided to record design decisions.

Future Ideas

  • infer loop invariants

Pros and Cons of parsing the ADT

Pros

  • Explicit bit vector lengths / types.
  • Explicit control flow graph information
  • More intuitive casting syntax
  • It should be a complete representation of what the lifter knows - i.e. there should never be missing information (probably)

Cons

  • It is extremely verbose and somewhat difficult to parse compared to the BIL
  • Unsure whether it's officially supported or maintained

Derivation of the ADT grammar

The grammar of the parser is inspired by the existing parser found at BinaryAnalysisPlatform/bap-python, in particular, the nodes found in adt.py, br.py and bil.py. The grammar attempts to reflect the inheritance hierarchy - the alternatives of a token are its subclasses.

For example, exp and term are subclasses of ADT, and thus in the grammar we have adt : exp | term | ....

General notes

  • Negative numbers are encoded somewhat strangely. It takes the 2's complement representation of the number, and interprets the binary string as if it were an unsigned integer.
    • For example, the number -2 in 64 bits is 1111111111111111111111111111111111111111111111111111111111111110, which is interpreted in the ADT as Int(18446744073709551614, 64).
Clone this wiki locally