Skip to content

Commit

Permalink
readme: update removing script and compression
Browse files Browse the repository at this point in the history
We also add some suggestions to install each dependency, and mention of the debugging scripts available.
  • Loading branch information
katrinafyi authored Jun 27, 2024
1 parent ce92032 commit 261e5a6
Showing 1 changed file with 39 additions and 50 deletions.
89 changes: 39 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,68 +4,53 @@

This codebase serves to tie several verification tools together.
The [GTIRB](https://github.com/grammatech/gtirb) intermediate representation produced by the [Datalog Disassembler](https://github.com/GrammaTech/ddisasm) is deserialised using [Google Protocol Buffers](https://developers.google.com/protocol-buffers). This is then dismantled and the [ASLi ASL Interpreter](https://github.com/UQ-PAC/asl-interpreter) is used to add instruction semantics for each instruction opcode. These are then reserialised back into the original IR protobufs alongside the original data produced by DDisasm (.gts file).
The semantic information itself is printed to stdout.
The semantic information itself is also printed to stdout.

## Requirements
To build and run this you will need:
* ddisasm and dependencies thereof
* asl-interpreter and dependencies thereof
* mra_tools
* protoc
* The following OPAM packages:
* ocaml-protoc-plugin
* hexstring
* base64

To massively simplify this, simply run ```scripts/build-all.sh``` in your preferred install directory. This script assumes a completely fresh Ubuntu 20.04.5 installation. It is advised to run this script within a fresh VM but it should work on established installations. A complete installation can take several hours and will prompt for the sudo password at least once.
To build and run this, you will need:
* ASLp —
following [its readme](https://github.com/UQ-PAC/aslp?tab=readme-ov-file#installing-dependencies),
install dependencies then follow the steps for use in other projects.
* ddisasm — installing via [the Nix package](https://github.com/katrinafyi/pac-nix) is recommended.
* protoc — installed by the system package manager.
* various OCaml libraries — install these with `opam install --deps-only ./*.opam`
* Scala / SBT (optional) — only needed for the "retrieve" tool, install with [Coursier](https://get-coursier.io/docs/cli-installation) (see also: [BASIL docs](https://github.com/UQ-PAC/BASIL/tree/main/docs/development)).

As a reference for these steps on Ubuntu 20.04, you may see `scripts/build-all.sh` (however, using this script directly is not advised).

## Usage
```
wget https://raw.githubusercontent.com/UQ-PAC/gtirb-semantics/main/scripts/build-all.sh
chmod 744 build-all.sh
./build-all.sh
dune exec gtirb_semantics [INPUT GTIRB FILE] [OUTPUT GTS FILE]
```
Partway through, opam will ask if it can modify ```~/.profile``` and if it can add a hook to its init scripts. Answer ```N``` to both.

## Usage
GTIRB files can be obtained with:
```
dune exec gtirb_semantics input_path prelude_path mra_dir asli_dir output_path
ddisasm ./a.out --ir a.gtirb
```

## GTIRB Specifics
The serialised output is almost identical to that produced by ddisasm except with a few differences:
* All Sections except ".text" have been removed from each compilation module. This is as they are not useful for analysis purposes, and take up a lot of effectively dead space.
* The auxdata attached to the top-level IR has been removed. It contained only the ddisasm version number and is not useful.
* The semantic information provided by ASLi has been added as an auxdata record for each compilation module. As these outputs are verbose and quite large when formatted as readable text, they have been compressed. The compression scheme is as below:

| Item | no_blocks | Compression Map | Compressed Text |
|----------|-----------|-----------------------------------|-----------------|
| Size (B) | 1 | (3 * no_blocks) - (4 * no_blocks) | Remainder |

The no_blocks field contains the number of blocks in the compression map.
Each block in the compression map follows the below format:

| Item | Ascii | no_bits | Compressed |
|----------|-------|---------|------------|
| Size (B) | 1 | 1 | 1-2 |

The Ascii field contains the original character pre-compression. The no_bits field contains the number of bits in the compressed representation of the character in question. The compressed field contains the compressed representation of that character aligned to the right. The resulting decompressed text output needs only to be wrapped in curly braces before being parsed as JSON with your favourite JSON library.
The semantic information JSON data is structured as so:
```
{
uuid : [
[
opcode_0_semantics
], [
opcode_1_semantics
], ...
]
}
```
Where ```uuid``` is the base64 string of a UUID corresponding to a code block within the GTIRB structure.
Each ```opcode_n_semantics``` are readable strings.
* The file does not have the 8-byte magic prefix used by GTIRB.
* The semantic information provided by ASLi has been added as an auxdata record for each compilation module.
The semantic information JSON data is structured as so:
```js
{
uuid : [
[
opcode_0_semantics
],
[
opcode_1_semantics
],
...
]
}
```
Where ```uuid``` is the base64 string of a UUID corresponding to a code block within the GTIRB structure.
Each ```opcode_n_semantics``` are readable strings of the ASL AST.

## Use with other tools
Some boilerplate Scala code has been provided in ```extras/retreive```. This minimal solution deserialises a .gts file and retrieves the IPCFG, text sections for each module, and semantic information for each module.
Some boilerplate Scala code has been provided in ```extras/retrieve```. This minimal solution deserialises a .gts file and retrieves the IPCFG, text sections for each module, and semantic information for each module.

A GTIRB spelunking tool has been provided in ```extras/spelunking```. It is runnable with ```python3 spelunk.py gtirb_file search_key``` where ```search_key``` is any of the below:

Expand All @@ -79,11 +64,15 @@ A GTIRB spelunking tool has been provided in ```extras/spelunking```. It is runn
| symbols | Symbols from each compilation module |
| texts | Text sections from each compilation module |

This has been provided to make it easier to extract relevant information from the GTIRB IR when developing future tools. Alternatively, see the ```--json``` option in ```ddisdasm``` for producing a readable JSON representation of the GTIRB IR, although this will be extremely verbose.
This has been provided to make it easier to extract relevant information from the GTIRB IR when developing future tools. Alternatively, see the ```--json``` option in ```ddisasm``` for producing a readable JSON representation of the GTIRB IR, although this will be extremely verbose.

This tool is easily extendable to accommodate any increased spelunking needs in the future.
It is important to note that the spelunker will not recognise .gts files due to the structural differences.

`scripts/debug-gts.py` is a tool for converting the .gts into a human-readable JSON format, with instruction names and opcode alongside each block of semantics.

`scripts/proto-json.py` converts to/from GTIRB/gts and a JSON format. This can be useful for exploring the GTIRB output with tools such as jq.

## Disassembly Pipeline
An example pipeline of disassembly -> instruction lifting -> semantic info -> compression -> serialisation -> deserialisation -> decompression is located in scripts/pipeline.sh.
This will disassemble an example ARM64 binary and produce:
Expand Down

0 comments on commit 261e5a6

Please sign in to comment.