Instruction addrs, block labels & llvm disassembly #6

ailrst · 2024-02-15T01:34:11Z

Adds more information and structure to the json inserted into the gtirb output.
This sacrifices some compactness for readability and convenience.

This also adds llvm-16 as a dependency of this package.

For each instruction we now have an address (just block_address + i * opcode_size), the llvm disassembly, and the opcode itself.

For each block if a label can be found in any of the Module.symbols that refers to the block uuid then it is added.

The output now looks as below, so we will hold off on merging this until the basil gtirb loader is updated accordingly.

  "JgXUTTeoR9qjEjJwNU/5+w==": {
    "label": "main",
    "addr": 1812,
    "instructions": [
      {
        "assembly": "adrp x0, #131072",
        "addr": 1812,
        "opcode_le": "00 01 00 90",
        "opcode_be": "0x90000100",
        "semantics": [
          "Stmt_Assign(LExpr_Array(LExpr_Var(_R),Expr_LitInt(\"0\")),Expr_LitBits(\"0000000000000000000000000000000000000000000000100000000000000000\"))"
        ]
      },
      {
        "assembly": "add x0, x0, #28",
        "addr": 1816,
        "opcode_le": "00 70 00 91",
        "opcode_be": "0x91007000",
        "semantics": [
          "Stmt_Assign(LExpr_Array(LExpr_Var(_R),Expr_LitInt(\"0\")),Expr_TApply(add_bits.0,[(Expr_LitInt(\"64\"))],[(Expr_Array(Expr_Var(_R),Expr_LitInt(\"0\")));(Expr_LitBits(\"000000000000000000
0000000000000000000000000000000000000000011100\"))]))"
        ]
      },

l-kent · 2024-02-15T02:34:53Z

I'm not sure any of this really serves much purpose. Readability of the internal JSON does not matter - no one needs to directly read the internal JSON, and we have debug-gts.py to create more human-readable outputs.

What does the LLVM disassembly look like? I don't see it in your example output. I also don't see the point in adding that and the opcode to the internal JSON for the semantics - those aren't going to be any use to BASIL so they will just take up space. If they are necessary for something else, it would make more sense to add those to debug-gts.py output.

The labels you add will in practice just be the function name for function entry blocks, so adding it to the internal semantics representation doesn't really serve much purpose when we already have access to that information.

We can already derive the address of blocks and instructions from the GTIRB data structures, so there is not much benefit to duplicating that information. It would be useful to have the addresses of instructions displayed in debug-gts.py though (the addresses of blocks are already displayed).

It would be easier to add support for changes like this to BASIL after the changes have been merged in - it is not necessary to wait for BASIL to be updated.

ailrst · 2024-02-16T00:27:57Z

I think there is benefit to having a readable/debuggable format in the internal json.

What does the LLVM disassembly look like? I don't see it in your example output

This line is the LLVM disassembly

"assembly": "adrp x0, #131072",

l-kent · 2024-02-16T00:43:03Z

What is that benefit supposed to be? As I said, no one needs to directly read the internal JSON. The debug-gts.py output already contains most of this information, and would be a more suitable location for the rest of this.

katrinafyi · 2024-02-16T01:57:44Z

As I see it, the goal of this PR is to consolidate the gtirb processing into one tool. debug-gts.py was a bit of a quick-fix solution to the unreadable gtirb problem.

Regarding space, increasing the gtirb size could lead to a faster increase in the git repository size. But I think a better solution here is to (eventually) move the gtirb files outside of git. Is this done for reproducibility of the results (i.e. avoiding compiler and ddisasm differences)? If so, this might be solvable by making the compiler->ddisasm pipeline deterministsic. Otherwise, the binary files could also be stored elsewhere with less space constraints.

One more advantage of Ocaml is we can use Aslp's pp_stmt method to obtain pretty-printed semantics. This can provide slightly nicer output such as, for adds x1, x2, x3,

constant bits ( 64 ) Cse0__5 = add_bits.0 {{ 64 }} ( __array _R [ 2 ],__array _R [ 3 ] ) ;
PSTATE . V = not_bits.0 {{ 1 }} ( cvt_bool_bv.0 {{  }} ( eq_bits.0 {{ 65 }} ( SignExtend.0 {{ 64,65 }} ( Cse0__5,65 ),add_bits.0 {{ 65 }} ( SignExtend.0 {{ 64,65 }} ( __array _R [ 2 ],65 ),SignExtend.0 {{ 64,65 }} ( __array _R [ 3 ],65 ) ) ) ) ) ;
PSTATE . C = not_bits.0 {{ 1 }} ( cvt_bool_bv.0 {{  }} ( eq_bits.0 {{ 65 }} ( ZeroExtend.0 {{ 64,65 }} ( Cse0__5,65 ),add_bits.0 {{ 65 }} ( ZeroExtend.0 {{ 64,65 }} ( __array _R [ 2 ],65 ),ZeroExtend.0 {{ 64,65 }} ( __array _R [ 3 ],65 ) ) ) ) ) ;
PSTATE . Z = cvt_bool_bv.0 {{  }} ( eq_bits.0 {{ 64 }} ( Cse0__5,'0000000000000000000000000000000000000000000000000000000000000000' ) ) ;
PSTATE . N = Cse0__5 [ 63 +: 1 ] ;
__array _R [ 1 ] = Cse0__5 ;

This is not trivial to replicate in other languages.

l-kent · 2024-02-16T03:02:46Z

The .gtirb and executable files are not included in the BASIL repo, but the .gts and .json (from debug-gts.py) files are. Reproducibility of the results is necessary - we want to be testing BASIL for specific .gts files, not testing the entire pipeline all at once. Making the pipeline deterministic and only storing the test source files is not a solution to this, because changes to the lifting/disassembly pipeline would then propagate to BASIL and it would be more difficult to identify the source of issues. It would also significantly slow the execution of tests.

Storing all this information directly in the .gts does nothing for convenience, as it will still be necessary to run a script to extract the json from the .gts file in order to actually read it, meaning we end up in exactly the same place except with extra redundancy.

Since the .json files are directly derivable from the .gts at present, it would be possible to not store the .json files and only create them locally (from the .gts files which are in the repo) in order to save space.

There is no point to adding further information to the .gts files unless it is necessary for BASIL, or not directly derivable from the information that is already there.

katrinafyi · 2024-02-16T04:14:16Z

Well, the point of making the compile/ddisasm pipeline reproducible is to also pin the pipeline, thus avoiding changes, and producing byte-for-byte identical gts files from only the c source files - addressing the need for reproducible results. However, this approach is looking like a lot of work and might not be practical.

My second point was that storing binary files (of any kind) in git can quickly bloat the repository size. At the moment, the .gts files are 57MB and it will increase by this much every time the files are updated. Is this something on your mind? If size is a concern, it might be a good idea to avoid committing .gts files at all? The reproducible pipeline is only one such solution for this.

As for this PR, I would like to see the pretty-printed semantics included but I am otherwise neutral.

l-kent · 2024-02-16T04:34:34Z

We can't avoid committing the .gts files at present, and ultimately the reproducible pipeline is a separate issue to the question of what should be contained within the .gts files right now. I don't have a problem if you want to pursue a stable, reproducible pipeline, but we also aren't at a stage where the pipeline is stable.

I'm not necessarily opposed to including the pretty-printed semantics (though they aren't in this PR), they're something that may not be directly derivable from what's already there. I suppose it may be possible to replace debug-gts.py with something written in OCaml that serves the same purpose and can make use of the ASLp pretty printer but that's probably a lot of hassle - maybe worth pursuing though?

see: UQ-PAC/gtirb-semantics#6

we just use Printf now.

a bit of a hack to support compilation on debian.

see: UQ-PAC/gtirb-semantics#6

schema for semantics and successors are tweaked to avoid storing data within JSON keys. addresses are also now formatted as hex.

ailrst mentioned this pull request Feb 15, 2024

Gtirb to ir UQ-PAC/BASIL#161

Merged

katrinafyi mentioned this pull request Feb 16, 2024

fix for missing semantics. #7

Merged

katrinafyi added a commit to katrinafyi/pac-nix that referenced this pull request Feb 25, 2024

gtirb-semantics: update package in line with llvm FFI usage.

065a1e0

see: UQ-PAC/gtirb-semantics#6

ailrst and others added 9 commits March 1, 2024 14:06

add human readable llvm-mc disas to output

370643f

lookup labels

7d56278

add prettyprinted semantics

06b08ff

tweak dune files for easier Nix

5455f2c

clean up opcode handling of endianness.

18b0b32

remove hexstring library.

d6e98fb

we just use Printf now.

restore flags to bin/dune.

e26174c

a bit of a hack to support compilation on debian.

more robust llvm-config usage.

d359e05

print json output

96e938f

katrinafyi force-pushed the instruction-addrs branch from a174d4b to 96e938f Compare March 1, 2024 04:12

katrinafyi added a commit to katrinafyi/pac-nix that referenced this pull request Mar 1, 2024

gtirb-semantics: update package in line with llvm FFI usage.

9168027

see: UQ-PAC/gtirb-semantics#6

katrinafyi added a commit that referenced this pull request Sep 18, 2024

debug_gts: rearrange semantics schema and add address (fixes #6)

b6f6674

schema for semantics and successors are tweaked to avoid storing data within JSON keys. addresses are also now formatted as hex.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instruction addrs, block labels & llvm disassembly #6

Instruction addrs, block labels & llvm disassembly #6

ailrst commented Feb 15, 2024

l-kent commented Feb 15, 2024

ailrst commented Feb 16, 2024

l-kent commented Feb 16, 2024

katrinafyi commented Feb 16, 2024

l-kent commented Feb 16, 2024 •

edited

Loading

katrinafyi commented Feb 16, 2024 •

edited

Loading

l-kent commented Feb 16, 2024

Instruction addrs, block labels & llvm disassembly #6

Are you sure you want to change the base?

Instruction addrs, block labels & llvm disassembly #6

Conversation

ailrst commented Feb 15, 2024

l-kent commented Feb 15, 2024

ailrst commented Feb 16, 2024

l-kent commented Feb 16, 2024

katrinafyi commented Feb 16, 2024

l-kent commented Feb 16, 2024 • edited Loading

katrinafyi commented Feb 16, 2024 • edited Loading

l-kent commented Feb 16, 2024

l-kent commented Feb 16, 2024 •

edited

Loading

katrinafyi commented Feb 16, 2024 •

edited

Loading