-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instruction addrs, block labels & llvm disassembly #6
base: main
Are you sure you want to change the base?
Conversation
I'm not sure any of this really serves much purpose. Readability of the internal JSON does not matter - no one needs to directly read the internal JSON, and we have debug-gts.py to create more human-readable outputs. What does the LLVM disassembly look like? I don't see it in your example output. I also don't see the point in adding that and the opcode to the internal JSON for the semantics - those aren't going to be any use to BASIL so they will just take up space. If they are necessary for something else, it would make more sense to add those to debug-gts.py output. The labels you add will in practice just be the function name for function entry blocks, so adding it to the internal semantics representation doesn't really serve much purpose when we already have access to that information. We can already derive the address of blocks and instructions from the GTIRB data structures, so there is not much benefit to duplicating that information. It would be useful to have the addresses of instructions displayed in debug-gts.py though (the addresses of blocks are already displayed). It would be easier to add support for changes like this to BASIL after the changes have been merged in - it is not necessary to wait for BASIL to be updated. |
I think there is benefit to having a readable/debuggable format in the internal json.
This line is the LLVM disassembly
|
What is that benefit supposed to be? As I said, no one needs to directly read the internal JSON. The debug-gts.py output already contains most of this information, and would be a more suitable location for the rest of this. |
As I see it, the goal of this PR is to consolidate the gtirb processing into one tool. debug-gts.py was a bit of a quick-fix solution to the unreadable gtirb problem. Regarding space, increasing the gtirb size could lead to a faster increase in the git repository size. But I think a better solution here is to (eventually) move the gtirb files outside of git. Is this done for reproducibility of the results (i.e. avoiding compiler and ddisasm differences)? If so, this might be solvable by making the compiler->ddisasm pipeline deterministsic. Otherwise, the binary files could also be stored elsewhere with less space constraints. One more advantage of Ocaml is we can use Aslp's pp_stmt method to obtain pretty-printed semantics. This can provide slightly nicer output such as, for adds x1, x2, x3, constant bits ( 64 ) Cse0__5 = add_bits.0 {{ 64 }} ( __array _R [ 2 ],__array _R [ 3 ] ) ;
PSTATE . V = not_bits.0 {{ 1 }} ( cvt_bool_bv.0 {{ }} ( eq_bits.0 {{ 65 }} ( SignExtend.0 {{ 64,65 }} ( Cse0__5,65 ),add_bits.0 {{ 65 }} ( SignExtend.0 {{ 64,65 }} ( __array _R [ 2 ],65 ),SignExtend.0 {{ 64,65 }} ( __array _R [ 3 ],65 ) ) ) ) ) ;
PSTATE . C = not_bits.0 {{ 1 }} ( cvt_bool_bv.0 {{ }} ( eq_bits.0 {{ 65 }} ( ZeroExtend.0 {{ 64,65 }} ( Cse0__5,65 ),add_bits.0 {{ 65 }} ( ZeroExtend.0 {{ 64,65 }} ( __array _R [ 2 ],65 ),ZeroExtend.0 {{ 64,65 }} ( __array _R [ 3 ],65 ) ) ) ) ) ;
PSTATE . Z = cvt_bool_bv.0 {{ }} ( eq_bits.0 {{ 64 }} ( Cse0__5,'0000000000000000000000000000000000000000000000000000000000000000' ) ) ;
PSTATE . N = Cse0__5 [ 63 +: 1 ] ;
__array _R [ 1 ] = Cse0__5 ; This is not trivial to replicate in other languages. |
The Storing all this information directly in the Since the .json files are directly derivable from the There is no point to adding further information to the |
Well, the point of making the compile/ddisasm pipeline reproducible is to also pin the pipeline, thus avoiding changes, and producing byte-for-byte identical gts files from only the c source files - addressing the need for reproducible results. However, this approach is looking like a lot of work and might not be practical. My second point was that storing binary files (of any kind) in git can quickly bloat the repository size. At the moment, the .gts files are 57MB and it will increase by this much every time the files are updated. Is this something on your mind? If size is a concern, it might be a good idea to avoid committing .gts files at all? The reproducible pipeline is only one such solution for this. As for this PR, I would like to see the pretty-printed semantics included but I am otherwise neutral. |
We can't avoid committing the .gts files at present, and ultimately the reproducible pipeline is a separate issue to the question of what should be contained within the .gts files right now. I don't have a problem if you want to pursue a stable, reproducible pipeline, but we also aren't at a stage where the pipeline is stable. I'm not necessarily opposed to including the pretty-printed semantics (though they aren't in this PR), they're something that may not be directly derivable from what's already there. I suppose it may be possible to replace debug-gts.py with something written in OCaml that serves the same purpose and can make use of the ASLp pretty printer but that's probably a lot of hassle - maybe worth pursuing though? |
we just use Printf now.
a bit of a hack to support compilation on debian.
a174d4b
to
96e938f
Compare
schema for semantics and successors are tweaked to avoid storing data within JSON keys. addresses are also now formatted as hex.
Adds more information and structure to the json inserted into the gtirb output.
This sacrifices some compactness for readability and convenience.
This also adds llvm-16 as a dependency of this package.
For each instruction we now have an address (just block_address + i * opcode_size), the llvm disassembly, and the opcode itself.
For each block if a label can be found in any of the
Module.symbols
that refers to the block uuid then it is added.The output now looks as below, so we will hold off on merging this until the basil gtirb loader is updated accordingly.