Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gtirb to ir #161

Merged
merged 87 commits into from
Feb 22, 2024
Merged

Gtirb to ir #161

merged 87 commits into from
Feb 22, 2024

Conversation

Megatomato
Copy link
Contributor

@Megatomato Megatomato commented Jan 31, 2024

Provides a new frontend for the BASIL tool, taking dissasembly data from ddisasm and GTIRB instead of BAP.

To use, just use a .gts file as the argument to the tool, instead of the .adt, and it spits out a .bpl file like normal. It seeks to replicate the old interface as much as it can.

You can switch between the two interfaces by commenting the GTIRB LOGIC code and everything beneath it in the LoadBap function in RunUtils, and uncommenting all the code underneath BAP LOGIC. Also, be sure to change the type of loadBap from Program to BAPProgram, as the new interface directly outputs an IR program in one translation parse.

Broadly, it is about as good as the old interface, with the notable exception that GTIRB lacks function parameters and returns, so those are hardcoded as of now. Alistair has also lifted every example in the tests folder to a .gts.

Megatomato added 30 commits July 6, 2023 14:51
@ailrst
Copy link
Contributor

ailrst commented Feb 8, 2024

aarch64-linux-gnu-gcc (Debian 13.2.0-2) 13.2.0

@l-kent
Copy link
Contributor

l-kent commented Feb 8, 2024

I have 11.3.0, which explains the difference.

@Megatomato
Copy link
Contributor Author

Megatomato commented Feb 8, 2024

Rather than creating a completely new UUID, it would be best to just give it the same label as its originator block, just with $_n (for the nth additional block created from the original block) after it, so we can maintain the correspondence.

This was my idea initally as well, and I wish we could, but then it messes up the Decoding/Encoding from Base64. The main reason why these UUIDs are so annoying is because the jumps are filtered using GTIRBs Edge class, which takes two ByteStrings (one source and one target). Thus, when new blocks are created, edges need to be added, and these edges require the decoded base64 string.

@l-kent
Copy link
Contributor

l-kent commented Feb 8, 2024

That seems like a problem that is possible to solve though, I'll work on that at some point.

Comment on lines 136 to 148
name = symbols.find(functionNames(uuid) == _.uuid).get.name

val entryBlocks: mutable.Set[ByteString] = functionEntries(uuid)

val result = entryBlocks
.flatMap(e => edgeMap.getOrElse(Base64.getUrlEncoder.encodeToString(e.toByteArray), None).iterator)
.flatMap(elem => mods.flatMap(_.proxies).find(_.uuid == elem.targetUuid))
.flatMap(proxy => symMap.getOrElse(proxy.uuid, None))
.map(p => p.name)

if (result.nonEmpty) {
return result.head //. head seems weird here but i guess it works
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this section do? How do the proxy blocks come into getting the name of a function associated with a UUID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For weird linker functions introduced by the compiler, gtirb usually attaches the function name to a proxyblock instead of a block. In this case, I need to get the block that the proxybloxk is connected to, to find the right name of that function. This doesn't seem to be a problem for any functions the user has written themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Those PLT bridging functions don't actually have names as such. When they jump to an external function that does have a name, BAP just uses that name for the bridging function too. Since DDisasm doesn't do that, I'll make it so what we're doing here is more explicit.

@ailrst
Copy link
Contributor

ailrst commented Feb 15, 2024

I have a pr to add more information to the gtirb output UQ-PAC/gtirb-semantics#6

It now stores the label next to the block (if exists), and the disassembly and address next to the instructions.

This makes the semantics located at block["instructions"][i]["semantics"].

@Megatomato
Copy link
Contributor Author

@l-kent
Is there anything else we need to add/ review on this pr?
I'm asking since it's my last day, so if you want any changes with the base gtirb functionality, today would probably be better.

Thank you.

@l-kent
Copy link
Contributor

l-kent commented Feb 16, 2024

I don't need anything else from you, thanks.

@l-kent
Copy link
Contributor

l-kent commented Feb 16, 2024

I'm still working on cleaning it up to get it to parse all of CNTLM, and solving the issue with correspondence between the UUIDs for split blocks, but that's almost done now.

@l-kent l-kent force-pushed the walter-gtirb-to-ir branch from 2dddffe to 0e5489a Compare February 21, 2024 01:23
@l-kent
Copy link
Contributor

l-kent commented Feb 21, 2024

I'm ready to merge this. I've cleaned up some of the big cntlm files from this branch's history (just the ones committed in this branch).

@ailrst if you want to have a quick look before I merge feel free

private val symbols = mods.flatMap(_.symbols).map(s => s.uuid -> s).toMap
private val uuidToSymbols = createSymbolMap()
private val uuidToProcedure: mutable.Map[ByteString, Procedure] = mutable.Map()
private val entranceUUIDtoProcedure: mutable.Map[ByteString, Procedure] = mutable.Map()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between uuidToProcedure and entranceUUIDtoProcedure? Can we comment these mutable maps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GTIRB functions have a UUID that represents them which is used in various mappings, but this is separate from the UUID for the function's entrance block, which we also commonly encounter and need to know which procedure it refers to.

private val uuidToBlock: mutable.Map[ByteString, Block] = mutable.Map()
private val proxies = mods.flatMap(_.proxies.map(p => p.uuid -> p)).toMap
private val blockOutgoingEdges = createCFGMap()
private val externalProcedures = mutable.Map[String, Procedure]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of mutable state here and it is not clear when in the translation process it becomes valid / populated, it would be good to make this code clearer at some point about where these values come from and what they are used for (i.e. they are returned by functions and passed via parameters rather than being shared class state).

"$" + procedure.name + "$__" + blockCount + "__$" + byteStringToString(label).replace("=", "").replace("-", "~").replace("/", "\'")
}

private def cleanUpIfPCAssign(block: Block, procedure: Procedure): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment what this does? It seems to both resolve the TempIfs and indirect calls?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it handles the block splitting required to clean up the TempIfs and any stray _PC assignments that were not already removed, which exist due to indirect calls that were not identified by DDisasm due to a bug in it (so far these are only blr instructions that DDisasm does not handle properly)

var currentStatement = currentBlock.statements.head()
var breakLoop = false
val queue = mutable.Queue[Block]()
while (!breakLoop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use loop.breakable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't any more convenient here

newBlocks.append(afterBlock)
afterBlock.replaceJump(currentBlock.jump)
// TODO currently assume return target is afterBlock, probably best to check properly though once calculating address of instructions is done
afterBlock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create an issue for this? We have address instructions now I believe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the addresses of instructions are derived and just used as part of the statement labels. The return target being the rest of the block is a safe assumption if it is a BLR instruction (so far the only relevant case), but I don't know if there's anything else DDisasm fails to handle like this.

throw Exception(s"inconsistent size parameters in Mem.set.0: ${ctx.getText}")
}
if (mysteryArg != 0) {
Logger.debug(s"mystery 3rd arg of Mem.set.0 has value $mysteryArg: ${ctx.getText}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3rd arg is AccessType, maybe worth being less strict since we could see vector and atomic? It guess it hasn't shown up in practice but maybe worth commenting.

enumeration AccType {AccType_NORMAL, AccType_VEC,        // Normal loads and stores
                     AccType_STREAM, AccType_VECSTREAM,  // Streaming loads and stores
                     AccType_ATOMIC, AccType_ATOMICRW,   // Atomic loads and stores
                     AccType_ORDERED, AccType_ORDEREDRW, // Load-Acquire and Store-Release
                     AccType_ORDEREDATOMIC,              // Load-Acquire and Store-Release with atomic access
                     AccType_ORDEREDATOMICRW,
                     AccType_LIMITEDORDERED,             // Load-LOAcquire and Store-LORelease
                     AccType_UNPRIV,                     // Load and store unprivileged
                     AccType_IFETCH,                     // Instruction fetch
                     AccType_PTW,                        // Page table walk
                     AccType_NONFAULT,                   // Non-faulting loads
                     AccType_CNOTFIRST,                  // Contiguous FF load, not first element
                     AccType_NV2REGISTER,                // MRS/MSR instruction used at EL1 and which is converted
                                                         // to a memory access that uses the EL2 translation regime
                     // Other operations
                     AccType_DC,                         // Data cache maintenance
                     AccType_DC_UNPRIV,                  // Data cache maintenance instruction used at EL0
                     AccType_IC,                         // Instruction cache maintenance
                     AccType_DCZVA,                      // DC ZVA instructions
                     AccType_AT};                        // Address translation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a reason to throw an exception if we encounter another value, since we have not encountered any others yet and they may require a different translation. Where are the semantics of Mem.set and Mem.read defined? They aren't something that directly exists in the original ASL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the implementation of the Mem[] array access operator https://github.com/UQ-PAC/aslp/blob/partial_eval/mra_tools/arch/arch.asl#L11795, and the actual name comes from the internal implementation of the interpreter https://github.com/UQ-PAC/aslp/blob/partial_eval/libASL/tcheck.ml#L1663

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that. Looking at that I feel comfortable enough with the approach of just logging it for further investigation if we encounter a non-0 value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that should be enough

@l-kent l-kent merged commit c5d396a into main Feb 22, 2024
1 check passed
@l-kent l-kent deleted the walter-gtirb-to-ir branch February 23, 2024 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants