-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM verification #356
base: main
Are you sure you want to change the base?
LLVM verification #356
Conversation
Signed-off-by: Afonso Oliveira <[email protected]>
a56b0b1
to
d9b50b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good start, and is heading in the right direction. Some immediate comments before I look again in the next few days.
Is there anything specific you want feedback on?
Which instructions are you running into issues with so far?
Another overall thought I have is: if this is a test/validation suite, could this be structured using pytest? For instance, you could parse the YAML and the JSON to match up instruction descriptions, and use that to create parameterized fixtures (one instance per matched description) - the advantage of this is that you get all of the niceness of pytest's asserts, and pytest's test suite reports, without having to reimplement all the testcase management. This also makes it easier to split the test code that checks the encoding matches from tests that check the assembly strings match (for example, but we can think of others).
output_stream.write("-" * 20 + "\n") | ||
output_stream.write(f"Name: {name}\n") | ||
output_stream.write(f"Assembly Format: {safe_get(data, 'AsmString', 'N/A')}\n") | ||
output_stream.write(f"Size: {safe_get(data, 'Size', 'N/A')} bytes\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an instruction doesn't have a Size, it cannot be encoded, so it's likely not of interest to checking the encoding.
ext/auto-inst/parsing.py
Outdated
output_stream.write(f"Commutable: {'Yes' if safe_get(data, 'isCommutable', 0) else 'No'}\n") | ||
output_stream.write(f"Memory Load: {'Yes' if safe_get(data, 'mayLoad', 0) else 'No'}\n") | ||
output_stream.write(f"Memory Store: {'Yes' if safe_get(data, 'mayStore', 0) else 'No'}\n") | ||
output_stream.write(f"Side Effects: {'Yes' if safe_get(data, 'hasSideEffects', 0) else 'No'}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we probably need a good discussion about these and how we verify them.
Broadly:
- isCommutable will only be set if we can teach LLVM how to commute the instruction - using a hook implemented in C++. We do this quite a lot for e.g. inverting conditions. This is not something the assembler will do, it's done earlier, in code generation though.
- mayLoad and mayStore should be accurate enough, but I guess you'll need to statically analyse the pseudocode to work out if a load or store happens. We really only model loads/stores to conventional memory (and not, e.g., loads for page table walks or permission checks). I'm not sure there's any modelling of ordering. These are used to prevent code motion during code generation.
- hasSideEffects is a catch all for "be very careful with this instruction during codegen", and usually points to other effects that LLVM doesn't model. Generic CSR accesses are part of this (except the floating point CSR, which I think we model correctly), but so are other effects I haven't thought hard about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think those should be explained in the UDB as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this probably needs a longer discussion on how denormalised this information should be. i.e., the information is there if you statically-analyse the pseudocode (which should absolutely be something we should be able to do with the pseudocode), but we probably don't want that to be the only way to find out this sort of thing as the pseudocode operations might be a large amount of code.
I don't think you should be exactly matching LLVM's internal representation, but I do think there is the opportunity to denormalise more information that might be generally useful for this kind of tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this needs a more appropriate discussion, l think @dhower-qc has been working on instruction representations lately, so he probably also has been thinking about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have time for this discussion before mid-January.
Let's try get the things that are already in the YAML done first, i.e.:
- Instruction Encodings
- Extensions and Profiles
ext/auto-inst/parsing.py
Outdated
""" | ||
Attempt to find a matching key in json_data for instr_name, considering different | ||
naming conventions: replacing '.' with '_', and trying various case transformations. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really suggest using the name from AsmString
rather than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll change that, thanks for the suggestion!
Thanks for your feedback!
Not yet, the point was just some general considerations about the initial approach.
I still didnt find a pattern, but I'll try to fix this soon and if I run into any issue I'm not able to solve by myself, I'll try to bring it up, thanks!
I'll take a look into pytest and see how to port this, thanks for the suggestion! |
Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
current known bugs are:
I'm working on both now. |
Somewhere, a |
Signed-off-by: Afonso Oliveira <[email protected]>
instr_name = instr_name.lower().strip() | ||
|
||
# Search through all entries in json_data | ||
for key, value in json_data.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the !instanceof
metadata to only search through things with an encoding, I pointed this out in the comment about the initial approach. This will save you iterating through aliases (and other data like isel patterns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see !instanceof
follows this structure, but I'm finding it hard to understand how to use it for the parsing purposes, can you please further explain?
{
"!instanceof": {
"AES_1Arg_Intrinsic": [
"int_arm_neon_aesimc",
"int_arm_neon_aesmc"
],
"AES_2Arg_Intrinsic": [
"int_arm_neon_aesd",
"int_arm_neon_aese"
],
"ALUW_rr": [
"ADDW",
"ADD_UW",
"DIVUW",
"DIVW",
"MULW",
"PACKW",
"REMUW",
"REMW",
"ROLW",
"RORW",
"SH1ADD_UW",
"SH2ADD_UW",
"SH3ADD_UW",
"SLLW",
"SRAW",
"SRLW",
"SUBW"
],
"ALU_ri": [
"ADDI",
"ANDI",
"ORI",
"SLTI",
"SLTIU",
"XORI"
],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, let's assume you parsed the whole json object, into a python variable called json
.
json["!instanceof"]
is a map, from tablegen classes, to a list of definition names that are instances of that class (or its sub-classes). These definition names appear in the top-level json.
You would use code like the following:
for def_name in json["!instanceof"]["RVInstCommon"]:
def_data = json[def_name]
...
This saves you having to look at all the tablegen data that is not an instruction (so an alias or a pattern or CSR or something).
Note you'll still have to look at isPseudo
and isCodeGenOnly
and potentially exclude items where one or both of those is true.
Following up on #258.
I've been doing work on what @lenary proposed as a first approach, I think this is an ok first mock-up. Still a WIP since many instructions still have bugs, but if you have any comments or recommendations, please LMK :).
I did not add LLVM as a submodule yet because it may be easier licensing wise to just point at it in the script? Usage is
python3 riscv_parser.py <tablegen_json_file> <arch_inst_directory>")
. I've also used jq to enhance readibility on the output ofllvm-tblgen -I llvm/include -I llvm/lib/Target/RISCV llvm/lib/Target/RISCV/RISCV.td --dump-json -o <path-to-json-output>
.