Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM verification #356

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

LLVM verification #356

wants to merge 7 commits into from

Conversation

AFOliveira
Copy link
Collaborator

Following up on #258.

I've been doing work on what @lenary proposed as a first approach, I think this is an ok first mock-up. Still a WIP since many instructions still have bugs, but if you have any comments or recommendations, please LMK :).

I did not add LLVM as a submodule yet because it may be easier licensing wise to just point at it in the script? Usage is python3 riscv_parser.py <tablegen_json_file> <arch_inst_directory>"). I've also used jq to enhance readibility on the output of llvm-tblgen -I llvm/include -I llvm/lib/Target/RISCV llvm/lib/Target/RISCV/RISCV.td --dump-json -o <path-to-json-output>.

@AFOliveira AFOliveira requested a review from lenary December 13, 2024 13:48
Copy link
Collaborator

@lenary lenary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start, and is heading in the right direction. Some immediate comments before I look again in the next few days.

Is there anything specific you want feedback on?

Which instructions are you running into issues with so far?

Another overall thought I have is: if this is a test/validation suite, could this be structured using pytest? For instance, you could parse the YAML and the JSON to match up instruction descriptions, and use that to create parameterized fixtures (one instance per matched description) - the advantage of this is that you get all of the niceness of pytest's asserts, and pytest's test suite reports, without having to reimplement all the testcase management. This also makes it easier to split the test code that checks the encoding matches from tests that check the assembly strings match (for example, but we can think of others).

output_stream.write("-" * 20 + "\n")
output_stream.write(f"Name: {name}\n")
output_stream.write(f"Assembly Format: {safe_get(data, 'AsmString', 'N/A')}\n")
output_stream.write(f"Size: {safe_get(data, 'Size', 'N/A')} bytes\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an instruction doesn't have a Size, it cannot be encoded, so it's likely not of interest to checking the encoding.

Comment on lines 165 to 168
output_stream.write(f"Commutable: {'Yes' if safe_get(data, 'isCommutable', 0) else 'No'}\n")
output_stream.write(f"Memory Load: {'Yes' if safe_get(data, 'mayLoad', 0) else 'No'}\n")
output_stream.write(f"Memory Store: {'Yes' if safe_get(data, 'mayStore', 0) else 'No'}\n")
output_stream.write(f"Side Effects: {'Yes' if safe_get(data, 'hasSideEffects', 0) else 'No'}\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we probably need a good discussion about these and how we verify them.

Broadly:

  • isCommutable will only be set if we can teach LLVM how to commute the instruction - using a hook implemented in C++. We do this quite a lot for e.g. inverting conditions. This is not something the assembler will do, it's done earlier, in code generation though.
  • mayLoad and mayStore should be accurate enough, but I guess you'll need to statically analyse the pseudocode to work out if a load or store happens. We really only model loads/stores to conventional memory (and not, e.g., loads for page table walks or permission checks). I'm not sure there's any modelling of ordering. These are used to prevent code motion during code generation.
  • hasSideEffects is a catch all for "be very careful with this instruction during codegen", and usually points to other effects that LLVM doesn't model. Generic CSR accesses are part of this (except the floating point CSR, which I think we model correctly), but so are other effects I haven't thought hard about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think those should be explained in the UDB as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably needs a longer discussion on how denormalised this information should be. i.e., the information is there if you statically-analyse the pseudocode (which should absolutely be something we should be able to do with the pseudocode), but we probably don't want that to be the only way to find out this sort of thing as the pseudocode operations might be a large amount of code.

I don't think you should be exactly matching LLVM's internal representation, but I do think there is the opportunity to denormalise more information that might be generally useful for this kind of tool.

Copy link
Collaborator Author

@AFOliveira AFOliveira Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this needs a more appropriate discussion, l think @dhower-qc has been working on instruction representations lately, so he probably also has been thinking about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time for this discussion before mid-January.

Let's try get the things that are already in the YAML done first, i.e.:

  • Instruction Encodings
  • Extensions and Profiles

ext/auto-inst/parsing.py Show resolved Hide resolved
Comment on lines 246 to 249
"""
Attempt to find a matching key in json_data for instr_name, considering different
naming conventions: replacing '.' with '_', and trying various case transformations.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really suggest using the name from AsmString rather than this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll change that, thanks for the suggestion!

@AFOliveira
Copy link
Collaborator Author

I think this is a good start, and is heading in the right direction. Some immediate comments before I look again in the next few days.

Thanks for your feedback!

Is there anything specific you want feedback on?

Not yet, the point was just some general considerations about the initial approach.

Which instructions are you running into issues with so far?

I still didnt find a pattern, but I'll try to fix this soon and if I run into any issue I'm not able to solve by myself, I'll try to bring it up, thanks!

Another overall thought I have is: if this is a test/validation suite, could this be structured using pytest? For instance, you could parse the YAML and the JSON to match up instruction descriptions, and use that to create parameterized fixtures (one instance per matched description) - the advantage of this is that you get all of the niceness of pytest's asserts, and pytest's test suite reports, without having to reimplement all the testcase management. This also makes it easier to split the test code that checks the encoding matches from tests that check the assembly strings match (for example, but we can think of others).

I'll take a look into pytest and see how to port this, thanks for the suggestion!

Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
@AFOliveira
Copy link
Collaborator Author

AFOliveira commented Dec 19, 2024

current known bugs are:

  • Can't parse yaml complex imm encodings (i.e. 31|7|30-25|11-8)
  • Some instructions need to be fixed in the UDB, since this already caught some errors.
  • Need to split pytest into different unit tests rather than one big test.
  • Use ASMString from JSON.

I'm working on both now.

@lenary
Copy link
Collaborator

lenary commented Dec 19, 2024

Somewhere, a __pycache__ and *.pyc is missing from a gitignore. :)

Signed-off-by: Afonso Oliveira <[email protected]>
instr_name = instr_name.lower().strip()

# Search through all entries in json_data
for key, value in json_data.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the !instanceof metadata to only search through things with an encoding, I pointed this out in the comment about the initial approach. This will save you iterating through aliases (and other data like isel patterns)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see !instanceof follows this structure, but I'm finding it hard to understand how to use it for the parsing purposes, can you please further explain?

{
  "!instanceof": {
    "AES_1Arg_Intrinsic": [
      "int_arm_neon_aesimc",
      "int_arm_neon_aesmc"
    ],
    "AES_2Arg_Intrinsic": [
      "int_arm_neon_aesd",
      "int_arm_neon_aese"
    ],
    "ALUW_rr": [
      "ADDW",
      "ADD_UW",
      "DIVUW",
      "DIVW",
      "MULW",
      "PACKW",
      "REMUW",
      "REMW",
      "ROLW",
      "RORW",
      "SH1ADD_UW",
      "SH2ADD_UW",
      "SH3ADD_UW",
      "SLLW",
      "SRAW",
      "SRLW",
      "SUBW"
    ],
    "ALU_ri": [
      "ADDI",
      "ANDI",
      "ORI",
      "SLTI",
      "SLTIU",
      "XORI"
    ],

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, let's assume you parsed the whole json object, into a python variable called json.

json["!instanceof"] is a map, from tablegen classes, to a list of definition names that are instances of that class (or its sub-classes). These definition names appear in the top-level json.

You would use code like the following:

for def_name in json["!instanceof"]["RVInstCommon"]:
  def_data = json[def_name]
  ...

This saves you having to look at all the tablegen data that is not an instruction (so an alias or a pattern or CSR or something).

Note you'll still have to look at isPseudo and isCodeGenOnly and potentially exclude items where one or both of those is true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants