Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binutils encoding/match verification for all instructions description. #258

Closed
wants to merge 17 commits into from

Conversation

AFOliveira
Copy link
Collaborator

I added subtrees for binutils to have the necessary information. I know this adds a lot of unnecessary information, but the other options were submodules (which is around 3gb) or copy pasting the the files(which is not updatable).

The script is using only binutils to generate all the encondings. Warnings are instructions that the script can't parse (yet), and errors are for differences between binutils and the yamls. (Which surprisingly exist on the DB right now).

Should I add this to the regression test?

…22ac94

git-subtree-dir: ext/binutils-gdb/include/opcode
git-subtree-split: eabd522ac949549199802378ee3a04cd55dc3275
git-subtree-dir: ext/binutils-gdb/opcodes
git-subtree-split: ddce56a70c12095b8f72c2f48ad4194df9a62386
@dhower-qc
Copy link
Collaborator

Unfortunately, we can't add these files in the repo since they are GPL licensed. Perhaps we can use a submodule with a shallow fetch to avoid the space overhead?

Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
@AFOliveira
Copy link
Collaborator Author

@dhower-qc I updated it, but I believe that the shallow fetch will have to be at local submodule initialization.

@dhower-qc
Copy link
Collaborator

@dhower-qc I updated it, but I believe that the shallow fetch will have to be at local submodule initialization.

Thanks. I've asked our legal to team to review this setup, just to be sure that we aren't inadvertently tainting UnifiedDB with GPL. I'll get back when I have their opinion (should be a few days).

@AFOliveira
Copy link
Collaborator Author

AFOliveira commented Nov 13, 2024

Thanks. I've asked our legal to team to review this setup, just to be sure that we aren't inadvertently tainting UnifiedDB with GPL. I'll get back when I have their opinion (should be a few days).

Great!

If we also want to add this to the regression test, I have that done with just one thing missing. I was not able to add pyyaml to singularity neither through pip install nor apt-get install -y --no-install-recommends python3-yaml. Do you think you can set that up if this goes through?

@dhower-qc
Copy link
Collaborator

For libraries, I've been installing them outside the container so we don't have to rebuild it every time a dependence changes. You'll see in bin/setup that it runs bundle (ruby package management) and installs them in $root/.home/.gems (that is, in the repo working directory). It also runs npm to install the Node dependencies locally ($root/node_modules). When the container starts, it actually maps $root/.home as the home directory.

We should be able to replicate this for python. pip is already installed in the container.

Signed-off-by: Afonso Oliveira <[email protected]>
Signed-off-by: Afonso Oliveira <[email protected]>
@AFOliveira
Copy link
Collaborator Author

Added the regression test part.

@AFOliveira
Copy link
Collaborator Author

#269 should fix this regression test failing

@dhower-qc
Copy link
Collaborator

Sorry for the run-around; I added some files in #270 in order to support python dependencies better (also needed by pre-commit that @kbroch-rivosinc is adding). With that, you can run ./bin/python to pick up pyyaml in the venv. I've already added pyyaml to requirements.txt, which will be automatically installed.

@AFOliveira
Copy link
Collaborator Author

AFOliveira commented Nov 15, 2024

I'll wait for your merge into main and then rebase to remove the python part.

Signed-off-by: Afonso Oliveira <[email protected]>
@dhower-qc
Copy link
Collaborator

Just an update: I have heard back from our legal team, but they have more questions. Appreciate your patience while we sort it out.

@AFOliveira
Copy link
Collaborator Author

AFOliveira commented Nov 19, 2024

Thank you for the feedback. This already functional with the newest PRs, but let me know when and if you get the greenlight so I make it mergeable.

@AFOliveira
Copy link
Collaborator Author

@dhower-qc any news on this?

@dhower-qc
Copy link
Collaborator

Yes, the advice is that:

  • Having a submodule to binutils (GPL code) is ok in isolation
  • Basing BSD-3 code (UnifiedDB) on GPL code (binutils) could result in a copyleft of UnifiedDB (want to avoid!)
  • Having a comparison script is OK if it is "cleanroomed". So the engineer creating the script and viewing GPL code (Afonso) cannot be the one fixing the BSD-3 code. Instead, Afonso would have to create a bug report, without proposing a solution, for another engineer to fix.

In summary, yuck.

I wonder if there is a way to compare instead again LLVM? @lenary, any thoughts on this?

@lenary
Copy link
Collaborator

lenary commented Dec 10, 2024

You also definitely don't want to bring in all of LLVM, so whether we're comparing against LLVM or against binutils, I think the best approach is to use environment/configure variables to point at an external source/build tree (maybe both?) which testing can use, without modifying that tree. This is how the LLVM test suite works for e.g. SPEC which is a proprietary test suite (and others).

As for actually testing in this manner against LLVM, there are a few ways forward. Probably the best is to dump the info from the RISC-V tablegen descriptions into JSON (this would have to be done with llvm-tblgen, which needs to be built), and then write scripts to parse the JSON and process it against the YAML encodings. This is effectively a build step (invoking the LLVM build for llvm-tblgen), and then invoking that tool and using the data it emits (from processing LLVM code/data). I hope that lawyers would find this approach acceptable, but I wouldn't like to speak for them.

I would need to more closely examine the RISC-V tablegen output, to make sure it has what we need for validation, but I believe it should, having frequently read the expanded tablegen information.

This does not solve the problem of "is binutils correct" at all. Another approach is that, given Clang/LLVM and GCC/Binutils aim to be compatible with each other, to use the assembler (clang or as respectively) and disassembler (llvm-objdump or objdump respectively) at a higher level. This is a little more involved, because we'd need to e.g. disable the automatic compression of instructions that the assemblers do when smaller instructions are available, and do things like turn off alias printing in the disassembler (effectively the reverse process, but aliases are also used for canonical disassembly, like mov xM, xN instead of addi xM, xN, 0). On the other hand, this is easier because you only need to point a verifier at those two tools, and you shouldn't even need the Binutils/LLVM sources.

@AFOliveira
Copy link
Collaborator Author

Yes, the advice is that:

Having a submodule to binutils (GPL code) is ok in isolation
Basing BSD-3 code (UnifiedDB) on GPL code (binutils) could result in a copyleft of UnifiedDB (want to avoid!)
Having a comparison script is OK if it is "cleanroomed". So the engineer creating the script and viewing GPL code (Afonso) cannot be the one fixing the BSD-3 code. Instead, Afonso would have to create a bug report, without proposing a solution, for another engineer to fix.
In summary, yuck.

I wonder if there is a way to compare instead again LLVM? @lenary, any thoughts on this?

Yuck :)

I believe this is of great utility for the UDB and I truly would like to find a way of fitting it without harming the UDB with the whole licensing spaghetti, maybe keeping it outside the repo (as @lenary described) and just adding some external reference would be possible? I believe we will also have this problem with LLVM, no? Even with the more permissive licensing I guess we will have to have some cautions.

I've actualy looked into LLVM to find a way of doing this before, but due to my lack of knowledge I couldn't find what I needed and dropped it.

This does not solve the problem of "is binutils correct" at all. Another approach is that, given Clang/LLVM and GCC/Binutils aim to be compatible with each other, to use the assembler (clang or as respectively) and disassembler (llvm-objdump or objdump respectively) at a higher level. This is a little more involved, because we'd need to e.g. disable the automatic compression of instructions that the assemblers do when smaller instructions are available, and do things like turn off alias printing in the disassembler (effectively the reverse process, but aliases are also used for canonical disassembly, like mov xM, xN instead of addi xM, xN, 0). On the other hand, this is easier because you only need to point a verifier at those two tools, and you shouldn't even need the Binutils/LLVM sources.

A question on this, how would I be able to dump all the instructions out of this disassembler, any files I should look for?

I've tried getting the info from LLVM but dropped it due to my (non-existing) progress, I believe the first pointer is already enough for me to give it a second shot, I'll keep you guys updated on how it goes.

@lenary
Copy link
Collaborator

lenary commented Dec 11, 2024

A question on this, how would I be able to dump all the instructions out of this disassembler, any files I should look for?

With an LLVM source tree, and having built llvm-tblgen, I the command you need is:

llvm-tblgen -I llvm/include -I llvm/lib/Target/RISCV llvm/lib/Target/RISCV/RISCV.td --dump-json -o <path-to-json-output>

This will get you absolutely everything in the tablegen files for RISC-V (and some of the target-independent definitions), so is about 180MB. The tablegen programming model is a bit weird, classes/instances are more for prototypical inheritance than java-style inheritance, and things can inherit from multiple classes and get the properties of all of them.

For a first clue, the !instanceof key contains a map of which keys in the main map are an instance of which class. RISC-V instructions should all inherit from RVInstCommon, so pulling that list should tell you all the object names that are relevant. We document some of the fields in llvm/lib/Target/RISCV/RISCVInstrFormats.td, and some will come from llvm/include/llvm/Target/Target.td.

The encoding information is in the Inst field (including which bits come from which operands), and the AsmString should contain at least the mnemonic, but also the operands. Ignore SoftFail, we don't use it on RISC-V, but we do use TSFlags (target-specific flags), which are documented in RISCVInstrFormats.td.

Similarly, the Extension info is in instances of RISCVExtension, and profiles info are in instances of RISCVProfile. Some things are not in Tablegen (relocations), but you should have more than enough to start with from here.

@lenary
Copy link
Collaborator

lenary commented Dec 11, 2024

(You asked for "dump all the instructions out the disassembler", which the closest way to do so is the tablegen information, via json - normally tablegen generates C++ which is invoked by other hand-written C++)

Part of my envisioning with the second assembler/disassembler approach, is that you use the database to generate valid or invalid bitpatterns (in an object) or instructions (as text), and pass those in to the tools, and look at the output. This sort of approaches the problem in more of a black box manner, but also has throughput difficulties if you want to scan the entire <=32-bit instruction space.

@AFOliveira AFOliveira mentioned this pull request Dec 13, 2024
@AFOliveira
Copy link
Collaborator Author

Closing this due to the licensing problems.

@AFOliveira AFOliveira closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants