Skip to content
This repository has been archived by the owner on Feb 29, 2024. It is now read-only.

16-bit instruction to shift by a register value #165

Open
jalobaba opened this issue Jun 9, 2022 · 25 comments
Open

16-bit instruction to shift by a register value #165

jalobaba opened this issue Jun 9, 2022 · 25 comments
Assignees
Labels
future idea Something for a future version of the spec

Comments

@jalobaba
Copy link

jalobaba commented Jun 9, 2022

I recently was reviewing the Zc spec and noticed it was missing 16-bit shift instructions by a register value. Zc only currently has 16-bit shift instructions that shift by an immediate value.

@tariqkurd told me these instructions haven't been thought of yet. Can we explore the code size benefit they provide and consider them for Zc?

Note that Arm Thumb-2 includes 16-bit shift by register instructions as do some other ISAs.

BTW, I'm new to this group. I've been watching for a while but plan to get more involved. I work for Qualcomm in San Diego and am a processor architect & designer (mostly embedded processors).

@tariqkurd-repo
Copy link
Contributor

Hi, thanks @jalobaba, I appreciate your input.
The first phase of Zc is adding 16-bit encodings (excluding those relating to EABI which isn't ratified).
The second phase is likely to be adding 32-bit encodings, but TBD.

To encode 16-bit shift by register value, I think it would need to go into the following reserved spaces:

101 010 xxxx 10
101 10x xxxx 10

both of which overlap with C.FSDSP

The simple approach is to have encodings for CM.SLL, CM.SRL, C.SRA, which would take up both of those spaces, and each require two register operands. For example:

CM.SLL rsd', rsd', rs2'

rsd' = rsd' << rs2

out of range shifts return zero.

is this what you are thinking?

@abukharmeh how hard is this to add into the analysis script? If it's easy then I'm wondering if we should try and get them in now - given that it's much more efficient to add all the (non-EABI) 16-bit encodings at once. I think it's worth assessing them anyway to see if they help the benchmark suite. It would be more efficient if you ran it, but it would certainly help if we could get more people involved to assess new instructions.

Tariq

@abukharmeh
Copy link
Contributor

It should be fairly easy if we know what patterns we are trying to optimise, is this trying to achieve the same thing as #141 ?

@tariqkurd-repo
Copy link
Contributor

tariqkurd-repo commented Jun 14, 2022

It's different, that one has an immediate shift distance and I think different source and destination registers. It's worth nothing that these wouldn't expand into an existing 32 bit encoding which might be a problem, bits it's certainly worth knowing how valuable they are

@abukharmeh
Copy link
Contributor

abukharmeh commented Jun 14, 2022

It's worth nothing that these wouldn't expand into an existing 32 bit encoding Why not SRL AND SLL ?

image

This does not look promising unless I am missing something, I will let you know exact % savings later in the week, or perhaps Saturday.

@abukharmeh abukharmeh self-assigned this Jun 14, 2022
@jalobaba
Copy link
Author

The proposed 16-bit shift instructions will expand into existing 32-bit instructions.

image

@jalobaba
Copy link
Author

Tariq, you asked if these were the instructions I was proposing:
CM.SLL rsd', rsd', rs2'

Yes, exactly this with 2 registers and destructive shift.

As for out of range shifts (i.e. rs2' > 31), I'd match the behavior of the 32-bit RISC-V shift instructions. I just checked the spec and the 32-bit shift instructions just use the bottom 5-bit of rs2 so I would match this for the 16-bit versions and then there is no special behavior for out of range.

@tariqkurd-repo
Copy link
Contributor

yes - you're right - CM.SLL would be the 16-bit encoding of SLL, and similarly for SRL/CM.SRL, SRA/CM.SRA

give that there doesn't seem to be much value in these in our existing benchmark suit @jalobaba do you have an open source application which sees a benefit from them?

@jalobaba
Copy link
Author

No, I don't have an open-source application that benefits from 16-bit shift by register value.

We have several groups at Qualcomm with RISC-V code for their microcontroller subsystems. I'm starting to reach out to them and get an effort started to evaluate Zc on those code bases. Some use GCC and some use LLVM so we can use your preliminary versions of those to evaluate Zc. We can also try Ibrahim's script. Is that useful to you to help determine the benefits of 16-bit shift by register?

What is the best way to analyze our internal code to determine the benefit of 16-bit shift by register? I can always grep through the disassembly but is there a better way? How did you determine that the benchmarks you are using to evaluate Zc don't have much benefit for 16-bit shift by register? I'm hoping I could use the same approach.

@tariqkurd-repo
Copy link
Contributor

tariqkurd-repo commented Jun 15, 2022 via email

@abukharmeh
Copy link
Contributor

I think it would be easier to add that to the script, you just need to add two lines similar to the following snippet, and then you can get the exact savings
https://github.com/riscv/riscv-code-size-reduction/blob/44ac85786eed91ff3768de5cdaca27113f0209dc/benchmarks/HCA/hca#L121-L124

@jalobaba
Copy link
Author

Thanks Ibrahim. I'll be working on this next week.

@abukharmeh
Copy link
Contributor

I would be happy if you submit a PR for it :)

Even if it didn't up in the spec, it would help people try it on their applications etc.

So what you need to do is

  1. Add a new action name, e.g elif action == 'cm_srl'
  2. Use retrieve field function with required instructions list set as [srl], and additional condition as new lambda function that checks that the one of the source register and the destination register are equal, and all registers are within the compressed format (Can use utils.regwithin as that is just a function that checks if a register with an ABI name is from ones that fit the compressed encoding)
  3. Use apply_optimizations to patch the instructions, the most important factor is to change WoE (Width of encoding to 16)

Then just use the script

hca -a cm_srl -f name_of_your_elf_fille

@abukharmeh
Copy link
Contributor

abukharmeh commented Jun 26, 2022

Hi, as promised earlier, here are the complete results from our benchmark suite

image

I doubt we can justify the encoding space for these instructions with these results !

Are we happy to close this now @jalobaba @tariqkurd-repo

@jnk0le
Copy link
Contributor

jnk0le commented Jun 26, 2022

0.09% and 0.05% for rsd+rs2 instruction. Bang for the code point ratio seems to be in line with other instructions (c.mul is also 0.09%)

@abukharmeh
Copy link
Contributor

Yes, you are right, they would be 2^(3+3)=64 points instructions as C.MUL which has similar savings !

@tariqkurd-repo
Copy link
Contributor

tariqkurd-repo commented Jun 27, 2022

Hi @abukharmeh are we also considering CM.SRA? Are there any results for that one?
C.MUL has better resutls than these two added together, and of course each of these has the same number of code-points as C.MUL so the overall benefit is less than half. Note that C.MUL saves >1% on several benchmarks, so the overall average isn't very useful in this case.

I'm interested to see CM.SRA if possible.

At the moment I feel that we need to put more thought into the very small amount of remaining encoding space - we can't allocate all if it to Zc, and so that we should proceed without these instructions. that doesn't mean that we can't add them in the next revision of Zc if they can be justified against other options at that point.

...otherwise we also risk failing to ratify anything this year... and that would certainly be worse for RISC-V in general than proceeding without them.

what do people think?

@abukharmeh
Copy link
Contributor

abukharmeh commented Jun 27, 2022

what do people think? I agree, I think its too late to add anything if we want ratification this year, and I dont see the savings from these instructions big enough to justify any addition to the current revision of extension !

@jnk0le
Copy link
Contributor

jnk0le commented Jun 27, 2022

c.mul peaks in pure dsp code, many of instances being multiply accumulate that can be handled by P ext as a single 32bit instruction. Otherwise it's average 0.2x% or lower bottomed down by debian packages.

In terms of code point efficiency c.lhu,c.lh,c.sh are even worser

@jnk0le
Copy link
Contributor

jnk0le commented Jun 27, 2022

...otherwise we also risk failing to ratify anything this year... and that would certainly be worse for RISC-V in general than proceeding without them

think its too late to add anything if we want ratification this year,!

I see one significant issue with "adding zce insrcutions later"; fragmentation. And those are likely to end up not used by A profile packages at all

eg. Zcev1 goes into profile with mandatory vector, such profile is very likely to become defacto lowest common denominator for targeting software (and optionally RV64GC as "legacy" one). If Zcev2 brings extra 0.5% code size reduction - nobody would like to break with LCD as well as provide yet another binary for just that 0.5%.

zcm* situation is a bit better, but not entirely.

@Silabs-ArjanB
Copy link

Silabs-ArjanB commented Jun 28, 2022

what do people think? I agree, I think its too late to add anything if we want ratification this year, and I dont see the savings from these instructions big enough to justify any addition to the current revision of extension

In our opinion it is definitely not worth putting ratification this year at risk by adding new instructions. Given that Zc* already consists of 7 groups and that likely/hopefully other ideas we will pop up, these instructions (and others) can be added later (if needed in a new sub-extension).

@tariqkurd-repo
Copy link
Contributor

yes exactly - and delaying Zc for several months for minimal extra code-size improvement will only make the fragmentation worse.

@tariqkurd-repo
Copy link
Contributor

agreed - so no change now - and we can consider these in the future.

@tariqkurd-repo tariqkurd-repo added the future idea Something for a future version of the spec label Jun 28, 2022
@jnk0le
Copy link
Contributor

jnk0le commented Jun 28, 2022

consists of 7 groups

In current zce spec, i see only 4 (+1 for RVE eabi) "fragmentation" groups, 3 (+1) of them being embedded only.

yes exactly - and delaying Zc for several months for minimal extra code-size improvement will only make the fragmentation worse.

given that the current compiler results match exactly the script ones. (they don't, table added on the right was misleading)
Maybe the compiler proof step could be skipped for obviously simple things like sll/srl/sra in v0.80

@tariqkurd-repo
Copy link
Contributor

there will only be a v0.80 if the architecture review committee request it

@abukharmeh
Copy link
Contributor

@tariqkurd-repo Can we keep the issue open as its future idea ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
future idea Something for a future version of the spec
Projects
None yet
Development

No branches or pull requests

5 participants