diff --git a/source/VexiiRiscv/BranchPrediction/index.rst b/source/VexiiRiscv/BranchPrediction/index.rst index 08d29f9..a73b824 100644 --- a/source/VexiiRiscv/BranchPrediction/index.rst +++ b/source/VexiiRiscv/BranchPrediction/index.rst @@ -1,54 +1,54 @@ Branch Prediction -================== +================= -The branch prediction is implemented as follow : +The branch prediction is implemented as follow : - During fetch, a BTB, GShare, RAS memory is used to provide an early branch prediction (BtbPlugin / GSharePlugin) - In Decode, the DecodePredictionPlugin will ensure that no "none jump/branch instruction"" predicted as a jump/branch continues down the pipeline. -- In Execute, the prediction made is checked and eventualy corrected. Also a stream of data is generated to feed the BTB / GShare memories with good data to learn. +- In Execute, the prediction made is checked and eventually corrected. Also a stream of data is generated to feed the BTB / GShare memories with good data to learn. -Here is a diagram of the whole architecture : +Here is a diagram of the whole architecture : .. image:: /asset/picture/branch_prediction.png While it would have been possible in the decode stage to correct some miss prediction from the BTB / RAS, it isn't done to improve timings and reduce Area. BtbPlugin -------------------------- +--------- Will : - Implement a branch target buffer in the fetch pipeline - Implement a return address stack buffer - Predict which slices of the fetched word are the last slice of a branch/jump -- Predict the branch/ĵump target +- Predict the branch/jump target - Use the FetchConditionalPrediction plugin (GSharePlugin) to know if branch should be taken - Apply the prediction (flush + pc update + history update) -- Learn using the LearnPlugin interface. Only learn on missprediction. To avoid write to read hazard, the fetch stage is blocked when it learn. -- Implement "ways" named chunks which are staticaly assigned to groups of word's slices, allowing to predict multiple branch/jump present in the same word +- Learn using the LearnPlugin interface. Only learn on misprediction. To avoid write to read hazard, the fetch stage is blocked when it learn. +- Implement "ways" named chunks which are statically assigned to groups of word's slices, allowing to predict multiple branch/jump present in the same word GSharePlugin -------------------------- +------------ -Will : +Will : - Implement a FetchConditionalPrediction (GShare flavor) - Learn using the LearnPlugin interface. Write to read hazard are handled via a bypass - Will not apply the prediction via flush / pc change, another plugin will do that DecodePredictionPlugin -------------------------- +---------------------- The purpose of this plugin is to ensure that no branch/jump prediction was made for non branch/jump instructions. In case this is detected, the plugin will just flush the pipeline and set the fetch PC to redo everything, but this time with a "first prediction skip" BranchPlugin --------------- +------------ Placed in the execute pipeline, it will ensure that the branch prediction was correct, else it correct it. It also generate a learn interface. LearnPlugin --------------- +----------- This plugin will collect all the learn interface (generated by the BranchPlugin) and produce a single stream of learn interface for the BtbPlugin / GShare plugin to use. diff --git a/source/VexiiRiscv/Debug/index.rst b/source/VexiiRiscv/Debug/index.rst index f730b3f..f76a7e9 100644 --- a/source/VexiiRiscv/Debug/index.rst +++ b/source/VexiiRiscv/Debug/index.rst @@ -1,6 +1,5 @@ Debug -============ - +===== .. toctree:: :maxdepth: 2 diff --git a/source/VexiiRiscv/Debug/jtag.rst b/source/VexiiRiscv/Debug/jtag.rst index 64df23a..a6e7bf1 100644 --- a/source/VexiiRiscv/Debug/jtag.rst +++ b/source/VexiiRiscv/Debug/jtag.rst @@ -1,5 +1,5 @@ JTAG -============================== +==== VexiiRiscv support debugging by implementing the official RISC-V debug spec. diff --git a/source/VexiiRiscv/Decode/index.rst b/source/VexiiRiscv/Decode/index.rst index fc86bec..1ad31ff 100644 --- a/source/VexiiRiscv/Decode/index.rst +++ b/source/VexiiRiscv/Decode/index.rst @@ -1,7 +1,7 @@ Decode -============ +====== -A few plugins operate in the fetch stage : +A few plugins operate in the fetch stage : - DecodePipelinePlugin - AlignerPlugin @@ -11,41 +11,41 @@ A few plugins operate in the fetch stage : DecodePipelinePlugin -------------------------- +-------------------- Provide the pipeline framework for all the decode related hardware. It use the spinal.lib.misc.pipeline API but implement multiple "lanes" in it. AlignerPlugin -------------------------- +------------- -Decode the words froms the fetch pipeline into aligned instructions in the decode pipeline. Its complexity mostly come from the necessity to support having RVC [and BTB], mostly by adding additional cases to handle. +Decode the words from the fetch pipeline into aligned instructions in the decode pipeline. Its complexity mostly come from the necessity to support having RVC [and BTB], mostly by adding additional cases to handle. 1) RVC allows 32 bits instruction to be unaligned, meaning they can cross between 2 fetched words, so it need to have some internal buffer / states to work. -2) The BTB may have predicted (falsly) a jump instruction where there is none, which may cut the fetch of an 32 bits instruction in the middle. +2) The BTB may have predicted (falsely) a jump instruction where there is none, which may cut the fetch of an 32 bits instruction in the middle. -The AlignerPlugin is designed as following : +The AlignerPlugin is designed as following : - Has a internal fetch word buffer in oder to support 32 bits instruction with RVC - First it scan at every possible instruction position, ex : RVC with 64 bits fetch words => 2x64/16 scanners. Extracting the instruction length, presence of all the instruction data (slices) and necessity to redo the fetch because of a bad BTB prediction. - Then it has one extractor per decoding lane. They will check the scanner for the firsts valid instructions. -- Then each extractor is feeded into the decoder pipeline. +- Then each extractor is fed into the decoder pipeline. .. image:: /asset/picture/aligner.png DecoderPlugin -------------------------- +------------- Will : - Decode instruction -- Generate ilegal instruction exception +- Generate illegal instruction exception - Generate "interrupt" instruction DecodePredictionPlugin -------------------------- +---------------------- The purpose of this plugin is to ensure that no branch/jump prediction was made for non branch/jump instructions. In case this is detected, the plugin will just flush the pipeline and set the fetch PC to redo everything, but this time with a "first prediction skip" @@ -53,20 +53,20 @@ In case this is detected, the plugin will just flush the pipeline and set the fe See more in the Branch prediction chapter DispatchPlugin -------------------------- +-------------- -Will : +Will : - Collect instruction from the end of the decode pipeline - Try to dispatch them ASAP on the multiple "layers" available -Here is a few explenation about execute lanes and layers : +Here is a few explanation about execute lanes and layers : - A execute lane represent a path toward which an instruction can be executed. - A execute lane can have one or many layers, which can be used to implement things as early ALU / late ALU - Each layer will have static a scheduling priority -The DispatchPlugin doesn't require lanes or layers to be symetric in any way. +The DispatchPlugin doesn't require lanes or layers to be symmetric in any way. diff --git a/source/VexiiRiscv/Execute/custom.rst b/source/VexiiRiscv/Execute/custom.rst index 2d4e86a..6b765ac 100644 --- a/source/VexiiRiscv/Execute/custom.rst +++ b/source/VexiiRiscv/Execute/custom.rst @@ -1,10 +1,10 @@ Custom instruction -============================== +================== There are multiple ways you can add custom instructions into VexiiRiscv. The following chapter will provide some demo. SIMD add ------------ +-------- Let's define a plugin which will implement a SIMD add (4x8bits adder), working on the integer register file. @@ -22,7 +22,7 @@ For instance the Plugin configuration could be : plugins += new SimdAddPlugin(early0) // <- We will implement this plugin Plugin implementation -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^ Here is a example how this plugin could be implemented : @@ -40,69 +40,69 @@ Here is a example how this plugin could be implemented : import vexiiriscv.compat.MultiPortWritesSymplifier import vexiiriscv.riscv.{IntRegFile, RS1, RS2, Riscv} - //This plugin example will add a new instruction named SIMD_ADD which do the following : + // This plugin example will add a new instruction named SIMD_ADD which do the following : // - //RD : Regfile Destination, RS : Regfile Source - //RD( 7 downto 0) = RS1( 7 downto 0) + RS2( 7 downto 0) - //RD(16 downto 8) = RS1(16 downto 8) + RS2(16 downto 8) - //RD(23 downto 16) = RS1(23 downto 16) + RS2(23 downto 16) - //RD(31 downto 24) = RS1(31 downto 24) + RS2(31 downto 24) + // RD : Regfile Destination, RS : Regfile Source + // RD( 7 downto 0) = RS1( 7 downto 0) + RS2( 7 downto 0) + // RD(16 downto 8) = RS1(16 downto 8) + RS2(16 downto 8) + // RD(23 downto 16) = RS1(23 downto 16) + RS2(23 downto 16) + // RD(31 downto 24) = RS1(31 downto 24) + RS2(31 downto 24) // - //Instruction encoding : - //0000000----------000-----0001011 <- Custom0 func3=0 func7=0 - // |RS2||RS1| |RD | + // Instruction encoding : + // 0000000----------000-----0001011 <- Custom0 func3=0 func7=0 + // |RS2||RS1| |RD | // - //Note : RS1, RS2, RD positions follow the RISC-V spec and are common for all instruction of the ISA + // Note : RS1, RS2, RD positions follow the RISC-V spec and are common for all instruction of the ISA object SimdAddPlugin{ - //Define the instruction type and encoding that we wll use + // Define the instruction type and encoding that we wll use val ADD4 = IntRegFile.TypeR(M"0000000----------000-----0001011") } - //ExecutionUnitElementSimple is a plugin base class which will integrate itself in a execute lane layer - //It provide quite a few utilities to ease the implementation of custom instruction. - //Here we will implement a plugin which provide SIMD add on the register file. - class SimdAddPlugin(val layer : LaneLayer) extends ExecutionUnitElementSimple(layer) { + // ExecutionUnitElementSimple is a plugin base class which will integrate itself in a execute lane layer + // It provide quite a few utilities to ease the implementation of custom instruction. + // Here we will implement a plugin which provide SIMD add on the register file. + class SimdAddPlugin(val layer : LaneLayer) extends ExecutionUnitElementSimple(layer) { - //Here we create an elaboration thread. The Logic class is provided by ExecutionUnitElementSimple to provide functionalities + // Here we create an elaboration thread. The Logic class is provided by ExecutionUnitElementSimple to provide functionalities val logic = during setup new Logic { - //Here we could have lock the elaboration of some other plugins (ex CSR), but here we don't need any of that - //as all is already sorted out in the Logic base class. - //So we just wait for the build phase + // Here we could have lock the elaboration of some other plugins (ex CSR), but here we don't need any of that + // as all is already sorted out in the Logic base class. + // So we just wait for the build phase awaitBuild() - //Let's assume we only support RV32 for now + // Let's assume we only support RV32 for now assert(Riscv.XLEN.get == 32) - //Let's get the hardware interface that we will use to provide the result of our custom instruction + // Let's get the hardware interface that we will use to provide the result of our custom instruction val wb = newWriteback(ifp, 0) - - //Specify that the current plugin will implement the ADD4 instruction + + // Specify that the current plugin will implement the ADD4 instruction val add4 = add(SimdAddPlugin.ADD4).spec - //We need to specify on which stage we start using the register file values + // We need to specify on which stage we start using the register file values add4.addRsSpec(RS1, executeAt = 0) add4.addRsSpec(RS2, executeAt = 0) - //Now that we are done specifying everything about the instructions, we can release the Logic.uopRetainer - //This will allow a few other plugins to continue their elaboration (ex : decoder, dispatcher, ...) + // Now that we are done specifying everything about the instructions, we can release the Logic.uopRetainer + // This will allow a few other plugins to continue their elaboration (ex : decoder, dispatcher, ...) uopRetainer.release() - //Let's define some logic in the execute lane [0] + // Let's define some logic in the execute lane [0] val process = new el.Execute(id = 0) { - //Get the RISC-V RS1/RS2 values from the register file + // Get the RISC-V RS1/RS2 values from the register file val rs1 = el(IntRegFile, RS1).asUInt val rs2 = el(IntRegFile, RS2).asUInt - //Do some computation + // Do some computation val rd = UInt(32 bits) rd( 7 downto 0) := rs1( 7 downto 0) + rs2( 7 downto 0) rd(16 downto 8) := rs1(16 downto 8) + rs2(16 downto 8) rd(23 downto 16) := rs1(23 downto 16) + rs2(23 downto 16) rd(31 downto 24) := rs1(31 downto 24) + rs2(31 downto 24) - //Provide the computation value for the writeback + // Provide the computation value for the writeback wb.valid := SEL wb.payload := rd.asBits } @@ -111,7 +111,7 @@ Here is a example how this plugin could be implemented : VexiiRiscv generation -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^ Then, to generate a VexiiRiscv with this new plugin, we could run the following App : @@ -144,7 +144,7 @@ To run this App, you can go to the NaxRiscv directory and run : sbt "runMain vexiiriscv.execute.VexiiSimdAddGen" Software test -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^ Then let's write some assembly test code : (https://github.com/SpinalHDL/NaxSoftware/tree/849679c70b238ceee021bdfd18eb2e9809e7bdd0/baremetal/simdAdd) @@ -157,16 +157,16 @@ Then let's write some assembly test code : (https://github.com/SpinalHDL/NaxSoft #include "../../driver/sim_asm.h" #include "../../driver/custom_asm.h" - //Test 1 + // Test 1 li x1, 0x01234567 li x2, 0x01FF01FF - opcode_R(CUSTOM0, 0x0, 0x00, x3, x1, x2) //x3 = ADD4(x1, x2) + opcode_R(CUSTOM0, 0x0, 0x00, x3, x1, x2) // x3 = ADD4(x1, x2) - //Print result value + // Print result value li x4, PUT_HEX sw x3, 0(x4) - //Check result + // Check result li x5, 0x02224666 bne x3, x5, fail @@ -184,15 +184,15 @@ Compile it with make clean rv32im Simulation -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^ -You could run a simulation using this testbench : +You could run a simulation using this testbench : - Bottom of https://github.com/SpinalHDL/VexiiRiscv/blob/dev/src/main/scala/vexiiriscv/execute/SimdAddPlugin.scala .. code:: scala - object VexiiSimdAddSim extends App{ + object VexiiSimdAddSim extends App { val param = new ParamSimple() val testOpt = new TestOptions() @@ -231,7 +231,7 @@ Which will output the value 02224666 in the shell and show traces in simWorkspac Note that --no-rvls-check is required as spike do not implement that custom simdAdd. Conclusion -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^ So overall this example didn't introduce how to specify some additional decoding, nor how to define multi-cycle ALU. (TODO). But you can take a look in the IntAluPlugin, ShiftPlugin, DivPlugin, MulPlugin and BranchPlugin which are doing those things using the same ExecutionUnitElementSimple base class. diff --git a/source/VexiiRiscv/Execute/fpu.rst b/source/VexiiRiscv/Execute/fpu.rst index 955277e..c721195 100644 --- a/source/VexiiRiscv/Execute/fpu.rst +++ b/source/VexiiRiscv/Execute/fpu.rst @@ -1,27 +1,27 @@ FPU -======== +=== -The VexiiRiscv FPU has the following caracteristics : +The VexiiRiscv FPU has the following characteristics : - By default, It is fully compliant with the IEEE-754 spec (subnormal, rounding, exception flags, ..) - There is options to reduce its footprint at the cost of compliance (reduced FMA accuracy, and drop subnormal support) -- It isn't a single chunky module, instead it is composed of many plugins in the same ways than the rest of the CPU. -- It is thightly coupled to the execute pipeline +- It isn't a single chunky module, instead it is composed of many plugins in the same ways than the rest of the CPU. +- It is tightly coupled to the execute pipeline - All operations can be issued at the rate of 1 instruction per cycle, excepted for FDIV/FSQRT/Subnormals - By default, it is deeply pipelined to help with FPGA timings (10 stages FMA) -- Multiple hardware ressources are sharred between multiple instruction (ex rounding, adder (FMA+FADD) -- The VexiiRiscv scheduler take care to not schedule an instruction which would use the same ressource than an older instruction +- Multiple hardware resources are shared between multiple instruction (ex rounding, adder (FMA+FADD) +- The VexiiRiscv scheduler take care to not schedule an instruction which would use the same resource than an older instruction - FDIV and FMUL reuse the integer pipeline DIV and MUL hardware - Subnormal numbers are handled by recoding/encoding them on operands and results of math instructions. This will trigger some little state machines which will halt the CPU a few cycles (2-3 cycles) Plugins architecture ----------------------- +-------------------- -There is a few fundation plugins that compose the FPU : +There is a few foundation plugins that compose the FPU : -- FpuUnpackPlugin : Will decode the RS1/2/3 operands (isZero, isInfinit, ..) aswell as recode them in a floating point format which simplify subnormals into regular floating point values +- FpuUnpackPlugin : Will decode the RS1/2/3 operands (isZero, isInfinity, ..) as well as recode them in a floating point format which simplify subnormals into regular floating point values - FpuPackPlugin : Will apply rounding to floating point results, recode them into IEEE-754 (including subnormal) before sending those to the WriteBackPlugin(float) - WriteBackPlugin(float) : Allows to write values back to the register file (it is the same implementation as the WriteBackPlugin(integer) - FpuFlagsWriteback ; Allows instruction to set FPU exception flags @@ -31,30 +31,34 @@ There is a few fundation plugins that compose the FPU : Area / Timings options ----------------------- -To improve the FPU area and timings (especialy on FPGA), there is currently two main options implemented. +To improve the FPU area and timings (especially on FPGA), there is currently two main options implemented. The first option is to reduce the FMA (Float Multiply Add instruction A*B+C) accuracy. -The reason is that the mantissa result of the multiply operation (for 64 bits float) is 2x(52+1)=106 bits, then we need to take those bits and implement the floating point adder against the third opperand. So, instead of having to do a 52 bits + 52 bits floating point adder, we need to do a 106 bits + 52 bits floating point adder, which is quite heavy, increase the timings and latencies while being (very likely) overkilled. +The reason is that the mantissa result of the multiply operation (for 64 bits float) is 2x(52+1)=106 bits, +then we need to take those bits and implement the floating point adder against the third operand. +So, instead of having to do a 52 bits + 52 bits floating point adder, +we need to do a 106 bits + 52 bits floating point adder, which is quite heavy, +increase the timings and latencies while being (very likely) overkilled. So this option throw away about half of the multiplication mantissa result. The second option is to disable subnormal support, and instead consider those value as normal floating point numbers. This reduce the area by not having to handle subnormals (it removes big barrels shifters) -, aswell as improving timings. -The down side is that the floating point value range is slightly reduced, -and if the user provide floating point constants which are subnormals number, +, as well as improving timings. +The down side is that the floating point value range is slightly reduced, +and if the user provide floating point constants which are subnormals number, they will be considered as 2^exp_subnormal numbers. -In practice those two option do not seems to creates issues (for regular use cases), +In practice those two option do not seems to creates issues (for regular use cases), as it was tested by running debian with various software and graphical environnements. Optimized software ------------------------------ +------------------ -If you used the default FPU configuration (deeply pipelined), and you want to achieve a high FPU bandwidth, -your software need to be carefull about dependencies between instruction. -For instance, a FMA instruction will have around 10 cycle latency before providing its results, -so if you want for instance to multipliy 1000 values against some constants +If you used the default FPU configuration (deeply pipelined), and you want to achieve a high FPU bandwidth, +your software need to be careful about dependencies between instruction. +For instance, a FMA instruction will have around 10 cycle latency before providing its results, +so if you want for instance to multiply 1000 values against some constants and accumulate the results together, you will need to accumulate things using multiple accumulators and then, only at the end, aggregate the accumulators together. -So think about code pipelining. GCC will not necessarly do a got job about it, +So think about code pipelining. GCC will not necessary do a got job about it, as it may assume assume that the FPU has a much lower latency, or just optimize for code size. diff --git a/source/VexiiRiscv/Execute/index.rst b/source/VexiiRiscv/Execute/index.rst index d16a507..64a4a35 100644 --- a/source/VexiiRiscv/Execute/index.rst +++ b/source/VexiiRiscv/Execute/index.rst @@ -1,5 +1,5 @@ Execute -============ +======= .. toctree:: diff --git a/source/VexiiRiscv/Execute/introduction.rst b/source/VexiiRiscv/Execute/introduction.rst index 82aa74d..9d11633 100644 --- a/source/VexiiRiscv/Execute/introduction.rst +++ b/source/VexiiRiscv/Execute/introduction.rst @@ -1,13 +1,13 @@ Introduction -============================== +============ -The execute pipeline has the following properties : +The execute pipeline has the following properties : - Support multiple lane of execution. - Support multiple implementation of the same instruction on the same lane (late-alu) via the concept of "layer" - each layer is owned by a given lane - each layer can implement multiple instructions and store a data model of their requirements. -- The whole pipeline never collapse bubbles, all lanes of every stage move forward together as one. +- The whole pipeline never collapse bubbles, all lanes of every stage move forward together as one. - Elements of the pipeline are allowed to stop the whole pipeline via a shared freeze interface. @@ -16,11 +16,11 @@ Here is a class diagram : .. image:: /asset/picture/execute_structure.png -The main thing about it is that for every uop implementation in the pipeline, there is the elaboration time information for : +The main thing about it is that for every uop implementation in the pipeline, there is the elaboration time information for : -- How/where to retreive the result of the instruction (rd) +- How/where to retrieve the result of the instruction (rd) - From which point in the pipeline it use which register file (rs) -- From which point in the pipleine the instruction can be considered as done (completion) +- From which point in the pipeline the instruction can be considered as done (completion) - Until which point in the pipeline the instruction may flush younger instructions (mayFlushUpTo) - From which point in the pipeline the instruction should not be flushed anymore because it already had produced side effects (dontFlushFrom) - The list of decoded signals/values that the instruction is using (decodings) diff --git a/source/VexiiRiscv/Execute/lsu.rst b/source/VexiiRiscv/Execute/lsu.rst index b221318..97c31ef 100644 --- a/source/VexiiRiscv/Execute/lsu.rst +++ b/source/VexiiRiscv/Execute/lsu.rst @@ -1,25 +1,25 @@ Load Store Unit (LSU) -======================== +===================== -VexiiRiscv has 2 implementions of LSU : +VexiiRiscv has 2 implementations of LSU : - LsuCachelessPlugin for microcontrollers, which doesn't implement any cache - LsuPlugin / LsuL1Plugin which can work together to implement load and store through an L1 cache Without L1 ----------------- +---------- -Implemented by the LsuCachelessPlugin, it should be noted that to -reach good frequencies on FPGA SoC, forking the memory request at -execute stage 1 seems to provide the best results (instead of execute stage 0), -as it relax the AGU timings aswell as the PMA (Physical Memory Attributes) checks. +Implemented by the LsuCachelessPlugin, it should be noted that to +reach good frequencies on FPGA SoC, forking the memory request at +execute stage 1 seems to provide the best results (instead of execute stage 0), +as it relax the AGU timings as well as the PMA (Physical Memory Attributes) checks. .. image:: /asset/picture/lsu_nol1.png With L1 ----------------- +------- -This configuration supports : +This configuration supports : - N ways (limited to 4 KB per way if the MMU is enabled) - Non-blocking design, able to handle multiple cache line refill and writeback @@ -27,7 +27,7 @@ This configuration supports : .. image:: /asset/picture/lsu_l1.png -This LSU implementation is partitionned between 2 plugins : +This LSU implementation is partitioned between 2 plugins : The LsuPlugin : @@ -53,25 +53,25 @@ For multiple reasons (ease of implementation, FMax, hardware usage), VexiiRiscv - Cache miss, MMU miss - Refill / Writeback aliasing (4KB) -- Unreaded data bank durring load (ex : load durring data bank refill) -- Load which hit the store queue +- Unread data bank during load (ex : load during data bank refill) +- Load which hit the store queue - Store miss while the store queue is full - ... -In those situation, the LsuPlugin will trigger an "hardware trap" +In those situation, the LsuPlugin will trigger an "hardware trap" which will flush the pipeline and reschedule the failed instruction to the fetch unit. Memory coherency ------------------- +---------------- Memory coherency (L1) with other memory agents (CPU, DMA, ..) is supported though a MESI implementation which can be bridged to a tilelink memory bus. - + So, the L1 cache will have the following stream interfaces : -- read_cmd : To send memory block aquire requests (invalid/shared -> shared/exclusive) +- read_cmd : To send memory block acquire requests (invalid/shared -> shared/exclusive) - read_rsp : For responses of the above requests -- read_ack : To send aquire requests completion -- write_cmd : To send release a memory block permition (shared/exclusive -> invalid) +- read_ack : To send acquire requests completion +- write_cmd : To send release a memory block permission (shared/exclusive -> invalid) - write_rsp : For responses of the above requests - probe_cmd : To receive probe requests (toInvalid/toShared/toUnique) - probe_rsp : to send responses from the above requests (isInvalid/isShared/isUnique) @@ -79,7 +79,7 @@ So, the L1 cache will have the following stream interfaces : PICTURE Prefetching --------------- +----------- Currently there is two implementation of prefetching @@ -87,62 +87,62 @@ Currently there is two implementation of prefetching - PrefetchRptPlugin : Enable prefetching for instruction which have a constant stride between accesses PrefetchRptPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^ -This prefetcher is capable of reconizing instructions which have a constant stride between their +This prefetcher is capable of recognizing instructions which have a constant stride between their own previous accesses in order to prefetch multiple strides ahead. - Will learn memory accesses patterns from the LsuPlugin traces -- Patterns need to have a constant stride in order to be reconized +- Patterns need to have a constant stride in order to be recognized - By default, can keep of the access patterns up to 128 instructions (1 way * 128 sets, pc indexed) .. image:: /asset/picture/lsu_prefetch.png -This can improve performance dramasticaly (for some use cases). -For instance, on a 100 Mhz SoC in a FPGA, equipied of a 16x800 MT/s DDR3, -the load bandwidth went from 112 MB/s to 449 MB/s. (sequencial load) +This can improve performance dramatically (for some use cases). +For instance, on a 100 MHz SoC in a FPGA, equipped of a 16x800 MT/s DDR3, +the load bandwidth went from 112 MB/s to 449 MB/s. (sequential load) -Here is a description of the table fields : +Here is a description of the table fields : -"Tag" : Allows to get a better idea if the given instruction (PC) is the one owning -the table entry by comparing more PC's MSB bits. +"Tag" : Allows to get a better idea if the given instruction (PC) is the one owning +the table entry by comparing more PC's MSB bits. An entry is "owned" by an instruction if its tag match the given instruction PC's msb bits. "Address" : Previous virtual address generated by the instruction "stride" : Number of bytes expected between memory accesses -"Score" : Allows to know if the given entry is usefull or not. Each time +"Score" : Allows to know if the given entry is useful or not. Each time the instruction is keeping the same stride, the score increase, else it decrease. -If another instruction (with another tag) want to use an entry, +If another instruction (with another tag) want to use an entry, the score field has to be low enough. -"Advance" : Allows to keep track how far the prefetching for the given -instruction already went. This field is cleared when a entry switch +"Advance" : Allows to keep track how far the prefetching for the given +instruction already went. This field is cleared when a entry switch to a new instruction -"Missed" : This field was added in order to reduce the spam of +"Missed" : This field was added in order to reduce the spam of redundant prefetch request which were happening for load/store intensive code. -For instance, for a deeply unrolled memory clear loop will generate (x16), -as each store instruction PC will be tracked individualy, -and as each execution of a given instruction will stride over a full cache line, -this will generate one hardware prefetch request on each store instruction every -time, spamming the LSU pipeline with redundant requests +For instance, for a deeply unrolled memory clear loop will generate (x16), +as each store instruction PC will be tracked individually, +and as each execution of a given instruction will stride over a full cache line, +this will generate one hardware prefetch request on each store instruction every +time, spamming the LSU pipeline with redundant requests and reducing overall performances. This "missed" field works as following : - It is cleared when a stride disruption happens (ex new memcopy execution) - It is set on cache miss (set win over clear) -- An instruction will only trigger a prefetch if it miss or +- An instruction will only trigger a prefetch if it miss or if its "missed" field is already set. -For example, in a hardware simulation test -(RV64, 20 cycles memory latency, 16xload loop), this addition increased +For example, in a hardware simulation test +(RV64, 20 cycles memory latency, 16xload loop), this addition increased the memory read memory bandwidth from 3.6 bytes/cycle to 6.8 bytes per cycle. -Note that if you want to take full advantage of this prefetcher, you need to -have enough hardware refill/writeback slots in the LsuL1Plugin. +Note that if you want to take full advantage of this prefetcher, you need to +have enough hardware refill/writeback slots in the LsuL1Plugin. Also, prefetch which fail (ex : because of hazards in L1) aren't replayed. diff --git a/source/VexiiRiscv/Execute/plugins.rst b/source/VexiiRiscv/Execute/plugins.rst index dbc59a6..18e14ae 100644 --- a/source/VexiiRiscv/Execute/plugins.rst +++ b/source/VexiiRiscv/Execute/plugins.rst @@ -1,16 +1,16 @@ Plugins -============ +======= infrastructures -------------------- +--------------- -Many plugins operate in the fetch stage. Some provide infrastructures : +Many plugins operate in the fetch stage. Some provide infrastructures : ExecutePipelinePlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^ -Provide the pipeline framework for all the execute related hardware with the following specificities : +Provide the pipeline framework for all the execute related hardware with the following specificities : - It is based on the spinal.lib.misc.pipeline API and can host multiple "lanes" in it. - For flow control, the lanes can only freeze the whole pipeline @@ -18,60 +18,60 @@ Provide the pipeline framework for all the execute related hardware with the fol ExecuteLanePlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^ Implement an execution lane in the ExecutePipelinePlugin RegFilePlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^ -Implement one register file, with the possibility to create new read / write port on demande +Implement one register file, with the possibility to create new read / write port on demand SrcPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^ Provide some early integer values which can mux between RS1/RS2 and multiple RISC-V instruction's literal values RsUnsignedPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^ Used by mul/div in order to get an unsigned RS1/RS2 value early in the pipeline IntFormatPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^ -Alows plugins to write integer values back to the register file through a optional sign extender. +Allows plugins to write integer values back to the register file through a optional sign extender. It uses WriteBackPlugin as value backend. WriteBackPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^ Used by plugins to provide the RD value to write back to the register file LearnPlugin -^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^ Will collect all interface which provide jump/branch learning interfaces to aggregate them into a single one, which will then be used by branch prediction plugins to learn. Instructions -------------------- +------------ -Some implement regular instructions +Some implement regular instructions IntAluPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^ Implement the arithmetic, binary and literal instructions (ADD, SUB, AND, OR, LUI, ...) BarrelShifterPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^ Implement the shift instructions in a non-blocking way (no iterations). Fast but "heavy". BranchPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^ -Will : +Will : - Implement branch/jump instruction - Correct the PC / History in the case the branch prediction was wrong @@ -79,54 +79,54 @@ Will : MulPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^ - Implement multiplication operation using partial multiplications and then summing their result - Done over multiple stage -- Can optionaly extends the last stage for one cycle in order to buffer the MULH bits +- Can optionally extends the last stage for one cycle in order to buffer the MULH bits DivPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^ -- Implement the division/remain +- Implement the division/remain - 2 bits per cycle are solved. - When it start, it scan for the numerator leading bits for 0, and can skip dividing them (can skip blocks of XLEN/4) LsuCachelessPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^ - Implement load / store through a cacheless memory bus - Will fork the cmd as soon as fork stage is valid (with no flush) -- Handle backpresure by using a little fifo on the response data +- Handle backpressure by using a little fifo on the response data Special -------------------- +------- Some implement CSR, privileges and special instructions CsrAccessPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^ - Implement the CSR instruction - Provide an API for other plugins to specify its hardware mapping CsrRamPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^ - Implement a shared on chip ram -- Provide an API which allows to staticaly allocate space on it +- Provide an API which allows to statically allocate space on it - Provide an API to create read / write ports on it -- Used by various plugins to store the CSR contents in a FPGA efficient way +- Used by various plugins to store the CSR contents in a FPGA efficient way PrivilegedPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^ - Implement the RISCV privileged spec - Implement the trap buffer / FSM - Use the CsrRamPlugin to implement various CSR as MTVAL, MTVEC, MEPC, MSCRATCH, ... PerformanceCounterPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^ - Implement the privileged performance counters in a very FPGA way - Use the CsrRamPlugin to store most of the counter bits @@ -135,6 +135,6 @@ PerformanceCounterPlugin EnvPlugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^ - Implement a few instructions as MRET, SRET, ECALL, EBREAK diff --git a/source/VexiiRiscv/Fetch/index.rst b/source/VexiiRiscv/Fetch/index.rst index a5dac88..3e35ae1 100644 --- a/source/VexiiRiscv/Fetch/index.rst +++ b/source/VexiiRiscv/Fetch/index.rst @@ -1,8 +1,8 @@ Fetch -============ +===== -A few plugins operate in the fetch stage : +A few plugins operate in the fetch stage : - FetchPipelinePlugin - PcPlugin @@ -15,12 +15,12 @@ A few plugins operate in the fetch stage : FetchPipelinePlugin -------------------------- +------------------- Provide the pipeline framework for all the fetch related hardware. It use the native spinal.lib.misc.pipeline API without any restriction. PcPlugin -------------------------- +-------- Will : @@ -31,9 +31,9 @@ Will : Jump interfaces will impact the PC value injected in the fetch stage in a combinatorial manner to reduce latency. FetchCachelessPlugin -------------------------- +-------------------- -Will : +Will : - Generate a fetch memory bus - Connect that memory bus to the fetch pipeline with a response buffer @@ -41,11 +41,10 @@ Will : - Always generate aligned memory accesses - FetchL1Plugin ------------------- +------------- -Will : +Will : - Implement a L1 fetch cache (non-blocking) - Generate a fetch memory bus for the SoC interconnect @@ -53,40 +52,40 @@ Will : PrefetcherNextLinePlugin ------------------------------- +------------------------ -Currently, there is one instruction L1 prefetcher implementation (PrefetchNextLinePlugin). +Currently, there is one instruction L1 prefetcher implementation (PrefetchNextLinePlugin). -It is a very simple implementation : +It is a very simple implementation : - On L1 access miss, it trigger the prefetching of the next cache line - On L1 access hit, if the cache line accessed is the same than the last prefetch, is trigger the prefetching of the next cache line -In short it can only prefetch one cache block ahead and assume that if there was a cache miss on a block, -then the following blocks are likely worth prefetching aswell. +In short it can only prefetch one cache block ahead and assume that if there was a cache miss on a block, +then the following blocks are likely worth prefetching as well. .. image:: /asset/picture/fetch_prefetch_nl.png -Note, for the best results, the FetchL1Plugin need to have +Note, for the best results, the FetchL1Plugin need to have 2 hardware refill slots instead of 1 (default). The prefetcher can be turned off by setting the CSR 0x7FF bit 0. BtbPlugin -------------------------- +--------- See more in the Branch prediction chapter GSharePlugin -------------------------- +------------ See more in the Branch prediction chapter HistoryPlugin -------------------------- +------------- -Will : +Will : - implement the branch history register - inject the branch history in the first fetch stage diff --git a/source/VexiiRiscv/Framework/index.rst b/source/VexiiRiscv/Framework/index.rst index 5fee107..280bf2f 100644 --- a/source/VexiiRiscv/Framework/index.rst +++ b/source/VexiiRiscv/Framework/index.rst @@ -1,9 +1,9 @@ Framework -============ +========= Dependencies ------------------------------- +------------ VexRiscv is based on a few tools / API @@ -18,33 +18,35 @@ VexRiscv is based on a few tools / API Scala / SpinalHDL ------------------------------- +----------------- -This combination alows to goes way behond what regular HDL alows in terms of hardware description capabilities. -You can find some documentation about SpinalHDL here : +This combination allows to goes way beyond what regular HDL allows in terms of hardware description capabilities. +You can find some documentation about SpinalHDL here : - https://spinalhdl.github.io/SpinalDoc-RTD/master/index.html Plugin -------------------------- +------ -One main design aspect of VexiiRiscv is that all its hardware is defined inside plugins. When you want to instanciate a VexiiRiscv CPU, you "only" need to provide a list of plugins as parameters. So, plugins can be seen as both parameters and hardware definition from a VexiiRiscv perspective. +One main design aspect of VexiiRiscv is that all its hardware is defined inside plugins. +When you want to instantiate a VexiiRiscv CPU, you "only" need to provide a list of plugins as parameters. +So, plugins can be seen as both parameters and hardware definition from a VexiiRiscv perspective. -So it is quite different from the regular HDL component/module paradigm. Here are the adventages of this aproache : +So it is quite different from the regular HDL component/module paradigm. Here are the advantagesof this approach : - The CPU can be extended without modifying its core source code, just add a new plugin in the parameters - You can swap a specific implementation for another just by swapping plugin in the parameter list. (ex branch prediction, mul/div, ...) -- It is decentralised by nature, you don't have a fat toplevel of doom, software interface between plugins can be used to negociate things durring elaboration time. +- It is decentralized by nature, you don't have a fat toplevel of doom, software interface between plugins can be used to negotiate things during elaboration time. -The plugins can fork elaboration threads which cover 2 phases : +The plugins can fork elaboration threads which cover 2 phases : -- setup phase : where plugins can aquire elaboration locks on each others -- build phase : where plugins can negociate between each others and generate hardware +- setup phase : where plugins can acquire elaboration locks on each others +- build phase : where plugins can negotiate between each others and generate hardware Simple all-in-one example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^ -Here is a simple example : +Here is a simple example : .. code-block:: scala @@ -63,7 +65,7 @@ Here is a simple example : } object Gen extends App{ - // Generate the verilog + // Generate the verilog SpinalVerilog{ val plugins = ArrayBuffer[FiberPlugin]() plugins += new FixedOutputPlugin() @@ -72,7 +74,7 @@ Here is a simple example : } -Will generate +Will generate .. code-block:: verilog @@ -86,10 +88,10 @@ Will generate -Negociation example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Negotiation example +^^^^^^^^^^^^^^^^^^^ -Here is a example where there a plugin which count the number of hardware event comming from other plugins : +Here is a example where there a plugin which count the number of hardware event coming from other plugins : .. code-block:: scala @@ -99,34 +101,34 @@ Here is a example where there a plugin which count the number of hardware event import spinal.lib.CountOne import vexiiriscv._ import scala.collection.mutable.ArrayBuffer - + class EventCounterPlugin extends FiberPlugin{ val lock = Retainer() // Will allow other plugins to block the elaboration of "logic" thread val events = ArrayBuffer[Bool]() // Will allow other plugins to add event sources - val logic = during build new Area{ - lock.await() // Active blocking + val logic = during build new Area { + lock.await() // Active blocking val counter = Reg(UInt(32 bits)) init(0) counter := counter + CountOne(events) } } - //For the demo we want to be able to instanciate this plugin multiple times, so we add a prefix parameter + // For the demo we want to be able to instantiate this plugin multiple times, so we add a prefix parameter class EventSourcePlugin(prefix : String) extends FiberPlugin{ withPrefix(prefix) // Create a thread starting from the setup phase (this allow to run some code before the build phase, and so lock some other plugins retainers) - val logic = during setup new Area{ + val logic = during setup new Area { val ecp = host[EventCounterPlugin] // Search for the single instance of EventCounterPlugin in the plugin pool // Generate a lock to prevent the EventCounterPlugin elaboration until we release it. - // this will allow us to add our localEvent to the ecp.events list - val ecpLocker = ecp.lock() - + // this will allow us to add our localEvent to the ecp.events list + val ecpLocker = ecp.lock() + // Wait for the build phase before generating any hardware awaitBuild() // Here the local event is a input of the VexiiRiscv toplevel (just for the demo) - val localEvent = in Bool() + val localEvent = in Bool() ecp.events += localEvent // As everything is done, we now allow the ecp to elaborate itself @@ -134,8 +136,8 @@ Here is a example where there a plugin which count the number of hardware event } } - object Gen extends App{ - SpinalVerilog{ + object Gen extends App { + SpinalVerilog { val plugins = ArrayBuffer[FiberPlugin]() plugins += new EventCounterPlugin() plugins += new EventSourcePlugin("lane0") @@ -182,12 +184,12 @@ Here is a example where there a plugin which count the number of hardware event Database --------------------- +-------- Quite a few things behave kinda like variable specific for each VexiiRiscv instance. For instance XLEN, PC_WIDTH, INSTRUCTION_WIDTH, ... So they are end up with things that we would like to share between plugins of a given VexiiRiscv instance with the minimum code possible to keep things slim. For that, a "database" was added. -You can see it in the VexRiscv toplevel : +You can see it in the VexRiscv toplevel : .. code-block:: scala @@ -221,22 +223,22 @@ What it does is that all the plugin thread will run in the context of that datab Global.VIRTUAL_WIDTH.set(39) } } - - object Gen extends App{ + + object Gen extends App{ SpinalVerilog{ val plugins = ArrayBuffer[FiberPlugin]() plugins += new LoadStorePlugin() plugins += new MmuPlugin() VexiiRiscv(plugins) } - } + } Pipeline API --------------------- +------------ -In short, the design use a pipeline API in order to : +In short, the design use a pipeline API in order to : -- Propagate data into the pipeline automaticaly +- Propagate data into the pipeline automatically - Allow design space exploration with less paine (retiming, moving around the architecture) - Reduce boiler plate code diff --git a/source/VexiiRiscv/HowToUse/index.rst b/source/VexiiRiscv/HowToUse/index.rst index c585ba6..2aa4323 100644 --- a/source/VexiiRiscv/HowToUse/index.rst +++ b/source/VexiiRiscv/HowToUse/index.rst @@ -45,8 +45,8 @@ On debian : # RVLS / Spike dependencies sudo apt-get install device-tree-compiler libboost-all-dev - # Install ELFIO, used to load elf file in the sim - git clone https://github.com/serge1/ELFIO.git + # Install ELFIO, used to load elf file in the sim + git clone https://github.com/serge1/ELFIO.git cd ELFIO git checkout d251da09a07dff40af0b63b8f6c8ae71d2d1938d # Avoid C++17 sudo cp -R elfio /usr/include @@ -55,7 +55,7 @@ On debian : Repo setup ---------------- -After installing the dependencies (see above) : +After installing the dependencies (see above) : .. code-block:: bash @@ -117,7 +117,7 @@ You can get a list of the supported parameters via : Run a simulation ------------------ +---------------- Note that Vexiiriscv use mostly an opt-in configuration. So, most performance related configuration are disabled by default. @@ -131,18 +131,18 @@ Note that Vexiiriscv use mostly an opt-in configuration. So, most performance re This will generate a simWorkspace/VexiiRiscv/test folder which contains : - test.fst : A wave file which can be open with gtkwave. It shows all the CPU signals -- konata.log : A wave file which can be open with https://github.com/shioyadan/Konata, it shows the pipeline behaviour of the CPU +- konata.log : A wave file which can be open with https://github.com/shioyadan/Konata, it shows the pipeline behavior of the CPU - spike.log : The execution logs of Spike (golden model) - tracer.log : The execution logs of VexRiscv (Simulation model) -Here is an example of the additional argument you can use to improve the IPC : +Here is an example of the additional argument you can use to improve the IPC : .. code-block:: bash --with-btb --with-gshare --with-ras --decoders 2 --lanes 2 --with-aligner-buffer --with-dispatcher-buffer --with-late-alu --regfile-async --allow-bypass-from 0 --div-radix 4 -Here is a screen shot of a cache-less VexiiRiscv booting linux : +Here is a screen shot of a cache-less VexiiRiscv booting linux : .. image:: /asset/picture/konata.png @@ -153,15 +153,15 @@ Synthesis / Inferation ----------------------- VexiiRiscv is designed in a way which should make it easy to deploy on all FPGA. -including the ones without support for asyncronous memory read +including the ones without support for asynchronous memory read (LUT ram / distributed ram / MLAB). The one exception is the MMU, but if configured to only read the memory on cycle 0 -(no tag hit), then the synthesis tool should be capable of inferring that asyncronus -read into a syncronous one (RAM block, work on Efinix FPGA) +(no tag hit), then the synthesis tool should be capable of inferring that asynchronous +read into a synchronous one (RAM block, work on Efinix FPGA) By default SpinalHDL will generate memories in a Verilog/VHDL inferable way. -Otherwise, for ASIC, you likely want to enable the automatic memory blackboxing, -which will instead replace all memories defined in the design by a consistant blackbox +Otherwise, for ASIC, you likely want to enable the automatic memory blackboxing, +which will instead replace all memories defined in the design by a consistent blackbox module/component, the user having then to provide those blackbox implementation. Currently all memories used are "simple dual port ram". While this is the best for FPGA usages, diff --git a/source/VexiiRiscv/Introduction/index.rst b/source/VexiiRiscv/Introduction/index.rst index 955d23d..7292c82 100644 --- a/source/VexiiRiscv/Introduction/index.rst +++ b/source/VexiiRiscv/Introduction/index.rst @@ -9,9 +9,9 @@ In a few words, VexiiRiscv : - Should fit well on FPGA and ASIC Other doc / media / talks --------------------------- +------------------------- -Here is a list of links to ressources which present or document VexiiRiscv : +Here is a list of links to resources which present or document VexiiRiscv : - FSiC 2024 : https://wiki.f-si.org/index.php?title=Moving_toward_VexiiRiscv - COSCUP 2024 : https://coscup.org/2024/en/session/PVAHAS @@ -21,9 +21,9 @@ Here is a list of links to ressources which present or document VexiiRiscv : Technicalities ------------------------------ -VexiiRiscv is a from scratch second iteration of VexRiscv, with the following goals : +VexiiRiscv is a from scratch second iteration of VexRiscv, with the following goals : -- To imlement RISC-V 32/64 bits IMAFDCSU +- To implement RISC-V 32/64 bits IMAFDCSU - Could start around as small as VexRiscv, but could scale further in performance - Optional late-alu - Optional multi issue @@ -32,10 +32,10 @@ VexiiRiscv is a from scratch second iteration of VexRiscv, with the following go - Proper branch prediction - ... -On this date (09/08/2024) the status is : +On this date (09/08/2024) the status is : - RISC-V 32/64 IMAFDCSU supported (Multiply / Atomic / Float / Double / Supervisor / User) -- Can run baremetal applications (2.50 dhrystone/mhz, 5.24 coremark/mhz) +- Can run baremetal applications (2.50 dhrystone/MHz, 5.24 coremark/MHz) - Can run linux/buildroot/debian on FPGA hardware (via litex) - single/dual issue supported - late-alu supported @@ -51,28 +51,28 @@ Here is a diagram with 2 issue / early+late alu / 6 stages configuration (note t .. image:: /asset/picture/architecture_all_1.png Navigating the code ----------------------------------- +------------------- -Here are a few key / typical code examples : +Here are a few key / typical code examples : - The CPU toplevel src/main/scala/vexiiriscv/VexiiRiscv.scala - A cpu configuration generator : dev/src/main/scala/vexiiriscv/Param.scala -- Some globaly shared definitions : src/main/scala/vexiiriscv/Global.scala +- Some globally shared definitions : src/main/scala/vexiiriscv/Global.scala - Integer ALU plugin ; src/main/scala/vexiiriscv/execute/IntAluPlugin.scala Also on quite important one is to use a text editor / IDE which support curly brace folding and to start with them fully folded, as the code extensively used nested structures. Check list ------------------------ +---------- -Here is a list of important assumptions and things to know about : +Here is a list of important assumptions and things to know about : -- trap/flush/pc request from the pipeline, once asserted one cycle can not be undone. This also mean that while a given instruction is stuck somewere, if that instruction did raised on of those request, nothing should change the execution path. For instance, a sudden cache line refill completion should not lift the request from the LSU asking a redo (due to cache refill hazard). +- trap/flush/pc request from the pipeline, once asserted one cycle can not be undone. This also mean that while a given instruction is stuck somewhere, if that instruction did raised on of those request, nothing should change the execution path. For instance, a sudden cache line refill completion should not lift the request from the LSU asking a redo (due to cache refill hazard). - In the execute pipeline, stage.up(RS1/RS2) is the value to be used, while stage.down(RS1/RS2) should not be used, as it implement the bypassing for the next stage -- Fetch.ctrl(0) isn't persistant. +- Fetch.ctrl(0) isn't persistent. About VexRiscv (not VexiiRiscv) ------------------------------------- +------------------------------- There is few reasons why VexiiRiscv exists instead of doing incremental upgrade on VexRiscv diff --git a/source/VexiiRiscv/Performance/index.rst b/source/VexiiRiscv/Performance/index.rst index 8eb1db1..abb8a8c 100644 --- a/source/VexiiRiscv/Performance/index.rst +++ b/source/VexiiRiscv/Performance/index.rst @@ -1,7 +1,7 @@ Performance / Area / FMax ========================= -It is still very early in the developement, but here are some metrics : +It is still very early in the development, but here are some metrics : +---------------+----------------+ | Name | Max IPC | @@ -14,7 +14,7 @@ It is still very early in the developement, but here are some metrics : +---------------+----------------+ | GShare | 4KB | +---------------+----------------+ -| Dhrystone/MHz | 2.50 | +| Drystone/MHz | 2.50 | +---------------+----------------+ | Coremark/MHz | 5.24 | +---------------+----------------+ @@ -27,23 +27,23 @@ It is too early for area / fmax metric, there is a lot of design space explorati Tuning -------------- +------ VexiiRiscv can scale a lot in function of its plugins/parameters. It can scale from simple microcontroller (ex M0) up to an application processor (A53), -On FPGA there is a few options which can be key in order to scale up the IPC while preserving the FMax : +On FPGA there is a few options which can be key in order to scale up the IPC while preserving the FMax : -- --relaxed-btb : When the BTB is enabled, by default it is implemented as a single cycle predictor, - This can be easily be the first critical path to appear. - This option make the BTB implementation spread over 2 cycles, - which relax the timings at the cost of 1 cycle penality on every successfull branch predictions. +- --relaxed-btb : When the BTB is enabled, by default it is implemented as a single cycle predictor, + This can be easily be the first critical path to appear. + This option make the BTB implementation spread over 2 cycles, + which relax the timings at the cost of 1 cycle penalty on every successful branch predictions. -- --relaxed-branch : By default, the BranchPlugin will flush/setPc in the same stage - than its own ALU. This is good for IPC but can easily be a critical path. - This option will add one cycle latency between the ALU and the side effects (flush/setPc) +- --relaxed-branch : By default, the BranchPlugin will flush/setPc in the same stage + than its own ALU. This is good for IPC but can easily be a critical path. + This option will add one cycle latency between the ALU and the side effects (flush/setPc) in order to improve timings. If you enabled the branch prediction, then the impact on the IPC should be quite low. -- --fma-reduced-accuracy and --fpu-ignore-subnormal both reduce and can improve the fmax +- --fma-reduced-accuracy and --fpu-ignore-subnormal both reduce and can improve the fmax at the cost of accuracy diff --git a/source/VexiiRiscv/Soc/index.rst b/source/VexiiRiscv/Soc/index.rst index e0f5d6a..457bde4 100644 --- a/source/VexiiRiscv/Soc/index.rst +++ b/source/VexiiRiscv/Soc/index.rst @@ -1,5 +1,5 @@ SoC -============ +=== This is currently WIP. diff --git a/source/VexiiRiscv/Soc/litex.rst b/source/VexiiRiscv/Soc/litex.rst index 890cc6e..3c2c9e7 100644 --- a/source/VexiiRiscv/Soc/litex.rst +++ b/source/VexiiRiscv/Soc/litex.rst @@ -1,14 +1,14 @@ Litex ------------------------------- +----- VexiiRiscv can also be deployed using Litex. -You can find some fully self contained example about how to generate the software and hardware files to run buildroot and debian here : +You can find some fully self contained example about how to generate the software and hardware files to run buildroot and debian here : - https://github.com/SpinalHDL/VexiiRiscv/tree/dev/doc/litex -For instance, you can run the following litex command to generate a linux capable SoC on the digilent_nexys_video dev kit (RV32IMA): +For instance, you can run the following litex command to generate a linux capable SoC on the digilent_nexys_video dev kit (RV32IMA): .. code:: shell @@ -20,13 +20,13 @@ Here is an example for a dual core, debian capable (RV64GC) with L2 cache and a python3 -m litex_boards.targets.digilent_nexys_video --cpu-type=vexiiriscv --cpu-variant=debian --cpu-count=2 --with-video-framebuffer --with-sdcard --with-ethernet --with-coherent-dma --l2-byte=262144 --build --load -Additional arguements can be provided to customize the VexiiRiscv configuration, for instance the following will enable the PMU, 0 cycle latency register file, multiple outstanding D$ refill/writeback and store buffer: +Additional arguments can be provided to customize the VexiiRiscv configuration, for instance the following will enable the PMU, 0 cycle latency register file, multiple outstanding D$ refill/writeback and store buffer: .. code:: shell - + --vexii-args="--performance-counters 9 --regfile-async --lsu-l1-refill-count 2 --lsu-l1-writeback-count 2 --lsu-l1-store-buffer-ops=32 --lsu-l1-store-buffer-slots=2" -To generate a DTS, i recommand adding `--soc-json build/csr.json` to the command line, and then running : +To generate a DTS, I recommend adding `--soc-json build/csr.json` to the command line, and then running : .. code:: shell @@ -45,17 +45,17 @@ That linux.dts will miss the CLINT definition (used by opensbi), so you need to &L3 3 &L3 7>; reg = <0xf0010000 0x10000>; }; - -Then you can convert the linux.dts into linux.dtb via : + +Then you can convert the linux.dts into linux.dtb via : .. code:: shell dtc -O dtb -o build/linux.dtb build/linux.dts -To run debian, you would need to change the dts boot device to your block device, aswell as removing the initrd from the dts. You can find more information about how to setup the debian images on https://github.com/SpinalHDL/NaxSoftware/tree/main/debian_litex +To run debian, you would need to change the dts boot device to your block device, as well as removing the initrd from the dts. You can find more information about how to setup the debian images on https://github.com/SpinalHDL/NaxSoftware/tree/main/debian_litex -But note that for opensbi, use instead the following (official upstream opensbi using the generic platform, which will also contains the dtb): +But note that for opensbi, use instead the following (official upstream opensbi using the generic platform, which will also contains the dtb): .. code:: shell diff --git a/source/VexiiRiscv/Soc/microsoc.rst b/source/VexiiRiscv/Soc/microsoc.rst index d6f741d..e4dee5d 100644 --- a/source/VexiiRiscv/Soc/microsoc.rst +++ b/source/VexiiRiscv/Soc/microsoc.rst @@ -2,18 +2,18 @@ This is currently WIP. MicroSoc ------------------------------- +-------- MicroSoC is a little SoC based on VexiiRiscv and a tilelink interconnect. .. image:: /asset/picture/microsoc.png -Here you can see the default vexiiriscv architecture for this SoC : +Here you can see the default vexiiriscv architecture for this SoC : .. image:: /asset/picture/microsoc_vexii.png -Its goals are : +Its goals are : - Provide a simple reference design - To be a simple and light FPGA SoC