Skip to content

Shared Resource Bus Old Graphics Demo

Julian Kemmerer edited this page Sep 20, 2023 · 55 revisions

For now the new AXI DDR memory based graphics demo drawing Mandelbrot will live here, to be swapped out for graphics demo currently on shared resource buffer page:


New Shared Resource Bus Graphics Demo

This graphics demo differs from the old one (which is worth reading for more details). Instead of requiring on-chip block RAM, this new demo can use off-chip DDR memory for full color frame buffers. Additionally, this demo focuses on a more complex rendering computation that can benefit from PipelineC's auto-pipelining.

TODO GENERAL GRAPHICS DEMO DIAGRAM

Dual Frame Buffer

The graphics_demo.c file is an example exercising a dual frame buffer as a shared bus resource from dual_frame_buffer.c. The demo slowly cycles through R,G,B color ranges, requiring for each pixel: a read from frame buffer RAM, minimal computation to update pixel color, and a write back to frame buffer RAM for display.

The frame buffer is configured to use a Xilinx AXI DDR controller starting inside ddr_dual_frame_buffer.c. The basic shared resource bus setup for connecting to the Xilinx DDR memory controller AXI bus can be found in axi_xil_mem.c. In that file an instance of an axi_shared_bus_t shared resource bus (defined in axi_shared_bus.h) called axi_xil_mem is declared using the shared_resource_bus_decl.h file include-as-macro helper.

Displaying Frame Buffer Pixels

In addition to 'user' rendering threads, the frame buffer memory shared resource needs to be reading pixels at a rate that can meet the streaming requirement of the VGA resolution pixel clock timing for connecting a display.

Unlike the the old demo, in this demo ddr_dual_frame_buffer.c uses a separate 'read-only priority port' axi_xil_rd_pri_port_mem_host_to_dev_wire wire to simply connect a VGA position counter to a dedicated read request side of the shared resource bus. Responses from the bus are the pixels that are written directly into the vga_pmod_async_pixels_fifo.c display stream.

MAIN_MHZ(host_vga_reader, XIL_MEM_MHZ)
void host_vga_reader()
{
  static uint1_t frame_buffer_read_port_sel_reg;

  // READ REQUEST SIDE
  // Increment VGA counters and do read for each position
  static vga_pos_t vga_pos;
  // Read and increment pos if room in fifos (cant be greedy since will 100% hog priority port)
  uint1_t fifo_ready;
  #pragma FEEDBACK fifo_ready
  // Read from the current read frame buffer addr
  uint32_t addr = pos_to_addr(vga_pos.x, vga_pos.y);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.araddr = dual_ram_to_addr(frame_buffer_read_port_sel_reg, addr);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arlen = 1-1; // size=1 minus 1: 1 transfer cycle (non-burst)
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arsize = 2; // 2^2=4 bytes per transfer
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arburst = BURST_FIXED; // Not a burst, single fixed address per transfer
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.valid = fifo_ready;
  uint1_t do_increment = fifo_ready & axi_xil_rd_pri_port_mem_dev_to_host_wire.read.req_ready;
  vga_pos = vga_frame_pos_increment(vga_pos, do_increment);

  // READ RESPONSE SIDE
  // Get read data from the AXI RAM bus
  uint8_t data[4];
  uint1_t data_valid = 0;
  data = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.burst.data_resp.user.rdata;
  data_valid = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.valid;
  // Write pixel data into fifo
  pixel_t pixel;
  pixel.a = data[0];
  pixel.r = data[1];
  pixel.g = data[2];
  pixel.b = data[3];
  pixel_t pixels[1];
  pixels[0] = pixel;
  fifo_ready = pmod_async_fifo_write_logic(pixels, data_valid);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.data_ready = fifo_ready;

  frame_buffer_read_port_sel_reg = frame_buffer_read_port_sel;
}

Threads + Kernel

Computation Kernel

In graphics_demo.c the pixel_kernel function implements incrementing RGB channel values as a simply test pattern.

The pixels_kernel_seq_range function iterates over a range of frame area executing pixel_kernel for reach pixel. The frame area is defined by start and end x and y positions.

// Single 'thread' state machine running pixel_kernel "sequentially" across an x,y range
void pixels_kernel_seq_range(
  kernel_args_t args,
  uint16_t x_start, uint16_t x_end, 
  uint16_t y_start, uint16_t y_end)
{
  uint16_t x;
  uint16_t y;
  for(y=y_start; y<=y_end; y+=TILE_FACTOR)
  {
    for(x=x_start; x<=x_end; x+=TILE_FACTOR)
    {
      if(args.do_clear){
        pixel_t pixel = {0};
        frame_buf_write(x, y, pixel);
      }else{
        // Read the pixel from the 'read' frame buffer
        pixel_t pixel = frame_buf_read(x, y);
        pixel = pixel_kernel(args, pixel, x, y);
        // Write pixel back to the 'write' frame buffer
        frame_buf_write(x, y, pixel);
      }
    }
  }
}

Multiple Threads

Multiple host threads can be reading and writing the frame buffers trying to execute their own sequential run of pixels_kernel_seq_range. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM modules inside of a function called render_demo_kernel. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS) copies of pixels_kernel_seq_range all run in parallel, splitting the FRAME_WIDTH by NUM_X_THREADS threads and FRAME_HEIGHT by NUM_Y_THREADS.

// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_demo_kernel(
  kernel_args_t args,
  uint16_t x, uint16_t width,
  uint16_t y, uint16_t height
){
  // Wire up N parallel pixel_kernel_seq_range_FSM instances
  uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
  uint32_t i,j;
  uint1_t all_threads_done;
  while(!all_threads_done)
  {
    pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
    pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
    all_threads_done = 1;
    
    uint16_t thread_x_size = width >> NUM_X_THREADS_LOG2;
    uint16_t thread_y_size = height >> NUM_Y_THREADS_LOG2;
    for (i = 0; i < NUM_X_THREADS; i+=1)
    {
      for (j = 0; j < NUM_Y_THREADS; j+=1)
      {
        if(!thread_done[i][j])
        {
          fsm_in[i][j].input_valid = 1;
          fsm_in[i][j].output_ready = 1;
          fsm_in[i][j].args = args;
          fsm_in[i][j].x_start = (thread_x_size*i) + x;
          fsm_in[i][j].x_end = fsm_in[i][j].x_start + thread_x_size - 1;
          fsm_in[i][j].y_start = (thread_y_size*j) + y;
          fsm_in[i][j].y_end = fsm_in[i][j].y_start + thread_y_size - 1;
          fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
          thread_done[i][j] = fsm_out[i][j].output_valid;
        }
        all_threads_done &= thread_done[i][j];
      }
    }
    __clk();
  }
}

render_demo_kernel can then simply run in a loop, trying for the fastest frames per second possible.

void main()
{
  kernel_args_t args;
  ...
  while(1)
  {
    // Render entire frame
    render_demo_kernel(args, 0, FRAME_WIDTH, 0, FRAME_HEIGHT);
  }
}

The actual graphics_demo.c file main() does some extra DDR initialization, is slowed down to render the test pattern slowly, and manages the toggling of the dual frame buffer 'which is the read buffer' select signal after each render_demo_kernel iteration: frame_buffer_read_port_sel = !frame_buffer_read_port_sel;.

Pipelines as Shared Resource

The above graphics demo uses an AXI RAM frame buffer as the resource shared on a bus.

Another common use case is having an automatically pipelined function as the shared resource. shared_resource_bus_pipeline.h is a header-as-macro helper for declaring a pipeline instance connected to multiple host state machines via a shared resource bus.

// Example declaration using helper header-as-macro
#define SHARED_RESOURCE_BUS_PIPELINE_NAME         name
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE     output_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC         the_func_to_pipeline
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE      input_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ  DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"

In the above example a function output_t the_func_to_pipeline(input_t) is made into a pipeline instance used like output_t name(input_t) from derived FSM NUM_THREADS threads host threads (running at HOST_CLK_MHZ). The pipeline is automatically pipelined to meet the target DEV_CLK_MHZ operating frequency.

Mandelbrot Demo

As a demonstration of shared_resource_bus_pipeline.h the mandelbrot_demo.c file instantiates several pipeline devices for computation inside shared_mandelbrot_dev.c.

For example the device for computing Mandelbrot iterations is declared as such:

// Do N Mandelbrot iterations per call to mandelbrot_iter_func
#define ITER_CHUNK_SIZE 6
#define MAX_ITER 32
typedef struct mandelbrot_iter_t{
  complex_t c;
  complex_t z;
  complex_t z_squared;
  uint1_t escaped;
  uint32_t n;
}mandelbrot_iter_t;
#define ESCAPE 2.0
mandelbrot_iter_t mandelbrot_iter_func(mandelbrot_iter_t inputs)
{
  mandelbrot_iter_t rv = inputs;
  uint32_t i;
  for(i=0;i<ITER_CHUNK_SIZE;i+=1)
  {
    // Mimic while loop
    if(!rv.escaped & (rv.n < MAX_ITER))
    {
      // float_lshift is division by subtraction on exponent only
      rv.z.im = float_lshift((rv.z.re*rv.z.im), 1) + rv.c.im;
      rv.z.re = rv.z_squared.re - rv.z_squared.im + rv.c.re;
      rv.z_squared.re = rv.z.re * rv.z.re;
      rv.z_squared.im = rv.z.im * rv.z.im;
      rv.n = rv.n + 1;
      rv.escaped = (rv.z_squared.re+rv.z_squared.im) > (ESCAPE*ESCAPE);
    }
  }
  return rv;
}
#define SHARED_RESOURCE_BUS_PIPELINE_NAME         mandelbrot_iter
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE     mandelbrot_iter_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC         mandelbrot_iter_func
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE      mandelbrot_iter_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_USER_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ  MANDELBROT_DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"

Other devices include screen_to_complex for computing screen position to complex plane position value, as well as iter_to_color which takes and integer number of iterations and returns and RGB color.

Simulation

Using the setup from inside the PipelineC-Graphics repo the following commands will compile and run the demo at 480p display with 8x tiling down to 80x60 pixels in frame buffer.

rm -Rf ./build
../PipelineC/src/pipelinec mandelbrot_pipelinec_app.c --out_dir ./build --comb --sim --verilator --run -1
verilator -Mdir ./obj_dir -Wno-UNOPTFLAT -Wno-WIDTH -Wno-CASEOVERLAP --top-module top -cc ./build/top/top.v -O3 --exe main.cpp -I./build/verilator -CFLAGS -DUSE_VERILATOR -CFLAGS -DFRAME_WIDTH=640 -CFLAGS -DFRAME_HEIGHT=480 -LDFLAGS $(shell sdl2-config --libs)
cp ./main.cpp ./obj_dir
make CXXFLAGS="-DUSE_VERILATOR -I../../PipelineC/ -I../../PipelineC/pipelinec/include -I../build/verilator -I.." -C ./obj_dir -f Vtop.mk
./obj_dir/Vtop

Alternatively, cloning the mandelbrot branch will allow you run to make mandelbrot_verilator instead.

Conclusion

Using these shared resource buses its possible to picture even more complex host threads and computation devices. For instance, it is possible to have a single thread start multiple shared resource bus transactions at once, further increasing the parallelism possible to exploit.

Generally the functionality in shared_resource_bus.h will continue to be improved and made easier to adapt in more design situations.

Please reach out if interested in giving anything a try or making improvements, happy to help! -Julian