Skip to content

Shared Resource Bus Old Graphics Demo

Julian Kemmerer edited this page Sep 16, 2023 · 55 revisions

For now the new AXI DDR memory based graphics demo drawing Mandelbrot will live here, to be swapped out for graphics demo currently on shared resource buffer page:


New Shared Resource Bus Graphics Demo

This graphics demo differs from the old one (which is worth reading for more details). Instead of using on-chip block RAM, this new demo uses off-chip DDR memory for full color frame buffers. Additionally, this demo focuses on a more complex rendering computation that can benefit from PipelineC's auto-pipelining.

TODO GENERAL GRAPHICS DEMO DIAGRAM

Dual Frame Buffer

The graphics_demo.c file is an example exercising a dual frame buffer as a shared bus resource from dual_frame_buffer.c. The demo slowly cycles through R,G,B color ranges, requiring for each pixel: a read from frame buffer RAM, minimal computation to update pixel color, and a write back to frame buffer RAM for display.

The frame buffer is configured to use a Xilinx AXI DDR controller starting inside ddr_dual_frame_buffer.c. The basic shared resource bus setup for connecting to the Xilinx DDR memory controller AXI bus can be found in axi_xil_mem.c. In that file an instance of an axi_shared_bus_t shared resource bus (defined in axi_shared_bus.h) called axi_xil_mem is declared using the shared_resource_bus_decl.h file include-as-macro helper.

Displaying Frame Buffer Pixels

In addition to 'user' rendering threads, the frame buffer memory shared resource needs to be reading pixels at a rate that can meet the streaming requirement of the VGA resolution pixel clock timing for connecting a display.

Unlike the the old demo, in this demo ddr_dual_frame_buffer.c uses a separate 'read-only priority port' axi_xil_rd_pri_port_mem_host_to_dev_wire wire to simply wire a VGA counter to a dedicated read request side of the shared resource bus. Responses from the bus are the pixels that are written directly into the vga_pmod_async_pixels_fifo.c display stream.

MAIN_MHZ(host_vga_reader, XIL_MEM_MHZ)
void host_vga_reader()
{
  static uint1_t frame_buffer_read_port_sel_reg;

  // READ REQUEST SIDE
  // Increment VGA counters and do read for each position
  static vga_pos_t vga_pos;
  // Read and increment pos if room in fifos (cant be greedy since will 100% hog priority port)
  uint1_t fifo_ready;
  #pragma FEEDBACK fifo_ready
  // Read from the current read frame buffer addr
  uint32_t addr = pos_to_addr(vga_pos.x, vga_pos.y);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.araddr = dual_ram_to_addr(frame_buffer_read_port_sel_reg, addr);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arlen = 1-1; // size=1 minus 1: 1 transfer cycle (non-burst)
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arsize = 2; // 2^2=4 bytes per transfer
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arburst = BURST_FIXED; // Not a burst, single fixed address per transfer
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.valid = fifo_ready;
  uint1_t do_increment = fifo_ready & axi_xil_rd_pri_port_mem_dev_to_host_wire.read.req_ready;
  vga_pos = vga_frame_pos_increment(vga_pos, do_increment);

  // READ RESPONSE SIDE
  // Get read data from the AXI RAM bus
  uint8_t data[4];
  uint1_t data_valid = 0;
  data = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.burst.data_resp.user.rdata;
  data_valid = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.valid;
  // Write pixel data into fifo
  pixel_t pixel;
  pixel.a = data[0];
  pixel.r = data[1];
  pixel.g = data[2];
  pixel.b = data[3];
  pixel_t pixels[1];
  pixels[0] = pixel;
  fifo_ready = pmod_async_fifo_write_logic(pixels, data_valid);
  axi_xil_rd_pri_port_mem_host_to_dev_wire.read.data_ready = fifo_ready;

  frame_buffer_read_port_sel_reg = frame_buffer_read_port_sel;
}

Threads + Kernel

Computation Kernel

In graphics_demo.c the pixel_kernel function implements incrementing RGB channel values as a test pattern.

The pixels_kernel_seq_range function iterates over a range of frame area executing pixel_kernel for reach pixel. The frame area is defined by start and end x and y positions.

// Single 'thread' state machine running pixel_kernel "sequentially" across an x,y range
void pixels_kernel_seq_range(
  kernel_args_t args,
  uint16_t x_start, uint16_t x_end, 
  uint16_t y_start, uint16_t y_end)
{
  uint16_t x;
  uint16_t y;
  for(y=y_start; y<=y_end; y+=TILE_FACTOR)
  {
    for(x=x_start; x<=x_end; x+=TILE_FACTOR)
    {
      if(args.do_clear){
        pixel_t pixel = {0};
        frame_buf_write(x, y, pixel);
      }else{
        // Read the pixel from the 'read' frame buffer
        pixel_t pixel = frame_buf_read(x, y);
        pixel = pixel_kernel(args, pixel, x, y);
        // Write pixel back to the 'write' frame buffer
        frame_buf_write(x, y, pixel);
      }
    }
  }
}

Multiple Threads

Multiple host threads can be reading and writing the frame buffers trying to execute their own sequential run of pixels_kernel_seq_range. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM modules inside of a function called render_demo_kernel. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS) copies of pixels_kernel_seq_range all run in parallel, splitting the FRAME_WIDTH by NUM_X_THREADS threads and FRAME_HEIGHT by NUM_Y_THREADS.

// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_demo_kernel(
  kernel_args_t args,
  uint16_t x, uint16_t width,
  uint16_t y, uint16_t height
){
  // Wire up N parallel pixel_kernel_seq_range_FSM instances
  uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
  uint32_t i,j;
  uint1_t all_threads_done;
  while(!all_threads_done)
  {
    pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
    pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
    all_threads_done = 1;
    
    uint16_t thread_x_size = width >> NUM_X_THREADS_LOG2;
    uint16_t thread_y_size = height >> NUM_Y_THREADS_LOG2;
    for (i = 0; i < NUM_X_THREADS; i+=1)
    {
      for (j = 0; j < NUM_Y_THREADS; j+=1)
      {
        if(!thread_done[i][j])
        {
          fsm_in[i][j].input_valid = 1;
          fsm_in[i][j].output_ready = 1;
          fsm_in[i][j].args = args;
          fsm_in[i][j].x_start = (thread_x_size*i) + x;
          fsm_in[i][j].x_end = fsm_in[i][j].x_start + thread_x_size - 1;
          fsm_in[i][j].y_start = (thread_y_size*j) + y;
          fsm_in[i][j].y_end = fsm_in[i][j].y_start + thread_y_size - 1;
          fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
          thread_done[i][j] = fsm_out[i][j].output_valid;
        }
        all_threads_done &= thread_done[i][j];
      }
    }
    __clk();
  }
}

render_demo_kernel can then simply run in a loop, trying for the fastest frames per second possible.

void main()
{
  kernel_args_t args;
  ...
  while(1)
  {
    // Render entire frame
    render_demo_kernel(args, 0, FRAME_WIDTH, 0, FRAME_HEIGHT);
  }
}

The actual graphics_demo.c file main() does some extra DDR initialization, is slowed down to render the test pattern slowly, and manages the toggling of the dual frame buffer 'which is the read buffer' select signal after each render_demo_kernel iteration: frame_buffer_read_port_sel = !frame_buffer_read_port_sel;.

Pipelines as Shared Resource

The above graphics demo uses an AXI RAM frame buffer as the resource shared on a bus.

Another common use case is having an automatically pipelined function as the shared resource. shared_resource_bus_pipeline.h is a header-as-macro helper for declaring a pipeline instance connected to multiple host state machines via a shared resource bus.

// Example declaration using helper header-as-macro
#define SHARED_RESOURCE_BUS_PIPELINE_NAME         name
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE     output_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC         the_func_to_pipeline
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE      input_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ  DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"

In the above example a function output_t the_func_to_pipeline(input_t) is made into a pipeline instance used like output_t name(input_t) from derived FSM NUM_THREADS threads host threads (running at HOST_CLK_MHZ). The pipeline is automatically pipelined to meet the target DEV_CLK_MHZ operating frequency.

Mandelbrot Demo

TODO

  • describe shared res bus devices:
    • screen to complex
    • mandebrot iters
    • iter count to color
    • frame buffer
  • next state function
    • signal to compute next state each time frame rendered

Simulation

Scaling

resources fmax etc

TODO 'cpu style single floating point unit?'


Using the multi-threaded dual frame buffer graphics demo setup discussed above, the final specifics for a Game of Life demo are ready to assemble:

The per-pixel kernel function implementing Game of Life runs the familiar alive neighbor cell counting algorithm to compute the cell's next alive/dead state:

// Func run for every n_pixels_t chunk
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
  // Read the pixels from the 'read' frame buffer
  n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);

  // Run Game of Life kernel for each pixel
  uint32_t i;
  uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
  for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
  { 
    pixels.data[i] = cell_next_state(pixels.data[i], x+i, y);
  }  
  
  // Write pixels back to the 'write' frame buffer 
  dual_frame_buf_write(x_buffer_index, y, pixels);
}

Conclusion

Using these shared resource buses its possible to picture even more complex host threads and computation devices. For instance, a long pixel rendering pipeline like used in Sphery vs. Shapes could be adapted to be a device resource shared among many rendering threads.

Generally the functionality in shared_resource_bus.h will continue to be improved and made easier to adapt in more design situations.

Please reach out if interested in giving anything a try or making improvements, happy to help! -Julian