Skip to content

Shared Resource Bus Old Graphics Demo

Julian Kemmerer edited this page Oct 11, 2023 · 55 revisions

Old Shared Resource Bus Graphics Demo

Dual Frame Buffer

The graphics_demo.c file is an example exercising two frame buffer devices as shared bus resources from shared_dual_frame_buffer.c.

For example the state machine that reads from shared resource frame_buf0_shared_bus or frame_buf1_shared_bus based on a select variable:

uint1_t frame_buffer_read_port_sel;
n_pixels_t dual_frame_buf_read(uint16_t x_buffer_index, uint16_t y)
{
  uint32_t addr = pos_to_addr(x_buffer_index, y);
  n_pixels_t resp;
  if(frame_buffer_read_port_sel){
    resp = frame_buf1_shared_bus_read(addr);
  }else{
    resp = frame_buf0_shared_bus_read(addr);
  }
  return resp;
}

One of the host threads using the frame buffers is always-reading logic to push pixels out the VGA port for display.

void host_vga_reader()
{
  vga_pos_t vga_pos;
  while(1)
  {
    // Read the pixels at x,y pos
    uint16_t x_buffer_index = vga_pos.x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
    n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, vga_pos.y);
    
    // Write it into async fifo feeding vga pmod for display
    pmod_async_fifo_write(pixels);

    // Execute a cycle of vga timing to get x,y and increment for next time
    vga_pos = vga_frame_pos_increment(vga_pos, RAM_PIXEL_BUFFER_SIZE);
  }
}

Threads + Kernel

RAM Access Width

The frame buffer from frame_buffer.c stores RAM_PIXEL_BUFFER_SIZE pixels at each RAM address. This is done by defining a wrapper 'chunk of n pixels' struct.

// Must be divisor of FRAME_WIDTH across x direction
typedef struct n_pixels_t{
  uint1_t data[RAM_PIXEL_BUFFER_SIZE];
}n_pixels_t;

Computation Kernel

The pixels_buffer_kernel function reads n_pixels_t worth of pixels, operates on them running some kernel function on each pixel sequentially, and then writes the resulting group of pixel values back.

void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
  // Read the pixels from the 'read' frame buffer
  n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);

  // Run kernel for each pixel
  uint32_t i;
  uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
  for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
  { 
    pixels.data[i] = some_kernel_func(pixels.data[i], x+i, y);
  }  
  
  // Write pixels back to the 'write' frame buffer 
  dual_frame_buf_write(x_buffer_index, y, pixels);
}

The pixels_kernel_seq_range function iterates over a range of frame area executing pixels_buffer_kernel for reach set of pixels. The frame area is defined by start and end x and y positions.

void pixels_kernel_seq_range(
  uint16_t x_start, uint16_t x_end, 
  uint16_t y_start, uint16_t y_end)
{
  uint16_t x_buffer_index_start = x_start >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  uint16_t x_buffer_index_end = x_end >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  uint16_t x_buffer_index;
  uint16_t y;
  for(y=y_start; y<=y_end; y+=1)
  {
    for(x_buffer_index=x_buffer_index_start; x_buffer_index<=x_buffer_index_end; x_buffer_index+=1)
    {
      pixels_buffer_kernel(x_buffer_index, y);
    }
  }
}

Multiple Threads

Multiple host threads can be reading and writing the frame buffers trying to execute their own copy of pixels_kernel_seq_range. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM modules inside of a function called render_frame. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS) copies of pixels_kernel_seq_range all run in parallel, splitting the FRAME_WIDTH by NUM_X_THREADS threads and FRAME_HEIGHT by NUM_Y_THREADS.

// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_frame()
{
  // Wire up N parallel pixel_kernel_seq_range_FSM instances
  uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
  uint32_t i,j;
  uint1_t all_threads_done;
  while(!all_threads_done)
  {
    pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
    pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
    all_threads_done = 1;
    
    uint16_t THREAD_X_SIZE = FRAME_WIDTH / NUM_X_THREADS;
    uint16_t THREAD_Y_SIZE = FRAME_HEIGHT / NUM_Y_THREADS;
    for (i = 0; i < NUM_X_THREADS; i+=1)
    {
      for (j = 0; j < NUM_Y_THREADS; j+=1)
      {
        if(!thread_done[i][j])
        {
          fsm_in[i][j].input_valid = 1;
          fsm_in[i][j].output_ready = 1;
          fsm_in[i][j].x_start = THREAD_X_SIZE*i;
          fsm_in[i][j].x_end = (THREAD_X_SIZE*(i+1))-1;
          fsm_in[i][j].y_start = THREAD_Y_SIZE*j;
          fsm_in[i][j].y_end = (THREAD_Y_SIZE*(j+1))-1;
          fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
          thread_done[i][j] = fsm_out[i][j].output_valid;
        }
        all_threads_done &= thread_done[i][j];
      }
    }
    __clk();
  }
  // Final step in rendering frame is switching to read from newly rendered frame buffer
  frame_buffer_read_port_sel = !frame_buffer_read_port_sel;
}

render_frame is then simply run in a loop, trying for the fastest frames per second possible.

void main()
{
  while(1)
  {
    render_frame();
  }
}

Game of Life Demo

Using the multi-threaded dual frame buffer graphics demo setup discussed above, the final specifics for a Game of Life demo are ready to assemble:

The per-pixel kernel function implementing Game of Life runs the familiar alive neighbor cell counting algorithm to compute the cell's next alive/dead state:

// Func run for every n_pixels_t chunk
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
  // Read the pixels from the 'read' frame buffer
  n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);

  // Run Game of Life kernel for each pixel
  uint32_t i;
  uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
  for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
  { 
    pixels.data[i] = cell_next_state(pixels.data[i], x+i, y);
  }  
  
  // Write pixels back to the 'write' frame buffer 
  dual_frame_buf_write(x_buffer_index, y, pixels);
}

Working N Pixels at a Time

Memory is accessed RAM_PIXEL_BUFFER_SIZE pixels/cells at a time. However, simple implementations of Game of Life typically have individually addressable pixels/cells. To accommodate this a wrapper pixel_buf_read function is used to read single pixels.

// Frame buffer reads N pixels at a time
// ~Convert that into function calls reading 1 pixel/cell at a time
// VERY INEFFICIENT, reading N pixels to return just 1 for now...
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
  // Read the pixels from the 'read' frame buffer
  uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
  // Select the single pixel offset of interest (bottom bits of x)
  ram_pixel_offset_t x_offset = x;
  return pixels.data[x_offset];
}

Results and Improvements

The initially with just one thread rendering the entire screen the frame rate is a slow ~0.5 FPS:

// ~0.5 FPS
#define NUM_X_THREADS 1 
#define NUM_Y_THREADS 1
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clock
#define RAM_PIXEL_BUFFER_SIZE 16 // Frame buffer RAM width in pixels

Multiple Threads

The first and easiest way of scaling up this design for higher FPS is to use more threads to tile the screen rendering.

// 1.1 FPS
// 2 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 1
// 2.3 FPS
// 4 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 2
// 4.5 FPS
// 8 threads
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 2

At this point my personal FPGA is at about 50% LUT resource usage. These derived finite state machine threads have lots of room for optimizations and are what generally limit expansion to more threads at this time.

Cached Reads

As described above the current 'read a single cell' function is very inefficient. It reads RAM_PIXEL_BUFFER_SIZE pixels and selects just one to return.

uint1_t pixel_buf_read(uint32_t x, uint32_t y);

However, in computing the Game of Life next state kernel function the same section of RAM_PIXEL_BUFFER_SIZE pixels is read multiple times (when counting living neighbor cells).

Simple 1 Entry Cache

The simplest way to avoid repeated reads is to keep around the last read's result and re-use it if requested again:

// 8.0 FPS
// 8 threads, 1 cached read
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
  // Cache registers
  static uint16_t cache_x_buffer_index = FRAME_WIDTH; // invalid init
  static uint16_t cache_y = FRAME_HEIGHT; // invalid init
  static n_pixels_t cache_pixels;
  // Read the pixels from the 'read' frame buffer or from cache
  n_pixels_t pixels;
  uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  uint1_t cache_match = (x_buffer_index==cache_x_buffer_index) & (y==cache_y);
  if(cache_match)
  {
    // Use cache
    pixels = cache_pixels;
  }
  else
  {
    // Read RAM and update cache
    pixels = dual_frame_buf_read(x_buffer_index, y);
    cache_x_buffer_index = x_buffer_index;
    cache_y = y;
    cache_pixels = pixels;
  }
  // Select the single pixel offset of interest (bottom bits of x)
  ram_pixel_offset_t x_offset = x;
  return pixels.data[x_offset];
}

This increases rendering to 8.0 FPS.

3 Line Cache

Game of Life repeatedly reads in a 3x3 grid around each cell. RAM_PIXEL_BUFFER_SIZE is typically >3 so one read can capture the entire x direction for several cells, but multiple reads are needed for the three y direction lines:

// 13.9 FPS
// 3 'y' lines of reads cached
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
  // Cache registers
  static uint16_t cache_x_buffer_index = FRAME_WIDTH;
  static uint16_t cache_y[3] = {FRAME_HEIGHT, FRAME_HEIGHT, FRAME_HEIGHT};
  static n_pixels_t cache_pixels[3];
  // Read the pixels from the 'read' frame buffer or from cache
  n_pixels_t pixels;
  uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  // Check cache for match (only one will match)
  uint1_t cache_match = 0;
  uint8_t cache_sel; // Which of 3 cache lines
  uint32_t i;
  for(i=0; i<3; i+=1)
  {
    uint1_t match_i = (x_buffer_index==cache_x_buffer_index) & (y==cache_y[i]);
    cache_match |= match_i;
    if(match_i){
      cache_sel = i;
    }
  } 
  if(cache_match)
  {
    pixels = cache_pixels[cache_sel];
  }
  else
  {
    // Read RAM and update cache
    pixels = dual_frame_buf_read(x_buffer_index, y);
    // If got a new x pos to read then default clear/invalidate entire cache
    if(x_buffer_index != cache_x_buffer_index)
    {
      ARRAY_SET(cache_y, FRAME_HEIGHT, 3)
    }
    cache_x_buffer_index = x_buffer_index;
    // Least recently used style shift out cache entries
    // to make room for keeping new one most recent at [0]
    ARRAY_1SHIFT_INTO_BOTTOM(cache_y, 3, y)
    ARRAY_1SHIFT_INTO_BOTTOM(cache_pixels, 3, pixels)
  }
  // Select the single pixel offset of interest (bottom bits of x)
  ram_pixel_offset_t x_offset = x;
  return pixels.data[x_offset];
}

Rendering speed is increased to 13.9 FPS.

Wider RAM Data Bus

In the above tests RAM_PIXEL_BUFFER_SIZE=16, that is, a group of N=16 pixels is stored at each RAM address. Increasing the width of the data bus (while keeping clock rates the same) will result in more available memory bandwidth, possible for more pixels per second written/read from RAM.

// 16.1 FPS
#define RAM_PIXEL_BUFFER_SIZE 32

If the width is made very wide, ex. RAM_PIXEL_BUFFER_SIZE=FRAME_WIDTH, then essentially the RAM is a per-line buffer of FRAME_HEIGHT lines (the entire screen).

However, especially given the read data caching described above, this design is not currently memory bandwidth limited. Increasing the RAM data width shows diminishing returns, ex. 64 bits gets to just 17 FPS for double the caching resources.

It is actually of greater benefit to save resources using the original 16b wide RAM data as this just barely allows for another doubling of the number of threads:

// ~30 FPS, 16 threads
#define RAM_PIXEL_BUFFER_SIZE 16
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 4

Faster Clock Rates

Increasing the clock rates and eventually 'overclocking' the design is the final easy axis to explore for increasing rendering speed.

// ~30 FPS, 16 threads
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clock
#define RAM_PIXEL_BUFFER_SIZE 16
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 4

The host and device clock domains were chosen as easily meet timing. However, there is room to push the design further and see where timing fails, and where visible glitches occur.

The host clock domain of user threads is currently frequency limited by excess logic produced from derived finite state machines. The device clock domain containing the frame buffer RAM and arbitration logic is currently limited by the arbitration implementation built into shared_resource_bus.h.

The design begins to fail timing with just a slightly higher host clock of 30MHz:

// 32 FPS, 16 threads
#define HOST_CLK_MHZ 30.0 
#define DEV_CLK_MHZ 100.0

However, in hardware testing the design can run with no visible issues using as fast as 45MHz for the host clock:

// 46 FPS, 16 threads
#define HOST_CLK_MHZ 45 
#define DEV_CLK_MHZ 100.0

Any faster host clock results in design that fails to work: shows only a still image, execution is stalled, locked on the very first frame it seems.

Finally, the device clock running the frame buffers can be increased as well:

// 48 FPS, 16 threads
#define HOST_CLK_MHZ 45 
#define DEV_CLK_MHZ 150.0

This frame buffer RAM clock has been seen to work at as high as ~250MHz, however, this is unnecessary since the design is not memory bandwidth limited and the returns quickly diminish, jumping only by 2FPS for an extra 50MHz clock rate increase from the original 100Mhz. Beyond 250MHz device clock the design fails to display any image on the monitor.

Conclusion

Using these shared resource buses its possible to picture even more complex host threads and computation devices. For instance, a long pixel rendering pipeline like used in Sphery vs. Shapes could be adapted to be a device resource shared among many rendering threads.

Generally the functionality in shared_resource_bus.h will continue to be improved and made easier to adapt in more design situations.

Please reach out if interested in giving anything a try or making improvements, happy to help! -Julian

Clone this wiki locally