-
Notifications
You must be signed in to change notification settings - Fork 50
Shared Resource Bus Old Graphics Demo
The graphics_demo.c file is an example exercising two frame buffer devices as shared bus resources from shared_dual_frame_buffer.c
.
For example the state machine that read
s from shared resource frame_buf0_shared_bus
or frame_buf1_shared_bus
based on a select variable:
uint1_t frame_buffer_read_port_sel;
n_pixels_t dual_frame_buf_read(uint16_t x_buffer_index, uint16_t y)
{
uint32_t addr = pos_to_addr(x_buffer_index, y);
n_pixels_t resp;
if(frame_buffer_read_port_sel){
resp = frame_buf1_shared_bus_read(addr);
}else{
resp = frame_buf0_shared_bus_read(addr);
}
return resp;
}
One of the host threads using the frame buffers is always-reading logic to push pixels out the VGA port for display.
void host_vga_reader()
{
vga_pos_t vga_pos;
while(1)
{
// Read the pixels at x,y pos
uint16_t x_buffer_index = vga_pos.x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, vga_pos.y);
// Write it into async fifo feeding vga pmod for display
pmod_async_fifo_write(pixels);
// Execute a cycle of vga timing to get x,y and increment for next time
vga_pos = vga_frame_pos_increment(vga_pos, RAM_PIXEL_BUFFER_SIZE);
}
}
The frame buffer from frame_buffer.c stores RAM_PIXEL_BUFFER_SIZE
pixels at each RAM address. This is done by defining a wrapper 'chunk of n pixels' struct.
// Must be divisor of FRAME_WIDTH across x direction
typedef struct n_pixels_t{
uint1_t data[RAM_PIXEL_BUFFER_SIZE];
}n_pixels_t;
The pixels_buffer_kernel
function reads n_pixels_t
worth of pixels, operates on them running some kernel function on each pixel sequentially, and then writes the resulting group of pixel values back.
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
// Read the pixels from the 'read' frame buffer
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
// Run kernel for each pixel
uint32_t i;
uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
{
pixels.data[i] = some_kernel_func(pixels.data[i], x+i, y);
}
// Write pixels back to the 'write' frame buffer
dual_frame_buf_write(x_buffer_index, y, pixels);
}
The pixels_kernel_seq_range
function iterates over a range of frame area executing pixels_buffer_kernel
for reach set of pixels. The frame area is defined by start and end x
and y
positions.
void pixels_kernel_seq_range(
uint16_t x_start, uint16_t x_end,
uint16_t y_start, uint16_t y_end)
{
uint16_t x_buffer_index_start = x_start >> RAM_PIXEL_BUFFER_SIZE_LOG2;
uint16_t x_buffer_index_end = x_end >> RAM_PIXEL_BUFFER_SIZE_LOG2;
uint16_t x_buffer_index;
uint16_t y;
for(y=y_start; y<=y_end; y+=1)
{
for(x_buffer_index=x_buffer_index_start; x_buffer_index<=x_buffer_index_end; x_buffer_index+=1)
{
pixels_buffer_kernel(x_buffer_index, y);
}
}
}
Multiple host threads can be reading and writing the frame buffers trying to execute their own copy of pixels_kernel_seq_range
. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM
modules inside of a function called render_frame
. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS)
copies of pixels_kernel_seq_range
all run in parallel, splitting the FRAME_WIDTH
by NUM_X_THREADS
threads and FRAME_HEIGHT
by NUM_Y_THREADS
.
// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_frame()
{
// Wire up N parallel pixel_kernel_seq_range_FSM instances
uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
uint32_t i,j;
uint1_t all_threads_done;
while(!all_threads_done)
{
pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
all_threads_done = 1;
uint16_t THREAD_X_SIZE = FRAME_WIDTH / NUM_X_THREADS;
uint16_t THREAD_Y_SIZE = FRAME_HEIGHT / NUM_Y_THREADS;
for (i = 0; i < NUM_X_THREADS; i+=1)
{
for (j = 0; j < NUM_Y_THREADS; j+=1)
{
if(!thread_done[i][j])
{
fsm_in[i][j].input_valid = 1;
fsm_in[i][j].output_ready = 1;
fsm_in[i][j].x_start = THREAD_X_SIZE*i;
fsm_in[i][j].x_end = (THREAD_X_SIZE*(i+1))-1;
fsm_in[i][j].y_start = THREAD_Y_SIZE*j;
fsm_in[i][j].y_end = (THREAD_Y_SIZE*(j+1))-1;
fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
thread_done[i][j] = fsm_out[i][j].output_valid;
}
all_threads_done &= thread_done[i][j];
}
}
__clk();
}
// Final step in rendering frame is switching to read from newly rendered frame buffer
frame_buffer_read_port_sel = !frame_buffer_read_port_sel;
}
render_frame
is then simply run in a loop, trying for the fastest frames per second possible.
void main()
{
while(1)
{
render_frame();
}
}
Using the multi-threaded dual frame buffer graphics demo setup discussed above, the final specifics for a Game of Life
demo are ready to assemble:
The per-pixel kernel function implementing Game of Life runs the familiar alive neighbor cell counting algorithm to compute the cell's next alive/dead state:
// Func run for every n_pixels_t chunk
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
// Read the pixels from the 'read' frame buffer
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
// Run Game of Life kernel for each pixel
uint32_t i;
uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
{
pixels.data[i] = cell_next_state(pixels.data[i], x+i, y);
}
// Write pixels back to the 'write' frame buffer
dual_frame_buf_write(x_buffer_index, y, pixels);
}
Memory is accessed RAM_PIXEL_BUFFER_SIZE
pixels/cells at a time. However, simple implementations of Game of Life typically have individually addressable pixels/cells. To accommodate this a wrapper pixel_buf_read
function is used to read single pixels.
// Frame buffer reads N pixels at a time
// ~Convert that into function calls reading 1 pixel/cell at a time
// VERY INEFFICIENT, reading N pixels to return just 1 for now...
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
// Read the pixels from the 'read' frame buffer
uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
// Select the single pixel offset of interest (bottom bits of x)
ram_pixel_offset_t x_offset = x;
return pixels.data[x_offset];
}
The initially with just one thread rendering the entire screen the frame rate is a slow ~0.5 FPS
:
// ~0.5 FPS
#define NUM_X_THREADS 1
#define NUM_Y_THREADS 1
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clock
#define RAM_PIXEL_BUFFER_SIZE 16 // Frame buffer RAM width in pixels
The first and easiest way of scaling up this design for higher FPS is to use more threads to tile the screen rendering.
// 1.1 FPS
// 2 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 1
// 2.3 FPS
// 4 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 2
// 4.5 FPS
// 8 threads
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 2
At this point my personal FPGA is at about 50% LUT resource usage. These derived finite state machine threads have lots of room for optimizations and are what generally limit expansion to more threads at this time.
As described above the current 'read a single cell' function is very inefficient. It reads RAM_PIXEL_BUFFER_SIZE
pixels and selects just one to return.
uint1_t pixel_buf_read(uint32_t x, uint32_t y);
However, in computing the Game of Life next state kernel function the same section of RAM_PIXEL_BUFFER_SIZE
pixels is read multiple times (when counting living neighbor cells).
The simplest way to avoid repeated reads is to keep around the last read's result and re-use it if requested again:
// 8.0 FPS
// 8 threads, 1 cached read
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
// Cache registers
static uint16_t cache_x_buffer_index = FRAME_WIDTH; // invalid init
static uint16_t cache_y = FRAME_HEIGHT; // invalid init
static n_pixels_t cache_pixels;
// Read the pixels from the 'read' frame buffer or from cache
n_pixels_t pixels;
uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
uint1_t cache_match = (x_buffer_index==cache_x_buffer_index) & (y==cache_y);
if(cache_match)
{
// Use cache
pixels = cache_pixels;
}
else
{
// Read RAM and update cache
pixels = dual_frame_buf_read(x_buffer_index, y);
cache_x_buffer_index = x_buffer_index;
cache_y = y;
cache_pixels = pixels;
}
// Select the single pixel offset of interest (bottom bits of x)
ram_pixel_offset_t x_offset = x;
return pixels.data[x_offset];
}
This increases rendering to 8.0 FPS
.
Game of Life repeatedly reads in a 3x3 grid
around each cell. RAM_PIXEL_BUFFER_SIZE
is typically >3
so one read can capture the entire x
direction for several cells, but multiple reads are needed for the three y direction lines
:
// 13.9 FPS
// 3 'y' lines of reads cached
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
// Cache registers
static uint16_t cache_x_buffer_index = FRAME_WIDTH;
static uint16_t cache_y[3] = {FRAME_HEIGHT, FRAME_HEIGHT, FRAME_HEIGHT};
static n_pixels_t cache_pixels[3];
// Read the pixels from the 'read' frame buffer or from cache
n_pixels_t pixels;
uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
// Check cache for match (only one will match)
uint1_t cache_match = 0;
uint8_t cache_sel; // Which of 3 cache lines
uint32_t i;
for(i=0; i<3; i+=1)
{
uint1_t match_i = (x_buffer_index==cache_x_buffer_index) & (y==cache_y[i]);
cache_match |= match_i;
if(match_i){
cache_sel = i;
}
}
if(cache_match)
{
pixels = cache_pixels[cache_sel];
}
else
{
// Read RAM and update cache
pixels = dual_frame_buf_read(x_buffer_index, y);
// If got a new x pos to read then default clear/invalidate entire cache
if(x_buffer_index != cache_x_buffer_index)
{
ARRAY_SET(cache_y, FRAME_HEIGHT, 3)
}
cache_x_buffer_index = x_buffer_index;
// Least recently used style shift out cache entries
// to make room for keeping new one most recent at [0]
ARRAY_1SHIFT_INTO_BOTTOM(cache_y, 3, y)
ARRAY_1SHIFT_INTO_BOTTOM(cache_pixels, 3, pixels)
}
// Select the single pixel offset of interest (bottom bits of x)
ram_pixel_offset_t x_offset = x;
return pixels.data[x_offset];
}
Rendering speed is increased to 13.9 FPS
.
In the above tests RAM_PIXEL_BUFFER_SIZE=16
, that is, a group of N=16 pixels is stored at each RAM address. Increasing the width of the data bus (while keeping clock rates the same) will result in more available memory bandwidth, possible for more pixels per second written/read from RAM.
// 16.1 FPS
#define RAM_PIXEL_BUFFER_SIZE 32
If the width is made very wide, ex. RAM_PIXEL_BUFFER_SIZE=FRAME_WIDTH
, then essentially the RAM is a per-line buffer of FRAME_HEIGHT
lines (the entire screen).
However, especially given the read data caching described above, this design is not currently memory bandwidth limited. Increasing the RAM data width shows diminishing returns, ex. 64 bits gets to just 17 FPS for double the caching resources.
It is actually of greater benefit to save resources using the original 16b wide RAM
data as this just barely allows for another doubling of the number of threads
:
// ~30 FPS, 16 threads
#define RAM_PIXEL_BUFFER_SIZE 16
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 4
Increasing the clock rates and eventually 'overclocking' the design is the final easy axis to explore for increasing rendering speed.
// ~30 FPS, 16 threads
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clock
#define RAM_PIXEL_BUFFER_SIZE 16
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 4
The host and device clock domains were chosen as easily meet timing. However, there is room to push the design further and see where timing fails, and where visible glitches occur.
The host clock domain of user threads is currently frequency limited by excess logic produced from derived finite state machines. The device clock domain containing the frame buffer RAM and arbitration logic is currently limited by the arbitration implementation built into shared_resource_bus.h.
The design begins to fail timing with just a slightly higher host clock of 30MHz:
// 32 FPS, 16 threads
#define HOST_CLK_MHZ 30.0
#define DEV_CLK_MHZ 100.0
However, in hardware testing the design can run with no visible issues using as fast as 45MHz for the host clock:
// 46 FPS, 16 threads
#define HOST_CLK_MHZ 45
#define DEV_CLK_MHZ 100.0
Any faster host clock results in design that fails to work: shows only a still image, execution is stalled, locked on the very first frame it seems.
Finally, the device clock running the frame buffers can be increased as well:
// 48 FPS, 16 threads
#define HOST_CLK_MHZ 45
#define DEV_CLK_MHZ 150.0
This frame buffer RAM clock has been seen to work at as high as ~250MHz, however, this is unnecessary since the design is not memory bandwidth limited and the returns quickly diminish, jumping only by 2FPS for an extra 50MHz clock rate increase from the original 100Mhz. Beyond 250MHz device clock the design fails to display any image on the monitor.
Using these shared resource buses its possible to picture even more complex host threads and computation devices. For instance, a long pixel rendering pipeline like used in Sphery vs. Shapes could be adapted to be a device resource shared among many rendering threads.
Generally the functionality in shared_resource_bus.h will continue to be improved and made easier to adapt in more design situations.
Please reach out if interested in giving anything a try or making improvements, happy to help! -Julian