-
Notifications
You must be signed in to change notification settings - Fork 49
Shared Resource Bus Old Graphics Demo
For now the new AXI DDR memory based graphics demo drawing Mandelbrot will live here, to be swapped out for graphics demo currently on shared resource buffer page:
This graphics demo differs from the old one (which is worth reading for more details). Instead of using on-chip block RAM, this new demo uses off-chip DDR memory for full color frame buffers. Additionally, this demo focuses on a more complex rendering computation that can benefit from PipelineC's auto-pipelining.
TODO GENERAL GRAPHICS DEMO DIAGRAM
The graphics_demo.c file is an example exercising a dual frame buffer as a shared bus resource from dual_frame_buffer.c. The demo slowly cycles through R,G,B color ranges, requiring for each pixel: a read from frame buffer RAM, minimal computation to update pixel color, and a write back to frame buffer RAM for display.
The frame buffer is configured to use a Xilinx AXI DDR controller starting inside ddr_dual_frame_buffer.c. The basic shared resource bus setup for connecting to the Xilinx DDR memory controller AXI bus can be found in axi_xil_mem.c. In that file an instance of an axi_shared_bus_t
shared resource bus (defined in axi_shared_bus.h) called axi_xil_mem
is declared using the shared_resource_bus_decl.h file include-as-macro helper.
In addition to 'user' rendering threads, the frame buffer memory shared resource needs to be reading pixels at a rate that can meet the streaming requirement of the VGA resolution pixel clock timing for connecting a display.
Unlike the the old demo, in this demo ddr_dual_frame_buffer.c uses a separate 'read-only priority port' axi_xil_rd_pri_port_mem_host_to_dev_wire
wire to simply wire a VGA counter to a dedicated read request side of the shared resource bus. Responses from the bus are the pixels that are written directly into the vga_pmod_async_pixels_fifo.c display stream.
MAIN_MHZ(host_vga_reader, XIL_MEM_MHZ)
void host_vga_reader()
{
static uint1_t frame_buffer_read_port_sel_reg;
// READ REQUEST SIDE
// Increment VGA counters and do read for each position
static vga_pos_t vga_pos;
// Read and increment pos if room in fifos (cant be greedy since will 100% hog priority port)
uint1_t fifo_ready;
#pragma FEEDBACK fifo_ready
// Read from the current read frame buffer addr
uint32_t addr = pos_to_addr(vga_pos.x, vga_pos.y);
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.araddr = dual_ram_to_addr(frame_buffer_read_port_sel_reg, addr);
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arlen = 1-1; // size=1 minus 1: 1 transfer cycle (non-burst)
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arsize = 2; // 2^2=4 bytes per transfer
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arburst = BURST_FIXED; // Not a burst, single fixed address per transfer
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.valid = fifo_ready;
uint1_t do_increment = fifo_ready & axi_xil_rd_pri_port_mem_dev_to_host_wire.read.req_ready;
vga_pos = vga_frame_pos_increment(vga_pos, do_increment);
// READ RESPONSE SIDE
// Get read data from the AXI RAM bus
uint8_t data[4];
uint1_t data_valid = 0;
data = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.burst.data_resp.user.rdata;
data_valid = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.valid;
// Write pixel data into fifo
pixel_t pixel;
pixel.a = data[0];
pixel.r = data[1];
pixel.g = data[2];
pixel.b = data[3];
pixel_t pixels[1];
pixels[0] = pixel;
fifo_ready = pmod_async_fifo_write_logic(pixels, data_valid);
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.data_ready = fifo_ready;
frame_buffer_read_port_sel_reg = frame_buffer_read_port_sel;
}
In graphics_demo.c the pixel_kernel
function implements incrementing RGB channel values as a test pattern.
The pixels_kernel_seq_range
function iterates over a range of frame area executing pixel_kernel
for reach pixel. The frame area is defined by start and end x
and y
positions.
// Single 'thread' state machine running pixel_kernel "sequentially" across an x,y range
void pixels_kernel_seq_range(
kernel_args_t args,
uint16_t x_start, uint16_t x_end,
uint16_t y_start, uint16_t y_end)
{
uint16_t x;
uint16_t y;
for(y=y_start; y<=y_end; y+=TILE_FACTOR)
{
for(x=x_start; x<=x_end; x+=TILE_FACTOR)
{
if(args.do_clear){
pixel_t pixel = {0};
frame_buf_write(x, y, pixel);
}else{
// Read the pixel from the 'read' frame buffer
pixel_t pixel = frame_buf_read(x, y);
pixel = pixel_kernel(args, pixel, x, y);
// Write pixel back to the 'write' frame buffer
frame_buf_write(x, y, pixel);
}
}
}
}
Multiple host threads can be reading and writing the frame buffers trying to execute their own sequential run of pixels_kernel_seq_range
. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM
modules inside of a function called render_demo_kernel
. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS)
copies of pixels_kernel_seq_range
all run in parallel, splitting the FRAME_WIDTH
by NUM_X_THREADS
threads and FRAME_HEIGHT
by NUM_Y_THREADS
.
// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_demo_kernel(
kernel_args_t args,
uint16_t x, uint16_t width,
uint16_t y, uint16_t height
){
// Wire up N parallel pixel_kernel_seq_range_FSM instances
uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
uint32_t i,j;
uint1_t all_threads_done;
while(!all_threads_done)
{
pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
all_threads_done = 1;
uint16_t thread_x_size = width >> NUM_X_THREADS_LOG2;
uint16_t thread_y_size = height >> NUM_Y_THREADS_LOG2;
for (i = 0; i < NUM_X_THREADS; i+=1)
{
for (j = 0; j < NUM_Y_THREADS; j+=1)
{
if(!thread_done[i][j])
{
fsm_in[i][j].input_valid = 1;
fsm_in[i][j].output_ready = 1;
fsm_in[i][j].args = args;
fsm_in[i][j].x_start = (thread_x_size*i) + x;
fsm_in[i][j].x_end = fsm_in[i][j].x_start + thread_x_size - 1;
fsm_in[i][j].y_start = (thread_y_size*j) + y;
fsm_in[i][j].y_end = fsm_in[i][j].y_start + thread_y_size - 1;
fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
thread_done[i][j] = fsm_out[i][j].output_valid;
}
all_threads_done &= thread_done[i][j];
}
}
__clk();
}
}
render_demo_kernel
can then simply run in a loop, trying for the fastest frames per second possible.
void main()
{
kernel_args_t args;
...
while(1)
{
// Render entire frame
render_demo_kernel(args, 0, FRAME_WIDTH, 0, FRAME_HEIGHT);
}
}
The actual graphics_demo.c file main()
does some extra DDR initialization, is slowed down to render the test pattern slowly, and manages the toggling of the dual frame buffer 'which is the read buffer' select signal after each render_demo_kernel
iteration: frame_buffer_read_port_sel = !frame_buffer_read_port_sel;
.
The above graphics demo uses an AXI RAM frame buffer as the resource shared on a bus.
Another common use case is having an automatically pipelined function as the shared resource. shared_resource_bus_pipeline.h is a header-as-macro helper for declaring a pipeline instance connected to multiple host state machine via a shared resource bus.
// Example declaration using helper header-as-macro
#define SHARED_RESOURCE_BUS_PIPELINE_NAME name
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE output_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC the_func_to_pipeline
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE input_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"
In the above example a function output_t the_func_to_pipeline(input_t)
is made into a pipeline instance used like output_t name(input_t)
from derived FSM NUM_THREADS
threads host threads (running at HOST_CLK_MHZ
). The pipeline is automatically pipelined to meet the target DEV_CLK_MHZ
operating frequency.
TODO
- describe shared res bus devices:
- screen to complex
- mandebrot iters
- iter count to color
- frame buffer
- next state function
- signal to compute next state each time frame rendered
resources fmax etc
TODO 'cpu style single floating point unit?'
Using the multi-threaded dual frame buffer graphics demo setup discussed above, the final specifics for a Game of Life
demo are ready to assemble:
The per-pixel kernel function implementing Game of Life runs the familiar alive neighbor cell counting algorithm to compute the cell's next alive/dead state:
// Func run for every n_pixels_t chunk
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
// Read the pixels from the 'read' frame buffer
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
// Run Game of Life kernel for each pixel
uint32_t i;
uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
{
pixels.data[i] = cell_next_state(pixels.data[i], x+i, y);
}
// Write pixels back to the 'write' frame buffer
dual_frame_buf_write(x_buffer_index, y, pixels);
}
Using these shared resource buses its possible to picture even more complex host threads and computation devices. For instance, a long pixel rendering pipeline like used in Sphery vs. Shapes could be adapted to be a device resource shared among many rendering threads.
Generally the functionality in shared_resource_bus.h will continue to be improved and made easier to adapt in more design situations.
Please reach out if interested in giving anything a try or making improvements, happy to help! -Julian