Minimalist software decoder with state-of-the-art performance for the H.264/AVC video format.
Please note this is a work in progress and will be ready for use after making GStreamer/VLC plugins.
- Supports Progressive High and MVC 3D profiles, up to level 6.2
- Any resolution up to 8K UHD
- 8-bit 4:2:0 planar YUV output
- Slices and Arbitrary Slice Order
- Slice and frame multi-threading
- Per-slice reference picture list
- Memory Management Control Operations
- Long-term reference frames
- Windows: x86, x64
- Linux: x86, x64, ARM64
- Mac OS: x64
edge264 is entirely developed in C using 128-bit vector extensions and vector intrinsics, and can be compiled with GNU GCC or LLVM Clang. SDL2 runtime library may be used (optional) to enable display with edge264_test
.
Here are the make
options for tuning the compiled library file:
CC
- C compiler used to convert source file to object files (defaultcc
)CFLAGS
- additional compilation flags passed toCC
ARCH
- target architecture that will be passed to -march (defaultnative
)OS
- target operating system (defaults to host)TARGETCC
- C compiler used to link object files into library file (defaults toCC
)LDFLAGS
- additional compilation flags passed toTARGETCC
VARIANTS
- comma-separated list of additional variants included in the library and selected at runtime (defaultdebug
)x86-64-v2
- variant compiled for x86-64 microarchitecture level 2 (SSSE3, SSE4.1 and POPCOUNT)x86-64-v3
- variant compiled for x86-64 microarchitecture level 3 (AVX2, BMI, LZCNT, MOVBE)debug
- variant compiled with debugging support (-g and print calls for headers and slices)
BUILD_TEST
- toggles compilation ofedge264_test
(defaultyes
)
$ make ARCH=x86-64 VARIANTS=x86-64-v2,x86-64-v3 BUILD_TEST=no # example release build
The automated test program edge264_test
can browse files in a given directory, decoding each <video>.264
file and comparing its output with each sibling file <video>.yuv
if found. On the set of AVCv1, FRExt and MVC conformance bitstreams, 109/224 files are decoded without errors, the rest using yet unsupported features.
$ make
$ ./edge264_test --help # prints all options available
$ ffmpeg -i vid.mp4 -vcodec copy -bsf h264_mp4toannexb -an vid.264 # optional, converts from MP4 format
$ ./edge264_test -d vid.264 # replace -d with -b to benchmark instead of display
Here is a complete example that opens an input file in Annex B format from command line, and dumps its decoded frames in planar YUV order to standard output. See edge264_test.c for a more complete example which can also display frames.
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include "edge264.h"
int main(int argc, char *argv[]) {
int fd = open(argv[1], O_RDONLY);
struct stat st;
fstat(fd, &st);
uint8_t *buf = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
const uint8_t *nal = buf + 3 + (buf[2] == 0); // skip the [0]001 delimiter
const uint8_t *end = buf + st.st_size;
Edge264Decoder *dec = edge264_alloc(-1, NULL, NULL); // auto number of threads
Edge264Frame frm;
int res;
do {
res = edge264_decode_NAL(dec, nal, end, 0, NULL, NULL, &nal);
while (!edge264_get_frame(dec, &frm, 0)) {
for (int y = 0; y < frm.height_Y; y++)
write(1, frm.samples[0] + y * frm.stride_Y, frm.width_Y);
for (int y = 0; y < frm.height_C; y++)
write(1, frm.samples[1] + y * frm.stride_C, frm.width_C);
for (int y = 0; y < frm.height_C; y++)
write(1, frm.samples[2] + y * frm.stride_C, frm.width_C);
}
} while (res == 0 || res == ENOBUFS);
edge264_free(&dec);
munmap(buf, st.st_size);
close(fd);
return 0;
}
const uint8_t * find_start_code(buf, end)
Return a pointer to the next three-byte sequence 001, or end
if not found.
const uint8_t * buf
- first byte of buffer to search intoconst uint8_t * end
- first invalid byte past the buffer that stops the search
Edge264Decoder * edge264_alloc(n_threads, trace_headers, trace_slices)
Allocate and initialize a decoding context.
int n_threads
- number of background worker threads, with 0 to disable multithreading and -1 to detect the number of logical cores at runtimeFILE * trace_headers
- if not NULL, the file to print header values while decoding (⚠️ large, enabling it requires thedebug
variant, otherwise the function will fail at runtime)FILE * trace_slices
- if not NULL, the file to print slice values while decoding (⚠️ very large, requiresdebug
too)
void edge264_flush(dec)
For use when seeking, stop all background processing and clear all delayed frames. The parameter sets are kept, thus do not need to be sent again if they did not change.
Edge264Decoder * dec
- initialized decoding context
void edge264_free(pdec)
Deallocate the entire decoding context, and unset the pointer.
Edge264Decoder ** pdec
- pointer to a decoding context, initialized or not
int edge264_decode_NAL(dec, buf, end, non_blocking, free_cb, free_arg, next_NAL)
Decode a single NAL unit containing any parameter set or slice.
Edge264Decoder * dec
- initialized decoding contextconst uint8_t * buf
- first byte of NAL unit (containingnal_unit_type
)const uint8_t * end
- first byte past the buffer (max buffer size is 231-1 on 32-bit and 263-1 on 64-bit)int non_blocking
- set to 1 if the current thread has other processing thus cannot block herevoid (* free_cb)(void * free_arg, int ret)
- callback that may be called from another thread when multithreaded, to signal the end of parsing and release the NAL buffervoid * free_arg
- custom value that will be passed tofree_cb
const uint8_t ** next_NAL
- if not NULL and the return code is0
|ENOTSUP
|EBADMSG
, will receive a pointer to the next NAL unit after the next start code in an Annex B stream
Return codes are:
0
on successENOTSUP
on unsupported stream (decoding may proceed but could return zero frames)EBADMSG
on invalid stream (decoding may proceed but could show visual artefacts, if you can check with another decoder that the stream is actually flawless, please consider filling a bug report 🙏)EINVAL
if the function was called withdec == NULL
ordec->buf == NULL
ENODATA
if the function was called whiledec->buf >= dec->end
ENOMEM
ifmalloc
failed to allocate memoryENOBUFS
if more frames should be consumed withedge264_get_frame
to release a picture slotEWOULDBLOCK
if the non-blocking function would have to wait before a picture slot is available
int edge264_get_frame(dec, out, borrow)
Fetch the next frame ready for output.
Edge264Decoder * dec
- initialized decoding contextEdge264Frame *out
- a structure that will be filled with data for the frame returnedint borrow
- if 0 the frame may be accessed until the next call toedge264_decode_NAL
, otherwise the frame should be explicitly returned withedge264_return_frame
. Note that access is not exclusive, it may be used concurrently as reference for other frames.
Return codes are:
0
on success (one frame is returned)EINVAL
if the function was called withdec == NULL
orout == NULL
ENOMSG
if there is no frame to output at the moment
While reference frames may be decoded ahead of their actual display (ex. B-Pyramid technique), all frames are buffered for reordering before being released for display:
- Decoding a non-reference frame releases it and all frames set to be displayed before it.
- Decoding a key frame releases all stored frames (but not the key frame itself which might be reordered later).
- Exceeding the maximum number of frames held for reordering releases the next frame in display order.
- Lacking an available frame buffer releases the next non-reference frame in display order (to salvage its buffer) and all reference frames displayed before it.
void edge264_return_frame(dec, return_arg)
Give back ownership of the frame if it was borrowed from a previous call to edge264_get_frame
.
Edge264Decoder * dec
- initialized decoding contextvoid * return_arg
- the value stored inside the frame to return
typedef struct Edge264Frame {
const uint8_t *samples[3]; // Y/Cb/Cr planes
const uint8_t *samples_mvc[3]; // second view
const uint8_t *mb_errors; // probabilities (0..100) for each macroblock to be erroneous, NULL if there are no errors, values are spaced by stride_mb in memory
int8_t pixel_depth_Y; // 0 for 8-bit, 1 for 16-bit
int8_t pixel_depth_C;
int16_t width_Y;
int16_t width_C;
int16_t height_Y;
int16_t height_C;
int16_t stride_Y;
int16_t stride_C;
int16_t stride_mb;
int32_t TopFieldOrderCnt;
int32_t BottomFieldOrderCnt;
int16_t frame_crop_offsets[4]; // {top,right,bottom,left}, useful to derive the original frame with 16x16 macroblocks
void *return_arg;
} Edge264Frame;
- Any invalid or corrupted header is ignored by edge264, i.e. if an invalid parameter set is received then the previous one will still be kept.
- Any invalid or corrupted frame will be signaled by setting
mb_errors
inEdge264Frame
. Since edge264 cannot detect exactly where a corruption occurs, it returns a 0-100% integer probability for each macroblock to contain errors caused by the corruption. This probability is rounded upward, such that all macroblocks inside a corrupted slice will at least have the value 1. edge264 does a basic job at reconstructing the corrupted macroblocks with neighboring frames, so media players are encouraged to usemb_errors
to provide a better reconstruction.
- Multithreading (in progress)
- Error recovery (in progress)
- Integration in VLC/ffmpeg/GStreamer
- ARM32
- PAFF and MBAFF
- 4:0:0, 4:2:2 and 4:4:4
- 9-14 bit depths with possibility of different luma/chroma depths
- Transform-bypass for macroblocks with QP==0
- SEI messages
- AVX-2 optimizations
I use edge264 to experiment on new programming techniques to improve performance and code size over existing decoders, and presented a few of these techniques at FOSDEM'24 and FOSDEM'25.
- Single header file - It contains all struct definitions, common constants and enums, SIMD aliases, inline functions and macros, and exported functions for each source file. To understand the code base you should look at this file first.
- Code blocks instead of functions - The main decoding loop is a forward pipeline designed as a DAG loosely resembling hardware decoders, with nodes being non-inlined functions and edges being tail calls. It helps mutualize code branches wherever possible, thus reduces code size to help fit in L1 cache.
- Tree branching - Directional intra modes are implemented with a jump table to the leaves of a tree then unconditional jumps down to the trunk. It allows sharing the bottom code among directional modes, to reduce code size.
Global context register - The pointer to the main structure holding context data is assigned to a register when supported by the compiler (GCC).This technique was dropped as Clang eventually reached on-par performance, so there is little incentive to maintain this hack.- Default neighboring values (search
unavail_mb
) - Tests for availability of neighbors are replaced with fake neighboring macroblocks around each frame. It reduces the number of conditional tests inside the main decoding loop, thus reduces code size and branch predictor pressure. - Relative neighboring offsets (look for
A4x4_int8
and related variables) - Access to left/top macroblock values is done with direct offsets in memory instead of copying their values to a buffer beforehand. It helps to reduce the reads and writes in the main decoding loop. - Parsing uneven block shapes (look at function
parse_P_sub_mb
) - Each Inter macroblock paving specified with mb_type and sub_mb_type is first converted to a bitmask, then iterated on set bits to fetch the correct number of reference indices and motion vectors. This helps to reduce code size and number of conditional blocks. - Using vector extensions - GCC's vector extensions are used along vector intrinsics to write more compact code. All intrinsics from Intel are aliased with shorter names, which also provides an enumeration of all SIMD instructions used in the decoder.
- Register-saturating SIMD - Some critical SIMD algorithms use more simultaneous vectors than available registers, effectively saturating the register bank and generating stack spills on purpose. In some cases this is more efficient than splitting the algorithm into smaller bits, and has the additional benefit of scaling well with later CPUs.
- Piston cached bitstream reader - The bitstream bits are read in a size_t[2] intermediate cache with a trailing set bit to keep track of the number of cached bits, giving access to 32/64 bits per read from the cache, and allowing wide refills from memory.
- On-the-fly SIMD unescaping - The input bitstream is unescaped on the fly using vector code, avoiding a full preprocessing pass to remove escape sequences, and thus reducing memory reads/writes.
- Multiarch SIMD programming - Using vector extensions along with aliased intrinsics allows supporting both Intel SSE and ARM NEON with around 80% common code and few #if #else blocks, while keeping state-of-the-art performance for both architectures.
- The Structure of Arrays pattern - The frame buffer is stored with arrays for each distinct field rather than an array of structures, to express operations on frames with bitwise and vector operators (see AoS and SoA). The task buffer for multithreading also relies on it partially.
- Deferred error checking - Error detection is performed once in each type of NAL unit (search for
return
statements), by clamping all input values to their expected ranges, then expectingrbsp_trailing_bit
afterwards (with very high probability of catching an error if the stream is corrupted). This design choice is discussed in A case about parsing errors.
Other yet-to-be-presented bits:
- Minimalistic API with FFI-friendly design (7 functions and 1 structure).
- The bitstream caches for CAVLC and CABAC (search for
rbsp_reg
) are stored in two size_t variables each, which may be mapped to Global Register Variables in the future. - The decoding of input symbols is interspersed with their parsing (instead of parsing to a
struct
then decoding the data). It deduplicates branches and loops that are present in both parsing and decoding, and even eliminates the need to store some symbols (e.g. mb_type, sub_mb_type, mb_qp_delta).