JPEG 1 sequential

Welcome to the jpeg-decoder-py wiki!

reference

https://parametric.press/issue-01/unraveling-the-jpeg/
https://en.wikipedia.org/wiki/ The progressive encoding is not described in detail. But if you want to begin with sequential, wiki is enough.
https://www.w3.org/Graphics/JPEG/itu-t81.pdf. It provides some flow graphs which are worth reading.
Compressed Image File Formats JPEG, PNG, GIF, XBM, BMP. John Miano. It provides algorithm codes.

introduction

JPEG is a method of lossy compression for digital images. The JPEG standard specifies the codec, which defines how an image is compressed into a stream of bytes and decompressed back into an image, but not the file format used to contain that stream.

JPEG/Exif is the most common image format used by digital cameras and other photographic image capture devices; along with JPEG/JFIF, it is the most common format for storing and transmitting photographic images on the World Wide Web, and has been the de facto file format for images encoded by the JPEG standard.

Here we only talk about encoding. Suppose the input image is a 32×36 matrix of pixels, each pixel consists of 3 integers between 0 and 255. We will convert it to a sequence of bits.

color space transforming

Color space is the way to represent colors.

RGB: three color components(or called channels), red, green, blue, 0-255

YCbCr: one luminance(luma) component, Y (brightness, what you see on a black and white TV, 0 for white, 255 for black); and two chrominance(chroma, or called color difference) components, Cb (relative blueness, also 0-255), and Cr (relative redness)

JPEG uses YCbCr to make the image more compressable. A brief description of transforming.

Y = 0.299 * R + 0.587*G + 0.114*B
Cb = -0.1687*R - 0.3313*G + 0.5*B + 128
Cr = 0.5*R - 0.4187*G - 0.0813*B + 128

chrominance subsampling

Human eyes are more sensitive to variations in brightness than color. If some chroma components are changed, we can't notice it. Thus we don't need to keep track of the chroma of all pixels.

The method we use is called subsampling or downsampling. It reduces the sampling frequency of some components to reduce the amount of samples. If the reduction is done by a factor of 2 in the horizontal direction, it means one sample value maps two horizontally adjecent pixels. The value may be the average.

Some subsampling scheme:

4:4:4 no downsampling
4:2:2 reduction by a factor of 2 in the horizontal direction
4:2:0 (most common) reduction by a factor of 2 in both the horizontal and vertical directions

JPEG files don't use reduction factor to describe subsampling. Instead, sampling frequency(1, 2 or 4) is used. It is a relative value. The component with the maximum frequency is not performed subsampling, and the component with less frequency is shrunk.

Suppose we have the following sampling frequency.

    vertical    horizontal
Y    2          2
Cb   1          1
Cr   1          1

The sampling frequency of Cb, Cr is only half of Y. This means that for every two samples taken for the Y in each direction, one sample is taken for the Cb and Cr. Each Y sample maps to a pixel and each Cb or Cr sample maps to 2×2 pixels. In the following part, I will use these frequencies as the example.

block spliting

The samples for each component must be split into 8×8 blocks, then 64 samples in a block are processed together. Blocks are also called data units. For DCT-based JPEG, the term block is more easy to understand.

The sample block is always 8×8. For Y which is not performed subsampling, its corresponding pixel block is also 8×8. But for Cb or Cr, the pixel block is 16×16. Keep in mind the difference between the sample block and the pixel block, in the following part, if the term block is used alone, it refers to a sample block.

pixel_block_height = 8 * max_vf // vf
pixel_block_width = 8 * max_hf // hf

Incomplete blocks are not allowed. In vertical direction, if the image_height is not a multiple of pixel_block_height, extra samples are added.

number_of_rows_of_blocks = ceil(image_height / pixel_block_height)
number_of_blocks_per_row = ceil(image_width / pixel_block_width)

The image is 32×36, so there are 4×5 Y blocks, 2×3 Cb or Cr blocks.

Discrete Cosine Transform

The values in a block are between 0 and 255. They are shifted to center around zero, then transformed by DCT (we use type-Ⅱ) into a4 values referred to as DCT coefficients. One of these values, which is in the top-left corner representing low frequency, is referred to as the DC coefficient and the other 63 as the AC coefficients.

Let the block be B, Coef = dct2(shift(B)), DC = Coef[0][0]

Quantization

The step where actual lossy compression happens.

The human eye is good at seeing small differences in brightness over a relatively large area, but not so good at distinguishing the exact strength of a high frequency brightness variation. This allows one to greatly reduce the amount of information in the high frequency components, and the general gist of the image will still be there.

There are 8×8 quantization tables depending on the quality setting saved in the file header. What we do is to divide each coefficient by corresponding quantization value and round to the nearest integer. Often the value for high frequency is rounded to 0.

Qcoef[i][j] = round(Coef[i][j]/QT[i][j])

scans

Now we have several blocks for each component, each block has 1 quantized DC coefficient and 63 ACs. We will encode these values within one or multiple scans. Each scan iterates over MCUs(minimum coded unit). MCUs are encoded one row at a time, top to bottom, left to right. The encoded data in one scan are stored together, in the order they were scanned.

interleaved

A scan with only one component is referred to as non-interleaved. A MCU only contains one block.

The number of MCU rows and columns is equal to number_of_row_of_blocks and number_of_blocks_per_row of the component.

get number_of_row_of_blocks and number_of_blocks_per_row of the component

for i in range(number_of_row_of_blocks):
    for j in range(number_of_blocks_per_row):
        encode the block with index (i, j)

non-interleaved

A scan with more than one component is known as interleaved.

The MCU of a interleaved scan is determined by the sampling frequencies of all components, not just the components in the current scan. The biggest pixel block is 16×16, it contains 4 Y block(referred to as Y00,Y01,Y10,Y11), 1 Cb block and 1 Cr block. So a MCU has 4Y 1Cb 1Cr. The number of blocks each component contributes to an MCU is the product of the component's sampling frequencies, vf×hf. Within an MCU, the blocks for each component are encoded from top to bottom, left to right. Components are not interleaved within an MCU.

If one scan contains Y Cb Cr
(MCU1 Y00 Y01 Y10 Y11 Cb Cr), (MCU2 Y00 Y01 Y10 Y11 Cb Cr), ...

The number of MCU rows and columns are determined by max_vf and max_hf of all components in the image.

number_of_rows_of_MCUs = ceil(image_height / (8 * max_vf))
number_of_MCUs_per_row = ceil(image_width / (8 * max_hf))

So we have 2×3 interleaved MCUs. Take care, there are 2×3×4 Y blocks, more than previous 4×5. So we need add dummy Y blocks to form a complete MCU.

Interleaved sequential scan

If a component is encoded in only one scan, it is known as sequential.

Interleaved sequential scan is the common method. There is only one scan. In a component block, DC are encoded first, followed by ACs in zigzag order.

zig-zag (0,0) -> (0,1) -> (1,0) ->(2,0) -> (1,1) -> (0,2) -> ...

DC encoding

Delta-encoding

It is the technique of storing each value as a relative value compared to something before it, instead of storing its absolute value.

The encoded DC value is actually the difference between the DC coefficient value in the current data unit and the DC value from the last data unit processed for the same component.

For Y: (MCU1_Y00_DC), ACs, (MCU1_Y01_DC - MCU1_Y00_DC), ..., (MCU2_Y00_DC - MCU1_Y11_DC), ...

DC represents the average of the block, so for an image, it may be close to DCs of adjecent blocks. By using delta-encoding, DCs are converted to smaller values.

12 13 14 14 14 13 13 14

12 1 1 0 0 -1 0 1

Huffman encoding

Huffman encoding: given a set of symbols(like numbers, characters, etc.) and their weight(usually proportional to frequency), find a prefix-free binary code(a set of codewords, the codeword is a bit string, like 010) with minimum total codeword length.

Huffman tables map between symbols and codewords. The JPEG standard provides general-purpose Huffman tables; encoders may also choose to generate Huffman tables optimized for the actual frequency distributions in images being encoded.

After delta-encoding, DC value is then converted to a value knowns as magnitude(or called SIZE) and magnitude bits known as amplitude. For positive values, like 7, its binary representation is 111, so the magnitude is 3, and the amplitude is [1,1,1]. For negetive values, like -7, the magnitude is the same as its absolute value 7, and the amplitude of 7 is flipped to get [0,0,0]. As for 0, the magnitude is 0, and the amplitude is empty [].

Then, the magnitude is Huffman-encoded. The Huffman codeword is appended to the output stream and the bits of amplitude follows.

Suppose the codeword for symbol 3 is 010, the output stream receives [0,1,0,1,1,1].

AC encoding

AC encoding doesn't use delta-encoding, and it uses a variation of Huffman encoding, which combines run length encoding with Huffman encoding.

run length encoding

After quantization, many AC coefficients are 0. It's a waste to explicitly encode and store them. We just encode non-zero ACs and record how many zeroes are between two non-zero ACs, which is known as runlength of zeroes.

let x be the non-zero AC, we get a triple (runlength, magnitude, amplitude)

runlength is the number of zeroes between the previous non-zero AC and x.
magnitude and amplitude are the same as DC encoding

We use a byte to represent runlength and magnitude. The high 4 bits are runlength and low 4 bits are magnitude. Then it's Huffman-encoded. Also, the Huffman codeword is appended to the output stream and the bits of amplitude follows.

runlength is between 0 and 15. What if there are more than 15 continuous zeroes? When find 16 zeroes, the byte is (15, 0), ZLF, representing the 16th zero. And the next non-zero AC considers the 16th zero as the previous non-zero AC.

When the remianing of the block are all zeroes, the byte is (0, 0), which means EOB, End of Block, representing all these zeroes.

Given following 63 ACs in a block.

12, 0, 0, 5, (16, 0), 0, 4, (41, 0)
(n, 0) represents n continous 0

They are converted to

(0, 4, [1,1,0,0]), (2, 3, [1,0,1]), (15, 0, []), (1, 3, [1,0,0]), (0,0,[])

Then Huffman encoded.

non-interleaved sequential

The procedure for a block is the same as interleaved sequential.

The progressive encoding is more complex, I leave it to another post.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly