Skip to content

KevinzChen04/C_compiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

C_compiler

Project Overview

The modern C compiler replicates what current compilers(Klang, GDB, Swift) use to speed up C code. This is done through a process of translating C code into simpler languages, built with a specific optimization in mind, going from C->LB->LA->IR->L3->L2->L1 before being translated to assembly and machine code.

Languages

L1

L1 is a barebones language that serves as a middle ground between the back-end compiler and assembly. The language definition is very similar to assembly.

L1 Language

Figure 1: L1 Language Syntax

L2

In L2, the language supports both variables and registers. The focus of this language is to map variables to registers. This is done through graph coloring, where a connection between two variables represents that they must be colored with two different registers. This graph is built using liveness analysis. For every line, we create a list of variables read and stored. If a variable is read, previously stored variables cannot share the same register. If graph coloring is not possible with 16 registers, we will "spill" the variable name, allocating the variable's value onto the stack. L2's syntax is built to support this:

L2 Language

Figure 2: L2 Language Syntax

L3

L3 is about instruction selection, combining multiple lines of code into one for optimizations. To do so, L3 merges instructions into large tree expressions:

L3 tree

Figure 3: L3 Tree

Then, we merge the tree to simplify the total number of expressions and convert them to L2 instructions. Thus simplifying the overall number of instructions.
L3 merging trees

Figure 4: L3 merging trees

This is supported by L3's simpler syntax:
L3 syntax

Figure 5: L3 language syntax

IR

IR is not part of the optimization, but rather a simplified view between the front-end and back-end. This is reflected in the syntax, as its forceful use of basic blocks makes programs easier to read during debugging.

IR syntax

Figure 6: IR language syntax

However, there is a problem going from IR to L3, due to IR's unique syntax. This is solved with control flow graph linearization. Naively we could have each false branch fall through to a force branch. However, we implemented CFG which utilized Markov chains which maximizes fall-through links while minimizing jumps.

LA

An underlying problem throughout the previous compilers was how to differentiate a data value from a memory location. Previously, data members ended with a "1" bit while memory locations ended with a "0" bit. In LA, values can only be data members or an array or tuple of data members. In addition to this quick fix, we also implemented automatic tensor errors and type declarations.

LA syntax

Figure 7: LA language syntax

LB

Finally, in LB, we added scopes, which delimit variable names and allow for the creation of "if" and "while" statements.

LB syntax

Figure 8: LB language syntax

Conclusion

In the end, our compiler was able to compete with up to 90% of the modern-day compilers. The reason why it was slower was due to modern compilers further optimizing spilling.

All pictures are credited to Professor Simone Campanoni

About

C compiler using modern techniques

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages