Architecture/Performance/cpu_optimization.theory.txt


                                  ┏━━━━━━━━━━━━━━━━━━━━━━┓
                                  ┃   CPU_OPTIMIZATION   ┃
                                  ┗━━━━━━━━━━━━━━━━━━━━━━┛

SCOPE ==>                         #Can be:
                                  #  - "whole program optimization"/"link-time"
                                  #  - "interprocedural optimization" (IPO): between functions
                                  #  - "intraprocedural optimization"/"global": to each function
                                  #  - "loop optimization": inside loop structure
                                  #  - "local": several instructions
                                  #  - "peephole": few instructions

PROFILE-GUIDED OPTIMIZATION (PGO) #When using runtime information collected by a profiler
 ==>                              #Also called "profile-directed feedback" (PDF) or "feedback-directed optimization" (FDO)

MACHINE CODE OPTIMIZATION ==>     #Optimization of CPU instructions for a specific architecture
                                  #Performed after all other optimizations

POST-PASS OPTIMIZATION ==>        #Optimizing from the machine code instead of the source code.

                                  ┌──────────────┐
                                  │   ANALYSIS   │
                                  └──────────────┘

ALIAS ANALYSIS ==>                #Determine which values can be held under the same variables

POINTER|POINTS-TO ANALYSIS ==>    #Determine which values a pointer might hold

ESCAPE ANALYSIS ==>               #Determine which scopes might access a pointer

SHAPE ANALYSIS ==>                #Verify properties of dynamic variables
                                  #Examples:
                                  #  - checking a dynamically allocated variable memory is freed
                                  #  - bounds cheking
                                  #  - dangling pointer dereferencing

ARRAY ACCESS ANALYSIS ==>         #Determine which indices of an array are used

CONTROL DEPENDENCIES ==>          #Determine dependencies between statements, in terms of concurrency and shared resources|variables

CONTROL FLOW ANALYSIS (CFA) ==>   #Determine control flow

DATA FLOW ANALYSIS ==>            #Determine variables scope and lifetime according to control flow
                                  #  - "live variable analysis": variables lifetime
                                  #  - "available expression analysis": scope

                                  ┌──────────────────┐
                                  │   CODE REMOVAL   │
                                  └──────────────────┘

DEAD CODE ELIMINATION (DCE) ==>   #Removing code not used anywhere ("unreachable code")
                                  #Also called "dead code removal|stripping"
                                  #"Dead store elimination": when removing variables
                                  #"Bounds-checking elimination": when removing bounds-checking code

JUMP THREADING ==>                #When a jump goes to another jump, reduce it to a single jump

CONSTANT FOLDING ==>              #Perform computation that can be done compile-time, i.e. where operands are constants
                                  #"Constant propagation":
                                  #  - replacing variables containing constants by constants directly
                                  #  - "sparse conditional": when performing recursively and removing constant branches

                                  ┌──────────────────────┐
                                  │   CODE DUPLICATION   │
                                  └──────────────────────┘

COMMON SUBEXPRESSION ELIMINATION  #Refactoring duplicated code into single place
 (CSE) ==>                        #Pro: less CPU cycles
                                  #Con: more local state, e.g. higher register pressure
                                  #"Partial redundancy elimination": when performing loop-invariant code motion at same time

VALUE NUMERING ==>                #Replacing two variables initialized with same value by a single variable
                                  #Same pros and cons as CSE
                                  #Can be global or local, according to the scope it operates

LOOP-INVARIANT CODE               #Moving code in a loop that always execute the same, outside of loop instead
 MOTION|HOISTING ==>              #Same pros and cons as CSE
                                  #Opposite is "remat[erialization]"
                                  #"Loop unswitching":
                                  #  - when code includes branches
                                  #  - i.e. branch becomes parent of several copies of loop, and some of them do not have the conditional code

                                  ┌───────────────────────────┐
                                  │   LOCALITY OF REFERENCE   │
                                  └───────────────────────────┘

REGISTER ALLOCATION ==>           #Deciding which variables should go into CPU registers as opposed to RAM
                                  #Problem:
                                  #  - registers are faster
                                  #  - moving data from registers to RAM ("spilling") and inverse ("filling|reloading") is slow
                                  #     - "register pressure": when it happens often
                                  #  - number of registers is limited
                                  #I.e. goal is to optimize number of variables with high temporal locality of reference into registers
                                  #  - i.e. must track jumps as they connect different group of variables
                                  #How:
                                  #  - graph coloring:
                                  #     - vertices are variables
                                  #     - variables within same scope are connected with "interference edges"
                                  #     - variables connected through jumps are connected with "preference edges"
                                  #     - do a K-coloring where:
                                  #        - K is number of registers
                                  #        - two vertex with interference edges should have different colors, because same scope
                                  #        - inverse for preference edges
                                  #        - if a vertex cannot be colored, it means it will be spilled
                                  #     - is the most efficient method, but requires computation, i.e. better for compiled languages
                                  #  - linear scan allocation:
                                  #     - track variables lifetime
                                  #     - variables with longer lifetime are more likely to be spilled, i.e. should not be put in registers
                                  #  - graph coloring algorithms using heuristic to make them run faster

LOOP FISSION|DISTRIBUTION ==>     #Spreading instructions within a loop to several loops (with same range)
                                  #Goal: parallelism, better locality of reference
                                  #"Nuclear computation|fission|fusion": same but for threads instead of loops

LOOP FUSION|JAMMING ==>           #Inverse of loop fission

LOOP INTERCHANGE ==>              #Swapping inner loop with outer loop
                                  #Goal: better locality of reference, e.g. to match storage alignment with matrices

LOOP NEST OPTIMIZATION (LNO) ==>  #Divide a loop into a nested set of smaller loops
                                  #Goal: making sure loops fit within CPU cache lines ("tiling size")
                                  #Also called "loop tiling|blocking" or "strip mine and interchange"

CACHE OBLIVIOUS ALGORITHM ==>     #Algorithm that works well regardless of the CPU cache sizes
                                  #Usally obtained by dividing into very small subproblems and using recursion or greediness.

                                  ┌──────────────┐
                                  │   INLINING   │
                                  └──────────────┘

INLINE EXPANSION ==>              #Replacing function call by function body
                                  #Pros:
                                  #  - faster: less jumps (which take time and prevent prefetching), smaller stack frame
                                  #  - flattening instructions to allow other optimizations
                                  #Cons:
                                  #  - takes more space
                                  #  - more cache trashing
                                  #I.e. best for small|simple functions

TAIL CALL ELIMINATION ==>         #Like expansion but performed on last instruction of a function
                                  #I.e. can throw away current stack frame
                                  #"Tail call recursion":
                                  #  - when performed on recursion
                                  #  - allow inlining recursions
                                  #  - allow recursion without max stack size

LOOP UNROLLING|UNWINDING ==>      #Flattening a loop into duplicated code
                                  #Pros and cons:
                                  #  - like inline expansion
                                  #  - also slower if iteration itself contains jumps|branches
                                  #Can be "manual|static" (done by programmer) or "dynamic" (done by compiler)

LOOP SPLITTING ==>                #Dividing loops into several loops with same body but different ranges
                                  #Goal: breaking things down to allow other optimizations
                                  #"Loop peeling":
                                  #  - when doing it to separte only the first|last iteration which is known to be optimizable

STRENGTH REDUCTION ==>            #Dividing complex instructions into several simpler instructions
                                  #Examples:
                                  #  - 2 ** 3 to 2 * 2 * 2
                                  #  - 3 * 3 to 3 + 3 + 3
                                  #  - 3 / 8 to 3 >> 3 ("right-shifting")
                                  #  - 3 * 8 to 3 << 3 ("left-shifting")
                                  #Goal: breaking things down to allow other optimizations

                                  ┌────────────────┐
                                  │   PIPELINING   │
                                  └────────────────┘

PIPELINE STALL ==>                #On a CPU with instruction pipelining, when one stage is delayed,
                                  #causing other stages to be delayed as well (with NOP instructions)
                                  #Also called "bubble" or "pipeline break"
                                  #Any jump produces a pipeline stall

LOOP INVERSION ==>                #Replacing a while by a if+do while
                                  #Goal: one less goto CPU instruction, leading to less pipeline stalls
                                  #Can also help with loop-invariant code motion

SOFTWARE PIPELINING ==>           #Like CPU pipelining but at the software level
                                  #I.e. making instructions async and trying to launch at once several instructions with same computation time
                                  #Also called "instruction scheduling"

                                  ┌─────────────────┐
                                  │   PARALLELISM   │
                                  └─────────────────┘

PARALLELISM ==>                   #See parallelism doc