-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathcpu_optimization.theory.txt
188 lines (147 loc) · 12.2 KB
/
cpu_optimization.theory.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
┏━━━━━━━━━━━━━━━━━━━━━━┓
┃ CPU_OPTIMIZATION ┃
┗━━━━━━━━━━━━━━━━━━━━━━┛
SCOPE ==> #Can be:
# - "whole program optimization"/"link-time"
# - "interprocedural optimization" (IPO): between functions
# - "intraprocedural optimization"/"global": to each function
# - "loop optimization": inside loop structure
# - "local": several instructions
# - "peephole": few instructions
PROFILE-GUIDED OPTIMIZATION (PGO) #When using runtime information collected by a profiler
==> #Also called "profile-directed feedback" (PDF) or "feedback-directed optimization" (FDO)
MACHINE CODE OPTIMIZATION ==> #Optimization of CPU instructions for a specific architecture
#Performed after all other optimizations
POST-PASS OPTIMIZATION ==> #Optimizing from the machine code instead of the source code.
┌──────────────┐
│ ANALYSIS │
└──────────────┘
ALIAS ANALYSIS ==> #Determine which values can be held under the same variables
POINTER|POINTS-TO ANALYSIS ==> #Determine which values a pointer might hold
ESCAPE ANALYSIS ==> #Determine which scopes might access a pointer
SHAPE ANALYSIS ==> #Verify properties of dynamic variables
#Examples:
# - checking a dynamically allocated variable memory is freed
# - bounds cheking
# - dangling pointer dereferencing
ARRAY ACCESS ANALYSIS ==> #Determine which indices of an array are used
CONTROL DEPENDENCIES ==> #Determine dependencies between statements, in terms of concurrency and shared resources|variables
CONTROL FLOW ANALYSIS (CFA) ==> #Determine control flow
DATA FLOW ANALYSIS ==> #Determine variables scope and lifetime according to control flow
# - "live variable analysis": variables lifetime
# - "available expression analysis": scope
┌──────────────────┐
│ CODE REMOVAL │
└──────────────────┘
DEAD CODE ELIMINATION (DCE) ==> #Removing code not used anywhere ("unreachable code")
#Also called "dead code removal|stripping"
#"Dead store elimination": when removing variables
#"Bounds-checking elimination": when removing bounds-checking code
JUMP THREADING ==> #When a jump goes to another jump, reduce it to a single jump
CONSTANT FOLDING ==> #Perform computation that can be done compile-time, i.e. where operands are constants
#"Constant propagation":
# - replacing variables containing constants by constants directly
# - "sparse conditional": when performing recursively and removing constant branches
┌──────────────────────┐
│ CODE DUPLICATION │
└──────────────────────┘
COMMON SUBEXPRESSION ELIMINATION #Refactoring duplicated code into single place
(CSE) ==> #Pro: less CPU cycles
#Con: more local state, e.g. higher register pressure
#"Partial redundancy elimination": when performing loop-invariant code motion at same time
VALUE NUMERING ==> #Replacing two variables initialized with same value by a single variable
#Same pros and cons as CSE
#Can be global or local, according to the scope it operates
LOOP-INVARIANT CODE #Moving code in a loop that always execute the same, outside of loop instead
MOTION|HOISTING ==> #Same pros and cons as CSE
#Opposite is "remat[erialization]"
#"Loop unswitching":
# - when code includes branches
# - i.e. branch becomes parent of several copies of loop, and some of them do not have the conditional code
┌───────────────────────────┐
│ LOCALITY OF REFERENCE │
└───────────────────────────┘
REGISTER ALLOCATION ==> #Deciding which variables should go into CPU registers as opposed to RAM
#Problem:
# - registers are faster
# - moving data from registers to RAM ("spilling") and inverse ("filling|reloading") is slow
# - "register pressure": when it happens often
# - number of registers is limited
#I.e. goal is to optimize number of variables with high temporal locality of reference into registers
# - i.e. must track jumps as they connect different group of variables
#How:
# - graph coloring:
# - vertices are variables
# - variables within same scope are connected with "interference edges"
# - variables connected through jumps are connected with "preference edges"
# - do a K-coloring where:
# - K is number of registers
# - two vertex with interference edges should have different colors, because same scope
# - inverse for preference edges
# - if a vertex cannot be colored, it means it will be spilled
# - is the most efficient method, but requires computation, i.e. better for compiled languages
# - linear scan allocation:
# - track variables lifetime
# - variables with longer lifetime are more likely to be spilled, i.e. should not be put in registers
# - graph coloring algorithms using heuristic to make them run faster
LOOP FISSION|DISTRIBUTION ==> #Spreading instructions within a loop to several loops (with same range)
#Goal: parallelism, better locality of reference
#"Nuclear computation|fission|fusion": same but for threads instead of loops
LOOP FUSION|JAMMING ==> #Inverse of loop fission
LOOP INTERCHANGE ==> #Swapping inner loop with outer loop
#Goal: better locality of reference, e.g. to match storage alignment with matrices
LOOP NEST OPTIMIZATION (LNO) ==> #Divide a loop into a nested set of smaller loops
#Goal: making sure loops fit within CPU cache lines ("tiling size")
#Also called "loop tiling|blocking" or "strip mine and interchange"
CACHE OBLIVIOUS ALGORITHM ==> #Algorithm that works well regardless of the CPU cache sizes
#Usally obtained by dividing into very small subproblems and using recursion or greediness.
┌──────────────┐
│ INLINING │
└──────────────┘
INLINE EXPANSION ==> #Replacing function call by function body
#Pros:
# - faster: less jumps (which take time and prevent prefetching), smaller stack frame
# - flattening instructions to allow other optimizations
#Cons:
# - takes more space
# - more cache trashing
#I.e. best for small|simple functions
TAIL CALL ELIMINATION ==> #Like expansion but performed on last instruction of a function
#I.e. can throw away current stack frame
#"Tail call recursion":
# - when performed on recursion
# - allow inlining recursions
# - allow recursion without max stack size
LOOP UNROLLING|UNWINDING ==> #Flattening a loop into duplicated code
#Pros and cons:
# - like inline expansion
# - also slower if iteration itself contains jumps|branches
#Can be "manual|static" (done by programmer) or "dynamic" (done by compiler)
LOOP SPLITTING ==> #Dividing loops into several loops with same body but different ranges
#Goal: breaking things down to allow other optimizations
#"Loop peeling":
# - when doing it to separte only the first|last iteration which is known to be optimizable
STRENGTH REDUCTION ==> #Dividing complex instructions into several simpler instructions
#Examples:
# - 2 ** 3 to 2 * 2 * 2
# - 3 * 3 to 3 + 3 + 3
# - 3 / 8 to 3 >> 3 ("right-shifting")
# - 3 * 8 to 3 << 3 ("left-shifting")
#Goal: breaking things down to allow other optimizations
┌────────────────┐
│ PIPELINING │
└────────────────┘
PIPELINE STALL ==> #On a CPU with instruction pipelining, when one stage is delayed,
#causing other stages to be delayed as well (with NOP instructions)
#Also called "bubble" or "pipeline break"
#Any jump produces a pipeline stall
LOOP INVERSION ==> #Replacing a while by a if+do while
#Goal: one less goto CPU instruction, leading to less pipeline stalls
#Can also help with loop-invariant code motion
SOFTWARE PIPELINING ==> #Like CPU pipelining but at the software level
#I.e. making instructions async and trying to launch at once several instructions with same computation time
#Also called "instruction scheduling"
┌─────────────────┐
│ PARALLELISM │
└─────────────────┘
PARALLELISM ==> #See parallelism doc