-
Notifications
You must be signed in to change notification settings - Fork 101
/
norms.html
235 lines (182 loc) · 7.79 KB
/
norms.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- saved from url=(0014)about:internet -->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title></title>
<base target="_blank"/>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 12px;
margin: 8px;
}
tt, code, pre {
font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}
h1 {
font-size:2.2em;
}
h2 {
font-size:1.8em;
}
h3 {
font-size:1.4em;
}
h4 {
font-size:1.0em;
}
h5 {
font-size:0.9em;
}
h6 {
font-size:0.8em;
}
a:visited {
color: rgb(50%, 0%, 50%);
}
pre {
margin-top: 0;
max-width: 95%;
border: 1px solid #ccc;
}
pre code {
display: block; padding: 0.5em;
}
code.r {
background-color: #F8F8F8;
}
table, td, th {
border: none;
}
blockquote {
color:#666666;
margin:0;
padding-left: 1em;
border-left: 0.5em #EEE solid;
}
hr {
height: 0px;
border-bottom: none;
border-top-width: thin;
border-top-style: dotted;
border-top-color: #999999;
}
@media print {
* {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
body {
font-size:12pt;
max-width:100%;
}
a, a:visited {
text-decoration: underline;
}
hr {
visibility: hidden;
page-break-before: always;
}
pre, blockquote {
padding-right: 1em;
page-break-inside: avoid;
}
tr, img {
page-break-inside: avoid;
}
img {
max-width: 100% !important;
}
@page :left {
margin: 15mm 20mm 15mm 10mm;
}
@page :right {
margin: 15mm 10mm 15mm 20mm;
}
p, h2, h3 {
orphans: 3; widows: 3;
}
h2, h3 {
page-break-after: avoid;
}
}
</style>
<!-- MathJax scripts -->
<script type="text/javascript" src="https://c328740.ssl.cf1.rackcdn.com/mathjax/2.0-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
</head>
<body>
<p>Often we want to measure the size of some mathematical object. When the object can be built out of real numbers in the way that scalars, vectors and matrices can, then there is a consistent strategy for defining the size of the object as a whole. We will call the size of the object its norm and write the norm of x as \(N(x)\) or, more commonly, as \(||x||\).</p>
<p>For scalars, we can use absolute values as a measure of size:<br/>
\[<br/>
N(x) = ||x|| \triangleq |x|<br/>
\]<br/>
For reasons that will be clear in a moment, we can also write this in a “more complicated” way:<br/>
\[<br/>
N(x) = ||x|| \triangleq (|x|^2)^{\frac{1}{2}}<br/>
\]</p>
<p>For \(n\)-dimensional vectors, we can use Euclidean norms as a measure of size:<br/>
\[<br/>
N(x) = ||x|| \triangleq (\sum_{i = 1}^{n} x_{i}^2)^{\frac{1}{2}}<br/>
\]<br/>
Again, we can write this in a “more complicated” way:<br/>
\[<br/>
N(x) = ||x|| \triangleq (\sum_{i = 1}^{n} |x_{i}|^2)^{\frac{1}{2}}<br/>
\]</p>
<p>For \(m\)-by-\(n\) dimensional matrices, we can use what are called Frobenius norms as a measure of size, which we write in the “more complicated” way from the start since you are less likely to be familiar with them in another form:<br/>
\[<br/>
N(X) = ||X|| \triangleq (\sum_{i = 1}^{m} \sum_{j = 1}^{n} |x_{i, j}|^2)^{\frac{1}{2}}<br/>
\]</p>
<p>When these measures of sizes are all presented right next to one another, it becomes clear that they are actually the same mathematical operation given different names depending on the type of object being measured. For any object with extents along \(k\) dimensions built up out of real numbers, you sum up the squared entries and then compute the square root. Scalars are a degenerate case in which there are \(0\) dimensions.</p>
<p>Note that the operations of squaring and computing square roots occur in all of these models. This is not a deep fact about the essential truths of our universe: we typically perform squaring for mathematical convenience because the resulting norm functions are differentiable. This is not required by any deep meaning of the concept of size: modern statistics and machine learning have moved beyond this requirement and explored a much richer set of norms.</p>
<p>This richer set of norms, which provide a natural extension to the squared norms defined above, consists of the \(L_{p}\) norms or the Minkowski \(p\)-metrics. For scalars, the \(L_{p}\) norm is,<br/>
\[<br/>
N_{p}(x) = ||x||_{p} \triangleq (|x|^p)^{\frac{1}{p}}<br/>
\]<br/>
For vectors, the \(L_{p}\) norm is,<br/>
\[<br/>
N_{p}(x) = ||x||_{p} \triangleq (\sum_{i = 1}^{n} |x_{i}|^p)^{\frac{1}{p}}<br/>
\]<br/>
For matrices, the \(L_{p}\) norm is<br/>
\[<br/>
N_{p}(X) = ||X||_{p} \triangleq (\sum_{i = 1}^{m} \sum_{j = 1}^{n} |x_{i, j}|^p)^{\frac{1}{p}}<br/>
\]</p>
<p>The squared norms we described above are all special cases in which \(p = 2\). But now that you've seen all the forms of the \(L_{p}\) metric, the “more complicated” ways to write things are much more natural than the simpler ways, no?</p>
<p>Thankfully, once you understand norms, we can trivially describe the three fundamental objects of modern statistical theory:</p>
<ul>
<li>Means: \(\mu^{*} = \arg \min_{\mu} \sum_{i = 1}^{n} ||x_{i} - \mu||_{2}\)</li>
<li>OLS linear regressions: \(\beta^{*} = \arg \min_{\beta} ||Y - X\beta||_{2}\)</li>
<li>Rank-\(k\) SVD's: \(\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{2}\), where \(\bar{X}\) has rank \(k\)</li>
</ul>
<p>The reason for this is simple: almost all of statistical theory can be understand in terms of prediction error minimization: we have (1) a system that generates predictions, (2) a ground truth we want to predict and (3) a measure of the error in our predictions. We typically measure the error in our predictions using a norm of the disprecancy between our predictions and ground truth. The standard algorithms in statistics are simply mechanisms for making out-of-the-box predictions that minimize different errors measured using \(L_{2}\) norms.</p>
<p>Once you master means, OLS and SVD's, we can move on to:</p>
<ul>
<li>Medians: \(m^{*} = \arg \min_{m} \sum_{i = 1}^{n} ||x_{i} - m||_{1}\)</li>
<li>LAD linear regression: \(\beta^{*} = \arg \min_{\beta} ||Y - X\beta||_{1}\)</li>
<li>Robust SVD (Correct name?): \(\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{1}\), where \(\bar{X}\) has rank \(k\)</li>
</ul>
<p>And then after that, we can add penalties to the basic statistical models by penalizing the parameters we fit using these same norms. We then get:</p>
<ul>
<li>Bayesian inference for means with Normal priors: \(\mu^{*} = \arg \min_{\mu} \sum_{i = 1}^{n} ||x_{i} - \mu||_{2} + \lambda ||\mu||_{2}\)</li>
<li>Ridge regression: \(\beta^{*} = \arg \min_{\beta} ||Y - X\beta||_{2} + \lambda ||\beta||_{2}\)</li>
<li>L2 regularized SVD: \(\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{2} + \lambda ||\bar{X}||_{2}\), where \(\bar{X}\) has rank \(k\)</li>
<li>Bayesian inference for medians with Normal priors: \(m^{*} = \arg \min_{m} \sum_{i = 1}^{n} ||x_{i} - m||_{1} + \lambda ||m||_{2}\)</li>
<li>LASSO regression: \(\arg \min_{\beta} ||Y - X\beta||_{2} + \lambda ||\beta||_{1}\)</li>
<li>L1 regularized SVD: \(\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{2} + \lambda ||\bar{X}||_{1}\), where \(\bar{X}\) has rank \(k\)</li>
</ul>
<p>We also note that the following models are all special cases of, variations of, or straightforward applications of the standard SVD:</p>
<ul>
<li>Factor analysis</li>
<li>PCA</li>
<li>MDS</li>
<li>Procrustean analysis</li>
<li>More…</li>
</ul>
<p>Really, predictive error minimization is a universal principle in statistical theory.</p>
</body>
</html>