-
Notifications
You must be signed in to change notification settings - Fork 4
/
mgp.m
351 lines (313 loc) · 12.7 KB
/
mgp.m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
% MGP approximates marginal GP predictions.
%
% Suppose we have a Gaussian process model on a latent function f:
%
% p(f | \theta) = GP(f; \mu(x; \theta), K(x, x'; \theta)),
%
% where \theta are the hyperparameters of the model. Suppose we have a
% dataset D = (X, y) of observations and a test point x*. This
% function returns the mean and variance of the approximate marginal
% predictive distributions for the associated observation value y* and
% latent function value f*:
%
% p(y* | x*, D) = \int p(y* | x*, D, \theta) p(\theta | D) d\theta,
% p(f* | x*, D) = \int p(f* | x*, D, \theta) p(\theta | D) d\theta,
%
% where we have marginalized over the hyperparameters \theta. The
% approximate posterior is derived using he "MGP" approximation
% described in
%
% Garnett, R., Osborne, M., and Hennig, P. Active Learning of Linear
% Embeddings for Gaussian Processes. (2014). 30th Conference on
% Uncertainty in Artificial Intelligence (UAI 2014).
%
% Dependencies
% ------------
%
% This code is written to be interoperable with the GPML MATLAB
% toolbox, available here:
%
% http://www.gaussianprocess.org/gpml/code/matlab/doc/
%
% The GPML toolbox must be in your MATLAB path for this function to
% work. This function also depends on the gpml_extensions repository,
% available here:
%
% https://github.com/rmgarnett/gpml_extensions/
%
% which must also be in your MATLAB path.
%
% Notes
% -----
%
% The approximation relies first on a Gaussian approximation to the
% hyperparameter posterior p(\theta | D):
%
% p(\theta | D) ~ N(\hat{\theta}, \Sigma).
%
% The hyperparameter vector passed in here is assumed to represent the
% mean of this approximate posterior, \hat{\theta}. The user may, if
% desired, also specify the covariance or precision matrix of the
% approximate posterior. See "Specifiying Covariance" and "Specifying
% Precision" below for details.
%
% If not specified, the code derives an approximate posterior by
% performing a Laplace approximation at the given value of
% \hat{\theta}. See "Laplace Approximation" below for details.
%
% Specifying Covariance
% ---------------------
%
% To specify the covariance matrix \Sigma of the approximate
% hyperparameter posterior distribution, augment the GPML
% hyperparameter struct theta (representing the approximate posterior
% mean \hat{\theta}) with a field .Sigma, containing a struct with
% fields:
%
% theta.Sigma.value
% theta.Sigma.covariance_ind
% theta.Sigma.likelihood_ind
% theta.Sigma.mean_ind
%
% theta.Sigma.value should be a (m x m) matrix containing the desired
% covariance matrix. The remaining fields index into Sigma.value
% identifying rows/columns with different hyperparameter subsets. Note
% that this is the same structure as used for specifying the
% hyperparameter Hessian matrix HnlZ in the gpml_extensions
% repository; see hessians.m in that repository for more information.
%
% Specifying Precision
% --------------------
%
% To specify the precision matrix \Sigma^{-1} of the approximate
% hyperparameter posterior distribution, augment the GPML
% hyperparameter struct theta (representing the approximate posterior
% mean \hat{\theta}) with a field .Sigma_inv. See the "Specifying
% Covariance" section above for the expected format of this field,
% noting that of course in this case theta.Sigma_inv.value will
% contain the desired (m x m) hyperparameter precision matrix.
%
% Laplace Approximation
% ---------------------
%
% If the user does not specify the covariance or precision matrix of
% the approximate hyperparameter posterior distribution, then a
% Laplace approximation will be performed.
%
% In this case, this code requires that the provided hyperparameters
% be the MLE hyperparameters:
%
% \hat{theta} = argmax_\theta log p(y | X, \theta),
%
% or, if using a hyperparameter prior p(\theta), the MAP
% hyperparameters:
%
% \hat{theta} = argmax_\theta log p(y | X, \theta) + log p(\theta).
%
% In this case, this function will not perform the maximization over
% \theta but will rather assume that the given hyperparameters
% represent the MLE/MAP point.
%
% Usage
% -----
%
% The usage of mgp.m is identical to the gp(...) function from the
% GPML toolkit in prediction mode:
%
% [y_star_mean, y_star_variance, ...
% f_star_mean, f_star_variance, ...
% log_probabilities, posterior, ...
% y_star_mean_gp, y_star_variance_gp, ...
% f_star_variance_gp, log_probabilities_gp] = ...
% mgp(theta, inference_method, mean_function, covariance_function, ...
% likelihood, x, y, x_star, y_star)
%
% There are four additional output arguments: y_star_mean_gp,
% y_star_variance_gp, f_star_variance_gp, and
% log_probabilities_gp. These provide:
%
% E[y* | x*, D, \hat{\theta}],
% Var[y* | x*, D, \hat{\theta}],
% Var[f* | x*, D, \hat{\theta}],
%
% and
%
% log p(y* | x*, D, \hat{theta}),
%
% respectively. That is, these give the outputs that would have been
% provided by calling gp(...) without the MGP corrections applied, if
% desired. Note that under the MGP approximation, f_star_mean equals
% what would be the (nonexistent, redundant) f_star_mean_gp output.
%
% Inputs
% ------
%
% theta: a GPML hyperparameter struct containing the
% mean of the approximate hyperparameter
% posterior (and possibly the
% covariance/precision of this approximation,
% see above)
% inference_method: a GPML inference method
% mean_function: a GPML mean function
% covariance_function: a GPML covariance function
% likelihood: a GPML likelihood
% x: training observation locations (n x D)
% y: training observation values (n x 1) or
% GPML posterior struct
% x_star: test observation locations (n_star x D)
% y_star: test observation values (n_star x 1) (optional)
%
% Outputs
% -------
%
% y_star_mean: the approximate E[y* | x*, D]
% y_star_variance: the approximate Var[y* | x*, D]
% f_star_mean: the approximate E[f* | x*, D]
% f_star_variance: the approximate Var[f* | x*, D]
% log_probabilities: if test observation values y_star are
% provided, a vector containing the
% approximate log predictive probabilities
%
% log p(y* | x*, D).
%
% posterior: a GPML posterior struct corresponding to the
% provided training data and MLE/MAP
% hyperparameters
%
% Outputs conditioned on \hat{\theta} (no MGP corrections applied,
% matches outputs from gp.m):
%
% y_star_mean_gp: E[y* | x*, D, \hat{\theta}]
% y_star_variance_gp: Var[y* | x*, D, \hat{\theta}]
% f_star_variance_gp: Var[f* | x*, D, \hat{\theta}]
% log_probabilities_gp: if test observation values y_star are
% provided, a vector containing the
% approximate log predictive probabilities
%
% log p(y* | x*, D, \hat{\theta}).
%
% See also GP.
% Copyright (c) 2014--2015 Roman Garnett.
function [y_star_mean, y_star_variance, ...
f_star_mean, f_star_variance, ...
log_probabilities, posterior, ...
y_star_mean_gp, y_star_variance_gp, ...
f_star_variance_gp, log_probabilities_gp] = ...
mgp(theta, inference_method, mean_function, covariance_function, ...
likelihood, x, y, x_star, y_star)
% perform initial argument checks/transformations
[theta, inference_method, mean_function, covariance_function, likelihood] ...
= check_arguments(theta, inference_method, mean_function, ...
covariance_function, likelihood, x);
% convenience handles
mu = @(varargin) feval(mean_function{:}, theta.mean, varargin{:});
K = @(varargin) feval(covariance_function{:}, theta.cov, varargin{:});
inference = @() feval(inference_method{:}, theta, ...
mean_function, covariance_function, likelihood, x, y);
% S will contain either Sigma or Sigma_inv, and Sigma_times will
% provide a handle for computing \Sigma * x
have_posterior = false;
if (isfield(theta, 'Sigma'))
have_posterior = true;
S = theta.Sigma;
Sigma_times = @(x) (S.value * x);
elseif (isfield(theta, 'Sigma_inv'))
have_posterior = true;
S = theta.Sigma_inv;
Sigma_times = @(x) (S.value \ x);
end
% find GP posterior at the approximate posterior mean \theta, as well
% as the derivatives of alpha and diag W^{-1} with respect to
% \theta
if (have_posterior)
[posterior, ~, ~, dalpha, dWinv] = inference();
else
% neither covariance nor precision matrix given, perform Laplace
% approximation
% set S to Hessian of log likelihood/posterior
[posterior, ~, ~, dalpha, dWinv, S] = inference();
Sigma_times = @(x) (S.value \ x);
end
% find the predictive distribution conditioned on \theta
if ((nargin > 8) && (nargout > 8) && ~isempty(y_star))
% log probaiblities requested
[y_star_mean_gp, y_star_variance_gp, f_star_mean, f_star_variance_gp, ...
log_probabilities_gp] = gp(theta, inference_method, mean_function, ...
covariance_function, likelihood, x, posterior, x_star, y_star);
else
[y_star_mean_gp, y_star_variance_gp, f_star_mean, f_star_variance_gp] ...
= gp(theta, inference_method, mean_function, covariance_function, ...
likelihood, x, posterior, x_star);
log_probabilities_gp = [];
end
% precompute k*' [ K + W^{-1} ]^{-1}; it's used a lot
k_star = K(x, x_star);
% handle different posterior parameterizations
if (is_chol(posterior.L))
% posterior.L contains chol(I + W^{1/2} K W^{1/2})
% [ K + W^{-1} ]^{-1} =
% W^{1/2} [I + W^{1/2} K W^{1/2} ]^{-1} W^{1/2}
k_star_V_inv = ...
bsxfun(@times, posterior.sW', ...
solve_chol(posterior.L, bsxfun(@times, posterior.sW, k_star))');
else
% posterior.L contains -[ K + W^{-1} ]^{-1}
k_star_V_inv = -k_star' * posterior.L;
end
% for each hyperparameter \theta_i, we must compute:
%
% d mu_{f | D}(x*; \theta) / d \theta_i, and
% d V_{f | D}(x*; \theta) / d \theta_i.
num_test = size(x_star, 1);
num_hyperparameters = size(S.value, 1);
df_star_mean = zeros(num_test, num_hyperparameters);
df_star_variance = zeros(num_test, num_hyperparameters);
% partials with respect to covariance hyperparameters
for i = 1:numel(S.covariance_ind)
dK = K(x, [], i);
dk_star = K(x, x_star, i);
dk_star_star = K(x_star, 'diag', i);
df_star_mean(:, S.covariance_ind(i)) = ...
dk_star' * posterior.alpha + k_star' * dalpha.cov(:, i);
df_star_variance(:, S.covariance_ind(i)) = ...
dk_star_star - ...
product_diag(k_star_V_inv, ...
2 * dk_star - (dK + diag(dWinv.cov(:, i))) * k_star_V_inv');
end
% partials with respect to likelihood hyperparameters
for i = 1:numel(S.likelihood_ind)
df_star_mean(:, S.likelihood_ind(i)) = k_star' * dalpha.lik(:, i);
df_star_variance(:, S.likelihood_ind(i)) = ...
product_diag(k_star_V_inv, ...
bsxfun(@times, dWinv.lik(:, i), k_star_V_inv'));
end
% partials with respect to mean hyperparameters
for i = 1:numel(S.mean_ind)
dm_star = mu(x_star, i);
df_star_mean(:, S.mean_ind(i)) = dm_star + k_star' * dalpha.mean(:, i);
df_star_variance(:, S.mean_ind(i)) = ...
product_diag(k_star_V_inv, ...
bsxfun(@times, dWinv.mean(:, i), k_star_V_inv'));
end
% the MGP approximation inflates the predictive variance to
% account for uncertainty in \theta
f_star_variance = ...
(4 / 3) * f_star_variance_gp + ...
product_diag(df_star_mean, Sigma_times(df_star_mean')) + ...
product_diag(df_star_variance, Sigma_times(df_star_variance')) ./ ...
(3 * f_star_variance_gp);
% approximate predictive distribution for y* (observations). if y*
% given, compute log predictive probabilities.
if ((nargin > 8) && (nargout > 4) && ~isempty(y_star))
[log_probabilities, y_star_mean, y_star_variance] = ...
feval(likelihood{:}, theta.lik, y_star, f_star_mean, f_star_variance);
else
[~, y_star_mean, y_star_variance] = ...
feval(likelihood{:}, theta.lik, [], f_star_mean, f_star_variance);
log_probabilities = [];
end
end
% returns diag(AB) without computing the full product
function result = product_diag(A, B)
result = sum(B .* A')';
end