Add post draft.

aterenin · Nov 15, 2024 · 41144b2 · 41144b2
1 parent dde0961
commit 41144b2
Show file tree

Hide file tree

Showing 9 changed files with 248 additions and 18 deletions.
diff --git a/content/2020-03-01-Variational-Integrator-Networks/index.md b/content/2020-03-01-Variational-Integrator-Networks/index.md
@@ -44,7 +44,7 @@ Using VINs allows us to easily learn models with physical forecasting behaviour
 
 # From Residual Networks to Variational Integrator Networks
 
-The idea is simple: if we view neural networks as dynamical systems[^haber][^E][^chen]---and discretize them in a manner that preserves qualitative physical properties[^marsden]---we can define network architectures that obey the laws of physics. 
+The idea is simple: if we view neural networks as dynamical systems[^haber] [^E] [^chen]---and discretize them in a manner that preserves qualitative physical properties[^marsden]---we can define network architectures that obey the laws of physics. 
 A particularly salient example of the kind of inductive bias we are interested in is the presence of conservation laws, for instance conservation of energy or conservation of momentum.
 
 A canonical description of classical physical dynamical systems is Lagrangian mechanics, where a system is completely characterized by its Lagrangian `$L(q, \dot{q}, t)$`, a scalar function that encodes underlying physical properties. 
@@ -57,7 +57,7 @@ $$
 $$
 ```
 
-discretized using an Euler scheme,[^haber][^E][^chen] giving
+discretized using an Euler scheme,[^haber] [^E] [^chen] giving
 
 ```
 $$

diff --git a/content/2020-09-25-Riemannian-Matern-GP/index.md b/content/2020-09-25-Riemannian-Matern-GP/index.md
@@ -39,7 +39,7 @@ where `$K_\nu$` is the modified Bessel function of the second kind, and `$\sigma
 As `$\nu\to\infty$`, the Matérn kernel converges to the widely-used squared exponential kernel.
 
 To generalize this class of Gaussian processes to the Riemannian setting, one might consider replacing Euclidean distances `$\Vert x-x' \Vert$` with the geodesic distance `$d_g(x, x')$`. 
-Unfortunately, this doesn't necessarily define a valid kernel: in particular, the geodesic squared exponential kernel already fails to be positive semi-definite for most manifolds, due to a recent no-go result.[^nogo][^nogo2]
+Unfortunately, this doesn't necessarily define a valid kernel: in particular, the geodesic squared exponential kernel already fails to be positive semi-definite for most manifolds, due to a recent no-go result.[^nogo] [^nogo2]
 We therefore adopt a different approach, which is not based on geodesics.
 
 # Stochastic partial differential equations
@@ -75,7 +75,7 @@ $$
 ```
 
 where `$C$` is a constant chosen so that the variance is `$\sigma^2$` on average.[^sqexp]
-By truncating this sum, we obtain a workable approximation for the kernel,[^sm] allowing us to train the process on data using standard methods, such as sparse inducing point techniques.[^vfe][^gpbd]
+By truncating this sum, we obtain a workable approximation for the kernel,[^sm] allowing us to train the process on data using standard methods, such as sparse inducing point techniques.[^vfe] [^gpbd]
 The resulting posterior Gaussian processes are visualized below.
 
 
@@ -88,7 +88,7 @@ This equation is very well-studied, and a number of scalable techniques for solv
 # Concluding remarks
 
 We present techniques for computing the kernels, spectral measures, and Fourier feature approximations of Riemannian Matérn and squared exponential Gaussian processes, using spectral techniques via the Laplace--Beltrami operator.
-This allows us to train these processes via standard techniques, such as variational inference via sparse inducing point methods,[^vfe][^gpbd] or Fourier feature methods.[^rff]
+This allows us to train these processes via standard techniques, such as variational inference via sparse inducing point methods,[^vfe] [^gpbd] or Fourier feature methods.[^rff]
 In turn, this allows Riemannian Matérn Gaussian processes to easily be deployed in mini-batch, online, and non-conjugate settings.
 We hope this work enables practitioners to easily deploy techniques such as Bayesian optimization in this setting.
 

diff --git a/content/2023-12-10-Stochastic-Gradient-Descent-GP/index.md b/content/2023-12-10-Stochastic-Gradient-Descent-GP/index.md
@@ -38,7 +38,7 @@ Let's see a simple comparison between standard large-scale Gaussian process appr
 {% end %}
 
 From this comparison, one can see that different large-scale Gaussian process approximations work well in different regimes.
-Conjugate-gradient-based Gaussian processes[^cg] work well under large-domain asymptotics, whereas sparse Gaussian processes trained via variational inference[^ip-s][^ip-v] work well under infill asymptotics.
+Conjugate-gradient-based Gaussian processes[^cg] work well under large-domain asymptotics, whereas sparse Gaussian processes trained via variational inference[^ip-s] [^ip-v] work well under infill asymptotics.
 One can show theory which suggests this this distinction holds beyond one-dimensional problems.[^ip-theory]
 In contrast, the stochastic gradient descent variant we present looks very reasonable in both cases: it empirically converges in most regions of state space under infill asymptotics, and converges everywhere under large-domain asymptotics.
 Let's look at this algorithm in more details.
@@ -48,7 +48,7 @@ Let's look at this algorithm in more details.
 
 To formulate stochastic gradient descent for posterior sampling, let's begin by writing down a random quadratic optimization problem for computing posterior samples.
 Let `$f \sim\mathrm{GP}(0,k)$` be the prior, and let `$\boldsymbol{y}\mid f\sim\mathrm{N}(f(\boldsymbol{x}), \mathbf\Sigma)$` be the likelihood.
-Let's begin with the *pathwise conditioning*[^efficient-sampling][^pathwise-conditioning] formula for posterior random functions, namely
+Let's begin with the *pathwise conditioning*[^efficient-sampling] [^pathwise-conditioning] formula for posterior random functions, namely
 
 ```
 $$
@@ -70,7 +70,7 @@ $$
 
 We can stochastically estimate the large sum using minibatches.
 Similarly, we can apply a Fourier-feature-based stochastic estimator for the squared norm term.
-We use *efficient sampling* to approximately sample the prior `$f(x_i)$` using Fourier features.[^efficient-sampling][^pathwise-conditioning]
+We use *efficient sampling* to approximately sample the prior `$f(x_i)$` using Fourier features.[^efficient-sampling] [^pathwise-conditioning]
 This gives us a subquadratic stochastic estimator for this optimization objective which is almost unbiased, in the sense that the only bias present is from efficiently sampling the prior.
 To reduce this objective's variance, we apply a number of tricks, including carefully shifting the `$\boldsymbol\varepsilon$` noise term into the regularizer, which are described in the paper.
 The result is a practical stochastic optimization objective for Gaussian process posterior samples.
@@ -83,7 +83,7 @@ In the paper, use stochastic gradient descent with Nesterov momentum, gradient c
 Let's see how this algorithm performs, in particular how it is affected by observation noise in the likelihood.
 
 {% figure(alt=["Convergence of stochastic gradient descent for the Gaussian process mean"] src=["exact_metrics.svg"] dark_invert=[true]) %}
-**Figure 2.** Convergence of stochastic gradient descent for the Gaussian process posterior mean, in terms of training and test error, along with Euclidean error for the representer weights
+**Figure 2.** Convergence of stochastic gradient descent for the Gaussian process posterior mean, in terms of training and test error, along with Euclidean error for the representer weights.
 {% end %}
 
 From this plot, it is clear that stochastic gradient descent does not converge approximately to the correct representer weights.
@@ -151,7 +151,7 @@ This suggests the benign non-convergence, which we previously saw in one dimensi
 
 # Conclusion
 
-In this, we explored using stochastic gradient descent to approximately compute Gaussian process posteriors, by way of means and function samples.
+In this work, we explored using stochastic gradient descent to approximately compute Gaussian process posteriors, by way of means and function samples.
 We examined how to derive appropriate stochastic optimization objectives for doing so, and showed that SGD can produce accurate predictions even in cases where it does not converge to the respective optimum under the given compute budget.
 We developed a spectral characterization of the effect of non-convergence in terms of the spectral basis functions.
 We showed that, on a Thompson sampling benchmark where well-calibrated uncertainty is critical, SGD matches or exceeds the performance of more computationally expensive baselines.

diff --git a/content/2024-05-02-vGPMP-Motion-Planning/index.md b/content/2024-05-02-vGPMP-Motion-Planning/index.md
@@ -31,13 +31,13 @@ Our results show the proposal achieves a reasonable balance between the motion p
 # Applying Variational Gaussian Processes to Motion Planning
 
 We begin with a motion planning framework, which we call *variational Gaussian process motion planning (vGPMP)*.
-This framework is based on variational Gaussian processes, which were originally introduced for scalability:[^vfe][^gpbd] here, we instead apply them to create a straightforward way to parameterize motion plans.
+This framework is based on variational Gaussian processes, which were originally introduced for scalability:[^vfe] [^gpbd] here, we instead apply them to create a straightforward way to parameterize motion plans.
 Let `$\mathcal{T}$` represent time: our motion plan is a map `$f: \mathcal{T} \to \mathbb{R}^d$`, where the output space represents each of the robot's joints.
 We parameterize `$f$` as a posterior Gaussian process, conditioned on `$f(\boldsymbol{z}) = \boldsymbol{u}$`, where `$\boldsymbol{z}$` is a set of inducing locations `$\boldsymbol{z} \in \mathcal{T}^m$`, and `$\boldsymbol{u}$` are robot joint states at times `$\boldsymbol{z}$`. 
 We interpret `$(z_j,u_j)$`-pairs as *waypoints* through which the robot should move.
 Our precise formulation in the paper also includes a bijective map which accounts for joint constraints: we suppress this here for simplicity.
 
-To draw motion plans, we apply *pathwise conditioning*,[^efficient-sampling][^pathwise-conditioning] and represent posterior samples as
+To draw motion plans, we apply *pathwise conditioning*,[^efficient-sampling] [^pathwise-conditioning] and represent posterior samples as
 
 ```
 $$
@@ -56,7 +56,7 @@ We illustrate this below.
 Computing the motion plan therefore entails optimizing these parameters with respect to an appropriate variational objective.
 Once optimized, in practice we can sample from the posterior using efficient sampling, that is, by first approximately sampling the prior `$f(\cdot)$` using Fourier features, then transforming the sampled prior motion plans into posterior motion plans. 
 This procedure allows us to draw random curves representing the posterior in a way that *resolves the stochasticity once in advance* per sample, after which we can evaluate and differentiate the motion plan at arbitrary time points without any additional sampling.
-Compared to prior work such as GPMP2 and its variants,[^gpmp][^gpmp2][^igpmp2] we support general kernels and avoid relying on specialized techniques for stochastic differential equations, thereby enabling explicit control of motion plan smoothness properties.
+Compared to prior work such as GPMP2 and its variants,[^gpmp] [^gpmp2] [^igpmp2] we support general kernels and avoid relying on specialized techniques for stochastic differential equations, thereby enabling explicit control of motion plan smoothness properties.
 Additionally, in contrast with prior work,[^gvi] our formulation bypasses the need to use interpolation to evaluate the posterior in-between a set of pre-specified time points.
 
 Following the framework of variational inference, the resulting variational posterior can be trained by solving the optimization problem 
@@ -86,7 +86,7 @@ This is done by composing the forward kinematics map `$\operatorname{k}_{\operat
 Then, we compute the hinge loss `$\operatorname{h}_\varepsilon(x) = \max(-x + \varepsilon, 0)$`, where `$\varepsilon$` is the *safety distance* parameter, and calculate its squared norm with respect to a diagonal scaling matrix `$\mathbf\Sigma_{\operatorname{obs}}$` which determines the overall importance of avoiding collisions in the objective.
 
 The soft constraint term, which can be used to encode desired behavior such as a grasping pose, is handled analogously.
-Compared to prior work,[^gpmp][^gpmp2][^igpmp2][^gvi] one of the key differences is the introduction of `$\sigma$`, which guarantees that joint limits are respected without the need for clamping or other post-processing-based heuristics.
+Compared to prior work,[^gpmp] [^gpmp2] [^igpmp2] [^gvi] one of the key differences is the introduction of `$\sigma$`, which guarantees that joint limits are respected without the need for clamping or other post-processing-based heuristics.
 
 # Experiments
 

diff --git a/content/2024-12-10-Pandoras-Box-BayesOpt/bayesian_regret.svg b/content/2024-12-10-Pandoras-Box-BayesOpt/bayesian_regret.svg
diff --git a/content/2024-12-10-Pandoras-Box-BayesOpt/contour_and_cost.svg b/content/2024-12-10-Pandoras-Box-BayesOpt/contour_and_cost.svg