-
Notifications
You must be signed in to change notification settings - Fork 0
/
conclusions.tex
64 lines (43 loc) · 13 KB
/
conclusions.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
\chapter{Conclusions}
\label{chp:conclusions}
To recap, we have presented three original pieces of work in this thesis. The common theme has been that neural networks and deep learning are crucial for scaling Bayesian inference on generative models, and conversely that traditional Bayesian inference is important for quantifying uncertainty in NN discriminative models. In this chapter, we summarize our contributions and suggest directions for future work.
\section{Achievements}
In Ch 3, we presented the NaMI algorithm for designing the structure of inference networks and demonstrated superiority of NaMI inverses on several models. A review was given in Ch 2 of the other parts of the NaMI framework assumed in \citet{WebbEtAl2018}, namely amortized VI and NN density estimators.
In Ch 4, we presented a novel framework for distributed Bayesian learning based on a modification of the standard EP algorithm. We provided evidence that our method was able to efficiently perform distributed learning of Bayesian NNs, suffering less from the stale-gradient problem of (non-Bayesian) distributed SGD algorithms, and outperforming all other methods in learning feedforward networks with many layers of depth.
In Ch 5, we developed a novel statistical measure of neural network robustness based on a traditional inference method and demonstrated its low bias and variance, and improved scalability relative to formal verification methods.
\section{Future work}
\subsection{A statistical approach to assessing neural network robustness}
When our method is applied to adversarial properties, it produces a measure of robustness on the NN \emph{for a given image}. However, it would be desirable if we could produce a metric that measures overall robustness of the neural network, averaging over the data distribution. We believe this is an interesting and easy extension that could be accomplished by modifying the robustness metric to,
\begin{align*}
\mathcal{I}[p,s] &\triangleq \int_{\mathcal{X}\times\mathcal{X}'}\mathbbm{1}_{\{s(\mathbf{x};\mathbf{x}')\}\ge0}p(\mathbf{x}\mid\mathbf{x}')p_\mathcal{D}(\mathbf{x}')d\mathbf{x}d\mathbf{x}',
\end{align*}
where the input model, $p(\mathbf{x},\mathbf{x}')$, can be broken up into two terms: the data distribution $p_\mathcal{D}(\cdot)$, and a uniform perturbation around the input, $p(\cdot|\mathbf{x}')$. The property function $s(\cdot)$ now becomes a function of the sampled input, $\mathbf{x}'$. Our algorithm based on AMLS would then be modified to perform the MH sampling over the modified input distribution---effectively the only change is performing approximate sampling from a distribution with double the dimension of the per-sample case.
A second extension is the investigation of improved sampling from the intermediate distributions, $\{\pi_n(\cdot)\}$. The simple Metropolis--Hastings (MH) scheme was found to be very robust and produce low bias/variance estimates in our experiments. However, it may not scale in the dimension, $D$, of larger datasets such as {\scshape ImageNet}. Even for {\scshape CIFAR}-10 (with around 4000 dimensions) the required mixing time was found to be quite large in our experiments, being around $M=2000$ steps to obtain satisfactory convergence to the target distribution. This becomes all the more important with larger models as well, for which both the constant factor of running the forward pass increases, and the number of Markov chains, $N$, that can be processed in parallel is reduced (due to limited GPU memory).
To improve the mixing time, and thus be able to reduce $M$ without introducing unacceptable bias, we propose to replace the simple MH scheme with Langevin Monte Carlo (LMC) \citep{Neal2011}. Theoretically, the computational time to reach a nearly independents sample scales as $O(D^2)$ for MH, whereas for LMC it scales as $O(D^{4/3})$. LMC can be understood as Hamiltonian Monte Carlo (HMC) with a single leapfrog step, or alternatively as a random walk that is informed by gradient information. The computational cost to reach an independent sample does scale slightly better for HMC, as $O(D^{5/4})$, although this is heavily outweighed by the increase in the constant factor due to the many more forward passes of the NN required during the leapfrog steps.
%We intend to show that LMC can produce the same results as the existing experiments but with reduced computation, judging by total wall-clock time. Then, we would like to prove that the method scales to models on {\scshape ImageNet}, which are both larger in input dimension and number of parameters.
We suggested in a previous version of \citet{webb2018statistical} that the adversarial samples produced by our algorithm may be of interest for adversarial training. However, it takes several minutes in typical usage to pass through all the levels to obtain samples from the final $\pi_K$. While it may not, therefore, be practical to \emph{directly} use the samples from our algorithm for robustness training, we believe that a robust training method could be formed from the fundamental idea of our work, of sampling adversarial, or near-adversarial, samples by MCMC methods. Let us elaborate.
Many robust training methods can be understood as approximations to the following variant of empirical risk minimization \citep{madry2017towards, tramer2017ensemble},
\begin{align*}
f^* &= \text{argmax}_{f\in\mathcal{F}}\mathbb{E}_{(\mathbf{x},y^*)\sim\mathcal{D}}\left[\max_{||\mathbf{x}'-\mathbf{x}||_\infty\le\epsilon}s(\mathbf{x}';f,\mathbf{x},y^*)\right],
\end{align*}
where $y^*$ is the true class label. For instance, the iterative fast gradient-sign method (I-FGSM) \citep{goodfellow2014explaining} approximates the inner max term by starting from $x$ and taking $k$ steps of size $\alpha=\epsilon/k$ in the direction of the sign of the gradient of the property function, clipping where necessary to keep within the $l_\infty$ $\epsilon$-ball and the valid range of pixels. In another method \citep{kolter2017provable, wong2018explaining}, a convex outer bound, $E\subseteq\{\mathbf{x}'\mid s(\mathbf{x}';f,\mathbf{x},y^*)\ge0\}$ is produced and the max performed over $\mathbf{x}'\in E$.
As pointed out in \citep{tramer2017ensemble}, if the approximation to the max term is poor, that is,
\begin{align*}
s(\mathbf{x}^*;f,\mathbf{x},y^*) &\ll \max_{||\mathbf{x}'-\mathbf{x}||_\infty\le\epsilon}s(\mathbf{x}';f,\mathbf{x},y^*)
\end{align*}
where $\mathbf{x}^*$ is the sample produced by the approximation method, then the adversarial training degrades the quality of adversarial samples generated by the model, which then degrades the quality of subsequent adversarial training. The result of this is that the model reaches an equilibrium during training where it can effectively defend against white-box but not black-box attacks, those where the adversarial samples are transferred from another model.
The authors suggest a new method for adversarial training in which they decouple adversarial training from adversarial generation, and use samples from an ensemble of additional source models that are held constant during the adversarial training of the target model (adding a small amount of random noise to each). In this way, the adversarial training does not influence the generation of samples used for the training. This relies on the heuristic that adversarial examples often transfer across models. However, it does seem unsatisfactory in that we are no longer directly solving the objective, and it may not be the case that examples do transfer across all architectures or that the transferable examples are representative of all adversarial examples on the model.
We propose to modify the objective to,
\begin{align*}
f^* &= \text{argmax}_{f\in\mathcal{F}}\mathbb{E}_{(\mathbf{x},y^*)\sim\mathcal{D}}\left[\mathbb{E}_{\mathbf{x}'\sim p(\cdot|\mathbf{x})}\left[s(\mathbf{x}';f,\mathbf{x},y^*)\right]\right],
\end{align*}
where, for instance, $p(\mathbf{x}'|\mathbf{x})\propto\exp(s(\cdot|\mathbf{x}))\mathbbm{1}_{\{||\mathbf{x}'-\mathbf{x}||_\infty\le\epsilon\}}$. We can draw approximate samples from $p$ by LMC.
This method has several advantages. Firstly, we can produce multiple proposal samples per data point directly from the model in question, as opposed to only a single proposal sample from the model like in the I-FGSM method and that of \citet{kolter2017provable}, or a number of samples per data point from surrogate models, like \citet{tramer2017ensemble}. Also, by the reliability of the sampling process, we can ensure that the proposals are diverse and are adversarial or near-adversarial, thus avoiding the poor equilibrium discussed before. As argued in \citet{webb2018statistical}, in many cases it makes more sense to define robustness in terms of the prevalence of adversarial examples in the vicinity of a data point rather than the distance to the nearest one, and for this reason we would expect a training algorithm that trains on multiple and diverse adversarial examples per data point to be more effective in reducing our measure of robustness. Our proposed method is also fast, requiring only a few extra gradient steps per minibatch over the I-FGSM.
%We would like to compare our method to other robust training methods like I-FGSM and \citet{kolter2017provable} measured by both our novel robustness metric and by the fraction of samples that are provably robust to perturbations within a given radius, and show that it is more efficient in quickly increasing robustness, while obtaining a better solution at convergence.
\subsection{Faithful inverses for effective amortized inference}
A very worthy outstanding challenge of Bayesian statistics is that of producing \emph{universal automated inference}. Such a hypothesized inference scheme is said to be \emph{universal} in that it would be applicable to higher-order probabilistic programs, those with recursion, other stochastic control flow mechanisms, and high-order functions, and, \emph{automated} in that it would not require any intervention by the user after having specified the model and the query of interest in the details of how inference is applied.
A universal inference scheme based on amortized VI and the Anglican PPL \citep{wood2014new} was attempted in \citet{LeEtAl2016}. Their method used particle-based inference with a data driven proposal to learn to perform inference in a universal PPL. As noted by the authors, it suffered from two shortcomings. Firstly, the method did not perform model learning, and as discovered in the experiments on CAPCHA solving, it is difficult to program a correct model ab initio. Secondly, the inference network conditioned on all previous random choices, not only the relevant ones (from the PGM perspective of \citet{WebbEtAl2018}). As demonstrated in our work, this slows learning considerably and leads convergence to a worse solution.
We believe this important work is based on the most promising approach for universal automated inference, that of amortized VI, and that it can be further developed to overcome the hurdles the authors discovered. Firstly, new learning algorithms have been developed that would be suitable for probabilistic programs, with their seqential nature. For instance, filtering variational objectives, in a sense, can differentiate through particle-based inference algorithms in order to perform model learning \citep{MaddisonEtAl2017, LeEtAl2017, NaessethEtAl2017}. Secondly, the development of the NaMI algorithm provides an intelligent way to automate the design of the structure of the inference network. Currently, either the modeler must design the inference network by hand, or use an unstructured fully connected one (e.g., see the existing autoguides of Pyro \citep{bingham2018pyro}).
We intend to integrate the NaMI algorithm into the Pyro PPL for designing the structure of an inference network, and use heuristics for designing the parametrization based on normalizing flows. PPLs provide the necessary abstractions, in this case \emph{poutines}, to manipulate a representation of the model in order to operate the NaMI algorithm and choose an appropriate factorization.
At first, we intend to demonstrate inference compilation on models with a fixed structure, such as traditional BNs. We believe the importance of a structured inference network increases in the scale of the model---an heuristic approach diverges more from the posterior as the number of variables increases, and a fully connected approach suffers the curse of dimensionality. Some traditional factor graph BNs, for example, those used for medical diagnosis, have thousands of variables, and would be suitable for this experimentation. Performing fast inference in these models is very important in industry. %We would then like to attempt to automate the design of inference networks for models with limited forms of dynamic structure.
Some challenges in automating the design of inference networks include automating the design of the NN encoder and decoder architectures and selection of hyperparameters like the learning rates. It is likely that meta-learning approaches are necessary, which learn from their past successes and failures and are able to generalize to novel scenarios.