The scale-invariant physics updates (SIPs) of the previous chapters illustrated the importance of inverting the direction of the update step (in addition to making use of higher order terms). We'll now turn to an alternative for achieving the inversion, the so-called Half-Inverse Gradients (HIGs) {cite}schnell2022hig
. They come with their own set of pros and cons, and thus provide an interesting alternative for computing improved update steps for physics-based deep learning tasks.
Unlike the SIPs, they do not require an analytical inverse solver. The HIGs jointly invert the neural network part as well as the physical model. As a drawback, they require an SVD for a large Jacobian matrix.
:class: tip
More specifically the HIGs:
- do not require an analytical inverse solver (in contrast to SIPs),
- and they jointly invert the neural network part as well as the physical model.
As a drawback, HIGs:
- require an SVD for a large Jacobian matrix,
- and are based on first-order information (similar to regular gradients).
Howver, in contrast to regular gradients, they use the full Jacobian matrix. So as we'll see below, they typically outperform regular GD and Adam significantly.
As mentioned during the derivation of inverse simulator updates in {eq}quasi-newton-update
, the update for regular Newton steps uses the inverse Hessian matrix. If we rewrite its update for the case of an
$$ \Delta \theta_{\mathrm{GN}} = - \eta \Bigg( \bigg(\frac{\partial y}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial y}{\partial \theta}\bigg) \Bigg)^{-1} \cdot \bigg(\frac{\partial y}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{T} . $$ (gauss-newton-update-full)
For a full-rank Jacobian
$$ \Delta \theta_{\mathrm{GN}} = - \eta \bigg(\frac{\partial y}{\partial \theta}\bigg) ^{-1} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{T} . $$ (gauss-newton-update)
This looks much simpler, but still leaves us with a Jacobian matrix to invert. This Jacobian is typically non-square, and has small singular values which cause problems during inversion. Naively applying methods like Gauss-Newton can quickly explode. However, as we're dealing with cases where we have a physics solver in the training loop, the small singular values are often relevant for the physics. Hence, we don't want to just discard these parts of the learning signal, but rather preserve as many of them as possible.
This motivates the HIG update, which employs a partial and truncated inversion of the form
$$ \Delta \theta_{\mathrm{HIG}} = - \eta \cdot \bigg(\frac{\partial y}{\partial \theta}\bigg)^{-1/2} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{T} , $$ (hig-update)
where the square-root for
_Truncation versus Clamping:_
It might seem attractive at first to clamp singular values to a small value $\tau$, instead of discarding them by setting them to zero. However, the singular vectors corresponding to these small singular values are exactly the ones which are potentially unreliable. A small $\tau$ yields a large contribution during the inversion, and thus these singular vectors would cause problems when clamping. Hence, it's a much better idea to discard their content by setting their singular values to zero.
The use of a partial inversion via schnell2022hig
, the half-inversion regularizes the inverse and provides substantial improvements for the learning, while reducing the chance of gradient explosions.
---
height: 160px
name: hig-spaces
---
A visual overview of the different spaces involved in HIG training. Most importantly, it makes use of the joint, inverse Jacobian for neural network and physics.
The formulation above hides one important aspect of HIGs: the search direction we compute not only jointly takes into account the scaling of neural network and physics, but can also incorporate information from all the samples in a mini-batch. This has the advantage of finding the optimal direction (in an
To achieve, this, the Jacobian matrix for
The notation with
To summarize, compute the HIG update requires evaluating the individual Jacobians of a batch, doing an SVD of the combined Jacobian, truncating and half-inverting the singular values, and computing the update direction by re-assembling the half-inverted Jacobian matrix.
%
This is a good time to illustrate the properties mentioned in the previous paragraphs with a real example. As learning target, we'll consider a simple two-dimensional setting with the function
and a scaled loss function
Here
We'll use a small neural network with a single hidden layer consisting of 7 neurons with tanh() activations and the objective to learn
Let's first look at the well-conditioned case with
---
height: 230px
name: hig-toy-example-good
---
The example problem for a well-conditioned case. Comparing Adam, GN, and HIGs.
As seen here, all three methods fare okay on all fronts for the well conditioned case: the loss decreases to around
In addition, the neuron activations, which are shown in terms of mean and standard deviation, all show a broad range of values (as indicated by the solid-shaded regions representing the standard deviation). This means that the neurons of all three networks produce a wide range of values. While it's difficult to interpret specific values here, it's a good sign that different values are produced by different inputs. If this was not the case, i.e., different inputs producing constant values (despite the obviously different targets in
Finally, the third graph on the right shows the evolution in terms of a single input-output pair. The starting point from the initial network state is shown in light gray, while the ground truth target
Overall, the behavior of all three methods is largely in line with what we'd expect: while the loss surely could go down more, and some of the steps in
Now we can consider a less well-conditioned case with
---
height: 230px
name: hig-toy-example-bad
---
The example problem for an ill-conditioned case. Comparing Adam, GN, and HIGs.
The loss curves now show a different behavior: both Adam and GN do not manage to decrease the loss beyond a level of around 0.2 (compared to the 0.01 and better from before). Adam has significant problems with the bad scaling of the
This becomes even clearer in the middle graph, showing the activations statistics. The red curve of GN very quickly saturates at 1, without showing any variance. Hence, all neurons have saturated, and do not produce meaningful signals anymore. This not only means that the target function isn't approximated well, it also means that future gradients will effectively be zero, and these neurons are lost to all future learning iterations. Hence, this is a highly undesirable case that we want to avoid in practice. It's also worth pointing out that this doesn't always happen for GN. However, it regularly happens, e.g. when individual samples in a batch lead to vectors in the Jacobian that are linearly dependent (or very close to it), and thus makes GN a sub-optimal choice.
The third graph on the right side of figure {numref}hig-toy-example-bad
shows the resulting behavior in terms of the outputs. As already indicated by the loss values, both Adam and GN do not reach the target (the black dot). Interestingly, it's also apparent that both have much more problems along the
%We've kept the
Note that for all examples so far, we've improved upon the differentiable physics (DP) training from the previous chapters. I.e., we've focused on combinations of neural networks and PDE solving operators. The latter need to be differentiable for training with regular GD, as well as for HIG-based training.
In contrast, for training with SIPs (from {doc}physgrad-nn
), we even needed to provide a full inverse solver. As shown there, this has advantages, but differentiates SIPs from DP and HIGs. Thus, the HIGs share more similarities with, e.g., {doc}diffphys-code-sol
and {doc}diffphys-code-control
, than with the example {doc}physgrad-code
.
This is a good time to give a specific code example of how to train physical NNs with HIGs: we'll look at a classic case, a system of coupled oscillators.