Skip to content

Defenses Against Adversarial Attacks

Timo Denk edited this page Sep 4, 2018 · 20 revisions

Gradient Masking

Derivative penalties regularize training with penalties for high first (and second) order derivatives with respect to input changes. The term was introduced here. The loss function is extended by a regularization term that penalizes high gradients:
l(\theta)=\lVert f_\theta(x)-y_\text{target}\rVert+\left\lVert\frac{\partial f_\theta(x)}{\partial x}\right\rVert

Growing CNNs

While growing GANs have proven effective, it might be possible to transfer this method to CNNs, by growing filters of the CNN. That way the learned filters might be very different from normal approaches and black-box attacks with transferred examples might fail. I have not seen any research that goes into this direction, but it might be something interesting to try.

Fisher Information

Kirkpatrick et al (2017) use the Fisher information matrix to determine the importance of every weight. This might be a helpful tool for network size size reduction, i.e. kernel removal or simplification of the dense head layers (prior to softmax).

Adversarial Training

Training the network with adversarial examples is a common practice. Absolutely needed, just as a basis to build on top of. However, it does not translate well to unseen attacks since it does not target the actual problem of the network.

Adversarial examples can be generated as an input for hidden layers as well.

Information on adversarial training can be found in Kurakin, Goodfellow, Bengio (2017).

Mixed-Minibatch PGD (M-PGD)

Training with adversarial examples generated by the PGD attack and normal training samples alike. Suggested by Kannan et al. (2018).

Bounding-Boxes

Approaches similar to bounding-box training. Use only a portion of the image.

Stacked HGDs

After training a High-level Guided Denoiser (HGD), train another HGD and put it in front of the "HGD-CNN" stack.

Preprocessing Based Methods

Apply preprocessing methods to the inputs before feeding them into the network.

  • Guided Denoiser (last year's winner)
  • JPEG compression
  • Auto-encoder
  • Median filter, averaging filter, gaussian (low-pass) filter
  • Input dropout, i.e. randomly setting a few pixels of the image to 0 (and up-scaling of the others)

Activation Anomaly Detection

Training a second network that takes the classifier's activations as an input and outputs whether the current sample is adversarial or not. Back-propagate the loss to play a min-max-game between classifier and adversarial-attack-detection network. Something like that has been done by Metzen et. al (2017) "On Detecting Adversarial Perturbations".

Extension: Train a detection network to predict the correct class labels based on the classifier's gradient, activations, and class output.

PCA-based detection

CNN Filter Encoding

Feed the CNN filter outputs (multiple filters) into an auto-encoder and build following conv layers on top of the representation. screen shot 2018-06-20 at 23 48 26

Very comparable to the Inception blocks, except our compression would be more aggressive (for instance n filters --> log_2 n filters) and we would place the 1x1 conv in one of the lowest layers (closest to input).

Extension: Learn a priority mapping of where filters are fed forward and where they are dropped (weight map over the image for each filter, up-scaled using resize_images.

Mixup

Suggested by Zhang et al. (2017) mixup: Beyond Empirical Risk Minimization

Training with samples drawn from the linear interpolation of two data points (i.e. images and labels). Interpolation of both, images and labels.

Internal Binarization

Enforce the encoded state after a layer to be binarized. Perhaps, this comes with robustness to small changes of the input.

Vector quantization as done in the VQ-VAE paper can be seen as a combination of filter compression and binarization.