-
Notifications
You must be signed in to change notification settings - Fork 1
Defenses Against Adversarial Attacks
Derivative penalties regularize training with penalties for high first (and second) order derivatives with respect to input changes. The term was introduced here. The loss function is extended by a regularization term that penalizes high gradients:
While growing GANs have proven effective, it might be possible to transfer this method to CNNs, by growing filters of the CNN. That way the learned filters might be very different from normal approaches and black-box attacks with transferred examples might fail. I have not seen any research that goes into this direction, but it might be something interesting to try.
Kirkpatrick et al (2017) use the Fisher information matrix to determine the importance of every weight. This might be a helpful tool for network size size reduction, i.e. kernel removal or simplification of the dense head layers (prior to softmax).
Training the network with adversarial examples is a common practice. Absolutely needed, just as a basis to build on top of. However, it does not translate well to unseen attacks since it does not target the actual problem of the network.
Adversarial examples can be generated as an input for hidden layers as well.
Information on adversarial training can be found in Kurakin, Goodfellow, Bengio (2017).
Training with adversarial examples generated by the PGD attack and normal training samples alike. Suggested by Kannan et al. (2018).
Approaches similar to bounding-box training. Use only a portion of the image.
After training a High-level Guided Denoiser (HGD), train another HGD and put it in front of the "HGD-CNN" stack.
Apply preprocessing methods to the inputs before feeding them into the network.
- Guided Denoiser (last year's winner)
- JPEG compression
- Auto-encoder
- Median filter, averaging filter, gaussian (low-pass) filter
- Input dropout, i.e. randomly setting a few pixels of the image to 0 (and up-scaling of the others)
Training a second network that takes the classifier's activations as an input and outputs whether the current sample is adversarial or not. Back-propagate the loss to play a min-max-game between classifier and adversarial-attack-detection network. Something like that has been done by Metzen et. al (2017) "On Detecting Adversarial Perturbations".
Extension: Train a detection network to predict the correct class labels based on the classifier's gradient, activations, and class output.
Feed the CNN filter outputs (multiple filters) into an auto-encoder and build following conv layers on top of the representation.
Very comparable to the Inception blocks, except our compression would be more aggressive (for instance n filters --> log_2 n filters) and we would place the 1x1 conv in one of the lowest layers (closest to input).
Extension: Learn a priority mapping of where filters are fed forward and where they are dropped (weight map over the image for each filter, up-scaled using resize_images.
Suggested by Zhang et al. (2017) mixup: Beyond Empirical Risk Minimization
Training with samples drawn from the linear interpolation of two data points (i.e. images and labels). Interpolation of both, images and labels.
Enforce the encoded state after a layer to be binarized. Perhaps, this comes with robustness to small changes of the input.
Vector quantization as done in the VQ-VAE paper can be seen as a combination of filter compression and binarization.
- Machine Learning
- Infrastructure
- Challenge
- Misc