Skip to content

Evaluation

Timo Denk edited this page Jun 25, 2018 · 7 revisions

This wiki page lists methods and ideas that can be used to score models with respect to their robustness against adversarial attacks.

Activations

Plotting a histogram of activations and getting a sense for how they behave differently when feeding adversarial examples vs. normal samples.

Plotting FGSM vs. Orthogonal Vector

Plotting the classification in two dimensions, where the first is given by the FGSM attack and the second is orthogonal. A plot of that kind can be seen in the image below. The validation is being done based on what these individual plots look like. Ideally, after some regularization, there would not be these two, distinct halves anymore.
screen shot 2018-06-21 at 13 50 18

Layerwise Perturbation Graph

The layerwise perturbation graph is an idea taken from last year's winning defense paper (GoogleBrain is also using it). For two inputs x (normal image) and x* (adversarial attack, generated from x), the graph plots the difference of activation outputs for every layer.

The "activation difference" for layer l is defined as
E_l(x,x^)=\frac{\lVert f_l(x)-f_l(x^)\rVert}{\lVert f_l(x)\rVert}

This is a plot from the source:
image

Our fist experiments on the topic are being tracked in #14

Influence Functions

Paper Understanding Black-box Predictions via Influence Functions. Using influence functions, the authors "trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction."

Linear Combinations

Feed linear combinations of two inputs and check whether the classification around the samples is correct. Determine the distance from an image (when linearly approaching an image of another class) of the first miss-classified input. Analyze how noisy the classifications along the line are.
image This figure plots classification over linear combination between a "1" and a "0" sample from the training data. Our first experiments can be found here.

Goodfellow is presenting a similar thing here and shows that the classification will work just fine in most directions except for a few. Therefore, the linear combination method might not be efficient in terms of spotting vulnerabilities of a model.