The implemetation of DejaVu and baselines will be public after publication.
The datasets A, B, C, D are public at https://www.dropbox.com/sh/ist4ojr03e2oeuw/AAD5NkpAFg1nOI2Ttug3h2qja?dl=0.
In each dataset, graph.yml
or graphs/*.yml
are FDGs, metrics.csv
is metrics, and faults.csv
is failures (including ground truths).
FDG.pkl
is a pickle of the FDG object, which contains all the above data.
The official repo is at https://github.com/fudanselab/train-ticket. Our scripts will be public after publication.
Since the DejaVu model is trained with historical failures, it is straightforward to interpret how it diagnoses a given failure by figuring out from which historical failures it learns to localize the root causes.
Therefore, we propose a pairwise failure similarity function based on the aggregated features extracted by the DejaVu model.
Compared with raw metrics, the extracted features are of much lower dimension and contain little useless information, which the DejaVu model ignores.
However, computing failure similarity is not trivial due to the generalizability of DejaVu.
For example, suppose that the features are
To solve this problem, we calculate similarities based on failure classes rather than single failure units.
As shown in \cref{fig:local-interpretation}, for each failure units at an in-coming failure
For an in-coming failure, we calculate its similarity to each historical failure and recommend the top-k most similar ones to engineers. Our model is believed to learn localizing the root causes from these similar historical failures. Furthermore, engineers can also directly refer to the failure tickets of these historical failures for their diagnosis and mitigation process. Note that sometimes the most similar historical failures may have different failure classes to the localization results due to imperfect similarity calculation. In such cases, we discard and ignore such historical failures.