A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Two Examples of Useful, Non-Robust Features

Goh, Gabriel

doi:10.23915/distill.00019.3

The elusive non-robust useful features, however, seem conspicuously absent in the above plot. Fortunately, we can construct such features by strategically combining elements of this basis.

It is surprising, thus, that the experiments of Madry et al. (with deterministic perturbations) do distinguish between the non-robust useful features generated from ensembles and containments. A succinct definition of a robust feature that peels these two worlds apart is yet to exist, and remains an open problem for the machine learning community.

Response Summary: The construction of explicit non-robust features is very interesting and makes progress towards the challenge of visualizing some of the useful non-robust features detected by our experiments. We also agree that non-robust features arising as “distractors” is indeed not precluded by our theoretical framework, even if it is precluded by our experiments. This simple theoretical framework sufficed for reasoning about and predicting the outcomes of our experiments We also presented a theoretical setting where we can analyze things fully rigorously in Section 4 of our paper.. However, this comment rightly identifies finding a more comprehensive definition of feature as an important future research direction.

Response: These experiments (visualizing the robustness and usefulness of different linear features) are very interesting! They both further corroborate the existence of useful, non-robust features and make progress towards visualizing what these non-robust features actually look like.

We also appreciate the point made by the provided construction of non-robust features (as defined in our theoretical framework) that are combinations of useful+robust and useless+non-robust features. Our theoretical framework indeed enables such a scenario, even if — as the commenter already notes — our experimental results do not. (In this sense, the experimental results and our main takeaway are actually stronger than our theoretical framework technically captures.) Specifically, in such a scenario, during the construction of the $\widehat{\mathcal{D}}_{det}$ dataset, only the non-robust and useless term of the feature would be flipped. Thus, a classifier trained on such a dataset would associate the predictive robust feature with the wrong label and would thus not generalize on the test set. In contrast, our experiments show that classifiers trained on $\widehat{\mathcal{D}}_{det}$ do generalize.

Overall, our focus while developing our theoretical framework was on enabling us to formally describe and predict the outcomes of our experiments. As the comment points out, putting forth a theoretical framework that captures non-robust features in a very precise way is an important future research direction in itself.

@article{goh2019a, author = {Goh, Gabriel}, title = {A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Two Examples of Useful, Non-Robust Features}, journal = {Distill}, year = {2019}, note = {https://distill.pub/2019/advex-bugs-discussion/response-3}, doi = {10.23915/distill.00019.3} }

[ilyas2019adversarial] Adversarial examples are not bugs, they are features
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.

Two Examples of Useful, Non-Robust Features

Authors

Affiliations

Published

DOI

Non-Robust Features in Linear Models

Acknowledgments

Author Contributions

References

Updates and Corrections

Reuse

Citation