"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization" Paper Review

2020/09/08
Ho Seong Lee (hoya012)
Cognex Deep Learning Lab
Research Engineer
SNUAI 8th | The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization 1

Contents
• Introduction
• Related Work
• New Benchmarks
• DeepAugment
• Experiments
• Conclusion

Introduction
Human vision system is robust, but existing vision models are not robust.
• Humans can deal with many forms of corruption such as blur, pixel noise and abstract change in
structure and style.
• Achieving robustness is essential in safety-critical and accuracy-critical applications.
Dog! Dog!Dog! Dog!
Dog! starfish! baseball! drumstick!

Introduction
Most work on robustness methods for vision has focused on adversarial examples.
• For standardizing and expanding the robustness topic, several studies have begun.
• At first, establish benchmark dataset and evaluation metric.

Related Works
Pioneers of robustness in ML - Dan Hendrycks (My Favorite Researcher..)
Reference: https://people.eecs.berkeley.edu/~hendrycks/

Related Works
Pioneers of robustness in ML – Madry Lab
Reference: http://madry-lab.ml/

Related Works
Pioneers of robustness in ML – Bethge Lab
Reference: http://bethgelab.org/

Related Works
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
• Create ImageNet-C and ImageNet-P test set and Benchmarks.
ImageNet-C ImageNet-P

Related Works
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
• Robustness Metrics
= 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 ( )
* each level of severity s (1 ≤ s ≤ 5).
ImageNet-C
ImageNet-P

Related Works
Natural Adversarial Examples
• Introduce natural adversarial examples and create 7,500 ImageNet-A test set. (200 class)
• Download numerous images related to an ImageNet class from website iNaturalist and Flickr.
• Delete the images that ResNet-50 correctly classify. Finally, select a subset of high-quality images.
ImageNet-A

Related Works
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty
• Propose a technique to improve the robustness and uncertainty estimates of image classifiers.
• Use JS Divergence Consistency loss

New Benchmarks
Seven robustness hypotheses
• Larger Models: increasing model size improves robustness.
• Self-Attention: adding self-attention layers to models improves robustness.
• Diverse Data Augmentation: robustness can increase through data augmentation.
• Pretraining: pretraining on larger and more diverse datasets improves robustness.
• Texture Bias: convolutional networks are biased towards texture, which harms robustness.
• Only IID Accuracy Matters: accuracy on independent and identically distributed test data entirely
determines natural robustness.
• Synthetic ≠Natural: synthetic robustness interventions including diverse data augmentations do not help
with robustness on naturally occurring distribution shifts.

New Benchmarks
Introduce three new robustness benchmarks
• It has been difficult to arbitrate these hypotheses because existing robustness datasets preclude the
possibility of controlled experiments by varying multiple aspects simultaneously.
• To address these issues and test the seven hypotheses outlined above, create new test sets.

New Benchmarks
ImageNet-Renditions (ImageNet-R)
• 30,000 test set containing various renditions (e.g., paintings, embroidery, etc.) of ImageNet object
classes (200). The rendition styles (“Painting”, “Toy”) are not ImageNet-R’s classes.
• Original ImageNet dataset discouraged such images since annotators were instructed to collect “photos
only, no painting, no drawings, etc.” (Deng, 2012). Authors do the opposite.

New Benchmarks
StreetView StoreFronts (SVSF)
• Contains business storefront images taken from Google Streetview (20 classes).
• Investigate natural shifts in the image capture process using metadata (e.g. location, year, camera type)
• Create one training set(200K) and five in-distribution test set(10K) from images taken in
US/Mexico/Canada during 2019 using “new” camera. Unfortunately, unleased..
• Make four out-of-distribution test set(10K): “2017”, “2018”, “France”, “old camera”

New Benchmarks
DeepFashion Remixed
• Changes in camera operation can cause shifts in attributes such as object size, object occlusion, camera
viewpoint, and camera zoom.
• To measure this, create multi-labeled training set(48K) and 8 out-of-distribution test set (Total 121K).
Medium scale, Medium occlusion,
side/back viewpoint, no zoom-in
Small and large scale,
Minimal and heavy occlusion,
Frontal and not-worn viewpoints,
Medium and large zoom-in

DeepAugment
New data augmentation technique: DeepAugment
• In order to explore the Diverse Data Augmentation hypothesis, introduce a new data augmentation.
• Pass an image through an img-to-img networks (such as autoencoder or SR network)
• But rather than processing the image normally, distort the internal weights and activations by applying
randomly sampled ops(zeroing, negating, convolving, transposing), applying activation functions on.
• This creates diverse but semantically consistent images.

DeepAugment

Experiments
Experimental Setup

Experiments
Experimental Results: ImageNet-R
• ImageNet-200: The original ImageNet test set restricted to ImageNet-R’s 200 classes.
• Pretraining → improves IID/OOD gap with very small portion.
• Self-Attention → increase IID/OOD gap
• Diverse Data Augmentation, Larger Models → improves IID/OOD gap significantly!
OODIID
Diff
-0.2
0.0
0.2
-12.5
-2.0
-6.4
-7.1
-5.7
-10.8
-4.1
Error Rate (↓)

Experiments
Experimental Results: StreetView StoreFronts
• No method helps much on country shift, where error rates roughly double across the board.
• Images captured in France contain noticeably different architecture styles and storefront designs.
• Unable to find conspicuous and consistent indicators of the camera and year. → insensitive feature.
• Data augmentation primarily helps combat texture bias as with ImageNet-R.
• But existing augmentations are not diverse enough to capture high-level semantic shifts such as building
architecture. Error Rate (↓)

Experiments
Experimental Results: DeepFashion Remixed
• All evaluated methods have an average OOD mAP that is close to the baseline.
• DFR’s size and occlusion shifts hurt performance the most.
• Nothing substantially improved OOD performance beyond what is explained by IID performance, so here
it would appear that Only IID Accuracy Matters.
mAP scores (↑)

Experiments
Experimental Results: ImageNet-C
• DeepAugment + AugMix → attain the SOTA result.
• Evidence for Larger Models, Self-Attention, Diverse Data Augmentation, Pretraining, and Texture Bias.
• Evidence against the Only IID Accuracy Matters.
Error Rate, mCE (↓)

Experiments
Experimental Results: Real Blurry Images
• ImageNet-C use various synthetic corruptions that have different from real-world.
• Collect a small dataset of 1,000 real-world blurry images and evaluate various models.
• Everything that helped in ImageNet-C was also helpful in Real Blurry Images.
• Evidence against the Synthetic ≠Natural.
Error Rate (↓)

Conclusion
SNUAI 8th | The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
• Introduce three new benchmarks, ImageNet-R, SVSF, and DFR. (+ Real Blurry Images)
• Introduce new data augmentation technique DeepAugment.
• With these benchmarks, evaluate seven robustness hypotheses.
• It seems that robustness has a many faces (multivariate).
• If so, research community should prioritize creating new robustness methods.
25

"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization" Paper Review

More Related Content

What's hot (20)

Similar to "The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization" Paper Review (8)

More from LEE HOSEONG (20)

Recently uploaded (20)

"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization" Paper Review