PR243: Designing Network Design Spaces

Designing Network Design Spaces
Ilija Radosavovic, et al., “Designing Network Design Spaces”
3rd May, 2020
PR12 Paper Review
JinWon Lee
Samsung Electronics

Designing Network Design Spaces

Introduction
• Over the past several years better architectures have resulted in
considerable progress in a wide range of visual recognition tasks.
 Ex)VGG, ResNet, MobileNet, EfficientNet, etc.
• While manual network design has led to large advances, finding well-
optimized networks manually can be challenging, especially as the
number of design choices increases.
• A popular approach to address this limitation is neural architecture
search (NAS).
• However, it does not enable discovery of network design principles
that deepen our understanding and allow us to generalize to new
settings.

Introduction
• In this work, the authors present a new network design paradigm
that combines the advantages of manual design and NAS.
• Instead of focusing on designing individual network instances, they
design design spaces that parametrize populations of networks.

Exploring RandomlyWired Neural Networks for
Image Recognition(PR-155)
• Design a Network Generator not an
Individual Network!

Introduction
• The authors start with a relatively unconstrained design space we call
AnyNet and apply human-in- the-loop methodology to arrive at a
low-dimensional design space consisting of simple “regular”
networks, RegNet.
• RegNet design space generalizes to various compute regimes,
schedule lengths and network block types.
• They analyze the RegNet design space and arrive at interesting
findings that do not match the current practice of network design.

Tools for Design Space Design
• Rather than designing or searching for a single best model under
specific settings, the authors study the behavior of populations of
models.
• They rely on the concept of network design spaces introduced by
Radosavovic et al., “On network design spaces for visual
recognition.”, ICCV2019.
• Core idea of the paper is that we can quantify the quality of a design
space by sampling a set of models from that design space and
characterizing the resulting model error distribution.

• To obtain a distribution of models, sample and train n models from a
design space.
• A primary tool for analyzing design space quality is the error
empirical distribution function (EDF).The error EDF of n models with
errors 𝑒𝑖 is given by:
𝐹 𝑒 =
1
𝑛
෍
𝑖=1
𝑛
1[𝑒𝑖 < 𝑒]
• F(e) gives the fraction of models with
error less than 𝑒.

• Given a population of trained models, we can plot and analyze
various network properties versus network error.
• For these plots, an empirical bootstrap is applied to estimate the
likely range in which the best models fall.
The blue shaded regions are ranges containing the best models with 95% confidence, and the black vertical line
the most likely best value.

• To summarize:
1. generate distributions of models obtained by sampling and
training n models from a design space.
2. compute and plot error EDFs to summarize design space quality.
3. visualize various properties of a design space and use an
empirical bootstrap to gain insight.
4. use these insights to refine the design space.

The AnyNet Design Space
• Given an input image, a network consists of a simple stem, followed by the
network body that performs the bulk of the computation, and a final network
head that predicts the output classes.
• Keep the stem and head fixed and as simple as possible, and instead focus on
the structure of the network body.
• The network body consists of 4 stages operating at progressively reduced
resolution, each stage consists of a sequence of identical blocks.

AnyNetX
• Most of our experiments use the standard residual bottlenecks block
with group convolution.They refer to this as the X block, and the
AnyNet design space built on it as AnyNetX.

AnyNetX
• The AnyNetX design space has 16 degrees of freedom as each
network consists of 4 stages and each stage 𝑖 has 4 parameters: the
number of blocks 𝑑𝑖, block width 𝑤𝑖, bottleneck ratio 𝑏𝑖, and group
width 𝑔𝑖.
• Resolution 𝑟 = 224 (fixed)
• To obtain valid models, we perform log-uniform sampling of 𝑑𝑖 ≤ 16,
𝑤𝑖 ≤ 1024 and divisible by 8, 𝑏𝑖 ∈ {1, 2, 4}, and 𝑔𝑖 ∈ {1, 2, … , 32}.
• There are (16 ∙ 128 ∙ 3 ∙ 6)4≈ 1018possible model configurations in
the AnyNetX design space.

Design Space Design Aims
1. To simplify the structure of the design.
2. To improve the interpretability of the design space.
3. To improve or maintain the design space quality.
4. To maintain model diversity in the design space.

AnyNetX(A, B, C)
• Refer to unconstrained AnyNet design space as AnyNetXA.
• Shared bottleneck ratio 𝑏𝑖 = 𝑏 for all stage i for the AnyNetXA  AynNetXB.
• Shared group width 𝑔𝑖 = 𝑔 for all stage i for the AnyNetXB  AnyNetXC.

AnyNetX(D, E)
• AnyNetXD is from examining typical network structures of both good
and bad networks from AnyNetXC.
 A pattern emerges: good network have increasing widths.
• AnyNetXD constraint: AnyNetXC & 𝑤𝑖+1 ≥ 𝑤𝑖.
• In addition to stage widths 𝑤𝑖 increasing with i, the stage depths 𝑑𝑖
likewise tend to increase for the best models
• AnyNetXE constraint: AnyNetXD & 𝑑𝑖+1 ≥ 𝑑𝑖.
• Finally, constraints on 𝑤𝑖 and 𝑑𝑖 each reduce the design space by 4!,
with a cumulative reduction of O(107) from AnyNetXA.

Linear Fits
• To gain further insight into the model structure, the best 20 models
from AnyNetXE are showed in a single plot.
• While there is significant variance in the individual models (gray
curves), in the aggregate a pattern emerges.
• In particular, in the same plot we show the line 𝑤𝑗 = 48 · (𝑗 + 1) for
0 ≤ 𝑗 ≤ 20

Linear Fits
• Inspired of AnyNetXD and AnyNetXE, a linear parameterization of
block widths is as follow:
𝑢𝑗 = 𝑤0 + 𝑤 𝑎 ⋅ 𝑗 for 0 ≤ 𝑗 < 𝑑, 𝑤0 > 0, 𝑤 𝑎 > 0
• To quantize 𝑢𝑗, 𝑤 𝑚 is introduced as an additional parameter
𝑢𝑗 = 𝑤0 ⋅ 𝑤 𝑚
𝑠 𝑗
• Then, to quantize 𝑢𝑗, simply rounding 𝑠𝑗 and compute quantized per-
block width 𝑤𝑗 via:
𝑤𝑗 = 𝑤0 ⋅ 𝑤 𝑚
‫ہ‬ 𝑠 ‫ۀ‬𝑗
• Converting the per-block 𝑤𝑗 to per-stage format 𝑤𝑖:
𝑤𝑖 = 𝑤0 ⋅ 𝑤 𝑚
𝑖
𝑑𝑖 = ෍
𝑗
1 ‫ہ‬ 𝑠 ‫ۀ‬𝑗 = 1

Linear Fits
efit is a mean log-ratio

The RegNet Design Space
• The design space of RegNet contains only simple, regular models.
 𝑑 < 64
 𝑤0, 𝑤 𝑎 < 256
 1.5 ≤ 𝑤 𝑚 ≤ 3
 𝑏 𝑎𝑛𝑑 𝑔 are same as AnyNet
• 𝑤 𝑚 = 2 𝑎𝑛𝑑 𝑤0 = 𝑤 𝑎 make good performance, but to maintain
the diversity of models they are not applied to RegNet design space.

Common Design Patterns
• The deeper the model, the better the performance.
• Double the number of channels whenever the spatial activation size
is reduced.
• Skip connection is good.
• Bottleneck is good.
• Depthwise separable convolution is popular for low compute regime.
• Inverted bottleneck is also good.

RegNetTrends
• The depth of best models is stable across regimes, with an optimal
depth of ~20 blocks(60 layers).
• This is in contrast to the common practice of using deeper models for
higher flop regimes.

RegNetTrends
• The best models use a bottleneck ratio 𝑏 of 1.0, which effectively
removes the bottleneck.
• The width multiplier 𝑤 𝑚 of good models is ~2.5, similar but not
identical to the popular recipe of doubling widths across stages.

RegNetTrends
• The remaining parameters(𝑔, 𝑤 𝑎, 𝑤0) increase with complexity

Complexity Analysis
• While not a common measure of network complexity, activations can
heavily affect runtime on memory-bound hardware accelerators.
• Activations increase with the square-root of flops, parameters
increase linearly.

RegNetX Constrained
• Using these findings, RegNetX design space is refined – RegNetX C
 𝑏 = 1, 𝑑 ≤ 40, and 𝑤 𝑚 ≥ 2
 Limited parameters and activations following complexity analysis
 Further depth limit: 12 ≤ 𝑑 ≤ 28

Alternate Design Choices
• Inverted bottleneck(𝑏 < 1) degrades the EDF slightly and depthwise
conv performs even worse relative to 𝑏 = 1 and 𝑔 ≥ 1.
• For RegNetX, a fixed resolution of 224x224 is best, even at higher flops.
• Squeeze-and-Excitation(SE) op yields good gains – RegNetY

Comparison to Existing Networks

Comparison to Existing Networks
• The higher flop models have a large number of blocks in the third
stage and a small number of blocks in the last stage.
• The group width 𝑔 increases with complexity, but depth 𝑑 saturates
for large models.

State of the Art Comparison: Mobile Regime

EfficientNet
Comparison
At low flops, EfficientNet outperforms the
RegNetY. At intermediate flops, RegNetY
outperforms EfficientNet, and at higher
flops both RegNetX and RegNetY perform
better.

Additional Ablations
• Fixed Depth
 Surprisingly, fixed-depth networks can match the performance of variable depth networks
for all flop regimes.
• Fewer Stages
 Top RegNet models at high flops have few blocks in the fourth stage but, 3 stage networks
perform considerably worse.
• Inverted Bottleneck
 In a high-compute regime, b < 1 degrades results further.

Additional Ablations
• Swish vs ReLU
 Swish outperforms ReLU at low flops, but ReLU is better at high flops.
 Interestingly, if g is restricted to be 1(depthwise conv), Swish performs much
better than ReLU.

Optimization Settings
• Initial learning rate and weight decay are stable across complexity regimes.
RegNet
EfficientNet

PR243: Designing Network Design Spaces

More Related Content

What's hot (20)

Similar to PR243: Designing Network Design Spaces (20)

More from Jinwon Lee (20)

Recently uploaded (20)

PR243: Designing Network Design Spaces