[한국어] Neural Architecture Search with Reinforcement Learning

Presentation on “Neural
Architecture Search with
Reinforcement Learning”
Kiho Suh
Modulabs(모두의연구소), June 12th 2017

About Paper
• 2016년 11월 5일에 v1
• 현재는 v2
• Google Brain
• Barret Zoph, Quoc V. Le
• ICLR 2017 Oral Presentation으로
발표
Kiho Suh, Modulabs (모두의연구소)

Google’s AutoML
https://futurism.com/googles-new-ai-is-better-at-creating-ai-than-the-companys-engineers/ http://techm.kr/bbs/?t=154
• 머신러닝 개발 업무중 일부를 자동화하는 프로젝트
• AI 전문가의 미래는?…
• AutoML이 뭘하려는지 잘 보여주는 논문

Motivation for Architecture Search
• 뉴럴 네트워크를 디자인하는것은 힘들고 튜닝하는데 꽤 많은 노력이 필요
하다. 잘 한다고 하시는 분들도 꽤 시간이 많이 걸린다.
• 디자인 하는데 “딱 이 방법이다” 하는게 없다.
• 이제는 Feature Engineering에서 Neural Network Engineering으로
페러다임이 변화되었다.
• 좋은 구조를 자동으로 학습할수 있을까?
• 딥러닝 구조를 만드는 딥러닝 구조
• 이 분야에서의 첫 시도
Kiho Suh, Modulabs (모두의연구소)slide modiﬁed from Zoph, Le

Related Work
• Hyperparameter optimization
• Modern neuro-evolution algorithms
• Neural Architecture Search
• Learning to learn or Meta-Learning
Learning to learn by gradient descent by gradient descent, Andrychowicz et al. 2016

Neural Architecture Search
• 핵심은 Conﬁguration string으로 뉴럴 네트워크의 구조와 연결을 명시할수 있다. (Caffe에서는 이렇게 함)
- 한 layer의 Conﬁguration: [“Filter Width: 5”, “Filter Height: 3”, “Num Filters: 24”]
• 여기에서 아이디어는 RNN (“Controller”)를 사용해서 Neural Network Architecture를 명시하는 이 string을 생성한다.
• RNN을 수렴하게 학습한다. 수렴할때 이 만들어진 string의 정확도를 알수 있다.
• Validation set에서 얼마나 잘하는지 알기위해서 만들어진 구조(“Child Network”)를 학습한다.
• Child model의 정확도를 기반으로한 Controller model의 파라미터들을 강화학습을 써서 업데이트한다.
• 강화학습이기때문에 정확도는 Reward signal로 쓰인다.

Neural Architecture Search
1. Controller RNN은 뉴럴 네트워크의 구
조적 hyperparameter들을 생성한다. Layer
의 숫자가 어떤 값을 넘으면 구조를 생성하
는것을 멈춘다. 이 값은 training이 진행될수
록 높아지는 스케줄에 의해 정해진다.
2. Controller RNN에서 하나의 구
조를 생성하면 이 구조를 가진 뉴럴
네트워크가 만들어지고 train한다.
3. 수렴할때 네트워크의 validation set
에 대한 정확도가 기록된다.
4. 제안된 구조의 Expected Validation
Accuracy를 최대화하기 위해서 Controller
RNN의 파라미터들인 θc 가 최적화된다.
5. Policy gradient를 써서 θc를
업데이트 한다.

Neural Architecture Search for Convolutional
Networks
Softmax classiﬁer
Embedding
Controller RNN
Kiho Suh, Modulabs (모두의연구소)slide from Zoph, Le

Training with REINFORCE
Architecture predicted by the controller RNN viewed
as a sequence of actions.
생성된 구조의 정확도, Reward signal
• REINFORCE를 사용한 이유는 가장 간단하고 Q-learning을 포함한 다른 방법들에 비해서 튜닝하기 쉽다. 스케일 엄청 크기
때문에 튜닝을 많이 하려고 하면 힘들다.
• Layer 하나짜리 CNN에서 T는 3이다. a1은 filter height, a2는 filter width, a3은 number of filters이다.
• Reward signal R이 non-differentiable해서 policy gradient를 써서 θc 업데이트해야 한다.
Standard REINFORCE Update Rule
Controller가 새로운 뉴럴네트워크 구
조를 디자인하기 위한 예상해야되는
hyperparameter들의 숫자
Controller RNN의 파라미터들

Training with REINFORCE
Sample many architectures
at one time and average
them across mini batch
이 측정의 high variance를 줄이기 위한
baseline (정답값이라고도 볼수 있음)
mini batch 안의 모델들의 갯수
Controller가 새로운 뉴럴네트워크 구
조를 디자인하기 위한 예상해야되는
hyperparameter들의 숫자
Controller RNN의 파라미터들

Distributed Training
• Controller parameter들은 S parameter server들에 저장한다. Parameter server들에서 parameter들을 K controller
replicas 보낸다.
• 각 controller replica는 m architecture들을 sample하고 여러개의 child model들을 병렬로 돌린다.
• 각 child model의 정확도는 parameter server들에 보낼 θc에 대한 gradient들을 계산하기 위해 기록한다.
• 10 parameter servers
• 800 GPUs. Accuracy를 train하는데 몇시간이 걸림.
• 13,000~15,000 모델들을 train함. Google에서 2~3주 걸렸다.
Shards

Overview of Experiments
• 딥러닝에서 가장 많이 쓰이는 데이터셋인 CIFAR-10와 Penn Treebank에 적용.
• CIFAR-10에 CNN을 생성하고 Penn Treebank에 RNN cell을 생성한다.
• Penn Treebank에서는 State of the Art이고 CIFAR-10에서는 거의 State of the Art 이면서 더 작고 더 빠른 네트워크
• Penn Treebank에서 나온 cell이 LSTM보다 다른 language modeling datasets과 번역에서 더 좋은 성과를 보였다.
https://www.cs.toronto.edu/~kriz/cifar.html https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Neural Architecture Search for CIFAR-10
• CIFAR-10에서 convolutional network를 예측하기 위해서
Neural Architecture Search 적용한다.
• 고정된 layer들의 숫자 (15, 20, 13) 를 위해 다음을 예상한다:
- Filter width/height
- Stride width/height
- Number of ﬁlters

• Skip Connection이 없다 (그래서 한계가 있음)
• Filter Height나 Width에 정할수 있는 값들인 [1,3,5,7] 에서 softmax가 정한다. Continuous가 아니기 때문에, 리스트에
서 선택한다.
• One layer = 128 unit RNN (pretty small)
• 현재 모델들에서 Skip Connection이 많이 사용되고 있음 (e.g. ResNet, DenseNet)
[1,3,5,7] [1,3,5,7] [1,2,3] [1,2,3] [24,36,48,64]

CIFAR-10 Prediction Method
• Branching과 residual connections를 포함하기 위해 탐색 범위
를 넓힌다.
• 탐색 범위를 넓히기 위해 Skip Connection의 예측을 제안한다.
• Layer N에서 어떤 layer들이 layer N에 input으로 넣어야될지
N-1 sigmoid까지 모은다.
• 만약 layer가 안 모이면, 이미지들의 minibatch 넣는다.
• 마지막 layer에서 연결되지않은 모든 layer 출력들을 다
concatenate 한다.

Skip Connection in ResNet
The formulation of F(x) +x can be realized by feedforward neural networks with “shortcut
connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more
layers. In our case, the shortcut connections simply perform identity mapping, and
their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut
connections add neither extra parameter nor computational
complexity.

Weight Matrices

CIFAR-10 Experiment Details
• 동시에 100 Controller Replica들을 training 하는 8개의 child network를 썼다.
• 800 GPUs를 한번에 동시에 썼다.
• Reward given to the Controller is the maximum validation accuracy of the last 5
epochs squared
• 50,000개의 Training 예에서 45,000는 training으로 5,000는 validation으로 썼다.
• 각 child model은 50 epoch 동안 train했다. 반나절 걸림.
• 12,800개의 child model들을 돌렸다.
• Controller를 위해 curriculum training을 사용해서 layer 수를 늘려나갔다.
• 탐색 공간이 엄청 크다: 10
30
~10
40

DenseNet and ResNet
DenseNet, Huang et al. 2016 ResNet, He et al. 2015
DenseNets connects each layer to every other
layer in a feed-forward fashion. They alleviate the
vanishing-gradient problem, strengthen feature
propagation, encourage feature reuse, and
substantially reduce the number of parameters.

Generated Convolutional Network from Neural
Architecture Search
5% faster
Best result of evolution (Real et al. 2017): 5.4%
Best result of Q-learning (Baker et al. 2017): 6.92%

Generated Convolutional Network from Neural
Architecture Search
• Skip Connection을 너무 좋아함
• 직사각형 필터를 좋아함 (e.g. 7 x 4 ﬁlter)
• 첫번째 convolution 이후에 모든 layer와
연결하는것을 좋아한다.

Recurrent Cell Prediction Method
• LSTM이나 GRU와 비슷한 RNN cell들을 찾기 위해서 search space를 만
들었다.
• LSTM cell을 참고하고 Search space를 만들었다.

Recurrent Cell Prediction Method
Controller RNNCell Search Space Created New Cell
• 그래프의 계산은 트리로 표현함
• 어떤 activation function 이나
어떤 결합 방법을 쓸지 결정
• LSTM에서는 “더하기 함수”
• 트리의 leaf node는 8 (여기서
는 2)
• Controller RNN을 써서
어떤 함수를 결합할지 혹은
어떤 activation function
을 써서 트리를 label 결정
• Controller RNN의 역할은
트리를 보고 어떤 함수를
선택해서 생성할건지 결정
• 트리를 생성하면
cell의 실제 구현을
한다.

Penn Treebank Experiment Details
• Run Neural Architecture Search with our cell prediction
method on the Penn Treebank language modeling dataset
• 1 child network를 training하는 400개의 Controller Replica들
을 사용했다.
• 동시에 400 CPU들을 한번에 사용했다.
• 총 15,000 child model들을 사용했다. (실제 search space는
1018
)
• Controller의 Reward는 c/(validation perplexity)2

Penn Treebank Results
LSTM Cell Neural Architecture Search (NAS) Cell

Other Experiment Info.
• PennTree Bank: LSTM cell들과 비슷하다. Attention과 비슷한것을 한다.
• Google Brain에서 유전자 프로그래밍의 진화로 CIFAR-10을 위한 CNN을 만들어 봤
는데 더 오래 걸리고 안 좋은 결과를 얻었다.
• 베이지안 최적화와 비교도 해봤다. NAS가 더 많은 기기를 써서 좋은 비교는 아니지만,
NAS가 더 좋은 결과가 나왔고 더 확장가능하다. 베이지안 최적화에서 inverse matrix
가 있고 엄청 큰 matrix (20,000 by 20,000)이 있어서 어렵다. NAS는 inversion이
없어서 학습하기 쉽다.
• 마지막에 끝나는 토큰을 추가했다. 실제로 학습하기 좋은 조그마한 네트워크를 좋아한
다. 작은 string lists 에서 시작해서 더 늘려나간다.
• 처음에는 network가 무작위이기 때문에 다양했다. 하지만 수렴할때 좋은 architecture
로 좁혀간다. 이게 문제가 될수 있어서 노이즈를 조금주면서 overﬁt을 막는다.

Penn Treebank Results
2x as fast
Kiho Suh, Modulabs (모두의연구소)slide from Zoph, Le

RHN (Recurrent Highway Network)
Recurrent Highway Network, Zilly et al. 2016

Comparison to Random Search
• Policy gradient 대신에 random search를 사용해서 제일 좋은 네트워크를 찾을수 있다.
• 하지만 policy gradient가 더 좋다.

Transfer Learning on Character Level Language
Modeling
• Took the RNN cell that we evolved on Penn Treebank
and tried it in on other datasets. Here, we tried on
character level Penn Treebank datasets.
낮을수록 좋다

Transfer Learning on Neural Machine Translation
LSTM Cell
Google Neural Machine
Translation, Wu et al. 2016
Model
WMT’14 en->de
Test Set BLEU
GNMT LSTM 24.1
GNMT NAS Cell 24.6
• GNMT에서 LSTM빼고 NAS를 통해서 만든 cell을 넣었다.
• LSTM에 특화된 Hyperparameter(learning rate, weight initialization)들을 튜닝을 안했다.
• 0.5 BLEU 점수가 높아졌다. 꽤 의미있는 결과이다.
• 96 GPU들을 써서 1주일동안 train했다.

Designing Neural Network Architectures Using RL
• 2016년 11월 30일에 발표
• MetaQNN, a meta-modeling algorithm based on RL to automatically generate high-performing
CNN architectures for a given learning.
• The learning agent is trained to sequentially choose CNN layers using Q-Learning with an epsilon-
greedy exploration strategy and experience replay.
• The agent explores a large but ﬁnite space of possible architectures and iteratively discovers
designs with improved performance on the learning task.

UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING
GENERALIZATION (Zhang et al. 2016)
1.The effective capacity of neural networks is large enough for a
brute-force memorization of the entire data set.
2.Even optimization on random labels remains easy. In fact, training
time increases only by a small constant factor compared with
training on the true labels.
3.Randomizing labels is solely a data transformation, leaving all other
properties of the learning problem unchanged.
우리는 뉴럴 네트워크 모델의 ‘일반화’에 대해서 거의 이해하지 못하고 있다.

Reference
• https://arxiv.org/pdf/1611.01578.pdf
• https://arxiv.org/abs/1609.08144.pdf
• https://arxiv.org/abs/1602.07261.pdf
• https://openreview.net/pdf?id=Sy8gdB9xx
• https://futurism.com/googles-new-ai-is-better-at-creating-ai-than-the-companys-engineers/
• http://www.iclr.cc/doku.php?id=iclr2017:schedule
• http://techm.kr/bbs/?t=154
• https://medium.com/intuitionmachine/deep-learning-the-unreasonable-effectiveness-of-randomness-14d5aef13f87
• http://rll.berkeley.edu/deeprlcourse/docs/quoc_barret.pdf
• http://www.1-4-5.net/~dmm/ml/nnrnn.pdf
• https://blog.acolyer.org/2017/05/10/neural-architecture-search-with-reinforcement-learning/

Appendix 1: REINFORCE in depth
likelihood parametrized by θ
Kiho Suh, Modulabs (모두의연구소)note by David Meyer

Appendix 1: REINFORCE in depth
Kiho Suh, Modulabs (모두의연구소)note by David Meyer

Appendix 2: Proof of the Policy Gradient Theorem
http://ufal.mff.cuni.cz/~straka/courses/npﬂ114/2016/
sutton-bookdraft2016sep.pdf
Page 269 in Reinforcement Learning: An
Introduction by Richard S. Sutton and
Andrew G. Barto

Appendix 2: Large-Scale Evolution of Image
Classiﬁers
Large-Scale Evolution of Image Classiﬁers, Real et al. 2016

Appendix 3: Inception V4
Inception V4, Szegedy et al. 2016

[한국어] Neural Architecture Search with Reinforcement Learning

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to [한국어] Neural Architecture Search with Reinforcement Learning (20)

[한국어] Neural Architecture Search with Reinforcement Learning