SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Alexander Smola
AWS Machine Learning
Personalization and Scalable Deep Learning with MXNET
Outline
• Personalization
• Latent Variable Models
• User Engagement and Return Times
• Deep Recommender Systems
• MXNet
• Basic concepts
• Launching a cluster in a minute
• Imagenet for beginners
Personalization
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Clusters (navigational, informational queries in search)
• Topics (interest distributions for users over time)
• Kalman Filter (trajectory and location modeling)
Action
Explanation
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Clusters (navigational, informational queries in search)
• Topics (interest distributions for users over time)
• Kalman Filter (trajectory and location modeling)
Action
Explanation
Are the parametric models really true?
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Nonparametric model / spectral
• Use data to determine shape
• Sidestep approximate inference
x
h
ht = f(xt 1, ht 1)
xt = g(xt 1, ht)
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Plain deep network = RNN
• Deep network with attention = LSTM / GRU …

(learn when to update state, how to read out)
x
h
Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
it = (Wi(xt, ht) + bi)
ft = (Wf (xt, ht) + bf )
zt+1 = ft · zt + it · tanh(Wz(xt, ht) + bz)
ot = (Wo(xt, ht, zt+1) + bo)
ht+1 = ot · tanh zt+1
Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
(zt+1, ht+1, ot) = LSTM(zt, ht, xt)
Treat it as a black box
User Engagement
9:01 8:55 11:50
12:30
never
next week
?
(app frame toutiao.com)
User Engagement Modeling
• User engagement is gradual
• Daily average users?
• Weekly average users?
• Number of active users?
• Number of users?
• Abandonment is passive
• The last time you tweeted? Pin? Like? Skype?
• Churn models assume active abandonment 

(insurance, phone, bank)
9:01
User Engagement Modeling
• User engagement is gradual
• Model user returns
• Context of activity
• World events (elections, Super Bowl, …)
• User habits (morning reader, night owl)
• Previous reading behavior

(poor quality content will discourage return)
9:01
Survival Analysis 101
• Model population where something dramatic happens
• Cancer patients (death; efficacy of a drug)
• Atoms (radioactive decay)
• Japanese women (marriage)
• Users (opens app)
• Survival probability
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JA
well known that the differential equation can be solved
partial integration, i.e.
Pr(tsurvival T) = exp
Z T
0
(T)dt
!
. (2)
ce, if the patient survives until time T and we stop
kernel
time t
Conse
hazard rate function
Session Model
• User activity is sequence of times
• bi when app is opened
• ei when app is closed
• In between wait for user return
• Model user activity likelihood
start
end
Look up
table
One-hot
UserID
Hidden2
Hidden1
User
Embedding
Look up
table
One-hot
TimeID
Time
Embedding……
0 0 1 0 0 0……
……
0 0 0 1 0 0……
……
……
……
External
Feature
Rate
Fig. 1. A Personalized Time-Aware architecture for Survival Analysis.
Given the data from previous session, we aims to predict the (quantized)
rate values for the next session.
tun
to
[39
sp
[40
of
tho
in
to
mo
ins
lin
lea
Session Model
start
end
Personalized LSTM
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 8
Hidden2
Hidden1
Input
……
……
……
Hidden2
Hidden1
……
……
Hidden2
Hidden1
……
……
Input
……
Input
……
Session s-2 Session s-1 Session s
Fig. 2. Unfolded LSTM network for 3 sessions. The input vector for session s is the concatenation of user embedding, time slot embedding and the
• LSTM for global state update
• LSTM for indvidual state update
• Update both of them
• Learn using backprop and SGD
Jing and Smola, WSDM’17
Perplexity (quality of prediction)
next visit time (hour)
Fig. 6. The histogram of the time period between two sessions. The top
one is from Toutiao and the bottom one is from Last.fm. The small bump
around 24 hours corresponds to users having a daily habit of using the
app at the same time.
global constant model. A static model with only one pa-
rameter, assuming that the rate is constant throughout
the time frame for all users.
global+user constant model. A static model that assumes
that the rate is an additive function of a global constant
and a user-specific constant model.
piecewise constant model. A more flexible static model
that learns parameters for each discretized bin.
Hawkes process. A self-exciting point process that respects
past sessions.
integrated model. A combined model with all the above
components.
DNN. A model that assumes that the rate is a function
of time, user, session feature, parameterized by a deep
neural network.
LSTM. A recurrent neural network that incorporates past
activities.
For completeness, we also report the result for Cox’s model
where the Hazard Rate is given by
u(t) = 0(t) exp(h , xu(t)i) (28)
perp = exp
⇣ 1
M
mX
u=1
muX
i=1
log p({bi, ei}; )
⌘
(29)
where M is the total number of sessions in the test set. The
lower the value, the better the model is at explaining the
test data. In other words, perplexity measures the amount
of surprise in a user’s behavior relative to our prediction.
Obviously a good model can predict well, hence there will
be less surprise.
6.6 Model Comparison
The summarized results are shown in table 1. As can be seen
from the table, there is a big gap between linear models
and the two deep models. The Cox model is inferior to
our integrated model and significantly worse than the deep
networks.
model Toutiao Last.fm
Cox Model 27.13 28.31
global constant 45.29 59.98
user constant 28.74 45.44
piecewise constant 26.88 26.12
Hawkes process 22.58 30.80
integrated model 21.56 26.06
DNN 18.87 20.62
LSTM 18.10 19.80
TABLE 1
Average perplexity evaluated on the test set for different models.
flexible static model
iscretized bin.
nt process that respects
el with all the above
the rate is a function
ameterized by a deep
that incorporates past
result for Cox’s model
xu(t)i) (28)
from the table, there is a big gap between line
and the two deep models. The Cox model is
our integrated model and significantly worse than
networks.
model Toutiao Last.fm
Cox Model 27.13 28.31
global constant 45.29 59.98
user constant 28.74 45.44
piecewise constant 26.88 26.12
Hawkes process 22.58 30.80
integrated model 21.56 26.06
DNN 18.87 20.62
LSTM 18.10 19.80
TABLE 1
Average perplexity evaluated on the test set for different
Perplexity (quality of prediction)IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016
Toutiao Last.fm
# of sessions (%)
0 20 40 60 80 100
Perplexity
0
20
40
60
80
100
120
140
160
global constant
user constant
piecewise constant
Hawkes Process
Integrated
Cox
DNN
LSTM
# of sessions (%)
0 20 40 60 80 100
Perplexity
0
20
40
60
80
100
120
140
160
180
global constant
user constant
piecewise constant
Hawkes Process
Integrated
Cox
DNN
LSTM
%)
50
LSTM v.s. Integrated
LSTM v.s. Cox
%)
45
50
LSTM v.s. Integrated
LSTM v.s. Cox
# of sessions (%)
0 20 40 60 80 100
0
20
# of sessions (%)
0 5 10 15 20
RelativeImprovements(%)
0
10
20
30
40
50
LSTM v.s. Integrated
LSTM v.s. Cox
Fig. 7. Top row: Average test perplexity as a function of the fraction of o
LSTMs over the integrated and the Cox model. Left column: Toutiao datJing and Smola, WSDM’17
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
0.6
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
g. 9. Six randomly sampled learned predictive rate function. Three from toutiao (left) and three from Last.fm (right). Each pair of figure denotes
e instantaneous rate value (t) (purple), the survival function p(return t) in red, and the actual return time in blue. Clearly, our deep model is
Recommender Systems
Recommender systems, not recommender archaeology
users
items
time
NOW
predict that
(future)
use this
(past)
don’t predict this
(archaeology)
The Netflix contest
got it wrong …
Getting it right
change in
taste and
expertise
change in
perception
and novelty
LSTM
LSTM
Wu et al, WSDM’17
Wu et al, WSDM’17
Prizes
Sanity Check
Deep Learning with MXNet
Caffe
Torch
Theano
Tensorflow
CNTK
Keras
Paddle
(image - Banksy/wikipedia)
Why yet another deep networks tool?
Why yet another deep networks tool?
• Frugality & resource efficiency

Engineered for cheap GPUs with smaller memory, slow networks
• Speed
• Linear scaling with #machines and #GPUs
• High efficiency on single machine, too (C++ backend)
• Simplicity

Mix declarative and imperative code
single implementation of
backend system and
common operators
performance guarantee
regardless which frontend
language is used
frontend
backend
Imperative Programs
import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
print c
d = c + 1 Easy to tweak
with python
codes
Pro
• Straightforward and flexible.
• Take advantage of language native
features (loop, condition, debugger)
Con
• Hard to optimize
Declarative Programs
A = Variable('A')
B = Variable('B')
C = B * A
D = C + 1
f = compile(D)
d = f(A=np.ones(10),
B=np.ones(10)*2)
Pro
• More chances for optimization
• Cross different languages
Con
• Less flexible
A B
1
+
⨉
C can share memory with D,
because C is deleted later
Imperative vs. Declarative for Deep Learning
Computational Graph
of the Deep Architecture
forward backward
Needs heavy optimization,
fits declarative programs
Needs mutation and more
language native features, good for
imperative programs
Updates and Interactions
with the graph
• Iteration loops
• Parameter update

• Beam search
• Feature extraction …
w w ⌘@wf(w)
Mixed Style Training Loop in MXNet
executor = neuralnetwork.bind()
for i in range(3):
train_iter.reset()
for dbatch in train_iter:
args["data"][:] = dbatch.data[0]
args["softmax_label"][:] = dbatch.label[0]
executor.forward(is_train=True)
executor.backward()
for key in update_keys:
args[key] -= learning_rate * grads[key]
Imperative NDArray can be set as input
nodes to the graph
Executor is bound from
declarative program that
describes the network
Imperative parameter update on GPU
Mixed API for Quick Extensions
• Runtime switching between different graphs depending on input
• Useful for sequence modeling and image size reshaping
• Use of imperative code in Python, 10 lines of additional Python code
BucketingVariable length sentences
3D Image Construction
Deep3D
100 lines of Python code
https://github.com/piiswrong/deep3d
Distributed Deep Learning
Distributed Deep Learning
Distributed Deep Learning
## train
num_gpus = 4
gpus = [mx.gpu(i) for i in range(num_gpus)]
model = mx.model.FeedForward(
ctx = gpus,
symbol = softmax,
num_round = 20,
learning_rate = 0.01,
momentum = 0.9,
wd = 0.00001)
model.fit(X = train, eval_data = val, batch_end_callback = mx.callback.Speedometer(batch_size=batch_size))
2 lines for multi GPU
Scaling on p2.16xlarge
alexnet
inception-v3
resnet-50
GPUs GPUs
average throughput
per GPU
aggregate throughput
GPU-GPU sync
alexnet
inception-v3
resnet-50 108x
75x
Demo
Getting Started
• Website

http://mxnet.io/
• GitHub repository

git clone —recursive git@github.com:dmlc/mxnet.git
• Docker

docker pull dmlc/mxnet
• Amazon AWS Deep Learning AMI (with other toolkits & anaconda)

https://aws.amazon.com/marketplace/pp/B01M0AXXQB

http://bit.ly/deepami
• CloudFormation Template

https://github.com/dmlc/mxnet/tree/master/tools/cfn 

http://bit.ly/deepcfn
Acknowledgements
• User engagement

How Jing, Chao-Yuan Wu
• Temporal recommenders

Chao-Yuan Wu, Alex Beutel, Amr Ahmed
• MXNet & Deep Learning AMI

Mu Li, Tianqi Chen, Bing Xu, Eric Xie, Joseph Spisak,
Naveen Swamy, Anirudh Subramanian and many more …
We are hiring
{smola, thakerb, spisakj}@amazon.com

More Related Content

PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
PPTX
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
MLconf
 
PDF
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
PDF
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
PDF
Prediction of Exchange Rate Using Deep Neural Network
Tomoki Hayashi
 
PDF
Introduction to Deep Learning with Python
indico data
 
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
MLconf
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
Prediction of Exchange Rate Using Deep Neural Network
Tomoki Hayashi
 
Introduction to Deep Learning with Python
indico data
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 

What's hot (19)

PDF
Training Neural Networks
Databricks
 
PPTX
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
MLconf
 
PDF
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
PDF
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
PDF
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
PPTX
TensorFlow Tutorial Part1
Sungjoon Choi
 
PDF
How to win data science competitions with Deep Learning
Sri Ambati
 
PPTX
TensorFlow in 3 sentences
Barbara Fusinska
 
PPTX
Caffe framework tutorial2
Park Chunduck
 
PDF
Applying your Convolutional Neural Networks
Databricks
 
PDF
Deeplearning in finance
Sebastien Jehan
 
PDF
Terascale Learning
pauldix
 
PPTX
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Oswald Campesato
 
PDF
Generating Sequences with Deep LSTMs & RNNS in julia
Andre Pemmelaar
 
PPTX
Modern classification techniques
mark_landry
 
PPTX
Beyond data and model parallelism for deep neural networks
JunKudo2
 
PDF
Score based Generative Modeling through Stochastic Differential Equations
Sungchul Kim
 
PDF
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Intel Nervana
 
PDF
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Pooyan Jamshidi
 
Training Neural Networks
Databricks
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
MLconf
 
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
TensorFlow Tutorial Part1
Sungjoon Choi
 
How to win data science competitions with Deep Learning
Sri Ambati
 
TensorFlow in 3 sentences
Barbara Fusinska
 
Caffe framework tutorial2
Park Chunduck
 
Applying your Convolutional Neural Networks
Databricks
 
Deeplearning in finance
Sebastien Jehan
 
Terascale Learning
pauldix
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Oswald Campesato
 
Generating Sequences with Deep LSTMs & RNNS in julia
Andre Pemmelaar
 
Modern classification techniques
mark_landry
 
Beyond data and model parallelism for deep neural networks
JunKudo2
 
Score based Generative Modeling through Stochastic Differential Equations
Sungchul Kim
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Intel Nervana
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Pooyan Jamshidi
 
Ad

Viewers also liked (20)

PDF
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
MLconf
 
PPTX
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
MLconf
 
PPTX
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
MLconf
 
PDF
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
MLconf
 
PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
PPTX
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
MLconf
 
PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
PDF
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
PPTX
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
MLconf
 
PDF
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
MLconf
 
PDF
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
PDF
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
MLconf
 
PDF
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
MLconf
 
PDF
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
MLconf
 
PDF
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
MLconf
 
PDF
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
MLconf
 
PDF
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
MLconf
 
PDF
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
MLconf
 
PDF
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
MLconf
 
PDF
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
MLconf
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
MLconf
 
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
MLconf
 
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
MLconf
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
MLconf
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
MLconf
 
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
MLconf
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
MLconf
 
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
MLconf
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
MLconf
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
MLconf
 
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
MLconf
 
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
MLconf
 
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
MLconf
 
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
MLconf
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
Ad

Similar to Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016 (20)

PPTX
Power System Simulation: History, State of the Art, and Challenges
Luigi Vanfretti
 
PDF
Modeling & Simulation Lecture Notes
FellowBuddy.com
 
PPTX
Modeling and Simulation of Electrical Power Systems using OpenIPSL.org and Gr...
Luigi Vanfretti
 
PDF
Entropy 12-02268-v2
CAA Sudan
 
PPT
Panel data random effect fixed effect.ppt
Mustansarsaeed2
 
PDF
Discretetime Dynamics Of Structured Populations And Homogeneous Orderpreservi...
yomnacyed
 
PDF
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
Matteo Ferroni
 
PPTX
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
ssuser4b1f48
 
PPTX
Dataworkz odsc london 2018
Olaf de Leeuw
 
PDF
What is the likely future of real-time transient stability?
Université de Liège (ULg)
 
PDF
Approaches to online quantile estimation
Data Con LA
 
PDF
Model Based Control of Soft Robotic system.pdf
rohitsinglaonline1
 
PDF
Model Based Control of Soft Robots and Optimal.pdf
rohitsinglaonline1
 
PDF
Model-Based Control of Soft Robots and Optimal.pdf
rohitsinglaonline1
 
PDF
Model Based Control of Soft Robotic system.pdf
rohitsinglaonline1
 
PDF
Modeling adoptions and the stages of the diffusion of innovations
Nicola Barbieri
 
PDF
IRJET- Two-Class Priority Queueing System with Restricted Number of Priority ...
IRJET Journal
 
PDF
Controls Based Q Measurement Report
Louis Gitelman
 
PDF
Chapter 2 - Introduction to manufacturing processes - N.pdf
mezgebe2
 
PPTX
Md simulation and stochastic simulation
AbdulAhad358
 
Power System Simulation: History, State of the Art, and Challenges
Luigi Vanfretti
 
Modeling & Simulation Lecture Notes
FellowBuddy.com
 
Modeling and Simulation of Electrical Power Systems using OpenIPSL.org and Gr...
Luigi Vanfretti
 
Entropy 12-02268-v2
CAA Sudan
 
Panel data random effect fixed effect.ppt
Mustansarsaeed2
 
Discretetime Dynamics Of Structured Populations And Homogeneous Orderpreservi...
yomnacyed
 
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
Matteo Ferroni
 
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
ssuser4b1f48
 
Dataworkz odsc london 2018
Olaf de Leeuw
 
What is the likely future of real-time transient stability?
Université de Liège (ULg)
 
Approaches to online quantile estimation
Data Con LA
 
Model Based Control of Soft Robotic system.pdf
rohitsinglaonline1
 
Model Based Control of Soft Robots and Optimal.pdf
rohitsinglaonline1
 
Model-Based Control of Soft Robots and Optimal.pdf
rohitsinglaonline1
 
Model Based Control of Soft Robotic system.pdf
rohitsinglaonline1
 
Modeling adoptions and the stages of the diffusion of innovations
Nicola Barbieri
 
IRJET- Two-Class Priority Queueing System with Restricted Number of Priority ...
IRJET Journal
 
Controls Based Q Measurement Report
Louis Gitelman
 
Chapter 2 - Introduction to manufacturing processes - N.pdf
mezgebe2
 
Md simulation and stochastic simulation
AbdulAhad358
 

More from MLconf (20)

PDF
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
MLconf
 
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
PPTX
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
PDF
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
MLconf
 
PPTX
Josh Wills - Data Labeling as Religious Experience
MLconf
 
PDF
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
MLconf
 
PDF
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
PDF
Meghana Ravikumar - Optimized Image Classification on the Cheap
MLconf
 
PDF
Noam Finkelstein - The Importance of Modeling Data Collection
MLconf
 
PDF
June Andrews - The Uncanny Valley of ML
MLconf
 
PDF
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf
 
PDF
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
PDF
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 
PDF
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
MLconf
 
PDF
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
MLconf
 
PPTX
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
MLconf
 
PPTX
Neel Sundaresan - Teaching a machine to code
MLconf
 
PDF
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
MLconf
 
PPTX
Soumith Chintala - Increasing the Impact of AI Through Better Software
MLconf
 
PPTX
Roy Lowrance - Predicting Bond Prices: Regime Changes
MLconf
 
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
MLconf
 
Josh Wills - Data Labeling as Religious Experience
MLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
MLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
MLconf
 
June Andrews - The Uncanny Valley of ML
MLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
MLconf
 
Neel Sundaresan - Teaching a machine to code
MLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
MLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
MLconf
 

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Software Development Methodologies in 2025
KodekX
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Future of Artificial Intelligence (AI)
Mukul
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Alexander Smola AWS Machine Learning Personalization and Scalable Deep Learning with MXNET
  • 2. Outline • Personalization • Latent Variable Models • User Engagement and Return Times • Deep Recommender Systems • MXNet • Basic concepts • Launching a cluster in a minute • Imagenet for beginners
  • 4. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Clusters (navigational, informational queries in search) • Topics (interest distributions for users over time) • Kalman Filter (trajectory and location modeling) Action Explanation
  • 5. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Clusters (navigational, informational queries in search) • Topics (interest distributions for users over time) • Kalman Filter (trajectory and location modeling) Action Explanation Are the parametric models really true?
  • 6. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Nonparametric model / spectral • Use data to determine shape • Sidestep approximate inference x h ht = f(xt 1, ht 1) xt = g(xt 1, ht)
  • 7. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Plain deep network = RNN • Deep network with attention = LSTM / GRU …
 (learn when to update state, how to read out) x h
  • 8. Long Short Term Memory x h Schmidhuber and Hochreiter, 1998 it = (Wi(xt, ht) + bi) ft = (Wf (xt, ht) + bf ) zt+1 = ft · zt + it · tanh(Wz(xt, ht) + bz) ot = (Wo(xt, ht, zt+1) + bo) ht+1 = ot · tanh zt+1
  • 9. Long Short Term Memory x h Schmidhuber and Hochreiter, 1998 (zt+1, ht+1, ot) = LSTM(zt, ht, xt) Treat it as a black box
  • 10. User Engagement 9:01 8:55 11:50 12:30 never next week ? (app frame toutiao.com)
  • 11. User Engagement Modeling • User engagement is gradual • Daily average users? • Weekly average users? • Number of active users? • Number of users? • Abandonment is passive • The last time you tweeted? Pin? Like? Skype? • Churn models assume active abandonment 
 (insurance, phone, bank) 9:01
  • 12. User Engagement Modeling • User engagement is gradual • Model user returns • Context of activity • World events (elections, Super Bowl, …) • User habits (morning reader, night owl) • Previous reading behavior
 (poor quality content will discourage return) 9:01
  • 13. Survival Analysis 101 • Model population where something dramatic happens • Cancer patients (death; efficacy of a drug) • Atoms (radioactive decay) • Japanese women (marriage) • Users (opens app) • Survival probability TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JA well known that the differential equation can be solved partial integration, i.e. Pr(tsurvival T) = exp Z T 0 (T)dt ! . (2) ce, if the patient survives until time T and we stop kernel time t Conse hazard rate function
  • 14. Session Model • User activity is sequence of times • bi when app is opened • ei when app is closed • In between wait for user return • Model user activity likelihood start end
  • 15. Look up table One-hot UserID Hidden2 Hidden1 User Embedding Look up table One-hot TimeID Time Embedding…… 0 0 1 0 0 0…… …… 0 0 0 1 0 0…… …… …… …… External Feature Rate Fig. 1. A Personalized Time-Aware architecture for Survival Analysis. Given the data from previous session, we aims to predict the (quantized) rate values for the next session. tun to [39 sp [40 of tho in to mo ins lin lea Session Model start end
  • 16. Personalized LSTM IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 8 Hidden2 Hidden1 Input …… …… …… Hidden2 Hidden1 …… …… Hidden2 Hidden1 …… …… Input …… Input …… Session s-2 Session s-1 Session s Fig. 2. Unfolded LSTM network for 3 sessions. The input vector for session s is the concatenation of user embedding, time slot embedding and the • LSTM for global state update • LSTM for indvidual state update • Update both of them • Learn using backprop and SGD Jing and Smola, WSDM’17
  • 17. Perplexity (quality of prediction) next visit time (hour) Fig. 6. The histogram of the time period between two sessions. The top one is from Toutiao and the bottom one is from Last.fm. The small bump around 24 hours corresponds to users having a daily habit of using the app at the same time. global constant model. A static model with only one pa- rameter, assuming that the rate is constant throughout the time frame for all users. global+user constant model. A static model that assumes that the rate is an additive function of a global constant and a user-specific constant model. piecewise constant model. A more flexible static model that learns parameters for each discretized bin. Hawkes process. A self-exciting point process that respects past sessions. integrated model. A combined model with all the above components. DNN. A model that assumes that the rate is a function of time, user, session feature, parameterized by a deep neural network. LSTM. A recurrent neural network that incorporates past activities. For completeness, we also report the result for Cox’s model where the Hazard Rate is given by u(t) = 0(t) exp(h , xu(t)i) (28) perp = exp ⇣ 1 M mX u=1 muX i=1 log p({bi, ei}; ) ⌘ (29) where M is the total number of sessions in the test set. The lower the value, the better the model is at explaining the test data. In other words, perplexity measures the amount of surprise in a user’s behavior relative to our prediction. Obviously a good model can predict well, hence there will be less surprise. 6.6 Model Comparison The summarized results are shown in table 1. As can be seen from the table, there is a big gap between linear models and the two deep models. The Cox model is inferior to our integrated model and significantly worse than the deep networks. model Toutiao Last.fm Cox Model 27.13 28.31 global constant 45.29 59.98 user constant 28.74 45.44 piecewise constant 26.88 26.12 Hawkes process 22.58 30.80 integrated model 21.56 26.06 DNN 18.87 20.62 LSTM 18.10 19.80 TABLE 1 Average perplexity evaluated on the test set for different models. flexible static model iscretized bin. nt process that respects el with all the above the rate is a function ameterized by a deep that incorporates past result for Cox’s model xu(t)i) (28) from the table, there is a big gap between line and the two deep models. The Cox model is our integrated model and significantly worse than networks. model Toutiao Last.fm Cox Model 27.13 28.31 global constant 45.29 59.98 user constant 28.74 45.44 piecewise constant 26.88 26.12 Hawkes process 22.58 30.80 integrated model 21.56 26.06 DNN 18.87 20.62 LSTM 18.10 19.80 TABLE 1 Average perplexity evaluated on the test set for different
  • 18. Perplexity (quality of prediction)IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 Toutiao Last.fm # of sessions (%) 0 20 40 60 80 100 Perplexity 0 20 40 60 80 100 120 140 160 global constant user constant piecewise constant Hawkes Process Integrated Cox DNN LSTM # of sessions (%) 0 20 40 60 80 100 Perplexity 0 20 40 60 80 100 120 140 160 180 global constant user constant piecewise constant Hawkes Process Integrated Cox DNN LSTM %) 50 LSTM v.s. Integrated LSTM v.s. Cox %) 45 50 LSTM v.s. Integrated LSTM v.s. Cox # of sessions (%) 0 20 40 60 80 100 0 20 # of sessions (%) 0 5 10 15 20 RelativeImprovements(%) 0 10 20 30 40 50 LSTM v.s. Integrated LSTM v.s. Cox Fig. 7. Top row: Average test perplexity as a function of the fraction of o LSTMs over the integrated and the Cox model. Left column: Toutiao datJing and Smola, WSDM’17
  • 19. t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 0.6 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time g. 9. Six randomly sampled learned predictive rate function. Three from toutiao (left) and three from Last.fm (right). Each pair of figure denotes e instantaneous rate value (t) (purple), the survival function p(return t) in red, and the actual return time in blue. Clearly, our deep model is
  • 21. Recommender systems, not recommender archaeology users items time NOW predict that (future) use this (past) don’t predict this (archaeology)
  • 22. The Netflix contest got it wrong …
  • 23. Getting it right change in taste and expertise change in perception and novelty LSTM LSTM Wu et al, WSDM’17
  • 24. Wu et al, WSDM’17
  • 29. Why yet another deep networks tool? • Frugality & resource efficiency
 Engineered for cheap GPUs with smaller memory, slow networks • Speed • Linear scaling with #machines and #GPUs • High efficiency on single machine, too (C++ backend) • Simplicity
 Mix declarative and imperative code single implementation of backend system and common operators performance guarantee regardless which frontend language is used frontend backend
  • 30. Imperative Programs import numpy as np a = np.ones(10) b = np.ones(10) * 2 c = b * a print c d = c + 1 Easy to tweak with python codes Pro • Straightforward and flexible. • Take advantage of language native features (loop, condition, debugger) Con • Hard to optimize
  • 31. Declarative Programs A = Variable('A') B = Variable('B') C = B * A D = C + 1 f = compile(D) d = f(A=np.ones(10), B=np.ones(10)*2) Pro • More chances for optimization • Cross different languages Con • Less flexible A B 1 + ⨉ C can share memory with D, because C is deleted later
  • 32. Imperative vs. Declarative for Deep Learning Computational Graph of the Deep Architecture forward backward Needs heavy optimization, fits declarative programs Needs mutation and more language native features, good for imperative programs Updates and Interactions with the graph • Iteration loops • Parameter update
 • Beam search • Feature extraction … w w ⌘@wf(w)
  • 33. Mixed Style Training Loop in MXNet executor = neuralnetwork.bind() for i in range(3): train_iter.reset() for dbatch in train_iter: args["data"][:] = dbatch.data[0] args["softmax_label"][:] = dbatch.label[0] executor.forward(is_train=True) executor.backward() for key in update_keys: args[key] -= learning_rate * grads[key] Imperative NDArray can be set as input nodes to the graph Executor is bound from declarative program that describes the network Imperative parameter update on GPU
  • 34. Mixed API for Quick Extensions • Runtime switching between different graphs depending on input • Useful for sequence modeling and image size reshaping • Use of imperative code in Python, 10 lines of additional Python code BucketingVariable length sentences
  • 35. 3D Image Construction Deep3D 100 lines of Python code https://github.com/piiswrong/deep3d
  • 38. Distributed Deep Learning ## train num_gpus = 4 gpus = [mx.gpu(i) for i in range(num_gpus)] model = mx.model.FeedForward( ctx = gpus, symbol = softmax, num_round = 20, learning_rate = 0.01, momentum = 0.9, wd = 0.00001) model.fit(X = train, eval_data = val, batch_end_callback = mx.callback.Speedometer(batch_size=batch_size)) 2 lines for multi GPU
  • 39. Scaling on p2.16xlarge alexnet inception-v3 resnet-50 GPUs GPUs average throughput per GPU aggregate throughput GPU-GPU sync alexnet inception-v3 resnet-50 108x 75x
  • 40. Demo
  • 41. Getting Started • Website
 http://mxnet.io/ • GitHub repository
 git clone —recursive [email protected]:dmlc/mxnet.git • Docker
 docker pull dmlc/mxnet • Amazon AWS Deep Learning AMI (with other toolkits & anaconda)
 https://aws.amazon.com/marketplace/pp/B01M0AXXQB
 http://bit.ly/deepami • CloudFormation Template
 https://github.com/dmlc/mxnet/tree/master/tools/cfn 
 http://bit.ly/deepcfn
  • 42. Acknowledgements • User engagement
 How Jing, Chao-Yuan Wu • Temporal recommenders
 Chao-Yuan Wu, Alex Beutel, Amr Ahmed • MXNet & Deep Learning AMI
 Mu Li, Tianqi Chen, Bing Xu, Eric Xie, Joseph Spisak, Naveen Swamy, Anirudh Subramanian and many more … We are hiring {smola, thakerb, spisakj}@amazon.com