An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear

Presented: J. White Bear
An Online Spark Pipeline:
Semi-Supervised Learning and Online
Retraining with Spark Streaming.

2
IBM
Spark Technology Center
• Founded in 2015.
• Location:
– Physical: 505 Howard St., San Francisco CA
– Web: http://spark.tc Twitter: @apachespark_tc
• Mission:
– Contribute intellectual and technical capital to theApache
Spark community.
– Make the core technology enterprise- and cloud-ready.
– Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
• Key statistics:
– About 50 developers, co-located with 25 IBM designers.
– Major contributions toApache Spark http://jiras.spark.tc
– Apache SystemML is now a top level Apache project !
– Founding member of UC Berkeley AMPLab and RISE
Lab
– Member of R Consortium and Scala Center
Spark Technology Center

About Me
Education
• University ofMichigan-Computer Science
• Databases,Machine Learning/Computational
Biology,Cryptography
• University ofCalifornia SanFrancisco,University of
California Berkeley-
• Multi-objective Optimization/Computational
Biology/Bioinformatics
• McGill University
• Machine Learning/ Multi-objective
Optimizationfor PathPlanning/
Cryptography
Industry
• IBM
• Amazon
• TeraGrid
• Pfizer
• Researchat UC Berkeley,Purdue University,and
every university Iever attended.J
Fun Facts (?)
I love researchfor itsown sake. I like robots,helping
to cure diseases,advocating for social change and
reform,and breaking encryptions.Also,most
activitiesinvolving the OceanandI usually hate
taking pictures. J

Why do we need online semisupervised learning?
Malware/Fraud Detection
Stock Prediction
Real-time Diagnosis
NLP/Speech Recognition Real-time Visual Processing

Why online learning?
• Incremental/Sequential learning for real-time use
cases
• Predicts/learns from the newest data
• Optimized for low latency in real-time cases
• Often used in conjunction with streaming data

Semi-supervised learning
• Smaller training sets, less labeled data
• Classifying or predicting unlabeled data
• The underlying distribution P(x,y) may or may
not be known/ stable (PAC learning)
• How/Why do we bring these ideas together?

Key Challenges
• Acquiring a sufficient ratio training data
– Batch learning is much slower and more difficult to update
a model
• Maintaining accuracy
– Concept drift
– Catastrophic events
• Meeting latency requirements particularly in real-
time scenarios, like autonomous vehicles
• What about unlabeled data?....

The Framework: Hybrid Online and Semi-Supervised learning
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford

Why this framework?
• Abundance of unlabeled data requires semi-supervised learning
– Real-time learning in the IoT setting
• Semi-Supervised learning can improve predictions
– Empirically studied for online use case with Big Data
• We need to address the challenges of incremental sequential
learning without losing valuable historical data
• We need to optimize for low latency without overly sacrificing
accuracy
• You may have an online case but you are not guaranteed labeled
data and definitely not in a timely fashion
• Hybrid frameworks address these challenges and allow for future
data mining to understand and correct for how your data changes
over time

Online and Semi-supervised Learning: Online Constraint
min
w2Rd
nX
i=1
l(w>
xi; yi) + R(w) (1)
Supervised learning over a batch
1
n
nX
i=1
l(w>
xi; yi) +
n
R(w) (2)
e learning over a mini-batch from a streaming data set
i=1
Supervised learning over a batch
1
n
nX
i=1
l(w>
xi;yi) +
n
R(w) (2
Online learning over a mini-batch from a streaming data set
m
n
nX
i=Sk
l(w>
xi;yi) +
n
R(w) (3
Distributed learning over a mini-batch from a streaming data set
1
n
nX
i=1
l(w>
xi; yi) +
n
R(w) (2)
Online learning over a mini-batch from a streaming data set
m
n
nX
i=Sk
l(w>
xi; yi) +
n
R(w) (3)
1

Online and Semi-supervised Learning: Semi-supervised Constraint
m
n
nX
i=Sk
l(w>
xi; yi) +
n
R(w) (3)
Let xp be an example where xp = (xp1, xp2, xp2, ..., xpD, !) where xp belongs to class
! and a D dimensional space. xpi is the value of the ith feature of the pth sample.
Assume a labeled set. L with n instances of x, with ! known. Assume a labeled
set. U with m instances of x, with ! unknown. Assume that the number of labeled
instances L, is less than the number of unlabeled instances, U.
L [ U represents the training set, T .
We want to infer a hypothesis using T and use this hypothesis to predict labels we
have not yet seen.
The semi-supervised learning case.
1

Online and Semi-supervised Learning: Semi-supervised Constraint
Labeled
Data
Unlabeled
Data
Initial Training
Set
Online/Streaming
Labeled/
Unlabeled Data
Model
Retraining
Label
ed
Data
Unlabel
ed
Data

The Framework: Batch Component
Feature Extraction
Feature Hashing
Hashing Trick
Batch
RDD of All Data
(not in cache)
Background
AllReduce Component
averaging
scheme
Background
for AllReduce
(simulated in POC)
Learning Modules
M1…MN
Batch
RDD of All Data
(not in cache)
Background
Learning Modules
M1…MN
Customization & Policy
Options
Retraining Policy
Amending Policy
Committee Policy

• Features
– Out-of-Core, can run as a background process
– Ensemble Learning: Multiple model/ Co-training
learning algorithms
– Amending Policy
– Retrain Policy
– Active Learning Policy
– Custom Parameters

Semi-Supervised Learning
Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien
Learning from labeled and unlabeled data
Tom Mitchell, CMU

• Multiple Model Learning/ Co-training
– Expandable learning modules: multiple learning
algorithms
– Co-training for high-dimensionality, improved
confidence
– Accounts for different biases across learning
techniques to strengthen the hypothesis
– Improves confidence estimates in practice
– Yields better results when there is a significant
diversity among the models.
– Reduces variance and helps to avoid overfitting

L [ U represents the training set, T .
We want to infer a hypothesis using T and use this hypothesis to predict labels we
have not yet seen.
The semi-supervised learning case.
Given a model, m 2 M
ym(x) = h(x) + ✏m(x) (4)
We can get the average error over all models, Eaverage
Eaverage =
1
M
MX
m=1
Ex[✏m(x)2
] (5)
Regression: Modal averaging sum of squares/ Prediction conﬁdence
1Multiple Model-Based Reinforcement Learning
Kenji Doya, KazuyukiSamejima,Ken-ichiKatagiriand Mitsuo Kawato
Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting." Advances in NeuralInformation
Processing Systems.2013.

• Amending Policy
– Instances are added sequentially and maintain order
– Only the most ‘confident’ predictions are added
– If validation is received, inaccurate predictions are
removed and retrained (w/n a threshold)
– Low confidence, unlabeled data are stored for future
validation or rescoring
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the
eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100.
DOI=http://dx.doi.org/10.1145/279943.279962

Training Batch
Labeled
Data
Unlabeled
Low
Confidence
Data
Amending Policy

• Retrain Policy
– Loss Minimization
• Increases in expected loss above threshold
– Accuracy
• Ratio/Number of incorrect predictions
– Schedule
• Regular retraining schedule
– Optimizations
• Limited to a window over the given data set
– Train over a specific time period to optimize for catastrophic events
and concept shifts
• Subsampling
– Very large datasets can set a subsampling methodology to improve
batch processing speed

• What happens to low confidence predictions?
• Active Learning Policy
– Separate dataset holds these instances
– Human labeling and/or an oracle is queried for accurate labels
• Google has over 10,000 contractors performing this function
– https://arstechnica.com/features/2017/04/the-secret-lives-of-google-raters/
• Facebook is hiring 3000 new content monitors for a job AI cannot do
– http://www.popsci.com/Facebook-hiring-3000-content-monitors
• Netflix/Amazon are bringing in humans to improve the CTR
recommendations
– http://blog.echen.me/2014/10/07/moving-beyond-ctr-better-
recommendations-through-human-evaluation/

• Custom Parameters:
– Window Size
• Specify a time or range
– Amending times
• Schedule when amending policy runs
• Default is automated at accuracy rates
– Loss threshold
• Specify maximum loss for a given loss function
– Accuracy threshold
• Specify the precision/recall rates before retraining is called on
batch
– Subsampling
• Run a random subsample to train

The Framework: All-Reduce Component
Feature Extraction
Feature Hashing
Hashing Trick
Batch
RDD of All Data
(not in cache)
Background
AllReduce Component
averaging
scheme
Background
for AllReduce
(simulated in POC)
Learning Modules
M1…MN

The Framework: All-Reduce Component
• SGD/ LBFGS/Regression
• In-memory
• Minimal network latency
• Mini-batches of n/m for
each nodes
• Head nodes holds an
average of the weight
parameters
• Can be added to existing
implementations to
optimize for online
training

The Framework: Cache Component
Bringing it all together…
Feature Extraction
Feature Hashing
Hashing Trick
Batch
RDD of All Data
(not in cache)
Background
AllReduce Component
averaging
scheme
Background
for AllReduce
(simulated in POC)
Learning Modules
M1…MN

The Framework: Cache Component
• Model Weights & Loss
• Custom Parameters
• When a the retraining policy is triggered in the
monitor
– A new best performing weight vector is selected from
the cache
– The running model weights are updated and used in
real-time predictions

The Framework: Monitor Component
Monitor
Feature Extraction
Feature Hashing
Hashing Trick
Batch
RDD of All Data
(not in cache)
Background
AllReduce Component
averaging
scheme
Background
for AllReduce
(simulated in POC)
Learning Modules
M1…MN

The Framework: Monitor Component
• Retraining in All-Reduce happens after each online pass
• Option 1: Makes real-time prediction with only local latency when
initiated (Model Retrain Policy)
– very low latency
– stable predictions
– Significant change in weights since last pass
– Loss Minimization
• Increases in expected loss above threshold
– Accuracy
• Option 2: Can opt to use All-Reduce weights after the online pass
– Low Latency
– Greater sensitivy
– Periodic queries to cache fro best weights (Model Retrain Policy)

Performance: IBM’s Data Science Experience
IBM Data Science Experience is an
environment that brings together everything that
a Data Scientist needs.
It includes the most popular Open Source tools
such as Code in Scala/Python/R/SQL, Jupyter
Notebooks, RStudio IDE and Shiny apps,
Apache Spark and IBM unique value-add
functionalities with community and social
features, integrated as a first class citizen to
make Data Scientists more successful.

Performance: IBM’s Data Science Experience

Performance: Accuracy Gains
Tom Mitchell, CMU

Performance: NASDAQ
• NASDAQ Data
– Daily stock market values since 1971; ~50K instances
– Features (excerpt)
• Date, Open, High, Low, Close ,Volume, Adj Close
– Fully Labeled, y = close
– k fold cross validation
– Regression

Performance (1): Linear Regression
1 1.5 2 2.5 3 3.5 4 4.5 5
Percentage of Unlabeled
789.5403
789.5404
789.5405
789.5406
789.5407
789.5408
789.5409
AverageRMSD
Average RMSD
1 2 3 4 5 6 7 8 9 10
K-Fold
788
788.5
789
789.5
790
790.5
791
791.5
RMSD
RMSE Supervised- 50% Unlabeled
Supervised
10% Unlabeled
20% Unlabeled
30% Unlabeled
50% Unlabeled

Performance (1A): Classification
1 2 3 4 5 6 7 8 9 10
0.795
0.8
0.805
0.81
0.815
0.82
0.825
0.83
Supervised
10% Unlabeled
20% Unlabeled
30% Unlabeled
50% Unlabeled
1 1.5 2 2.5 3 3.5 4 4.5 5
% Unlabeled
0.8158
0.816
0.8162
0.8164
0.8166
0.8168
0.817
0.8172
Accuracy
Ave Log Accuracy
Average Accuracy

Performance (2): Online & Semi-Supervised
• HealthStats provides key health, nutrition and population
statistics gathered from a variety of international sources
• Global Health Data since1960; ~100K instances
• This dataset includes 345 indicators, such as immunization rates,
malnutrition prevalence, and vitamin A supplementation rates across
263 countries around the world. Data was collected on a yearly basis
from 1960-2016.Fully Labeled, y = close
– Classification

Performance (3): Online Semi-Supervised Learning
• IBM Employee Attrition and Performance
– Uncover the factors that lead to employee attrition
• Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5
'Doctor'EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'JobInvolvement 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'PerformanceRating 1 'Low' 2 'Good' 3 'Excellent' 4
'Outstanding'RelationshipSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'Fully Labeled
– Classification

Future Work
• More complex real-time data sets
• Real-Time Streaming and Semi-
Supervised Learning for Autonomous
Vehicles
• IoT and the Autonomous Vehicle in the
Clouds: Simultaneous Localization and
Mapping (SLAM) with Kafka and Spark
Streaming (Spark Summit East 2017)
• Full Framework!!!

Future Work
• Improving Batch retraining policy to incorporate
more information about the distributions of data
• Investigating model switching vs retraining
• Adding boosting mechanisms in batch
• Adding feature extraction for high dimensionality
• Expansion of Spark Streaming ML algorithms

Further Reading
Tom Mitchell, CMU
Multiple Model-Based Reinforcement Learning
Kenji Doya, Kazuyuki Samejima,Ken-ichi Katagiri and Mitsuo Kawato
Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting."
Advances in Neural Information Processing Systems. 2013.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer
Series in Statistics) 2nd Editionby Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman

Thank You.
J. WhiteBear (jwhiteb@us.ibm.com)
IBM SparkTechnology Center
505 Howard St San Francisco, CA

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear

More Related Content

What's hot (20)

Similar to An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear (20)

More from Databricks (20)

Recently uploaded (20)

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear