SlideShare a Scribd company logo
Presented: J. White Bear
An Online Spark Pipeline:
Semi-Supervised Learning and Online
Retraining with Spark Streaming.
2
IBM
Spark Technology Center
• Founded in 2015.
• Location:
– Physical: 505 Howard St., San Francisco CA
– Web: http://spark.tc Twitter: @apachespark_tc
• Mission:
– Contribute intellectual and technical capital to theApache
Spark community.
– Make the core technology enterprise- and cloud-ready.
– Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
• Key statistics:
– About 50 developers, co-located with 25 IBM designers.
– Major contributions toApache Spark http://jiras.spark.tc
– Apache SystemML is now a top level Apache project !
– Founding member of UC Berkeley AMPLab and RISE
Lab
– Member of R Consortium and Scala Center
Spark Technology Center
About Me
Education
• University ofMichigan-Computer Science
• Databases,Machine Learning/Computational
Biology,Cryptography
• University ofCalifornia SanFrancisco,University of
California Berkeley-
• Multi-objective Optimization/Computational
Biology/Bioinformatics
• McGill University
• Machine Learning/ Multi-objective
Optimizationfor PathPlanning/
Cryptography
Industry
• IBM
• Amazon
• TeraGrid
• Pfizer
• Researchat UC Berkeley,Purdue University,and
every university Iever attended.J
Fun Facts (?)
I love researchfor itsown sake. I like robots,helping
to cure diseases,advocating for social change and
reform,and breaking encryptions.Also,most
activitiesinvolving the OceanandI usually hate
taking pictures. J
Why do we need online semisupervised learning?
Malware/Fraud Detection
Stock Prediction
Real-time Diagnosis
NLP/Speech Recognition Real-time Visual Processing
Why online learning?
• Incremental/Sequential learning for real-time use
cases
• Predicts/learns from the newest data
• Optimized for low latency in real-time cases
• Often used in conjunction with streaming data
Semi-supervised learning
• Smaller training sets, less labeled data
• Classifying or predicting unlabeled data
• The underlying distribution P(x,y) may or may
not be known/ stable (PAC learning)
• How/Why do we bring these ideas together?
Key Challenges
• Acquiring a sufficient ratio training data
– Batch learning is much slower and more difficult to update
a model
• Maintaining accuracy
– Concept drift
– Catastrophic events
• Meeting latency requirements particularly in real-
time scenarios, like autonomous vehicles
• What about unlabeled data?....
The Framework: Hybrid Online and Semi-Supervised learning
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
Why this framework?
• Abundance of unlabeled data requires semi-supervised learning
– Real-time learning in the IoT setting
• Semi-Supervised learning can improve predictions
– Empirically studied for online use case with Big Data
• We need to address the challenges of incremental sequential
learning without losing valuable historical data
• We need to optimize for low latency without overly sacrificing
accuracy
• You may have an online case but you are not guaranteed labeled
data and definitely not in a timely fashion
• Hybrid frameworks address these challenges and allow for future
data mining to understand and correct for how your data changes
over time
Online and Semi-supervised Learning: Online Constraint
min
w2Rd
nX
i=1
l(w>
xi; yi) + R(w) (1)
Supervised learning over a batch
1
n
nX
i=1
l(w>
xi; yi) +
n
R(w) (2)
e learning over a mini-batch from a streaming data set
i=1
Supervised learning over a batch
1
n
nX
i=1
l(w>
xi;yi) +
n
R(w) (2
Online learning over a mini-batch from a streaming data set
m
n
nX
i=Sk
l(w>
xi;yi) +
n
R(w) (3
Distributed learning over a mini-batch from a streaming data set
1
n
nX
i=1
l(w>
xi; yi) +
n
R(w) (2)
Online learning over a mini-batch from a streaming data set
m
n
nX
i=Sk
l(w>
xi; yi) +
n
R(w) (3)
Distributed learning over a mini-batch from a streaming data set
1
Online and Semi-supervised Learning: Semi-supervised Constraint
m
n
nX
i=Sk
l(w>
xi; yi) +
n
R(w) (3)
Distributed learning over a mini-batch from a streaming data set
Let xp be an example where xp = (xp1, xp2, xp2, ..., xpD, !) where xp belongs to class
! and a D dimensional space. xpi is the value of the ith feature of the pth sample.
Assume a labeled set. L with n instances of x, with ! known. Assume a labeled
set. U with m instances of x, with ! unknown. Assume that the number of labeled
instances L, is less than the number of unlabeled instances, U.
L [ U represents the training set, T .
We want to infer a hypothesis using T and use this hypothesis to predict labels we
have not yet seen.
The semi-supervised learning case.
1
Online and Semi-supervised Learning: Semi-supervised Constraint
Labeled
Data
Unlabeled
Data
Initial Training
Set
Online/Streaming
Labeled/
Unlabeled Data
Model
Retraining
Label
ed
Data
Unlabel
ed
Data
The Framework: Batch Component
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
Learning Modules
M1…MN
Customization & Policy
Options
Retraining Policy
Amending Policy
Committee Policy
The Framework: Batch Component
• Features
– Out-of-Core, can run as a background process
– Ensemble Learning: Multiple model/ Co-training
learning algorithms
– Amending Policy
– Retrain Policy
– Active Learning Policy
– Custom Parameters
The Framework: Batch Component
The Framework: Batch Component
Semi-Supervised Learning
Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien
Learning from labeled and unlabeled data
Tom Mitchell, CMU
The Framework: Batch Component
• Multiple Model Learning/ Co-training
– Expandable learning modules: multiple learning
algorithms
– Co-training for high-dimensionality, improved
confidence
– Accounts for different biases across learning
techniques to strengthen the hypothesis
– Improves confidence estimates in practice
– Yields better results when there is a significant
diversity among the models.
– Reduces variance and helps to avoid overfitting
The Framework: Batch Component
L [ U represents the training set, T .
We want to infer a hypothesis using T and use this hypothesis to predict labels we
have not yet seen.
The semi-supervised learning case.
Given a model, m 2 M
ym(x) = h(x) + ✏m(x) (4)
We can get the average error over all models, Eaverage
Eaverage =
1
M
MX
m=1
Ex[✏m(x)2
] (5)
Regression: Modal averaging sum of squares/ Prediction confidence
1Multiple Model-Based Reinforcement Learning
Kenji Doya, KazuyukiSamejima,Ken-ichiKatagiriand Mitsuo Kawato
Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting." Advances in NeuralInformation
Processing Systems.2013.
The Framework: Batch Component
• Amending Policy
– Instances are added sequentially and maintain order
– Only the most ‘confident’ predictions are added
– If validation is received, inaccurate predictions are
removed and retrained (w/n a threshold)
– Low confidence, unlabeled data are stored for future
validation or rescoring
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the
eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100.
DOI=http://dx.doi.org/10.1145/279943.279962
The Framework: Batch Component
Training Batch
Labeled
Data
Unlabeled
Low
Confidence
Data
Amending Policy
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the
eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100.
DOI=http://dx.doi.org/10.1145/279943.279962
The Framework: Batch Component
• Retrain Policy
– Loss Minimization
• Increases in expected loss above threshold
– Accuracy
• Ratio/Number of incorrect predictions
– Schedule
• Regular retraining schedule
– Optimizations
• Limited to a window over the given data set
– Train over a specific time period to optimize for catastrophic events
and concept shifts
• Subsampling
– Very large datasets can set a subsampling methodology to improve
batch processing speed
The Framework: Batch Component
• What happens to low confidence predictions?
• Active Learning Policy
– Separate dataset holds these instances
– Human labeling and/or an oracle is queried for accurate labels
• Google has over 10,000 contractors performing this function
– https://arstechnica.com/features/2017/04/the-secret-lives-of-google-raters/
• Facebook is hiring 3000 new content monitors for a job AI cannot do
– http://www.popsci.com/Facebook-hiring-3000-content-monitors
• Netflix/Amazon are bringing in humans to improve the CTR
recommendations
– http://blog.echen.me/2014/10/07/moving-beyond-ctr-better-
recommendations-through-human-evaluation/
The Framework: Batch Component
• Custom Parameters:
– Window Size
• Specify a time or range
– Amending times
• Schedule when amending policy runs
• Default is automated at accuracy rates
– Loss threshold
• Specify maximum loss for a given loss function
– Accuracy threshold
• Specify the precision/recall rates before retraining is called on
batch
– Subsampling
• Run a random subsample to train
The Framework: All-Reduce Component
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
The Framework: All-Reduce Component
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
• SGD/ LBFGS/Regression
• In-memory
• Minimal network latency
• Mini-batches of n/m for
each nodes
• Head nodes holds an
average of the weight
parameters
• Can be added to existing
implementations to
optimize for online
training
The Framework: Cache Component
Bringing it all together…
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
The Framework: Cache Component
• Model Weights & Loss
• Custom Parameters
• When a the retraining policy is triggered in the
monitor
– A new best performing weight vector is selected from
the cache
– The running model weights are updated and used in
real-time predictions
The Framework: Monitor Component
RealTime Streaming Data
Monitor
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
The Framework: Monitor Component
• Retraining in All-Reduce happens after each online pass
• Option 1: Makes real-time prediction with only local latency when
initiated (Model Retrain Policy)
– very low latency
– stable predictions
– Significant change in weights since last pass
– Loss Minimization
• Increases in expected loss above threshold
– Accuracy
• Option 2: Can opt to use All-Reduce weights after the online pass
– Low Latency
– Greater sensitivy
– Periodic queries to cache fro best weights (Model Retrain Policy)
Performance: IBM’s Data Science Experience
IBM Data Science Experience is an
environment that brings together everything that
a Data Scientist needs.
It includes the most popular Open Source tools
such as Code in Scala/Python/R/SQL, Jupyter
Notebooks, RStudio IDE and Shiny apps,
Apache Spark and IBM unique value-add
functionalities with community and social
features, integrated as a first class citizen to
make Data Scientists more successful.
Performance: IBM’s Data Science Experience
Performance: IBM’s Data Science Experience
Performance: Accuracy Gains
Semi-Supervised Learning
Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien
Learning from labeled and unlabeled data
Tom Mitchell, CMU
Performance: NASDAQ
• NASDAQ Data
– Daily stock market values since 1971; ~50K instances
– Features (excerpt)
• Date, Open, High, Low, Close ,Volume, Adj Close
– Fully Labeled, y = close
– k fold cross validation
– Regression
Performance (1): Linear Regression
1 1.5 2 2.5 3 3.5 4 4.5 5
Percentage of Unlabeled
789.5403
789.5404
789.5405
789.5406
789.5407
789.5408
789.5409
AverageRMSD
Average RMSD
1 2 3 4 5 6 7 8 9 10
K-Fold
788
788.5
789
789.5
790
790.5
791
791.5
RMSD
RMSE Supervised- 50% Unlabeled
Supervised
10% Unlabeled
20% Unlabeled
30% Unlabeled
50% Unlabeled
Performance (1A): Classification
1 2 3 4 5 6 7 8 9 10
0.795
0.8
0.805
0.81
0.815
0.82
0.825
0.83
Supervised
10% Unlabeled
20% Unlabeled
30% Unlabeled
50% Unlabeled
1 1.5 2 2.5 3 3.5 4 4.5 5
% Unlabeled
0.8158
0.816
0.8162
0.8164
0.8166
0.8168
0.817
0.8172
Accuracy
Ave Log Accuracy
Average Accuracy
Performance (2): Online & Semi-Supervised
• HealthStats provides key health, nutrition and population
statistics gathered from a variety of international sources
• Global Health Data since1960; ~100K instances
– Features (excerpt)
• This dataset includes 345 indicators, such as immunization rates,
malnutrition prevalence, and vitamin A supplementation rates across
263 countries around the world. Data was collected on a yearly basis
from 1960-2016.Fully Labeled, y = close
– k fold cross validation
– Classification
Performance (3): Online Semi-Supervised Learning
• IBM Employee Attrition and Performance
– Uncover the factors that lead to employee attrition
– Features (excerpt)
• Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5
'Doctor'EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'JobInvolvement 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'PerformanceRating 1 'Low' 2 'Good' 3 'Excellent' 4
'Outstanding'RelationshipSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'Fully Labeled
– k fold cross validation
– Classification
Future Work
• More complex real-time data sets
• Real-Time Streaming and Semi-
Supervised Learning for Autonomous
Vehicles
• IoT and the Autonomous Vehicle in the
Clouds: Simultaneous Localization and
Mapping (SLAM) with Kafka and Spark
Streaming (Spark Summit East 2017)
• Full Framework!!!
Future Work
• Improving Batch retraining policy to incorporate
more information about the distributions of data
• Investigating model switching vs retraining
• Adding boosting mechanisms in batch
• Adding feature extraction for high dimensionality
• Expansion of Spark Streaming ML algorithms
Further Reading
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
Semi-Supervised Learning
Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien
Learning from labeled and unlabeled data
Tom Mitchell, CMU
Multiple Model-Based Reinforcement Learning
Kenji Doya, Kazuyuki Samejima,Ken-ichi Katagiri and Mitsuo Kawato
Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting."
Advances in Neural Information Processing Systems. 2013.
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the
eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100.
DOI=http://dx.doi.org/10.1145/279943.279962
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer
Series in Statistics) 2nd Editionby Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman
Thank You.
J. WhiteBear (jwhiteb@us.ibm.com)
IBM SparkTechnology Center
505 Howard St San Francisco, CA

More Related Content

What's hot (20)

PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PPT
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
PPTX
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PPTX
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
PDF
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
PPTX
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
DataWorks Summit
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PDF
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
PDF
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
PPTX
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
DataWorks Summit
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 

Similar to An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear (20)

PPTX
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
PDF
Supervised learning
O. R. Kumaran
 
PDF
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
yireme8491
 
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
PDF
Fast Distributed Online Classification
DataWorks Summit/Hadoop Summit
 
PDF
Fast Distributed Online Classification
Prasad Chalasani
 
PPTX
Online learning with structured streaming, spark summit brussels 2016
Ram Sriharsha
 
PDF
Weak Supervision.pdf
StephenLeo7
 
PDF
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Giacomo Boracchi
 
PPTX
Model Development And Evaluation in ML.pptx
bismayabaliarsingh00
 
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
PDF
Machine learning pipeline with spark ml
datamantra
 
PDF
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
PDF
Overview of data programming: easing the bottleneck of supervised machine lea...
datalab-vietnam
 
PPTX
Semi-supervised Learning Survey - 20 years of evaluation
subarna89
 
PDF
Apache Spark Machine Learning
Praveen Devarao
 
PDF
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
Eirini Ntoutsi
 
PDF
Week 4 advanced labeling, augmentation and data preprocessing
Ajay Taneja
 
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
Supervised learning
O. R. Kumaran
 
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
yireme8491
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
Fast Distributed Online Classification
DataWorks Summit/Hadoop Summit
 
Fast Distributed Online Classification
Prasad Chalasani
 
Online learning with structured streaming, spark summit brussels 2016
Ram Sriharsha
 
Weak Supervision.pdf
StephenLeo7
 
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Giacomo Boracchi
 
Model Development And Evaluation in ML.pptx
bismayabaliarsingh00
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
Machine learning pipeline with spark ml
datamantra
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Overview of data programming: easing the bottleneck of supervised machine lea...
datalab-vietnam
 
Semi-supervised Learning Survey - 20 years of evaluation
subarna89
 
Apache Spark Machine Learning
Praveen Devarao
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Universitat Politècnica de Catalunya
 
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
Eirini Ntoutsi
 
Week 4 advanced labeling, augmentation and data preprocessing
Ajay Taneja
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear

  • 1. Presented: J. White Bear An Online Spark Pipeline: Semi-Supervised Learning and Online Retraining with Spark Streaming.
  • 2. 2 IBM Spark Technology Center • Founded in 2015. • Location: – Physical: 505 Howard St., San Francisco CA – Web: http://spark.tc Twitter: @apachespark_tc • Mission: – Contribute intellectual and technical capital to theApache Spark community. – Make the core technology enterprise- and cloud-ready. – Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com • Key statistics: – About 50 developers, co-located with 25 IBM designers. – Major contributions toApache Spark http://jiras.spark.tc – Apache SystemML is now a top level Apache project ! – Founding member of UC Berkeley AMPLab and RISE Lab – Member of R Consortium and Scala Center Spark Technology Center
  • 3. About Me Education • University ofMichigan-Computer Science • Databases,Machine Learning/Computational Biology,Cryptography • University ofCalifornia SanFrancisco,University of California Berkeley- • Multi-objective Optimization/Computational Biology/Bioinformatics • McGill University • Machine Learning/ Multi-objective Optimizationfor PathPlanning/ Cryptography Industry • IBM • Amazon • TeraGrid • Pfizer • Researchat UC Berkeley,Purdue University,and every university Iever attended.J Fun Facts (?) I love researchfor itsown sake. I like robots,helping to cure diseases,advocating for social change and reform,and breaking encryptions.Also,most activitiesinvolving the OceanandI usually hate taking pictures. J
  • 4. Why do we need online semisupervised learning? Malware/Fraud Detection Stock Prediction Real-time Diagnosis NLP/Speech Recognition Real-time Visual Processing
  • 5. Why online learning? • Incremental/Sequential learning for real-time use cases • Predicts/learns from the newest data • Optimized for low latency in real-time cases • Often used in conjunction with streaming data
  • 6. Semi-supervised learning • Smaller training sets, less labeled data • Classifying or predicting unlabeled data • The underlying distribution P(x,y) may or may not be known/ stable (PAC learning) • How/Why do we bring these ideas together?
  • 7. Key Challenges • Acquiring a sufficient ratio training data – Batch learning is much slower and more difficult to update a model • Maintaining accuracy – Concept drift – Catastrophic events • Meeting latency requirements particularly in real- time scenarios, like autonomous vehicles • What about unlabeled data?....
  • 8. The Framework: Hybrid Online and Semi-Supervised learning RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
  • 9. Why this framework? • Abundance of unlabeled data requires semi-supervised learning – Real-time learning in the IoT setting • Semi-Supervised learning can improve predictions – Empirically studied for online use case with Big Data • We need to address the challenges of incremental sequential learning without losing valuable historical data • We need to optimize for low latency without overly sacrificing accuracy • You may have an online case but you are not guaranteed labeled data and definitely not in a timely fashion • Hybrid frameworks address these challenges and allow for future data mining to understand and correct for how your data changes over time
  • 10. Online and Semi-supervised Learning: Online Constraint min w2Rd nX i=1 l(w> xi; yi) + R(w) (1) Supervised learning over a batch 1 n nX i=1 l(w> xi; yi) + n R(w) (2) e learning over a mini-batch from a streaming data set i=1 Supervised learning over a batch 1 n nX i=1 l(w> xi;yi) + n R(w) (2 Online learning over a mini-batch from a streaming data set m n nX i=Sk l(w> xi;yi) + n R(w) (3 Distributed learning over a mini-batch from a streaming data set 1 n nX i=1 l(w> xi; yi) + n R(w) (2) Online learning over a mini-batch from a streaming data set m n nX i=Sk l(w> xi; yi) + n R(w) (3) Distributed learning over a mini-batch from a streaming data set 1
  • 11. Online and Semi-supervised Learning: Semi-supervised Constraint m n nX i=Sk l(w> xi; yi) + n R(w) (3) Distributed learning over a mini-batch from a streaming data set Let xp be an example where xp = (xp1, xp2, xp2, ..., xpD, !) where xp belongs to class ! and a D dimensional space. xpi is the value of the ith feature of the pth sample. Assume a labeled set. L with n instances of x, with ! known. Assume a labeled set. U with m instances of x, with ! unknown. Assume that the number of labeled instances L, is less than the number of unlabeled instances, U. L [ U represents the training set, T . We want to infer a hypothesis using T and use this hypothesis to predict labels we have not yet seen. The semi-supervised learning case. 1
  • 12. Online and Semi-supervised Learning: Semi-supervised Constraint Labeled Data Unlabeled Data Initial Training Set Online/Streaming Labeled/ Unlabeled Data Model Retraining Label ed Data Unlabel ed Data
  • 13. The Framework: Batch Component RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing Learning Modules M1…MN Customization & Policy Options Retraining Policy Amending Policy Committee Policy
  • 14. The Framework: Batch Component • Features – Out-of-Core, can run as a background process – Ensemble Learning: Multiple model/ Co-training learning algorithms – Amending Policy – Retrain Policy – Active Learning Policy – Custom Parameters
  • 15. The Framework: Batch Component
  • 16. The Framework: Batch Component Semi-Supervised Learning Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien Learning from labeled and unlabeled data Tom Mitchell, CMU
  • 17. The Framework: Batch Component • Multiple Model Learning/ Co-training – Expandable learning modules: multiple learning algorithms – Co-training for high-dimensionality, improved confidence – Accounts for different biases across learning techniques to strengthen the hypothesis – Improves confidence estimates in practice – Yields better results when there is a significant diversity among the models. – Reduces variance and helps to avoid overfitting
  • 18. The Framework: Batch Component L [ U represents the training set, T . We want to infer a hypothesis using T and use this hypothesis to predict labels we have not yet seen. The semi-supervised learning case. Given a model, m 2 M ym(x) = h(x) + ✏m(x) (4) We can get the average error over all models, Eaverage Eaverage = 1 M MX m=1 Ex[✏m(x)2 ] (5) Regression: Modal averaging sum of squares/ Prediction confidence 1Multiple Model-Based Reinforcement Learning Kenji Doya, KazuyukiSamejima,Ken-ichiKatagiriand Mitsuo Kawato Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting." Advances in NeuralInformation Processing Systems.2013.
  • 19. The Framework: Batch Component • Amending Policy – Instances are added sequentially and maintain order – Only the most ‘confident’ predictions are added – If validation is received, inaccurate predictions are removed and retrained (w/n a threshold) – Low confidence, unlabeled data are stored for future validation or rescoring Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100. DOI=http://dx.doi.org/10.1145/279943.279962
  • 20. The Framework: Batch Component Training Batch Labeled Data Unlabeled Low Confidence Data Amending Policy Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100. DOI=http://dx.doi.org/10.1145/279943.279962
  • 21. The Framework: Batch Component • Retrain Policy – Loss Minimization • Increases in expected loss above threshold – Accuracy • Ratio/Number of incorrect predictions – Schedule • Regular retraining schedule – Optimizations • Limited to a window over the given data set – Train over a specific time period to optimize for catastrophic events and concept shifts • Subsampling – Very large datasets can set a subsampling methodology to improve batch processing speed
  • 22. The Framework: Batch Component • What happens to low confidence predictions? • Active Learning Policy – Separate dataset holds these instances – Human labeling and/or an oracle is queried for accurate labels • Google has over 10,000 contractors performing this function – https://arstechnica.com/features/2017/04/the-secret-lives-of-google-raters/ • Facebook is hiring 3000 new content monitors for a job AI cannot do – http://www.popsci.com/Facebook-hiring-3000-content-monitors • Netflix/Amazon are bringing in humans to improve the CTR recommendations – http://blog.echen.me/2014/10/07/moving-beyond-ctr-better- recommendations-through-human-evaluation/
  • 23. The Framework: Batch Component • Custom Parameters: – Window Size • Specify a time or range – Amending times • Schedule when amending policy runs • Default is automated at accuracy rates – Loss threshold • Specify maximum loss for a given loss function – Accuracy threshold • Specify the precision/recall rates before retraining is called on batch – Subsampling • Run a random subsample to train
  • 24. The Framework: All-Reduce Component A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN
  • 25. The Framework: All-Reduce Component A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford • SGD/ LBFGS/Regression • In-memory • Minimal network latency • Mini-batches of n/m for each nodes • Head nodes holds an average of the weight parameters • Can be added to existing implementations to optimize for online training
  • 26. The Framework: Cache Component Bringing it all together… RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN
  • 27. The Framework: Cache Component • Model Weights & Loss • Custom Parameters • When a the retraining policy is triggered in the monitor – A new best performing weight vector is selected from the cache – The running model weights are updated and used in real-time predictions
  • 28. The Framework: Monitor Component RealTime Streaming Data Monitor Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN
  • 29. The Framework: Monitor Component • Retraining in All-Reduce happens after each online pass • Option 1: Makes real-time prediction with only local latency when initiated (Model Retrain Policy) – very low latency – stable predictions – Significant change in weights since last pass – Loss Minimization • Increases in expected loss above threshold – Accuracy • Option 2: Can opt to use All-Reduce weights after the online pass – Low Latency – Greater sensitivy – Periodic queries to cache fro best weights (Model Retrain Policy)
  • 30. Performance: IBM’s Data Science Experience IBM Data Science Experience is an environment that brings together everything that a Data Scientist needs. It includes the most popular Open Source tools such as Code in Scala/Python/R/SQL, Jupyter Notebooks, RStudio IDE and Shiny apps, Apache Spark and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to make Data Scientists more successful.
  • 31. Performance: IBM’s Data Science Experience
  • 32. Performance: IBM’s Data Science Experience
  • 33. Performance: Accuracy Gains Semi-Supervised Learning Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien Learning from labeled and unlabeled data Tom Mitchell, CMU
  • 34. Performance: NASDAQ • NASDAQ Data – Daily stock market values since 1971; ~50K instances – Features (excerpt) • Date, Open, High, Low, Close ,Volume, Adj Close – Fully Labeled, y = close – k fold cross validation – Regression
  • 35. Performance (1): Linear Regression 1 1.5 2 2.5 3 3.5 4 4.5 5 Percentage of Unlabeled 789.5403 789.5404 789.5405 789.5406 789.5407 789.5408 789.5409 AverageRMSD Average RMSD 1 2 3 4 5 6 7 8 9 10 K-Fold 788 788.5 789 789.5 790 790.5 791 791.5 RMSD RMSE Supervised- 50% Unlabeled Supervised 10% Unlabeled 20% Unlabeled 30% Unlabeled 50% Unlabeled
  • 36. Performance (1A): Classification 1 2 3 4 5 6 7 8 9 10 0.795 0.8 0.805 0.81 0.815 0.82 0.825 0.83 Supervised 10% Unlabeled 20% Unlabeled 30% Unlabeled 50% Unlabeled 1 1.5 2 2.5 3 3.5 4 4.5 5 % Unlabeled 0.8158 0.816 0.8162 0.8164 0.8166 0.8168 0.817 0.8172 Accuracy Ave Log Accuracy Average Accuracy
  • 37. Performance (2): Online & Semi-Supervised • HealthStats provides key health, nutrition and population statistics gathered from a variety of international sources • Global Health Data since1960; ~100K instances – Features (excerpt) • This dataset includes 345 indicators, such as immunization rates, malnutrition prevalence, and vitamin A supplementation rates across 263 countries around the world. Data was collected on a yearly basis from 1960-2016.Fully Labeled, y = close – k fold cross validation – Classification
  • 38. Performance (3): Online Semi-Supervised Learning • IBM Employee Attrition and Performance – Uncover the factors that lead to employee attrition – Features (excerpt) • Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'JobInvolvement 1 'Low' 2 'Medium' 3 'High' 4 'Very High'JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'PerformanceRating 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'RelationshipSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'Fully Labeled – k fold cross validation – Classification
  • 39. Future Work • More complex real-time data sets • Real-Time Streaming and Semi- Supervised Learning for Autonomous Vehicles • IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and Mapping (SLAM) with Kafka and Spark Streaming (Spark Summit East 2017) • Full Framework!!!
  • 40. Future Work • Improving Batch retraining policy to incorporate more information about the distributions of data • Investigating model switching vs retraining • Adding boosting mechanisms in batch • Adding feature extraction for high dimensionality • Expansion of Spark Streaming ML algorithms
  • 41. Further Reading A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford Semi-Supervised Learning Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien Learning from labeled and unlabeled data Tom Mitchell, CMU Multiple Model-Based Reinforcement Learning Kenji Doya, Kazuyuki Samejima,Ken-ichi Katagiri and Mitsuo Kawato Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting." Advances in Neural Information Processing Systems. 2013. Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100. DOI=http://dx.doi.org/10.1145/279943.279962 The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) 2nd Editionby Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman
  • 42. Thank You. J. WhiteBear ([email protected]) IBM SparkTechnology Center 505 Howard St San Francisco, CA