SlideShare a Scribd company logo
Practical Aspects of
Machine Learning on Big-Data platforms
(Big Learning)
Mohit Garg
Research Engineer
TXII, Airbus Group Innovations
25-05-2015
Motivation & Agenda
Motivation: Answer the questions
– Why multiple ecosystems for scalable
ML?
– Is there a unified approach for a big-
data platform for ML?
– How much to catch up on?
– How industry leaders are doing it?
– Put things into perspective !
Agenda: To present
– Quick brief on practical ML process
– Current landscape of Open source tools
– Evolutionary drivers with examples
– Case studies
– The Twitter Experience
– * Share journey and observations
(ongoing process)
ML (Optimization,
LA, Stats)
scalabilityBig Data (Schema,
workflow, architecture)
Quick brief on ML process
Big learning   1.2
Quick brief - Process
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
• Not applicable to all ML modeling techniques. Biologically-inspired algorithms are more of a paradigm
(guidelines) rather than algorithm, and requires algorithm definition under those guidelines (GA, ACO, ANN).
• Graph Source: http://www.astroml.org/sklearn_tutorial/practical.html
Bias Vs Variance
Learning Curve
Data Sampling Algorithm Model EvaluationModel/Hypothesis
Quick brief – Workflow Example
Graph Source:
Quick brief – WF breakdown k-means
Quick brief – WF breakdown k-means
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
Quick brief – WF breakdown k-means
Only after iteration
is over
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
new cluster centres
Quick brief – Pieces
• An ML-algorithm is finally a computer algorithm
• Complex design or blocks of
– Blocks feeding each other. Output becoming Input
– Iterations over entire data (Gradient descent for Linear, Logistic
Regression, K-means etc) – Memory limitations
– Algorithms as non-linear workflows.
• Principles when operating on Large datasets
– Minimize I/O – Don’t read/write disk again and again
– Minimize Network transfer – Localize logic (non-additive?) to
data
– Use online-trainable algorithms- Optimized Parallel Algorithms
– Ease of use –Abstraction - Package well for end-user
Quick brief – then and now
• Small Data
– Static data
• Big Data
– Static Data
– But cant run on single machine
• Online Big Data
– Integrated with ETL
– Prediction and Learning together
– Twitter case study
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
Velocity
α
β
Current Landscape
Current Landscape – Open Source tools
Stream
Sybil
Current Landscape – Open Source tools
Data
Complexity&completeness
Stream
Sibyl
Evolutionary Drivers
Current Landscape – Open Source tools
Data
Complexity&completeness
Stream
Sibyl
Sibyl
Evolutionary Drivers
Data
Complexity&completeness
Stream
2. Scientific
1. Loose Integration
3. Architectural
4. Use Case*
Open Source tools – Landscape & Drivers
Data
Complexity&completeness
Stream
SybilIs there a wholesome solution?
Loose Integration
Quick Review: MapReduce + Hadoop
• Bigger focus on
– Large scaling on data
– Scheduling and Concurrency control
• Load balancing
– Fault tolerance
– Basically, to save tonnes of user’s efforts
required in older frameworks like MPI.
– The map and reduce can be ‘executables ’ in
virtually any language (Streaming)
– *Maps (& reducers) don’t interact !
• MapReduce exploits massively
parallelizable problems, what about rest of
them?
– Simple case: Try finding median of integers(1-40
say) using MR?
• Can we expect any algorithm to execute in
with MR implementation with same time-
bounds?
Loose Integration
• Set of components/APIs
– exposing existing tools with Map-Reduce frameworks
– to be compiled, optimized and deployed in
– streaming or pipe mode with frameworks.
• Hadoop/MapReduce bindings available for
– R
– Python (numpy, sci-kit)
• Focus on
– Accommodating existing user-base to leverage hadoop data storage
– Easy & neat APIs for native users.
– No real effort on ‘bridging the gap’
R-Hadoop
Loose Integration – Pydoop Example
• Uses Hadoop Pipes as underlying framework
• Based on Cpython, so provide inclusion of sci-kit, num-py etc
• Lets you define map and reduce logics
• But, does not provide better representations of ML Algorithms
Scientific
Scientific - Interest
Scientific - Efforts
• Efforts comes in waves with breakthroughs
• Efforts on
– Accuracy bounds & Convergence
– Execution time bounds
• Recent efforts in tandem with Big Data
– Distributable algorithms – Central Limit theorem (local
logic with simpler aggregations)
– Batch-to-Online model – ‘One pass mode’ (avoid
iterations)
• Example
– Distributable algorithms - Ensemble Classification (eg
Random forest), K-means++||
– Batch-to-online - (SGD)
• Note – Power ‘inherently’ lies in Big Data
– Simple algorithm with larger dataset outperforms complex
algorithms with smaller dataset
Image-2 Source: Andrew Ng – Coursera ML-08
O1 O2 ON
Ǝ
Logistic Classification
• Sample: Yes/No kind of answers
– Is this tweet a spam?
– Will this tweeter log back in 48 hours?
X1 X2 …… XN Y
X11 . . X1N 0
. . . . 1
. . . . 1
. . . . 1
. . . . 0
XM1 . . XMN 0
X Y
x1
x2
xN
hӨ (x)
Ө1
Ө2
ӨN
Hypothesis hӨ (x) = 1 / ( 1 + e – ӨTx)
Cost(x) hӨ (x) - y
J =Cost(X)
• Ө is unknown variable
• Lets start with random value of Ө
• Aim is to change Ө to minimize J
Gradient Descent
Iterations
J =Cost(X, Ө) minimize using Gradient DescentVisualization in 2D
Gradient Descent
• Cost function requires all records.
While (W does not change)
{
// Load data
// find local losses
// Aggregate local losses
// Find gradient
// Update W
// Save W to disk
}
/* Multiple Passes */
J =Cost(X, Ө)
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (intermediate)
User code
Stochastic Gradient Descent (SGD)
• No need to get cost function for gradient calculation
• Each iteration on a data point - xi
• Gradient calculated using only xi
• As good as performing SGD on single machine. Reducer – a serious
bottleneck
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (final)
// Load data
While (no samples left)
{
// Find gradient using xi
// Update W
}
// Save W
/* Single Pass */
User code
Ref: Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks
SGD - Distributed
• Similar to SGD, but have multiple reducers
• Data is thoroughly randomized
• Multiple classifiers are learned together – ensemble classifiers
• Bottleneck of single reducer (Network Data) resolved
• Testing using standard aggregation over predictors’ results
M1 M2 MN
Map –load data
Reduce – Calculates
gradient and updates WjR1
W1
// Pre-process – Randomize
// Load data
While (no samples left)
{
// Find gradient using xi
// Update Wj
}
/* Single Pass and
distributed */
User code
R2
W2
Ref: L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.
Architectural
Now1971 2020
Moore’s Law vs Kryder’s Law
Source: Collective information from Wiki & its references
“if hard drives were to continue to progress
at their then-current pace of about 40%
per year, then in 2020 a two-platter, 2.5-
inch disk drive would store approximately
40 TB and cost about $40” - Kryder
Moore’s II law : “As the cost of computer
power to the consumer falls, the cost for
producers to fulfill Moore's law follows an
opposite trend: R&D, manufacturing, and
test costs have increased steadily with each
new generation of chips”
GAP
- Individual processors’ power growing at slower rate
- Data Storage becomes easier & cheaper.
- MORE data, LESS processors – and the gap is
widening !
- Computer h/w architecture working at its pace to
provide faster buses, RAM & augmented GPUs.
Architectural – Forces
2012
VolumeinExabytes
15000
2017
Percentage of uncertain data
Percentofuncertaindata
We are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
6000
9000
100
0
50
VeracityVolume
Variety
Architectural – Forces
Source: IBM - Challenges and Opportunities with Big Data- Dr Hammou Messatfa
Mahout with MapReduce
• Key feature: Somewhat loose & somewhat tight integration
– Among the earliest library to exploit batch-like scalable components online
learning algorithms.
– Some algorithms re-engineered for MapReduce, some not.
– Performance hit for iterative algorithms. Huge I/O overhead
– Each (global) iteration means Map-Reduce job :O
– Integration of new scalable learners less active.
• Industry acceptance
– Accepted for scalable Recommender systems
• Future
– Mahout Samsara for scalable low-level Linear Algebra as scala & spark bindings
Sybil
Cascading
• Key feature: Abstraction & Packaging
– Let you think of workflows as chain of MR
– Pre-existing methods for reading and storage methods
– Provide checkpoints in the workflow to save the state.
– J-Unit for test-case driven s/w development.
• Industry acceptance
– Scalding is scala bindings for cascading, from twitter
– Used by Bixo for anti-spam classification
– Used to load data by elasticsearch & Cassandra
– Ebay leverages scalding design for distributable computing.
Sybil
Cascading
Sybil
Pig – Quick Summary
 High level dataflow language (Pig Latin)
 Much simpler than Java
 Simplify the data processing
 Put the operations at the apropriate phases
 Chains multiple MR jobs
 Appropriate for ML workflows
 No need to take care of intermediate outputs
 Provide user defined functions (UDFs) in java,
integrable with Pig
Sybil
Pig – Quick Summary
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE
group, COUNT(D);
STORE F INTO 'output';
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
Sybil
FILTER
LOCAL REARRANGE
Map
Reduce
Map
Reduce
PACKAGE
FOREACH
LOCAL REARRANGE
PACKAGE
FOREACH
ML-lib
Sybil
 Part of Apache Spark framework
 Data can be from hdfs, S3, local files
 Encapsulates run-time data as Resilient Distributed Data store (RDD)
 RDD are in-memory data pieces
 Failt tolerent – knows how to recreate itself if resident node goes down
 No distinction between map and reduce, just a task.
 Vigilant
 Bindings for R too – SparkR
 Real ingenuity in implementing new-generation algorithms (online &
distributed)
 Example, has three versions of K-means – Lloyd, k-means++, k-means++ ||
 Key feature
 Shared objects – means tasks (belonging to one node) can share objects.
Spark (ML-lib)
Sybil
Shared
variable
Tez
Sybil
 Apache Ìncubated project
 Fundamentally similar design principles as of Spark
 Encapsulates run-time data as nodes just like RDDs
 Key features
 In-memory data
 Shared objects – means tasks (belonging to one node) can
share objects.
 Very few comaprative studies available
 Not much contributions from open community
Tez
Sybil
Distributed R
Sybil
 Opensource Project lead by HP
 Similar to R-hadoop but with some new
genuine features like
 User-defined array partitioning
 Local transformation/functions
 Master-Worker synchronization
 Not the same ingenuity yet, as seen in ML-lib.
 Only fundamentally scalable algorithms (online
& distributable) scales linearly.
 Tall claims of 50-100x time efficiency when
used with HP-Vertica database
Sibyl
Sybil
 Not opensource yet, but some rumours !
 Claims to provide a GFS based highly scalable
flexible infrastucture for embedding ML
process in ETL
 Designed for supervised learning
 Focussed on learning user behaviours
 Youtube video recommendations
 Spam filters
 Major design principle– Columnar Data
 Suitable for sparse datasets (new columns?)
 Comrpression techniques for columnar data
much efficient (structral similarity)
Columnar data- LZO Compression
• Idea 1
– Compression should be ‘splittable’
– A large file can be compressed and
split in size equal to hdfs block.
– Each block should hold its
‘decompression key’
Compression Size
(GB)
Compression
Time (s)
Decompression
Time (s)
None 8.0 - -
Gzip 1.3 241 72
LZO 2.0 55 35
• Idea 2
– Compress data on hadoop (save 3/4th
space)
– Save 75% I/O time !!
– Achieve Decompression < 75% I/O
time
| | | | | | | |
Conclusion
• Big Data has resurrected interest in ML algorithms.
• A two-way push is leading the confluence – Online & Distributed
Learning (scientific) & flexible workflows (architectural) to
accommodate them.
• Facilitated by compression, serialization, in-memory, DAG-
representations, columnar databases etc.
• Majority of man-hours goes into engineering the pipelines.
• Industry aiming to provide high level abstraction on standard ML
algorithms hiding gory details.
Learning Resources
• MOOCs (Coursera)
– Machine Learning (Stanford)
– Design & Analysis of Algorithms (Stanford)
– R Programming Language (John Hopkins)
– Exploratory Data Analysis (John Hopkins)
• Online Competitions
– Kaggle Data Science platform
• Software Resources
– Matlab
– R
– Scikit – Python ?
– APIs – ANN, JGAP
• 2009 - "Subspace extracting Adaptive Cellular Network for layered Architectures
with circular boundaries.“ Paper on IEEE.
• 2006-07. 1st prize – IBM’s Great mind challenge – “Transport Management System”
Multi –TSP implementation using Genetic Algorithm .
** tHanK YoU **

More Related Content

PDF
Spark 101
Mohit Garg
 
PPTX
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 
PPTX
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
PDF
How Machine Learning and AI Can Support the Fight Against COVID-19
Databricks
 
PDF
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
PDF
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PDF
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Spark 101
Mohit Garg
 
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
How Machine Learning and AI Can Support the Fight Against COVID-19
Databricks
 
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 

What's hot (20)

PDF
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
PDF
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
PDF
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
PDF
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PDF
Machine Learning using Apache Spark MLlib
IMC Institute
 
PDF
Apache Spark & MLlib
Grigory Sapunov
 
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
PDF
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
PDF
Ray and Its Growing Ecosystem
Databricks
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
H2O Design and Infrastructure with Matt Dowle
Sri Ambati
 
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
PDF
Introduction to Spark Training
Spark Summit
 
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Machine Learning using Apache Spark MLlib
IMC Institute
 
Apache Spark & MLlib
Grigory Sapunov
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Ray and Its Growing Ecosystem
Databricks
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
H2O Design and Infrastructure with Matt Dowle
Sri Ambati
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
Introduction to Spark Training
Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Ad

Similar to Big learning 1.2 (20)

PPTX
Azure Databricks for Data Scientists
Richard Garris
 
PPT
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 
PDF
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
Ioan Toma
 
PPTX
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
PDF
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
PPTX
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Lucas Jellema
 
PPTX
Oow2016 review-db-dev-bigdata-BI
Getting value from IoT, Integration and Data Analytics
 
PPTX
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
PDF
Deep Learning for Autonomous Driving
Jan Wiegelmann
 
PDF
High Performance Engineering - 01-intro.pdf
ss63261
 
PDF
Maximize Impact: Learn from the Dual Pillars of Open-Source Energy Planning T...
IEA-ETSAP
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PPTX
Chap 2 classification of parralel architecture and introduction to parllel p...
Malobe Lottin Cyrille Marcel
 
PDF
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
PPT
Data Streaming in Big Data and Data mining in streaming
AMSERMAKANITeaching
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
E05312426
IOSR-JEN
 
Azure Databricks for Data Scientists
Richard Garris
 
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
Ioan Toma
 
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Lucas Jellema
 
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
Deep Learning for Autonomous Driving
Jan Wiegelmann
 
High Performance Engineering - 01-intro.pdf
ss63261
 
Maximize Impact: Learn from the Dual Pillars of Open-Source Energy Planning T...
IEA-ETSAP
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Malobe Lottin Cyrille Marcel
 
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Data Streaming in Big Data and Data mining in streaming
AMSERMAKANITeaching
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
E05312426
IOSR-JEN
 
Ad

Recently uploaded (20)

PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Presentation on animal welfare a good topic
kidscream385
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 

Big learning 1.2

  • 1. Practical Aspects of Machine Learning on Big-Data platforms (Big Learning) Mohit Garg Research Engineer TXII, Airbus Group Innovations 25-05-2015
  • 2. Motivation & Agenda Motivation: Answer the questions – Why multiple ecosystems for scalable ML? – Is there a unified approach for a big- data platform for ML? – How much to catch up on? – How industry leaders are doing it? – Put things into perspective ! Agenda: To present – Quick brief on practical ML process – Current landscape of Open source tools – Evolutionary drivers with examples – Case studies – The Twitter Experience – * Share journey and observations (ongoing process) ML (Optimization, LA, Stats) scalabilityBig Data (Schema, workflow, architecture)
  • 3. Quick brief on ML process
  • 5. Quick brief - Process α β Train Tune (Cross Validate Data) Measure (Test Data) • Not applicable to all ML modeling techniques. Biologically-inspired algorithms are more of a paradigm (guidelines) rather than algorithm, and requires algorithm definition under those guidelines (GA, ACO, ANN). • Graph Source: http://www.astroml.org/sklearn_tutorial/practical.html Bias Vs Variance Learning Curve Data Sampling Algorithm Model EvaluationModel/Hypothesis
  • 6. Quick brief – Workflow Example Graph Source:
  • 7. Quick brief – WF breakdown k-means
  • 8. Quick brief – WF breakdown k-means Input Data (Points) Statement Block (assigning clusters) Termination condition (if no change) While (!termination_condition) meta input (new cluster centres) updates
  • 9. Quick brief – WF breakdown k-means Only after iteration is over Input Data (Points) Statement Block (assigning clusters) Termination condition (if no change) While (!termination_condition) meta input (new cluster centres) updates new cluster centres
  • 10. Quick brief – Pieces • An ML-algorithm is finally a computer algorithm • Complex design or blocks of – Blocks feeding each other. Output becoming Input – Iterations over entire data (Gradient descent for Linear, Logistic Regression, K-means etc) – Memory limitations – Algorithms as non-linear workflows. • Principles when operating on Large datasets – Minimize I/O – Don’t read/write disk again and again – Minimize Network transfer – Localize logic (non-additive?) to data – Use online-trainable algorithms- Optimized Parallel Algorithms – Ease of use –Abstraction - Package well for end-user
  • 11. Quick brief – then and now • Small Data – Static data • Big Data – Static Data – But cant run on single machine • Online Big Data – Integrated with ETL – Prediction and Learning together – Twitter case study α β Train Tune (Cross Validate Data) Measure (Test Data) α β Train Tune (Cross Validate Data) Measure (Test Data) Velocity α β
  • 13. Current Landscape – Open Source tools Stream Sybil
  • 14. Current Landscape – Open Source tools Data Complexity&completeness Stream Sibyl
  • 16. Current Landscape – Open Source tools Data Complexity&completeness Stream Sibyl
  • 18. Open Source tools – Landscape & Drivers Data Complexity&completeness Stream SybilIs there a wholesome solution?
  • 20. Quick Review: MapReduce + Hadoop • Bigger focus on – Large scaling on data – Scheduling and Concurrency control • Load balancing – Fault tolerance – Basically, to save tonnes of user’s efforts required in older frameworks like MPI. – The map and reduce can be ‘executables ’ in virtually any language (Streaming) – *Maps (& reducers) don’t interact ! • MapReduce exploits massively parallelizable problems, what about rest of them? – Simple case: Try finding median of integers(1-40 say) using MR? • Can we expect any algorithm to execute in with MR implementation with same time- bounds?
  • 21. Loose Integration • Set of components/APIs – exposing existing tools with Map-Reduce frameworks – to be compiled, optimized and deployed in – streaming or pipe mode with frameworks. • Hadoop/MapReduce bindings available for – R – Python (numpy, sci-kit) • Focus on – Accommodating existing user-base to leverage hadoop data storage – Easy & neat APIs for native users. – No real effort on ‘bridging the gap’
  • 23. Loose Integration – Pydoop Example • Uses Hadoop Pipes as underlying framework • Based on Cpython, so provide inclusion of sci-kit, num-py etc • Lets you define map and reduce logics • But, does not provide better representations of ML Algorithms
  • 26. Scientific - Efforts • Efforts comes in waves with breakthroughs • Efforts on – Accuracy bounds & Convergence – Execution time bounds • Recent efforts in tandem with Big Data – Distributable algorithms – Central Limit theorem (local logic with simpler aggregations) – Batch-to-Online model – ‘One pass mode’ (avoid iterations) • Example – Distributable algorithms - Ensemble Classification (eg Random forest), K-means++|| – Batch-to-online - (SGD) • Note – Power ‘inherently’ lies in Big Data – Simple algorithm with larger dataset outperforms complex algorithms with smaller dataset Image-2 Source: Andrew Ng – Coursera ML-08 O1 O2 ON Ǝ
  • 27. Logistic Classification • Sample: Yes/No kind of answers – Is this tweet a spam? – Will this tweeter log back in 48 hours? X1 X2 …… XN Y X11 . . X1N 0 . . . . 1 . . . . 1 . . . . 1 . . . . 0 XM1 . . XMN 0 X Y x1 x2 xN hӨ (x) Ө1 Ө2 ӨN Hypothesis hӨ (x) = 1 / ( 1 + e – ӨTx) Cost(x) hӨ (x) - y J =Cost(X) • Ө is unknown variable • Lets start with random value of Ө • Aim is to change Ө to minimize J
  • 28. Gradient Descent Iterations J =Cost(X, Ө) minimize using Gradient DescentVisualization in 2D
  • 29. Gradient Descent • Cost function requires all records. While (W does not change) { // Load data // find local losses // Aggregate local losses // Find gradient // Update W // Save W to disk } /* Multiple Passes */ J =Cost(X, Ө) M1 M2 MN Map – loads data Reduce – Calculates gradient and updates W R Saves W (intermediate) User code
  • 30. Stochastic Gradient Descent (SGD) • No need to get cost function for gradient calculation • Each iteration on a data point - xi • Gradient calculated using only xi • As good as performing SGD on single machine. Reducer – a serious bottleneck M1 M2 MN Map – loads data Reduce – Calculates gradient and updates W R Saves W (final) // Load data While (no samples left) { // Find gradient using xi // Update W } // Save W /* Single Pass */ User code Ref: Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks
  • 31. SGD - Distributed • Similar to SGD, but have multiple reducers • Data is thoroughly randomized • Multiple classifiers are learned together – ensemble classifiers • Bottleneck of single reducer (Network Data) resolved • Testing using standard aggregation over predictors’ results M1 M2 MN Map –load data Reduce – Calculates gradient and updates WjR1 W1 // Pre-process – Randomize // Load data While (no samples left) { // Find gradient using xi // Update Wj } /* Single Pass and distributed */ User code R2 W2 Ref: L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.
  • 33. Now1971 2020 Moore’s Law vs Kryder’s Law Source: Collective information from Wiki & its references “if hard drives were to continue to progress at their then-current pace of about 40% per year, then in 2020 a two-platter, 2.5- inch disk drive would store approximately 40 TB and cost about $40” - Kryder Moore’s II law : “As the cost of computer power to the consumer falls, the cost for producers to fulfill Moore's law follows an opposite trend: R&D, manufacturing, and test costs have increased steadily with each new generation of chips” GAP - Individual processors’ power growing at slower rate - Data Storage becomes easier & cheaper. - MORE data, LESS processors – and the gap is widening ! - Computer h/w architecture working at its pace to provide faster buses, RAM & augmented GPUs. Architectural – Forces
  • 34. 2012 VolumeinExabytes 15000 2017 Percentage of uncertain data Percentofuncertaindata We are here Sensors & Devices VoIP Enterprise Data Social Media 6000 9000 100 0 50 VeracityVolume Variety Architectural – Forces Source: IBM - Challenges and Opportunities with Big Data- Dr Hammou Messatfa
  • 35. Mahout with MapReduce • Key feature: Somewhat loose & somewhat tight integration – Among the earliest library to exploit batch-like scalable components online learning algorithms. – Some algorithms re-engineered for MapReduce, some not. – Performance hit for iterative algorithms. Huge I/O overhead – Each (global) iteration means Map-Reduce job :O – Integration of new scalable learners less active. • Industry acceptance – Accepted for scalable Recommender systems • Future – Mahout Samsara for scalable low-level Linear Algebra as scala & spark bindings Sybil
  • 36. Cascading • Key feature: Abstraction & Packaging – Let you think of workflows as chain of MR – Pre-existing methods for reading and storage methods – Provide checkpoints in the workflow to save the state. – J-Unit for test-case driven s/w development. • Industry acceptance – Scalding is scala bindings for cascading, from twitter – Used by Bixo for anti-spam classification – Used to load data by elasticsearch & Cassandra – Ebay leverages scalding design for distributable computing. Sybil
  • 38. Pig – Quick Summary  High level dataflow language (Pig Latin)  Much simpler than Java  Simplify the data processing  Put the operations at the apropriate phases  Chains multiple MR jobs  Appropriate for ML workflows  No need to take care of intermediate outputs  Provide user defined functions (UDFs) in java, integrable with Pig Sybil
  • 39. Pig – Quick Summary A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTER A by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; LOAD FILTER LOAD JOIN GROUP FOREACH STORE Sybil FILTER LOCAL REARRANGE Map Reduce Map Reduce PACKAGE FOREACH LOCAL REARRANGE PACKAGE FOREACH
  • 40. ML-lib Sybil  Part of Apache Spark framework  Data can be from hdfs, S3, local files  Encapsulates run-time data as Resilient Distributed Data store (RDD)  RDD are in-memory data pieces  Failt tolerent – knows how to recreate itself if resident node goes down  No distinction between map and reduce, just a task.  Vigilant  Bindings for R too – SparkR  Real ingenuity in implementing new-generation algorithms (online & distributed)  Example, has three versions of K-means – Lloyd, k-means++, k-means++ ||  Key feature  Shared objects – means tasks (belonging to one node) can share objects.
  • 42. Tez Sybil  Apache Ìncubated project  Fundamentally similar design principles as of Spark  Encapsulates run-time data as nodes just like RDDs  Key features  In-memory data  Shared objects – means tasks (belonging to one node) can share objects.  Very few comaprative studies available  Not much contributions from open community
  • 44. Distributed R Sybil  Opensource Project lead by HP  Similar to R-hadoop but with some new genuine features like  User-defined array partitioning  Local transformation/functions  Master-Worker synchronization  Not the same ingenuity yet, as seen in ML-lib.  Only fundamentally scalable algorithms (online & distributable) scales linearly.  Tall claims of 50-100x time efficiency when used with HP-Vertica database
  • 45. Sibyl Sybil  Not opensource yet, but some rumours !  Claims to provide a GFS based highly scalable flexible infrastucture for embedding ML process in ETL  Designed for supervised learning  Focussed on learning user behaviours  Youtube video recommendations  Spam filters  Major design principle– Columnar Data  Suitable for sparse datasets (new columns?)  Comrpression techniques for columnar data much efficient (structral similarity)
  • 46. Columnar data- LZO Compression • Idea 1 – Compression should be ‘splittable’ – A large file can be compressed and split in size equal to hdfs block. – Each block should hold its ‘decompression key’ Compression Size (GB) Compression Time (s) Decompression Time (s) None 8.0 - - Gzip 1.3 241 72 LZO 2.0 55 35 • Idea 2 – Compress data on hadoop (save 3/4th space) – Save 75% I/O time !! – Achieve Decompression < 75% I/O time | | | | | | | |
  • 47. Conclusion • Big Data has resurrected interest in ML algorithms. • A two-way push is leading the confluence – Online & Distributed Learning (scientific) & flexible workflows (architectural) to accommodate them. • Facilitated by compression, serialization, in-memory, DAG- representations, columnar databases etc. • Majority of man-hours goes into engineering the pipelines. • Industry aiming to provide high level abstraction on standard ML algorithms hiding gory details.
  • 48. Learning Resources • MOOCs (Coursera) – Machine Learning (Stanford) – Design & Analysis of Algorithms (Stanford) – R Programming Language (John Hopkins) – Exploratory Data Analysis (John Hopkins) • Online Competitions – Kaggle Data Science platform • Software Resources – Matlab – R – Scikit – Python ? – APIs – ANN, JGAP • 2009 - "Subspace extracting Adaptive Cellular Network for layered Architectures with circular boundaries.“ Paper on IEEE. • 2006-07. 1st prize – IBM’s Great mind challenge – “Transport Management System” Multi –TSP implementation using Genetic Algorithm .