Big learning 1.2

Practical Aspects of
Machine Learning on Big-Data platforms
(Big Learning)
Mohit Garg
Research Engineer
TXII, Airbus Group Innovations
25-05-2015

Motivation & Agenda
Motivation: Answer the questions
– Why multiple ecosystems for scalable
ML?
– Is there a unified approach for a big-
data platform for ML?
– How much to catch up on?
– How industry leaders are doing it?
– Put things into perspective !
Agenda: To present
– Quick brief on practical ML process
– Current landscape of Open source tools
– Evolutionary drivers with examples
– Case studies
– The Twitter Experience
– * Share journey and observations
(ongoing process)
ML (Optimization,
LA, Stats)
scalabilityBig Data (Schema,
workflow, architecture)

Quick brief - Process
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
• Not applicable to all ML modeling techniques. Biologically-inspired algorithms are more of a paradigm
(guidelines) rather than algorithm, and requires algorithm definition under those guidelines (GA, ACO, ANN).
• Graph Source: http://www.astroml.org/sklearn_tutorial/practical.html
Bias Vs Variance
Learning Curve
Data Sampling Algorithm Model EvaluationModel/Hypothesis

Quick brief – Workflow Example
Graph Source:

Quick brief – WF breakdown k-means

Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates

Only after iteration
is over
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
new cluster centres

Quick brief – Pieces
• An ML-algorithm is finally a computer algorithm
• Complex design or blocks of
– Blocks feeding each other. Output becoming Input
– Iterations over entire data (Gradient descent for Linear, Logistic
Regression, K-means etc) – Memory limitations
– Algorithms as non-linear workflows.
• Principles when operating on Large datasets
– Minimize I/O – Don’t read/write disk again and again
– Minimize Network transfer – Localize logic (non-additive?) to
data
– Use online-trainable algorithms- Optimized Parallel Algorithms
– Ease of use –Abstraction - Package well for end-user

Quick brief – then and now
• Small Data
– Static data
• Big Data
– Static Data
– But cant run on single machine
• Online Big Data
– Integrated with ETL
– Prediction and Learning together
– Twitter case study
α
β
Train
Measure (Test Data)
α
β
Train
Measure (Test Data)
Velocity
α
β

Current Landscape – Open Source tools
Stream
Sybil

Current Landscape – Open Source tools
Data
Complexity&completeness
Stream
Sibyl

Sibyl
Evolutionary Drivers
Data
Stream
2. Scientific
1. Loose Integration
3. Architectural
4. Use Case*

Open Source tools – Landscape & Drivers
Data
Stream
SybilIs there a wholesome solution?

Quick Review: MapReduce + Hadoop
• Bigger focus on
– Large scaling on data
– Scheduling and Concurrency control
• Load balancing
– Fault tolerance
– Basically, to save tonnes of user’s efforts
required in older frameworks like MPI.
– The map and reduce can be ‘executables ’ in
virtually any language (Streaming)
– *Maps (& reducers) don’t interact !
• MapReduce exploits massively
parallelizable problems, what about rest of
them?
– Simple case: Try finding median of integers(1-40
say) using MR?
• Can we expect any algorithm to execute in
with MR implementation with same time-
bounds?

Loose Integration
• Set of components/APIs
– exposing existing tools with Map-Reduce frameworks
– to be compiled, optimized and deployed in
– streaming or pipe mode with frameworks.
• Hadoop/MapReduce bindings available for
– R
– Python (numpy, sci-kit)
• Focus on
– Accommodating existing user-base to leverage hadoop data storage
– Easy & neat APIs for native users.
– No real effort on ‘bridging the gap’

Loose Integration – Pydoop Example
• Uses Hadoop Pipes as underlying framework
• Based on Cpython, so provide inclusion of sci-kit, num-py etc
• Lets you define map and reduce logics
• But, does not provide better representations of ML Algorithms

Scientific - Efforts
• Efforts comes in waves with breakthroughs
• Efforts on
– Accuracy bounds & Convergence
– Execution time bounds
• Recent efforts in tandem with Big Data
– Distributable algorithms – Central Limit theorem (local
logic with simpler aggregations)
– Batch-to-Online model – ‘One pass mode’ (avoid
iterations)
• Example
– Distributable algorithms - Ensemble Classification (eg
Random forest), K-means++||
– Batch-to-online - (SGD)
• Note – Power ‘inherently’ lies in Big Data
– Simple algorithm with larger dataset outperforms complex
algorithms with smaller dataset
Image-2 Source: Andrew Ng – Coursera ML-08
O1 O2 ON
Ǝ

Logistic Classification
• Sample: Yes/No kind of answers
– Is this tweet a spam?
– Will this tweeter log back in 48 hours?
X1 X2 …… XN Y
X11 . . X1N 0
. . . . 1
. . . . 1
. . . . 1
. . . . 0
XM1 . . XMN 0
X Y
x1
x2
xN
hӨ (x)
Ө1
Ө2
ӨN
Hypothesis hӨ (x) = 1 / ( 1 + e – ӨTx)
Cost(x) hӨ (x) - y
J =Cost(X)
• Ө is unknown variable
• Lets start with random value of Ө
• Aim is to change Ө to minimize J

Gradient Descent
Iterations
J =Cost(X, Ө) minimize using Gradient DescentVisualization in 2D

Gradient Descent
• Cost function requires all records.
While (W does not change)
{
// Load data
// find local losses
// Aggregate local losses
// Find gradient
// Update W
// Save W to disk
}
/* Multiple Passes */
J =Cost(X, Ө)
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (intermediate)
User code

Stochastic Gradient Descent (SGD)
• No need to get cost function for gradient calculation
• Each iteration on a data point - xi
• Gradient calculated using only xi
• As good as performing SGD on single machine. Reducer – a serious
bottleneck
M1 M2 MN
Map – loads data
gradient and updates W
R
Saves W (final)
// Load data
While (no samples left)
{
// Find gradient using xi
// Update W
}
// Save W
/* Single Pass */
User code
Ref: Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks

SGD - Distributed
• Similar to SGD, but have multiple reducers
• Data is thoroughly randomized
• Multiple classifiers are learned together – ensemble classifiers
• Bottleneck of single reducer (Network Data) resolved
• Testing using standard aggregation over predictors’ results
M1 M2 MN
Map –load data
gradient and updates WjR1
W1
// Pre-process – Randomize
// Load data
While (no samples left)
{
// Find gradient using xi
// Update Wj
}
/* Single Pass and
distributed */
User code
R2
W2
Ref: L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.

Now1971 2020
Moore’s Law vs Kryder’s Law
Source: Collective information from Wiki & its references
“if hard drives were to continue to progress
at their then-current pace of about 40%
per year, then in 2020 a two-platter, 2.5-
inch disk drive would store approximately
40 TB and cost about $40” - Kryder
Moore’s II law : “As the cost of computer
power to the consumer falls, the cost for
producers to fulfill Moore's law follows an
opposite trend: R&D, manufacturing, and
test costs have increased steadily with each
new generation of chips”
GAP
- Individual processors’ power growing at slower rate
- Data Storage becomes easier & cheaper.
- MORE data, LESS processors – and the gap is
widening !
- Computer h/w architecture working at its pace to
provide faster buses, RAM & augmented GPUs.
Architectural – Forces

2012
VolumeinExabytes
15000
2017
Percentage of uncertain data
Percentofuncertaindata
We are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
6000
9000
100
0
50
VeracityVolume
Variety
Architectural – Forces
Source: IBM - Challenges and Opportunities with Big Data- Dr Hammou Messatfa

Mahout with MapReduce
• Key feature: Somewhat loose & somewhat tight integration
– Among the earliest library to exploit batch-like scalable components online
learning algorithms.
– Some algorithms re-engineered for MapReduce, some not.
– Performance hit for iterative algorithms. Huge I/O overhead
– Each (global) iteration means Map-Reduce job :O
– Integration of new scalable learners less active.
• Industry acceptance
– Accepted for scalable Recommender systems
• Future
– Mahout Samsara for scalable low-level Linear Algebra as scala & spark bindings
Sybil

Cascading
• Key feature: Abstraction & Packaging
– Let you think of workflows as chain of MR
– Pre-existing methods for reading and storage methods
– Provide checkpoints in the workflow to save the state.
– J-Unit for test-case driven s/w development.
• Industry acceptance
– Scalding is scala bindings for cascading, from twitter
– Used by Bixo for anti-spam classification
– Used to load data by elasticsearch & Cassandra
– Ebay leverages scalding design for distributable computing.
Sybil

Pig – Quick Summary
 High level dataflow language (Pig Latin)
 Much simpler than Java
 Simplify the data processing
 Put the operations at the apropriate phases
 Chains multiple MR jobs
 Appropriate for ML workflows
 No need to take care of intermediate outputs
 Provide user defined functions (UDFs) in java,
integrable with Pig
Sybil

Pig – Quick Summary
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE
group, COUNT(D);
STORE F INTO 'output';
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
Sybil
FILTER
LOCAL REARRANGE
Map
Reduce
Map
Reduce
PACKAGE
FOREACH
LOCAL REARRANGE
PACKAGE
FOREACH

ML-lib
Sybil
 Part of Apache Spark framework
 Data can be from hdfs, S3, local files
 Encapsulates run-time data as Resilient Distributed Data store (RDD)
 RDD are in-memory data pieces
 Failt tolerent – knows how to recreate itself if resident node goes down
 No distinction between map and reduce, just a task.
 Vigilant
 Bindings for R too – SparkR
 Real ingenuity in implementing new-generation algorithms (online &
distributed)
 Example, has three versions of K-means – Lloyd, k-means++, k-means++ ||
 Key feature
 Shared objects – means tasks (belonging to one node) can share objects.

Spark (ML-lib)
Sybil
Shared
variable

Tez
Sybil
 Apache Ìncubated project
 Fundamentally similar design principles as of Spark
 Encapsulates run-time data as nodes just like RDDs
 Key features
 In-memory data
 Shared objects – means tasks (belonging to one node) can
share objects.
 Very few comaprative studies available
 Not much contributions from open community

Distributed R
Sybil
 Opensource Project lead by HP
 Similar to R-hadoop but with some new
genuine features like
 User-defined array partitioning
 Local transformation/functions
 Master-Worker synchronization
 Not the same ingenuity yet, as seen in ML-lib.
 Only fundamentally scalable algorithms (online
& distributable) scales linearly.
 Tall claims of 50-100x time efficiency when
used with HP-Vertica database

Sibyl
Sybil
 Not opensource yet, but some rumours !
 Claims to provide a GFS based highly scalable
flexible infrastucture for embedding ML
process in ETL
 Designed for supervised learning
 Focussed on learning user behaviours
 Youtube video recommendations
 Spam filters
 Major design principle– Columnar Data
 Suitable for sparse datasets (new columns?)
 Comrpression techniques for columnar data
much efficient (structral similarity)

Columnar data- LZO Compression
• Idea 1
– Compression should be ‘splittable’
– A large file can be compressed and
split in size equal to hdfs block.
– Each block should hold its
‘decompression key’
Compression Size
(GB)
Compression
Time (s)
Decompression
Time (s)
None 8.0 - -
Gzip 1.3 241 72
LZO 2.0 55 35
• Idea 2
– Compress data on hadoop (save 3/4th
space)
– Save 75% I/O time !!
– Achieve Decompression < 75% I/O
time
| | | | | | | |

Conclusion
• Big Data has resurrected interest in ML algorithms.
• A two-way push is leading the confluence – Online & Distributed
Learning (scientific) & flexible workflows (architectural) to
accommodate them.
• Facilitated by compression, serialization, in-memory, DAG-
representations, columnar databases etc.
• Majority of man-hours goes into engineering the pipelines.
• Industry aiming to provide high level abstraction on standard ML
algorithms hiding gory details.

Learning Resources
• MOOCs (Coursera)
– Machine Learning (Stanford)
– Design & Analysis of Algorithms (Stanford)
– R Programming Language (John Hopkins)
– Exploratory Data Analysis (John Hopkins)
• Online Competitions
– Kaggle Data Science platform
• Software Resources
– Matlab
– R
– Scikit – Python ?
– APIs – ANN, JGAP
• 2009 - "Subspace extracting Adaptive Cellular Network for layered Architectures
with circular boundaries.“ Paper on IEEE.
• 2006-07. 1st prize – IBM’s Great mind challenge – “Transport Management System”
Multi –TSP implementation using Genetic Algorithm .

Big learning 1.2

More Related Content

What's hot (20)

Similar to Big learning 1.2 (20)

Recently uploaded (20)

Big learning 1.2