Automated Machine
Learning @[24]7
Sourabh Chaki
Anticipate, Simplify, Learn
A Proactive Framework for Intuitive Customer Care
3
Visitor
Visitor
Caller
Homepage Product Page Cart Purchased
Chat Offered Chat Accepted Chat Started Chat Ended
Call Connected Call Ended Call Stats
Event
Discrete action
Interaction
Group of discrete events
Product Page
Web
Chat
Voice
Graph Problem
Web
Voice
Chat
Scattered
Interactions
Finding
Connected
Components
Visitor 1 Journey
Visitor 2 Journey
Graph processing in Hadoop
• HashToMin for Connected Component
– http://arxiv.org/pdf/1203.5387.pdf
• Log(n) mapreduce iterations for connecting n diameter
graph
• Challenges:
– Thinking graph problem in map and reduce
– Not fit for iterative algorithms
GraphX
• Spark fits for iterative programming
• In-memory data
• Think as edges and vertices
• High level APIs for graph processing
Connected Component in GraphX
• Vertices >> Edges
– Graph from only edges
– Interactions as external KVP RDD
• val cc = graph.connectedComponents()
• Join cc.vertices with interactions
• Group interactions with leaders
Join
• Broadcast join
• Hash Partition Join
• Adaptive
– Small data: Broadcast
– Large data: HashPartition
Memory Share
• Default shuffle memory 20%
– Low for shuffle heavy application
• Shuffle memory: 40%, storage memory 40%
Performance Gain
0
10
20
30
40
50
60
2 4 8
Mins
Graph Diameter
Hadoop Vs Spark (Mins), 350M events
Hadoop
GraphX
Interactions to Journey
web
Voice
Chat
Events
Connecting
User Journeys
User Journey Journey Views
User Journey to Model
Journey
View
Pre-processing
and model building
Model
Feature Engineering
• Training Set/Test Set
• Balanced Sampling
• Frequent Items
• Quantile Bins
• Transformation Function
Feature Engineering Config
trainTest=60,40
#features
catVarIndices=1,2
contVarIndices=3,4
labelIndex=0
#transformation
catVar.1.bin=5
catVar.2.bin=4
contVar.3.bin=5
contVar.4.bin=6
#sampling
label.1.wt=10
label.2.wt=100
Column wise data analysis
Quantile
TopK
Frequent
 Sum (set1 , set2) = Sum(set1) + Sum(set2)
Quantile(Set1 + Set2) = Quantile(Set1) +
Quantile(Set2)
TopKFrequent(set1, set2) =
TopKFrequent(set1) + TopKFrequent(set2)
Quantile
TopK
Frequent
Quantile
Problems
• Non distributed
• Need data shuffle
• Shuffle on different columns
• High disk and network IO
Algebird with Spark
F(set1 + set2) ≈ F(set1) + F(set2)
• QTree
• CountMinSketch
https://github.com/twitter/algebird
Binning Product Price
Product price => low / high value product
val aggregator= new QTreeSemigroup[Double](8)
val qtree = journeyViewRDD
.map(row=>QTree(row(productPriceIndex)))
.reduce(aggregator.plus(_,_) )
val median = qtree.quantileBounds(0.5)._1
Binning Product Price
Product price => low / high value product
class TransformationFunc(val median:Double) extends Serializable {
def transform (value:Double): String = {
if(value<median) "low"
else "high"
}
}
val transFunc = new TransformationFunc(median)
val priceCat = transFunc.transform(productPrice)
Automated Feature Engineering
Journey
View
Data Set and
Transformation Function
Feature Engineering
Configs
Model Building
• Spark MLlib
• Random Forest for classification
• Model Testing
• Performance Metrics
Model Config
#model
model=randomforest
randomforest.algo=gini
randomforest.bin=20
randomforest.classes=2
randomforest.depth=10
randomforest.numTrees=10
randomforest.featureSubsetStrategy=auto
#model metric
modelMetric=ROC
Automated Model Building
Multiple
Models
Model Configs
Data Set and
Transformation Function
Model
Testing
Best Model
Model Building
Prediction Entity
• Transformation Function Object
• Random Forest Model Object
class PredictionEntity(model:RandomForestModel,
tansformationFunc:TransformationFunc)
extends Serializable {
def predict(vector:Vector):Double = {
//transform the vector using transformationFunc
//predict using model
}
}
Prediction outside Spark
• Need export
– No PMML support
• Model store
• Synchronous prediction call
Prediction outside Spark
Model Store Prediction
Web
Voice
Chat
Wrapper
over MySql
Application
Server
Serialized Byte
Array
Prediction Entity
Machine Learning Cycle
Web
Voice
Chat
Event
Stream
GraphX Mllib
Journey
View
Model Model
Store
Prediction
Server
Configurations
Thank You

More Related Content

PDF
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
PPTX
Intro to Spark development
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
Spark Summit EU talk by Qifan Pu
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
SparkSQL: A Compiler from Queries to RDDs
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Intro to Spark development
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit EU talk by Qifan Pu
Deep Dive: Memory Management in Apache Spark
Spark Summit East 2015 Advanced Devops Student Slides
SparkSQL: A Compiler from Queries to RDDs

What's hot (20)

PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Revealing the Power of Legacy Machine Data
PDF
Adding Complex Data to Spark Stack by Tug Grall
PPTX
Introduction to Apache Spark Developer Training
PDF
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
PPTX
ETL with SPARK - First Spark London meetup
PDF
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
PDF
Spark Community Update - Spark Summit San Francisco 2015
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Dev Ops Training
PDF
Scaling Data Analytics Workloads on Databricks
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Strata NYC 2015: What's new in Spark Streaming
Spark Summit EU talk by Sameer Agarwal
Revealing the Power of Legacy Machine Data
Adding Complex Data to Spark Stack by Tug Grall
Introduction to Apache Spark Developer Training
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
ETL with SPARK - First Spark London meetup
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Use r tutorial part1, introduction to sparkr
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Spark Community Update - Spark Summit San Francisco 2015
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Dev Ops Training
Scaling Data Analytics Workloads on Databricks
Ad

Similar to Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(Sourabh Chaki, [24]7 (20)

PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
PDF
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PDF
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
PDF
Dallas DFW Data Science Meetup Jan 21 2016
PPTX
Running with Elephants: Predictive Analytics with HDInsight
PDF
Dataiku pig - hive - cascading
PPSX
Big&open data challenges for smartcity-PIC2014 Shanghai
PDF
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
Intro to Apache Spark by Marco Vasquez
PPTX
AutoML for user segmentation: how to match millions of users with hundreds of...
PPTX
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
PPTX
introduction to big data frameworks
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PDF
IRJET-Scaling Distributed Associative Classifier using Big Data
PDF
Everyday Probabilistic Data Structures for Humans
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
PDF
Big data analytics 1
PDF
Apache Spark & Hadoop
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Big Data Analytics with Storm, Spark and GraphLab
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Dallas DFW Data Science Meetup Jan 21 2016
Running with Elephants: Predictive Analytics with HDInsight
Dataiku pig - hive - cascading
Big&open data challenges for smartcity-PIC2014 Shanghai
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Recent Developments in Spark MLlib and Beyond
Intro to Apache Spark by Marco Vasquez
AutoML for user segmentation: how to match millions of users with hundreds of...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
introduction to big data frameworks
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
IRJET-Scaling Distributed Associative Classifier using Big Data
Everyday Probabilistic Data Structures for Humans
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Big data analytics 1
Apache Spark & Hadoop
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PPTX
ai agent creaction with langgraph_presentation_
PPTX
lung disease detection using transfer learning approach.pptx
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PPTX
GPS sensor used agriculture land for automation
PDF
technical specifications solar ear 2025.
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPTX
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PPTX
Introduction to Fundamentals of Data Security
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
Stats annual compiled ipd opd ot br 2024
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
ai agent creaction with langgraph_presentation_
lung disease detection using transfer learning approach.pptx
PPT for Diseases.pptx, there are 3 types of diseases
PPT for Diseases (1)-2, types of diseases.pptx
GPS sensor used agriculture land for automation
technical specifications solar ear 2025.
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
Hushh Hackathon for IIT Bombay: Create your very own Agents
The Role of Pathology AI in Translational Cancer Research and Education
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
Session 11 - Data Visualization Storytelling (2).pdf
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
indiraparyavaranbhavan-240418134200-31d840b3.pptx
Introduction to Fundamentals of Data Security
expt-design-lecture-12 hghhgfggjhjd (1).ppt
Stats annual compiled ipd opd ot br 2024

Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(Sourabh Chaki, [24]7