SlideShare a Scribd company logo
Spark-Cassandra Integration 2016
DuyHai DOAN
Apache Cassandra Evangelist
@doanduyhai
Main use-cases
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize, transform data
Schema migration,
Data conversion
@doanduyhai
Data import
3
•  Read data from CSV and dump into Cassandra ?
☞ Spark Job to distribute the import !
Load data from various
sources
Demo
4
@doanduyhai
Data cleaning
5
Sanitize, validate, normalize, transform data
•  Bugs in your application ?
•  Dirty input data ?
☞ Spark Job to clean it up!
Demo
6
@doanduyhai
Schema migration
7
•  Business requirements change with time ?
•  Current data model no longer relevant ?
☞ Spark Job to migrate data !
Schema migration,
Data conversion
Demo
8
@doanduyhai
Analytics
9
Given existing tables of performers and albums, I want:
①  top 10 most common music styles (pop,rock, RnB, …) ?
②  performer productivity(albums count) by origin country and by decade ?
☞ Spark Job to compute analytics !
Analytics (join, aggregate, transform, …)
Connector Architecture
•  Cluster Deployment
•  Data Locality
•  Failure Handling
•  Cross DC/cluster operations
@doanduyhai
Cluster Deployment
11
•  Stand-alone cluster
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Data Locality – remember token ranges ?
12
A: −x,−
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
B: −
3x
4
,−
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C: −
2x
4
,−
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
D: −
x
4
,0
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
E: 0,
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
F:
x
4
,
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
G:
2x
4
,
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
H :
3x
4
,x
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C*
C*
C*
C*
C* C*
C* C*
@doanduyhai
Data Locality – how to
13
Spark partition RDD
Cassandra
tokens ranges
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Data Locality – how to
14
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Perfect data locality scenario
•  read localy from Cassandra
•  use operations that do not require shuffle in Spark (map, filter, …)
•  repartitionbyCassandraReplica()
à to a table having same partition key as original table
•  save back into this Cassandra table
Sanitize, validate, normalize, transform data
USE CASE
15
@doanduyhai
Failure Handling
16
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW	
What if 1 node down ?
What if 1 node overloaded ?
@doanduyhai
Failure Handling
17
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW	
What if 1 node down ?
What if 1 node overloaded ?
☞ Spark master will re-assign
the job to another worker
@doanduyhai
Failure Handling
18
Oh no, my data locality !!!
@doanduyhai
Failure Handling
19
@doanduyhai
Data Locality Impl
20
abstract'class'RDD[T](…)'{'
' @DeveloperApi'
' def'compute(split:'Partition,'context:'TaskContext):'Iterator[T]'
'
' protected'def'getPartitions:'Array[Partition]'
' '
' protected'def'getPreferredLocations(split:'Partition):'Seq[String]'='Nil''''''''
}'
@doanduyhai
CassandraRDD
21
def getPreferredLocations(split: Partition): Cassandra replicas IP address
corresponding to this Spark partition
@doanduyhai
Failure Handling
22
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node
@doanduyhai
Failure Handling
23
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node
Only work for fixed
token ranges (vnodes)
@doanduyhai
Cross cluster/DC operations
24
Tales from the field, SASI index benchmark
•  Deployment automation
•  Parallel ingestion
•  Migrating data
•  Spark + Cassandra 3.4 SASI index for topK query
@doanduyhai
Deployment Automation
26
Use Ansible to bootstrap a cluster
•  role tools (install vim, htop, dstat, fio, jmxterm..)
•  role Cassandra. Do not put all nodes as seeds ….
•  role Spark (vanilla Spark). Slave on all nodes, master on a random node
DO NOT START ALL CASSANDRA NODES AT THE SAME TIME !!!!
•  bootstrap first seeds nodes
•  give ≥ 30secs between 2 node bootstrap for token range agreement
•  watch -n 5 nodetool status
@doanduyhai
Parallel ingestion for SASI index benchmark
27
Hardware specs
•  13 nodes
•  6 cores CPU (HT)
•  4 SSD in RAID 0 😎
•  64 Gb of RAM 
Cassandra conf:
•  G1GC 32Gb JVM Heap
•  compaction throughput in MB = 256
•  concurrent compactor = 2
@doanduyhai
Parallel ingestion for SASI index benchmark
28
@doanduyhai
Parallel ingestion for SASI index benchmark
29
3.2 billions row in 17h
(compaction disabled)
RF = 2
☞ ≈ 8000 ips
I/O idle, high CPU
@doanduyhai
Migrating Data
30
@doanduyhai
Migrating Data
31
@doanduyhai
TopK query
32
Pass 1, for each music provider
•  sum albums sales count by title
•  take top N, associate weight from descending order (1st = 1000, 2nd = 999 …)
Retrieve all albums from pass 1
•  re-sum the sum(sales count) and weight group by title
•  order again by sum(sales count) in descending order
•  take top N
@doanduyhai
TopK query
33
Target data set = 3.2 billions rows
•  minimum filter = 1 month (period_end_month = 201404 for ex)
•  worst filter = 3 months range
•  +8 other dynamic filters (music provider, distribution type …)
☞ SASI indices for filtering
☞ Spark for aggregation
@doanduyhai
TopK query results
34
3.2 billions rows in total
•  random distribution over 3 years (36 months) à 88 millions rows/month
Filters #rows Duration #rows/sec
3 months 376 947 612 14 mins (840 secs) 448 747
1 month 94 239 127 6.1 mins (366 secs) 257 483
1 month + 1 provider 7 267 983 2.1 mins (126 secs) 57 682
1 month + 1 provider + 1 country 2 737 178 1.5 mins (90 secs) 30 413
35
Q & A
! "
36
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You

More Related Content

What's hot (20)

PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
PDF
Sasi, cassandra on full text search ride
Duyhai Doan
 
PDF
Cassandra introduction 2016
Duyhai Doan
 
PDF
Apache zeppelin the missing component for the big data ecosystem
Duyhai Doan
 
PDF
Apache cassandra in 2016
Duyhai Doan
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PDF
Datastax enterprise presentation
Duyhai Doan
 
PDF
Datastax day 2016 introduction to apache cassandra
Duyhai Doan
 
PDF
Spark zeppelin-cassandra at synchrotron
Duyhai Doan
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PDF
Data stax academy
Duyhai Doan
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
PPTX
Apache spark Intro
Tudor Lapusan
 
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
PDF
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
Sasi, cassandra on full text search ride
Duyhai Doan
 
Cassandra introduction 2016
Duyhai Doan
 
Apache zeppelin the missing component for the big data ecosystem
Duyhai Doan
 
Apache cassandra in 2016
Duyhai Doan
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Datastax enterprise presentation
Duyhai Doan
 
Datastax day 2016 introduction to apache cassandra
Duyhai Doan
 
Spark zeppelin-cassandra at synchrotron
Duyhai Doan
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Data stax academy
Duyhai Doan
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Apache spark Intro
Tudor Lapusan
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 

Viewers also liked (15)

PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Duyhai Doan
 
PDF
Cassandra introduction @ ParisJUG
Duyhai Doan
 
PDF
Cassandra drivers and libraries
Duyhai Doan
 
PDF
Introduction to KillrChat
Duyhai Doan
 
PDF
KillrChat Data Modeling
Duyhai Doan
 
PDF
KillrChat presentation
Duyhai Doan
 
PDF
Cassandra introduction mars jug
Duyhai Doan
 
PDF
Cassandra introduction @ NantesJUG
Duyhai Doan
 
PDF
Apache Zeppelin @DevoxxFR 2016
Duyhai Doan
 
PDF
Cassandra introduction at FinishJUG
Duyhai Doan
 
PDF
Libon cassandra summiteu2014
Duyhai Doan
 
PDF
Cassandra for the ops dos and donts
Duyhai Doan
 
PDF
From rdbms to cassandra without a hitch
Duyhai Doan
 
PDF
Apache zeppelin, the missing component for the big data ecosystem
Duyhai Doan
 
PDF
Introduction to spark
Duyhai Doan
 
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Duyhai Doan
 
Cassandra introduction @ ParisJUG
Duyhai Doan
 
Cassandra drivers and libraries
Duyhai Doan
 
Introduction to KillrChat
Duyhai Doan
 
KillrChat Data Modeling
Duyhai Doan
 
KillrChat presentation
Duyhai Doan
 
Cassandra introduction mars jug
Duyhai Doan
 
Cassandra introduction @ NantesJUG
Duyhai Doan
 
Apache Zeppelin @DevoxxFR 2016
Duyhai Doan
 
Cassandra introduction at FinishJUG
Duyhai Doan
 
Libon cassandra summiteu2014
Duyhai Doan
 
Cassandra for the ops dos and donts
Duyhai Doan
 
From rdbms to cassandra without a hitch
Duyhai Doan
 
Apache zeppelin, the missing component for the big data ecosystem
Duyhai Doan
 
Introduction to spark
Duyhai Doan
 
Ad

Similar to Spark cassandra integration 2016 (20)

PDF
Apache Spark Best Practices Meetup Talk
Eren Avşaroğulları
 
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PDF
Cassandra introduction apache con 2014 budapest
Duyhai Doan
 
PDF
Spark Meetup at Uber
Databricks
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
PDF
Scio - Moving to Google Cloud, A Spotify Story
Neville Li
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PDF
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
PDF
Are general purpose big data systems eating the world?
Holden Karau
 
PPTX
MongoDB for Time Series Data: Sharding
MongoDB
 
PDF
[214]유연하고 확장성 있는 빅데이터 처리
NAVER D2
 
Apache Spark Best Practices Meetup Talk
Eren Avşaroğulları
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau
 
Tuning and Debugging in Apache Spark
Databricks
 
Cassandra introduction apache con 2014 budapest
Duyhai Doan
 
Spark Meetup at Uber
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Scio - Moving to Google Cloud, A Spotify Story
Neville Li
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Spark real world use cases and optimizations
Gal Marder
 
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Are general purpose big data systems eating the world?
Holden Karau
 
MongoDB for Time Series Data: Sharding
MongoDB
 
[214]유연하고 확장성 있는 빅데이터 처리
NAVER D2
 
Ad

More from Duyhai Doan (9)

PDF
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Duyhai Doan
 
PDF
Le futur d'apache cassandra
Duyhai Doan
 
PDF
Big data 101 for beginners devoxxpl
Duyhai Doan
 
PDF
Big data 101 for beginners riga dev days
Duyhai Doan
 
PDF
Datastax day 2016 : Cassandra data modeling basics
Duyhai Doan
 
PDF
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Duyhai Doan
 
PDF
Cassandra UDF and Materialized Views
Duyhai Doan
 
PDF
Distributed algorithms for big data @ GeeCon
Duyhai Doan
 
PDF
Algorithmes distribues pour le big data @ DevoxxFR 2015
Duyhai Doan
 
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Duyhai Doan
 
Le futur d'apache cassandra
Duyhai Doan
 
Big data 101 for beginners devoxxpl
Duyhai Doan
 
Big data 101 for beginners riga dev days
Duyhai Doan
 
Datastax day 2016 : Cassandra data modeling basics
Duyhai Doan
 
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Duyhai Doan
 
Cassandra UDF and Materialized Views
Duyhai Doan
 
Distributed algorithms for big data @ GeeCon
Duyhai Doan
 
Algorithmes distribues pour le big data @ DevoxxFR 2015
Duyhai Doan
 

Recently uploaded (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 

Spark cassandra integration 2016