Spark cassandra integration 2016

Spark-Cassandra Integration 2016
DuyHai DOAN
Apache Cassandra Evangelist

@doanduyhai
Main use-cases
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize, transform data
Schema migration,
Data conversion

@doanduyhai
Data import
3
•  Read data from CSV and dump into Cassandra ?
☞ Spark Job to distribute the import !
Load data from various
sources

@doanduyhai
Data cleaning
5
•  Bugs in your application ?
•  Dirty input data ?
☞ Spark Job to clean it up!

@doanduyhai
Schema migration
7
•  Business requirements change with time ?
•  Current data model no longer relevant ?
☞ Spark Job to migrate data !
Schema migration,
Data conversion

@doanduyhai
Analytics
9
Given existing tables of performers and albums, I want:
①  top 10 most common music styles (pop,rock, RnB, …) ?
②  performer productivity(albums count) by origin country and by decade ?
☞ Spark Job to compute analytics !
Analytics (join, aggregate, transform, …)

Connector Architecture
•  Cluster Deployment
•  Data Locality
•  Failure Handling
•  Cross DC/cluster operations

@doanduyhai
Cluster Deployment
11
•  Stand-alone cluster
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW

@doanduyhai
Data Locality – remember token ranges ?
12
A: −x,−
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
B: −
3x
4
,−
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C: −
2x
4
,−
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
D: −
x
4
,0
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
E: 0,
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
F:
x
4
,
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
G:
2x
4
,
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
H :
3x
4
,x
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C*
C*
C*
C*
C* C*
C* C*

@doanduyhai
Data Locality – how to
13
Spark partition RDD
Cassandra
tokens ranges
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW

@doanduyhai
Data Locality – how to
14
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW

@doanduyhai
Perfect data locality scenario
•  read localy from Cassandra
•  use operations that do not require shuffle in Spark (map, ﬁlter, …)
•  repartitionbyCassandraReplica()
à to a table having same partition key as original table
•  save back into this Cassandra table
USE CASE
15

@doanduyhai
Failure Handling
16
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
What if 1 node down ?
What if 1 node overloaded ?

@doanduyhai
Failure Handling
17
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
What if 1 node down ?
What if 1 node overloaded ?
☞ Spark master will re-assign
the job to another worker

@doanduyhai
Failure Handling
18
Oh no, my data locality !!!

@doanduyhai
Failure Handling
19

@doanduyhai
Data Locality Impl
20
abstract'class'RDD[T](…)'{'
' @DeveloperApi'
' def'compute(split:'Partition,'context:'TaskContext):'Iterator[T]'
'
' protected'def'getPartitions:'Array[Partition]'
' '
' protected'def'getPreferredLocations(split:'Partition):'Seq[String]'='Nil''''''''
}'

@doanduyhai
CassandraRDD
21
def getPreferredLocations(split: Partition): Cassandra replicas IP address
corresponding to this Spark partition

@doanduyhai
Failure Handling
22
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node

@doanduyhai
Failure Handling
23
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node
Only work for fixed
token ranges (vnodes)

@doanduyhai
Cross cluster/DC operations
24

Tales from the ﬁeld, SASI index benchmark
•  Deployment automation
•  Parallel ingestion
•  Migrating data
•  Spark + Cassandra 3.4 SASI index for topK query

@doanduyhai
Deployment Automation
26
Use Ansible to bootstrap a cluster
•  role tools (install vim, htop, dstat, fio, jmxterm..)
•  role Cassandra. Do not put all nodes as seeds ….
•  role Spark (vanilla Spark). Slave on all nodes, master on a random node
DO NOT START ALL CASSANDRA NODES AT THE SAME TIME !!!!
•  bootstrap ﬁrst seeds nodes
•  give ≥ 30secs between 2 node bootstrap for token range agreement
•  watch -n 5 nodetool status

@doanduyhai
Parallel ingestion for SASI index benchmark
27
Hardware specs
•  13 nodes
•  6 cores CPU (HT)
•  4 SSD in RAID 0 😎
•  64 Gb of RAM
Cassandra conf:
•  G1GC 32Gb JVM Heap
•  compaction throughput in MB = 256
•  concurrent compactor = 2

@doanduyhai
28

@doanduyhai
29
3.2 billions row in 17h
(compaction disabled)
RF = 2
☞ ≈ 8000 ips
I/O idle, high CPU

@doanduyhai
TopK query
32
Pass 1, for each music provider
•  sum albums sales count by title
•  take top N, associate weight from descending order (1st = 1000, 2nd = 999 …)
Retrieve all albums from pass 1
•  re-sum the sum(sales count) and weight group by title
•  order again by sum(sales count) in descending order
•  take top N

@doanduyhai
TopK query
33
Target data set = 3.2 billions rows
•  minimum filter = 1 month (period_end_month = 201404 for ex)
•  worst filter = 3 months range
•  +8 other dynamic filters (music provider, distribution type …)
☞ SASI indices for filtering
☞ Spark for aggregation

@doanduyhai
TopK query results
34
3.2 billions rows in total
•  random distribution over 3 years (36 months) à 88 millions rows/month
Filters #rows Duration #rows/sec
3 months 376 947 612 14 mins (840 secs) 448 747
1 month 94 239 127 6.1 mins (366 secs) 257 483
1 month + 1 provider 7 267 983 2.1 mins (126 secs) 57 682
1 month + 1 provider + 1 country 2 737 178 1.5 mins (90 secs) 30 413

36
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You

Spark cassandra integration 2016

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Spark cassandra integration 2016 (20)

More from Duyhai Doan (9)

Recently uploaded (20)

Spark cassandra integration 2016