SlideShare a Scribd company logo
@chbatey
Christopher Batey

Technical Evangelist for Apache Cassandra
Time series analysis with Spark and
Cassandra
@chbatey
Who am I?
• Technical Evangelist for Apache Cassandra
•Founder of Stubbed Cassandra
•Help out Apache Cassandra users
• DataStax
•Builds enterprise ready version of Apache
Cassandra
• Previous: Cassandra backed apps at BSkyB
@chbatey
Agenda
• Motivation
• Cassandra
• Replication
• Fault tolerance
• Data modelling
• Spark
• Use cases
• Stream processing
• Time series example: Weather station data
@chbatey
OLTP OLAP Batch
Weather data streaming
Incoming
weather
events
Apache Kafka
Producer
Consumer
NodeGuardian
Dashboard
@chbatey
@chbatey
@chbatey
Run this your self
• https://github.com/killrweather/killrweather
@chbatey
Cassandra
@chbatey
Cassandra for Applications
APACHE
CASSANDRA
@chbatey
Common use cases
•Ordered data such as time series
-Event stores
-Financial transactions
-IoT e.g Sensor data
@chbatey
Common use cases
•Ordered data such as time series
-Event stores
-Financial transactions
-IoT e.g Sensor data
•Non functional requirements:
-Linear scalability
-High throughout durable writes
-Multi datacenter including active-active
-Analytics without ETL
@chbatey
Cassandra
Cassandra
• Distributed masterless
database (Dynamo)
• Column family data model
(Google BigTable)
@chbatey
Datacenter and rack aware
Europe
• Distributed master less
database (Dynamo)
• Column family data model
(Google BigTable)
• Multi data centre replication
built in from the start
USA
@chbatey
Cassandra
Online
• Distributed master less
database (Dynamo)
• Column family data model
(Google BigTable)
• Multi data centre replication
built in from the start
• Analytics with Apache SparkAnalytics
@chbatey
Dynamo 101
@chbatey
Dynamo 101
• The parts Cassandra took
- Consistent hashing
- Replication
- Gossip
- Hinted handoff
- Anti-entropy repair
• And the parts it left behind
- Key/Value
- Vector clocks
@chbatey
Picking the right nodes
• You don’t want a full table scan on a 1000 node cluster!
• Dynamo to the rescue: Consistent Hashing
@chbatey
Murmer3 Example
• Data:
• Murmer3 Hash Values:
jim age: 36 car: ford gender: M
carol age: 37 car: bmw gender: F
johnny age: 12 gender: M
suzy: age: 10 gender: F
Primary Key Murmur3 hash value
jim 350
carol 998
johnny 50
suzy 600
Primary Key
Real hash range: -9223372036854775808 to 9223372036854775807
@chbatey
Murmer3 Example
Four node cluster:
Node Murmur3 start range Murmur3 end range
A 0 249
B 250 499
C 500 749
D 750 999
@chbatey
Pictures are better
A
B
C
D
999
249
499
750
749
0
250
500
B
CD
A
@chbatey
Murmer3 Example
Data is distributed as:
Node Start range End range Primary
key
Hash value
A 0 249 johnny 50
B 250 499 jim 350
C 500 749 suzy 600
D 750 999 carol 998
@chbatey
Replication
@chbatey
Replication strategy
• Simple
- Give it to the next node in the ring
- Don’t use this in production
• NetworkTopology
- Every Cassandra node knows its DC and Rack
- Replicas won’t be put on the same rack unless Replication Factor > # of racks
- Unfortunately Cassandra can’t create servers and racks on the fly to fix this :(
@chbatey
Replication
DC1 DC2
client
RF3 RF3
C
RC
WRITE
CL = 1 We have replication!
26
@chbatey
Tunable Consistency
•Data is replicated N times
•Every query that you execute you give a consistency
-ALL
-QUORUM
-LOCAL_QUORUM
-ONE
• Christos Kalantzis Eventual Consistency != Hopeful Consistency: http://
youtu.be/A6qzx_HE3EU?list=PLqcm6qE9lgKJzVvwHprow9h7KMpb5hcUU
@chbatey
Scaling shouldn’t be hard
• Throw more nodes at a cluster
• Bootstrapping + joining the ring
• For large data sets this can take some time
@chbatey
Spark Time
@chbatey
Scalability & Performance
• Scalability
- No single point of failure
- No special nodes that become the bottle neck
- Work/data can be re-distributed
• Operational Performance i.e single digit ms
- Single node for query
- Single disk seek per query
@chbatey
But but…
• Sometimes you don’t need a answers in milliseconds
• Reports / analysis
• Data models done wrong - how do I fix it?
• New requirements for old data?
• Ad-hoc operational queries
• Managers always want counts / maxs
@chbatey
@chbatey
Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop
MR
• Works out of the box on EMR
• Fault tolerant distributed datasets
• Batch, iterative and streaming analysis
• In memory storage and disk
• Integrates with most file and storage options
@chbatey
Part of most Big Data Platforms
Analytic
Search
• All Major Hadoop Distributions Include
Spark
• Spark Is Also Integrated With Non-
Hadoop Big Data Platforms like DSE
• Spark Applications Can Be Written Once
and Deployed Anywhere
SQL
Machine
Learning
Streaming Graph
Core
Deploy Spark Apps Anywhere
@chbatey
Components
Shark
or

Spark SQL
Streaming ML
Spark (General execution engine)
Graph
Cassandra
Compatible
@chbatey
org.apache.spark.rdd.RDD
• Resilient Distributed Dataset (RDD)
• Created through transformations on data (map,filter..) or other RDDs
• Immutable
• Partitioned
• Reusable
@chbatey
RDD Operations
• Transformations - Similar to Scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
• Actions
• Require materialization of the records to generate a value
• collect: Array[T], count, fold, reduce..
@chbatey
Word count
val file: RDD[String] = sc.textFile("hdfs://...")

val counts: RDD[(String, Int)] = file.flatMap(line =>
line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)


counts.saveAsTextFile("hdfs://...")
zillions of bytes gigabytes per second
Spark Versus Spark Streaming
DStream - Micro Batches
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)
Processing of DStream = Processing of μBatches, RDDs
DStream
• Continuous sequence of micro batches
• More complex processing models are possible with less effort
• Streaming computations as a series of deterministic batch
computations on small time intervals
@chbatey
Time series example time
@chbatey
Deployment
• Spark worker in each of the
Cassandra nodes
• Partitions made up of LOCAL
cassandra data
S C
S C
S C
S C
Weather Station Analysis
• Weather station collects data
• Cassandra stores in sequence
• Spark rolls up data into new
tables
Windsor California
July 1, 2014
High: 73.4F
Low : 51.4F
raw_weather_data
CREATE TABLE raw_weather_data (
weather_station text, // Composite of Air Force Datsav3 station number and NCDC WBAN numbe
year int, // Year collected
month int, // Month collected
day int, // Day collected
hour int, // Hour collected
temperature double, // Air temperature (degrees Celsius)
dewpoint double, // Dew point temperature (degrees Celsius)
pressure double, // Sea level pressure (hectopascals)
wind_direction int, // Wind direction in degrees. 0-359
wind_speed double, // Wind speed (meters per second)
sky_condition int, // Total cloud cover (coded, see format documentation)
sky_condition_text text, // Non-coded sky conditions
one_hour_precip double, // One-hour accumulated liquid precipitation (millimeters)
six_hour_precip double, // Six-hour accumulated liquid precipitation (millimeters)
PRIMARY KEY ((weather_station), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Reverses data in the storage engine.
Primary key relationship
PRIMARY KEY (weatherstation_id,year,month,day,hour)
Primary key relationship
PRIMARY KEY (weatherstation_id,year,month,day,hour)
Partition Key
Primary key relationship
PRIMARY KEY (weatherstation_id,year,month,day,hour)
Partition Key Clustering Columns
Primary key relationship
PRIMARY KEY (weatherstation_id,year,month,day,hour)
Partition Key Clustering Columns
10010:99999
2005:12:1:7
-5.6
Primary key relationship
PRIMARY KEY (weatherstation_id,year,month,day,hour)
Partition Key Clustering Columns
10010:99999
-5.3-4.9-5.1
2005:12:1:8 2005:12:1:9 2005:12:1:10
Data Locality
weatherstation_id=‘10010:99999’ ?
1000 Node Cluster
You are here!
Query patterns
• Range queries
• “Slice” operation on disk
SELECT weatherstation,hour,temperature
FROM raw_weather_data
WHERE weatherstation_id=‘10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;
Single seek on disk
2005:12:1:12
-5.4
2005:12:1:11
-4.9-5.3-4.9-5.1
2005:12:1:7
-5.6
2005:12:1:8 2005:12:1:9
10010:99999
2005:12:1:10
Partition key for locality
Query patterns
• Range queries
• “Slice” operation on disk
Programmers like this
Sorted by event_time
2005:12:1:7
-5.6
2005:12:1:8
-5.1
2005:12:1:9
-4.9
10010:99999
10010:99999
10010:99999
weather_station hour temperature
2005:12:1:10
-5.3
10010:99999
SELECT weatherstation,hour,temperature
FROM raw_weather_data
WHERE weatherstation_id=‘10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;
weather_station
CREATE TABLE weather_station (
id text PRIMARY KEY, // Composite of Air Force Datsav3 station number and NCDC WBAN number
name text, // Name of reporting station
country_code text, // 2 letter ISO Country ID
state_code text, // 2 letter state code for US stations
call_sign text, // International station call sign
lat double, // Latitude in decimal degrees
long double, // Longitude in decimal degrees
elevation double // Elevation in meters
);
Lookup table
daily_aggregate_temperature
CREATE TABLE daily_aggregate_temperature (
weather_station text,
year int,
month int,
day int,
high double,
low double,
mean double,
variance double,
stdev double,
PRIMARY KEY ((weather_station), year, month, day)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
SELECT high, low FROM daily_aggregate_temperature
WHERE weather_station='010010:99999'
AND year=2005 AND month=12 AND day=3;
high | low
------+------
1.8 | -1.5
daily_aggregate_precip
CREATE TABLE daily_aggregate_precip (
weather_station text,
year int,
month int,
day int,
precipitation counter,
PRIMARY KEY ((weather_station), year, month, day)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
SELECT precipitation FROM daily_aggregate_precip
WHERE weather_station='010010:99999'
AND year=2005 AND month=12 AND day>=1 AND day <= 7;
0
10
20
30
40
1 2 3 4 5 6 7
17
26
2
0
33
12
0
Result
wsid | year | month | day | high | low
--------------+------+-------+-----+------+------
725300:94846 | 2012 | 9 | 30 | 18.9 | 10.6
725300:94846 | 2012 | 9 | 29 | 25.6 | 9.4
725300:94846 | 2012 | 9 | 28 | 19.4 | 11.7
725300:94846 | 2012 | 9 | 27 | 17.8 | 7.8
725300:94846 | 2012 | 9 | 26 | 22.2 | 13.3
725300:94846 | 2012 | 9 | 25 | 25 | 11.1
725300:94846 | 2012 | 9 | 24 | 21.1 | 4.4
725300:94846 | 2012 | 9 | 23 | 15.6 | 5
725300:94846 | 2012 | 9 | 22 | 15 | 7.2
725300:94846 | 2012 | 9 | 21 | 18.3 | 9.4
725300:94846 | 2012 | 9 | 20 | 21.7 | 11.7
725300:94846 | 2012 | 9 | 19 | 22.8 | 5.6
725300:94846 | 2012 | 9 | 18 | 17.2 | 9.4
725300:94846 | 2012 | 9 | 17 | 25 | 12.8
725300:94846 | 2012 | 9 | 16 | 25 | 10.6
725300:94846 | 2012 | 9 | 15 | 26.1 | 11.1
725300:94846 | 2012 | 9 | 14 | 23.9 | 11.1
725300:94846 | 2012 | 9 | 13 | 26.7 | 13.3
725300:94846 | 2012 | 9 | 12 | 29.4 | 17.2
725300:94846 | 2012 | 9 | 11 | 28.3 | 11.7
725300:94846 | 2012 | 9 | 10 | 23.9 | 12.2
725300:94846 | 2012 | 9 | 9 | 21.7 | 12.8
725300:94846 | 2012 | 9 | 8 | 22.2 | 12.8
725300:94846 | 2012 | 9 | 7 | 25.6 | 18.9
725300:94846 | 2012 | 9 | 6 | 30 | 20.6
725300:94846 | 2012 | 9 | 5 | 30 | 17.8
725300:94846 | 2012 | 9 | 4 | 32.2 | 21.7
725300:94846 | 2012 | 9 | 3 | 30.6 | 21.7
725300:94846 | 2012 | 9 | 2 | 27.2 | 21.7
725300:94846 | 2012 | 9 | 1 | 27.2 | 21.7
SELECT wsid, year, month, day, high, low
FROM daily_aggregate_temperature
WHERE wsid = '725300:94846'
AND year=2012 AND month=9 ;
Weather Station Stream Analysis
• Weather station collects data
• Data processed in stream
• Data stored in Cassandra
Windsor California
Today
Rainfall total: 1.2cm
High: 73.4F
Low : 51.4F
Incoming data from Kafka
725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0,0.0
@chbatey
Creating a Stream
@chbatey
Saving the raw data
@chbatey
Building an aggregate
CREATE TABLE daily_aggregate_precip (
weather_station text,
year int,
month int,
day int,
precipitation counter,
PRIMARY KEY ((weather_station), year, month, day)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
CQL Counter
Weather data streaming
Load
Generator or
Data import
Apache Kafka
Producer
Consumer
NodeGuardian
Dashboard
@chbatey
@chbatey
@chbatey
Run this your self
• https://github.com/killrweather/killrweather
@chbatey
Summary
• Cassandra
- always-on operational database
• Spark
- Batch analytics
- Stream processing and saving back to Cassandra
@chbatey
Thanks for listening
• Follow me on twitter @chbatey
• Cassandra + Fault tolerance posts a plenty:
• http://christopher-batey.blogspot.co.uk/
• Cassandra resources: http://planetcassandra.org/
• Full free day of Cassandra talks/training:
• http://www.eventbrite.com/e/cassandra-day-london-2015-
april-22nd-2015-tickets-15053026006?aff=meetup1

More Related Content

What's hot (20)

PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
PDF
Laying down the smack on your data pipelines
Patrick McFadin
 
PDF
Time Series Processing with Apache Spark
Josef Adersberger
 
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
PDF
OLAP with Cassandra and Spark
Evan Chan
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PDF
What's New in Upcoming Apache Spark 2.3
Databricks
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
Introduction to cassandra 2014
Patrick McFadin
 
PDF
AddThis: Scaling Cassandra up and down into containers with ZFS
DataStax Academy
 
PDF
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Spark Summit
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
DataStax
 
PPTX
Multi-tenant Apache Storm as a service
Robert Evans
 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Laying down the smack on your data pipelines
Patrick McFadin
 
Time Series Processing with Apache Spark
Josef Adersberger
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
OLAP with Cassandra and Spark
Evan Chan
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
What's New in Upcoming Apache Spark 2.3
Databricks
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Introduction to cassandra 2014
Patrick McFadin
 
AddThis: Scaling Cassandra up and down into containers with ZFS
DataStax Academy
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
DataStax
 
Multi-tenant Apache Storm as a service
Robert Evans
 

Viewers also liked (16)

PDF
NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2
DataStax Academy
 
PDF
Real-time Cassandra
Acunu
 
PDF
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
DataStax
 
PDF
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
DataStax Academy
 
PDF
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
PPTX
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax
 
KEY
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
PDF
Arquitectura Lambda
Israel Gaytan
 
PDF
Introduction to Apache Cassandra
Robert Stupp
 
PDF
Cassandra Explained
Eric Evans
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PPTX
An Overview of Apache Cassandra
DataStax
 
PPTX
Cassandra Data Modeling - Practical Considerations @ Netflix
nkorla1share
 
NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2
DataStax Academy
 
Real-time Cassandra
Acunu
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
DataStax
 
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
DataStax Academy
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax
 
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
Arquitectura Lambda
Israel Gaytan
 
Introduction to Apache Cassandra
Robert Stupp
 
Cassandra Explained
Eric Evans
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
An Overview of Apache Cassandra
DataStax
 
Cassandra Data Modeling - Practical Considerations @ Netflix
nkorla1share
 
Ad

Similar to Data Science Lab Meetup: Cassandra and Spark (20)

PDF
1 Dundee - Cassandra 101
Christopher Batey
 
PDF
Apache cassandra & apache spark for time series data
Patrick McFadin
 
PPTX
Presentation
Dimitris Stripelis
 
PDF
Cassandra Talk: Austin JUG
Stu Hood
 
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
PDF
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
DataStax Academy
 
PDF
Time series with Apache Cassandra - Long version
Patrick McFadin
 
PDF
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
thumbtacktech
 
PDF
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Anton Yazovskiy
 
PDF
Jan 2015 - Cassandra101 Manchester Meetup
Christopher Batey
 
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
PDF
Cassandra lesson learned - extended
Andrzej Ludwikowski
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Storing time series data with Apache Cassandra
Patrick McFadin
 
PDF
About "Apache Cassandra"
Jihyun Ahn
 
PPTX
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Dave Gardner
 
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
PDF
Vienna Feb 2015: Cassandra: How it works and what it's good for!
Christopher Batey
 
PPTX
Introduction to Apache Cassandra
Jesus Guzman
 
PPT
No sql
Murat Çakal
 
1 Dundee - Cassandra 101
Christopher Batey
 
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Presentation
Dimitris Stripelis
 
Cassandra Talk: Austin JUG
Stu Hood
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
DataStax Academy
 
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
thumbtacktech
 
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Anton Yazovskiy
 
Jan 2015 - Cassandra101 Manchester Meetup
Christopher Batey
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
Cassandra lesson learned - extended
Andrzej Ludwikowski
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Storing time series data with Apache Cassandra
Patrick McFadin
 
About "Apache Cassandra"
Jihyun Ahn
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Dave Gardner
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Vienna Feb 2015: Cassandra: How it works and what it's good for!
Christopher Batey
 
Introduction to Apache Cassandra
Jesus Guzman
 
No sql
Murat Çakal
 
Ad

More from Christopher Batey (20)

PDF
Cassandra summit LWTs
Christopher Batey
 
PDF
Docker and jvm. A good idea?
Christopher Batey
 
PDF
LJC: Microservices in the real world
Christopher Batey
 
PDF
NYC Cassandra Day - Java Intro
Christopher Batey
 
PDF
Cassandra Day NYC - Cassandra anti patterns
Christopher Batey
 
PDF
Think your software is fault-tolerant? Prove it!
Christopher Batey
 
PDF
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
PDF
Cassandra London - 2.2 and 3.0
Christopher Batey
 
PDF
Cassandra London - C* Spark Connector
Christopher Batey
 
PDF
IoT London July 2015
Christopher Batey
 
PDF
2 Dundee - Cassandra-3
Christopher Batey
 
PDF
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
PDF
Paris Day Cassandra: Use case
Christopher Batey
 
PDF
Dublin Meetup: Cassandra anti patterns
Christopher Batey
 
PDF
Cassandra Day London: Building Java Applications
Christopher Batey
 
PDF
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Christopher Batey
 
PDF
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
PDF
Manchester Hadoop User Group: Cassandra Intro
Christopher Batey
 
PDF
Webinar Cassandra Anti-Patterns
Christopher Batey
 
PDF
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Cassandra summit LWTs
Christopher Batey
 
Docker and jvm. A good idea?
Christopher Batey
 
LJC: Microservices in the real world
Christopher Batey
 
NYC Cassandra Day - Java Intro
Christopher Batey
 
Cassandra Day NYC - Cassandra anti patterns
Christopher Batey
 
Think your software is fault-tolerant? Prove it!
Christopher Batey
 
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
Cassandra London - 2.2 and 3.0
Christopher Batey
 
Cassandra London - C* Spark Connector
Christopher Batey
 
IoT London July 2015
Christopher Batey
 
2 Dundee - Cassandra-3
Christopher Batey
 
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
Paris Day Cassandra: Use case
Christopher Batey
 
Dublin Meetup: Cassandra anti patterns
Christopher Batey
 
Cassandra Day London: Building Java Applications
Christopher Batey
 
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Christopher Batey
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
Manchester Hadoop User Group: Cassandra Intro
Christopher Batey
 
Webinar Cassandra Anti-Patterns
Christopher Batey
 
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 

Recently uploaded (20)

PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 

Data Science Lab Meetup: Cassandra and Spark

  • 1. @chbatey Christopher Batey
 Technical Evangelist for Apache Cassandra Time series analysis with Spark and Cassandra
  • 2. @chbatey Who am I? • Technical Evangelist for Apache Cassandra •Founder of Stubbed Cassandra •Help out Apache Cassandra users • DataStax •Builds enterprise ready version of Apache Cassandra • Previous: Cassandra backed apps at BSkyB
  • 3. @chbatey Agenda • Motivation • Cassandra • Replication • Fault tolerance • Data modelling • Spark • Use cases • Stream processing • Time series example: Weather station data
  • 5. Weather data streaming Incoming weather events Apache Kafka Producer Consumer NodeGuardian Dashboard
  • 8. @chbatey Run this your self • https://github.com/killrweather/killrweather
  • 11. @chbatey Common use cases •Ordered data such as time series -Event stores -Financial transactions -IoT e.g Sensor data
  • 12. @chbatey Common use cases •Ordered data such as time series -Event stores -Financial transactions -IoT e.g Sensor data •Non functional requirements: -Linear scalability -High throughout durable writes -Multi datacenter including active-active -Analytics without ETL
  • 13. @chbatey Cassandra Cassandra • Distributed masterless database (Dynamo) • Column family data model (Google BigTable)
  • 14. @chbatey Datacenter and rack aware Europe • Distributed master less database (Dynamo) • Column family data model (Google BigTable) • Multi data centre replication built in from the start USA
  • 15. @chbatey Cassandra Online • Distributed master less database (Dynamo) • Column family data model (Google BigTable) • Multi data centre replication built in from the start • Analytics with Apache SparkAnalytics
  • 17. @chbatey Dynamo 101 • The parts Cassandra took - Consistent hashing - Replication - Gossip - Hinted handoff - Anti-entropy repair • And the parts it left behind - Key/Value - Vector clocks
  • 18. @chbatey Picking the right nodes • You don’t want a full table scan on a 1000 node cluster! • Dynamo to the rescue: Consistent Hashing
  • 19. @chbatey Murmer3 Example • Data: • Murmer3 Hash Values: jim age: 36 car: ford gender: M carol age: 37 car: bmw gender: F johnny age: 12 gender: M suzy: age: 10 gender: F Primary Key Murmur3 hash value jim 350 carol 998 johnny 50 suzy 600 Primary Key Real hash range: -9223372036854775808 to 9223372036854775807
  • 20. @chbatey Murmer3 Example Four node cluster: Node Murmur3 start range Murmur3 end range A 0 249 B 250 499 C 500 749 D 750 999
  • 22. @chbatey Murmer3 Example Data is distributed as: Node Start range End range Primary key Hash value A 0 249 johnny 50 B 250 499 jim 350 C 500 749 suzy 600 D 750 999 carol 998
  • 24. @chbatey Replication strategy • Simple - Give it to the next node in the ring - Don’t use this in production • NetworkTopology - Every Cassandra node knows its DC and Rack - Replicas won’t be put on the same rack unless Replication Factor > # of racks - Unfortunately Cassandra can’t create servers and racks on the fly to fix this :(
  • 26. 26
  • 27. @chbatey Tunable Consistency •Data is replicated N times •Every query that you execute you give a consistency -ALL -QUORUM -LOCAL_QUORUM -ONE • Christos Kalantzis Eventual Consistency != Hopeful Consistency: http:// youtu.be/A6qzx_HE3EU?list=PLqcm6qE9lgKJzVvwHprow9h7KMpb5hcUU
  • 28. @chbatey Scaling shouldn’t be hard • Throw more nodes at a cluster • Bootstrapping + joining the ring • For large data sets this can take some time
  • 30. @chbatey Scalability & Performance • Scalability - No single point of failure - No special nodes that become the bottle neck - Work/data can be re-distributed • Operational Performance i.e single digit ms - Single node for query - Single disk seek per query
  • 31. @chbatey But but… • Sometimes you don’t need a answers in milliseconds • Reports / analysis • Data models done wrong - how do I fix it? • New requirements for old data? • Ad-hoc operational queries • Managers always want counts / maxs
  • 33. @chbatey Apache Spark • 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault tolerant distributed datasets • Batch, iterative and streaming analysis • In memory storage and disk • Integrates with most file and storage options
  • 34. @chbatey Part of most Big Data Platforms Analytic Search • All Major Hadoop Distributions Include Spark • Spark Is Also Integrated With Non- Hadoop Big Data Platforms like DSE • Spark Applications Can Be Written Once and Deployed Anywhere SQL Machine Learning Streaming Graph Core Deploy Spark Apps Anywhere
  • 35. @chbatey Components Shark or
 Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra Compatible
  • 36. @chbatey org.apache.spark.rdd.RDD • Resilient Distributed Dataset (RDD) • Created through transformations on data (map,filter..) or other RDDs • Immutable • Partitioned • Reusable
  • 37. @chbatey RDD Operations • Transformations - Similar to Scala collections API • Produce new RDDs • filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract • Actions • Require materialization of the records to generate a value • collect: Array[T], count, fold, reduce..
  • 38. @chbatey Word count val file: RDD[String] = sc.textFile("hdfs://...")
 val counts: RDD[(String, Int)] = file.flatMap(line => line.split(" "))
 .map(word => (word, 1))
 .reduceByKey(_ + _) 
 counts.saveAsTextFile("hdfs://...")
  • 39. zillions of bytes gigabytes per second Spark Versus Spark Streaming
  • 40. DStream - Micro Batches μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) Processing of DStream = Processing of μBatches, RDDs DStream • Continuous sequence of micro batches • More complex processing models are possible with less effort • Streaming computations as a series of deterministic batch computations on small time intervals
  • 42. @chbatey Deployment • Spark worker in each of the Cassandra nodes • Partitions made up of LOCAL cassandra data S C S C S C S C
  • 43. Weather Station Analysis • Weather station collects data • Cassandra stores in sequence • Spark rolls up data into new tables Windsor California July 1, 2014 High: 73.4F Low : 51.4F
  • 44. raw_weather_data CREATE TABLE raw_weather_data ( weather_station text, // Composite of Air Force Datsav3 station number and NCDC WBAN numbe year int, // Year collected month int, // Month collected day int, // Day collected hour int, // Hour collected temperature double, // Air temperature (degrees Celsius) dewpoint double, // Dew point temperature (degrees Celsius) pressure double, // Sea level pressure (hectopascals) wind_direction int, // Wind direction in degrees. 0-359 wind_speed double, // Wind speed (meters per second) sky_condition int, // Total cloud cover (coded, see format documentation) sky_condition_text text, // Non-coded sky conditions one_hour_precip double, // One-hour accumulated liquid precipitation (millimeters) six_hour_precip double, // Six-hour accumulated liquid precipitation (millimeters) PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); Reverses data in the storage engine.
  • 45. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour)
  • 46. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key
  • 47. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key Clustering Columns
  • 48. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key Clustering Columns 10010:99999
  • 49. 2005:12:1:7 -5.6 Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key Clustering Columns 10010:99999 -5.3-4.9-5.1 2005:12:1:8 2005:12:1:9 2005:12:1:10
  • 51. Query patterns • Range queries • “Slice” operation on disk SELECT weatherstation,hour,temperature FROM raw_weather_data WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10; Single seek on disk 2005:12:1:12 -5.4 2005:12:1:11 -4.9-5.3-4.9-5.1 2005:12:1:7 -5.6 2005:12:1:8 2005:12:1:9 10010:99999 2005:12:1:10 Partition key for locality
  • 52. Query patterns • Range queries • “Slice” operation on disk Programmers like this Sorted by event_time 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 10010:99999 10010:99999 10010:99999 weather_station hour temperature 2005:12:1:10 -5.3 10010:99999 SELECT weatherstation,hour,temperature FROM raw_weather_data WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;
  • 53. weather_station CREATE TABLE weather_station ( id text PRIMARY KEY, // Composite of Air Force Datsav3 station number and NCDC WBAN number name text, // Name of reporting station country_code text, // 2 letter ISO Country ID state_code text, // 2 letter state code for US stations call_sign text, // International station call sign lat double, // Latitude in decimal degrees long double, // Longitude in decimal degrees elevation double // Elevation in meters ); Lookup table
  • 54. daily_aggregate_temperature CREATE TABLE daily_aggregate_temperature ( weather_station text, year int, month int, day int, high double, low double, mean double, variance double, stdev double, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); SELECT high, low FROM daily_aggregate_temperature WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day=3; high | low ------+------ 1.8 | -1.5
  • 55. daily_aggregate_precip CREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); SELECT precipitation FROM daily_aggregate_precip WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day>=1 AND day <= 7; 0 10 20 30 40 1 2 3 4 5 6 7 17 26 2 0 33 12 0
  • 56. Result wsid | year | month | day | high | low --------------+------+-------+-----+------+------ 725300:94846 | 2012 | 9 | 30 | 18.9 | 10.6 725300:94846 | 2012 | 9 | 29 | 25.6 | 9.4 725300:94846 | 2012 | 9 | 28 | 19.4 | 11.7 725300:94846 | 2012 | 9 | 27 | 17.8 | 7.8 725300:94846 | 2012 | 9 | 26 | 22.2 | 13.3 725300:94846 | 2012 | 9 | 25 | 25 | 11.1 725300:94846 | 2012 | 9 | 24 | 21.1 | 4.4 725300:94846 | 2012 | 9 | 23 | 15.6 | 5 725300:94846 | 2012 | 9 | 22 | 15 | 7.2 725300:94846 | 2012 | 9 | 21 | 18.3 | 9.4 725300:94846 | 2012 | 9 | 20 | 21.7 | 11.7 725300:94846 | 2012 | 9 | 19 | 22.8 | 5.6 725300:94846 | 2012 | 9 | 18 | 17.2 | 9.4 725300:94846 | 2012 | 9 | 17 | 25 | 12.8 725300:94846 | 2012 | 9 | 16 | 25 | 10.6 725300:94846 | 2012 | 9 | 15 | 26.1 | 11.1 725300:94846 | 2012 | 9 | 14 | 23.9 | 11.1 725300:94846 | 2012 | 9 | 13 | 26.7 | 13.3 725300:94846 | 2012 | 9 | 12 | 29.4 | 17.2 725300:94846 | 2012 | 9 | 11 | 28.3 | 11.7 725300:94846 | 2012 | 9 | 10 | 23.9 | 12.2 725300:94846 | 2012 | 9 | 9 | 21.7 | 12.8 725300:94846 | 2012 | 9 | 8 | 22.2 | 12.8 725300:94846 | 2012 | 9 | 7 | 25.6 | 18.9 725300:94846 | 2012 | 9 | 6 | 30 | 20.6 725300:94846 | 2012 | 9 | 5 | 30 | 17.8 725300:94846 | 2012 | 9 | 4 | 32.2 | 21.7 725300:94846 | 2012 | 9 | 3 | 30.6 | 21.7 725300:94846 | 2012 | 9 | 2 | 27.2 | 21.7 725300:94846 | 2012 | 9 | 1 | 27.2 | 21.7 SELECT wsid, year, month, day, high, low FROM daily_aggregate_temperature WHERE wsid = '725300:94846' AND year=2012 AND month=9 ;
  • 57. Weather Station Stream Analysis • Weather station collects data • Data processed in stream • Data stored in Cassandra Windsor California Today Rainfall total: 1.2cm High: 73.4F Low : 51.4F
  • 58. Incoming data from Kafka 725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0,0.0
  • 61. @chbatey Building an aggregate CREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); CQL Counter
  • 62. Weather data streaming Load Generator or Data import Apache Kafka Producer Consumer NodeGuardian Dashboard
  • 65. @chbatey Run this your self • https://github.com/killrweather/killrweather
  • 66. @chbatey Summary • Cassandra - always-on operational database • Spark - Batch analytics - Stream processing and saving back to Cassandra
  • 67. @chbatey Thanks for listening • Follow me on twitter @chbatey • Cassandra + Fault tolerance posts a plenty: • http://christopher-batey.blogspot.co.uk/ • Cassandra resources: http://planetcassandra.org/ • Full free day of Cassandra talks/training: • http://www.eventbrite.com/e/cassandra-day-london-2015- april-22nd-2015-tickets-15053026006?aff=meetup1