SlideShare a Scribd company logo
@chbatey
Christopher Batey

Technical Evangelist for Apache Cassandra
Cassandra Spark Integration
@chbatey
Agenda
• Spark intro
• Spark Cassandra connector
• Examples:
- Migrating from MySQL to Cassandra
- Cassandra schema migrations
- Import data from flat file into Cassandra
- Spark SQL on Cassandra
- Spark Streaming and Cassandra
@chbatey
Scalability & Performance
• Scalability
- No single point of failure
- No special nodes that become the bottle neck
- Work/data can be re-distributed
• Operational Performance i.e single digit ms
- Single node for query
- Single disk seek per query
@chbatey
Cassandra can not join or aggregate
Client
Where do I go for the max?
@chbatey
But but…
• Sometimes you don’t need a answers in milliseconds
• Data models done wrong - how do I fix it?
• New requirements for old data?
• Ad-hoc operational queries
• Managers always want counts / maxs
@chbatey
Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop
MR
• Works out of the box on EMR
• Fault Tolerant Distributed Datasets
• Batch, iterative and streaming analysis
• In Memory Storage and Disk
• Integrates with Most File and Storage Options
@chbatey
Components
Shark
or

Spark SQL
Streaming ML
Spark (General execution engine)
Graph
Cassandra
Compatible
@chbatey
Spark architecture
@chbatey
org.apache.spark.rdd.RDD
• Resilient Distributed Dataset (RDD)
• Created through transformations on data (map,filter..) or other RDDs
• Immutable
• Partitioned
• Reusable
@chbatey
RDD Operations
• Transformations - Similar to Scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
• Actions
• Require materialization of the records to generate a value
• collect: Array[T], count, fold, reduce..
@chbatey
Word count
val file: RDD[String] = sc.textFile("hdfs://...")

val counts: RDD[(String, Int)] = file.flatMap(line =>
line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)


counts.saveAsTextFile("hdfs://...")
@chbatey
Spark shell
Operator Graph: Optimisation and Fault Tolerance
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= Cached partition= RDD
@chbatey
Partitioning
• Large data sets from S3, HDFS, Cassandra etc
• Split into small chunks called partitions
• Each operation is done locally on a partition before
combining other partitions
• So partitioning is important for data locality
@chbatey
Spark Streaming
@chbatey
Cassandra
@chbatey
Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-aware (speed)
• Data transformation, aggregation, etc.
@chbatey
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs + Spark
DStreams
@chbatey
Analytics Workload Isolation
@chbatey
Deployment
• Spark worker in each of the
Cassandra nodes
• Partitions made up of LOCAL
cassandra data
S C
S C
S C
S C
@chbatey
Example Time
@chbatey
It is on Github
"org.apache.spark" %% "spark-core" % sparkVersion



"org.apache.spark" %% "spark-streaming" % sparkVersion



"org.apache.spark" %% "spark-sql" % sparkVersion



"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion



"com.datastax.spark" % "spark-cassandra-connector_2.10" % connectorVersion
@chbatey
@chbatey
Boiler plate
import com.datastax.spark.connector.rdd._

import org.apache.spark._

import com.datastax.spark.connector._

import com.datastax.spark.connector.cql._



object BasicCassandraInteraction extends App {

val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext("local[4]", "AppName", conf)
// cool stuff



}
Cassandra Host
Spark master e.g spark://host:port
@chbatey
Executing code against the driver
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1 }")

session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")

}
@chbatey
Reading data from Cassandra
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")

}



val rdd: CassandraRDD[CassandraRow] = sc.cassandraTable("test", "kv")

println(rdd.count())

println(rdd.first())

println(rdd.max()(new Ordering[CassandraRow] {

override def compare(x: CassandraRow, y: CassandraRow): Int =
x.getInt("value").compare(y.getInt("value"))

}))
@chbatey
Word Count + Save to Cassandra
val textFile: RDD[String] = sc.textFile("Spark-Readme.md")



val words: RDD[String] = textFile.flatMap(line => line.split("s+"))

val wordAndCount: RDD[(String, Int)] = words.map((_, 1))

val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _)



println(wordCounts.first())



wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))
@chbatey
Migrating from an RDMS
create table store(
store_name varchar(32) primary key,
location varchar(32),
store_type varchar(10));
create table staff(
name varchar(32)
primary key,
favourite_colour varchar(32),
job_title varchar(32));
create table customer_events(
id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
customer varchar(12),
time timestamp,
event_type varchar(16),
store varchar(32),
staff varchar(32),
foreign key fk_store(store) references store(store_name),
foreign key fk_staff(staff) references staff(name))
@chbatey
Denormalised table
CREATE TABLE IF NOT EXISTS customer_events(
customer_id text,
time timestamp,
id uuid,
event_type text,

store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((customer_id), time, id))
@chbatey
Migration time


val customerEvents = new JdbcRDD(sc, () => { DriverManager.getConnection(mysqlJdbcString)}, 

"select * from customer_events ce, staff, store where ce.store = store.store_name and ce.staff = staff.name " +

"and ce.id >= ? and ce.id <= ?", 0, 1000, 6,

(r: ResultSet) => {

(r.getString("customer"),

r.getTimestamp("time"),

UUID.randomUUID(),

r.getString("event_type"),

r.getString("store_name"),

r.getString("location"),

r.getString("store_type"),

r.getString("staff"),

r.getString("job_title")

)

})



customerEvents.saveToCassandra("test", "customer_events",

SomeColumns("customer_id", "time", "id", "event_type", "store_name", "store_type", "store_location", "staff_name",
"staff_title"))
@chbatey
Issues with denormalisation
• What happens when I need to query the denormalised
data a different way?
@chbatey
Store it twice
CREATE TABLE IF NOT EXISTS customer_events(
customer_id text,
time timestamp,
id uuid,
event_type text,
store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((customer_id), time, id))


CREATE TABLE IF NOT EXISTS customer_events_by_staff(
customer_id text,
time timestamp,
id uuid,
event_type text,
store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((staff_name), time, id))
@chbatey
My reaction a year ago
@chbatey
Too simple
val events_by_customer = sc.cassandraTable("test", “customer_events")


events_by_customer.saveToCassandra("test", "customer_events_by_staff",
SomeColumns("customer_id", "time", "id", "event_type", "staff_name",
"staff_title", "store_location", "store_name", "store_type"))

@chbatey
Aggregations with Spark SQL
Partition Key Clustering Columns
@chbatey
Now now…
val cc = new CassandraSQLContext(sc)

cc.setKeyspace("test")

val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events
GROUP BY store_name, event_type")

rdd.collect().foreach(println)
[SportsApp,WATCH_STREAM,1]
[SportsApp,LOGOUT,1]
[SportsApp,LOGIN,1]
[ChrisBatey.com,WATCH_MOVIE,1]
[ChrisBatey.com,LOGOUT,1]
[ChrisBatey.com,BUY_MOVIE,1]
[SportsApp,WATCH_MOVIE,2]
@chbatey
Lamda architecture
http://lambda-architecture.net/
@chbatey
Spark Streaming
@chbatey
Network word count
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)")

session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)")

}


val ssc = new StreamingContext(conf, Seconds(5))

val lines = ssc.socketTextStream("localhost", 9999)

lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw")



val words = lines.flatMap(_.split("s+"))

val countOfOne = words.map((_, 1))

val reduced = countOfOne.reduceByKey(_ + _)

reduced.saveToCassandra("test", "network_word_count")
@chbatey
Kafka
• Partitioned pub sub system
• Very high throughput
• Very scalable
@chbatey
Stream processing customer events
val joeBuy = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))

val joeBuy2 = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))

val joeSell = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "SELL"))

val chrisBuy = write(CustomerEvent("chris", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events_by_type ( nameAndType text primary key, number int)")

session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events ( " +

"customer_id text, " +

"staff_id text, " +

"store_type text, " +

"group text static, " +

"content text, " +

"time timeuuid, " +

"event_type text, " +

"PRIMARY KEY ((customer_id), time) )")

}
@chbatey
Save + Process
val rawEvents: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]
(ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)

val events: DStream[CustomerEvent] = rawEvents.map({ case (k, v) =>

parse(v).extract[CustomerEvent]

})

events.saveToCassandra("streaming", "customer_events")

val eventsByCustomerAndType = events.map(event => (s"${event.customer_id}-${event.event_type}", 1)).reduceByKey(_ + _)



eventsByCustomerAndType.saveToCassandra("streaming", "customer_events_by_type")
@chbatey
Summary
• Cassandra is an operational database
• Spark gives us the flexibility to do slower things
- Schema migrations
- Ad-hoc queries
- Report generation
• Spark streaming + Cassandra allow us to build online
analytical platforms
@chbatey
Thanks for listening
• Follow me on twitter @chbatey
• Cassandra + Fault tolerance posts a plenty:
• http://christopher-batey.blogspot.co.uk/
• Github for all examples:
• https://github.com/chbatey/spark-sandbox
• Cassandra resources: http://planetcassandra.org/
• In London in April? http://www.eventbrite.com/e/cassandra-
day-london-2015-april-22nd-2015-tickets-15053026006?
aff=CommunityLanding

More Related Content

What's hot (20)

PDF
CQL3 in depth
Yuki Morishita
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PDF
Spark Streaming with Cassandra
Jacek Lewandowski
 
PDF
Cassandra Day Chicago 2015: Advanced Data Modeling
DataStax Academy
 
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
PDF
Cassandra 2.0 and timeseries
Patrick McFadin
 
PDF
Apache Cassandra & Data Modeling
Massimiliano Tomassi
 
PDF
Introduction to data modeling with apache cassandra
Patrick McFadin
 
PDF
Time series with apache cassandra strata
Patrick McFadin
 
PDF
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
PDF
Cloudera Impala, updated for v1.0
Scott Leberknight
 
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
PDF
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB
 
PDF
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
PPTX
AWS Cyber Security Best Practices
DoiT International
 
PDF
Intro to py spark (and cassandra)
Jon Haddad
 
PDF
MongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDB
MongoDB
 
PDF
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
PDF
Cassandra Materialized Views
Carl Yeksigian
 
PPTX
Introducing DataWave
Data Works MD
 
CQL3 in depth
Yuki Morishita
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Spark Streaming with Cassandra
Jacek Lewandowski
 
Cassandra Day Chicago 2015: Advanced Data Modeling
DataStax Academy
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
Cassandra 2.0 and timeseries
Patrick McFadin
 
Apache Cassandra & Data Modeling
Massimiliano Tomassi
 
Introduction to data modeling with apache cassandra
Patrick McFadin
 
Time series with apache cassandra strata
Patrick McFadin
 
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
Cloudera Impala, updated for v1.0
Scott Leberknight
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB
 
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
AWS Cyber Security Best Practices
DoiT International
 
Intro to py spark (and cassandra)
Jon Haddad
 
MongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDB
MongoDB
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
Cassandra Materialized Views
Carl Yeksigian
 
Introducing DataWave
Data Works MD
 

Viewers also liked (20)

PDF
IoT London July 2015
Christopher Batey
 
PDF
1 Dundee - Cassandra 101
Christopher Batey
 
PDF
Cassandra Day London: Building Java Applications
Christopher Batey
 
PDF
Dublin Meetup: Cassandra anti patterns
Christopher Batey
 
PDF
NYC Cassandra Day - Java Intro
Christopher Batey
 
PDF
Cassandra London - C* Spark Connector
Christopher Batey
 
PDF
Think your software is fault-tolerant? Prove it!
Christopher Batey
 
PDF
Cassandra Day NYC - Cassandra anti patterns
Christopher Batey
 
PDF
Cassandra London - 2.2 and 3.0
Christopher Batey
 
PDF
Cassandra summit LWTs
Christopher Batey
 
PDF
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
PDF
LJC: Microservices in the real world
Christopher Batey
 
PDF
2 Dundee - Cassandra-3
Christopher Batey
 
PDF
Manchester Hadoop User Group: Cassandra Intro
Christopher Batey
 
PDF
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Christopher Batey
 
PDF
Docker and jvm. A good idea?
Christopher Batey
 
PPTX
Get most out of Spark on YARN
DataWorks Summit
 
PDF
Apache spark - Installation
Martin Zapletal
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
Spark For Faster Batch Processing
Edureka!
 
IoT London July 2015
Christopher Batey
 
1 Dundee - Cassandra 101
Christopher Batey
 
Cassandra Day London: Building Java Applications
Christopher Batey
 
Dublin Meetup: Cassandra anti patterns
Christopher Batey
 
NYC Cassandra Day - Java Intro
Christopher Batey
 
Cassandra London - C* Spark Connector
Christopher Batey
 
Think your software is fault-tolerant? Prove it!
Christopher Batey
 
Cassandra Day NYC - Cassandra anti patterns
Christopher Batey
 
Cassandra London - 2.2 and 3.0
Christopher Batey
 
Cassandra summit LWTs
Christopher Batey
 
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
LJC: Microservices in the real world
Christopher Batey
 
2 Dundee - Cassandra-3
Christopher Batey
 
Manchester Hadoop User Group: Cassandra Intro
Christopher Batey
 
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Christopher Batey
 
Docker and jvm. A good idea?
Christopher Batey
 
Get most out of Spark on YARN
DataWorks Summit
 
Apache spark - Installation
Martin Zapletal
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Spark For Faster Batch Processing
Edureka!
 
Ad

Similar to Manchester Hadoop Meetup: Spark Cassandra Integration (20)

PDF
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
PDF
Apache cassandra & apache spark for time series data
Patrick McFadin
 
PDF
Cassandra Day London 2015: Getting Started with Apache Cassandra and Java
DataStax Academy
 
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
KEY
Perform Like a frAg Star
renaebair
 
PDF
Apache Cassandra and Go
DataStax Academy
 
PDF
Hadoop Integration in Cassandra
Jairam Chandar
 
PPTX
NoSQL Endgame DevoxxUA Conference 2020
Thodoris Bais
 
PDF
Converting a Rails application to Node.js
Matt Sergeant
 
PDF
Instaclustr webinar 2017 feb 08 japan
Hiromitsu Komatsu
 
PPTX
It Depends
Maggie Pint
 
PDF
Couchbas for dummies
Qureshi Tehmina
 
PDF
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr
 
PDF
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr
 
PPTX
It Depends - Database admin for developers - Rev 20151205
Maggie Pint
 
PPTX
Kåre Rude Andersen - Be a hero – optimize scom and present your services
Nordic Infrastructure Conference
 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
PDF
Introduction to .Net Driver
DataStax Academy
 
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Cassandra Day London 2015: Getting Started with Apache Cassandra and Java
DataStax Academy
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Perform Like a frAg Star
renaebair
 
Apache Cassandra and Go
DataStax Academy
 
Hadoop Integration in Cassandra
Jairam Chandar
 
NoSQL Endgame DevoxxUA Conference 2020
Thodoris Bais
 
Converting a Rails application to Node.js
Matt Sergeant
 
Instaclustr webinar 2017 feb 08 japan
Hiromitsu Komatsu
 
It Depends
Maggie Pint
 
Couchbas for dummies
Qureshi Tehmina
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr
 
It Depends - Database admin for developers - Rev 20151205
Maggie Pint
 
Kåre Rude Andersen - Be a hero – optimize scom and present your services
Nordic Infrastructure Conference
 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Introduction to .Net Driver
DataStax Academy
 
Ad

More from Christopher Batey (11)

PDF
Paris Day Cassandra: Use case
Christopher Batey
 
PDF
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
PDF
Webinar Cassandra Anti-Patterns
Christopher Batey
 
PDF
LA Cassandra Day 2015 - Testing Cassandra
Christopher Batey
 
PDF
LA Cassandra Day 2015 - Cassandra for developers
Christopher Batey
 
PDF
Voxxed Vienna 2015 Fault tolerant microservices
Christopher Batey
 
PDF
Vienna Feb 2015: Cassandra: How it works and what it's good for!
Christopher Batey
 
PDF
Jan 2015 - Cassandra101 Manchester Meetup
Christopher Batey
 
PDF
LJC: Fault tolerance with Apache Cassandra
Christopher Batey
 
PDF
Cassandra Summit EU 2014 Lightning talk - Paging (no animation)
Christopher Batey
 
PDF
Cassandra Summit EU 2014 - Testing Cassandra Applications
Christopher Batey
 
Paris Day Cassandra: Use case
Christopher Batey
 
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
Webinar Cassandra Anti-Patterns
Christopher Batey
 
LA Cassandra Day 2015 - Testing Cassandra
Christopher Batey
 
LA Cassandra Day 2015 - Cassandra for developers
Christopher Batey
 
Voxxed Vienna 2015 Fault tolerant microservices
Christopher Batey
 
Vienna Feb 2015: Cassandra: How it works and what it's good for!
Christopher Batey
 
Jan 2015 - Cassandra101 Manchester Meetup
Christopher Batey
 
LJC: Fault tolerance with Apache Cassandra
Christopher Batey
 
Cassandra Summit EU 2014 Lightning talk - Paging (no animation)
Christopher Batey
 
Cassandra Summit EU 2014 - Testing Cassandra Applications
Christopher Batey
 

Recently uploaded (20)

PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 

Manchester Hadoop Meetup: Spark Cassandra Integration

  • 1. @chbatey Christopher Batey
 Technical Evangelist for Apache Cassandra Cassandra Spark Integration
  • 2. @chbatey Agenda • Spark intro • Spark Cassandra connector • Examples: - Migrating from MySQL to Cassandra - Cassandra schema migrations - Import data from flat file into Cassandra - Spark SQL on Cassandra - Spark Streaming and Cassandra
  • 3. @chbatey Scalability & Performance • Scalability - No single point of failure - No special nodes that become the bottle neck - Work/data can be re-distributed • Operational Performance i.e single digit ms - Single node for query - Single disk seek per query
  • 4. @chbatey Cassandra can not join or aggregate Client Where do I go for the max?
  • 5. @chbatey But but… • Sometimes you don’t need a answers in milliseconds • Data models done wrong - how do I fix it? • New requirements for old data? • Ad-hoc operational queries • Managers always want counts / maxs
  • 6. @chbatey Apache Spark • 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options
  • 7. @chbatey Components Shark or
 Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra Compatible
  • 9. @chbatey org.apache.spark.rdd.RDD • Resilient Distributed Dataset (RDD) • Created through transformations on data (map,filter..) or other RDDs • Immutable • Partitioned • Reusable
  • 10. @chbatey RDD Operations • Transformations - Similar to Scala collections API • Produce new RDDs • filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract • Actions • Require materialization of the records to generate a value • collect: Array[T], count, fold, reduce..
  • 11. @chbatey Word count val file: RDD[String] = sc.textFile("hdfs://...")
 val counts: RDD[(String, Int)] = file.flatMap(line => line.split(" "))
 .map(word => (word, 1))
 .reduceByKey(_ + _) 
 counts.saveAsTextFile("hdfs://...")
  • 13. Operator Graph: Optimisation and Fault Tolerance join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = Cached partition= RDD
  • 14. @chbatey Partitioning • Large data sets from S3, HDFS, Cassandra etc • Split into small chunks called partitions • Each operation is done locally on a partition before combining other partitions • So partitioning is important for data locality
  • 17. @chbatey Spark on Cassandra • Server-Side filters (where clauses) • Cross-table operations (JOIN, UNION, etc.) • Data locality-aware (speed) • Data transformation, aggregation, etc.
  • 18. @chbatey Spark Cassandra Connector • Loads data from Cassandra to Spark • Writes data from Spark to Cassandra • Implicit Type Conversions and Object Mapping • Implemented in Scala (offers a Java API) • Open Source • Exposes Cassandra Tables as Spark RDDs + Spark DStreams
  • 20. @chbatey Deployment • Spark worker in each of the Cassandra nodes • Partitions made up of LOCAL cassandra data S C S C S C S C
  • 22. @chbatey It is on Github "org.apache.spark" %% "spark-core" % sparkVersion
 
 "org.apache.spark" %% "spark-streaming" % sparkVersion
 
 "org.apache.spark" %% "spark-sql" % sparkVersion
 
 "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion
 
 "com.datastax.spark" % "spark-cassandra-connector_2.10" % connectorVersion
  • 24. @chbatey Boiler plate import com.datastax.spark.connector.rdd._
 import org.apache.spark._
 import com.datastax.spark.connector._
 import com.datastax.spark.connector.cql._
 
 object BasicCassandraInteraction extends App {
 val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
 val sc = new SparkContext("local[4]", "AppName", conf) // cool stuff
 
 } Cassandra Host Spark master e.g spark://host:port
  • 25. @chbatey Executing code against the driver CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
 session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")
 }
  • 26. @chbatey Reading data from Cassandra CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")
 }
 
 val rdd: CassandraRDD[CassandraRow] = sc.cassandraTable("test", "kv")
 println(rdd.count())
 println(rdd.first())
 println(rdd.max()(new Ordering[CassandraRow] {
 override def compare(x: CassandraRow, y: CassandraRow): Int = x.getInt("value").compare(y.getInt("value"))
 }))
  • 27. @chbatey Word Count + Save to Cassandra val textFile: RDD[String] = sc.textFile("Spark-Readme.md")
 
 val words: RDD[String] = textFile.flatMap(line => line.split("s+"))
 val wordAndCount: RDD[(String, Int)] = words.map((_, 1))
 val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _)
 
 println(wordCounts.first())
 
 wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))
  • 28. @chbatey Migrating from an RDMS create table store( store_name varchar(32) primary key, location varchar(32), store_type varchar(10)); create table staff( name varchar(32) primary key, favourite_colour varchar(32), job_title varchar(32)); create table customer_events( id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY, customer varchar(12), time timestamp, event_type varchar(16), store varchar(32), staff varchar(32), foreign key fk_store(store) references store(store_name), foreign key fk_staff(staff) references staff(name))
  • 29. @chbatey Denormalised table CREATE TABLE IF NOT EXISTS customer_events( customer_id text, time timestamp, id uuid, event_type text,
 store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id))
  • 30. @chbatey Migration time 
 val customerEvents = new JdbcRDD(sc, () => { DriverManager.getConnection(mysqlJdbcString)}, 
 "select * from customer_events ce, staff, store where ce.store = store.store_name and ce.staff = staff.name " +
 "and ce.id >= ? and ce.id <= ?", 0, 1000, 6,
 (r: ResultSet) => {
 (r.getString("customer"),
 r.getTimestamp("time"),
 UUID.randomUUID(),
 r.getString("event_type"),
 r.getString("store_name"),
 r.getString("location"),
 r.getString("store_type"),
 r.getString("staff"),
 r.getString("job_title")
 )
 })
 
 customerEvents.saveToCassandra("test", "customer_events",
 SomeColumns("customer_id", "time", "id", "event_type", "store_name", "store_type", "store_location", "staff_name", "staff_title"))
  • 31. @chbatey Issues with denormalisation • What happens when I need to query the denormalised data a different way?
  • 32. @chbatey Store it twice CREATE TABLE IF NOT EXISTS customer_events( customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id)) 
 CREATE TABLE IF NOT EXISTS customer_events_by_staff( customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((staff_name), time, id))
  • 34. @chbatey Too simple val events_by_customer = sc.cassandraTable("test", “customer_events") 
 events_by_customer.saveToCassandra("test", "customer_events_by_staff", SomeColumns("customer_id", "time", "id", "event_type", "staff_name", "staff_title", "store_location", "store_name", "store_type"))

  • 35. @chbatey Aggregations with Spark SQL Partition Key Clustering Columns
  • 36. @chbatey Now now… val cc = new CassandraSQLContext(sc)
 cc.setKeyspace("test")
 val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events GROUP BY store_name, event_type")
 rdd.collect().foreach(println) [SportsApp,WATCH_STREAM,1] [SportsApp,LOGOUT,1] [SportsApp,LOGIN,1] [ChrisBatey.com,WATCH_MOVIE,1] [ChrisBatey.com,LOGOUT,1] [ChrisBatey.com,BUY_MOVIE,1] [SportsApp,WATCH_MOVIE,2]
  • 39. @chbatey Network word count CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)")
 session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)")
 } 
 val ssc = new StreamingContext(conf, Seconds(5))
 val lines = ssc.socketTextStream("localhost", 9999)
 lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw")
 
 val words = lines.flatMap(_.split("s+"))
 val countOfOne = words.map((_, 1))
 val reduced = countOfOne.reduceByKey(_ + _)
 reduced.saveToCassandra("test", "network_word_count")
  • 40. @chbatey Kafka • Partitioned pub sub system • Very high throughput • Very scalable
  • 41. @chbatey Stream processing customer events val joeBuy = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
 val joeBuy2 = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
 val joeSell = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "SELL"))
 val chrisBuy = write(CustomerEvent("chris", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY")) CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events_by_type ( nameAndType text primary key, number int)")
 session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events ( " +
 "customer_id text, " +
 "staff_id text, " +
 "store_type text, " +
 "group text static, " +
 "content text, " +
 "time timeuuid, " +
 "event_type text, " +
 "PRIMARY KEY ((customer_id), time) )")
 }
  • 42. @chbatey Save + Process val rawEvents: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder] (ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)
 val events: DStream[CustomerEvent] = rawEvents.map({ case (k, v) =>
 parse(v).extract[CustomerEvent]
 })
 events.saveToCassandra("streaming", "customer_events")
 val eventsByCustomerAndType = events.map(event => (s"${event.customer_id}-${event.event_type}", 1)).reduceByKey(_ + _)
 
 eventsByCustomerAndType.saveToCassandra("streaming", "customer_events_by_type")
  • 43. @chbatey Summary • Cassandra is an operational database • Spark gives us the flexibility to do slower things - Schema migrations - Ad-hoc queries - Report generation • Spark streaming + Cassandra allow us to build online analytical platforms
  • 44. @chbatey Thanks for listening • Follow me on twitter @chbatey • Cassandra + Fault tolerance posts a plenty: • http://christopher-batey.blogspot.co.uk/ • Github for all examples: • https://github.com/chbatey/spark-sandbox • Cassandra resources: http://planetcassandra.org/ • In London in April? http://www.eventbrite.com/e/cassandra- day-london-2015-april-22nd-2015-tickets-15053026006? aff=CommunityLanding