SlideShare a Scribd company logo
@chbatey
Christopher Batey

Technical Evangelist for Apache Cassandra
Cassandra Spark Integration
@chbatey
Agenda
• Spark intro
• Spark Cassandra connector
• Examples:
- Migrating from MySQL to Cassandra
- Cassandra schema migrations
- Import data from flat file into Cassandra
- Spark SQL on Cassandra
- Spark Streaming and Cassandra
@chbatey
Scalability & Performance
• Scalability
- No single point of failure
- No special nodes that become the bottle neck
- Work/data can be re-distributed
• Operational Performance i.e single digit ms
- Single node for query
- Single disk seek per query
@chbatey
Cassandra can not join or aggregate
Client
Where do I go for the max?
@chbatey
Denormalisation
@chbatey
But but…
• Sometimes you don’t need a answers in milliseconds
• Data models done wrong - how do I fix it?
• New requirements for old data?
• Ad-hoc operational queries
• Managers always want counts / maxs
@chbatey
Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop
MR
• Works out of the box on EMR
• Fault Tolerant Distributed Datasets
• Batch, iterative and streaming analysis
• In Memory Storage and Disk
• Integrates with Most File and Storage Options
@chbatey
Components
Shark
or

Spark SQL
Streaming ML
Spark (General execution engine)
Graph
Cassandra
Compatible
@chbatey
Spark architecture
@chbatey
org.apache.spark.rdd.RDD
• Resilient Distributed Dataset (RDD)
• Created through transformations on data (map,filter..) or other RDDs
• Immutable
• Partitioned
• Reusable
@chbatey
RDD Operations
• Transformations - Similar to Scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
• Actions
• Require materialization of the records to generate a value
• collect: Array[T], count, fold, reduce..
@chbatey
Word count
val file: RDD[String] = sc.textFile("hdfs://...")

val counts: RDD[(String, Int)] = file.flatMap(line =>
line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)


counts.saveAsTextFile("hdfs://...")
@chbatey
Spark shell
Operator Graph: Optimisation and Fault Tolerance
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= Cached partition= RDD
@chbatey
Partitioning
• Large data sets from S3, HDFS, Cassandra etc
• Split into small chunks called partitions
• Each operation is done locally on a partition before
combining other partitions
• So partitioning is important for data locality
@chbatey
Spark Streaming
@chbatey
Cassandra
@chbatey
Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-aware (speed)
• Data transformation, aggregation, etc.
@chbatey
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs + Spark
DStreams
@chbatey
Analytics Workload Isolation
@chbatey
Deployment
• Spark worker in each of the
Cassandra nodes
• Partitions made up of LOCAL
cassandra data
S C
S C
S C
S C
@chbatey
Example Time
@chbatey
It is on Github
"org.apache.spark" %% "spark-core" % sparkVersion



"org.apache.spark" %% "spark-streaming" % sparkVersion



"org.apache.spark" %% "spark-sql" % sparkVersion



"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion



"com.datastax.spark" % "spark-cassandra-connector_2.10" % connectorVersion
@chbatey
@chbatey
Boiler plate
import com.datastax.spark.connector.rdd._

import org.apache.spark._

import com.datastax.spark.connector._

import com.datastax.spark.connector.cql._



object BasicCassandraInteraction extends App {

val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext("local[4]", "AppName", conf)
// cool stuff



}
Cassandra Host
Spark master e.g spark://host:port
@chbatey
Executing code against the driver
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1 }")

session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")

}
@chbatey
Reading data from Cassandra
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")

session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")

}



val rdd: CassandraRDD[CassandraRow] = sc.cassandraTable("test", "kv")

println(rdd.count())

println(rdd.first())

println(rdd.max()(new Ordering[CassandraRow] {

override def compare(x: CassandraRow, y: CassandraRow): Int =
x.getInt("value").compare(y.getInt("value"))

}))
@chbatey
Word Count + Save to Cassandra
val textFile: RDD[String] = sc.textFile("Spark-Readme.md")



val words: RDD[String] = textFile.flatMap(line => line.split("s+"))

val wordAndCount: RDD[(String, Int)] = words.map((_, 1))

val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _)



println(wordCounts.first())



wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))
@chbatey
Migrating from an RDMS
create table store(
store_name varchar(32) primary key,
location varchar(32),
store_type varchar(10));
create table staff(
name varchar(32)
primary key,
favourite_colour varchar(32),
job_title varchar(32));
create table customer_events(
id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
customer varchar(12),
time timestamp,
event_type varchar(16),
store varchar(32),
staff varchar(32),
foreign key fk_store(store) references store(store_name),
foreign key fk_staff(staff) references staff(name))
@chbatey
Denormalised table
CREATE TABLE IF NOT EXISTS customer_events(
customer_id text,
time timestamp,
id uuid,
event_type text,

store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((customer_id), time, id))
@chbatey
Migration time


val customerEvents = new JdbcRDD(sc, () => { DriverManager.getConnection(mysqlJdbcString)}, 

"select * from customer_events ce, staff, store where ce.store = store.store_name and ce.staff = staff.name " +

"and ce.id >= ? and ce.id <= ?", 0, 1000, 6,

(r: ResultSet) => {

(r.getString("customer"),

r.getTimestamp("time"),

UUID.randomUUID(),

r.getString("event_type"),

r.getString("store_name"),

r.getString("location"),

r.getString("store_type"),

r.getString("staff"),

r.getString("job_title")

)

})



customerEvents.saveToCassandra("test", "customer_events",

SomeColumns("customer_id", "time", "id", "event_type", "store_name", "store_type", "store_location", "staff_name",
"staff_title"))
@chbatey
Issues with denormalisation
• What happens when I need to query the denormalised
data a different way?
@chbatey
Store it twice
CREATE TABLE IF NOT EXISTS customer_events(
customer_id text,
time timestamp,
id uuid,
event_type text,
store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((customer_id), time, id))


CREATE TABLE IF NOT EXISTS customer_events_by_staff(
customer_id text,
time timestamp,
id uuid,
event_type text,
store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((staff_name), time, id))
@chbatey
My reaction a year ago
@chbatey
Too simple
val events_by_customer = sc.cassandraTable("test", “customer_events")


events_by_customer.saveToCassandra("test", "customer_events_by_staff",
SomeColumns("customer_id", "time", "id", "event_type", "staff_name",
"staff_title", "store_location", "store_name", "store_type"))

@chbatey
Aggregations with Spark SQL
Partition Key Clustering Columns
@chbatey
Now now…
val cc = new CassandraSQLContext(sc)

cc.setKeyspace("test")

val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events
GROUP BY store_name, event_type")

rdd.collect().foreach(println)
[SportsApp,WATCH_STREAM,1]
[SportsApp,LOGOUT,1]
[SportsApp,LOGIN,1]
[ChrisBatey.com,WATCH_MOVIE,1]
[ChrisBatey.com,LOGOUT,1]
[ChrisBatey.com,BUY_MOVIE,1]
[SportsApp,WATCH_MOVIE,2]
@chbatey
Lamda architecture
http://lambda-architecture.net/
@chbatey
Spark Streaming
@chbatey
Network word count
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)")

session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)")

}


val ssc = new StreamingContext(conf, Seconds(5))

val lines = ssc.socketTextStream("localhost", 9999)

lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw")



val words = lines.flatMap(_.split("s+"))

val countOfOne = words.map((_, 1))

val reduced = countOfOne.reduceByKey(_ + _)

reduced.saveToCassandra("test", "network_word_count")
@chbatey
Kafka
• Partitioned pub sub system
• Very high throughput
• Very scalable
@chbatey
Stream processing customer events
val joeBuy = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))

val joeBuy2 = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))

val joeSell = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "SELL"))

val chrisBuy = write(CustomerEvent("chris", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
CassandraConnector(conf).withSessionDo { session =>

session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events_by_type ( nameAndType text primary key, number int)")

session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events ( " +

"customer_id text, " +

"staff_id text, " +

"store_type text, " +

"group text static, " +

"content text, " +

"time timeuuid, " +

"event_type text, " +

"PRIMARY KEY ((customer_id), time) )")

}
@chbatey
Save + Process
val rawEvents: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]
(ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)

val events: DStream[CustomerEvent] = rawEvents.map({ case (k, v) =>

parse(v).extract[CustomerEvent]

})

events.saveToCassandra("streaming", "customer_events")

val eventsByCustomerAndType = events.map(event => (s"${event.customer_id}-${event.event_type}", 1)).reduceByKey(_ + _)



eventsByCustomerAndType.saveToCassandra("streaming", "customer_events_by_type")
@chbatey
Summary
• Cassandra is an operational database
• Spark gives us the flexibility to do slower things
- Schema migrations
- Ad-hoc queries
- Report generation
• Spark streaming + Cassandra allow us to build online
analytical platforms
@chbatey
Thanks for listening
• Follow me on twitter @chbatey
• Cassandra + Fault tolerance posts a plenty:
• http://christopher-batey.blogspot.co.uk/
• Github for all examples:
• https://github.com/chbatey/spark-sandbox
• Cassandra resources: http://planetcassandra.org/

More Related Content

What's hot (20)

PDF
Cassandra Day Chicago 2015: Advanced Data Modeling
DataStax Academy
 
PDF
CQL3 in depth
Yuki Morishita
 
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
PDF
Advanced data modeling with apache cassandra
Patrick McFadin
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PDF
Spark Streaming with Cassandra
Jacek Lewandowski
 
PDF
Cassandra nice use cases and worst anti patterns
Duyhai Doan
 
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
PPTX
It Depends
Maggie Pint
 
PDF
Cassandra 2.0 and timeseries
Patrick McFadin
 
PDF
Cassandra EU - Data model on fire
Patrick McFadin
 
PDF
Apache Cassandra & Data Modeling
Massimiliano Tomassi
 
PPTX
It Depends - Database admin for developers - Rev 20151205
Maggie Pint
 
PDF
Storing time series data with Apache Cassandra
Patrick McFadin
 
PDF
The world's next top data model
Patrick McFadin
 
PDF
Cassandra 3.0 advanced preview
Patrick McFadin
 
PDF
Cassandra Day Atlanta 2015: Building Your First Application with Apache Cassa...
DataStax Academy
 
PDF
Introduction to data modeling with apache cassandra
Patrick McFadin
 
PDF
Cloudera Impala, updated for v1.0
Scott Leberknight
 
PDF
Time series with apache cassandra strata
Patrick McFadin
 
Cassandra Day Chicago 2015: Advanced Data Modeling
DataStax Academy
 
CQL3 in depth
Yuki Morishita
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Advanced data modeling with apache cassandra
Patrick McFadin
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Spark Streaming with Cassandra
Jacek Lewandowski
 
Cassandra nice use cases and worst anti patterns
Duyhai Doan
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
It Depends
Maggie Pint
 
Cassandra 2.0 and timeseries
Patrick McFadin
 
Cassandra EU - Data model on fire
Patrick McFadin
 
Apache Cassandra & Data Modeling
Massimiliano Tomassi
 
It Depends - Database admin for developers - Rev 20151205
Maggie Pint
 
Storing time series data with Apache Cassandra
Patrick McFadin
 
The world's next top data model
Patrick McFadin
 
Cassandra 3.0 advanced preview
Patrick McFadin
 
Cassandra Day Atlanta 2015: Building Your First Application with Apache Cassa...
DataStax Academy
 
Introduction to data modeling with apache cassandra
Patrick McFadin
 
Cloudera Impala, updated for v1.0
Scott Leberknight
 
Time series with apache cassandra strata
Patrick McFadin
 

Similar to Reading Cassandra Meetup Feb 2015: Apache Spark (20)

PDF
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
PDF
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
PDF
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
PDF
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
PDF
Spark cassandra integration, theory and practice
Duyhai Doan
 
PPTX
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
PPTX
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
PDF
Apache cassandra & apache spark for time series data
Patrick McFadin
 
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
PDF
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
PDF
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
Spark cassandra integration, theory and practice
Duyhai Doan
 
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Apache cassandra & apache spark for time series data
Patrick McFadin
 
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Ad

More from Christopher Batey (20)

PDF
Cassandra summit LWTs
Christopher Batey
 
PDF
Docker and jvm. A good idea?
Christopher Batey
 
PDF
LJC: Microservices in the real world
Christopher Batey
 
PDF
NYC Cassandra Day - Java Intro
Christopher Batey
 
PDF
Cassandra Day NYC - Cassandra anti patterns
Christopher Batey
 
PDF
Think your software is fault-tolerant? Prove it!
Christopher Batey
 
PDF
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
PDF
Cassandra London - 2.2 and 3.0
Christopher Batey
 
PDF
Cassandra London - C* Spark Connector
Christopher Batey
 
PDF
IoT London July 2015
Christopher Batey
 
PDF
1 Dundee - Cassandra 101
Christopher Batey
 
PDF
2 Dundee - Cassandra-3
Christopher Batey
 
PDF
Paris Day Cassandra: Use case
Christopher Batey
 
PDF
Dublin Meetup: Cassandra anti patterns
Christopher Batey
 
PDF
Cassandra Day London: Building Java Applications
Christopher Batey
 
PDF
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
PDF
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Christopher Batey
 
PDF
Manchester Hadoop User Group: Cassandra Intro
Christopher Batey
 
PDF
Webinar Cassandra Anti-Patterns
Christopher Batey
 
PDF
LA Cassandra Day 2015 - Testing Cassandra
Christopher Batey
 
Cassandra summit LWTs
Christopher Batey
 
Docker and jvm. A good idea?
Christopher Batey
 
LJC: Microservices in the real world
Christopher Batey
 
NYC Cassandra Day - Java Intro
Christopher Batey
 
Cassandra Day NYC - Cassandra anti patterns
Christopher Batey
 
Think your software is fault-tolerant? Prove it!
Christopher Batey
 
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
Cassandra London - 2.2 and 3.0
Christopher Batey
 
Cassandra London - C* Spark Connector
Christopher Batey
 
IoT London July 2015
Christopher Batey
 
1 Dundee - Cassandra 101
Christopher Batey
 
2 Dundee - Cassandra-3
Christopher Batey
 
Paris Day Cassandra: Use case
Christopher Batey
 
Dublin Meetup: Cassandra anti patterns
Christopher Batey
 
Cassandra Day London: Building Java Applications
Christopher Batey
 
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Christopher Batey
 
Manchester Hadoop User Group: Cassandra Intro
Christopher Batey
 
Webinar Cassandra Anti-Patterns
Christopher Batey
 
LA Cassandra Day 2015 - Testing Cassandra
Christopher Batey
 
Ad

Recently uploaded (20)

PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 

Reading Cassandra Meetup Feb 2015: Apache Spark

  • 1. @chbatey Christopher Batey
 Technical Evangelist for Apache Cassandra Cassandra Spark Integration
  • 2. @chbatey Agenda • Spark intro • Spark Cassandra connector • Examples: - Migrating from MySQL to Cassandra - Cassandra schema migrations - Import data from flat file into Cassandra - Spark SQL on Cassandra - Spark Streaming and Cassandra
  • 3. @chbatey Scalability & Performance • Scalability - No single point of failure - No special nodes that become the bottle neck - Work/data can be re-distributed • Operational Performance i.e single digit ms - Single node for query - Single disk seek per query
  • 4. @chbatey Cassandra can not join or aggregate Client Where do I go for the max?
  • 6. @chbatey But but… • Sometimes you don’t need a answers in milliseconds • Data models done wrong - how do I fix it? • New requirements for old data? • Ad-hoc operational queries • Managers always want counts / maxs
  • 7. @chbatey Apache Spark • 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options
  • 8. @chbatey Components Shark or
 Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra Compatible
  • 10. @chbatey org.apache.spark.rdd.RDD • Resilient Distributed Dataset (RDD) • Created through transformations on data (map,filter..) or other RDDs • Immutable • Partitioned • Reusable
  • 11. @chbatey RDD Operations • Transformations - Similar to Scala collections API • Produce new RDDs • filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract • Actions • Require materialization of the records to generate a value • collect: Array[T], count, fold, reduce..
  • 12. @chbatey Word count val file: RDD[String] = sc.textFile("hdfs://...")
 val counts: RDD[(String, Int)] = file.flatMap(line => line.split(" "))
 .map(word => (word, 1))
 .reduceByKey(_ + _) 
 counts.saveAsTextFile("hdfs://...")
  • 14. Operator Graph: Optimisation and Fault Tolerance join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = Cached partition= RDD
  • 15. @chbatey Partitioning • Large data sets from S3, HDFS, Cassandra etc • Split into small chunks called partitions • Each operation is done locally on a partition before combining other partitions • So partitioning is important for data locality
  • 18. @chbatey Spark on Cassandra • Server-Side filters (where clauses) • Cross-table operations (JOIN, UNION, etc.) • Data locality-aware (speed) • Data transformation, aggregation, etc.
  • 19. @chbatey Spark Cassandra Connector • Loads data from Cassandra to Spark • Writes data from Spark to Cassandra • Implicit Type Conversions and Object Mapping • Implemented in Scala (offers a Java API) • Open Source • Exposes Cassandra Tables as Spark RDDs + Spark DStreams
  • 21. @chbatey Deployment • Spark worker in each of the Cassandra nodes • Partitions made up of LOCAL cassandra data S C S C S C S C
  • 23. @chbatey It is on Github "org.apache.spark" %% "spark-core" % sparkVersion
 
 "org.apache.spark" %% "spark-streaming" % sparkVersion
 
 "org.apache.spark" %% "spark-sql" % sparkVersion
 
 "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion
 
 "com.datastax.spark" % "spark-cassandra-connector_2.10" % connectorVersion
  • 25. @chbatey Boiler plate import com.datastax.spark.connector.rdd._
 import org.apache.spark._
 import com.datastax.spark.connector._
 import com.datastax.spark.connector.cql._
 
 object BasicCassandraInteraction extends App {
 val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
 val sc = new SparkContext("local[4]", "AppName", conf) // cool stuff
 
 } Cassandra Host Spark master e.g spark://host:port
  • 26. @chbatey Executing code against the driver CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
 session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")
 }
  • 27. @chbatey Reading data from Cassandra CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")
 session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")
 }
 
 val rdd: CassandraRDD[CassandraRow] = sc.cassandraTable("test", "kv")
 println(rdd.count())
 println(rdd.first())
 println(rdd.max()(new Ordering[CassandraRow] {
 override def compare(x: CassandraRow, y: CassandraRow): Int = x.getInt("value").compare(y.getInt("value"))
 }))
  • 28. @chbatey Word Count + Save to Cassandra val textFile: RDD[String] = sc.textFile("Spark-Readme.md")
 
 val words: RDD[String] = textFile.flatMap(line => line.split("s+"))
 val wordAndCount: RDD[(String, Int)] = words.map((_, 1))
 val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _)
 
 println(wordCounts.first())
 
 wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))
  • 29. @chbatey Migrating from an RDMS create table store( store_name varchar(32) primary key, location varchar(32), store_type varchar(10)); create table staff( name varchar(32) primary key, favourite_colour varchar(32), job_title varchar(32)); create table customer_events( id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY, customer varchar(12), time timestamp, event_type varchar(16), store varchar(32), staff varchar(32), foreign key fk_store(store) references store(store_name), foreign key fk_staff(staff) references staff(name))
  • 30. @chbatey Denormalised table CREATE TABLE IF NOT EXISTS customer_events( customer_id text, time timestamp, id uuid, event_type text,
 store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id))
  • 31. @chbatey Migration time 
 val customerEvents = new JdbcRDD(sc, () => { DriverManager.getConnection(mysqlJdbcString)}, 
 "select * from customer_events ce, staff, store where ce.store = store.store_name and ce.staff = staff.name " +
 "and ce.id >= ? and ce.id <= ?", 0, 1000, 6,
 (r: ResultSet) => {
 (r.getString("customer"),
 r.getTimestamp("time"),
 UUID.randomUUID(),
 r.getString("event_type"),
 r.getString("store_name"),
 r.getString("location"),
 r.getString("store_type"),
 r.getString("staff"),
 r.getString("job_title")
 )
 })
 
 customerEvents.saveToCassandra("test", "customer_events",
 SomeColumns("customer_id", "time", "id", "event_type", "store_name", "store_type", "store_location", "staff_name", "staff_title"))
  • 32. @chbatey Issues with denormalisation • What happens when I need to query the denormalised data a different way?
  • 33. @chbatey Store it twice CREATE TABLE IF NOT EXISTS customer_events( customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id)) 
 CREATE TABLE IF NOT EXISTS customer_events_by_staff( customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((staff_name), time, id))
  • 35. @chbatey Too simple val events_by_customer = sc.cassandraTable("test", “customer_events") 
 events_by_customer.saveToCassandra("test", "customer_events_by_staff", SomeColumns("customer_id", "time", "id", "event_type", "staff_name", "staff_title", "store_location", "store_name", "store_type"))

  • 36. @chbatey Aggregations with Spark SQL Partition Key Clustering Columns
  • 37. @chbatey Now now… val cc = new CassandraSQLContext(sc)
 cc.setKeyspace("test")
 val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events GROUP BY store_name, event_type")
 rdd.collect().foreach(println) [SportsApp,WATCH_STREAM,1] [SportsApp,LOGOUT,1] [SportsApp,LOGIN,1] [ChrisBatey.com,WATCH_MOVIE,1] [ChrisBatey.com,LOGOUT,1] [ChrisBatey.com,BUY_MOVIE,1] [SportsApp,WATCH_MOVIE,2]
  • 40. @chbatey Network word count CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)")
 session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)")
 } 
 val ssc = new StreamingContext(conf, Seconds(5))
 val lines = ssc.socketTextStream("localhost", 9999)
 lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw")
 
 val words = lines.flatMap(_.split("s+"))
 val countOfOne = words.map((_, 1))
 val reduced = countOfOne.reduceByKey(_ + _)
 reduced.saveToCassandra("test", "network_word_count")
  • 41. @chbatey Kafka • Partitioned pub sub system • Very high throughput • Very scalable
  • 42. @chbatey Stream processing customer events val joeBuy = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
 val joeBuy2 = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
 val joeSell = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "SELL"))
 val chrisBuy = write(CustomerEvent("chris", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY")) CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events_by_type ( nameAndType text primary key, number int)")
 session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events ( " +
 "customer_id text, " +
 "staff_id text, " +
 "store_type text, " +
 "group text static, " +
 "content text, " +
 "time timeuuid, " +
 "event_type text, " +
 "PRIMARY KEY ((customer_id), time) )")
 }
  • 43. @chbatey Save + Process val rawEvents: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder] (ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)
 val events: DStream[CustomerEvent] = rawEvents.map({ case (k, v) =>
 parse(v).extract[CustomerEvent]
 })
 events.saveToCassandra("streaming", "customer_events")
 val eventsByCustomerAndType = events.map(event => (s"${event.customer_id}-${event.event_type}", 1)).reduceByKey(_ + _)
 
 eventsByCustomerAndType.saveToCassandra("streaming", "customer_events_by_type")
  • 44. @chbatey Summary • Cassandra is an operational database • Spark gives us the flexibility to do slower things - Schema migrations - Ad-hoc queries - Report generation • Spark streaming + Cassandra allow us to build online analytical platforms
  • 45. @chbatey Thanks for listening • Follow me on twitter @chbatey • Cassandra + Fault tolerance posts a plenty: • http://christopher-batey.blogspot.co.uk/ • Github for all examples: • https://github.com/chbatey/spark-sandbox • Cassandra resources: http://planetcassandra.org/