SlideShare a Scribd company logo
Escape From Hadoop: 
Ultra Fast Data Analysis 
with Apache Cassandra & Spark 
Kurt Russell Spitzer 
Piotr Kołaczkowski 
Piotr Kołaczkowski 
DataStax 
slides by 
presented by
Why escape from Hadoop? 
Hadoop 
Many Moving Pieces 
Map Reduce 
Single Points of Failure 
Lots of Overhead 
And there is a way out!
Spark Provides a Simple and Efficient 
framework for Distributed Computations 
Node Roles 2 
In Memory Caching Yes! 
Fault Tolerance Yes! 
Great Abstraction For 
Datasets? RDD! 
Spark 
Worker 
Spark 
Worker 
Spark 
Master 
Spark 
Worker 
Resilient 
Distributed 
Dataset 
SSppaarrkk EExxeeccuuttoorr
Spark is Compatible with 
HDFS, JDBC, Parquet, CSVs, …. 
AND 
APACHE CASSANDRA 
Apache 
Cassandra
Apache Cassandra is a Linearly Scaling and 
Fault Tolerant noSQL Database 
Linearly Scaling: 
The power of the database 
increases linearly with the number 
of machines 
2x machines = 2x throughput 
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 
Fault Tolerant: 
Nodes down != Database Down 
Datacenter down != Database 
Down
Apache Cassandra Architecture is Very Simple 
Node Roles 1 
Replication Tunable 
Consistency Replication 
Tunable 
CC** 
CC** CC** 
CC** 
CClliieenntt
DataStax OSS Connector 
Spark to Cassandra 
https://github.com/datastax/spark-cassandra-connector 
CCaassssaannddrraa SSppaarrkk 
KKeeyyssppaaccee TTaabbllee 
RRDDDD[[CCaassssaannddrraaRRooww]] 
RRDDDD[[TTuupplleess]] 
Bundled and Supported with 
DSE 4.5!
DataStax Connector 
Spark to Cassandra 
By the numbers: 
● 370 commits 
● 17 branches 
● 10 releases 
● 11 contributors 
● 168 issues (65 open) 
● 98 pull requests (6 open)
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Spark Cassandra Connector uses the DataStax 
Java Driver to Read from and Write to C* 
CC** 
Full Token 
Range 
Each Executor Maintains a 
connection to the C* Cluster 
Spark 
Executor 
DataStax 
Java Driver 
Tokens 1001 -2000 
Tokens 1-1000 
Tokens … 
RDD’s read into different 
splits based on token ranges
Co-locate Spark and C* for Best Performance 
CC** Running Spark Workers on 
the same nodes as your C* 
cluster will save network hops 
when reading and writing CC** CC** 
Spark 
Worker 
CC** 
Spark 
Worker 
Spark 
Master 
Spark 
Worker
Setting up C* and Spark 
DSE > 4.5.0 
Just start your nodes with 
dse cassandra -k 
Apache Cassandra 
Follow the excellent guide by Al Tobey 
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
We need a Distributed System For 
Analytics and Batch Jobs 
But it doesn’t have to be complicated!
Even count needs to be distributed 
Ask me to write a Map 
Reduce for word count, I 
dare you. 
You could make this easier by adding yet another technology to your 
Hadoop Stack (hive, pig, impala) or 
we could just do one liners on the spark shell.
Basics: Getting a Table and Counting 
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; 
USE newyork; 
CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); 
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; 
USE newyork; 
CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); 
scala> sc.cassandraTable(“newyork","presidentlocations").count 
res3: Long = 10 
scala> sc.cassandraTable(“newyork","presidentlocations").count 
res3: Long = 10 
cassandraTable 
count 10
Basics: take() and toArray 
scala> sc.cassandraTable("newyork","presidentlocations").take(1) 
res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) 
scala> sc.cassandraTable("newyork","presidentlocations").take(1) 
res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) 
cassandraTable 
take(1) 
Array of CassandraRows 
99 NNYYCC 
scala> sc.cassandraTable(“newyork","presidentlocations").toArray 
res3: Array[com.datastax.spark.connector.CassandraRow] = Array( 
scala> sc.cassandraTable(“newyork","presidentlocations").toArray 
res3: Array[com.datastax.spark.connector.CassandraRow] = Array( 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 3, location: White House}, 
…, 
CassandraRow{time: 6, location: Air Force 1}) 
cassandraTable 
toArray 
Array of CassandraRows 
99 NNYYCC 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 3, location: White House}, 
…, 
CassandraRow{time: 6, location: Air Force 1}) 
9999 NNNNYYYYCCCC 9999 NNNNYYYYCCCC
Basics: Getting Row Values out of a 
CassandraRow 
scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") 
res5: Int = 9 
scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") 
res5: Int = 9 
cassandraTable 
first 
A CassandraRow object 
99 NNYYCC 
99 get[Int] 
get[Int] 
get[String] 
get[List[...]] 
…get[Any] 
Got null ? 
get[Option[Int]] 
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
get[Int] get[String] 
CC** 
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE TABLE characterlocations ( 
CREATE TABLE characterlocations ( 
time int, 
character text, 
location text, 
PRIMARY KEY (time,character) 
); 
time int, 
character text, 
location text, 
PRIMARY KEY (time,character) 
); 
scala> sc.cassandraTable(“newyork","presidentlocations") 
.map( row => ( 
scala> sc.cassandraTable(“newyork","presidentlocations") 
.map( row => ( 
row.get[Int](“time"), 
"president", 
row.get[String](“location"))) 
row.get[Int](“time"), 
"president", 
row.get[String](“location"))) 
.saveToCassandra("newyork","characterlocations") 
.saveToCassandra("newyork","characterlocations") 
cqlsh:newyork> SELECT * FROM characterlocations ; 
time | character | location 
------+-----------+------------- 
cqlsh:newyork> SELECT * FROM characterlocations ; 
time | character | location 
------+-----------+------------- 
5 | president | Air Force 1 
10 | president | NYC 
…… 
5 | president | Air Force 1 
10 | president | NYC 
…… 
cassandraTable 
11 wwhhiittee hhoouussee 
11,,pprreessiiddeenntt,,wwhhiittee hhoouussee 
saveToCassandra
Filter a Table 
What if we want to filter based on a non-clustering key column? 
scala> sc.cassandraTable(“newyork","presidentlocations") 
scala> sc.cassandraTable(“newyork","presidentlocations") 
.filter( _.getInt("time") > 7 ) 
.toArray 
res9: Array[com.datastax.spark.connector.CassandraRow] = Array( 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 10, location: NYC}, 
CassandraRow{time: 8, location: NYC} 
) 
.filter( _.getInt("time") > 7 ) 
.toArray 
res9: Array[com.datastax.spark.connector.CassandraRow] = Array( 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 10, location: NYC}, 
CassandraRow{time: 8, location: NYC} 
) 
cassandraTable 
11 wwhhiittee hhoouussee 
getInt 
11 
>7 
filter
Backfill a Table with a Different Key! 
CREATE TABLE timelines ( 
time int, 
character text, 
location text, 
PRIMARY KEY ((character), time) 
) 
CREATE TABLE timelines ( 
time int, 
character text, 
location text, 
PRIMARY KEY ((character), time) 
) 
If we actually want to have quick access to 
timelines we need a C* table with a 
different structure. 
sc.cassandraTable("newyork","characterlocations") 
.saveToCassandra("newyork","timelines") 
sc.cassandraTable("newyork","characterlocations") 
.saveToCassandra("newyork","timelines") 
cqlsh:newyork> select * from timelines; 
character | time | location 
-----------+------+------------- 
president | 1 | White House 
president | 2 | White House 
president | 3 | White House 
president | 4 | White House 
president | 5 | Air Force 1 
president | 6 | Air Force 1 
president | 7 | Air Force 1 
president | 8 | NYC 
president | 9 | NYC 
president | 10 | NYC 
cqlsh:newyork> select * from timelines; 
character | time | location 
-----------+------+------------- 
president | 1 | White House 
president | 2 | White House 
president | 3 | White House 
president | 4 | White House 
president | 5 | Air Force 1 
president | 6 | Air Force 1 
president | 7 | Air Force 1 
president | 8 | NYC 
president | 9 | NYC 
president | 10 | NYC 
11 wwhhiittee hhoouussee 
cassandraTable 
saveToCassandra 
pprreessiiddeenntt CC**
Import a CSV 
I have some data in another source which I could really use in 
my Cassandra table 
sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") 
sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") 
.map(_.split(",")) 
.map(line => (line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) 
.map(_.split(",")) 
.map(line => (line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) 
textFile 
map 
plissken,1,white house 
cqlsh:newyork> select * from timelines where character = 'plissken'; 
character | time | location 
-----------+------+----------------- 
plissken | 1 | Federal Reserve 
plissken | 2 | Federal Reserve 
plissken | 3 | Federal Reserve 
plissken | 4 | Court 
plissken | 5 | Court 
plissken | 6 | Court 
plissken | 7 | Court 
plissken | 8 | Stealth Glider 
plissken | 9 | NYC 
plissken | 10 | NYC 
cqlsh:newyork> select * from timelines where character = 'plissken'; 
character | split 
time | location 
-----------+------+----------------- 
plissken | 1 | Federal Reserve 
plissken | 2 | Federal Reserve 
plissken | 3 | Federal Reserve 
plissken | 4 | Court 
plissken | 5 | Court 
plissken | 6 | Court 
plissken | 7 | Court 
plissken | 8 | Stealth Glider 
plissken | 9 | NYC 
plissken | 10 | NYC 
pplliisssskkeenn 11 wwhhiittee hhoouussee 
pplliisssskkeenn,,11,,wwhhiittee hhoouussee 
saveToCassandra 
CC**
Perform a Join with MySQL 
Maybe a little more than one line … 
import java.sql._ 
import org.apache.spark.rdd.JdbcRDD 
Class.forName("com.mysql.jdbc.Driver").newInstance(); 
val quotes = new JdbcRDD( 
sc, 
getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), 
sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", 
lowerBound = 0, 
upperBound = 100, 
numPartitions = 5, 
mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) 
) 
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 
import java.sql._ 
import org.apache.spark.rdd.JdbcRDD 
Class.forName("com.mysql.jdbc.Driver").newInstance(); 
val quotes = new JdbcRDD( 
sc, 
getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), 
sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", 
lowerBound = 0, 
upperBound = 100, 
numPartitions = 5, 
mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) 
) 
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23
Perform a Join with MySQL 
Maybe a little more than one line … 
val locations = 
sc.cassandraTable("newyork","timelines") 
val locations = 
sc.cassandraTable("newyork","timelines") 
.filter(_.getString("character") == "plissken") 
.map(row => (row.getInt("time"), row.getString("location"))) 
.filter(_.getString("character") == "plissken") 
.map(row => (row.getInt("time"), row.getString("location"))) 
quotes.join(locations) 
.take(1) 
.foreach(println) 
quotes.join(locations) 
.take(1) 
.foreach(println) 
(5, ( 
Bob Hauk: There was an accident. 
About an hour ago, a small jet went down inside New York City. 
The President was on board. 
Snake Plissken: The president of what?, 
Court 
)) 
cassandraTable 
JdbcRDD 
pplliisssskkeenn,, 55,, ccoouurrtt 
55,,ccoouurrtt 55,,((‘‘BBoobb HHaauukk:: ……’’,,ccoouurrtt)) 
55,, ‘‘BBoobb HHaauukk:: ……'' 
(5, ( 
Bob Hauk: There was an accident. 
About an hour ago, a small jet went down inside New York City. 
The President was on board. 
Snake Plissken: The president of what?, 
Court 
)) 
join
Easy Objects with Case Classes 
We have the technology to make this even easier! 
case class TimelineRow(character: String, time: Int, location: String) 
sc.cassandraTable[TimelineRow]("newyork","timelines") 
case class TimelineRow(character: String, time: Int, location: String) 
sc.cassandraTable[TimelineRow]("newyork","timelines") 
.filter(_.character == "plissken") 
.filter(_.time == 8) 
.toArray 
.filter(_.character == "plissken") 
.filter(_.time == 8) 
.toArray 
res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) 
res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) 
cassandraTable[TimelineRow] 
TimelineRow 
cchhaarraacctteerr,,ttiimmee,,llooccaattiioonn 
filter 
cchhaarraacctteerr ==== pplliisssskkeenn 
ttiimmee ==== 88 
cchhaarraacctteerr::pplliisssskkeenn,, ttiimmee::88,, llooccaattiioonn:: SStteeaalltthh GGlliiddeerr
A Map Reduce for Word Count … 
scala> sc.cassandraTable("newyork","presidentlocations") 
scala> sc.cassandraTable("newyork","presidentlocations") 
.map(_.getString("location")) 
.flatMap(_.split(" ")) 
.map((_,1)) 
.reduceByKey(_ + _) 
.toArray 
.map(_.getString("location")) 
.flatMap(_.split(" ")) 
.map((_,1)) 
.reduceByKey(_ + _) 
.toArray 
res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 
res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 
11 wwhhiittee hhoouussee 
wwhhiittee hhoouussee 
wwhhiittee,, 11 hhoouussee,, 11 
hhoouussee,, 11 hhoouussee,, 11 
hhoouussee,, 22 
cassandraTable 
getString 
_.split(" ") 
(_,1) 
reduceByKey(_ + _) 
wwhhiittee hhoouussee
Selected RDD transformations 
● min(), max(), count() 
● reduce[T](f: (T, T) ⇒ T): T 
● fold[T](zeroValue: T)(op: (T, T) ⇒ T): T 
● aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U 
● flatMap[U](func: (T) ⇒ TraversableOnce[U]): RDD[U] 
● mapPartitions[U]( 
f: (Iterator[T]) ⇒ Iterator[U], 
preservesPartitioning: Boolean): RDD[U] 
● sortBy[K](f: (T) ⇒ K, ascending: Boolean = true) 
● groupBy[K](f: (T) ⇒ K): RDD[(K, Iterable[T])] 
● intersection(other: RDD[T]): RDD[T] 
● union(other: RDD[T]): RDD[T] 
● subtract(other: RDD[T]): RDD[T] 
● zip[U](other: RDD[U]): RDD[(T, U)] 
● keyBy[K](f: (T) ⇒ K): RDD[(K, T)] 
● sample(withReplacement: Boolean, fraction: Double)
RDD can do even more...
How Fast is it? 
● Reading big data from Cassandra: 
– Spark ~2x faster than Hadoop 
● Minimum latency (1 node, vnodes disabled, tiny data): 
– Spark: 0.7s 
– Hadoop: ~20s 
● Minimum latency (1 node, vnodes enabled): 
– Spark: 1s 
– Hadoop: ~8 minutes 
● In memory processing: 
– up to 100x faster than Hadoop
source: https://amplab.cs.berkeley.edu/benchmark/
In-memory Processing 
Call cache or persist(storageLevel) to store RDD data in memory. 
val rdd = sc.cassandraTable("newyork","presidentlocations") 
.filter(...) 
.map(...) 
.reduce(...) 
.cache 
rdd.first // slow, loads data from Cassandra and keeps in memory 
rdd.first // fast, doesn't read from Cassandra, reads from memory 
val rdd = sc.cassandraTable("newyork","presidentlocations") 
.filter(...) 
.map(...) 
.reduce(...) 
.cache 
rdd.first // slow, loads data from Cassandra and keeps in memory 
rdd.first // fast, doesn't read from Cassandra, reads from memory 
Multiple StorageLevels available: 
● MEMORY_ONLY 
● MEMORY_ONLY_SER 
● MEMORY_AND_DISK 
● MEMORY_AND_DISK_SER 
● DISK_ONLY 
Also replicated variants available: just append _2 to the constant name.
Fault Tolerance 
cassandraTable 
77 88 99 
11 22 33 44 55 66 77 88 99 
11 
22 
44 55 
filter 
map 
33 
66 
Node 1 
FilteredRDD 
MappedRDD 
Cassandra RDD 
44 55 
77 88 
66 
99 
Node 2 
77 88 
11 22 
99 
33 
Node 3 
Replication Factor = 2
Standalone App Example 
https://github.com/RussellSpitzer/spark-cassandra-csv 
CCaarr,, MMooddeell,, CCoolloorr 
Dodge, Caravan, Red 
Ford, F150, Black 
Toyota, Prius, Green 
RDD 
[CassandraRow] 
FavoriteCars 
Table 
CCaassssaannddrraa 
Column Mapping 
CSV
Useful modules / projects 
● Java API 
– for diehard Java developers 
● Python API 
– for those allergic to static types 
● Shark 
– Hive QL on Spark (discontinued) 
● Spark SQL 
– new SQL engine based on Catalyst query planner 
● Spark Streaming 
– microbatch streaming framework 
● MLLib 
– machine learning library 
● GraphX 
– efficient representation and processing of graph data
We're hiring! 
http://www.datastax.com/company/careers
Thanks for listening! 
There is plenty more we can do with Spark but … 
Questions?

More Related Content

What's hot (20)

PDF
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
PDF
Time Series Processing with Apache Spark
Josef Adersberger
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
PDF
An Introduction to time series with Team Apache
Patrick McFadin
 
PDF
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
PDF
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
PDF
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
ODP
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
PDF
OLAP with Cassandra and Spark
Evan Chan
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
Spark with Cassandra by Christopher Batey
Spark Summit
 
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
PDF
Cassandra 2.0 and timeseries
Patrick McFadin
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
Time Series Processing with Apache Spark
Josef Adersberger
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
An Introduction to time series with Team Apache
Patrick McFadin
 
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
OLAP with Cassandra and Spark
Evan Chan
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Spark with Cassandra by Christopher Batey
Spark Summit
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
Cassandra 2.0 and timeseries
Patrick McFadin
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 

Viewers also liked (20)

DOCX
Resume of Vimal 4.1
Vimal Suthar
 
PPTX
Hadoop data analysis
Vakul Vankadaru
 
PPTX
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
PDF
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
PDF
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
PPTX
Basic Sentiment Analysis using Hive
Qubole
 
PPTX
Traffic data analysis using HADOOP
Kirthan S Holla
 
PDF
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor
 
PDF
Full stack analytics with Hadoop 2
Gabriele Modena
 
PPTX
Hadoop - Stock Analysis
Vaibhav Jain
 
PDF
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
PDF
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
PDF
Resilient Distributed Datasets
Gabriele Modena
 
PPTX
TRAFFIC DATA ANALYSIS USING HADOOP
Kirthan S Holla
 
PDF
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
PDF
Log analysis with Hadoop in livedoor 2013
SATOSHI TAGOMORI
 
PPTX
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PDF
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
Resume of Vimal 4.1
Vimal Suthar
 
Hadoop data analysis
Vakul Vankadaru
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
Basic Sentiment Analysis using Hive
Qubole
 
Traffic data analysis using HADOOP
Kirthan S Holla
 
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor
 
Full stack analytics with Hadoop 2
Gabriele Modena
 
Hadoop - Stock Analysis
Vaibhav Jain
 
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Resilient Distributed Datasets
Gabriele Modena
 
TRAFFIC DATA ANALYSIS USING HADOOP
Kirthan S Holla
 
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
Log analysis with Hadoop in livedoor 2013
SATOSHI TAGOMORI
 
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
Ad

Similar to Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra (20)

PDF
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 
PDF
Escape from Hadoop
DataStax Academy
 
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
DOCX
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
PPT
Scaling Web Applications with Cassandra Presentation.ppt
ssuserbad56d
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
PDF
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
PDF
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
DataStax Academy
 
PDF
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
PDF
Datastax day 2016 introduction to apache cassandra
Duyhai Doan
 
PDF
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
DataStax Academy
 
PDF
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
PPT
Scaling web applications with cassandra presentation
Murat Çakal
 
PPTX
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
PDF
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PDF
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PPT
Toronto jaspersoft meetup
Patrick McFadin
 
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 
Escape from Hadoop
DataStax Academy
 
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Scaling Web Applications with Cassandra Presentation.ppt
ssuserbad56d
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
DataStax Academy
 
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
Datastax day 2016 introduction to apache cassandra
Duyhai Doan
 
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
DataStax Academy
 
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
Scaling web applications with cassandra presentation
Murat Çakal
 
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Toronto jaspersoft meetup
Patrick McFadin
 
Ad

Recently uploaded (20)

PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Executive Business Intelligence Dashboards
vandeslie24
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Import Data Form Excel to Tally Services
Tally xperts
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

  • 1. Escape From Hadoop: Ultra Fast Data Analysis with Apache Cassandra & Spark Kurt Russell Spitzer Piotr Kołaczkowski Piotr Kołaczkowski DataStax slides by presented by
  • 2. Why escape from Hadoop? Hadoop Many Moving Pieces Map Reduce Single Points of Failure Lots of Overhead And there is a way out!
  • 3. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Fault Tolerance Yes! Great Abstraction For Datasets? RDD! Spark Worker Spark Worker Spark Master Spark Worker Resilient Distributed Dataset SSppaarrkk EExxeeccuuttoorr
  • 4. Spark is Compatible with HDFS, JDBC, Parquet, CSVs, …. AND APACHE CASSANDRA Apache Cassandra
  • 5. Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down
  • 6. Apache Cassandra Architecture is Very Simple Node Roles 1 Replication Tunable Consistency Replication Tunable CC** CC** CC** CC** CClliieenntt
  • 7. DataStax OSS Connector Spark to Cassandra https://github.com/datastax/spark-cassandra-connector CCaassssaannddrraa SSppaarrkk KKeeyyssppaaccee TTaabbllee RRDDDD[[CCaassssaannddrraaRRooww]] RRDDDD[[TTuupplleess]] Bundled and Supported with DSE 4.5!
  • 8. DataStax Connector Spark to Cassandra By the numbers: ● 370 commits ● 17 branches ● 10 releases ● 11 contributors ● 168 issues (65 open) ● 98 pull requests (6 open)
  • 10. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* CC** Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1001 -2000 Tokens 1-1000 Tokens … RDD’s read into different splits based on token ranges
  • 11. Co-locate Spark and C* for Best Performance CC** Running Spark Workers on the same nodes as your C* cluster will save network hops when reading and writing CC** CC** Spark Worker CC** Spark Worker Spark Master Spark Worker
  • 12. Setting up C* and Spark DSE > 4.5.0 Just start your nodes with dse cassandra -k Apache Cassandra Follow the excellent guide by Al Tobey http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
  • 13. We need a Distributed System For Analytics and Batch Jobs But it doesn’t have to be complicated!
  • 14. Even count needs to be distributed Ask me to write a Map Reduce for word count, I dare you. You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or we could just do one liners on the spark shell.
  • 15. Basics: Getting a Table and Counting CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); scala> sc.cassandraTable(“newyork","presidentlocations").count res3: Long = 10 scala> sc.cassandraTable(“newyork","presidentlocations").count res3: Long = 10 cassandraTable count 10
  • 16. Basics: take() and toArray scala> sc.cassandraTable("newyork","presidentlocations").take(1) res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) scala> sc.cassandraTable("newyork","presidentlocations").take(1) res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) cassandraTable take(1) Array of CassandraRows 99 NNYYCC scala> sc.cassandraTable(“newyork","presidentlocations").toArray res3: Array[com.datastax.spark.connector.CassandraRow] = Array( scala> sc.cassandraTable(“newyork","presidentlocations").toArray res3: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 3, location: White House}, …, CassandraRow{time: 6, location: Air Force 1}) cassandraTable toArray Array of CassandraRows 99 NNYYCC CassandraRow{time: 9, location: NYC}, CassandraRow{time: 3, location: White House}, …, CassandraRow{time: 6, location: Air Force 1}) 9999 NNNNYYYYCCCC 9999 NNNNYYYYCCCC
  • 17. Basics: Getting Row Values out of a CassandraRow scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") res5: Int = 9 scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") res5: Int = 9 cassandraTable first A CassandraRow object 99 NNYYCC 99 get[Int] get[Int] get[String] get[List[...]] …get[Any] Got null ? get[Option[Int]] http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 18. get[Int] get[String] CC** Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) ); time int, character text, location text, PRIMARY KEY (time,character) ); scala> sc.cassandraTable(“newyork","presidentlocations") .map( row => ( scala> sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location"))) row.get[Int](“time"), "president", row.get[String](“location"))) .saveToCassandra("newyork","characterlocations") .saveToCassandra("newyork","characterlocations") cqlsh:newyork> SELECT * FROM characterlocations ; time | character | location ------+-----------+------------- cqlsh:newyork> SELECT * FROM characterlocations ; time | character | location ------+-----------+------------- 5 | president | Air Force 1 10 | president | NYC …… 5 | president | Air Force 1 10 | president | NYC …… cassandraTable 11 wwhhiittee hhoouussee 11,,pprreessiiddeenntt,,wwhhiittee hhoouussee saveToCassandra
  • 19. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.getInt("time") > 7 ) .toArray res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) .filter( _.getInt("time") > 7 ) .toArray res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable 11 wwhhiittee hhoouussee getInt 11 >7 filter
  • 20. Backfill a Table with a Different Key! CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable("newyork","characterlocations") .saveToCassandra("newyork","timelines") sc.cassandraTable("newyork","characterlocations") .saveToCassandra("newyork","timelines") cqlsh:newyork> select * from timelines; character | time | location -----------+------+------------- president | 1 | White House president | 2 | White House president | 3 | White House president | 4 | White House president | 5 | Air Force 1 president | 6 | Air Force 1 president | 7 | Air Force 1 president | 8 | NYC president | 9 | NYC president | 10 | NYC cqlsh:newyork> select * from timelines; character | time | location -----------+------+------------- president | 1 | White House president | 2 | White House president | 3 | White House president | 4 | White House president | 5 | Air Force 1 president | 6 | Air Force 1 president | 7 | Air Force 1 president | 8 | NYC president | 9 | NYC president | 10 | NYC 11 wwhhiittee hhoouussee cassandraTable saveToCassandra pprreessiiddeenntt CC**
  • 21. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") .map(_.split(",")) .map(line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) .map(_.split(",")) .map(line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) textFile map plissken,1,white house cqlsh:newyork> select * from timelines where character = 'plissken'; character | time | location -----------+------+----------------- plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC cqlsh:newyork> select * from timelines where character = 'plissken'; character | split time | location -----------+------+----------------- plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC pplliisssskkeenn 11 wwhhiittee hhoouussee pplliisssskkeenn,,11,,wwhhiittee hhoouussee saveToCassandra CC**
  • 22. Perform a Join with MySQL Maybe a little more than one line … import java.sql._ import org.apache.spark.rdd.JdbcRDD Class.forName("com.mysql.jdbc.Driver").newInstance(); val quotes = new JdbcRDD( sc, getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", lowerBound = 0, upperBound = 100, numPartitions = 5, mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) ) quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 import java.sql._ import org.apache.spark.rdd.JdbcRDD Class.forName("com.mysql.jdbc.Driver").newInstance(); val quotes = new JdbcRDD( sc, getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", lowerBound = 0, upperBound = 100, numPartitions = 5, mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) ) quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23
  • 23. Perform a Join with MySQL Maybe a little more than one line … val locations = sc.cassandraTable("newyork","timelines") val locations = sc.cassandraTable("newyork","timelines") .filter(_.getString("character") == "plissken") .map(row => (row.getInt("time"), row.getString("location"))) .filter(_.getString("character") == "plissken") .map(row => (row.getInt("time"), row.getString("location"))) quotes.join(locations) .take(1) .foreach(println) quotes.join(locations) .take(1) .foreach(println) (5, ( Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court )) cassandraTable JdbcRDD pplliisssskkeenn,, 55,, ccoouurrtt 55,,ccoouurrtt 55,,((‘‘BBoobb HHaauukk:: ……’’,,ccoouurrtt)) 55,, ‘‘BBoobb HHaauukk:: ……'' (5, ( Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court )) join
  • 24. Easy Objects with Case Classes We have the technology to make this even easier! case class TimelineRow(character: String, time: Int, location: String) sc.cassandraTable[TimelineRow]("newyork","timelines") case class TimelineRow(character: String, time: Int, location: String) sc.cassandraTable[TimelineRow]("newyork","timelines") .filter(_.character == "plissken") .filter(_.time == 8) .toArray .filter(_.character == "plissken") .filter(_.time == 8) .toArray res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) cassandraTable[TimelineRow] TimelineRow cchhaarraacctteerr,,ttiimmee,,llooccaattiioonn filter cchhaarraacctteerr ==== pplliisssskkeenn ttiimmee ==== 88 cchhaarraacctteerr::pplliisssskkeenn,, ttiimmee::88,, llooccaattiioonn:: SStteeaalltthh GGlliiddeerr
  • 25. A Map Reduce for Word Count … scala> sc.cassandraTable("newyork","presidentlocations") scala> sc.cassandraTable("newyork","presidentlocations") .map(_.getString("location")) .flatMap(_.split(" ")) .map((_,1)) .reduceByKey(_ + _) .toArray .map(_.getString("location")) .flatMap(_.split(" ")) .map((_,1)) .reduceByKey(_ + _) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 11 wwhhiittee hhoouussee wwhhiittee hhoouussee wwhhiittee,, 11 hhoouussee,, 11 hhoouussee,, 11 hhoouussee,, 11 hhoouussee,, 22 cassandraTable getString _.split(" ") (_,1) reduceByKey(_ + _) wwhhiittee hhoouussee
  • 26. Selected RDD transformations ● min(), max(), count() ● reduce[T](f: (T, T) ⇒ T): T ● fold[T](zeroValue: T)(op: (T, T) ⇒ T): T ● aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U ● flatMap[U](func: (T) ⇒ TraversableOnce[U]): RDD[U] ● mapPartitions[U]( f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean): RDD[U] ● sortBy[K](f: (T) ⇒ K, ascending: Boolean = true) ● groupBy[K](f: (T) ⇒ K): RDD[(K, Iterable[T])] ● intersection(other: RDD[T]): RDD[T] ● union(other: RDD[T]): RDD[T] ● subtract(other: RDD[T]): RDD[T] ● zip[U](other: RDD[U]): RDD[(T, U)] ● keyBy[K](f: (T) ⇒ K): RDD[(K, T)] ● sample(withReplacement: Boolean, fraction: Double)
  • 27. RDD can do even more...
  • 28. How Fast is it? ● Reading big data from Cassandra: – Spark ~2x faster than Hadoop ● Minimum latency (1 node, vnodes disabled, tiny data): – Spark: 0.7s – Hadoop: ~20s ● Minimum latency (1 node, vnodes enabled): – Spark: 1s – Hadoop: ~8 minutes ● In memory processing: – up to 100x faster than Hadoop
  • 30. In-memory Processing Call cache or persist(storageLevel) to store RDD data in memory. val rdd = sc.cassandraTable("newyork","presidentlocations") .filter(...) .map(...) .reduce(...) .cache rdd.first // slow, loads data from Cassandra and keeps in memory rdd.first // fast, doesn't read from Cassandra, reads from memory val rdd = sc.cassandraTable("newyork","presidentlocations") .filter(...) .map(...) .reduce(...) .cache rdd.first // slow, loads data from Cassandra and keeps in memory rdd.first // fast, doesn't read from Cassandra, reads from memory Multiple StorageLevels available: ● MEMORY_ONLY ● MEMORY_ONLY_SER ● MEMORY_AND_DISK ● MEMORY_AND_DISK_SER ● DISK_ONLY Also replicated variants available: just append _2 to the constant name.
  • 31. Fault Tolerance cassandraTable 77 88 99 11 22 33 44 55 66 77 88 99 11 22 44 55 filter map 33 66 Node 1 FilteredRDD MappedRDD Cassandra RDD 44 55 77 88 66 99 Node 2 77 88 11 22 99 33 Node 3 Replication Factor = 2
  • 32. Standalone App Example https://github.com/RussellSpitzer/spark-cassandra-csv CCaarr,, MMooddeell,, CCoolloorr Dodge, Caravan, Red Ford, F150, Black Toyota, Prius, Green RDD [CassandraRow] FavoriteCars Table CCaassssaannddrraa Column Mapping CSV
  • 33. Useful modules / projects ● Java API – for diehard Java developers ● Python API – for those allergic to static types ● Shark – Hive QL on Spark (discontinued) ● Spark SQL – new SQL engine based on Catalyst query planner ● Spark Streaming – microbatch streaming framework ● MLLib – machine learning library ● GraphX – efficient representation and processing of graph data
  • 35. Thanks for listening! There is plenty more we can do with Spark but … Questions?