Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Escape From Hadoop:
Ultra Fast Data Analysis
with Apache Cassandra & Spark
Kurt Russell Spitzer
Piotr Kołaczkowski
Piotr Kołaczkowski
DataStax
slides by
presented by

Why escape from Hadoop?
Hadoop
Many Moving Pieces
Map Reduce
Single Points of Failure
Lots of Overhead
And there is a way out!

Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Fault Tolerance Yes!
Great Abstraction For
Datasets? RDD!
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient
Distributed
Dataset
SSppaarrkk EExxeeccuuttoorr

Spark is Compatible with
HDFS, JDBC, Parquet, CSVs, ….
AND
APACHE CASSANDRA
Apache
Cassandra

Apache Cassandra is a Linearly Scaling and
Fault Tolerant noSQL Database
Linearly Scaling:
The power of the database
increases linearly with the number
of machines
2x machines = 2x throughput
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Fault Tolerant:
Nodes down != Database Down
Datacenter down != Database
Down

Apache Cassandra Architecture is Very Simple
Node Roles 1
Replication Tunable
Consistency Replication
Tunable
CC**
CC** CC**
CC**
CClliieenntt

DataStax OSS Connector
Spark to Cassandra
https://github.com/datastax/spark-cassandra-connector
CCaassssaannddrraa SSppaarrkk
KKeeyyssppaaccee TTaabbllee
RRDDDD[[CCaassssaannddrraaRRooww]]
RRDDDD[[TTuupplleess]]
Bundled and Supported with
DSE 4.5!

DataStax Connector
Spark to Cassandra
By the numbers:
● 370 commits
● 17 branches
● 10 releases
● 11 contributors
● 168 issues (65 open)
● 98 pull requests (6 open)

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Spark Cassandra Connector uses the DataStax
Java Driver to Read from and Write to C*
CC**
Full Token
Range
Each Executor Maintains a
connection to the C* Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1001 -2000
Tokens 1-1000
Tokens …
RDD’s read into different
splits based on token ranges

Co-locate Spark and C* for Best Performance
CC** Running Spark Workers on
the same nodes as your C*
cluster will save network hops
when reading and writing CC** CC**
Spark
Worker
CC**
Spark
Worker
Spark
Master
Spark
Worker

Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

We need a Distributed System For
Analytics and Batch Jobs
But it doesn’t have to be complicated!

Even count needs to be distributed
Ask me to write a Map
Reduce for word count, I
dare you.
You could make this easier by adding yet another technology to your
Hadoop Stack (hive, pig, impala) or
we could just do one liners on the spark shell.

Basics: Getting a Table and Counting
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
USE newyork;
CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time );
INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' );
INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' );
INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' );
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
USE newyork;
CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time );
scala> sc.cassandraTable(“newyork","presidentlocations").count
res3: Long = 10
scala> sc.cassandraTable(“newyork","presidentlocations").count
res3: Long = 10
cassandraTable
count 10

Basics: take() and toArray
scala> sc.cassandraTable("newyork","presidentlocations").take(1)
res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC})
scala> sc.cassandraTable("newyork","presidentlocations").take(1)
res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC})
cassandraTable
take(1)
Array of CassandraRows
99 NNYYCC
scala> sc.cassandraTable(“newyork","presidentlocations").toArray
res3: Array[com.datastax.spark.connector.CassandraRow] = Array(
scala> sc.cassandraTable(“newyork","presidentlocations").toArray
CassandraRow{time: 9, location: NYC},
CassandraRow{time: 3, location: White House},
…,
CassandraRow{time: 6, location: Air Force 1})
cassandraTable
toArray
Array of CassandraRows
99 NNYYCC
CassandraRow{time: 3, location: White House},
…,
CassandraRow{time: 6, location: Air Force 1})
9999 NNNNYYYYCCCC 9999 NNNNYYYYCCCC

Basics: Getting Row Values out of a
CassandraRow
scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time")
res5: Int = 9
scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time")
res5: Int = 9
cassandraTable
first
A CassandraRow object
99 NNYYCC
99 get[Int]
get[Int]
get[String]
get[List[...]]
…get[Any]
Got null ?
get[Option[Int]]
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

get[Int] get[String]
CC**
Copy A Table
Say we want to restructure our table or add a new column?
CREATE TABLE characterlocations (
CREATE TABLE characterlocations (
time int,
character text,
location text,
PRIMARY KEY (time,character)
);
time int,
character text,
location text,
PRIMARY KEY (time,character)
);
scala> sc.cassandraTable(“newyork","presidentlocations")
.map( row => (
.map( row => (
row.get[Int](“time"),
"president",
row.get[String](“location")))
row.get[Int](“time"),
"president",
row.get[String](“location")))
.saveToCassandra("newyork","characterlocations")
.saveToCassandra("newyork","characterlocations")
cqlsh:newyork> SELECT * FROM characterlocations ;
time | character | location
------+-----------+-------------
cqlsh:newyork> SELECT * FROM characterlocations ;
time | character | location
------+-----------+-------------
5 | president | Air Force 1
10 | president | NYC
……
5 | president | Air Force 1
10 | president | NYC
……
cassandraTable
11 wwhhiittee hhoouussee
11,,pprreessiiddeenntt,,wwhhiittee hhoouussee
saveToCassandra

Filter a Table
What if we want to filter based on a non-clustering key column?
.filter( _.getInt("time") > 7 )
.toArray
CassandraRow{time: 8, location: NYC}
)
.filter( _.getInt("time") > 7 )
.toArray
CassandraRow{time: 8, location: NYC}
)
cassandraTable
getInt
11
>7
filter

Backfill a Table with a Different Key!
CREATE TABLE timelines (
time int,
character text,
location text,
PRIMARY KEY ((character), time)
)
CREATE TABLE timelines (
time int,
character text,
location text,
PRIMARY KEY ((character), time)
)
If we actually want to have quick access to
timelines we need a C* table with a
different structure.
sc.cassandraTable("newyork","characterlocations")
.saveToCassandra("newyork","timelines")
sc.cassandraTable("newyork","characterlocations")
.saveToCassandra("newyork","timelines")
cqlsh:newyork> select * from timelines;
character | time | location
-----------+------+-------------
president | 1 | White House
president | 5 | Air Force 1
president | 8 | NYC
president | 9 | NYC
president | 10 | NYC
cqlsh:newyork> select * from timelines;
-----------+------+-------------
president | 8 | NYC
president | 9 | NYC
president | 10 | NYC
cassandraTable
saveToCassandra
pprreessiiddeenntt CC**

Import a CSV
I have some data in another source which I could really use in
my Cassandra table
sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv")
sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv")
.map(_.split(","))
.map(line => (line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location"))
.map(_.split(","))
.map(line => (line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location"))
textFile
map
plissken,1,white house
cqlsh:newyork> select * from timelines where character = 'plissken';
-----------+------+-----------------
plissken | 1 | Federal Reserve
plissken | 4 | Court
plissken | 8 | Stealth Glider
plissken | 9 | NYC
plissken | 10 | NYC
cqlsh:newyork> select * from timelines where character = 'plissken';
character | split
time | location
-----------+------+-----------------
plissken | 8 | Stealth Glider
plissken | 9 | NYC
plissken | 10 | NYC
pplliisssskkeenn 11 wwhhiittee hhoouussee
pplliisssskkeenn,,11,,wwhhiittee hhoouussee
saveToCassandra
CC**

Perform a Join with MySQL
Maybe a little more than one line …
import java.sql._
import org.apache.spark.rdd.JdbcRDD
Class.forName("com.mysql.jdbc.Driver").newInstance();
val quotes = new JdbcRDD(
sc,
getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"),
sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?",
lowerBound = 0,
upperBound = 100,
numPartitions = 5,
mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3))
)
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23
import java.sql._
import org.apache.spark.rdd.JdbcRDD
Class.forName("com.mysql.jdbc.Driver").newInstance();
val quotes = new JdbcRDD(
sc,
getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"),
sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?",
lowerBound = 0,
upperBound = 100,
numPartitions = 5,
mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3))
)
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23

Perform a Join with MySQL
Maybe a little more than one line …
val locations =
sc.cassandraTable("newyork","timelines")
val locations =
sc.cassandraTable("newyork","timelines")
.filter(_.getString("character") == "plissken")
.map(row => (row.getInt("time"), row.getString("location")))
.filter(_.getString("character") == "plissken")
.map(row => (row.getInt("time"), row.getString("location")))
quotes.join(locations)
.take(1)
.foreach(println)
quotes.join(locations)
.take(1)
.foreach(println)
(5, (
Bob Hauk: There was an accident.
About an hour ago, a small jet went down inside New York City.
The President was on board.
Snake Plissken: The president of what?,
Court
))
cassandraTable
JdbcRDD
pplliisssskkeenn,, 55,, ccoouurrtt
55,,ccoouurrtt 55,,((‘‘BBoobb HHaauukk:: ……’’,,ccoouurrtt))
55,, ‘‘BBoobb HHaauukk:: ……''
(5, (
Bob Hauk: There was an accident.
About an hour ago, a small jet went down inside New York City.
The President was on board.
Snake Plissken: The president of what?,
Court
))
join

Easy Objects with Case Classes
We have the technology to make this even easier!
case class TimelineRow(character: String, time: Int, location: String)
sc.cassandraTable[TimelineRow]("newyork","timelines")
case class TimelineRow(character: String, time: Int, location: String)
sc.cassandraTable[TimelineRow]("newyork","timelines")
.filter(_.character == "plissken")
.filter(_.time == 8)
.toArray
.filter(_.character == "plissken")
.filter(_.time == 8)
.toArray
res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider))
res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider))
cassandraTable[TimelineRow]
TimelineRow
cchhaarraacctteerr,,ttiimmee,,llooccaattiioonn
filter
cchhaarraacctteerr ==== pplliisssskkeenn
ttiimmee ==== 88
cchhaarraacctteerr::pplliisssskkeenn,, ttiimmee::88,, llooccaattiioonn:: SStteeaalltthh GGlliiddeerr

A Map Reduce for Word Count …
scala> sc.cassandraTable("newyork","presidentlocations")
scala> sc.cassandraTable("newyork","presidentlocations")
.map(_.getString("location"))
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_ + _)
.toArray
.map(_.getString("location"))
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_ + _)
.toArray
res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
wwhhiittee hhoouussee
wwhhiittee,, 11 hhoouussee,, 11
hhoouussee,, 11 hhoouussee,, 11
hhoouussee,, 22
cassandraTable
getString
_.split(" ")
(_,1)
reduceByKey(_ + _)
wwhhiittee hhoouussee

Selected RDD transformations
● min(), max(), count()
● reduce[T](f: (T, T) ⇒ T): T
● fold[T](zeroValue: T)(op: (T, T) ⇒ T): T
● aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
● flatMap[U](func: (T) ⇒ TraversableOnce[U]): RDD[U]
● mapPartitions[U](
f: (Iterator[T]) ⇒ Iterator[U],
preservesPartitioning: Boolean): RDD[U]
● sortBy[K](f: (T) ⇒ K, ascending: Boolean = true)
● groupBy[K](f: (T) ⇒ K): RDD[(K, Iterable[T])]
● intersection(other: RDD[T]): RDD[T]
● union(other: RDD[T]): RDD[T]
● subtract(other: RDD[T]): RDD[T]
● zip[U](other: RDD[U]): RDD[(T, U)]
● keyBy[K](f: (T) ⇒ K): RDD[(K, T)]
● sample(withReplacement: Boolean, fraction: Double)

How Fast is it?
● Reading big data from Cassandra:
– Spark ~2x faster than Hadoop
● Minimum latency (1 node, vnodes disabled, tiny data):
– Spark: 0.7s
– Hadoop: ~20s
● Minimum latency (1 node, vnodes enabled):
– Spark: 1s
– Hadoop: ~8 minutes
● In memory processing:
– up to 100x faster than Hadoop

source: https://amplab.cs.berkeley.edu/benchmark/

In-memory Processing
Call cache or persist(storageLevel) to store RDD data in memory.
val rdd = sc.cassandraTable("newyork","presidentlocations")
.filter(...)
.map(...)
.reduce(...)
.cache
rdd.first // slow, loads data from Cassandra and keeps in memory
rdd.first // fast, doesn't read from Cassandra, reads from memory
val rdd = sc.cassandraTable("newyork","presidentlocations")
.filter(...)
.map(...)
.reduce(...)
.cache
rdd.first // slow, loads data from Cassandra and keeps in memory
rdd.first // fast, doesn't read from Cassandra, reads from memory
Multiple StorageLevels available:
● MEMORY_ONLY
● MEMORY_ONLY_SER
● MEMORY_AND_DISK
● MEMORY_AND_DISK_SER
● DISK_ONLY
Also replicated variants available: just append _2 to the constant name.

Fault Tolerance
cassandraTable
77 88 99
11 22 33 44 55 66 77 88 99
11
22
44 55
filter
map
33
66
Node 1
FilteredRDD
MappedRDD
Cassandra RDD
44 55
77 88
66
99
Node 2
77 88
11 22
99
33
Node 3
Replication Factor = 2

Standalone App Example
https://github.com/RussellSpitzer/spark-cassandra-csv
CCaarr,, MMooddeell,, CCoolloorr
Dodge, Caravan, Red
Ford, F150, Black
Toyota, Prius, Green
RDD
[CassandraRow]
FavoriteCars
Table
CCaassssaannddrraa
Column Mapping
CSV

Useful modules / projects
● Java API
– for diehard Java developers
● Python API
– for those allergic to static types
● Shark
– Hive QL on Spark (discontinued)
● Spark SQL
– new SQL engine based on Catalyst query planner
● Spark Streaming
– microbatch streaming framework
● MLLib
– machine learning library
● GraphX
– efficient representation and processing of graph data

We're hiring!
http://www.datastax.com/company/careers

Thanks for listening!
There is plenty more we can do with Spark but …
Questions?

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra (20)

Recently uploaded (20)

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra