SlideShare a Scribd company logo
Cassandra And Spark Dataframes
Russell Spitzer
Software Engineer @ Datastax
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Tungsten Gives Dataframes OffHeap Power!
Can compare memory off-heap and bitwise!
Code generation!
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
source
org.apache.spark.sql.cassandra
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
CassandraSourceRelation
CassandraTableScanRDDConfiguration
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "other" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Connector
Default
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame
DataFromC*
AND
add where clause to
CQL
"clusteringKey > 100"
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
What can be pushed down?
1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate
2. Only push down primary key column predicates with = or IN predicate.
3. If there are regular columns in the pushdown predicates, they should have at least one EQ
expression on an indexed column and no IN predicates.
4. All partition column predicates must be included in the predicates to be pushed down, only
the last part of the partition key can be an IN predicate. For each partition column, only one
predicate is allowed.
5. For cluster column predicates, only last predicate can be non-EQ predicate including IN
predicate, and preceding column predicates must be EQ predicates.
6. If there is only one cluster column predicate, the predicates could be any non-IN predicate.
There is no pushdown predicates if there is any OR condition or NOT IN condition.
7. We're not allowed to push down multiple predicates for the same column if any of them is
equality or IN predicate.
What can be pushed down?
If you could write in CQL it will get pushed down.
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing

applies
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing

applies
https://academy.datastax.com/

Watch me talk about this in the privacy of your own home!
How the
Spark Cassandra Connector
Reads Data
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
Cassandra Data is Distributed By Token Range
Cassandra Data is Distributed By Token Range
0
500
Cassandra Data is Distributed By Token Range
0
500
999
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
Node 1
120-220
300-500
780-830
0-50
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
The Connector Uses Information on the Node to Make 

Spark Partitions
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
1
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
1
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1
300-400
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
421
Node 1
The Connector Uses Information on the Node to Make 

Spark Partitions
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50780-830
Node 1
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
How The Spark
Cassandra Connector
Writes Data
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
The Spark Cassandra
Connector saveToCassandra
method can be called on
almost all RDDs
rdd.saveToCassandra("Keyspace","Table")
Node 11
A Java Driver connection is made to
the local node and a prepared statement
is built for the target table
Java
Driver
Node 11
Batches are built from data in
Spark partitions
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
3,9,1
Node 11
By default these batches only
contain CQL Rows which share the same
partition key
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
Node 11
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
By default these batches only
contain CQL Rows which share the same
partition key
PK=1
Node 11
When an element is not part of an existing batch,
a new batch is started
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
When an element is not part of an existing batch,
a new batch is started
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
When an element is not part of an existing batch,
a new batch is started
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
PK=2
PK=3
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1
3,1,1
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
Java
Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
PK=2
PK=3
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=2
PK=3
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,13,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,13,9,1
Write Acknowledged
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
Block
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Thanks for Coming and I hope you Have a Great Time

At C* Summit
http://cassandrasummit-datastax.com/agenda/the-spark-
cassandra-connector-past-present-and-future/
Also ask these guys really hard questions
Jacek PiotrAlex

More Related Content

What's hot (20)

PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
PDF
Cassandra spark connector
Duyhai Doan
 
PDF
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PDF
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
PDF
Spark Streaming with Cassandra
Jacek Lewandowski
 
PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
PDF
OLAP with Cassandra and Spark
Evan Chan
 
PDF
Time series with Apache Cassandra - Long version
Patrick McFadin
 
PPTX
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
PDF
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Cassandra spark connector
Duyhai Doan
 
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Spark Streaming with Cassandra
Jacek Lewandowski
 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
OLAP with Cassandra and Spark
Evan Chan
 
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 

Viewers also liked (19)

PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
PDF
Bulk Loading into Cassandra
Brian Hess
 
PDF
Spark cassandra integration, theory and practice
Duyhai Doan
 
PDF
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
DataStax Academy
 
PDF
Datastax enterprise presentation
Duyhai Doan
 
PDF
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Spark Summit
 
PDF
Bulk Loading Data into Cassandra
DataStax
 
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
PDF
Structured streaming in Spark
Giri R Varatharajan
 
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
PDF
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PDF
DataStax: A deep look at the CQL WHERE clause
DataStax Academy
 
PDF
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Spark Summit
 
PDF
Introduction to PySpark
Russell Jurney
 
PDF
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PDF
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax Academy
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
Bulk Loading into Cassandra
Brian Hess
 
Spark cassandra integration, theory and practice
Duyhai Doan
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
DataStax Academy
 
Datastax enterprise presentation
Duyhai Doan
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Spark Summit
 
Bulk Loading Data into Cassandra
DataStax
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Structured streaming in Spark
Giri R Varatharajan
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
Data Engineering with Solr and Spark
Lucidworks
 
DataStax: A deep look at the CQL WHERE clause
DataStax Academy
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Spark Summit
 
Introduction to PySpark
Russell Jurney
 
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax Academy
 
Ad

Similar to Spark Cassandra Connector Dataframes (20)

PDF
Cassandra and Spark
datastaxjp
 
PDF
Cassandra London - C* Spark Connector
Christopher Batey
 
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
PDF
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
PDF
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
PPTX
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
PPTX
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
PDF
DataSource V2 and Cassandra – A Whole New World
Databricks
 
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
PDF
Apache cassandra & apache spark for time series data
Patrick McFadin
 
PDF
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
PDF
The Apache Cassandra ecosystem
Alex Thompson
 
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
PPTX
Presentation
Dimitris Stripelis
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Cassandra and Spark
datastaxjp
 
Cassandra London - C* Spark Connector
Christopher Batey
 
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
DataSource V2 and Cassandra – A Whole New World
Databricks
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
The Apache Cassandra ecosystem
Alex Thompson
 
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Presentation
Dimitris Stripelis
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Ad

More from Russell Spitzer (6)

PDF
Cassandra and Spark SQL
Russell Spitzer
 
PDF
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Russell Spitzer
 
PPTX
Maximum Overdrive: Tuning the Spark Cassandra Connector
Russell Spitzer
 
PDF
Cassandra and IoT
Russell Spitzer
 
PDF
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
PDF
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 
Cassandra and Spark SQL
Russell Spitzer
 
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Russell Spitzer
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Russell Spitzer
 
Cassandra and IoT
Russell Spitzer
 
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 

Recently uploaded (20)

PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 

Spark Cassandra Connector Dataframes

  • 1. Cassandra And Spark Dataframes Russell Spitzer Software Engineer @ Datastax
  • 2. Cassandra And Spark Dataframes
  • 3. Cassandra And Spark Dataframes
  • 4. Cassandra And Spark Dataframes
  • 5. Cassandra And Spark Dataframes
  • 6. Tungsten Gives Dataframes OffHeap Power! Can compare memory off-heap and bitwise! Code generation!
  • 7. The Core is the Cassandra Source https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra /** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */ DataFrame source org.apache.spark.sql.cassandra
  • 8. The Core is the Cassandra Source https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra /** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */ DataFrame CassandraSourceRelation CassandraTableScanRDDConfiguration
  • 9. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()
  • 10. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 11. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 12. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 13. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "default" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 14. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "other" , "cluster" -> "default" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32 Connector Default
  • 15. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100
  • 16. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show
  • 17. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show Catalyst
  • 18. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show Catalyst https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
  • 19. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* AND add where clause to CQL "clusteringKey > 100" Show Catalyst https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
  • 20. What can be pushed down? 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. 6. If there is only one cluster column predicate, the predicates could be any non-IN predicate. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate.
  • 21. What can be pushed down? If you could write in CQL it will get pushed down.
  • 22. What are we Pushing Down To? CassandraTableScanRDD All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
 applies
  • 23. What are we Pushing Down To? CassandraTableScanRDD All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
 applies https://academy.datastax.com/
 Watch me talk about this in the privacy of your own home!
  • 24. How the Spark Cassandra Connector Reads Data
  • 25. Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 1 2 3 4 5 6 7 8 9Node 2 Node 1 Node 3 Node 4
  • 26. Node 2 Node 1 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 2 346 7 8 9 Node 3 Node 4 1 5
  • 27. Node 2 Node 1 RDD 2 346 7 8 9 Node 3 Node 4 1 5 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks
  • 28. Cassandra Data is Distributed By Token Range
  • 29. Cassandra Data is Distributed By Token Range 0 500
  • 30. Cassandra Data is Distributed By Token Range 0 500 999
  • 31. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4
  • 32. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 Without vnodes
  • 33. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 With vnodes
  • 34. Node 1 120-220 300-500 780-830 0-50 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb The Connector Uses Information on the Node to Make 
 Spark Partitions
  • 35. Node 1 120-220 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 1 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 36. 1 Node 1 120-220 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 37. 2 1 Node 1 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 38. 2 1 Node 1 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 39. 2 1 Node 1 300-400 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 40. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 41. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 42. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 43. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 44. 4 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 45. 4 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 46. 421 Node 1 The Connector Uses Information on the Node to Make 
 Spark Partitions 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 47. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50780-830 Node 1
  • 48. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  • 49. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  • 50. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  • 51. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  • 52. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows 50 CQL Rows
  • 53. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows
  • 54. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  • 55. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  • 56. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 57. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 58. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 59. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 60. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 61. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 62. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 63. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 64. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 65. How The Spark Cassandra Connector Writes Data
  • 66. Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 1 2 3 4 5 6 7 8 9Node 2 Node 1 Node 3 Node 4
  • 67. Node 2 Node 1 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 2 346 7 8 9 Node 3 Node 4 1 5
  • 68. Node 2 Node 1 RDD 2 346 7 8 9 Node 3 Node 4 1 5 The Spark Cassandra Connector saveToCassandra method can be called on almost all RDDs rdd.saveToCassandra("Keyspace","Table")
  • 69. Node 11 A Java Driver connection is made to the local node and a prepared statement is built for the target table Java Driver
  • 70. Node 11 Batches are built from data in Spark partitions Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 3,9,1
  • 71. Node 11 By default these batches only contain CQL Rows which share the same partition key Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1
  • 72. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 By default these batches only contain CQL Rows which share the same partition key PK=1
  • 73. Node 11 When an element is not part of an existing batch, a new batch is started Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1
  • 74. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 When an element is not part of an existing batch, a new batch is started PK=1 PK=2
  • 75. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 When an element is not part of an existing batch, a new batch is started PK=1 PK=2
  • 76. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,13,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2 PK=3
  • 77. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,13,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1 PK=2 PK=3 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver
  • 78. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4,3,9,1 3,1,1 spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2
  • 79. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4,3,9,1 spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2 PK=3
  • 80. Node 11 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver Java Driver 1,1,1 1,2,1 2,1,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1 PK=2 PK=3
  • 81. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=2 PK=3 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver
  • 82. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver PK=2 PK=3 PK=5
  • 83. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver PK=2 PK=3 PK=5
  • 84. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,18,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,13,9,1 PK=2 PK=3 PK=5
  • 85. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,18,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,13,9,1 Write Acknowledged PK=2 PK=3 PK=5
  • 86. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=2 PK=3 PK=5
  • 87. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 3,1,1 5,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=3 PK=5
  • 88. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 3,1,1 5,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 89. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 90. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 91. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5 Write Acknowledged
  • 92. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 93. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5 Write Acknowledged
  • 94. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 Block 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 95. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 96. Thanks for Coming and I hope you Have a Great Time
 At C* Summit http://cassandrasummit-datastax.com/agenda/the-spark- cassandra-connector-past-present-and-future/ Also ask these guys really hard questions Jacek PiotrAlex