SlideShare a Scribd company logo
CASSANDRA DATA
MAINTENANCE WITH SPARK
Operate on your Data
WHAT IS SPARK?
A large-scale data processing framework
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
STEP 1:
Make Fake Data
(unless you have a million records to spare)
def create_fake_record( num: Int ) = {
(num,
1453389992000L + num,
s"My Token $num",
s"My Session Data$num")
}
sc.parallelize(1 to 1000000)
.map( create_fake_record )
.repartitionByCassandraReplica("maintdemo","user_visits",10)
.saveToCassandra("user_visits","oauth_cache")
THREE BASIC PATTERNS
• Read -Transform - Write (1:1) - .map()
• Read -Transform - Write (1:m) - .flatMap()
• Read - Filter - Delete (m:1) - it’s complicated
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
DELETES ARE TRICKY
DELETES ARE TRICKY
• Keep tombstones in mind
• Select the records you want to delete, then loop
over those and issue deletes through the driver
• OR select the records you want to keep, rewrite
them, then delete the partitions they lived in… IN
THE PAST…
DELETING
PREDICATE PUSHDOWN
• Use Cassandra-level filtering at every opportunity
• With DSE, benefit from predicate pushdown to
solr_query
GOTCHAS
• Null fields
• Writing jobs which aren’t or can’t be distributed.
TIPS &TRICKS
• .spanBy( partition key ) - work on one Cassandra
partition at a time
• .repartitionByCassandraReplica()
• tune
spark.cassandra.output.throughput_mb_per_sec to
throttle writes
USE CASE : CACHE
MAINTENANCE
USE CASE :TRIM USER
HISTORY
• Cassandra Data Model: PRIMARY KEY( userid,
last_access )
• Keep last X records
• .spanBy( partitionKey ) flatMap filtering Seq
USE CASE: PUBLISH DATA
• Cassandra Data Model: publish_date field
• filter by date, map to new RDD matching
destination, saveToCassandra()
USE CASE: MULTITENANT
BACKUP AND RECOVERY
• Cassandra Data Model: PRIMARY KEY((tenant_id,
other_partition_key), other_cluster, …)
• Backup: filter for tenant_id and .foreach() write to
external location.
• Recovery: read backup and upsert

More Related Content

What's hot (20)

PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Data day texas: Cassandra and the Cloud
jbellis
 
PDF
Druid
Dori Waldman
 
PDF
Wide Column Store NoSQL vs SQL Data Modeling
ScyllaDB
 
PPTX
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
All Things Open
 
PPTX
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax
 
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
PPTX
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
ScyllaDB
 
PPTX
Data Modeling IoT and Time Series data in NoSQL
Basho Technologies
 
PPTX
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
DataStax
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
What is DataStax Enterprise?
DataStax
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PPTX
Horizon for Big Data
Schubert Zhang
 
PPTX
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax
 
PDF
Cassandra + Spark + Elk
Vasil Remeniuk
 
PDF
Deep dive into event store using Apache Cassandra
AhmedabadJavaMeetup
 
PPTX
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Data day texas: Cassandra and the Cloud
jbellis
 
Wide Column Store NoSQL vs SQL Data Modeling
ScyllaDB
 
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
All Things Open
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
ScyllaDB
 
Data Modeling IoT and Time Series data in NoSQL
Basho Technologies
 
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
What is DataStax Enterprise?
DataStax
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Horizon for Big Data
Schubert Zhang
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax
 
Cassandra + Spark + Elk
Vasil Remeniuk
 
Deep dive into event store using Apache Cassandra
AhmedabadJavaMeetup
 
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 

Similar to Cassandra Data Maintenance with Spark (10)

PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
PDF
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
PDF
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PDF
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 
Ad

Recently uploaded (20)

PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 

Cassandra Data Maintenance with Spark

  • 1. CASSANDRA DATA MAINTENANCE WITH SPARK Operate on your Data
  • 2. WHAT IS SPARK? A large-scale data processing framework
  • 7. STEP 1: Make Fake Data (unless you have a million records to spare)
  • 8. def create_fake_record( num: Int ) = { (num, 1453389992000L + num, s"My Token $num", s"My Session Data$num") } sc.parallelize(1 to 1000000) .map( create_fake_record ) .repartitionByCassandraReplica("maintdemo","user_visits",10) .saveToCassandra("user_visits","oauth_cache")
  • 9. THREE BASIC PATTERNS • Read -Transform - Write (1:1) - .map() • Read -Transform - Write (1:m) - .flatMap() • Read - Filter - Delete (m:1) - it’s complicated
  • 13. DELETES ARE TRICKY • Keep tombstones in mind • Select the records you want to delete, then loop over those and issue deletes through the driver • OR select the records you want to keep, rewrite them, then delete the partitions they lived in… IN THE PAST…
  • 15. PREDICATE PUSHDOWN • Use Cassandra-level filtering at every opportunity • With DSE, benefit from predicate pushdown to solr_query
  • 16. GOTCHAS • Null fields • Writing jobs which aren’t or can’t be distributed.
  • 17. TIPS &TRICKS • .spanBy( partition key ) - work on one Cassandra partition at a time • .repartitionByCassandraReplica() • tune spark.cassandra.output.throughput_mb_per_sec to throttle writes
  • 18. USE CASE : CACHE MAINTENANCE
  • 19. USE CASE :TRIM USER HISTORY • Cassandra Data Model: PRIMARY KEY( userid, last_access ) • Keep last X records • .spanBy( partitionKey ) flatMap filtering Seq
  • 20. USE CASE: PUBLISH DATA • Cassandra Data Model: publish_date field • filter by date, map to new RDD matching destination, saveToCassandra()
  • 21. USE CASE: MULTITENANT BACKUP AND RECOVERY • Cassandra Data Model: PRIMARY KEY((tenant_id, other_partition_key), other_cluster, …) • Backup: filter for tenant_id and .foreach() write to external location. • Recovery: read backup and upsert