SlideShare a Scribd company logo
Partners in Crime
Cassandra Analytics and ETL with Hadoop




Cassandra Summit 2010

Date: August 10th, 2010
What is Hadoop?

• Distributed processing framework (MapReduce)
  – Moves processing to the data
• Distributed filesystem
  – Allows data to move when processing can't
Why use Hadoop with Cassandra?

 Perfect partners for big data laundering

• Cassandra optimized for access
• Hadoop optimized for processing
  – Many analytics frameworks
  – Existing integrations
      • RDBMS → Hadoop → Cassandra
Cluster Layouts

• Existing Hadoop cluster?
  – Start Hadoop tasktrackers on Cassandra cluster
  – Processing performed on local nodes
Cluster Layouts

• No Hadoop cluster?
  – Start all Hadoop daemons on 2-3 nodes
      • MapReduce depends lightly on HDFS
  – Start Hadoop tasktrackers on Cassandra cluster
Hadoop Integration Points

• JVM MapReduce
  – Keys/values iterated in process
• Hadoop Streaming
  – Performs IPC on stdin/stdout to arbitrary processes
• Apache Pig
  – High level relational language (SQL alternative)
• Apache Hive
  – Forthcoming support for Cassandra storage
Demo

• Code
  – github.com/stuhood/cassandra-summit-demo
• Flow
  – Load with Hadoop Streaming
  – Analyze with Apache Pig
  – Load/Process with JVM MapReduce
Hadoop Streaming Summary

• Mapper/Reducer scripts
  – Any language
• Script is moved to the data


 cat $input | mapper | sort | reducer > $output
ETL with Streaming

• ETL to Cassandra in ~50 lines
 Load!
ETL with Streaming

1)Files in HDFS
2)Hadoop Streaming
3)bin/load-mapper.py (the code you write)
4)Cassandra's Streaming Shim
5)Cassandra
Apache Pig Summary

• Declarative relational language
Analytics with Pig

• Analytics from Cassandra in ~20 lines
 Analyze!
Analytics with Pig

1)Data stored in Cassandra
2)Cassandra's Pig LoadFunc
3)bin/analyze.pig (the code you write)
4)Files in HDFS
JVM MapReduce Summary

• Extend Mapper/Reducer base classes
• Hadoop:
  – Transports the Jar to nodes near the data
  – Efficiently streams data through
Load/Process with MapReduce

• Efficient bulk loading in ~80 lines
 Summarize!
Load/Process with MapReduce

1)Files in HDFS
2)MapReduce
3)Mapper/Reducer (the code you write)
4)Cassandra's ColumnFamilyOutputFormat
5)Cassandra
Future Work

• Pig Output
• Hive
• Hadoop Streaming Input
• Optimizations
Questions?
References

• Code available at
  – github.com/stuhood/cassandra-summit-demo
• Open issues
  – CASSANDRA-1315
  – CASSANDRA-1322
  – CASSANDRA-1368
• “Hadoop + Cassandra” - Jeremy Hanna
  – slideshare.net/jeromatron/cassandrahadoop-4399672

More Related Content

ODP
Hadoop and Cassandra at Rackspace
Stu Hood
 
PDF
Cassandra Talk: Austin JUG
Stu Hood
 
PDF
On Rails with Apache Cassandra
Stu Hood
 
PDF
What Every Developer Should Know About Database Scalability
jbellis
 
PDF
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
PPTX
Hadoop+Cassandra_Integration
Joyabrata Das
 
PPTX
Cassandra/Hadoop Integration
Jeremy Hanna
 
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Hadoop and Cassandra at Rackspace
Stu Hood
 
Cassandra Talk: Austin JUG
Stu Hood
 
On Rails with Apache Cassandra
Stu Hood
 
What Every Developer Should Know About Database Scalability
jbellis
 
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
Hadoop+Cassandra_Integration
Joyabrata Das
 
Cassandra/Hadoop Integration
Jeremy Hanna
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 

What's hot (20)

PPTX
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
PDF
Migrating structured data between Hadoop and RDBMS
Bouquet
 
PPTX
Introduction to AWS Big Data
Omid Vahdaty
 
PDF
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
PPT
Nextag talk
Joydeep Sen Sarma
 
PPTX
Cloudera Impala + PostgreSQL
liuknag
 
PPTX
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
PPTX
Introduction to NoSql
Omid Vahdaty
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PDF
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
PDF
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
Cloudera, Inc.
 
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.
 
PDF
What every developer should know about database scalability, PyCon 2010
jbellis
 
PPTX
Apache HBase - Introduction & Use Cases
Data Con LA
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
PDF
Hadoop - How It Works
Vladimír Hanušniak
 
PDF
Hbase jdd
Andrzej Grzesik
 
PDF
Apache sqoop
megrhi haikel
 
PDF
Scalable Data Science with SparkR
DataWorks Summit
 
PDF
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Edureka!
 
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Introduction to AWS Big Data
Omid Vahdaty
 
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Nextag talk
Joydeep Sen Sarma
 
Cloudera Impala + PostgreSQL
liuknag
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
Introduction to NoSql
Omid Vahdaty
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
Cloudera, Inc.
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.
 
What every developer should know about database scalability, PyCon 2010
jbellis
 
Apache HBase - Introduction & Use Cases
Data Con LA
 
Apache Hadoop and HBase
Cloudera, Inc.
 
Hadoop - How It Works
Vladimír Hanušniak
 
Hbase jdd
Andrzej Grzesik
 
Apache sqoop
megrhi haikel
 
Scalable Data Science with SparkR
DataWorks Summit
 
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Edureka!
 
Ad

Viewers also liked (10)

PDF
Space-time data workshop at IfGI
Tomislav Hengl
 
PPTX
ArcGIS Space-Time Mining of Crime Data
margaretmfurr
 
PPTX
10 Steps to Optimize Your Crime Analysis
Azavea
 
PPT
Crime Risk Forecasting and Predictive Analytics - Esri UC
Azavea
 
PPTX
Helping Australian agencies fight serious crime
Wynyard Group
 
PDF
Group Capstone Project
margaretmfurr
 
PPTX
Crime Analytics: Analysis of crimes through news paper articles
Chamath Sajeewa
 
PDF
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Sudarson Roy Pratihar
 
PDF
ACFE Presentation on Analytics for Fraud Detection and Mitigation
Scott Mongeau
 
PPTX
Cyber crime and security ppt
Lipsita Behera
 
Space-time data workshop at IfGI
Tomislav Hengl
 
ArcGIS Space-Time Mining of Crime Data
margaretmfurr
 
10 Steps to Optimize Your Crime Analysis
Azavea
 
Crime Risk Forecasting and Predictive Analytics - Esri UC
Azavea
 
Helping Australian agencies fight serious crime
Wynyard Group
 
Group Capstone Project
margaretmfurr
 
Crime Analytics: Analysis of crimes through news paper articles
Chamath Sajeewa
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Sudarson Roy Pratihar
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
Scott Mongeau
 
Cyber crime and security ppt
Lipsita Behera
 
Ad

Similar to Partners in Crime: Cassandra Analytics and ETL with Hadoop (20)

PPTX
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
PPTX
Intro to cassandra + hadoop
Jeremy Hanna
 
PPT
Brust hadoopecosystem
Andrew Brust
 
PDF
Cassandra Hadoop Best Practices by Jeremy Hanna
Modern Data Stack France
 
PPT
Hadoop in action
Mahmoud Yassin
 
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
PDF
BIGDATA ppts
Krisshhna Daasaarii
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PDF
VMUGIT UC 2013 - 08a VMware Hadoop
VMUG IT
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
PDF
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 
PPTX
Ajug april 2011
Christopher Curtin
 
PDF
How pig and hadoop fit in data processing architecture
Kovid Academy
 
PDF
Combining hadoop with big data analytics
The Marketing Distillery
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Intro to cassandra + hadoop
Jeremy Hanna
 
Brust hadoopecosystem
Andrew Brust
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Modern Data Stack France
 
Hadoop in action
Mahmoud Yassin
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
BIGDATA ppts
Krisshhna Daasaarii
 
Introduction To Hadoop Ecosystem
InSemble
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUG IT
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Intro to Hadoop
Jonathan Bloom
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 
Ajug april 2011
Christopher Curtin
 
How pig and hadoop fit in data processing architecture
Kovid Academy
 
Combining hadoop with big data analytics
The Marketing Distillery
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 

Recently uploaded (20)

PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Future of Artificial Intelligence (AI)
Mukul
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Software Development Methodologies in 2025
KodekX
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 

Partners in Crime: Cassandra Analytics and ETL with Hadoop

  • 1. Partners in Crime Cassandra Analytics and ETL with Hadoop Cassandra Summit 2010 Date: August 10th, 2010
  • 2. What is Hadoop? • Distributed processing framework (MapReduce) – Moves processing to the data • Distributed filesystem – Allows data to move when processing can't
  • 3. Why use Hadoop with Cassandra? Perfect partners for big data laundering • Cassandra optimized for access • Hadoop optimized for processing – Many analytics frameworks – Existing integrations • RDBMS → Hadoop → Cassandra
  • 4. Cluster Layouts • Existing Hadoop cluster? – Start Hadoop tasktrackers on Cassandra cluster – Processing performed on local nodes
  • 5. Cluster Layouts • No Hadoop cluster? – Start all Hadoop daemons on 2-3 nodes • MapReduce depends lightly on HDFS – Start Hadoop tasktrackers on Cassandra cluster
  • 6. Hadoop Integration Points • JVM MapReduce – Keys/values iterated in process • Hadoop Streaming – Performs IPC on stdin/stdout to arbitrary processes • Apache Pig – High level relational language (SQL alternative) • Apache Hive – Forthcoming support for Cassandra storage
  • 7. Demo • Code – github.com/stuhood/cassandra-summit-demo • Flow – Load with Hadoop Streaming – Analyze with Apache Pig – Load/Process with JVM MapReduce
  • 8. Hadoop Streaming Summary • Mapper/Reducer scripts – Any language • Script is moved to the data cat $input | mapper | sort | reducer > $output
  • 9. ETL with Streaming • ETL to Cassandra in ~50 lines Load!
  • 10. ETL with Streaming 1)Files in HDFS 2)Hadoop Streaming 3)bin/load-mapper.py (the code you write) 4)Cassandra's Streaming Shim 5)Cassandra
  • 11. Apache Pig Summary • Declarative relational language
  • 12. Analytics with Pig • Analytics from Cassandra in ~20 lines Analyze!
  • 13. Analytics with Pig 1)Data stored in Cassandra 2)Cassandra's Pig LoadFunc 3)bin/analyze.pig (the code you write) 4)Files in HDFS
  • 14. JVM MapReduce Summary • Extend Mapper/Reducer base classes • Hadoop: – Transports the Jar to nodes near the data – Efficiently streams data through
  • 15. Load/Process with MapReduce • Efficient bulk loading in ~80 lines Summarize!
  • 16. Load/Process with MapReduce 1)Files in HDFS 2)MapReduce 3)Mapper/Reducer (the code you write) 4)Cassandra's ColumnFamilyOutputFormat 5)Cassandra
  • 17. Future Work • Pig Output • Hive • Hadoop Streaming Input • Optimizations
  • 19. References • Code available at – github.com/stuhood/cassandra-summit-demo • Open issues – CASSANDRA-1315 – CASSANDRA-1322 – CASSANDRA-1368 • “Hadoop + Cassandra” - Jeremy Hanna – slideshare.net/jeromatron/cassandrahadoop-4399672