Partners in Crime: Cassandra Analytics and ETL with Hadoop

Partners in Crime
Cassandra Analytics and ETL with Hadoop

Cassandra Summit 2010

Date: August 10th, 2010

What is Hadoop?

• Distributed processing framework (MapReduce)
– Moves processing to the data
• Distributed filesystem
– Allows data to move when processing can't

Why use Hadoop with Cassandra?

Perfect partners for big data laundering

• Cassandra optimized for access
• Hadoop optimized for processing
– Many analytics frameworks
– Existing integrations
• RDBMS → Hadoop → Cassandra

Cluster Layouts

• Existing Hadoop cluster?
– Start Hadoop tasktrackers on Cassandra cluster
– Processing performed on local nodes

Cluster Layouts

• No Hadoop cluster?
– Start all Hadoop daemons on 2-3 nodes
• MapReduce depends lightly on HDFS
– Start Hadoop tasktrackers on Cassandra cluster

Hadoop Integration Points

• JVM MapReduce
– Keys/values iterated in process
• Hadoop Streaming
– Performs IPC on stdin/stdout to arbitrary processes
• Apache Pig
– High level relational language (SQL alternative)
• Apache Hive
– Forthcoming support for Cassandra storage

Demo

• Code
– github.com/stuhood/cassandra-summit-demo
• Flow
– Load with Hadoop Streaming
– Analyze with Apache Pig
– Load/Process with JVM MapReduce

Hadoop Streaming Summary

• Mapper/Reducer scripts
– Any language
• Script is moved to the data

cat $input | mapper | sort | reducer > $output

ETL with Streaming

• ETL to Cassandra in ~50 lines
Load!

ETL with Streaming

1)Files in HDFS
2)Hadoop Streaming
3)bin/load-mapper.py (the code you write)
4)Cassandra's Streaming Shim
5)Cassandra

Apache Pig Summary

• Declarative relational language

Analytics with Pig

• Analytics from Cassandra in ~20 lines
Analyze!

Analytics with Pig

1)Data stored in Cassandra
2)Cassandra's Pig LoadFunc
3)bin/analyze.pig (the code you write)
4)Files in HDFS

JVM MapReduce Summary

• Extend Mapper/Reducer base classes
• Hadoop:
– Transports the Jar to nodes near the data
– Efficiently streams data through

Load/Process with MapReduce

• Efficient bulk loading in ~80 lines
Summarize!

Load/Process with MapReduce

1)Files in HDFS
2)MapReduce
3)Mapper/Reducer (the code you write)
4)Cassandra's ColumnFamilyOutputFormat
5)Cassandra

Future Work

• Pig Output
• Hive
• Hadoop Streaming Input
• Optimizations

References

• Code available at
– github.com/stuhood/cassandra-summit-demo
• Open issues
– CASSANDRA-1315
– CASSANDRA-1322
– CASSANDRA-1368
• “Hadoop + Cassandra” - Jeremy Hanna
– slideshare.net/jeromatron/cassandrahadoop-4399672

Partners in Crime: Cassandra Analytics and ETL with Hadoop

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Partners in Crime: Cassandra Analytics and ETL with Hadoop (20)

Recently uploaded (20)

Partners in Crime: Cassandra Analytics and ETL with Hadoop