SlideShare a Scribd company logo
Pig with CassandraAdventures in Analytics
MotivationWhat’s our need?How do we get at data in Cassandra with ad-hoc queriesDon’t reinvent the wheel
Enter PigPig was created at Yahoo! as an abstraction for MapReduceDesigned to eat anythingloadstorefunc created for Cassandra
How it worksPerform queries over all rows in a column family or set of column familiesIntermediate results stored in HDFS or CFSCan mixand match inputs and outputs
UsesAnalyticsData explorationHow many items did I get from New Jersey?Data validationHow many items were missing a field and when were they created?Data correctionCompany name correction over all dataExpand Cassandra data modelMake a new column family for querying by US State and back-populate with PigBootstrap local dev environment
PygmalionFigure in Greek mythology, sounds like PigUDFs, examples scripts for using Pig with CassandraUsed in production at The Dachis Grouphttps://github.com/jeromatron/pygmalion/
Digging in the DirtPygmalion basic examples
TipsDevelop incrementallyOutput intermediate data frequently to verifyValidate data on input if possibleUse Cassandra data type validation for inputs and outputsPygmalion for tabular dataPenny in Pig 0.9!
Cluster ConfigurationSplit cluster – virtual datacentersBrisk (built-in pig support in 1.0 beta 2+)Task trackers on all analytic nodesWith HDFS:Separate namenode/jobtrackerData nodes on all analytic nodesA few settings to bridge the twoStart the server processesDistributed cache and intermediate dataWith Brisk:Startup includes CFS, job tracker, and task trackers
Topology configuration# from conf/cassandra-topology.properties#### Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b10.20.114.11=DC-Analytics:Rack-1b10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a10.0.0.11=DC-Realtime-East:Rack-1a10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c10.21.119.14=DC-Realtime-West:Rack-1c10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodesdefault=DC-Realtime-West:Rack-1c
Configuration PrioritiesData localityData locality – no really, biggest performance factorMemory needsCassandra requires lots of memoryHadoop requires lots of memoryPlan with your data model and analytics in mindCPU needsCassandra doesn’t need a lot of CPU horsepowerHadoop loves CPU coresInterconnectedAnalytic nodes need to be close to one another
Cassandra/Hadoop propertiesReference: org.apache.cassandra.hadoop.ConfigHelper.javaBasicscassandra.thrift.addresscassandra.thrift.portcassandra.partitioner.classConsistencycassandra.consistencylevel.readcassandra.consistencylevel.writeSplits and batchescassandra.input.split.sizecassandra.range.batch.size
Future WorkBetter data type handling (Cassandra-2777)MapReduce over subsets of rows (Cassandra-1600)MapReduce over secondary indexes (Cassandra-1600)Pig pushdown projectionPig pushdown filterHCatalog support for CassandraBetter Cassandra wide-row support (Cassandra-2688)Support for immutable/snapshot inputs (Cassandra-2527)
QuestionsContact infoJeremy Hanna@jeromatron on twitterjeremy.hanna1234 <at> gmailjeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

More Related Content

What's hot (20)

PDF
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
PPTX
Practical Hadoop using Pig
David Wellman
 
PPT
Hadoop
Cassell Hsu
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
Apache Drill - Why, What, How
mcsrivas
 
PDF
PySpark in practice slides
Dat Tran
 
PPTX
Apache drill
Jakub Pieprzyk
 
PPTX
Introduction to Apache Pig
Jason Shao
 
PPT
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
PDF
Data profiling in Apache Calcite
DataWorks Summit
 
ODP
Cascalog internal dsl_preso
Hadoop User Group
 
PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
Uwe Printz
 
PDF
Hadoop sqoop
Wei-Yu Chen
 
PDF
Hive Anatomy
nzhang
 
PPTX
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PPTX
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
PDF
알쓸신잡
youngick
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Practical Hadoop using Pig
David Wellman
 
Hadoop
Cassell Hsu
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Apache Drill - Why, What, How
mcsrivas
 
PySpark in practice slides
Dat Tran
 
Apache drill
Jakub Pieprzyk
 
Introduction to Apache Pig
Jason Shao
 
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
Data profiling in Apache Calcite
DataWorks Summit
 
Cascalog internal dsl_preso
Hadoop User Group
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Uwe Printz
 
Hadoop sqoop
Wei-Yu Chen
 
Hive Anatomy
nzhang
 
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
알쓸신잡
youngick
 
03 pig intro
Subhas Kumar Ghosh
 

Similar to Pig with Cassandra: Adventures in Analytics (20)

PDF
Developing with Cassandra
Sperasoft
 
PDF
Koalas: Pandas on Apache Spark
Databricks
 
PDF
Building and running cloud native cassandra
Vinay Kumar Chella
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PDF
Overview of stinger interactive query for hive
David Kaiser
 
PPTX
Cassandra synergy
niallmilton
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
How Cloudflare analyzes -1m dns queries per second @ Percona E17
Tom Arnfeld
 
PPTX
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Steve Watt
 
PPTX
Get started with Microsoft SQL Polybase
Henk van der Valk
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
PDF
PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
Fabian Dubois
 
PDF
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Alexandre Dutra
 
PDF
Staying Ahead of the Curve with Spring and Cassandra 4
VMware Tanzu
 
PPTX
BigData - Apache Spark Sqoop Introduce Basic
luandnh1998
 
PDF
Introduction to Stacki at Atlanta Meetup February 2016
StackIQ
 
PDF
Data science for infrastructure dev week 2022
ZainAsgar1
 
PPTX
Simplifying Apache Cascading
Ming Yuan
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Developing with Cassandra
Sperasoft
 
Koalas: Pandas on Apache Spark
Databricks
 
Building and running cloud native cassandra
Vinay Kumar Chella
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Overview of stinger interactive query for hive
David Kaiser
 
Cassandra synergy
niallmilton
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
How Cloudflare analyzes -1m dns queries per second @ Percona E17
Tom Arnfeld
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Steve Watt
 
Get started with Microsoft SQL Polybase
Henk van der Valk
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
Fabian Dubois
 
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Alexandre Dutra
 
Staying Ahead of the Curve with Spring and Cassandra 4
VMware Tanzu
 
BigData - Apache Spark Sqoop Introduce Basic
luandnh1998
 
Introduction to Stacki at Atlanta Meetup February 2016
StackIQ
 
Data science for infrastructure dev week 2022
ZainAsgar1
 
Simplifying Apache Cascading
Ming Yuan
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Ad

More from Jeremy Hanna (8)

PDF
Göteborg Distributed: Eventual Consistency in Apache Cassandra
Jeremy Hanna
 
PDF
Apache Cassandra in the Real World
Jeremy Hanna
 
PDF
Apache Cassandra in the Real World
Jeremy Hanna
 
PDF
Modern Cassandra for Developers
Jeremy Hanna
 
PDF
Troubleshooting Cassandra
Jeremy Hanna
 
PPT
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Jeremy Hanna
 
KEY
Cassandra eu
Jeremy Hanna
 
PPTX
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Göteborg Distributed: Eventual Consistency in Apache Cassandra
Jeremy Hanna
 
Apache Cassandra in the Real World
Jeremy Hanna
 
Apache Cassandra in the Real World
Jeremy Hanna
 
Modern Cassandra for Developers
Jeremy Hanna
 
Troubleshooting Cassandra
Jeremy Hanna
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Jeremy Hanna
 
Cassandra eu
Jeremy Hanna
 
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Ad

Recently uploaded (20)

PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 

Pig with Cassandra: Adventures in Analytics

  • 2. MotivationWhat’s our need?How do we get at data in Cassandra with ad-hoc queriesDon’t reinvent the wheel
  • 3. Enter PigPig was created at Yahoo! as an abstraction for MapReduceDesigned to eat anythingloadstorefunc created for Cassandra
  • 4. How it worksPerform queries over all rows in a column family or set of column familiesIntermediate results stored in HDFS or CFSCan mixand match inputs and outputs
  • 5. UsesAnalyticsData explorationHow many items did I get from New Jersey?Data validationHow many items were missing a field and when were they created?Data correctionCompany name correction over all dataExpand Cassandra data modelMake a new column family for querying by US State and back-populate with PigBootstrap local dev environment
  • 6. PygmalionFigure in Greek mythology, sounds like PigUDFs, examples scripts for using Pig with CassandraUsed in production at The Dachis Grouphttps://github.com/jeromatron/pygmalion/
  • 7. Digging in the DirtPygmalion basic examples
  • 8. TipsDevelop incrementallyOutput intermediate data frequently to verifyValidate data on input if possibleUse Cassandra data type validation for inputs and outputsPygmalion for tabular dataPenny in Pig 0.9!
  • 9. Cluster ConfigurationSplit cluster – virtual datacentersBrisk (built-in pig support in 1.0 beta 2+)Task trackers on all analytic nodesWith HDFS:Separate namenode/jobtrackerData nodes on all analytic nodesA few settings to bridge the twoStart the server processesDistributed cache and intermediate dataWith Brisk:Startup includes CFS, job tracker, and task trackers
  • 10. Topology configuration# from conf/cassandra-topology.properties#### Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b10.20.114.11=DC-Analytics:Rack-1b10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a10.0.0.11=DC-Realtime-East:Rack-1a10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c10.21.119.14=DC-Realtime-West:Rack-1c10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodesdefault=DC-Realtime-West:Rack-1c
  • 11. Configuration PrioritiesData localityData locality – no really, biggest performance factorMemory needsCassandra requires lots of memoryHadoop requires lots of memoryPlan with your data model and analytics in mindCPU needsCassandra doesn’t need a lot of CPU horsepowerHadoop loves CPU coresInterconnectedAnalytic nodes need to be close to one another
  • 13. Future WorkBetter data type handling (Cassandra-2777)MapReduce over subsets of rows (Cassandra-1600)MapReduce over secondary indexes (Cassandra-1600)Pig pushdown projectionPig pushdown filterHCatalog support for CassandraBetter Cassandra wide-row support (Cassandra-2688)Support for immutable/snapshot inputs (Cassandra-2527)
  • 14. QuestionsContact infoJeremy Hanna@jeromatron on twitterjeremy.hanna1234 <at> gmailjeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

Editor's Notes

  • #3: Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • #7: Mention Jacob’s involvement