Pig with Cassandra: Adventures in Analytics

Download as PPTX, PDF

8 likes5,151 views

This document discusses using Pig with Cassandra to perform analytics and data processing tasks. Pig allows running queries over Cassandra data and storing intermediate results in HDFS or Cassandra. Example uses include analytics, data exploration, validation, and correction. Configuration involves splitting the cluster into virtual datacenters and setting properties. Future work includes improving data type handling and adding support for secondary indexes and wide rows.

Technology Business

Pig with CassandraAdventures in Analytics

MotivationWhat’s our need?How do we get at data in Cassandra with ad-hoc queriesDon’t reinvent the wheel

Enter PigPig was created at Yahoo! as an abstraction for MapReduceDesigned to eat anythingloadstorefunc created for Cassandra

How it worksPerform queries over all rows in a column family or set of column familiesIntermediate results stored in HDFS or CFSCan mixand match inputs and outputs

UsesAnalyticsData explorationHow many items did I get from New Jersey?Data validationHow many items were missing a field and when were they created?Data correctionCompany name correction over all dataExpand Cassandra data modelMake a new column family for querying by US State and back-populate with PigBootstrap local dev environment

PygmalionFigure in Greek mythology, sounds like PigUDFs, examples scripts for using Pig with CassandraUsed in production at The Dachis Grouphttps://github.com/jeromatron/pygmalion/

Digging in the DirtPygmalion basic examples

TipsDevelop incrementallyOutput intermediate data frequently to verifyValidate data on input if possibleUse Cassandra data type validation for inputs and outputsPygmalion for tabular dataPenny in Pig 0.9!

Cluster ConfigurationSplit cluster – virtual datacentersBrisk (built-in pig support in 1.0 beta 2+)Task trackers on all analytic nodesWith HDFS:Separate namenode/jobtrackerData nodes on all analytic nodesA few settings to bridge the twoStart the server processesDistributed cache and intermediate dataWith Brisk:Startup includes CFS, job tracker, and task trackers

Topology configuration# from conf/cassandra-topology.properties#### Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b10.20.114.11=DC-Analytics:Rack-1b10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a10.0.0.11=DC-Realtime-East:Rack-1a10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c10.21.119.14=DC-Realtime-West:Rack-1c10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodesdefault=DC-Realtime-West:Rack-1c

Configuration PrioritiesData localityData locality – no really, biggest performance factorMemory needsCassandra requires lots of memoryHadoop requires lots of memoryPlan with your data model and analytics in mindCPU needsCassandra doesn’t need a lot of CPU horsepowerHadoop loves CPU coresInterconnectedAnalytic nodes need to be close to one another

Cassandra/Hadoop propertiesReference: org.apache.cassandra.hadoop.ConfigHelper.javaBasicscassandra.thrift.addresscassandra.thrift.portcassandra.partitioner.classConsistencycassandra.consistencylevel.readcassandra.consistencylevel.writeSplits and batchescassandra.input.split.sizecassandra.range.batch.size

Future WorkBetter data type handling (Cassandra-2777)MapReduce over subsets of rows (Cassandra-1600)MapReduce over secondary indexes (Cassandra-1600)Pig pushdown projectionPig pushdown filterHCatalog support for CassandraBetter Cassandra wide-row support (Cassandra-2688)Support for immutable/snapshot inputs (Cassandra-2527)

QuestionsContact infoJeremy Hanna@jeromatron on twitterjeremy.hanna1234 <at> gmailjeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

More Related Content

What's hot (20)

PDF

Hadoop Pig: MapReduce the easy way!Nathan Bijnens

PPTX

Practical Hadoop using PigDavid Wellman

PPT

HadoopCassell Hsu

PDF

Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer

PDF

Apache Drill - Why, What, Howmcsrivas

PDF

PySpark in practice slidesDat Tran

PPTX

Apache drillJakub Pieprzyk

PPTX

Introduction to Apache PigJason Shao

PPT

Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma

PDF

Data profiling in Apache CalciteDataWorks Summit

ODP

Cascalog internal dsl_presoHadoop User Group

PDF

Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz

PDF

Hadoop sqoop Wei-Yu Chen

PDF

Hive Anatomynzhang

PPTX

Building a Scalable Web Crawler with HadoopHadoop User Group

PDF

introduction to data processing using Hadoop and PigRicardo Varela

PPTX

Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal

PDF

Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz

PDF

알쓸신잡youngick

PPTX

03 pig introSubhas Kumar Ghosh

Hadoop Pig: MapReduce the easy way!Nathan Bijnens

Practical Hadoop using PigDavid Wellman

HadoopCassell Hsu

Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer

Apache Drill - Why, What, Howmcsrivas

PySpark in practice slidesDat Tran

Apache drillJakub Pieprzyk

Introduction to Apache PigJason Shao

Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma

Data profiling in Apache CalciteDataWorks Summit

Cascalog internal dsl_presoHadoop User Group

Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz

Hadoop sqoop Wei-Yu Chen

Hive Anatomynzhang

Building a Scalable Web Crawler with HadoopHadoop User Group

introduction to data processing using Hadoop and PigRicardo Varela

Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal

Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz

알쓸신잡youngick

03 pig introSubhas Kumar Ghosh

Similar to Pig with Cassandra: Adventures in Analytics (20)

PDF

Developing with CassandraSperasoft

PDF

Koalas: Pandas on Apache SparkDatabricks

PDF

Building and running cloud native cassandraVinay Kumar Chella

PPTX

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit

PDF

Overview of stinger interactive query for hiveDavid Kaiser

PPTX

Cassandra synergyniallmilton

PPTX

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

PDF

How Cloudflare analyzes -1m dns queries per second @ Percona E17Tom Arnfeld

PPTX

Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt

PPTX

Get started with Microsoft SQL PolybaseHenk van der Valk

PPTX

Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble

PDF

Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks

PDF

PyconJP: Building a data preparation pipeline with Pandas and AWS LambdaFabian Dubois

PDF

Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Alexandre Dutra

PDF

Staying Ahead of the Curve with Spring and Cassandra 4VMware Tanzu

PPTX

BigData - Apache Spark Sqoop Introduce Basicluandnh1998

PDF

Introduction to Stacki at Atlanta Meetup February 2016StackIQ

PDF

Data science for infrastructure dev week 2022ZainAsgar1

PPTX

Simplifying Apache CascadingMing Yuan

PDF

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Developing with CassandraSperasoft

Koalas: Pandas on Apache SparkDatabricks

Building and running cloud native cassandraVinay Kumar Chella

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit

Overview of stinger interactive query for hiveDavid Kaiser

Cassandra synergyniallmilton

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

How Cloudflare analyzes -1m dns queries per second @ Percona E17Tom Arnfeld

Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt

Get started with Microsoft SQL PolybaseHenk van der Valk

Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble

Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks

PyconJP: Building a data preparation pipeline with Pandas and AWS LambdaFabian Dubois

Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Alexandre Dutra

Staying Ahead of the Curve with Spring and Cassandra 4VMware Tanzu

BigData - Apache Spark Sqoop Introduce Basicluandnh1998

Introduction to Stacki at Atlanta Meetup February 2016StackIQ

Data science for infrastructure dev week 2022ZainAsgar1

Simplifying Apache CascadingMing Yuan

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

More from Jeremy Hanna (8)

PDF

Göteborg Distributed: Eventual Consistency in Apache CassandraJeremy Hanna

PDF

Apache Cassandra in the Real WorldJeremy Hanna

PDF

Apache Cassandra in the Real WorldJeremy Hanna

PDF

Modern Cassandra for DevelopersJeremy Hanna

PDF

Troubleshooting CassandraJeremy Hanna

PPT

Cassandra + Hadoop: Analisi Batch con Apache CassandraJeremy Hanna

KEY

Cassandra euJeremy Hanna

PPTX

Cassandra + Hadoop @ApacheCon Jeremy Hanna

Göteborg Distributed: Eventual Consistency in Apache CassandraJeremy Hanna

Apache Cassandra in the Real WorldJeremy Hanna

Modern Cassandra for DevelopersJeremy Hanna

Troubleshooting CassandraJeremy Hanna

Cassandra + Hadoop: Analisi Batch con Apache CassandraJeremy Hanna

Cassandra euJeremy Hanna

Cassandra + Hadoop @ApacheCon Jeremy Hanna

Recently uploaded (20)

PDF

Smart Trailers 2025 Update with History and OverviewPaul Menig

PDF

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

PPTX

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

PDF

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

PDF

Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...AWS Chicago

PDF

SFWelly Summer 25 Release Highlights July 2025Anna Loughnan Colquhoun

PPTX

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

PDF

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

PDF

Agentic AI lifecycle for Enterprise Hyper-AutomationDebmalya Biswas

PDF

Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdfPavel Shukhman

PDF

Chris Elwell Woburn, MA - Passionate About IT InnovationChris Elwell Woburn, MA

PPTX

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

PDF

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

PDF

CIFDAQ Market Insights for July 7th 2025CIFDAQ

PDF

Presentation - Vibe Coding The Future of Techyanuarsinggih1

PPTX

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

PDF

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

PDF

Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AIdominikamizerska1

PDF

Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdfdarshakparmar

PDF

The Builder’s Playbook - 2025 State of AI Report.pdfjeroen339954

Smart Trailers 2025 Update with History and OverviewPaul Menig

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...AWS Chicago

SFWelly Summer 25 Release Highlights July 2025Anna Loughnan Colquhoun

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

Agentic AI lifecycle for Enterprise Hyper-AutomationDebmalya Biswas

Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdfPavel Shukhman

Chris Elwell Woburn, MA - Passionate About IT InnovationChris Elwell Woburn, MA

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

CIFDAQ Market Insights for July 7th 2025CIFDAQ

Presentation - Vibe Coding The Future of Techyanuarsinggih1

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AIdominikamizerska1

Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdfdarshakparmar

The Builder’s Playbook - 2025 State of AI Report.pdfjeroen339954

Pig with Cassandra: Adventures in Analytics

1. Pig with CassandraAdventures in Analytics

2. MotivationWhat’s our need?How do we get at data in Cassandra with ad-hoc queriesDon’t reinvent the wheel

3. Enter PigPig was created at Yahoo! as an abstraction for MapReduceDesigned to eat anythingloadstorefunc created for Cassandra

4. How it worksPerform queries over all rows in a column family or set of column familiesIntermediate results stored in HDFS or CFSCan mixand match inputs and outputs

5. UsesAnalyticsData explorationHow many items did I get from New Jersey?Data validationHow many items were missing a field and when were they created?Data correctionCompany name correction over all dataExpand Cassandra data modelMake a new column family for querying by US State and back-populate with PigBootstrap local dev environment

6. PygmalionFigure in Greek mythology, sounds like PigUDFs, examples scripts for using Pig with CassandraUsed in production at The Dachis Grouphttps://github.com/jeromatron/pygmalion/

7. Digging in the DirtPygmalion basic examples

8. TipsDevelop incrementallyOutput intermediate data frequently to verifyValidate data on input if possibleUse Cassandra data type validation for inputs and outputsPygmalion for tabular dataPenny in Pig 0.9!

9. Cluster ConfigurationSplit cluster – virtual datacentersBrisk (built-in pig support in 1.0 beta 2+)Task trackers on all analytic nodesWith HDFS:Separate namenode/jobtrackerData nodes on all analytic nodesA few settings to bridge the twoStart the server processesDistributed cache and intermediate dataWith Brisk:Startup includes CFS, job tracker, and task trackers

10. Topology configuration# from conf/cassandra-topology.properties#### Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b10.20.114.11=DC-Analytics:Rack-1b10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a10.0.0.11=DC-Realtime-East:Rack-1a10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c10.21.119.14=DC-Realtime-West:Rack-1c10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodesdefault=DC-Realtime-West:Rack-1c

11. Configuration PrioritiesData localityData locality – no really, biggest performance factorMemory needsCassandra requires lots of memoryHadoop requires lots of memoryPlan with your data model and analytics in mindCPU needsCassandra doesn’t need a lot of CPU horsepowerHadoop loves CPU coresInterconnectedAnalytic nodes need to be close to one another

12. Cassandra/Hadoop propertiesReference: org.apache.cassandra.hadoop.ConfigHelper.javaBasicscassandra.thrift.addresscassandra.thrift.portcassandra.partitioner.classConsistencycassandra.consistencylevel.readcassandra.consistencylevel.writeSplits and batchescassandra.input.split.sizecassandra.range.batch.size

13. Future WorkBetter data type handling (Cassandra-2777)MapReduce over subsets of rows (Cassandra-1600)MapReduce over secondary indexes (Cassandra-1600)Pig pushdown projectionPig pushdown filterHCatalog support for CassandraBetter Cassandra wide-row support (Cassandra-2688)Support for immutable/snapshot inputs (Cassandra-2527)

14. QuestionsContact infoJeremy Hanna@jeromatron on twitterjeremy.hanna1234 <at> gmailjeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

Editor's Notes

#3: Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
#7: Mention Jacob’s involvement