SlideShare a Scribd company logo
Cascalog
Data processing on Hadoop without the hassle


                                    Nathan Marz
                                     BackType
                                    @nathanmarz
What is Cascalog?

               Cascalog   Variables and logic
Abstraction




              Cascading   Tuples, data workflows

                            Key/value pairs,
              MapReduce      aggregation
Cascalog’s components

Cascading   (the job execution engine)
    +
 Datalog    (basis of the API design)
    +
 Clojure    (the host programming language)
Clojure

• General purpose programming language
• Dialect of Lisp that compiles to Java bytecode
Clojure
• “Programmable programming language”:
  Easy to build Domain Specific Languages
  (DSL) in Clojure
Clojure examples
   Clojure code           Result
    (+ 1 2 3)               6
   (> 20 18)               true

(defn incr [x] (+ 1 x))     4
(incr 3)
Cascalog basics




 The “age” dataset
Cascalog basics
Cascalog basics




Define and
execute a query
Cascalog basics


        Where to
        emit results



Define and
execute a query
Cascalog basics


        Where to
        emit results

                   Output variables
Define and
execute a query
Cascalog basics


        Where to                      “Predicates”: constrain
        emit results                  the output variables

                   Output variables
Define and
execute a query
Predicates
Predicates


Input fields
Predicates


Input fields   Output fields
Predicates



Fields can be constants or variables
Predicates



Fields can be constants or variables

 Variables are prefixed with ? or !
Predicates
Predicates
• Functions
• Filters
• Aggregators
• Generators: finite sources of tuples
Example #1



    Generator   Filter
Example #2



Generator        Function
Example #3



Generator   Aggregator   Filter
Join example
Join example




     Triggers a join
Join example
Join example




Joins are an implementation detail
Demo time!
Why another query
 language for Hadoop?

Existing tools cause too much

Accidental Complexity
Accidental complexity

  Complexity caused by the tool used
  to solve a problem rather than the
  problem itself
Accidental complexity


• Distinct query languages cause accidental
  complexity
• Example: SQL injection
Query language

• We want:
 • Ability to abstract
 • Ability to compose
Abstraction




Clojure function that returns a subquery
Abstraction




Defining and using custom operation
Composability




Dynamic query with parameterized operation
Composability




 “Predicate macro”
Composability

       expands to




Using a predicate macro
Contrast to Pig




“Average” is 300 lines of code in Pig
Optimized aggregators
     in Cascalog




Implementation of count and sum
Why another query
 language for Hadoop?

Existing tools cause too much

Accidental Complexity
Composability




Value normalization example #1
Composability




Value normalization example #2
Composability


For each id:
 select value with the biggest timestamp




   Value normalization algorithm
Composability




Implementing value normalization
Composability




Using value normalization
Try Cascalog yourself!
Project Page
http://www.github.com/nathanmarz/cascalog

Introductory Tutorial
http://nathanmarz.com/blog/introducing-cascalog/


       5 minutes to install Clojure, Hadoop, and
       Cascalog locally! See project README
BackType is hiring

          Think Cascalog’s cool?
 Come build amazing software at BackType.



http://www.backtype.com/jobs
Questions?


Follow me on Twitter at @nathanmarz
      nathan.marz@gmail.com

More Related Content

What's hot (20)

PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
PySpark Best Practices
Cloudera, Inc.
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PPTX
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
PPTX
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Josh Patterson
 
PDF
Spark Summit EU talk by Nimbus Goehausen
Spark Summit
 
PDF
Ray and Its Growing Ecosystem
Databricks
 
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
PDF
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
PDF
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
PDF
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...
Databricks
 
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 
Low Latency Execution For Apache Spark
Jen Aman
 
PySpark Best Practices
Cloudera, Inc.
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Josh Patterson
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit
 
Ray and Its Growing Ecosystem
Databricks
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...
Databricks
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 

Viewers also liked (17)

PDF
Your Code is Wrong
nathanmarz
 
KEY
Become Efficient or Die: The Story of BackType
nathanmarz
 
PDF
The inherent complexity of stream processing
nathanmarz
 
PDF
Demystifying Data Engineering
nathanmarz
 
KEY
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
The Epistemology of Software Engineering
nathanmarz
 
PPT
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz
 
PDF
Storm
nathanmarz
 
PDF
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
PDF
Lambda architecture for real time big data
Trieu Nguyen
 
PDF
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
PDF
Cascalog at Hadoop Day
nathanmarz
 
KEY
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
KEY
Cascalog
nathanmarz
 
KEY
Cascalog workshop
nathanmarz
 
KEY
Cascading
nathanmarz
 
Your Code is Wrong
nathanmarz
 
Become Efficient or Die: The Story of BackType
nathanmarz
 
The inherent complexity of stream processing
nathanmarz
 
Demystifying Data Engineering
nathanmarz
 
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
Big Data Architecture
Guido Schmutz
 
The Epistemology of Software Engineering
nathanmarz
 
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz
 
Storm
nathanmarz
 
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
Lambda architecture for real time big data
Trieu Nguyen
 
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
Cascalog at Hadoop Day
nathanmarz
 
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
Cascalog
nathanmarz
 
Cascalog workshop
nathanmarz
 
Cascading
nathanmarz
 
Ad

Similar to Cascalog at Strange Loop (20)

PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
PDF
PHP, the GraphQL ecosystem and GraphQLite
JEAN-GUILLAUME DUJARDIN
 
ODP
Building Complex Data Workflows with Cascading on Hadoop
Gagan Agrawal
 
PDF
Rafael Bagmanov «Scala in a wild enterprise»
e-Legion
 
PPTX
AestasIT - Internal DSLs in Scala
Dmitry Buzdin
 
PDF
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
PDF
Boost your APIs with GraphQL 1.0
Otávio Santana
 
PPTX
GraphQL-ify your APIs - Devoxx UK 2021
Soham Dasgupta
 
PPTX
Introduction to Designing and Building Big Data Applications
Cloudera, Inc.
 
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PPTX
Interactive Java Support to your tool -- The JShell API and Architecture
JavaDayUA
 
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
NoSQLmatters
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PDF
Recent Developments In SparkR For Advanced Analytics
Databricks
 
PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
BDM25 - Spark runtime internal
David Lauzon
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
PHP, the GraphQL ecosystem and GraphQLite
JEAN-GUILLAUME DUJARDIN
 
Building Complex Data Workflows with Cascading on Hadoop
Gagan Agrawal
 
Rafael Bagmanov «Scala in a wild enterprise»
e-Legion
 
AestasIT - Internal DSLs in Scala
Dmitry Buzdin
 
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
Boost your APIs with GraphQL 1.0
Otávio Santana
 
GraphQL-ify your APIs - Devoxx UK 2021
Soham Dasgupta
 
Introduction to Designing and Building Big Data Applications
Cloudera, Inc.
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Interactive Java Support to your tool -- The JShell API and Architecture
JavaDayUA
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
NoSQLmatters
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
Ad

Recently uploaded (20)

PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
July Patch Tuesday
Ivanti
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
July Patch Tuesday
Ivanti
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 

Cascalog at Strange Loop