SlideShare a Scribd company logo
© Cloudera, Inc. All rights reserved.
Lab #1 - VM setup
http://tiny.cloudera.com/StrataLab1
Lab #2 - Create a movies dataset
http://tiny.cloudera.com/StrataLab2
© Cloudera, Inc. All rights reserved.
Strata London 2015
Building an Apache Hadoop
Data Application
Ryan Blue, Joey Echeverria, Tom White
© Cloudera, Inc. All rights reserved.
Content for today’s tutorial
●The Hadoop Ecosystem
●Storage on Hadoop
●Movie ratings app: Data ingest
●Movie ratings app: Data analysis
© Cloudera, Inc. All rights reserved.
The Hadoop Ecosystem
© Cloudera, Inc. All rights reserved.
A Hadoop Stack
© Cloudera, Inc. All rights reserved.
Processing frameworks
●Code: MapReduce, Crunch, Spark, Tez
●SQL: Hive, Impala, Phoenix, Trafodian, Drill, Presto
●Tuples: Cascading, Pig
●Streaming: Spark streaming (micro-batch), Storm, Samza
© Cloudera, Inc. All rights reserved.
Coding frameworks
●Crunch
● A layer around MR (or Spark) that simplifies writing pipelines
●Spark
● A completely new framework for processing pipelines
● Takes advantage of memory, runs a DAG without extra map phases
●Tez
● DAG-based, like Spark’s execution engine without user-level API
© Cloudera, Inc. All rights reserved.
SQL on Hadoop
●Hive for batch processing
●Impala for low-latency queries
●Phoenix and Trafodion for transactional queries on HBase
© Cloudera, Inc. All rights reserved.
Ingest tools
●Relational: Sqoop, Sqoop2
●Record channel: Kafka, Flume
●Files: NiFi
●Numerous commercial options
© Cloudera, Inc. All rights reserved.
Ingest tools
●Relational: Sqoop, Sqoop2
●Record channel: Kafka, Flume
●Files: NiFi
Database
App
HDFS
Hadoop
FlumeFlume
© Cloudera, Inc. All rights reserved.
Relational DB to Hadoop
●Sqoop
● CLI to run MR-based import jobs
●Sqoop2
● Fixes configuration problems with Sqoop with credentials service
● More flexible to run on non-MR frameworks
● New and under active development
© Cloudera, Inc. All rights reserved.
Ingest tools
●Relational: Sqoop, Sqoop2
●Record channel: Kafka, Flume
●Files: NiFi
Database
App
HDFS
Hadoop
FlumeFlumeChannels
© Cloudera, Inc. All rights reserved.
Record streams to Hadoop
●Flume - source, channel, sink architecture
● Well-established and integrated with other tools
● No order guarantee, duplicates are possible
●Kafka - pub-sub model for low latencies
● Partitioned, provides ordering guarantees, easier to eliminate duplicates
● More resilient to node failure with consumer groups
© Cloudera, Inc. All rights reserved.
Files to Hadoop
●NiFi
● Web GUI for drag & drop configuration of a data flow
● Enterprise features: back-pressure, monitoring, lineage, etc.
● Integration to and from spool directory, HTTP, FTP, SFTP, and HDFS
● Originally for files, capable of handling record-based streams
● Currently in the Apache Incubator, and widely deployed privately
© Cloudera, Inc. All rights reserved.
NiFi
© Cloudera, Inc. All rights reserved.
Data storage in Hadoop
© Cloudera, Inc. All rights reserved.
data1.avro
...
HDFS Blocks
●Blocks
● Increase parallelism
● Balance work
● Replicated
●Configured by dfs.blocksize
● Client-side setting
data2.avro
© Cloudera, Inc. All rights reserved.
Splittable File Formats
●Splittable: Able to process part of a file
● Process blocks in parallel
●Avro is splittable
●Gzipped content is not splittable
●CSV is effectively not splittable
© Cloudera, Inc. All rights reserved.
File formats
●Existing formats: XML, JSON, Protobuf, Thrift
●Designed for Hadoop: SequenceFile, RCFile, ORC
●Makes me sad: Delimited text
●Recommended: Avro or Parquet
© Cloudera, Inc. All rights reserved.
Avro
●Recommended row-oriented format
● Broken into blocks with sync markers for splitting
● Binary encoding with block-level compression
●Avro schema
● Required to read any binary-encoded data!
● Written in the file header
●Flexible object models
© Cloudera, Inc. All rights reserved.
Avro in-memory object models
●generic
● Object model that can be used with any schema
●specific - compile schema to java object
● Generates type-safe runtime objects
●reflect - java object to schema
● Uses existing classes and objects
© Cloudera, Inc. All rights reserved.
Lab #3 - Using avro-tools
http://tiny.cloudera.com/StrataLab3
© Cloudera, Inc. All rights reserved.
Row- and column-oriented formats
●Able to reduce I/O when projecting columns
●Better encoding and compression
Images © Twitter, Inc.
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
© Cloudera, Inc. All rights reserved.
Parquet
●Recommended column-oriented format
● Splittable by organizing into row groups
● Efficient binary encoding, supports compression
●Uses other object models
● Record construction API rather than object model
● parquet-avro - Use Avro schemas with generic or specific records
● parquet-protobuf, parquet-thrift, parquet-hive, etc.
© Cloudera, Inc. All rights reserved.
Parquet trade-offs
●Rows are buffered into groups that target a final size
●Row group size
● Memory consumption grows with row group size
● Larger groups get more I/O benefit and better encoding
●Memory consumption grows for each open file
© Cloudera, Inc. All rights reserved.
Lab #4 - Using parquet-tools
http://tiny.cloudera.com/StrataLab4
© Cloudera, Inc. All rights reserved.
Partitioning
●Splittable file formats aren’t enough
●Not processing data is better than processing in parallel
●Organize data to avoid processing: Partitioning
●Use HDFS paths for a coarse index: data/y=2015/m=03/d=14/
© Cloudera, Inc. All rights reserved.
Partitioning Caution
●Partitioning in HDFS is the primary index to data
● Should reflect the most common access pattern
● Test partition strategies for multiple workloads
●Should balance file size with workload
● Lots of small files are bad for HDFS - partitioning should be more coarse
● Larger files take longer to find data - partitioning should be more specific
© Cloudera, Inc. All rights reserved.
Implementing partitioning
●Build your own - not recommended
●Hive and Impala managed
● Partitions are treated as data columns
● Insert statements must include partition calculations
●Kite managed
● Partition strategy configuration file
● Compatible with Hive and Impala
© Cloudera, Inc. All rights reserved.
Kite
●High-level data API for Hadoop
● Built around datasets, not files
● Tasks like partitioning are done internally
●Tools built around the data API
● Command-line
● Integration in Flume, Sqoop, NiFi, etc.
© Cloudera, Inc. All rights reserved.
Lab #5 - Create a partitioned
dataset
http://tiny.cloudera.com/StrataLab5
© Cloudera, Inc. All rights reserved.
Movie ratings app:
Data ingest pipeline
© Cloudera, Inc. All rights reserved.
Movie ratings scenario
●Your company runs a web application where users can rate movies
●You want to use Hadoop to analyze ratings over time
● Avoid scraping the production database for changes
● Instead, you want to log every rating submitted
Database
Ratings App
© Cloudera, Inc. All rights reserved.
Movie ratings app
●Log ratings to Flume
●Otherwise unchanged
Database
Ratings App
HDFSRatings
dataset
Hadoop
FlumeFlumeFlume
© Cloudera, Inc. All rights reserved.
Lab #6 - Create a Flume pipeline
http://tiny.cloudera.com/StrataLab6
© Cloudera, Inc. All rights reserved.
Movie ratings app:
Analyzing ratings data
© Cloudera, Inc. All rights reserved.
●Now you have several months of data
●You can query it in Hive and Impala for most cases
●Some questions are difficult to formulate as SQL
● Are there any movies that people either love or hate?
Movie ratings analysis
© Cloudera, Inc. All rights reserved.
Analyzing ratings
●Map
● Extract key, movie_id, and value, rating
●Reduce:
● Reduce groups all of the ratings by movie_id
● Count the number of ratings for each movie
● If there are two peaks, output the movie_id and counts
● Peak detection: difference between counts goes from negative to positive
© Cloudera, Inc. All rights reserved.
Crunch background
●Stack up functions until a group-by operation to make a map phase
●Similarly, stack up functions after a group-by to make a reduce phase
●Additional group-by operations set up more MR rounds automatically
PTable<Long, Double> table = collection
.by(new GetMovieID(), Avros.longs())
.mapValues(new GetRating(), Avros.ints())
.groupByKey()
.mapValues(new AverageRating(), Avros.doubles());
© Cloudera, Inc. All rights reserved.
Lab #7 - Analyze ratings with Crunch
http://tiny.cloudera.com/StrataLab7
© Cloudera, Inc. All rights reserved.
Thank you
blue@cloudera.com
joey@rocana.com
tom@cloudera.com
http://ingest.tips/

More Related Content

What's hot (20)

PPTX
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
SpringPeople
 
PPTX
Distributing Data The Aerospike Way
Aerospike, Inc.
 
PPTX
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PPTX
Aerospike Architecture
Peter Milne
 
PDF
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
PDF
Hadoop on OpenStack - Sahara @DevNation 2014
spinningmatt
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
20150314 sahara intro and the future plan for open stack meetup
Wei Ting Chen
 
PPTX
Architecting Applications with Hadoop
markgrover
 
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
PDF
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
PPTX
Spectrum Scale - Diversified analytic solution based on various storage servi...
Wei Gong
 
PDF
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
PPTX
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
PPTX
20151027 sahara + manila final
Wei Ting Chen
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
SpringPeople
 
Distributing Data The Aerospike Way
Aerospike, Inc.
 
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Aerospike Architecture
Peter Milne
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
Hadoop on OpenStack - Sahara @DevNation 2014
spinningmatt
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
20150314 sahara intro and the future plan for open stack meetup
Wei Ting Chen
 
Architecting Applications with Hadoop
markgrover
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Wei Gong
 
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
20151027 sahara + manila final
Wei Ting Chen
 

Viewers also liked (7)

PPTX
Flume vs. kafka
Omid Vahdaty
 
PDF
Building Hadoop Data Applications with Kite by Tom White
The Hive
 
PPTX
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
PDF
What is UX and how can it help your organisation?
Ned Potter
 
PDF
Mobile UX for Academic Libraries
Kevin Rundblad
 
PPTX
Hadoop data ingestion
Vinod Nayal
 
PDF
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Flume vs. kafka
Omid Vahdaty
 
Building Hadoop Data Applications with Kite by Tom White
The Hive
 
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
What is UX and how can it help your organisation?
Ned Potter
 
Mobile UX for Academic Libraries
Kevin Rundblad
 
Hadoop data ingestion
Vinod Nayal
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Ad

Similar to Building an Apache Hadoop data application (20)

PDF
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
PDF
Introduction to Data Science with Hadoop
Dr. Volkan OBAN
 
PDF
Webinar: The Future of Hadoop
Cloudera, Inc.
 
PDF
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
PDF
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
PDF
Introduction to HBase
Apekshit Sharma
 
PPTX
Data warehousing with Hadoop
hadooparchbook
 
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
PPTX
Building data pipelines with kite
Joey Echeverria
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
PDF
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Alex Moundalexis
 
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
PPTX
Back to School - St. Louis Hadoop Meetup September 2016
Adam Doyle
 
PPTX
Hadoop
Abhishek Agarwal
 
PPTX
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
PPTX
Getting Started with Hadoop
Cloudera, Inc.
 
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
PDF
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
PDF
9/2017 STL HUG - Back to School
Adam Doyle
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Introduction to Data Science with Hadoop
Dr. Volkan OBAN
 
Webinar: The Future of Hadoop
Cloudera, Inc.
 
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Introduction to HBase
Apekshit Sharma
 
Data warehousing with Hadoop
hadooparchbook
 
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
Building data pipelines with kite
Joey Echeverria
 
Intro to Hadoop
Jonathan Bloom
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Alex Moundalexis
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Back to School - St. Louis Hadoop Meetup September 2016
Adam Doyle
 
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Getting Started with Hadoop
Cloudera, Inc.
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
9/2017 STL HUG - Back to School
Adam Doyle
 
Ad

Recently uploaded (20)

PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
July Patch Tuesday
Ivanti
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 

Building an Apache Hadoop data application

  • 1. © Cloudera, Inc. All rights reserved. Lab #1 - VM setup http://tiny.cloudera.com/StrataLab1 Lab #2 - Create a movies dataset http://tiny.cloudera.com/StrataLab2
  • 2. © Cloudera, Inc. All rights reserved. Strata London 2015 Building an Apache Hadoop Data Application Ryan Blue, Joey Echeverria, Tom White
  • 3. © Cloudera, Inc. All rights reserved. Content for today’s tutorial ●The Hadoop Ecosystem ●Storage on Hadoop ●Movie ratings app: Data ingest ●Movie ratings app: Data analysis
  • 4. © Cloudera, Inc. All rights reserved. The Hadoop Ecosystem
  • 5. © Cloudera, Inc. All rights reserved. A Hadoop Stack
  • 6. © Cloudera, Inc. All rights reserved. Processing frameworks ●Code: MapReduce, Crunch, Spark, Tez ●SQL: Hive, Impala, Phoenix, Trafodian, Drill, Presto ●Tuples: Cascading, Pig ●Streaming: Spark streaming (micro-batch), Storm, Samza
  • 7. © Cloudera, Inc. All rights reserved. Coding frameworks ●Crunch ● A layer around MR (or Spark) that simplifies writing pipelines ●Spark ● A completely new framework for processing pipelines ● Takes advantage of memory, runs a DAG without extra map phases ●Tez ● DAG-based, like Spark’s execution engine without user-level API
  • 8. © Cloudera, Inc. All rights reserved. SQL on Hadoop ●Hive for batch processing ●Impala for low-latency queries ●Phoenix and Trafodion for transactional queries on HBase
  • 9. © Cloudera, Inc. All rights reserved. Ingest tools ●Relational: Sqoop, Sqoop2 ●Record channel: Kafka, Flume ●Files: NiFi ●Numerous commercial options
  • 10. © Cloudera, Inc. All rights reserved. Ingest tools ●Relational: Sqoop, Sqoop2 ●Record channel: Kafka, Flume ●Files: NiFi Database App HDFS Hadoop FlumeFlume
  • 11. © Cloudera, Inc. All rights reserved. Relational DB to Hadoop ●Sqoop ● CLI to run MR-based import jobs ●Sqoop2 ● Fixes configuration problems with Sqoop with credentials service ● More flexible to run on non-MR frameworks ● New and under active development
  • 12. © Cloudera, Inc. All rights reserved. Ingest tools ●Relational: Sqoop, Sqoop2 ●Record channel: Kafka, Flume ●Files: NiFi Database App HDFS Hadoop FlumeFlumeChannels
  • 13. © Cloudera, Inc. All rights reserved. Record streams to Hadoop ●Flume - source, channel, sink architecture ● Well-established and integrated with other tools ● No order guarantee, duplicates are possible ●Kafka - pub-sub model for low latencies ● Partitioned, provides ordering guarantees, easier to eliminate duplicates ● More resilient to node failure with consumer groups
  • 14. © Cloudera, Inc. All rights reserved. Files to Hadoop ●NiFi ● Web GUI for drag & drop configuration of a data flow ● Enterprise features: back-pressure, monitoring, lineage, etc. ● Integration to and from spool directory, HTTP, FTP, SFTP, and HDFS ● Originally for files, capable of handling record-based streams ● Currently in the Apache Incubator, and widely deployed privately
  • 15. © Cloudera, Inc. All rights reserved. NiFi
  • 16. © Cloudera, Inc. All rights reserved. Data storage in Hadoop
  • 17. © Cloudera, Inc. All rights reserved. data1.avro ... HDFS Blocks ●Blocks ● Increase parallelism ● Balance work ● Replicated ●Configured by dfs.blocksize ● Client-side setting data2.avro
  • 18. © Cloudera, Inc. All rights reserved. Splittable File Formats ●Splittable: Able to process part of a file ● Process blocks in parallel ●Avro is splittable ●Gzipped content is not splittable ●CSV is effectively not splittable
  • 19. © Cloudera, Inc. All rights reserved. File formats ●Existing formats: XML, JSON, Protobuf, Thrift ●Designed for Hadoop: SequenceFile, RCFile, ORC ●Makes me sad: Delimited text ●Recommended: Avro or Parquet
  • 20. © Cloudera, Inc. All rights reserved. Avro ●Recommended row-oriented format ● Broken into blocks with sync markers for splitting ● Binary encoding with block-level compression ●Avro schema ● Required to read any binary-encoded data! ● Written in the file header ●Flexible object models
  • 21. © Cloudera, Inc. All rights reserved. Avro in-memory object models ●generic ● Object model that can be used with any schema ●specific - compile schema to java object ● Generates type-safe runtime objects ●reflect - java object to schema ● Uses existing classes and objects
  • 22. © Cloudera, Inc. All rights reserved. Lab #3 - Using avro-tools http://tiny.cloudera.com/StrataLab3
  • 23. © Cloudera, Inc. All rights reserved. Row- and column-oriented formats ●Able to reduce I/O when projecting columns ●Better encoding and compression Images © Twitter, Inc. https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 24. © Cloudera, Inc. All rights reserved. Parquet ●Recommended column-oriented format ● Splittable by organizing into row groups ● Efficient binary encoding, supports compression ●Uses other object models ● Record construction API rather than object model ● parquet-avro - Use Avro schemas with generic or specific records ● parquet-protobuf, parquet-thrift, parquet-hive, etc.
  • 25. © Cloudera, Inc. All rights reserved. Parquet trade-offs ●Rows are buffered into groups that target a final size ●Row group size ● Memory consumption grows with row group size ● Larger groups get more I/O benefit and better encoding ●Memory consumption grows for each open file
  • 26. © Cloudera, Inc. All rights reserved. Lab #4 - Using parquet-tools http://tiny.cloudera.com/StrataLab4
  • 27. © Cloudera, Inc. All rights reserved. Partitioning ●Splittable file formats aren’t enough ●Not processing data is better than processing in parallel ●Organize data to avoid processing: Partitioning ●Use HDFS paths for a coarse index: data/y=2015/m=03/d=14/
  • 28. © Cloudera, Inc. All rights reserved. Partitioning Caution ●Partitioning in HDFS is the primary index to data ● Should reflect the most common access pattern ● Test partition strategies for multiple workloads ●Should balance file size with workload ● Lots of small files are bad for HDFS - partitioning should be more coarse ● Larger files take longer to find data - partitioning should be more specific
  • 29. © Cloudera, Inc. All rights reserved. Implementing partitioning ●Build your own - not recommended ●Hive and Impala managed ● Partitions are treated as data columns ● Insert statements must include partition calculations ●Kite managed ● Partition strategy configuration file ● Compatible with Hive and Impala
  • 30. © Cloudera, Inc. All rights reserved. Kite ●High-level data API for Hadoop ● Built around datasets, not files ● Tasks like partitioning are done internally ●Tools built around the data API ● Command-line ● Integration in Flume, Sqoop, NiFi, etc.
  • 31. © Cloudera, Inc. All rights reserved. Lab #5 - Create a partitioned dataset http://tiny.cloudera.com/StrataLab5
  • 32. © Cloudera, Inc. All rights reserved. Movie ratings app: Data ingest pipeline
  • 33. © Cloudera, Inc. All rights reserved. Movie ratings scenario ●Your company runs a web application where users can rate movies ●You want to use Hadoop to analyze ratings over time ● Avoid scraping the production database for changes ● Instead, you want to log every rating submitted Database Ratings App
  • 34. © Cloudera, Inc. All rights reserved. Movie ratings app ●Log ratings to Flume ●Otherwise unchanged Database Ratings App HDFSRatings dataset Hadoop FlumeFlumeFlume
  • 35. © Cloudera, Inc. All rights reserved. Lab #6 - Create a Flume pipeline http://tiny.cloudera.com/StrataLab6
  • 36. © Cloudera, Inc. All rights reserved. Movie ratings app: Analyzing ratings data
  • 37. © Cloudera, Inc. All rights reserved. ●Now you have several months of data ●You can query it in Hive and Impala for most cases ●Some questions are difficult to formulate as SQL ● Are there any movies that people either love or hate? Movie ratings analysis
  • 38. © Cloudera, Inc. All rights reserved. Analyzing ratings ●Map ● Extract key, movie_id, and value, rating ●Reduce: ● Reduce groups all of the ratings by movie_id ● Count the number of ratings for each movie ● If there are two peaks, output the movie_id and counts ● Peak detection: difference between counts goes from negative to positive
  • 39. © Cloudera, Inc. All rights reserved. Crunch background ●Stack up functions until a group-by operation to make a map phase ●Similarly, stack up functions after a group-by to make a reduce phase ●Additional group-by operations set up more MR rounds automatically PTable<Long, Double> table = collection .by(new GetMovieID(), Avros.longs()) .mapValues(new GetRating(), Avros.ints()) .groupByKey() .mapValues(new AverageRating(), Avros.doubles());
  • 40. © Cloudera, Inc. All rights reserved. Lab #7 - Analyze ratings with Crunch http://tiny.cloudera.com/StrataLab7
  • 41. © Cloudera, Inc. All rights reserved. Thank you [email protected] [email protected] [email protected] http://ingest.tips/