Hadoop and Big Data Training
Lessons learned
0
What’s Cloudera?
 Leading company in the NoSQL and cloud computing space
 Most popular Hadoop distribution
 Ex-es from Google, Facebook, Oracle and other leading tech
companies
 Sample Bn$ companies client list:
eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia
,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company
 Consulting and training services
1
Why this training?
 MongoDB is great for OLTP
 Not an OLAP DB, not really aspiring to become one
 Big Data coming in, need for more advanced analysis
processes
2
Intended audience
 Software engineers and friends 
3
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models
 Modules:
 HadoopCommon
 Hadoop Distributed File System (HDFS™)
 HadoopYARN
 HadoopMapReduce
4
What’s Hadoop?
How does it fit in our Big Goal?
 MongoDB for OLTP
 RDBMS (MySQL) for config data
 Hadoop for OLAP
5
What’s Map Reduce?
 MapReduce is a programming model for processing large data
sets, and the name of an implementation of the model by
Google. MapReduce is typically used to do distributed
computing on clusters of computers. © Wiki
 Practically?
 Can perform computations in a distributed fashion
 Highly scalable
 Inherently highly available
 By design fault tolerant
6
Bindings
 Native Java
 any language, even scripting ones, using Streaming
7
MapReduce framework vs. MapReduce functionality
 Several NoSQL technologies provide MR functionality
8
MR functionality
 Compromise….
 i.e. MongoDB
 CouchDB select * from foo; ;;
9
MapReduce V1 vsMapReduce V2
 MR V1 can not scale past 4k nodes per cluster
 More important to our goals, MR V1 is monolithic
10
MR V2 YARN
 Pluggable implementations on top of Hadoop
 Whole new set of problems can be solved:
 Graph processing
 MPI
11
MR V1 Architecture
12
MR V1 daemons
 client
 NameNode (HDFS)
 JobTracker
 DataNode(HDFS) + TaskTracker
13
MR V2 Architecture
14
MR V2 daemons
 Client
 Resource manager/Application manager
 NodeManager
 Application Master (resource containers)
15
Data Locality in Hadoop
 First replica placed in client node (or random if off cluster
client)
 Second off-rack
 Third in same rack as second but different node
16
HDFS - Architecture
 Hot
 Very large files
 Streaming data access (seek time ~<1% transfer time)
 Commodity hardware (no iphones…)
 Not
 Low-latency data access
 Lots of small files
 Multiple writers, arbitrary file modification
17
HDFS – NameNode
 Namenode Master
 Filesystem tree
 Metadata for all files and directories
 Namespace image and edit log
 Secondary Namenode
 Not a backup node!
 Periodically merges edit log into namespace image
 Could take 30 mins to come back online
18
HDFS HA - NameNode
 2.x Hadoop brings in HDFS HA
 Active-standby config for NameNodes
 Gotchas:
 Shared storage for edit log
 Datanodes send block reports to both NameNodes
 NameNode needs to be transparent to clients
19
HDFS – Read
20
HDFS - Read
 Client requests file from namenode (for first 10 blocks)
 Namenode returns addresses of datanodes
 Client contacts directly datanodes
 Blocks are read in order
21
HDFS - Write
22
HDFS - Write
 RPC initial call to create the file
 Permissions/file exists checks in NameNode etc
 As we write data, data queue in client which asks the
NameNode for datanode to store data
 List of datanodes form a pipeline
 ack queue to verify all replicas have been written
 Close file
23
Job Configuration
 setInputFormatClass
 setOutputFormatClass
 setMapperClass
 setReducerClass
 Set(Map)OutputKeyClass
 set(Map)OutputValueClass
 setNumReduceTasks
24
Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
OR job.submit();
25
Job Configuration
 Simple to invoke:
 bin/hadoop jar WordCountinputPathoutputPath
26
Map Reduce phases
27
Mapper – Life cycle
 Mapper inputs <K1,V1> outputs <K2,V2>
28
Shuffle and Sort
 All same keys are guaranteed to end up in the same reducer,
sorted by key
 Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>
 Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+>
29
Reducer – Life cycle
 Reducer inputs <K2, [V2]> outputs <K3, V3>
30
Hadoop interfaces and classes
 >=0.23 new API favoring abstract classes
 <0.23 old API with interfaces
 Packages mapred.* OLD API, mapreduce.* NEW API
31
Speculative execution
 At least one minute into a mapper or reducer, the Jobtracker
will decide based on the progress of a task
 Threshold of each task progress compared to
avgprogress(configurable)
 Relaunch task in different NameNode and have them race..
 Sometimes not wanted
 Cluster utilization
 Non idempotent partial output (OutputCollector)
32
Input Output Formats
 InputFormat<K,V> ->FileInputFormat<K,V> ->TextInputFormat,
KeyValueTextInputFormat, SequenceFileInputFormat
 Default TextInputFormat key=byte offset, value=line
 KeyValueTextInputFormat (key t value)
 Binary splittable format
 Corresponding Output formats
33
Compression
 The billion files problem
 300B/file * 10^9 files  300G RAM
 Big Data storage
 Solutions:
 Containers
 Compression
34
Containers
 HAR (splittable)
 Sequence Files, RC files, Avro files (splittable, compressable)
35
Compression codecs
 LZO, LZ4, snappy codecs are best VFM in compression speed
 Bzip2 offers native splitting but can be slow
36
Long story short
 Compression + sequence files
 Compression that supports splitting
 Split file into chunks in application layer with chunk size
aligned to HDFS block size
 Don’t bother
37
Partitioner
 Default is HashPartitioner
 Why implement our own partitioner?
 Sample case: Total ordering
 1 reducer
 Multiple reducers?
38
Partitioner
 TotalOrderPartitioner
 Sample input to determine number of reducers for maximum
performance
39
Hadoop Ecosystem
 Pig
 Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them, and
applying functions to records or groups of records.
 Procedural language, lazy evaluated, pipeline split support
 Closer to developers (or relational algebra aficionados) than
not
40
Hadoop Ecosystem
 Hive
 Access to hadoop clusters for non developers
 Data analysts, data scientists, statisticians, SDMs etc
 Subset of SQL-92 plus Hive extensions
 Insert overwrite, no update or delete
 No transactions
 No indexes, parallel scanning
 “Near” real time
 Only equality joins
41
Hadoop Ecosystem
 Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
42
Hadoop ecosystem
 Algorithmic categories:
 Classification
 Clustering
 Pattern mining
 Regression
 Dimension reduction
 Recommendation engines
 Vector similarity
…
43
Reporting Services
 Pentaho, Microstrategy, Jasper all can hook up to a hadoop
cluster
44
References
 Hadoop the definite guide 3rd edition
 apache.hadoop.org
 Hadoop in practice
 Cloudera Custom training slides
45

More Related Content

PDF
Hadoop Architecture in Depth
PPTX
Introduction to HDFS
PDF
May 2013 HUG: HCatalog/Hive Data Out
PPT
Hadoop ppt2
PPTX
Introduction to Hadoop part 2
PDF
Hadoop-Introduction
PDF
Big data overview of apache hadoop
PPTX
Understanding hdfs
Hadoop Architecture in Depth
Introduction to HDFS
May 2013 HUG: HCatalog/Hive Data Out
Hadoop ppt2
Introduction to Hadoop part 2
Hadoop-Introduction
Big data overview of apache hadoop
Understanding hdfs

What's hot (20)

PPT
Meethadoop
PPTX
SQLBits XI - ETL with Hadoop
PPTX
Hadoop architecture by ajay
PPTX
Hadoop architecture meetup
PPT
Hadoop training in hyderabad-kellytechnologies
PDF
Hadoop interview questions
PPTX
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PDF
Hadoop 31-frequently-asked-interview-questions
PDF
Big data interview questions and answers
PPTX
BIG DATA: Apache Hadoop
PPTX
HDFS: Hadoop Distributed Filesystem
PDF
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Hadoop hdfs interview questions
PDF
Big data hadooop analytic and data warehouse comparison guide
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
ODT
Hadoop Interview Questions and Answers by rohit kapa
PPTX
Hadoop Interview Questions and Answers
Meethadoop
SQLBits XI - ETL with Hadoop
Hadoop architecture by ajay
Hadoop architecture meetup
Hadoop training in hyderabad-kellytechnologies
Hadoop interview questions
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop 31-frequently-asked-interview-questions
Big data interview questions and answers
BIG DATA: Apache Hadoop
HDFS: Hadoop Distributed Filesystem
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop hdfs interview questions
Big data hadooop analytic and data warehouse comparison guide
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers
Ad

Similar to Hadoop and big data training (20)

PPTX
002 Introduction to hadoop v3
PPTX
Hadoop_arunam_ppt
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPTX
Introduction to Hadoop and Big Data
PDF
Hadoop Tutorial with @techmilind
 
PPTX
Hadoop-part1 in cloud computing subject.pptx
PPTX
Bigdata workshop february 2015
PPTX
Hadoop ppt1
PPTX
Getting Started with Hadoop
PPTX
Introduction to Hadoop
PPT
Hadoop - Introduction to HDFS
DOCX
project report on hadoop
PPTX
Big Data and Hadoop
PDF
1. Big Data - Introduction(what is bigdata).pdf
PPT
Hadoop and Mapreduce Introduction
PPTX
Distributed computing poli
PDF
Hadoop ecosystem
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
2.1-HADOOP.pdf
PPT
Apache hadoop, hdfs and map reduce Overview
002 Introduction to hadoop v3
Hadoop_arunam_ppt
Distributed Computing with Apache Hadoop: Technology Overview
Introduction to Hadoop and Big Data
Hadoop Tutorial with @techmilind
 
Hadoop-part1 in cloud computing subject.pptx
Bigdata workshop february 2015
Hadoop ppt1
Getting Started with Hadoop
Introduction to Hadoop
Hadoop - Introduction to HDFS
project report on hadoop
Big Data and Hadoop
1. Big Data - Introduction(what is bigdata).pdf
Hadoop and Mapreduce Introduction
Distributed computing poli
Hadoop ecosystem
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
2.1-HADOOP.pdf
Apache hadoop, hdfs and map reduce Overview
Ad

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Tartificialntelligence_presentation.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
STKI Israel Market Study 2025 version august
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Five Habits of High-Impact Board Members
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
August Patch Tuesday
PDF
CloudStack 4.21: First Look Webinar slides
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
DOCX
search engine optimization ppt fir known well about this
PDF
Architecture types and enterprise applications.pdf
Assigned Numbers - 2025 - Bluetooth® Document
A review of recent deep learning applications in wood surface defect identifi...
WOOl fibre morphology and structure.pdf for textiles
Developing a website for English-speaking practice to English as a foreign la...
Taming the Chaos: How to Turn Unstructured Data into Decisions
O2C Customer Invoices to Receipt V15A.pptx
Zenith AI: Advanced Artificial Intelligence
Tartificialntelligence_presentation.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
STKI Israel Market Study 2025 version august
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Five Habits of High-Impact Board Members
1 - Historical Antecedents, Social Consideration.pdf
sustainability-14-14877-v2.pddhzftheheeeee
A novel scalable deep ensemble learning framework for big data classification...
August Patch Tuesday
CloudStack 4.21: First Look Webinar slides
NewMind AI Weekly Chronicles – August ’25 Week III
search engine optimization ppt fir known well about this
Architecture types and enterprise applications.pdf

Hadoop and big data training

  • 1. Hadoop and Big Data Training Lessons learned 0
  • 2. What’s Cloudera?  Leading company in the NoSQL and cloud computing space  Most popular Hadoop distribution  Ex-es from Google, Facebook, Oracle and other leading tech companies  Sample Bn$ companies client list: eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia ,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company  Consulting and training services 1
  • 3. Why this training?  MongoDB is great for OLTP  Not an OLAP DB, not really aspiring to become one  Big Data coming in, need for more advanced analysis processes 2
  • 4. Intended audience  Software engineers and friends  3
  • 5.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models  Modules:  HadoopCommon  Hadoop Distributed File System (HDFS™)  HadoopYARN  HadoopMapReduce 4 What’s Hadoop?
  • 6. How does it fit in our Big Goal?  MongoDB for OLTP  RDBMS (MySQL) for config data  Hadoop for OLAP 5
  • 7. What’s Map Reduce?  MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers. © Wiki  Practically?  Can perform computations in a distributed fashion  Highly scalable  Inherently highly available  By design fault tolerant 6
  • 8. Bindings  Native Java  any language, even scripting ones, using Streaming 7
  • 9. MapReduce framework vs. MapReduce functionality  Several NoSQL technologies provide MR functionality 8
  • 10. MR functionality  Compromise….  i.e. MongoDB  CouchDB select * from foo; ;; 9
  • 11. MapReduce V1 vsMapReduce V2  MR V1 can not scale past 4k nodes per cluster  More important to our goals, MR V1 is monolithic 10
  • 12. MR V2 YARN  Pluggable implementations on top of Hadoop  Whole new set of problems can be solved:  Graph processing  MPI 11
  • 14. MR V1 daemons  client  NameNode (HDFS)  JobTracker  DataNode(HDFS) + TaskTracker 13
  • 16. MR V2 daemons  Client  Resource manager/Application manager  NodeManager  Application Master (resource containers) 15
  • 17. Data Locality in Hadoop  First replica placed in client node (or random if off cluster client)  Second off-rack  Third in same rack as second but different node 16
  • 18. HDFS - Architecture  Hot  Very large files  Streaming data access (seek time ~<1% transfer time)  Commodity hardware (no iphones…)  Not  Low-latency data access  Lots of small files  Multiple writers, arbitrary file modification 17
  • 19. HDFS – NameNode  Namenode Master  Filesystem tree  Metadata for all files and directories  Namespace image and edit log  Secondary Namenode  Not a backup node!  Periodically merges edit log into namespace image  Could take 30 mins to come back online 18
  • 20. HDFS HA - NameNode  2.x Hadoop brings in HDFS HA  Active-standby config for NameNodes  Gotchas:  Shared storage for edit log  Datanodes send block reports to both NameNodes  NameNode needs to be transparent to clients 19
  • 22. HDFS - Read  Client requests file from namenode (for first 10 blocks)  Namenode returns addresses of datanodes  Client contacts directly datanodes  Blocks are read in order 21
  • 24. HDFS - Write  RPC initial call to create the file  Permissions/file exists checks in NameNode etc  As we write data, data queue in client which asks the NameNode for datanode to store data  List of datanodes form a pipeline  ack queue to verify all replicas have been written  Close file 23
  • 25. Job Configuration  setInputFormatClass  setOutputFormatClass  setMapperClass  setReducerClass  Set(Map)OutputKeyClass  set(Map)OutputValueClass  setNumReduceTasks 24
  • 26. Job Configuration Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); OR job.submit(); 25
  • 27. Job Configuration  Simple to invoke:  bin/hadoop jar WordCountinputPathoutputPath 26
  • 29. Mapper – Life cycle  Mapper inputs <K1,V1> outputs <K2,V2> 28
  • 30. Shuffle and Sort  All same keys are guaranteed to end up in the same reducer, sorted by key  Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>  Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+> 29
  • 31. Reducer – Life cycle  Reducer inputs <K2, [V2]> outputs <K3, V3> 30
  • 32. Hadoop interfaces and classes  >=0.23 new API favoring abstract classes  <0.23 old API with interfaces  Packages mapred.* OLD API, mapreduce.* NEW API 31
  • 33. Speculative execution  At least one minute into a mapper or reducer, the Jobtracker will decide based on the progress of a task  Threshold of each task progress compared to avgprogress(configurable)  Relaunch task in different NameNode and have them race..  Sometimes not wanted  Cluster utilization  Non idempotent partial output (OutputCollector) 32
  • 34. Input Output Formats  InputFormat<K,V> ->FileInputFormat<K,V> ->TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat  Default TextInputFormat key=byte offset, value=line  KeyValueTextInputFormat (key t value)  Binary splittable format  Corresponding Output formats 33
  • 35. Compression  The billion files problem  300B/file * 10^9 files  300G RAM  Big Data storage  Solutions:  Containers  Compression 34
  • 36. Containers  HAR (splittable)  Sequence Files, RC files, Avro files (splittable, compressable) 35
  • 37. Compression codecs  LZO, LZ4, snappy codecs are best VFM in compression speed  Bzip2 offers native splitting but can be slow 36
  • 38. Long story short  Compression + sequence files  Compression that supports splitting  Split file into chunks in application layer with chunk size aligned to HDFS block size  Don’t bother 37
  • 39. Partitioner  Default is HashPartitioner  Why implement our own partitioner?  Sample case: Total ordering  1 reducer  Multiple reducers? 38
  • 40. Partitioner  TotalOrderPartitioner  Sample input to determine number of reducers for maximum performance 39
  • 41. Hadoop Ecosystem  Pig  Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records.  Procedural language, lazy evaluated, pipeline split support  Closer to developers (or relational algebra aficionados) than not 40
  • 42. Hadoop Ecosystem  Hive  Access to hadoop clusters for non developers  Data analysts, data scientists, statisticians, SDMs etc  Subset of SQL-92 plus Hive extensions  Insert overwrite, no update or delete  No transactions  No indexes, parallel scanning  “Near” real time  Only equality joins 41
  • 43. Hadoop Ecosystem  Mahout Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier 42
  • 44. Hadoop ecosystem  Algorithmic categories:  Classification  Clustering  Pattern mining  Regression  Dimension reduction  Recommendation engines  Vector similarity … 43
  • 45. Reporting Services  Pentaho, Microstrategy, Jasper all can hook up to a hadoop cluster 44
  • 46. References  Hadoop the definite guide 3rd edition  apache.hadoop.org  Hadoop in practice  Cloudera Custom training slides 45

Editor's Notes

  • #10: Combiners invoked by design in mongodb
  • #39: 1 reducer is the default config