SlideShare a Scribd company logo
Hadoop - Introduction to Hadoop
HadoopHadoop IntroductionIntroduction
Data Scalability ProblemsData Scalability Problems
• Search Engine
o 10KB / doc * 20B docs = 200TB
o Reindex every 30 days: 200TB/30days = 6 TB/day
• Log Processing / Data Warehousing
o 0.5KB/events * 3B pageview events/day = 1.5TB/day
o 100M users * 5 events * 100 feed/event * 0.1KB/feed = 5TB/day
• Multipliers: 3 copies of data, 3-10 passes of raw data
• Processing Speed (Single Machine)
o 2-20MB/second * 100K seconds/day = 0.2-2 TB/day
Google’s SolutionGoogle’s Solution
• Google File System – SOSP’2003
• Map-Reduce – OSDI’2004
• Sawzall – Scientific Programming Journal’2005
• Big Table – OSDI’2006
• Chubby – OSDI’2006
Open Source World’s SolutionOpen Source World’s Solution
• Google File System – Hadoop Distributed FS
• Map-Reduce – Hadoop Map-Reduce
• Sawzall – Pig, Hive, JAQL
• Big Table – Hadoop HBase, Cassandra
• Chubby – Zookeeper
Hadoop HistoryHadoop History
• Jan 2006 – Doug Cutting joins Yahoo
• Feb 2006 – Hadoop splits out of Nutch and Yahoo starts using it.
• Dec 2006 – Yahoo creating 100-node Webmap with Hadoop
• Apr 2007 – Yahoo on 1000-node cluster
• Jan 2008 – Hadoop made a top-level Apache project
• Dec 2007 – Yahoo creating 1000-node Webmap with Hadoop
• Sep 2008 – Hive added to Hadoop as a contrib
project
Hadoop IntroductionHadoop Introduction
• Written in Java
o Does work with other languages
• Runs on
o Linux, Windows and more
o Commodity hardware with high failure rate
Current Status of HadoopCurrent Status of Hadoop
• Largest Cluster
o 2000 nodes (8 cores, 4TB disk)
• Used by 40+ companies / universities over the world
o Yahoo, Facebook, etc
o Cloud Computing Donation from Google and IBM
• Startup focusing on providing services for hadoop
o Cloudera
Hadoop ComponentsHadoop Components
• Hadoop Distributed File System (HDFS)
• Hadoop Map-Reduce
• Contributes
o Hadoop Streaming
o Pig / JAQL / Hive
o HBase
o Hama / Mahout
Hadoop Distributed File
System
Goals ofGoals of HDFSHDFS
• Very Large Distributed File System
o 10K nodes, 100 million files, 10 PB
• Convenient Cluster Management
o Load balancing
o Node failures
o Cluster expansion
• Optimized for Batch Processing
o Allow move computation to data
o Maximize throughput
HDFS ArchitectureHDFS Architecture
HDFS DetailsHDFS Details
• Data Coherency
o Write-once-read-many access model
o Client can only append to existing files
• Files are broken up into blocks
o Typically 128 MB block size
o Each block replicated on multiple DataNodes
• Intelligent Client
o Client can find location of blocks
o Client accesses data directly from DataNode
Hadoop - Introduction to Hadoop
HDFS User InterfaceHDFS User Interface
• Java API
• Command Line
o hadoop dfs -mkdir /foodir
o hadoop dfs -cat /foodir/myfile.txt
o hadoop dfs -rm /foodir myfile.txt
o hadoop dfsadmin -report
o hadoop dfsadmin -decommission datanodename
More about HDFSMore about HDFS
•Hadoop FileSystem API
o HDFS
o Local File System
o Kosmos File System (KFS)
o Amazon S3 File System
Hadoop Map-Reduce and
Hadoop Streaming
• Map/Reduce works like a parallel Unix pipeline:
o cat input | grep | sort | uniq -c | cat > output
o Input | Map | Shuffle & Sort | Reduce | Output
• Framework does inter-node communication
o Failure recovery, consistency etc
o Load balancing, scalability etc
• Fits a lot of batch processing applications
o Log processing
o Web index building
Hadoop - Introduction to Hadoop
Physical FlowPhysical Flow
Example CodeExample Code
Hadoop StreamingHadoop Streaming
• Allow to write Map and Reduce functions in any
languages
o Hadoop Map/Reduce only accepts Java
• Example: Word Count
o hadoop streaming
-input /user/zshao/articles
-mapper ‘tr “ ” “n”’
-reducer ‘uniq -c‘
-output /user/zshao/
-numReduceTasks 32
Example: Log ProcessingExample: Log Processing
• Generate #pageview and #distinct users
for each page each day
o Input: timestamp url userid
• Generate the number of page views
o Map: emit < <date(timestamp), url>, 1>
o Reduce: add up the values for each row
• Generate the number of distinct users
o Map: emit < <date(timestamp), url, userid>, 1>
o Reduce: For the set of rows with the same <date(timestamp), url>, count the
number of distinct users by “uniq –c"
Example: PageExample: Page RankRank
• In each Map/Reduce Job:
o Map: emit <link, eigenvalue(url)/#links>
for each input: <url, <eigenvalue, vector<link>> >
o Reduce: add all values up for each link, to generate the new eigenvalue for
that link.
• Run 50 map/reduce jobs till the eigenvalues are
stable.
TODOTODO:: Split Job Scheduler and Map-ReduceSplit Job Scheduler and Map-Reduce
• Allow easy plug-in of different scheduling algorithms
o Scheduling based on job priority, size, etc
o Scheduling for CPU, disk, memory, network bandwidth
o Preemptive scheduling
• Allow to run MPI or other jobs on the same cluster
o PageRank is best done with MPI
Hive - SQL on top of Hadoop
Map-Reduce and SQLMap-Reduce and SQL
• Map-Reduce is scalable
o SQL has a huge user base
o SQL is easy to code
• Solution: Combine SQL and Map-Reduce
o Hive on top of Hadoop (open source)
o Aster Data (proprietary)
o Green Plum (proprietary)
HiveHive
• A database/data warehouse on top of Hadoop
o Rich data types (structs, lists and maps)
o Efficient implementations of SQL filters, joins and group-by’s on
top of map reduce
• Allow users to access Hive data without using Hive
Dealing with Structured DataDealing with Structured Data
• Type system
o Primitive types
o Recursively build up using Composition/Maps/Lists
• Generic (De)Serialization Interface (SerDe)
o To recursively list schema
o To recursively access fields within a row object
• Serialization families implement interface
o Thrift DDL based SerDe
o Delimited text based SerDe
o You can write your own SerDe
• Schema Evolution
MetaStoreMetaStore
• Stores Table/Partition properties:
o Table schema and SerDe library
o Table Location on HDFS
o Logical Partitioning keys and types
o Other information
• Thrift API
o Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and
CLI), Perl (Tests)
• Metadata can be stored as text files or even in a SQL
backend
Hive CLIHive CLI
• DDL:
o create table/drop table/rename table
o alter table add column
• Browsing:
o show tables
o describe table
o cat table
• Loading Data
• Queries
Web UI for HiveWeb UI for Hive
• MetaStore UI:
o Browse and navigate all tables in the system
o Comment on each table and each column
o Also captures data dependencies
• HiPal:
o Interactively construct SQL queries by mouse clicks
o Support projection, filtering, group by and joining
o Also support
Hive Query LanguageHive Query Language
• Philosophy
o SQL
o Map-Reduce with custom scripts (hadoop streaming)
• Query Operators
o Projections
o Equi-joins
o Group by
o Sampling
o Order By
Hive QL – Custom Map/Reduce ScriptsHive QL – Custom Map/Reduce Scripts
• Extended SQL:
• FROM (
• FROM pv_users
• MAP pv_users.userid, pv_users.date
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• REDUCE map.dt, map.uid
• USING 'reduce_script' AS (date, count);
• Map-Reduce: similar to hadoop streaming
ThankThank You !!!You !!!
For More Information click below link:
Follow Us on:
http://vibranttechnologies.co.in/hadoop-classes-in-mumbai.html

More Related Content

What's hot (20)

PPTX
Introduction to Hadoop
Ran Ziv
 
ODP
Hadoop - Overview
Jay
 
PPSX
Hadoop
Nishant Gandhi
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PPT
An Introduction to Hadoop
DerrekYoungDotCom
 
PPT
Hadoop Tutorial
awesomesos
 
PPTX
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
PPTX
Asbury Hadoop Overview
Brian Enochson
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
PPT
Hadoop
Cassell Hsu
 
PDF
Introduction to Hadoop
Ovidiu Dimulescu
 
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PPTX
Introduction to Big Data and Hadoop
Edureka!
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Introduction to Hadoop
Ran Ziv
 
Hadoop - Overview
Jay
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop Tutorial
awesomesos
 
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Asbury Hadoop Overview
Brian Enochson
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Hadoop
Cassell Hsu
 
Introduction to Hadoop
Ovidiu Dimulescu
 
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
HADOOP TECHNOLOGY ppt
sravya raju
 
Introduction to Big Data and Hadoop
Edureka!
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 

Similar to Hadoop - Introduction to Hadoop (20)

PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PDF
Apache Hadoop 1.1
Sperasoft
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PPT
Hadoop institutes in hyderabad
Kelly Technologies
 
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PPTX
Hadoop: An Industry Perspective
Cloudera, Inc.
 
PPT
Nextag talk
Joydeep Sen Sarma
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PPTX
Hadoop for sysadmins
ericwilliammarshall
 
PPTX
Hadoop intro
Keith Davis
 
PPTX
Presentation sreenu dwh-services
Sreenu Musham
 
PPTX
מיכאל
sqlserver.co.il
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
PPTX
Big data Hadoop
Ayyappan Paramesh
 
PPTX
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
 
PPTX
Hadoop Training in Hyderabad
Rajitha D
 
PPTX
Hadoop Training in Hyderabad
CHENNAKESHAVAKATAGAR
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Hive @ Hadoop day seattle_2010
nzhang
 
Apache Hadoop 1.1
Sperasoft
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop institutes in hyderabad
Kelly Technologies
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Nextag talk
Joydeep Sen Sarma
 
Hands on Hadoop and pig
Sudar Muthu
 
Hadoop for sysadmins
ericwilliammarshall
 
Hadoop intro
Keith Davis
 
Presentation sreenu dwh-services
Sreenu Musham
 
מיכאל
sqlserver.co.il
 
Apache Hadoop and HBase
Cloudera, Inc.
 
Big data Hadoop
Ayyappan Paramesh
 
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
 
Hadoop Training in Hyderabad
Rajitha D
 
Hadoop Training in Hyderabad
CHENNAKESHAVAKATAGAR
 
Ad

More from Vibrant Technologies & Computers (20)

PPT
Buisness analyst business analysis overview ppt 5
Vibrant Technologies & Computers
 
PPT
SQL Introduction to displaying data from multiple tables
Vibrant Technologies & Computers
 
PPT
SQL- Introduction to MySQL
Vibrant Technologies & Computers
 
PPT
SQL- Introduction to SQL database
Vibrant Technologies & Computers
 
PPT
ITIL - introduction to ITIL
Vibrant Technologies & Computers
 
PPT
Salesforce - Introduction to Security & Access
Vibrant Technologies & Computers
 
PPT
Data ware housing- Introduction to olap .
Vibrant Technologies & Computers
 
PPT
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 
PPT
Data ware housing- Introduction to data ware housing
Vibrant Technologies & Computers
 
PPT
Salesforce - classification of cloud computing
Vibrant Technologies & Computers
 
PPT
Salesforce - cloud computing fundamental
Vibrant Technologies & Computers
 
PPT
SQL- Introduction to PL/SQL
Vibrant Technologies & Computers
 
PPT
SQL- Introduction to advanced sql concepts
Vibrant Technologies & Computers
 
PPT
SQL Inteoduction to SQL manipulating of data
Vibrant Technologies & Computers
 
PPT
SQL- Introduction to SQL Set Operations
Vibrant Technologies & Computers
 
PPT
Sas - Introduction to designing the data mart
Vibrant Technologies & Computers
 
PPT
Sas - Introduction to working under change management
Vibrant Technologies & Computers
 
PPT
SAS - overview of SAS
Vibrant Technologies & Computers
 
PPT
Teradata - Architecture of Teradata
Vibrant Technologies & Computers
 
PPT
Teradata - Restoring Data
Vibrant Technologies & Computers
 
Buisness analyst business analysis overview ppt 5
Vibrant Technologies & Computers
 
SQL Introduction to displaying data from multiple tables
Vibrant Technologies & Computers
 
SQL- Introduction to MySQL
Vibrant Technologies & Computers
 
SQL- Introduction to SQL database
Vibrant Technologies & Computers
 
ITIL - introduction to ITIL
Vibrant Technologies & Computers
 
Salesforce - Introduction to Security & Access
Vibrant Technologies & Computers
 
Data ware housing- Introduction to olap .
Vibrant Technologies & Computers
 
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 
Data ware housing- Introduction to data ware housing
Vibrant Technologies & Computers
 
Salesforce - classification of cloud computing
Vibrant Technologies & Computers
 
Salesforce - cloud computing fundamental
Vibrant Technologies & Computers
 
SQL- Introduction to PL/SQL
Vibrant Technologies & Computers
 
SQL- Introduction to advanced sql concepts
Vibrant Technologies & Computers
 
SQL Inteoduction to SQL manipulating of data
Vibrant Technologies & Computers
 
SQL- Introduction to SQL Set Operations
Vibrant Technologies & Computers
 
Sas - Introduction to designing the data mart
Vibrant Technologies & Computers
 
Sas - Introduction to working under change management
Vibrant Technologies & Computers
 
SAS - overview of SAS
Vibrant Technologies & Computers
 
Teradata - Architecture of Teradata
Vibrant Technologies & Computers
 
Teradata - Restoring Data
Vibrant Technologies & Computers
 
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 

Hadoop - Introduction to Hadoop

  • 3. Data Scalability ProblemsData Scalability Problems • Search Engine o 10KB / doc * 20B docs = 200TB o Reindex every 30 days: 200TB/30days = 6 TB/day • Log Processing / Data Warehousing o 0.5KB/events * 3B pageview events/day = 1.5TB/day o 100M users * 5 events * 100 feed/event * 0.1KB/feed = 5TB/day • Multipliers: 3 copies of data, 3-10 passes of raw data • Processing Speed (Single Machine) o 2-20MB/second * 100K seconds/day = 0.2-2 TB/day
  • 4. Google’s SolutionGoogle’s Solution • Google File System – SOSP’2003 • Map-Reduce – OSDI’2004 • Sawzall – Scientific Programming Journal’2005 • Big Table – OSDI’2006 • Chubby – OSDI’2006
  • 5. Open Source World’s SolutionOpen Source World’s Solution • Google File System – Hadoop Distributed FS • Map-Reduce – Hadoop Map-Reduce • Sawzall – Pig, Hive, JAQL • Big Table – Hadoop HBase, Cassandra • Chubby – Zookeeper
  • 6. Hadoop HistoryHadoop History • Jan 2006 – Doug Cutting joins Yahoo • Feb 2006 – Hadoop splits out of Nutch and Yahoo starts using it. • Dec 2006 – Yahoo creating 100-node Webmap with Hadoop • Apr 2007 – Yahoo on 1000-node cluster • Jan 2008 – Hadoop made a top-level Apache project • Dec 2007 – Yahoo creating 1000-node Webmap with Hadoop • Sep 2008 – Hive added to Hadoop as a contrib project
  • 7. Hadoop IntroductionHadoop Introduction • Written in Java o Does work with other languages • Runs on o Linux, Windows and more o Commodity hardware with high failure rate
  • 8. Current Status of HadoopCurrent Status of Hadoop • Largest Cluster o 2000 nodes (8 cores, 4TB disk) • Used by 40+ companies / universities over the world o Yahoo, Facebook, etc o Cloud Computing Donation from Google and IBM • Startup focusing on providing services for hadoop o Cloudera
  • 9. Hadoop ComponentsHadoop Components • Hadoop Distributed File System (HDFS) • Hadoop Map-Reduce • Contributes o Hadoop Streaming o Pig / JAQL / Hive o HBase o Hama / Mahout
  • 11. Goals ofGoals of HDFSHDFS • Very Large Distributed File System o 10K nodes, 100 million files, 10 PB • Convenient Cluster Management o Load balancing o Node failures o Cluster expansion • Optimized for Batch Processing o Allow move computation to data o Maximize throughput
  • 13. HDFS DetailsHDFS Details • Data Coherency o Write-once-read-many access model o Client can only append to existing files • Files are broken up into blocks o Typically 128 MB block size o Each block replicated on multiple DataNodes • Intelligent Client o Client can find location of blocks o Client accesses data directly from DataNode
  • 15. HDFS User InterfaceHDFS User Interface • Java API • Command Line o hadoop dfs -mkdir /foodir o hadoop dfs -cat /foodir/myfile.txt o hadoop dfs -rm /foodir myfile.txt o hadoop dfsadmin -report o hadoop dfsadmin -decommission datanodename
  • 16. More about HDFSMore about HDFS •Hadoop FileSystem API o HDFS o Local File System o Kosmos File System (KFS) o Amazon S3 File System
  • 18. • Map/Reduce works like a parallel Unix pipeline: o cat input | grep | sort | uniq -c | cat > output o Input | Map | Shuffle & Sort | Reduce | Output • Framework does inter-node communication o Failure recovery, consistency etc o Load balancing, scalability etc • Fits a lot of batch processing applications o Log processing o Web index building
  • 22. Hadoop StreamingHadoop Streaming • Allow to write Map and Reduce functions in any languages o Hadoop Map/Reduce only accepts Java • Example: Word Count o hadoop streaming -input /user/zshao/articles -mapper ‘tr “ ” “n”’ -reducer ‘uniq -c‘ -output /user/zshao/ -numReduceTasks 32
  • 23. Example: Log ProcessingExample: Log Processing • Generate #pageview and #distinct users for each page each day o Input: timestamp url userid • Generate the number of page views o Map: emit < <date(timestamp), url>, 1> o Reduce: add up the values for each row • Generate the number of distinct users o Map: emit < <date(timestamp), url, userid>, 1> o Reduce: For the set of rows with the same <date(timestamp), url>, count the number of distinct users by “uniq –c"
  • 24. Example: PageExample: Page RankRank • In each Map/Reduce Job: o Map: emit <link, eigenvalue(url)/#links> for each input: <url, <eigenvalue, vector<link>> > o Reduce: add all values up for each link, to generate the new eigenvalue for that link. • Run 50 map/reduce jobs till the eigenvalues are stable.
  • 25. TODOTODO:: Split Job Scheduler and Map-ReduceSplit Job Scheduler and Map-Reduce • Allow easy plug-in of different scheduling algorithms o Scheduling based on job priority, size, etc o Scheduling for CPU, disk, memory, network bandwidth o Preemptive scheduling • Allow to run MPI or other jobs on the same cluster o PageRank is best done with MPI
  • 26. Hive - SQL on top of Hadoop
  • 27. Map-Reduce and SQLMap-Reduce and SQL • Map-Reduce is scalable o SQL has a huge user base o SQL is easy to code • Solution: Combine SQL and Map-Reduce o Hive on top of Hadoop (open source) o Aster Data (proprietary) o Green Plum (proprietary)
  • 28. HiveHive • A database/data warehouse on top of Hadoop o Rich data types (structs, lists and maps) o Efficient implementations of SQL filters, joins and group-by’s on top of map reduce • Allow users to access Hive data without using Hive
  • 29. Dealing with Structured DataDealing with Structured Data • Type system o Primitive types o Recursively build up using Composition/Maps/Lists • Generic (De)Serialization Interface (SerDe) o To recursively list schema o To recursively access fields within a row object • Serialization families implement interface o Thrift DDL based SerDe o Delimited text based SerDe o You can write your own SerDe • Schema Evolution
  • 30. MetaStoreMetaStore • Stores Table/Partition properties: o Table schema and SerDe library o Table Location on HDFS o Logical Partitioning keys and types o Other information • Thrift API o Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests) • Metadata can be stored as text files or even in a SQL backend
  • 31. Hive CLIHive CLI • DDL: o create table/drop table/rename table o alter table add column • Browsing: o show tables o describe table o cat table • Loading Data • Queries
  • 32. Web UI for HiveWeb UI for Hive • MetaStore UI: o Browse and navigate all tables in the system o Comment on each table and each column o Also captures data dependencies • HiPal: o Interactively construct SQL queries by mouse clicks o Support projection, filtering, group by and joining o Also support
  • 33. Hive Query LanguageHive Query Language • Philosophy o SQL o Map-Reduce with custom scripts (hadoop streaming) • Query Operators o Projections o Equi-joins o Group by o Sampling o Order By
  • 34. Hive QL – Custom Map/Reduce ScriptsHive QL – Custom Map/Reduce Scripts • Extended SQL: • FROM ( • FROM pv_users • MAP pv_users.userid, pv_users.date • USING 'map_script' AS (dt, uid) • CLUSTER BY dt) map • INSERT INTO TABLE pv_users_reduced • REDUCE map.dt, map.uid • USING 'reduce_script' AS (date, count); • Map-Reduce: similar to hadoop streaming
  • 35. ThankThank You !!!You !!! For More Information click below link: Follow Us on: http://vibranttechnologies.co.in/hadoop-classes-in-mumbai.html

Editor's Notes

  • #12: Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth
  • #13: Name node: single point of failure, so we have secondary name node. Secondary name node: read transaction log from name node, and upload FSImage to name node. Single name node avoids metadata conflict etc. Data node: easy to join and leave cluster. Heartbeat protocol.
  • #15: Block placement policy Block balancing Block replication on node failure