SlideShare a Scribd company logo
Shankar RadhakrishnanHCL TechnologiesHadoop – An Introduction
State of the DataWhat is HadoopHadoop EcosystemReferencesAgenda
Data driven businessesBusinesses have been collecting information all the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsNeed of the dayState of the data
Data driven businessBusinesses have been collecting informationall the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsState of the data
ApplicationsSearches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product ListingsERP, CRM, Databases, Internal Applications, Customer/Consumer facing productsMobileContextWeb, Customers, Products, Business Systems,Processes, ServicesSupport SystemsCRM, SOA, Recommendation Systems/processes,Data warehouses, Business Intelligence, BPMData driven business
Data driven businessesBusinesses have been collecting informationall the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsState of the data
DriversROICustomer RetentionProduct AffinityMarket TrendsResearch AnalysisCustomer/Consumer AnalyticsProcessClusteringClassificationBuild RelationshipsRegressionTypesStructuredSemi-structuredUnstructuredMine more
Data driven businessesBusinesses have been collecting informationall the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsState of the data
Complex ApplicationsData integration is a good but complex problem to solveData GrowthGrowth is exponentialInfrastructureAvailabilityUnscalablehardwareEconomicsManaging high data volume comes at a priceFailures are very costlyChallenges
System that can handle high volume dataSystem that can perform complex operationsScalableRobustHighly AvailableFault TolerantCheapNeed of the day
Top level Apache projectOpen sourceInspired by Google’s white papers onMap/Reduce (MR), Google File System (GFS)Originally developed to support Apache Nutch Search EngineSoftware Framework - JavaDesignedFor sophisticated analysisTo deal with structured and unstructured complex data
Runs on commodity hardwareShared-nothing architectureScale hardware when ever you wantSystem compensates for hardware scalingand issues (if any)Run large-scale, high volume data processesScales well with complex analysis jobsHandles failuresIdeal to consolidate data from both new and legacy data sourcesValue to the businessWhy Hadoop?
Hadoop in an enterprise - Example
HDFS 		Hadoop Distributed File SystemMap/Reduce 	Software framework for Clustered, 			Distributed data processingZooKeeper 	SchedulerAvro 		Data SerializationChukwa 		Data Collection System to monitor 			Distributed SystemsHBase 		Data storage for distributed large 			tablesHive 		Data warehousing infrastructurePig 			High-Level Query LanguageHadoop Ecosystem
Master/Slave ArchitectureRuns on commodity hardwareFault TolerantHandle large volumes of dataProvides High ThroughputStreaming data-accessSimple file coherency modelPortable to heterogeneous hardware and softwareRobustHandles disk failures, replication (& re-replication)Performs cluster rebalancing, data integrity checksHDFS – Hadoop Distributed File System
HDFS – ExampleName nodeFile system operations
Maps data-nodesData nodeProcess read/write
Handles Data-blocks
ReplicationTagged by a jobSplits input data-set into separate chunk’sProcessed by map tasks, in parallelSorts the output of the mapsProcessed by reduce tasks, in parallelTypically stored and processed in a file systemFramework takes care ofScheduling tasksMonitoringRe-executing failed tasksHadoop Map/Reduce
Example : Mapper Function

More Related Content

What's hot (20)

PPTX
Big data ppt
Shweta Sahu
 
PPT
Big Tools for Big Data
Lewis Crawford
 
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
DOCX
Big data abstract
nandhiniarumugam619
 
PDF
Big Data- Automotive Industry Use Case
Sophie (C.F.) Tsai
 
PDF
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
PPTX
Big data 101
Paresh Motiwala, PMP®
 
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Big Data Use Cases
boorad
 
PPT
Big Data Analytics 2014
Stratebi
 
PDF
AI meets Big Data
Jan Wiegelmann
 
PPTX
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
PPT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
PDF
Big Data Final Presentation
17aroumougamh
 
PPTX
big data overview ppt
VIKAS KATARE
 
PPTX
Introduction of big data unit 1
RojaT4
 
PDF
Big data ecosystem
magda3695
 
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
PPTX
Big data unit 2
RojaT4
 
Big data ppt
Shweta Sahu
 
Big Tools for Big Data
Lewis Crawford
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
Big data abstract
nandhiniarumugam619
 
Big Data- Automotive Industry Use Case
Sophie (C.F.) Tsai
 
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Big Data Use Cases
boorad
 
Big Data Analytics 2014
Stratebi
 
AI meets Big Data
Jan Wiegelmann
 
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
Big Data Final Presentation
17aroumougamh
 
big data overview ppt
VIKAS KATARE
 
Introduction of big data unit 1
RojaT4
 
Big data ecosystem
magda3695
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Big data unit 2
RojaT4
 

Similar to Hadoop - An Introduction (20)

PPTX
Is the traditional data warehouse dead?
James Serra
 
PDF
Addressing Big Data Challenges - The Hadoop Way
Xoriant Corporation
 
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Rio Info
 
PPT
Cloud Computing: Hadoop
darugar
 
PDF
Hadoop Developer
Edureka!
 
PPTX
Testing Big Data: Automated ETL Testing of Hadoop
Bill Hayduk
 
PPTX
Introduction To Big Data & Hadoop
Blackvard
 
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
PDF
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
Big Data Week
 
PDF
Hadoop & Data Warehouse
Mohit Srivastava
 
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Hortonworks
 
PPTX
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
PDF
Hadoop data-lake-white-paper
Supratim Ray
 
PPTX
Stratebi Big Data
Stratebi
 
PDF
Google Data Engineering.pdf
avenkatram
 
PDF
Data Engineering on GCP
BlibBlobb
 
PDF
data_engineering_on_GCP_PDE_cheat_sheets
oteghelepeter
 
PPTX
data analytics lecture4.pptx
NamrataBhatt8
 
PPTX
Big data architectures and the data lake
James Serra
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
Is the traditional data warehouse dead?
James Serra
 
Addressing Big Data Challenges - The Hadoop Way
Xoriant Corporation
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Rio Info
 
Cloud Computing: Hadoop
darugar
 
Hadoop Developer
Edureka!
 
Testing Big Data: Automated ETL Testing of Hadoop
Bill Hayduk
 
Introduction To Big Data & Hadoop
Blackvard
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
Big Data Week
 
Hadoop & Data Warehouse
Mohit Srivastava
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Hortonworks
 
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
Hadoop data-lake-white-paper
Supratim Ray
 
Stratebi Big Data
Stratebi
 
Google Data Engineering.pdf
avenkatram
 
Data Engineering on GCP
BlibBlobb
 
data_engineering_on_GCP_PDE_cheat_sheets
oteghelepeter
 
data analytics lecture4.pptx
NamrataBhatt8
 
Big data architectures and the data lake
James Serra
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
Ad

Recently uploaded (20)

PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
Q2 Leading a Tableau User Group - Onboarding
lward7
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
July Patch Tuesday
Ivanti
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Q2 Leading a Tableau User Group - Onboarding
lward7
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Ad

Hadoop - An Introduction

  • 2. State of the DataWhat is HadoopHadoop EcosystemReferencesAgenda
  • 3. Data driven businessesBusinesses have been collecting information all the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsNeed of the dayState of the data
  • 4. Data driven businessBusinesses have been collecting informationall the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsState of the data
  • 5. ApplicationsSearches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product ListingsERP, CRM, Databases, Internal Applications, Customer/Consumer facing productsMobileContextWeb, Customers, Products, Business Systems,Processes, ServicesSupport SystemsCRM, SOA, Recommendation Systems/processes,Data warehouses, Business Intelligence, BPMData driven business
  • 6. Data driven businessesBusinesses have been collecting informationall the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsState of the data
  • 7. DriversROICustomer RetentionProduct AffinityMarket TrendsResearch AnalysisCustomer/Consumer AnalyticsProcessClusteringClassificationBuild RelationshipsRegressionTypesStructuredSemi-structuredUnstructuredMine more
  • 8. Data driven businessesBusinesses have been collecting informationall the timeMine more == Collect more (and vice-versa)ChallengesApplication ComplexitiesData growthInfrastructureEconomicsState of the data
  • 9. Complex ApplicationsData integration is a good but complex problem to solveData GrowthGrowth is exponentialInfrastructureAvailabilityUnscalablehardwareEconomicsManaging high data volume comes at a priceFailures are very costlyChallenges
  • 10. System that can handle high volume dataSystem that can perform complex operationsScalableRobustHighly AvailableFault TolerantCheapNeed of the day
  • 11. Top level Apache projectOpen sourceInspired by Google’s white papers onMap/Reduce (MR), Google File System (GFS)Originally developed to support Apache Nutch Search EngineSoftware Framework - JavaDesignedFor sophisticated analysisTo deal with structured and unstructured complex data
  • 12. Runs on commodity hardwareShared-nothing architectureScale hardware when ever you wantSystem compensates for hardware scalingand issues (if any)Run large-scale, high volume data processesScales well with complex analysis jobsHandles failuresIdeal to consolidate data from both new and legacy data sourcesValue to the businessWhy Hadoop?
  • 13. Hadoop in an enterprise - Example
  • 14. HDFS Hadoop Distributed File SystemMap/Reduce Software framework for Clustered, Distributed data processingZooKeeper SchedulerAvro Data SerializationChukwa Data Collection System to monitor Distributed SystemsHBase Data storage for distributed large tablesHive Data warehousing infrastructurePig High-Level Query LanguageHadoop Ecosystem
  • 15. Master/Slave ArchitectureRuns on commodity hardwareFault TolerantHandle large volumes of dataProvides High ThroughputStreaming data-accessSimple file coherency modelPortable to heterogeneous hardware and softwareRobustHandles disk failures, replication (& re-replication)Performs cluster rebalancing, data integrity checksHDFS – Hadoop Distributed File System
  • 16. HDFS – ExampleName nodeFile system operations
  • 19. ReplicationTagged by a jobSplits input data-set into separate chunk’sProcessed by map tasks, in parallelSorts the output of the mapsProcessed by reduce tasks, in parallelTypically stored and processed in a file systemFramework takes care ofScheduling tasksMonitoringRe-executing failed tasksHadoop Map/Reduce
  • 20. Example : Mapper Function
  • 21. Example : Reduce Function