SlideShare a Scribd company logo
Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ [email_address]
What is Hadoop Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.
Why? A common infrastructure pattern extracted from building distributed systems Scale Incremental growth Cost Flexibility
Built-in Resilience to Failure When dealing with large numbers of commodity servers, failure is a fact of life Assume failure, build protections and recovery into your architecture Data level redundancy Job/Task level monitoring and automated restart and re-allocation
Current State of Hadoop Project Top level Apache Foundation project In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, …  Large, active user base, mailing lists, user groups Very active development, strong development team
Widely Adopted A valuable and reusable skill set Taught at major universities Easier to hire for Easier to train on Portable across projects, groups
Plethora of Related Projects Pig Hive Hbase Cascading Hadoop on EC2 JAQL , X-Trace, Happy, Mahout
What is Hadoop The Linux of distributed processing.
How Does Hadoop Work?
Hadoop File System A distributed file system for large data Your data in triplicate Built-in redundancy, resiliency to large scale failures Intelligent distribution, striping across racks Accommodates very large data sizes On commodity hardware
Programming Model: Map/Reduce Very simple programming model: Map(anything)->key, value Sort, partition on key Reduce(key,value)->key, value No parallel processing / message passing semantics Programmable in Java or any other language (streaming)
Processing Model Create or allocate a cluster Put data onto the file system: Data is split into blocks, stored in triplicate across your cluster Run your job: Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data Move computation to data, not data to  computation
Processing Model Monitor workers, automatically restarting failed or slow tasks Gather output of Map, sort and partition on key Run Reduce tasks Monitor workers, automatically restarting failed or slow tasks Results of your job are now available on the Hadoop file system
Hadoop on the Grid Managed Hadoop clusters Shared resources improved utilization Standard data sets, storage Shared, standardized operations management Hosted internally or externally (eg. on EC2)
Usage Patterns
ETL Put large data source (eg. Log files) onto the Hadoop File System Perform aggregations, transformations, normalizations on the data Load into RDBMS / data mart
Reporting and Analytics Run canned and ad-hoc queries over large data Run analytics and data mining operations on large data Produce reports for end-user consumption or loading into data mart
Data Processing Pipelines Multi-step pipelines for data processing Coordination, scheduling, data collection and publishing of feeds SLA carrying, regularly scheduled jobs
Machine Learning & Graph Algorithms Traverse large graphs and data sets, building models and classifiers Implement machine learning algorithms over massive data sets
General Back-End Processing Implement significant portions of back-end, batch oriented processing on the grid General computation framework Simplify back-end architecture
What Next? Dowload Hadoop: http://hadoop.apache.org/ Try it on your laptop Try Pig http://hadoop.apahe.org/pig/ Deploy to multiple boxes Try it on EC2

More Related Content

What's hot (20)

PDF
Lecture6 introduction to data streams
hktripathy
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Basics of Cloud Computing
Pranav Vashistha
 
PPT
CLOUD COMPUTING AND STORAGE
Shalini Toluchuri
 
PPTX
Load balancing in cloud computing.pptx
Hitesh Mohapatra
 
PPT
Unit 3 -Data storage and cloud computing
MonishaNehkal
 
PPT
Cloud computing
Sreehari820
 
PPT
INTRODUCTION TO CLOUD COMPUTING
Tanmoy Barman
 
PDF
Data storage in cloud computing
jamunaashok
 
PDF
Cloud Computing Using OpenStack
Bangladesh Network Operators Group
 
PPTX
Load balancing
ankur bhalla
 
PPTX
NIST Cloud Computing Reference Architecture
Thanakrit Lersmethasakul
 
PPTX
Public cloud
Dr.Neeraj Kumar Pandey
 
PPT
Security Issues of Cloud Computing
Falgun Rathod
 
PPT
Evolution of the cloud
sagaroceanic11
 
PDF
Data Streaming For Big Data
Seval Çapraz
 
PPTX
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
PDF
Introduction to Software Defined Networking (SDN)
Bangladesh Network Operators Group
 
Lecture6 introduction to data streams
hktripathy
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Map Reduce
Prashant Gupta
 
Basics of Cloud Computing
Pranav Vashistha
 
CLOUD COMPUTING AND STORAGE
Shalini Toluchuri
 
Load balancing in cloud computing.pptx
Hitesh Mohapatra
 
Unit 3 -Data storage and cloud computing
MonishaNehkal
 
Cloud computing
Sreehari820
 
INTRODUCTION TO CLOUD COMPUTING
Tanmoy Barman
 
Data storage in cloud computing
jamunaashok
 
Cloud Computing Using OpenStack
Bangladesh Network Operators Group
 
Load balancing
ankur bhalla
 
NIST Cloud Computing Reference Architecture
Thanakrit Lersmethasakul
 
Security Issues of Cloud Computing
Falgun Rathod
 
Evolution of the cloud
sagaroceanic11
 
Data Streaming For Big Data
Seval Çapraz
 
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Introduction to Software Defined Networking (SDN)
Bangladesh Network Operators Group
 

Viewers also liked (20)

PPTX
Hadoop in the Cloud: Common Architectural Patterns
DataWorks Summit
 
PPTX
Big data ppt
Nasrin Hussain
 
PDF
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
PPTX
Jethro for tableau webinar (11 15)
Remy Rosenbaum
 
PDF
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PPTX
Hadoop on retail
Douglas Bernardini
 
PPT
100424 teradata cloud computing 3rd party influencers2c
guest8ebe0a8
 
PPTX
BIG Data & Hadoop Applications in Retail
Skillspeed
 
PDF
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PPT
Intro to Amazon S3
Yu Lun Teo
 
PDF
A Reference Architecture for ETL 2.0
DataWorks Summit
 
KEY
Large scale ETL with Hadoop
OReillyStrata
 
PPTX
Real-time Market Basket Analysis for Retail with Hadoop
DataWorks Summit
 
PDF
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
PDF
10 Common Hadoop-able Problems Webinar
Cloudera, Inc.
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PDF
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
Mark Rittman
 
ODP
Hadoop demo ppt
Phil Young
 
PPTX
Hadoop project design and a usecase
sudhakara st
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
Hadoop in the Cloud: Common Architectural Patterns
DataWorks Summit
 
Big data ppt
Nasrin Hussain
 
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Jethro for tableau webinar (11 15)
Remy Rosenbaum
 
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Hadoop on retail
Douglas Bernardini
 
100424 teradata cloud computing 3rd party influencers2c
guest8ebe0a8
 
BIG Data & Hadoop Applications in Retail
Skillspeed
 
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Intro to Amazon S3
Yu Lun Teo
 
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Large scale ETL with Hadoop
OReillyStrata
 
Real-time Market Basket Analysis for Retail with Hadoop
DataWorks Summit
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
10 Common Hadoop-able Problems Webinar
Cloudera, Inc.
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
Mark Rittman
 
Hadoop demo ppt
Phil Young
 
Hadoop project design and a usecase
sudhakara st
 
Big Data & Hadoop Tutorial
Edureka!
 
Ad

Similar to Cloud Computing: Hadoop (20)

PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPT
Introduction to Apache Hadoop
Steve Watt
 
PDF
Hadoop Ecosystem
rohitraj268
 
PPTX
Apache hadoop basics
saili mane
 
PPTX
Unit 5
Ravi Kumar
 
PDF
Hadoop introduction
Subhas Kumar Ghosh
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PPTX
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
PDF
Hadoop Primer
Steve Staso
 
ODP
Hadoop - Overview
Jay
 
PPTX
The Apache Hadoop software library is a framework that allows for the distrib...
23Q95A6706
 
PPTX
What is Hadoop? Key Concepts, Architecture, and Applications
MikeKelvin1
 
PPTX
Hadoop & distributed cloud computing
Rajan Kumar Upadhyay
 
PDF
Hadoop installation by santosh nage
Santosh Nage
 
PPT
Another Intro To Hadoop
Adeel Ahmad
 
PPTX
Presentation sreenu dwh-services
Sreenu Musham
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPT
Hadoop basics
Antonio Silveira
 
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
Introduction to Hadoop and Big Data
Joe Alex
 
Introduction to Apache Hadoop
Steve Watt
 
Hadoop Ecosystem
rohitraj268
 
Apache hadoop basics
saili mane
 
Unit 5
Ravi Kumar
 
Hadoop introduction
Subhas Kumar Ghosh
 
Hadoop and Big Data
Harshdeep Kaur
 
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Hadoop Primer
Steve Staso
 
Hadoop - Overview
Jay
 
The Apache Hadoop software library is a framework that allows for the distrib...
23Q95A6706
 
What is Hadoop? Key Concepts, Architecture, and Applications
MikeKelvin1
 
Hadoop & distributed cloud computing
Rajan Kumar Upadhyay
 
Hadoop installation by santosh nage
Santosh Nage
 
Another Intro To Hadoop
Adeel Ahmad
 
Presentation sreenu dwh-services
Sreenu Musham
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop basics
Antonio Silveira
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
Ad

Recently uploaded (20)

PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 

Cloud Computing: Hadoop

  • 1. Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ [email_address]
  • 2. What is Hadoop Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.
  • 3. Why? A common infrastructure pattern extracted from building distributed systems Scale Incremental growth Cost Flexibility
  • 4. Built-in Resilience to Failure When dealing with large numbers of commodity servers, failure is a fact of life Assume failure, build protections and recovery into your architecture Data level redundancy Job/Task level monitoring and automated restart and re-allocation
  • 5. Current State of Hadoop Project Top level Apache Foundation project In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, … Large, active user base, mailing lists, user groups Very active development, strong development team
  • 6. Widely Adopted A valuable and reusable skill set Taught at major universities Easier to hire for Easier to train on Portable across projects, groups
  • 7. Plethora of Related Projects Pig Hive Hbase Cascading Hadoop on EC2 JAQL , X-Trace, Happy, Mahout
  • 8. What is Hadoop The Linux of distributed processing.
  • 10. Hadoop File System A distributed file system for large data Your data in triplicate Built-in redundancy, resiliency to large scale failures Intelligent distribution, striping across racks Accommodates very large data sizes On commodity hardware
  • 11. Programming Model: Map/Reduce Very simple programming model: Map(anything)->key, value Sort, partition on key Reduce(key,value)->key, value No parallel processing / message passing semantics Programmable in Java or any other language (streaming)
  • 12. Processing Model Create or allocate a cluster Put data onto the file system: Data is split into blocks, stored in triplicate across your cluster Run your job: Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data Move computation to data, not data to computation
  • 13. Processing Model Monitor workers, automatically restarting failed or slow tasks Gather output of Map, sort and partition on key Run Reduce tasks Monitor workers, automatically restarting failed or slow tasks Results of your job are now available on the Hadoop file system
  • 14. Hadoop on the Grid Managed Hadoop clusters Shared resources improved utilization Standard data sets, storage Shared, standardized operations management Hosted internally or externally (eg. on EC2)
  • 16. ETL Put large data source (eg. Log files) onto the Hadoop File System Perform aggregations, transformations, normalizations on the data Load into RDBMS / data mart
  • 17. Reporting and Analytics Run canned and ad-hoc queries over large data Run analytics and data mining operations on large data Produce reports for end-user consumption or loading into data mart
  • 18. Data Processing Pipelines Multi-step pipelines for data processing Coordination, scheduling, data collection and publishing of feeds SLA carrying, regularly scheduled jobs
  • 19. Machine Learning & Graph Algorithms Traverse large graphs and data sets, building models and classifiers Implement machine learning algorithms over massive data sets
  • 20. General Back-End Processing Implement significant portions of back-end, batch oriented processing on the grid General computation framework Simplify back-end architecture
  • 21. What Next? Dowload Hadoop: http://hadoop.apache.org/ Try it on your laptop Try Pig http://hadoop.apahe.org/pig/ Deploy to multiple boxes Try it on EC2