SlideShare a Scribd company logo
Introduction to Hadoop
and Big Data Processing
Presented by Sam Ng
Date: 7 Deptember 2018
2
• The number of IoT Units installed in 2018 is doubled comparing
with the number of installed Units in 2016. Two years later, the
number of IoT Units is expected to be doubled again.
• That means sensor data will increase rapidly due to highly
adoption of IoT devices.
Introduction
3
• Around a Terabyte of Sound Data will be generated if a car
manufacturer records sound files for a single product line
such to control quality in a year.
• File size of 30 seconds of sound = 5.046980702MB,

A car manufacturer produces 200,000 cars for a single
model per year,

If a file is recorded for each car, 

The total size of recorded files will be 985.7384183GB.
However, they may record more than a file for each car.
Introduction
4
• An example solution for automobile manufacturers
Introduction
5
• A Brief History of Hadoop
• What is HDFS and how to use it
• What is Map Reduce
• Advanced Map Reduce
• Namenode Resilience
• Directed Acyclic Graph
• Hadoop Ecosystem
• How to configure security for a Hadoop Cluster
Agenda
6
• In 2003, Google published a paper “The Google File System” about
a scalable distributed file system that they were using.

http://static.googleusercontent.com/media/research.google.com/
en//archive/gfs-sosp2003.pdf
• That paper inspired Doug Cutting, an employee of Yahoo!, to
create an open-source framework Hadoop based on the core
concept “MapReduce” borrowed from Google.
• The name Hadoop doesn’t have any meaning at all. The kid of
Doug Cutting drew a yellow elephant for this project.
A Brief History of Hadoop
7
• Projects related to Hadoop trends to use animal names or
animal logos, such as pig and hive. Those descriptive
components build up a Hadoop ecosystem.
• The configuration management tool in the Hadoop
ecosystem is called “ZooKeeper”.
A Brief History of Hadoop
8
• Name Nodes: To record where the files go, and log what is
being created and modified
• Data Nodes: To store data. The default block size is 128MB.
(The block size varies depending on file systems, can be 512
bytes, 4kB, 8kB, 16kB, 32kB etc. The block size in my
Macbook is 512 Bytes. )
• Client Nodes: To store client’s applications
• Please note that HDFS only refers to the file system. To
operate it, a resource manager “YARN” is required.

What is HDFS
9
• UI (Ambari, Hue)
• CLI, similar to cd, ls
• HTTP / HTTPS Proxies
• Java interface
• NFS Gateway (To remove or mount a file system into a server)
How to use HDFS
10
• Map data: transform data to another structure for solving,
associate the data with some Key Values
• Reduce data: aggregate data together (what you like to do with
each piece of data, eg count, maximum)
What is Map Reduce?
11
What is Map Reduce?
Magic Happened!
12
What is Map Reduce?
Shuffle and sort
13
What is Map Reduce? (Advanced)
14
• The single point of failure in a Hadoop cluster is the NameNode.
While the loss of any other machine (intermittently or
permanently) does not result in data loss, NameNode loss results
in cluster unavailability. The permanent loss of NameNode data
would render the cluster's HDFS inoperable.
Namenode Resilience
15
• Backup metadata (data node route table and edit logs)
• Secondary namenode (Maintain a copy)
• HDFS Federation(Have a separated namenode for each
namenode volume) -> Only lose a portion of data when a
namenode is down
• HDFS High Availability (Use shared edit log based on reliable
file system) -> Use Zookeeper keeps track of the active
namenode
Namenode Resilience
16
• Instead of Map Reduce, find out the fastest way to calculate
the result depending on scenarios.
• Using DAG, Sparks claimed that it is 100 times fastest than
Hadoop.
Directed Acyclic Graph
17
Hadoop Ecosystem
18
• H2O
• Spark ML or mllib
• Mahout
• Spark
• Pig
Data analysis tools in Hadoop
Machine Learning Tools:
Database Tools:
• Hive
• HBase
Hadoop Security
Security Concern:
- Confidentiality
- Integrity
- Availability
- Authentication
- Authorization
- Accounting
Threats:
- Unauthorized
access
- Insider threat
- DoS
- Data threat
Vulnerabilities:
- Password 

truncated
- Ping of death
Consideration:
- Network Security
- Data Flow
- Client Access
- Admin Traffic
Defense in Depth
“Any single security measure is not likely to mitigate all threats”
20
Hadoop Security - Kerberos
Thank you!

More Related Content

What's hot (20)

ODP
Hadoop seminar
KrishnenduKrishh
 
PPTX
Matlab, Big Data, and HDF Server
The HDF-EOS Tools and Information Center
 
PPTX
Presentation sreenu dwh-services
Sreenu Musham
 
PPSX
Hadoop
Nishant Gandhi
 
PDF
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
ODP
Tune hadoop
Jason Shao
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
PPTX
Asbury Hadoop Overview
Brian Enochson
 
PPT
Hadoop Technologies
Kannappan Sirchabesan
 
PDF
Hadoop sqoop
Wei-Yu Chen
 
PPT
Hadoop hive presentation
Arvind Kumar
 
PPTX
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
PPTX
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
 
PDF
Introduction to Hadoop part1
Giovanna Roda
 
PPT
Hadoop Technology
Atul Kushwaha
 
PPTX
Hadoop and big data
Sharad Pandey
 
PPTX
Distro-independent Hadoop cluster management
DataWorks Summit
 
PPT
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
PPTX
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
DataWorks Summit
 
ODP
Hadoop - Overview
Jay
 
Hadoop seminar
KrishnenduKrishh
 
Matlab, Big Data, and HDF Server
The HDF-EOS Tools and Information Center
 
Presentation sreenu dwh-services
Sreenu Musham
 
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
Tune hadoop
Jason Shao
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Asbury Hadoop Overview
Brian Enochson
 
Hadoop Technologies
Kannappan Sirchabesan
 
Hadoop sqoop
Wei-Yu Chen
 
Hadoop hive presentation
Arvind Kumar
 
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
 
Introduction to Hadoop part1
Giovanna Roda
 
Hadoop Technology
Atul Kushwaha
 
Hadoop and big data
Sharad Pandey
 
Distro-independent Hadoop cluster management
DataWorks Summit
 
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
DataWorks Summit
 
Hadoop - Overview
Jay
 

Similar to Introduction to Hadoop and Big Data Processing (20)

PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PDF
Scaling Storage and Computation with Hadoop
yaevents
 
PPTX
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
PPTX
Intro to Hadoop and MapReduce
Josi Aranda
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PPTX
Introduction to Hadoop
York University
 
PPT
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PDF
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
PPTX
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPTX
THE SOLUTION FOR BIG DATA
Tarak Tar
 
PPTX
THE SOLUTION FOR BIG DATA
Tarak Tar
 
PDF
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
PDF
Understanding Hadoop
Ahmed Ossama
 
PPTX
Hadoop-2022.pptx
MurindanyiSudi1
 
PPT
Hadoop
Girish Khanzode
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Scaling Storage and Computation with Hadoop
yaevents
 
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Intro to Hadoop and MapReduce
Josi Aranda
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Introduction to Hadoop
York University
 
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
THE SOLUTION FOR BIG DATA
Tarak Tar
 
THE SOLUTION FOR BIG DATA
Tarak Tar
 
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Understanding Hadoop
Ahmed Ossama
 
Hadoop-2022.pptx
MurindanyiSudi1
 
Seminar Presentation Hadoop
Varun Narang
 
Ad

Recently uploaded (20)

PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
AI/ML Applications in Financial domain projects
Rituparna De
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Ad

Introduction to Hadoop and Big Data Processing

  • 1. Introduction to Hadoop and Big Data Processing Presented by Sam Ng Date: 7 Deptember 2018
  • 2. 2 • The number of IoT Units installed in 2018 is doubled comparing with the number of installed Units in 2016. Two years later, the number of IoT Units is expected to be doubled again. • That means sensor data will increase rapidly due to highly adoption of IoT devices. Introduction
  • 3. 3 • Around a Terabyte of Sound Data will be generated if a car manufacturer records sound files for a single product line such to control quality in a year. • File size of 30 seconds of sound = 5.046980702MB,
 A car manufacturer produces 200,000 cars for a single model per year,
 If a file is recorded for each car, 
 The total size of recorded files will be 985.7384183GB. However, they may record more than a file for each car. Introduction
  • 4. 4 • An example solution for automobile manufacturers Introduction
  • 5. 5 • A Brief History of Hadoop • What is HDFS and how to use it • What is Map Reduce • Advanced Map Reduce • Namenode Resilience • Directed Acyclic Graph • Hadoop Ecosystem • How to configure security for a Hadoop Cluster Agenda
  • 6. 6 • In 2003, Google published a paper “The Google File System” about a scalable distributed file system that they were using.
 http://static.googleusercontent.com/media/research.google.com/ en//archive/gfs-sosp2003.pdf • That paper inspired Doug Cutting, an employee of Yahoo!, to create an open-source framework Hadoop based on the core concept “MapReduce” borrowed from Google. • The name Hadoop doesn’t have any meaning at all. The kid of Doug Cutting drew a yellow elephant for this project. A Brief History of Hadoop
  • 7. 7 • Projects related to Hadoop trends to use animal names or animal logos, such as pig and hive. Those descriptive components build up a Hadoop ecosystem. • The configuration management tool in the Hadoop ecosystem is called “ZooKeeper”. A Brief History of Hadoop
  • 8. 8 • Name Nodes: To record where the files go, and log what is being created and modified • Data Nodes: To store data. The default block size is 128MB. (The block size varies depending on file systems, can be 512 bytes, 4kB, 8kB, 16kB, 32kB etc. The block size in my Macbook is 512 Bytes. ) • Client Nodes: To store client’s applications • Please note that HDFS only refers to the file system. To operate it, a resource manager “YARN” is required.
 What is HDFS
  • 9. 9 • UI (Ambari, Hue) • CLI, similar to cd, ls • HTTP / HTTPS Proxies • Java interface • NFS Gateway (To remove or mount a file system into a server) How to use HDFS
  • 10. 10 • Map data: transform data to another structure for solving, associate the data with some Key Values • Reduce data: aggregate data together (what you like to do with each piece of data, eg count, maximum) What is Map Reduce?
  • 11. 11 What is Map Reduce? Magic Happened!
  • 12. 12 What is Map Reduce? Shuffle and sort
  • 13. 13 What is Map Reduce? (Advanced)
  • 14. 14 • The single point of failure in a Hadoop cluster is the NameNode. While the loss of any other machine (intermittently or permanently) does not result in data loss, NameNode loss results in cluster unavailability. The permanent loss of NameNode data would render the cluster's HDFS inoperable. Namenode Resilience
  • 15. 15 • Backup metadata (data node route table and edit logs) • Secondary namenode (Maintain a copy) • HDFS Federation(Have a separated namenode for each namenode volume) -> Only lose a portion of data when a namenode is down • HDFS High Availability (Use shared edit log based on reliable file system) -> Use Zookeeper keeps track of the active namenode Namenode Resilience
  • 16. 16 • Instead of Map Reduce, find out the fastest way to calculate the result depending on scenarios. • Using DAG, Sparks claimed that it is 100 times fastest than Hadoop. Directed Acyclic Graph
  • 18. 18 • H2O • Spark ML or mllib • Mahout • Spark • Pig Data analysis tools in Hadoop Machine Learning Tools: Database Tools: • Hive • HBase
  • 19. Hadoop Security Security Concern: - Confidentiality - Integrity - Availability - Authentication - Authorization - Accounting Threats: - Unauthorized access - Insider threat - DoS - Data threat Vulnerabilities: - Password 
 truncated - Ping of death Consideration: - Network Security - Data Flow - Client Access - Admin Traffic Defense in Depth “Any single security measure is not likely to mitigate all threats”