SlideShare a Scribd company logo
Hadoop Beginnings
Vu Pham Big Data Hadoop Stack
Big Data Computing
What is Hadoop ?
Apache Hadoop is an open source software
framework for storage and large scale
processing of the data-sets on clusters of
commodity hardware.
Vu Pham Big Data Hadoop Stack
Big Data Computing
Hadoop Beginnings
Hadoop was created by Doug Cutting and Mike Cafarella in
2005
It was originally developed to support distribution of the
Nutch Search Engine Project.
Doug, who was working at Yahoo at the time, who is now
actually a chief architect at Cloudera, has named this project
after his son’s toy elephant, Hadoop.
Vu Pham Big Data Hadoop Stack
Big Data Computing
History of Hadoop
HADOOP FEATURES
What is Apache Hadoop and its ecosystem?
Fault Tolerance
Reliability
Highly Available
Scalability
Economic
Data Locality
Scalability
Scalability's at it's core of a Hadoop system.
We have cheap computing storage.
We can distribute and scale across very easily
in a very cost effective manner.
Vu Pham Big Data Hadoop Stack
Big Data Computing
Apache Hadoop Framework
& its Basic Modules
Vu Pham Big Data Hadoop Stack
Big Data Computing
Hadoop Common: It contains libraries and utilities needed
by other Hadoop modules.
Hadoop Distributed File System (HDFS): It is a distributed
file system that stores data on a commodity machine.
Providing very high aggregate bandwidth across the entire
cluster.
Hadoop YARN: It is a resource management platform
responsible for managing compute resources in the cluster
and using them in order to schedule users and
applications.
Hadoop MapReduce: It is a programming model that
scales data across a lot of different processes.
Apache Framework Basic Modules
Vu Pham Big Data Hadoop Stack
Big Data Computing
Apache Framework Basic Modules
Vu Pham Big Data Hadoop Stack
Big Data Computing
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
HDFS
Hadoop distributed file system
Vu Pham Big Data Hadoop Stack
Big Data Computing
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
Distributed, scalable, and portable file-system written in
Java for the Hadoop framework.
Each node in Hadoop instance typically has a single name
node, and a cluster of data nodes that formed this HDFS
cluster.
Each HDFS stores large files, typically in ranges of
gigabytes to terabytes, and now petabytes, across
multiple machines. And it can achieve reliability by
replicating the cross multiple hosts, and therefore does
not require any range storage on hosts.
HDFS: Hadoop distributed file system
Vu Pham Big Data Hadoop Stack
Big Data Computing
MapReduce Engine
The typical MapReduce engine will consist of a job tracker, to which client
applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster. Struggling to
keep the word as close to the data as possible, as balanced as possible.
Vu Pham Big Data Hadoop Stack
Big Data Computing
Apache Hadoop NextGen MapReduce (YARN)
Vu Pham Big Data Hadoop Stack
Big Data Computing
Yarn enhances the power of the Hadoop compute cluster, without
being limited by the map produce kind of framework.
It's scalability's great. The processing
continue to grow quickly, because the
power and data
YARN research
centers
manager
What is Yarn ?
focuses exclusively on scheduling. It can manage those very large
clusters quite quickly and easily.
YARN is completely compatible with the MapReduce. Existing
MapReduce application end users can run on top of the Yarn without
disrupting any of their existing processes.
It does have a Improved cluster utilization as well. The resource
manager is a pure schedule or they just optimize this cluster
utilization according to the criteria such as capacity, guarantees,
fairness, how to be fair, maybe different SLA's or service level
agreements.
Scalability MapReduce Compatibility Improved cluster utilization
Vu Pham Big Data Hadoop Stack
Big Data Computing
It supports other work flows other than just map reduce.
Now we can bring in additional programming models, such as graph
process or iterative modeling, and now it's possible to process the
data in your base. This is especially useful when we talk about
machine learning applications.
Yarn allows multiple access engines, either open source or
proprietary, to use Hadoop as a common standard for either batch or
interactive processing, and even real time engines that can
simultaneous acts as a lot of different data, so you can put streaming
kind of applications on top of YARN inside a Hadoop architecture,
and seamlessly work and communicate between these
environments.
What is Yarn ?
Fairness
Supports Other Workloads
Iterative Modeling
Machine
Learning
Multiple
Access
Engines
Vu Pham Big Data Hadoop Stack
Big Data Computing
Tool designed for efficiently transferring bulk
data between Apache Hadoop and structured
datastores such as relational databases
Apache Sqoop
Vu Pham Big Data Hadoop Stack
Big Data Computing
Hbase is a key component of the Hadoop stack, as its
design caters to applications that require really fast random
access to significant data set.
Column-oriented database management system
Key-value store
Based on Google Big Table
Can hold extremely large data
Dynamic data model
Not a Relational DBMS
HBASE
Vu Pham Big Data Hadoop Stack
Big Data Computing
High level programming on top of Hadoop
MapReduce
The language: Pig Latin
Data analysis problems as data flows
Originally developed at Yahoo 2006
PIG
Vu Pham Big Data Hadoop Stack
Big Data Computing
It provides an abstraction over MapReduce, making it
easier for users to write complex data processing tasks
without having to write lengthy Java code.
Data warehouse software facilitates querying and
managing large datasets residing in distributed
storage
SQL-like language!
Facilitates querying and managing large datasets in
HDFS
Mechanism to project structure onto this data and
query the data using a SQL-like language called
HiveQL
Apache Hive
Vu Pham Big Data Hadoop Stack
Big Data Computing
• Workflow scheduler system to manage Apache Hadoop jobs
• Oozie Coordinator jobs!
• Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.
• Oozie simply performs the task of a scheduler, thus scheduling
jobs and binding them together as a single unit. There is two
kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
Oozie workflow is the jobs that need to be executed in a
sequentially ordered manner whereas Oozie Coordinator jobs
are those that are triggered when some data or external
stimulus is given to it.
Oozie
Vu Pham Big Data Hadoop Stack
Big Data Computing
Provides operational services for a
Hadoop cluster group services
Centralized service for: maintaining
configuration information naming
services
Providing distributed synchronization
and providing group services
Zookeeper
Vu Pham Big Data Hadoop Stack
Big Data Computing
Flume
Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data
Apache Flume is an open-source data ingestion tool that is part of
the Apache Hadoop ecosystem
It uses simple extensible data model that allows us to apply all
kinds of online analytic applications.
Vu Pham Big Data Hadoop Stack
Big Data Computing
Apache Spark is a fast and general engine for large-scale data
processing
Spark is a scalable data analytics platform that incorporates primitives
for in-memory computing and therefore, is allowing to exercise some
different performance advantages over traditional Hadoop's cluster
storage system approach. And it's implemented and supports
something called Scala language, and provides unique environment
for data processing.
Spark is really great for more complex kinds of analytics, and it's great
at supporting machine learning libraries.
It is yet again another open source computing frame work and it was
originally developed at MP labs at the University of California
Berkeley and it was later donated to the Apache software foundation
where it remains today as well.
Spark
Vu Pham Big Data Hadoop Stack
Big Data Computing
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
Avro
• Avro format is a row-based storage format for Hadoop, which is
widely used as a serialization platform.
• Avro format stores the schema in JSON format, making it easy to
read and interpret by any program.
• The data itself is stored in a binary format making it compact and
efficient in Avro files.
• Avro format is a language-neutral data serialization system. It can be
processed by many languages (currently C, C++, C#, Java, Python,
and Ruby).
• A key feature of Avro format is the robust support for data schemas
that changes over time, i.e., schema evolution. Avro handles schema
changes like missing fields, added fields, and changed fields.
• Avro format provides rich data structures. For example, you can
create a record that contains an array, an enumerated type, and a
sub-record.
What is Apache Hadoop and its ecosystem?
Parquet
• Parquet, an open-source file format for Hadoop, stores nested data
structures in a flat columnar format.
• Compared to a traditional approach where data is stored in a row-
oriented approach, Parquet file format is more efficient in terms of
storage and performance.
• It is especially good for queries that read particular columns from a
“wide” (with many columns) table since only needed columns are
read, and IO is minimized.
What is Apache Hadoop and its ecosystem?
RC File (Record Columnar
Files)
• RC file was the first columnar file in Hadoop and has
significant compression and query performance
benefits.
• But it doesn’t support schema evaluation and if you
want to add anything to RC file you will have to rewrite
the file. Also, it is a slower process.
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
•The mapper and the reducer (in the above example) are the scripts
that read the input line-by-line from stdin and emit the output to
stdout.
•The utility creates a Map/Reduce job and submits the job to an
appropriate cluster and monitor the job progress until its completion.
•When a script is specified for mappers, then each mapper task
launches the script as a separate process when the mapper is
initialized.
•The mapper task converts its inputs (key, value pairs) into lines and
pushes the lines to the standard input of the process. Meanwhile, the
mapper collects the line oriented outputs from the standard output
and converts each line into a (key, value pair) pair, which is collected
as the result of the mapper.
•When reducer script is specified, then each reducer task launches
the script as a separate process, and then the reducer is initialized.
•As reducer task runs, it converts its input key/values pairs into lines
and feeds the lines to the standard input of the process. Meantime,
the reducer gathers the line-oriented outputs from the stdout of the
process and converts each line collected into a key/value pair, which
is then collected as the result of the reducer.
•For both mapper and reducer, the prefix of a line until the first tab
character is the key, and the rest of the line is the value except the
tab character. In the case of no tab character in the line, the entire
line is considered as key, and the value is considered null. This is
customizable by setting -inputformat command option for mapper
and -outputformat option for reducer.
Hadoop Pipe
• Hadoop Pipes is a subproject of Apache Hadoop, designed to
support writing MapReduce applications in C++.
• Hadoop Pipes uses sockets to enable task-trackers to
communicate processes running the C++ map or reduce
functions.
• A socket is a communication mechanism that allows processes
to communicate with each other, either on the same machine
or across a network.
• Sockets are typically accessed through programming interfaces
known as socket APIs
Implementation and Performance:
•Hadoop Streaming: In Hadoop Streaming, the communication
between the Hadoop framework and the external program is
typically done through standard input and output streams. This
introduces some overhead in data serialization and deserialization,
which can impact performance, especially for large datasets.
•Hadoop Pipes: Hadoop Pipes offers a more efficient mechanism for
communication between Hadoop tasks and external C++ programs.
It uses a native interface and binary serialization, which can result in
better performance compared to Hadoop Streaming, especially for
computationally intensive tasks.
Anatomy of Map Reduce
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
Job Scheduling in MapReduce
How does the Job Scheduling Work?
•In Hadoop, different clients send their jobs to perform. The jobs are
managed by the JobTracker or Resource Manager.
•There are three different Scheduling Schemes:
• First In First Out (FIFO) Scheduler
• Capacity Scheduler
• Fiar Scheduler
•The JobTracker comes with these three scheduling techniques, and
the default is the FIFO. The Resource Manager takes Capacity
Scheduler and Fair Scheduler, where the Capacity Scheduler is
default.
What is Apache Hadoop and its ecosystem?
Advantages
•Jobs are served according to their submission.
•This scheduler is easy to understand also does not require any
configuration.
Disadvantages
•For shared clusters, this scheduler might not work best. If the larger
tasks come before, the shorter task, then the larger tasks will use all the
resources in the cluster. Due to this, the shorter tasks will be in the
queue for a longer time and has to wait for their turn, which will lead to
starvation.
•The balance of resource allocation between long and short
applications is not considered.
•In the Capacity Scheduler, there are multiple queues to schedule the tasks.
•For each queue, there are dedicated slots in the cluster. When no jobs are running, the task of
one queue can occupy as many slots as possible
A computer cluster is a set of loosely or tightly connected computers that work together so
that, in many respects, they can be viewed as a single system.
•When a new job comes in the next queue, it will replace the jobs from those slots which are
dedicated to that queue.
Advantages
•This scheduler provides a capacity assurance and safeguards to the
organization utilizing cluster.
•It maximizes the throughput and utilization of resources in the
Hadoop cluster.
Disadvantages
•Compared to the other two schedulers, a capacity scheduler is
considered complex.
•The Fair Scheduler is very similar to the Capacity Scheduler.
•When some higher priority job comes in the same queue, it is
processed in parallel by replacing some portion of the task from
dedicated slots.
Advantages
•It gives a reasonable way to share the cluster between the no. of
users.
•The fair scheduler can work with application priorities. Priorities
are used as a weight to recognize the fraction of the total resources
every application must get.
Disadvantages
•Configuration is required.
Job Failure

More Related Content

Similar to What is Apache Hadoop and its ecosystem? (20)

PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
PDF
Hadoop Primer
Steve Staso
 
PPTX
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
PPTX
Big data
Mina Soltani
 
DOCX
Big data and Hadoop overview
Nitesh Ghosh
 
PPTX
Getting started big data
Kibrom Gebrehiwot
 
PDF
Tools and techniques for data science
Ajay Ohri
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PPTX
hadoop-ecosystem-ppt.pptx
raghavanand36
 
PDF
Scaling Storage and Computation with Hadoop
yaevents
 
ODP
Hadoop introduction
葵慶 李
 
PDF
Understanding Hadoop
Ahmed Ossama
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
PPTX
Big data and hadoop anupama
Anupama Prabhudesai
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PDF
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PPTX
Hadoop
Abhishek Agarwal
 
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Hadoop Primer
Steve Staso
 
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
Big data
Mina Soltani
 
Big data and Hadoop overview
Nitesh Ghosh
 
Getting started big data
Kibrom Gebrehiwot
 
Tools and techniques for data science
Ajay Ohri
 
Big Data and Cloud Computing
Farzad Nozarian
 
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Scaling Storage and Computation with Hadoop
yaevents
 
Hadoop introduction
葵慶 李
 
Understanding Hadoop
Ahmed Ossama
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
Big data and hadoop anupama
Anupama Prabhudesai
 
Hadoop seminar
KrishnenduKrishh
 
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 

More from tommychauhan (7)

PDF
What is software testing in software engineering?
tommychauhan
 
PDF
What is Testing in Software Engineering?
tommychauhan
 
PPT
3_Register in COA.ppt
tommychauhan
 
PPT
crypto1.ppt
tommychauhan
 
PPT
Meenakshi mam Types of plant layout video.ppt
tommychauhan
 
PPTX
housepriceprediction-ml.pptx
tommychauhan
 
PPTX
B2B & SERVICE MARKETING - unit 1.pptx
tommychauhan
 
What is software testing in software engineering?
tommychauhan
 
What is Testing in Software Engineering?
tommychauhan
 
3_Register in COA.ppt
tommychauhan
 
crypto1.ppt
tommychauhan
 
Meenakshi mam Types of plant layout video.ppt
tommychauhan
 
housepriceprediction-ml.pptx
tommychauhan
 
B2B & SERVICE MARKETING - unit 1.pptx
tommychauhan
 
Ad

Recently uploaded (20)

PPTX
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PDF
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
PDF
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PPTX
Difference between write and update in odoo 18
Celine George
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PPTX
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
PPTX
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PPTX
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
Horarios de distribución de agua en julio
pegazohn1978
 
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
Difference between write and update in odoo 18
Celine George
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
Ad

What is Apache Hadoop and its ecosystem?

  • 1. Hadoop Beginnings Vu Pham Big Data Hadoop Stack Big Data Computing
  • 2. What is Hadoop ? Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Vu Pham Big Data Hadoop Stack Big Data Computing
  • 3. Hadoop Beginnings Hadoop was created by Doug Cutting and Mike Cafarella in 2005 It was originally developed to support distribution of the Nutch Search Engine Project. Doug, who was working at Yahoo at the time, who is now actually a chief architect at Cloudera, has named this project after his son’s toy elephant, Hadoop. Vu Pham Big Data Hadoop Stack Big Data Computing
  • 13. Scalability Scalability's at it's core of a Hadoop system. We have cheap computing storage. We can distribute and scale across very easily in a very cost effective manner. Vu Pham Big Data Hadoop Stack Big Data Computing
  • 14. Apache Hadoop Framework & its Basic Modules Vu Pham Big Data Hadoop Stack Big Data Computing
  • 15. Hadoop Common: It contains libraries and utilities needed by other Hadoop modules. Hadoop Distributed File System (HDFS): It is a distributed file system that stores data on a commodity machine. Providing very high aggregate bandwidth across the entire cluster. Hadoop YARN: It is a resource management platform responsible for managing compute resources in the cluster and using them in order to schedule users and applications. Hadoop MapReduce: It is a programming model that scales data across a lot of different processes. Apache Framework Basic Modules Vu Pham Big Data Hadoop Stack Big Data Computing
  • 16. Apache Framework Basic Modules Vu Pham Big Data Hadoop Stack Big Data Computing
  • 20. HDFS Hadoop distributed file system Vu Pham Big Data Hadoop Stack Big Data Computing
  • 24. Distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in Hadoop instance typically has a single name node, and a cluster of data nodes that formed this HDFS cluster. Each HDFS stores large files, typically in ranges of gigabytes to terabytes, and now petabytes, across multiple machines. And it can achieve reliability by replicating the cross multiple hosts, and therefore does not require any range storage on hosts. HDFS: Hadoop distributed file system Vu Pham Big Data Hadoop Stack Big Data Computing
  • 25. MapReduce Engine The typical MapReduce engine will consist of a job tracker, to which client applications can submit MapReduce jobs, and this job tracker typically pushes work out to all the available task trackers, now it's in the cluster. Struggling to keep the word as close to the data as possible, as balanced as possible. Vu Pham Big Data Hadoop Stack Big Data Computing
  • 26. Apache Hadoop NextGen MapReduce (YARN) Vu Pham Big Data Hadoop Stack Big Data Computing
  • 27. Yarn enhances the power of the Hadoop compute cluster, without being limited by the map produce kind of framework. It's scalability's great. The processing continue to grow quickly, because the power and data YARN research centers manager What is Yarn ? focuses exclusively on scheduling. It can manage those very large clusters quite quickly and easily. YARN is completely compatible with the MapReduce. Existing MapReduce application end users can run on top of the Yarn without disrupting any of their existing processes. It does have a Improved cluster utilization as well. The resource manager is a pure schedule or they just optimize this cluster utilization according to the criteria such as capacity, guarantees, fairness, how to be fair, maybe different SLA's or service level agreements. Scalability MapReduce Compatibility Improved cluster utilization Vu Pham Big Data Hadoop Stack Big Data Computing
  • 28. It supports other work flows other than just map reduce. Now we can bring in additional programming models, such as graph process or iterative modeling, and now it's possible to process the data in your base. This is especially useful when we talk about machine learning applications. Yarn allows multiple access engines, either open source or proprietary, to use Hadoop as a common standard for either batch or interactive processing, and even real time engines that can simultaneous acts as a lot of different data, so you can put streaming kind of applications on top of YARN inside a Hadoop architecture, and seamlessly work and communicate between these environments. What is Yarn ? Fairness Supports Other Workloads Iterative Modeling Machine Learning Multiple Access Engines Vu Pham Big Data Hadoop Stack Big Data Computing
  • 29. Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases Apache Sqoop Vu Pham Big Data Hadoop Stack Big Data Computing
  • 30. Hbase is a key component of the Hadoop stack, as its design caters to applications that require really fast random access to significant data set. Column-oriented database management system Key-value store Based on Google Big Table Can hold extremely large data Dynamic data model Not a Relational DBMS HBASE Vu Pham Big Data Hadoop Stack Big Data Computing
  • 31. High level programming on top of Hadoop MapReduce The language: Pig Latin Data analysis problems as data flows Originally developed at Yahoo 2006 PIG Vu Pham Big Data Hadoop Stack Big Data Computing It provides an abstraction over MapReduce, making it easier for users to write complex data processing tasks without having to write lengthy Java code.
  • 32. Data warehouse software facilitates querying and managing large datasets residing in distributed storage SQL-like language! Facilitates querying and managing large datasets in HDFS Mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL Apache Hive Vu Pham Big Data Hadoop Stack Big Data Computing
  • 33. • Workflow scheduler system to manage Apache Hadoop jobs • Oozie Coordinator jobs! • Supports MapReduce, Pig, Apache Hive, and Sqoop, etc. • Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it. Oozie Vu Pham Big Data Hadoop Stack Big Data Computing
  • 34. Provides operational services for a Hadoop cluster group services Centralized service for: maintaining configuration information naming services Providing distributed synchronization and providing group services Zookeeper Vu Pham Big Data Hadoop Stack Big Data Computing
  • 35. Flume Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data Apache Flume is an open-source data ingestion tool that is part of the Apache Hadoop ecosystem It uses simple extensible data model that allows us to apply all kinds of online analytic applications. Vu Pham Big Data Hadoop Stack Big Data Computing
  • 36. Apache Spark is a fast and general engine for large-scale data processing Spark is a scalable data analytics platform that incorporates primitives for in-memory computing and therefore, is allowing to exercise some different performance advantages over traditional Hadoop's cluster storage system approach. And it's implemented and supports something called Scala language, and provides unique environment for data processing. Spark is really great for more complex kinds of analytics, and it's great at supporting machine learning libraries. It is yet again another open source computing frame work and it was originally developed at MP labs at the University of California Berkeley and it was later donated to the Apache software foundation where it remains today as well. Spark Vu Pham Big Data Hadoop Stack Big Data Computing
  • 41. Avro • Avro format is a row-based storage format for Hadoop, which is widely used as a serialization platform. • Avro format stores the schema in JSON format, making it easy to read and interpret by any program. • The data itself is stored in a binary format making it compact and efficient in Avro files. • Avro format is a language-neutral data serialization system. It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby). • A key feature of Avro format is the robust support for data schemas that changes over time, i.e., schema evolution. Avro handles schema changes like missing fields, added fields, and changed fields. • Avro format provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub-record.
  • 43. Parquet • Parquet, an open-source file format for Hadoop, stores nested data structures in a flat columnar format. • Compared to a traditional approach where data is stored in a row- oriented approach, Parquet file format is more efficient in terms of storage and performance. • It is especially good for queries that read particular columns from a “wide” (with many columns) table since only needed columns are read, and IO is minimized.
  • 45. RC File (Record Columnar Files) • RC file was the first columnar file in Hadoop and has significant compression and query performance benefits. • But it doesn’t support schema evaluation and if you want to add anything to RC file you will have to rewrite the file. Also, it is a slower process.
  • 53. •The mapper and the reducer (in the above example) are the scripts that read the input line-by-line from stdin and emit the output to stdout. •The utility creates a Map/Reduce job and submits the job to an appropriate cluster and monitor the job progress until its completion. •When a script is specified for mappers, then each mapper task launches the script as a separate process when the mapper is initialized. •The mapper task converts its inputs (key, value pairs) into lines and pushes the lines to the standard input of the process. Meanwhile, the mapper collects the line oriented outputs from the standard output and converts each line into a (key, value pair) pair, which is collected as the result of the mapper.
  • 54. •When reducer script is specified, then each reducer task launches the script as a separate process, and then the reducer is initialized. •As reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the standard input of the process. Meantime, the reducer gathers the line-oriented outputs from the stdout of the process and converts each line collected into a key/value pair, which is then collected as the result of the reducer. •For both mapper and reducer, the prefix of a line until the first tab character is the key, and the rest of the line is the value except the tab character. In the case of no tab character in the line, the entire line is considered as key, and the value is considered null. This is customizable by setting -inputformat command option for mapper and -outputformat option for reducer.
  • 55. Hadoop Pipe • Hadoop Pipes is a subproject of Apache Hadoop, designed to support writing MapReduce applications in C++. • Hadoop Pipes uses sockets to enable task-trackers to communicate processes running the C++ map or reduce functions. • A socket is a communication mechanism that allows processes to communicate with each other, either on the same machine or across a network. • Sockets are typically accessed through programming interfaces known as socket APIs
  • 56. Implementation and Performance: •Hadoop Streaming: In Hadoop Streaming, the communication between the Hadoop framework and the external program is typically done through standard input and output streams. This introduces some overhead in data serialization and deserialization, which can impact performance, especially for large datasets. •Hadoop Pipes: Hadoop Pipes offers a more efficient mechanism for communication between Hadoop tasks and external C++ programs. It uses a native interface and binary serialization, which can result in better performance compared to Hadoop Streaming, especially for computationally intensive tasks.
  • 57. Anatomy of Map Reduce
  • 91. Job Scheduling in MapReduce How does the Job Scheduling Work? •In Hadoop, different clients send their jobs to perform. The jobs are managed by the JobTracker or Resource Manager. •There are three different Scheduling Schemes: • First In First Out (FIFO) Scheduler • Capacity Scheduler • Fiar Scheduler •The JobTracker comes with these three scheduling techniques, and the default is the FIFO. The Resource Manager takes Capacity Scheduler and Fair Scheduler, where the Capacity Scheduler is default.
  • 93. Advantages •Jobs are served according to their submission. •This scheduler is easy to understand also does not require any configuration. Disadvantages •For shared clusters, this scheduler might not work best. If the larger tasks come before, the shorter task, then the larger tasks will use all the resources in the cluster. Due to this, the shorter tasks will be in the queue for a longer time and has to wait for their turn, which will lead to starvation. •The balance of resource allocation between long and short applications is not considered.
  • 94. •In the Capacity Scheduler, there are multiple queues to schedule the tasks. •For each queue, there are dedicated slots in the cluster. When no jobs are running, the task of one queue can occupy as many slots as possible A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. •When a new job comes in the next queue, it will replace the jobs from those slots which are dedicated to that queue.
  • 95. Advantages •This scheduler provides a capacity assurance and safeguards to the organization utilizing cluster. •It maximizes the throughput and utilization of resources in the Hadoop cluster. Disadvantages •Compared to the other two schedulers, a capacity scheduler is considered complex.
  • 96. •The Fair Scheduler is very similar to the Capacity Scheduler. •When some higher priority job comes in the same queue, it is processed in parallel by replacing some portion of the task from dedicated slots.
  • 97. Advantages •It gives a reasonable way to share the cluster between the no. of users. •The fair scheduler can work with application priorities. Priorities are used as a weight to recognize the fraction of the total resources every application must get. Disadvantages •Configuration is required.