What is Apache Hadoop and its ecosystem?

Hadoop Beginnings
Vu Pham Big Data Hadoop Stack
Big Data Computing

What is Hadoop ?
Apache Hadoop is an open source software
framework for storage and large scale
processing of the data-sets on clusters of
commodity hardware.
Big Data Computing

Hadoop Beginnings
Hadoop was created by Doug Cutting and Mike Cafarella in
2005
It was originally developed to support distribution of the
Nutch Search Engine Project.
Doug, who was working at Yahoo at the time, who is now
actually a chief architect at Cloudera, has named this project
after his son’s toy elephant, Hadoop.
Big Data Computing

Scalability
Scalability's at it's core of a Hadoop system.
We have cheap computing storage.
We can distribute and scale across very easily
in a very cost effective manner.
Big Data Computing

Apache Hadoop Framework
& its Basic Modules
Big Data Computing

Hadoop Common: It contains libraries and utilities needed
by other Hadoop modules.
Hadoop Distributed File System (HDFS): It is a distributed
file system that stores data on a commodity machine.
Providing very high aggregate bandwidth across the entire
cluster.
Hadoop YARN: It is a resource management platform
responsible for managing compute resources in the cluster
and using them in order to schedule users and
applications.
Hadoop MapReduce: It is a programming model that
scales data across a lot of different processes.
Apache Framework Basic Modules
Big Data Computing

Apache Framework Basic Modules
Big Data Computing

HDFS
Hadoop distributed file system
Big Data Computing

Distributed, scalable, and portable file-system written in
Java for the Hadoop framework.
Each node in Hadoop instance typically has a single name
node, and a cluster of data nodes that formed this HDFS
cluster.
Each HDFS stores large files, typically in ranges of
gigabytes to terabytes, and now petabytes, across
multiple machines. And it can achieve reliability by
replicating the cross multiple hosts, and therefore does
not require any range storage on hosts.
HDFS: Hadoop distributed file system
Big Data Computing

MapReduce Engine
The typical MapReduce engine will consist of a job tracker, to which client
applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster. Struggling to
keep the word as close to the data as possible, as balanced as possible.
Big Data Computing

Apache Hadoop NextGen MapReduce (YARN)
Big Data Computing

Yarn enhances the power of the Hadoop compute cluster, without
being limited by the map produce kind of framework.
It's scalability's great. The processing
continue to grow quickly, because the
power and data
YARN research
centers
manager
What is Yarn ?
focuses exclusively on scheduling. It can manage those very large
clusters quite quickly and easily.
YARN is completely compatible with the MapReduce. Existing
MapReduce application end users can run on top of the Yarn without
disrupting any of their existing processes.
It does have a Improved cluster utilization as well. The resource
manager is a pure schedule or they just optimize this cluster
utilization according to the criteria such as capacity, guarantees,
fairness, how to be fair, maybe different SLA's or service level
agreements.
Scalability MapReduce Compatibility Improved cluster utilization
Big Data Computing

It supports other work flows other than just map reduce.
Now we can bring in additional programming models, such as graph
process or iterative modeling, and now it's possible to process the
data in your base. This is especially useful when we talk about
machine learning applications.
Yarn allows multiple access engines, either open source or
proprietary, to use Hadoop as a common standard for either batch or
interactive processing, and even real time engines that can
simultaneous acts as a lot of different data, so you can put streaming
kind of applications on top of YARN inside a Hadoop architecture,
and seamlessly work and communicate between these
environments.
What is Yarn ?
Fairness
Supports Other Workloads
Iterative Modeling
Machine
Learning
Multiple
Access
Engines
Big Data Computing

Tool designed for efficiently transferring bulk
data between Apache Hadoop and structured
datastores such as relational databases
Apache Sqoop
Big Data Computing

Hbase is a key component of the Hadoop stack, as its
design caters to applications that require really fast random
access to significant data set.
Column-oriented database management system
Key-value store
Based on Google Big Table
Can hold extremely large data
Dynamic data model
Not a Relational DBMS
HBASE
Big Data Computing

High level programming on top of Hadoop
MapReduce
The language: Pig Latin
Data analysis problems as data flows
Originally developed at Yahoo 2006
PIG
Big Data Computing
It provides an abstraction over MapReduce, making it
easier for users to write complex data processing tasks
without having to write lengthy Java code.

Data warehouse software facilitates querying and
managing large datasets residing in distributed
storage
SQL-like language!
Facilitates querying and managing large datasets in
HDFS
Mechanism to project structure onto this data and
query the data using a SQL-like language called
HiveQL
Apache Hive
Big Data Computing

• Workflow scheduler system to manage Apache Hadoop jobs
• Oozie Coordinator jobs!
• Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.
• Oozie simply performs the task of a scheduler, thus scheduling
jobs and binding them together as a single unit. There is two
kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
Oozie workflow is the jobs that need to be executed in a
sequentially ordered manner whereas Oozie Coordinator jobs
are those that are triggered when some data or external
stimulus is given to it.
Oozie
Big Data Computing

Provides operational services for a
Hadoop cluster group services
Centralized service for: maintaining
configuration information naming
services
Providing distributed synchronization
and providing group services
Zookeeper
Big Data Computing

Flume
Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data
Apache Flume is an open-source data ingestion tool that is part of
the Apache Hadoop ecosystem
It uses simple extensible data model that allows us to apply all
kinds of online analytic applications.
Big Data Computing

Apache Spark is a fast and general engine for large-scale data
processing
Spark is a scalable data analytics platform that incorporates primitives
for in-memory computing and therefore, is allowing to exercise some
different performance advantages over traditional Hadoop's cluster
storage system approach. And it's implemented and supports
something called Scala language, and provides unique environment
for data processing.
Spark is really great for more complex kinds of analytics, and it's great
at supporting machine learning libraries.
It is yet again another open source computing frame work and it was
originally developed at MP labs at the University of California
Berkeley and it was later donated to the Apache software foundation
where it remains today as well.
Spark
Big Data Computing

Avro
• Avro format is a row-based storage format for Hadoop, which is
widely used as a serialization platform.
• Avro format stores the schema in JSON format, making it easy to
read and interpret by any program.
• The data itself is stored in a binary format making it compact and
efficient in Avro files.
• Avro format is a language-neutral data serialization system. It can be
processed by many languages (currently C, C++, C#, Java, Python,
and Ruby).
• A key feature of Avro format is the robust support for data schemas
that changes over time, i.e., schema evolution. Avro handles schema
changes like missing fields, added fields, and changed fields.
• Avro format provides rich data structures. For example, you can
create a record that contains an array, an enumerated type, and a
sub-record.

Parquet
• Parquet, an open-source file format for Hadoop, stores nested data
structures in a flat columnar format.
• Compared to a traditional approach where data is stored in a row-
oriented approach, Parquet file format is more efficient in terms of
storage and performance.
• It is especially good for queries that read particular columns from a
“wide” (with many columns) table since only needed columns are
read, and IO is minimized.

RC File (Record Columnar
Files)
• RC file was the first columnar file in Hadoop and has
significant compression and query performance
benefits.
• But it doesn’t support schema evaluation and if you
want to add anything to RC file you will have to rewrite
the file. Also, it is a slower process.

•The mapper and the reducer (in the above example) are the scripts
that read the input line-by-line from stdin and emit the output to
stdout.
•The utility creates a Map/Reduce job and submits the job to an
appropriate cluster and monitor the job progress until its completion.
•When a script is specified for mappers, then each mapper task
launches the script as a separate process when the mapper is
initialized.
•The mapper task converts its inputs (key, value pairs) into lines and
pushes the lines to the standard input of the process. Meanwhile, the
mapper collects the line oriented outputs from the standard output
and converts each line into a (key, value pair) pair, which is collected
as the result of the mapper.

•When reducer script is specified, then each reducer task launches
the script as a separate process, and then the reducer is initialized.
•As reducer task runs, it converts its input key/values pairs into lines
and feeds the lines to the standard input of the process. Meantime,
the reducer gathers the line-oriented outputs from the stdout of the
process and converts each line collected into a key/value pair, which
is then collected as the result of the reducer.
•For both mapper and reducer, the prefix of a line until the first tab
character is the key, and the rest of the line is the value except the
tab character. In the case of no tab character in the line, the entire
line is considered as key, and the value is considered null. This is
customizable by setting -inputformat command option for mapper
and -outputformat option for reducer.

Hadoop Pipe
• Hadoop Pipes is a subproject of Apache Hadoop, designed to
support writing MapReduce applications in C++.
• Hadoop Pipes uses sockets to enable task-trackers to
communicate processes running the C++ map or reduce
functions.
• A socket is a communication mechanism that allows processes
to communicate with each other, either on the same machine
or across a network.
• Sockets are typically accessed through programming interfaces
known as socket APIs

Implementation and Performance:
•Hadoop Streaming: In Hadoop Streaming, the communication
between the Hadoop framework and the external program is
typically done through standard input and output streams. This
introduces some overhead in data serialization and deserialization,
which can impact performance, especially for large datasets.
•Hadoop Pipes: Hadoop Pipes offers a more efficient mechanism for
communication between Hadoop tasks and external C++ programs.
It uses a native interface and binary serialization, which can result in
better performance compared to Hadoop Streaming, especially for
computationally intensive tasks.

Job Scheduling in MapReduce
How does the Job Scheduling Work?
•In Hadoop, different clients send their jobs to perform. The jobs are
managed by the JobTracker or Resource Manager.
•There are three different Scheduling Schemes:
• First In First Out (FIFO) Scheduler
• Capacity Scheduler
• Fiar Scheduler
•The JobTracker comes with these three scheduling techniques, and
the default is the FIFO. The Resource Manager takes Capacity
Scheduler and Fair Scheduler, where the Capacity Scheduler is
default.

Advantages
•Jobs are served according to their submission.
•This scheduler is easy to understand also does not require any
configuration.
Disadvantages
•For shared clusters, this scheduler might not work best. If the larger
tasks come before, the shorter task, then the larger tasks will use all the
resources in the cluster. Due to this, the shorter tasks will be in the
queue for a longer time and has to wait for their turn, which will lead to
starvation.
•The balance of resource allocation between long and short
applications is not considered.

•In the Capacity Scheduler, there are multiple queues to schedule the tasks.
•For each queue, there are dedicated slots in the cluster. When no jobs are running, the task of
one queue can occupy as many slots as possible
A computer cluster is a set of loosely or tightly connected computers that work together so
that, in many respects, they can be viewed as a single system.
•When a new job comes in the next queue, it will replace the jobs from those slots which are
dedicated to that queue.

Advantages
•This scheduler provides a capacity assurance and safeguards to the
organization utilizing cluster.
•It maximizes the throughput and utilization of resources in the
Hadoop cluster.
Disadvantages
•Compared to the other two schedulers, a capacity scheduler is
considered complex.

•The Fair Scheduler is very similar to the Capacity Scheduler.
•When some higher priority job comes in the same queue, it is
processed in parallel by replacing some portion of the task from
dedicated slots.

Advantages
•It gives a reasonable way to share the cluster between the no. of
users.
•The fair scheduler can work with application priorities. Priorities
are used as a weight to recognize the fraction of the total resources
every application must get.
Disadvantages
•Configuration is required.

What is Apache Hadoop and its ecosystem?

More Related Content

Similar to What is Apache Hadoop and its ecosystem? (20)

More from tommychauhan (7)

Recently uploaded (20)

What is Apache Hadoop and its ecosystem?