SlideShare a Scribd company logo
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
HADOOP COMPONENTS
HADOOP CORE COMPONENTS
HADOOP ARCHITECTURE
www.edureka.co
WHAT IS HADOOP?
MAJOR HADOOP COMPONENTS
WHAT IS HADOOP?
www.edureka.co
www.edureka.co
WHAT IS HADOOP?
HADOOP
Hadoop is an open source distributed processing
framework that manages data processing and
storage for big data applications running in clustered
systems.
HADOOP CORE COMPONENTS
www.edureka.co
HADOOP CORE COMPONENTS
MAPREDUCE
COMMON UTILITIES
HDFS
YARN
www.edureka.co
HADOOP CORE COMPONENTS
NAMENODE RESOURCE MANAGER
SECONDARY
NAMENODE
DATANODE NODEMANAGER
HDFS YARN
Hadoop
MASTER
SLAVE
www.edureka.co
HADOOP ARCHITECTURE
www.edureka.co
HADOOP ARCHITECTURE
NAMENODE SECONDARY
NAMENODE
FS-image
Edit Log
Edit Log
(New)
FS-image
Edit Log
FS-image
(Final)
www.edureka.co
HADOOP CORE COMPONENTS
NODE
MANAGER
APP
MANAGER
CONTAINER
NODE
MANAGER
APP
MANAGER
CONTAINER
NODE
MANAGER
APP
MANAGER
CONTAINER
CLIENT RESOURCE MANAGER
Node Status
Resource Request
MapReduce Status
www.edureka.co
MAJOR HADOOP COMPONENTS
www.edureka.co
Storage Managers General Purpose
Execution Engines
Data abstraction
Engines
Machine Learning
Engines
Machine Learning
Engines
Database
Management
Engines
Resource
Management YARN
Storage HDFS
General Purpose
Execution
Engines
General Purpose
Execution
Engines
Hadoop Cluster
Management
Software
Graph Processing
Frameworks
Realtime Data
Streaming
Frameworks
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
HADOOP
STORAGE MANAGERS
MAJOR HADOOP COMPONENTS
HDFS
• Hadoop Distributed File System.
• Primary Data Storage Unit in Hadoop.
• Used in Distributed Data Processing environment.
www.edureka.co
MAJOR HADOOP COMPONENTS
HCATALOG
• Hadoop Storage Management layer.
• Exposes Tabular data of Hive metastore to other
applications like Pig, MapReduce etc.
www.edureka.co
MAJOR HADOOP COMPONENTS
ZOOKEEPER
• Centralized Open-source Server
• Used to provide a distributed configuration
service, synchronization service, and naming
registry for large distributed systems.
www.edureka.co
MAJOR HADOOP COMPONENTS
OOZIE
• Server-based workflow scheduling system
• It Schedules jobs in Apache Hadoop Jobs
• Used to manage Directed Acyclical Graphs (DAGs)
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
GENERAL PURPOSE
EXECUTION ENGINES
MAJOR HADOOP COMPONENTS
MAPREDUCE
• Software Framework for distributed processing .
• It splits data into chunks to enable map, filter and
other operations.
• Used in Functional Programming.
www.edureka.co
MAJOR HADOOP COMPONENTS
SPARK
• General Purpose Cluster Computing Framework.
• It can perform Real-time data streaming and ETL
• Used for Micro-Batch Processing.
www.edureka.co
MAJOR HADOOP COMPONENTS
TEZ
• High performance Data processing tool.
• Executes series of MapReduce Jobs as single Job
• Used to Batch Processing environment
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
HADOOP DATABASE
MANAGEMENT ENGINES
MAJOR HADOOP COMPONENTS
HIVE
• Data Warehouse Software Project
• Enables SQL like queries for Databases.
• Used in ETL, Hive DDL and DML
www.edureka.co
MAJOR HADOOP COMPONENTS
SPARK SQL
• Distributed SQL Query engine
• Enables Structured Data Processing.
• Used importing data from RDDs, Hive, Parquet
files etc.
www.edureka.co
MAJOR HADOOP COMPONENTS
IMPALA
• In-Memory Processing Query engine
• Integrates with HIVE metastore to share the table
information between the components.
• Used to process data in Hadoop Clusters
www.edureka.co
MAJOR HADOOP COMPONENTS
APACHE DRILL
• Low Latency Distributed Query engine
• Combines a variety of data stores just by using a
single query.
• Used to support different kinds of NoSQL Data
bases.
www.edureka.co
MAJOR HADOOP COMPONENTS
HBASE
• Open source, non-relational distributed database
• Combines a variety of data stores just by using a
single query.
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
HADOOP DATA
ABSTRACTION ENGINES
MAJOR HADOOP COMPONENTS
APACHE PIG
• High level scripting language
• Enables users to write complex data
transformations
• Performs ETL and analyses huge Datasets.
www.edureka.co
MAJOR HADOOP COMPONENTS
APACHE SQOOP
• Command-line interface application for
transferring data between relational databases
and Hadoop.
• Data Ingesting tool.
• Enables to import and export structured data in
an enterprise level
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
HADOOP REAL-TIME
STREAMING FRAMEWORKS
MAJOR HADOOP COMPONENTS
SPARK STREAMING
• Spark Streaming is an extension of the
core SparkAPI.
• Enables scalable, high-throughput, fault-
tolerant stream processing of live data streams
• Spark Streaming provides a high-level abstraction
called discretized stream for continuous data
streaming.
www.edureka.co
MAJOR HADOOP COMPONENTS
APACHE KAFKA
• Open-source stream-processing software
• Ingests and moves large amounts of data very
quickly.
• Uses publish and subscribe to streams of records.
www.edureka.co
MAJOR HADOOP COMPONENTS
APACHE FLUME
• Open-source Distributed and Reliable software
• Architecture is based on Streaming Data Flows
• Collecting, Aggregating and Moving large logs of
Data.
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
HADOOP GRAPH
PROCESSING FRAMEWORK
MAJOR HADOOP COMPONENTS
APACHE GIRAPH
• Iterative graph processing framework.
• Utilizes Apache Hadoop's MapReduce
implementation to process graphs.
• Used to analyse social media data
www.edureka.co
MAJOR HADOOP COMPONENTS
APACHE GRAPHX
• GraphX is Apache Spark's API for graphs and
graph-parallel computation.
• Comparable performance to the fastest
specialized graph processing systems.
• Seamlessly work with both graphs and collections.
• Choose from a growing library of graph
algorithms.
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
HADOOP MACHINE
LEARNING FRAMEWORKS
MAJOR HADOOP COMPONENTS
H2O
• H2O is open-source software for big-data analysis.
• H2O allows to fit thousands of potential models as
part of discovering patterns in data.
• H2O uses iterative methods that provide quick
answers using all of the client's data.
www.edureka.co
MAJOR HADOOP COMPONENTS
ORYX
• A generic lambda architecture tier, providing
batch/speed/serving layers.
• Oryx is designed with specialization for real-time
large scale machine learning
• End-to-End implementation of the standard ML
algorithms as applications.
www.edureka.co
MAJOR HADOOP COMPONENTS
SPARK MLlib
• Spark MLlib is a scalable Machine Learning
Library.
• It enables us to perform Machine Learning
operations in Spark.
www.edureka.co
MAJOR HADOOP COMPONENTS
AVRO
• Avro is a row-oriented remote procedure call and
data serialization.
• Used in Dynamic typing and Schema Evolution
and many more.
• Avro is used in Data Serialization and RPC.
www.edureka.co
MAJOR HADOOP COMPONENTS
THRIFT
• It is an Interface definition language and binary
communication protocol.
• It allows users to define data types and service
interfaces in a simple definition file
• Thrift is used in building RPC Clients and Servers.
www.edureka.co
MAJOR HADOOP COMPONENTS
MAHOUT
• Implementations of distributed machine learning
algorithms.
• Store and process big data in a distributed
environment across clusters of computers
using simple programming models
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co
HADOOP CLUSTER
MANAGEMENT SOFTWARE
www.edureka.co
MAJOR HADOOP COMPONENTS
AMBAARI
• Hadoop Cluster Management Software.
• Ambari enables system administrators to
provision, manage and monitor a Hadoop cluster.
www.edureka.co
MAJOR HADOOP COMPONENTS
ZOOKEEPER
• Centralized Open-source Server
• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution
www.edureka.co
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
www.edureka.co
www.edureka.co

More Related Content

What's hot (20)

PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
PPTX
Chapter 1 big data
Prof .Pragati Khade
 
PPTX
Apache Spark overview
DataArt
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
PPTX
Big Data
Subhavinolin Raja
 
PPTX
Hadoop And Their Ecosystem ppt
sunera pathan
 
PPTX
Hadoop
ABHIJEET RAJ
 
PPT
Hive(ppt)
Abhinav Tyagi
 
PDF
Introduction to Bigdata and HADOOP
vinoth kumar
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
PDF
Intro to HBase
alexbaranau
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PPTX
Apache hive
pradipbajpai68
 
PDF
Summary introduction to data engineering
Novita Sari
 
PPTX
Apache HBase™
Prashant Gupta
 
PDF
What is data engineering?
yongdam kim
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Chapter 1 big data
Prof .Pragati Khade
 
Apache Spark overview
DataArt
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop
ABHIJEET RAJ
 
Hive(ppt)
Abhinav Tyagi
 
Introduction to Bigdata and HADOOP
vinoth kumar
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Intro to HBase
alexbaranau
 
Big data and Hadoop
Rahul Agarwal
 
Apache hive
pradipbajpai68
 
Summary introduction to data engineering
Novita Sari
 
Apache HBase™
Prashant Gupta
 
What is data engineering?
yongdam kim
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Similar to What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka (20)

PPTX
Getting started big data
Kibrom Gebrehiwot
 
PDF
What is Apache Hadoop and its ecosystem?
tommychauhan
 
PPTX
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
PPTX
hadoop eco system regarding big data analytics.pptx
mrudulasb
 
PPT
Spark_Part 1
Shashi Prakash
 
PDF
Parquet and AVRO
airisData
 
PPTX
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
PPTX
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
PPTX
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
SQL On Hadoop
Muhammad Ali
 
PPTX
Big data analytics with hadoop volume 2
Imviplav
 
PDF
Applications on Hadoop
markgrover
 
PDF
Bdm hadoop ecosystem
Amit Bhardwaj
 
PPTX
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
PDF
Meetup Oracle Database BCN: 2.1 Data Management Trends
avanttic Consultoría Tecnológica
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
2017 OpenWorld Keynote for Data Integration
Jeffrey T. Pollock
 
Getting started big data
Kibrom Gebrehiwot
 
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
hadoop eco system regarding big data analytics.pptx
mrudulasb
 
Spark_Part 1
Shashi Prakash
 
Parquet and AVRO
airisData
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
SQL On Hadoop
Muhammad Ali
 
Big data analytics with hadoop volume 2
Imviplav
 
Applications on Hadoop
markgrover
 
Bdm hadoop ecosystem
Amit Bhardwaj
 
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
avanttic Consultoría Tecnológica
 
Apache Spark Fundamentals
Zahra Eskandari
 
2017 OpenWorld Keynote for Data Integration
Jeffrey T. Pollock
 
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
PDF
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
PDF
Tableau Tutorial for Data Science | Edureka
Edureka!
 
PDF
Python Programming Tutorial | Edureka
Edureka!
 
PDF
Top 5 PMP Certifications | Edureka
Edureka!
 
PDF
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
PDF
Linux Mint Tutorial | Edureka
Edureka!
 
PDF
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
PDF
Importance of Digital Marketing | Edureka
Edureka!
 
PDF
RPA in 2020 | Edureka
Edureka!
 
PDF
Email Notifications in Jenkins | Edureka
Edureka!
 
PDF
EA Algorithm in Machine Learning | Edureka
Edureka!
 
PDF
Cognitive AI Tutorial | Edureka
Edureka!
 
PDF
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
PDF
Blue Prism Top Interview Questions | Edureka
Edureka!
 
PDF
Big Data on AWS Tutorial | Edureka
Edureka!
 
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
PDF
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
PDF
Introduction to DevOps | Edureka
Edureka!
 
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Python Programming Tutorial | Edureka
Edureka!
 
Top 5 PMP Certifications | Edureka
Edureka!
 
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
Linux Mint Tutorial | Edureka
Edureka!
 
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
Importance of Digital Marketing | Edureka
Edureka!
 
RPA in 2020 | Edureka
Edureka!
 
Email Notifications in Jenkins | Edureka
Edureka!
 
EA Algorithm in Machine Learning | Edureka
Edureka!
 
Cognitive AI Tutorial | Edureka
Edureka!
 
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
Blue Prism Top Interview Questions | Edureka
Edureka!
 
Big Data on AWS Tutorial | Edureka
Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
Introduction to DevOps | Edureka
Edureka!
 
Ad

Recently uploaded (20)

PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Digital Circuits, important subject in CS
contactparinay1
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 

What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka

  • 1. Copyright © 2017, edureka and/or its affiliates. All rights reserved.
  • 3. HADOOP CORE COMPONENTS HADOOP ARCHITECTURE www.edureka.co WHAT IS HADOOP? MAJOR HADOOP COMPONENTS
  • 5. www.edureka.co WHAT IS HADOOP? HADOOP Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.
  • 7. HADOOP CORE COMPONENTS MAPREDUCE COMMON UTILITIES HDFS YARN www.edureka.co
  • 8. HADOOP CORE COMPONENTS NAMENODE RESOURCE MANAGER SECONDARY NAMENODE DATANODE NODEMANAGER HDFS YARN Hadoop MASTER SLAVE www.edureka.co
  • 10. HADOOP ARCHITECTURE NAMENODE SECONDARY NAMENODE FS-image Edit Log Edit Log (New) FS-image Edit Log FS-image (Final) www.edureka.co
  • 13. Storage Managers General Purpose Execution Engines Data abstraction Engines Machine Learning Engines Machine Learning Engines Database Management Engines Resource Management YARN Storage HDFS General Purpose Execution Engines General Purpose Execution Engines Hadoop Cluster Management Software Graph Processing Frameworks Realtime Data Streaming Frameworks www.edureka.co
  • 14. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co HADOOP STORAGE MANAGERS
  • 15. MAJOR HADOOP COMPONENTS HDFS • Hadoop Distributed File System. • Primary Data Storage Unit in Hadoop. • Used in Distributed Data Processing environment. www.edureka.co
  • 16. MAJOR HADOOP COMPONENTS HCATALOG • Hadoop Storage Management layer. • Exposes Tabular data of Hive metastore to other applications like Pig, MapReduce etc. www.edureka.co
  • 17. MAJOR HADOOP COMPONENTS ZOOKEEPER • Centralized Open-source Server • Used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems. www.edureka.co
  • 18. MAJOR HADOOP COMPONENTS OOZIE • Server-based workflow scheduling system • It Schedules jobs in Apache Hadoop Jobs • Used to manage Directed Acyclical Graphs (DAGs) www.edureka.co
  • 19. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co GENERAL PURPOSE EXECUTION ENGINES
  • 20. MAJOR HADOOP COMPONENTS MAPREDUCE • Software Framework for distributed processing . • It splits data into chunks to enable map, filter and other operations. • Used in Functional Programming. www.edureka.co
  • 21. MAJOR HADOOP COMPONENTS SPARK • General Purpose Cluster Computing Framework. • It can perform Real-time data streaming and ETL • Used for Micro-Batch Processing. www.edureka.co
  • 22. MAJOR HADOOP COMPONENTS TEZ • High performance Data processing tool. • Executes series of MapReduce Jobs as single Job • Used to Batch Processing environment www.edureka.co
  • 23. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co HADOOP DATABASE MANAGEMENT ENGINES
  • 24. MAJOR HADOOP COMPONENTS HIVE • Data Warehouse Software Project • Enables SQL like queries for Databases. • Used in ETL, Hive DDL and DML www.edureka.co
  • 25. MAJOR HADOOP COMPONENTS SPARK SQL • Distributed SQL Query engine • Enables Structured Data Processing. • Used importing data from RDDs, Hive, Parquet files etc. www.edureka.co
  • 26. MAJOR HADOOP COMPONENTS IMPALA • In-Memory Processing Query engine • Integrates with HIVE metastore to share the table information between the components. • Used to process data in Hadoop Clusters www.edureka.co
  • 27. MAJOR HADOOP COMPONENTS APACHE DRILL • Low Latency Distributed Query engine • Combines a variety of data stores just by using a single query. • Used to support different kinds of NoSQL Data bases. www.edureka.co
  • 28. MAJOR HADOOP COMPONENTS HBASE • Open source, non-relational distributed database • Combines a variety of data stores just by using a single query. www.edureka.co
  • 29. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co HADOOP DATA ABSTRACTION ENGINES
  • 30. MAJOR HADOOP COMPONENTS APACHE PIG • High level scripting language • Enables users to write complex data transformations • Performs ETL and analyses huge Datasets. www.edureka.co
  • 31. MAJOR HADOOP COMPONENTS APACHE SQOOP • Command-line interface application for transferring data between relational databases and Hadoop. • Data Ingesting tool. • Enables to import and export structured data in an enterprise level www.edureka.co
  • 32. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co HADOOP REAL-TIME STREAMING FRAMEWORKS
  • 33. MAJOR HADOOP COMPONENTS SPARK STREAMING • Spark Streaming is an extension of the core SparkAPI. • Enables scalable, high-throughput, fault- tolerant stream processing of live data streams • Spark Streaming provides a high-level abstraction called discretized stream for continuous data streaming. www.edureka.co
  • 34. MAJOR HADOOP COMPONENTS APACHE KAFKA • Open-source stream-processing software • Ingests and moves large amounts of data very quickly. • Uses publish and subscribe to streams of records. www.edureka.co
  • 35. MAJOR HADOOP COMPONENTS APACHE FLUME • Open-source Distributed and Reliable software • Architecture is based on Streaming Data Flows • Collecting, Aggregating and Moving large logs of Data. www.edureka.co
  • 36. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co HADOOP GRAPH PROCESSING FRAMEWORK
  • 37. MAJOR HADOOP COMPONENTS APACHE GIRAPH • Iterative graph processing framework. • Utilizes Apache Hadoop's MapReduce implementation to process graphs. • Used to analyse social media data www.edureka.co
  • 38. MAJOR HADOOP COMPONENTS APACHE GRAPHX • GraphX is Apache Spark's API for graphs and graph-parallel computation. • Comparable performance to the fastest specialized graph processing systems. • Seamlessly work with both graphs and collections. • Choose from a growing library of graph algorithms. www.edureka.co
  • 39. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co HADOOP MACHINE LEARNING FRAMEWORKS
  • 40. MAJOR HADOOP COMPONENTS H2O • H2O is open-source software for big-data analysis. • H2O allows to fit thousands of potential models as part of discovering patterns in data. • H2O uses iterative methods that provide quick answers using all of the client's data. www.edureka.co
  • 41. MAJOR HADOOP COMPONENTS ORYX • A generic lambda architecture tier, providing batch/speed/serving layers. • Oryx is designed with specialization for real-time large scale machine learning • End-to-End implementation of the standard ML algorithms as applications. www.edureka.co
  • 42. MAJOR HADOOP COMPONENTS SPARK MLlib • Spark MLlib is a scalable Machine Learning Library. • It enables us to perform Machine Learning operations in Spark. www.edureka.co
  • 43. MAJOR HADOOP COMPONENTS AVRO • Avro is a row-oriented remote procedure call and data serialization. • Used in Dynamic typing and Schema Evolution and many more. • Avro is used in Data Serialization and RPC. www.edureka.co
  • 44. MAJOR HADOOP COMPONENTS THRIFT • It is an Interface definition language and binary communication protocol. • It allows users to define data types and service interfaces in a simple definition file • Thrift is used in building RPC Clients and Servers. www.edureka.co
  • 45. MAJOR HADOOP COMPONENTS MAHOUT • Implementations of distributed machine learning algorithms. • Store and process big data in a distributed environment across clusters of computers using simple programming models www.edureka.co
  • 46. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co www.edureka.co HADOOP CLUSTER MANAGEMENT SOFTWARE
  • 47. www.edureka.co MAJOR HADOOP COMPONENTS AMBAARI • Hadoop Cluster Management Software. • Ambari enables system administrators to provision, manage and monitor a Hadoop cluster. www.edureka.co
  • 48. MAJOR HADOOP COMPONENTS ZOOKEEPER • Centralized Open-source Server • Manage configuration across nodes • Implement reliable messaging • Implement redundant services • Synchronize process execution www.edureka.co
  • 49. Copyright © 2017, edureka and/or its affiliates. All rights reserved. www.edureka.co