SlideShare a Scribd company logo
The MapReduce Programming Paradigm
• Hadoop MapReduce is a software framework for distributed processing of
large data sets on computing clusters.
• It is a sub-project of the Apache Hadoop project.
• Apache Hadoop is an open-source framework that allows to store and
process big data in a distributed environment across clusters of computers
using simple programming models.
• Mapreduce helps to split the input data set into a number of parts and run a
program on all data parts parallel at once.
• The term MapReduce refers to two separate and distinct tasks.
• The first is the map operation, takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key/value
pairs).
• The reduce operation combines those data tuples based on the key and
accordingly modifies the value of the key.
• MapReduce is a programming paradigm that was designed to allow parallel
distributed processing of large sets of data, converting them to sets of
tuples, and then combining and reducing those tuples into smaller sets of
tuples.
• With respect to MapReduce, tuples refer to key-value pairs by which data is
grouped, sorted, and processed.
• In the map task, you delegate your data to key-value pairs, transform it, and
filter it. Then you assign the data to nodes for processing.
• In the reduce task, you aggregate that data down to smaller sized datasets.
Data from the reduce step is transformed into a standard key-value format
— where the key acts as the record identifier and the value is the value that’s
being identified by the key.
Hadoop JobTracker and TaskTracker Design
• In hadoop system there are five services always running in background (called hadoop daemon services).
• Daemon Services of Hadoop
• 1. Namenodes
• 2. Secondary Namenodes
• 3. Jobtracker
• 4. Datanodes
• 5. Tasktracker
• Job Tracker is the master daemon for both Job resource management and scheduling / monitoring of
jobs. It acts as a liaison between Hadoop and your application.
• MapReduce Work Flow Hadoop divides the job into tasks. There are two types of
tasks: 1. Map tasks (Splits & Mapping)
• 2. Reduce tasks (Shuffling, Reducing)
• The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a
• 1. Jobtracker: Acts like a master (responsible for complete execution of submitted
job)
• 2. Multiple Task Trackers: Acts like slaves, each of them performing the job
the  mapreduce programming paradigm in cybersecurity
• 1. User copy all input files to distributed file system using namenode meta data.
• 2. Submit jobs to client which applied to input files fetched stored in datanodes.
• 3. Client get information about input files from namenodes to be process.
• 4. Client create splits of all files for the jobs
• 5. After splitting files client stored meta data about this job to DFS.
• 6. Now client submit this job to job tracker.
7. Now jobtracker come into picture and initialize job with job queue.
8. Jobtracker read job files from DFS submitted by client.
9. Now jobtracker create maps and reduces for jobs and input splits applied to
mappers.
Same number of mapper are there as many input splits are there. Every map work
on individual split and create output
the  mapreduce programming paradigm in cybersecurity
10. Now tasktrackers come into picture and jobs submitted to every tasktrackers by
jobtracker and receiving heartbeat from every TaskTracker for confirming tasktracker
working properly or not. This heartbeat frequently sent to JobTracker in 3 second by
every TaskTrackers.
If suppose any task tracker is not sending heartbeat to jobtracker in 3 second then
JobTracker wait for 30 second more after that jobtracker consider those tasktracker as a
dead state and upate metadata about those task trackers.
11. Picks tasks from splits.
the  mapreduce programming paradigm in cybersecurity
How status updates are propagated through the
MapReduce 1 system
• Job Completion
• When the jobtracker receives a notification that the last task for a job is
complete (this will be the special job cleanup task), it changes the status for
the job to “successful.”
• Then, when the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns from the
waitForCompletion() method
Yarn map reduce
• YARN meets the scalability shortcomings of “classic” MapReduce by splitting the responsibilities
of the jobtracker into separate entities.
• The jobtracker takes care of both job scheduling (matching tasks with tasktrackers) and task
progress monitoring .
• YARN separates these two roles into two independent daemons:
• 1. Resource manager:
-> To manage the use of resources across the cluster.
• 2. Application master:
-> To manage the lifecycle of applications running on the cluster
the  mapreduce programming paradigm in cybersecurity
THREE COMPONENTS OF HADOOP
• HDFS
• MAP REDUCE
• YARN
• YET ANOTHER RESOURCE NEGOTIATOR IS used to manage
hadoopcluster.
MAIN COMPONENTS OF YARN
ARCHITECTURE
• Client: user submits job to job client
• Resource manager:client after copying the
resources to hdfs submits job to RM
• Node manager:RM contacts to NM to allocate a container and launch AM
• Application master: Decides how to run map reduce phase
• Container
• Yarn child is a java program for executing task it runs on separate jvm to
isolate user code from long running system daemons
• Yarn doesn’t support jvm reuse
APACHE PIG
• Pig is a high-level platform or tool which is used to process the large
datasets.
• It provides a high-level of abstraction for processing over the MapReduce.
• It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes.
• A high level scripting lang is easier to write and understand code. user
express operations more naturally.
• It provides a more user friendly syntax for programming.
Example of scripting lang
Certainly! High-level scripting languages are designed to be easy to read and write, with a
focus on rapid development. Here are some examples:
1.Python:
•Known for its readability and simplicity.
•Widely used in web development, data science, machine learning, and automation.
python
print("Hello, World!")
• 2. JavaScript:
• Mainly used for front-end web development.
Also used on the server-side with Node.js.
console.log("Hello, World!");
Bash:
•Designed for shell scripting and automation in Unix-like operating
systems.
bash
echo "Hello, World!“
Ruby:
•Emphasizes simplicity and productivity.
•Commonly used for web development.
ruby
puts "Hello, World!
"
About Apache pig
• First, to process the data which is stored in the HDFS, the programmers will
write the scripts using the Pig Latin Language.
• Internally Pig Engine(a component of Apache Pig) converted all these scripts
into a specific map and reduce task.
• But these are not visible to the programmers in order to provide a high-
level of abstraction.
• Pig Latin and Pig Engine are the two main components of the Apache Pig
tool.
• The result of Pig always stored in the HDFS.
Features of Apache Pig:
• For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
• Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages .
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• The data structure is multivalued, nested, and richer.
• Pig can handle the analysis of both structured and unstructured data.
Installing and Running Pig
• Runs as a client-side application.
• Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install
on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop
filesystems) from your workstation.
• Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will
need Cygwin).
• Download a stable release from http://pig.apache.org/releases.html.
• Cygwin gives users linux experience on windows.
• Execution Types Pig has two execution types or modes:
• 1. local mode and
• 2. MapReduce mode.
• Local mode In local mode, Pig runs in a single JVM and accesses the local
filesystem.
• This mode is suitable only for small datasets and when trying out Pig.
• MapReduce mode
• In MapReduce mode:
• Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster.
• The cluster may be a pseudo- or fully distributed cluster.
• MapReduce mode (with a fully distributed cluster) is what you use when you want to
run Pig on large datasets.
• To use MapReduce mode, you first need to check that the version of Pig you
downloaded is compatible with the version of Hadoop you are using.
Running Pig Programs
• There are three ways of executing Pig programs, all of which work in both local and MapReduce mode:
• Script
• Grunt
• Embedded
• 1. Pig can run a script file that contains Pig commands. For example, pig script.
• pig runs the commands in the local file script.pig.
• 2.Grunt
• Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is
not used. It is also possible to run Pig scripts from within Grunt using run and exec.
• 3. Embedded
• You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java.
• For programmatic access to Grunt, use PigRunner.
Apache Pig Execution Modes
• You can run Apache Pig in two modes
• Local Mode and
• HDFS mode.
• Local Mode
• In this mode, all the files are installed and run from your local
host and local file system.
• There is no need of Hadoop or HDFS. This mode is generally
used for testing purpose.
MapReduce Mode
• MapReduce mode is where we load or process the data
that exists in the Hadoop File System (HDFS) using Apache
Pig.
• In this mode, whenever we execute the Pig Latin
statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.
Pig commands
• Apache pig provides a set of commands in its scripting language.
• 1.LOAD: load data and write pig script
• 2. FILTER: filter A by x>0
• 3.GROUP: group B by x;
• 4.FOREACH:FOREACH A generate x,
• 5.JOIN
• 6.STORE
• 🡪 These commands are designed to make it easier for users to express complex data
transformations .
About pig latin
• We can run pig latin commands in three ways
• The grunt is an interactive shell.
• Pig latin is used for data processing
• Pig latin script helps in analysis of large data sets.
• It provides an abstraction over the map reduce programming model, making
it easier for users to express complex data transformations.
Comparison with Databases
• Pig Latin is similar to SQL.
• The presence of such operators as GROUP BY and DESCRIBE reinforces this impression.
• However, there are several differences between the two languages, and between Pig and RDBMSs in general.
• The most significant difference is that Pig Latin is a data flow programming language, whereas SQL is a declarative programming
language.
• In other words, a Pig Latin program is a step-by-step set of operations on an input relation, in which each step is a single transformation.
• In many ways, programming in Pig Latin is like working at the level of an RDBMS query planner, which figures out how to turn a
declarative statement into a system of steps.
• RDBMSs store data in tables, with tightly predefined schemas.
• Pig is more relaxed about the data that it processes: you can define a schema at runtime, but it’s optional.
• Pig provides a built-in load function for this format. Unlike with a traditional database, there is no data import process to load the data
into the RDBMS.
• The data is loaded from the filesystem (usually HDFS) as the first step in the processing.
• Pig’s support for complex, nested data structures differentiates it from SQL, which operates on
flatter data structures.
• Also, Pig’s ability to use UDFs and streaming operators that are tightly integrated with the language
and Pig’s nested data structures makes Pig Latin more customizable than most SQL dialects.
• There are several features to support online, low-latency queries that RDBMSs have that are absent
in Pig, such as transactions and indexes
• . As mentioned earlier, Pig does not support random reads or queries in the order of tens of
milliseconds.
• Nor does it support random writes to update small portions of data; all writes are bulk, streaming
writes, just like MapReduce.
• Hive sits between Pig and conventional RDBMSs.
• Like Pig, Hive is designed to use HDFS for storage, but otherwise there are
some significant differences. Its query language,
• HiveQL, is based on SQL, and anyone who is familiar with SQL would have
little trouble writing queries in HiveQL.
LITTLE INFO ABOUT Apache Hive
Apache Hive is a data warehouse and an ETL tool which provides an
SQL-like interface between the user and the Hadoop distributed file
system (HDFS) which integrates Hadoop.
It is built on top of Hadoop.
It is a software project that provides data query and analysis.
It facilitates reading, writing and handling wide datasets that stored
in distributed storage and queried by Structure Query Language
(SQL) syntax.
Used for
It is frequently used for data warehousing tasks like
❑Data Encapsulation,
❑Ad-hoc Queries, and
❑ Analysis of huge datasets.
Hive developed by
• Initially Hive is developed by
• Facebook
• Amazon,
• Netflix
• It delivers standard SQL functionality for analytics.
• Apache Hive is a data warehouse software project that is built on top of the
Hadoop ecosystem.
• It provides an SQL-like interface to query and analyze large datasets stored
in Hadoop’s distributed file system (HDFS)
• Hive uses a language called HiveQL, which is similar to SQL
to allow users to express data queries, transformations, and analyses in a
familiar syntax.
🡪 HiveQL statements are compiled into MapReduce jobs, which are then
executed on the Hadoop cluster to process the data.
Hive used for
• Hive can be used for a variety of data processing tasks, such as
data warehousing
• ETL (extract, transform, load)
• Ad-hoc data analysis.
• It is widely used in the big data industry, especially in
companies that have adopted the Hadoop ecosystem as their
primary data processing platform.
Components of Hive:
1.WebHCat –
It provides a service which can be utilized by the user to run Hadoop
MapReduce (or YARN), Pig, Hive tasks .
2.HCatalog –
It is a Hive component and is a table as well as a store management
layer for Hadoop.
3. It enables user along with various data processing tools like Pig and
MapReduce which enables to read and write on the grid easily.
Modes of Hive:
1.Local Mode –
It is used, when the Hadoop is built under pseudo mode which has only one
data node,
2.when the data size is smaller in term of restricted to single local machine, and
when processing will be faster on smaller datasets existing in the local
machine.
3.Map Reduce Mode –
It is used, when Hadoop is built with multiple data nodes and data is divided
across various nodes, it will function on huge datasets and query is executed
parallelly.
Features of Hive:
1.It provides indexes, including bitmap indexes to accelerate the queries.
Index type containing compaction and bitmap index as of 0.10.
2.Metadata storage in a RDBMS, reduces the time to function semantic
checks during query execution.
3.Built in user-defined functions (UDFs) to manipulation of strings, dates,
and other data-mining tools. Hive is reinforced to extend the UDF set
to deal with the use-cases not reinforced by predefined functions.
1.It stores schemas in a database and processes the data into the
Hadoop File Distributed File System (HDFS).
2.It is built for Online Analytical Processing (OLAP).
3.It delivers various types of querying language which are
frequently known as Hive Query Language (HVL or HiveQL).
THANK YOU

More Related Content

Similar to the mapreduce programming paradigm in cybersecurity (20)

PDF
Hadoop Introduction
tutorialvillage
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPTX
Hadoop_arunam_ppt
jerrin joseph
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PDF
Hadoop 2.0 handout 5.0
Manaranjan Pradhan
 
PPTX
Bigdata
renukarenuka9
 
PPTX
Bigdata ppt
renukarenuka9
 
DOCX
project report on hadoop
Manoj Jangalva
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PPTX
Hadoop info
Nikita Sure
 
PDF
Big Data Analytics [email protected]
WasyihunSema2
 
PPTX
Introduction to Hadoop and Big-Data
Ramsay Key
 
PDF
What is hadoop
Asis Mohanty
 
PPTX
Lecture 2 Hadoop.pptx
Anonymous9etQKwW
 
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
PDF
BIGDATA MODULE 3.pdf
DIVYA370851
 
PDF
Hadoop Ecosystem
rohitraj268
 
Hadoop Introduction
tutorialvillage
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop_arunam_ppt
jerrin joseph
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Hadoop 2.0 handout 5.0
Manaranjan Pradhan
 
Bigdata
renukarenuka9
 
Bigdata ppt
renukarenuka9
 
project report on hadoop
Manoj Jangalva
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Hands on Hadoop and pig
Sudar Muthu
 
Hadoop info
Nikita Sure
 
Big Data Analytics [email protected]
WasyihunSema2
 
Introduction to Hadoop and Big-Data
Ramsay Key
 
What is hadoop
Asis Mohanty
 
Lecture 2 Hadoop.pptx
Anonymous9etQKwW
 
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
BIGDATA MODULE 3.pdf
DIVYA370851
 
Hadoop Ecosystem
rohitraj268
 

Recently uploaded (20)

PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPT
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
Day2 B2 Best.pptx
helenjenefa1
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
Ad

the mapreduce programming paradigm in cybersecurity

  • 1. The MapReduce Programming Paradigm • Hadoop MapReduce is a software framework for distributed processing of large data sets on computing clusters. • It is a sub-project of the Apache Hadoop project. • Apache Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.
  • 2. • Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once. • The term MapReduce refers to two separate and distinct tasks. • The first is the map operation, takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). • The reduce operation combines those data tuples based on the key and accordingly modifies the value of the key.
  • 3. • MapReduce is a programming paradigm that was designed to allow parallel distributed processing of large sets of data, converting them to sets of tuples, and then combining and reducing those tuples into smaller sets of tuples. • With respect to MapReduce, tuples refer to key-value pairs by which data is grouped, sorted, and processed.
  • 4. • In the map task, you delegate your data to key-value pairs, transform it, and filter it. Then you assign the data to nodes for processing. • In the reduce task, you aggregate that data down to smaller sized datasets. Data from the reduce step is transformed into a standard key-value format — where the key acts as the record identifier and the value is the value that’s being identified by the key.
  • 5. Hadoop JobTracker and TaskTracker Design • In hadoop system there are five services always running in background (called hadoop daemon services). • Daemon Services of Hadoop • 1. Namenodes • 2. Secondary Namenodes • 3. Jobtracker • 4. Datanodes • 5. Tasktracker • Job Tracker is the master daemon for both Job resource management and scheduling / monitoring of jobs. It acts as a liaison between Hadoop and your application.
  • 6. • MapReduce Work Flow Hadoop divides the job into tasks. There are two types of tasks: 1. Map tasks (Splits & Mapping) • 2. Reduce tasks (Shuffling, Reducing) • The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a • 1. Jobtracker: Acts like a master (responsible for complete execution of submitted job) • 2. Multiple Task Trackers: Acts like slaves, each of them performing the job
  • 8. • 1. User copy all input files to distributed file system using namenode meta data. • 2. Submit jobs to client which applied to input files fetched stored in datanodes. • 3. Client get information about input files from namenodes to be process. • 4. Client create splits of all files for the jobs • 5. After splitting files client stored meta data about this job to DFS. • 6. Now client submit this job to job tracker.
  • 9. 7. Now jobtracker come into picture and initialize job with job queue. 8. Jobtracker read job files from DFS submitted by client. 9. Now jobtracker create maps and reduces for jobs and input splits applied to mappers. Same number of mapper are there as many input splits are there. Every map work on individual split and create output
  • 11. 10. Now tasktrackers come into picture and jobs submitted to every tasktrackers by jobtracker and receiving heartbeat from every TaskTracker for confirming tasktracker working properly or not. This heartbeat frequently sent to JobTracker in 3 second by every TaskTrackers. If suppose any task tracker is not sending heartbeat to jobtracker in 3 second then JobTracker wait for 30 second more after that jobtracker consider those tasktracker as a dead state and upate metadata about those task trackers. 11. Picks tasks from splits.
  • 13. How status updates are propagated through the MapReduce 1 system
  • 14. • Job Completion • When the jobtracker receives a notification that the last task for a job is complete (this will be the special job cleanup task), it changes the status for the job to “successful.” • Then, when the Job polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the waitForCompletion() method
  • 15. Yarn map reduce • YARN meets the scalability shortcomings of “classic” MapReduce by splitting the responsibilities of the jobtracker into separate entities. • The jobtracker takes care of both job scheduling (matching tasks with tasktrackers) and task progress monitoring . • YARN separates these two roles into two independent daemons: • 1. Resource manager: -> To manage the use of resources across the cluster. • 2. Application master: -> To manage the lifecycle of applications running on the cluster
  • 17. THREE COMPONENTS OF HADOOP • HDFS • MAP REDUCE • YARN
  • 18. • YET ANOTHER RESOURCE NEGOTIATOR IS used to manage hadoopcluster.
  • 19. MAIN COMPONENTS OF YARN ARCHITECTURE • Client: user submits job to job client • Resource manager:client after copying the resources to hdfs submits job to RM • Node manager:RM contacts to NM to allocate a container and launch AM • Application master: Decides how to run map reduce phase • Container
  • 20. • Yarn child is a java program for executing task it runs on separate jvm to isolate user code from long running system daemons • Yarn doesn’t support jvm reuse
  • 21. APACHE PIG • Pig is a high-level platform or tool which is used to process the large datasets. • It provides a high-level of abstraction for processing over the MapReduce. • It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes. • A high level scripting lang is easier to write and understand code. user express operations more naturally. • It provides a more user friendly syntax for programming.
  • 22. Example of scripting lang Certainly! High-level scripting languages are designed to be easy to read and write, with a focus on rapid development. Here are some examples: 1.Python: •Known for its readability and simplicity. •Widely used in web development, data science, machine learning, and automation. python print("Hello, World!") • 2. JavaScript: • Mainly used for front-end web development. Also used on the server-side with Node.js. console.log("Hello, World!");
  • 23. Bash: •Designed for shell scripting and automation in Unix-like operating systems. bash echo "Hello, World!“ Ruby: •Emphasizes simplicity and productivity. •Commonly used for web development. ruby puts "Hello, World! "
  • 24. About Apache pig • First, to process the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin Language. • Internally Pig Engine(a component of Apache Pig) converted all these scripts into a specific map and reduce task. • But these are not visible to the programmers in order to provide a high- level of abstraction. • Pig Latin and Pig Engine are the two main components of the Apache Pig tool. • The result of Pig always stored in the HDFS.
  • 25. Features of Apache Pig: • For performing several operations Apache Pig provides rich sets of operators like the filtering, joining, sorting, aggregation etc. • Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon. • Apache Pig is extensible so that you can make your own process and user-defined functions(UDFs) written in python, java or other programming languages . • Join operation is easy in Apache Pig. • Fewer lines of code. • Apache Pig allows splits in the pipeline. • The data structure is multivalued, nested, and richer. • Pig can handle the analysis of both structured and unstructured data.
  • 26. Installing and Running Pig • Runs as a client-side application. • Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop filesystems) from your workstation. • Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will need Cygwin). • Download a stable release from http://pig.apache.org/releases.html. • Cygwin gives users linux experience on windows.
  • 27. • Execution Types Pig has two execution types or modes: • 1. local mode and • 2. MapReduce mode. • Local mode In local mode, Pig runs in a single JVM and accesses the local filesystem. • This mode is suitable only for small datasets and when trying out Pig.
  • 28. • MapReduce mode • In MapReduce mode: • Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. • The cluster may be a pseudo- or fully distributed cluster. • MapReduce mode (with a fully distributed cluster) is what you use when you want to run Pig on large datasets. • To use MapReduce mode, you first need to check that the version of Pig you downloaded is compatible with the version of Hadoop you are using.
  • 29. Running Pig Programs • There are three ways of executing Pig programs, all of which work in both local and MapReduce mode: • Script • Grunt • Embedded • 1. Pig can run a script file that contains Pig commands. For example, pig script. • pig runs the commands in the local file script.pig. • 2.Grunt • Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. It is also possible to run Pig scripts from within Grunt using run and exec. • 3. Embedded • You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java. • For programmatic access to Grunt, use PigRunner.
  • 30. Apache Pig Execution Modes • You can run Apache Pig in two modes • Local Mode and • HDFS mode. • Local Mode • In this mode, all the files are installed and run from your local host and local file system. • There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
  • 31. MapReduce Mode • MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. • In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
  • 32. Pig commands • Apache pig provides a set of commands in its scripting language. • 1.LOAD: load data and write pig script • 2. FILTER: filter A by x>0 • 3.GROUP: group B by x; • 4.FOREACH:FOREACH A generate x, • 5.JOIN • 6.STORE • 🡪 These commands are designed to make it easier for users to express complex data transformations .
  • 33. About pig latin • We can run pig latin commands in three ways • The grunt is an interactive shell. • Pig latin is used for data processing • Pig latin script helps in analysis of large data sets. • It provides an abstraction over the map reduce programming model, making it easier for users to express complex data transformations.
  • 34. Comparison with Databases • Pig Latin is similar to SQL. • The presence of such operators as GROUP BY and DESCRIBE reinforces this impression. • However, there are several differences between the two languages, and between Pig and RDBMSs in general. • The most significant difference is that Pig Latin is a data flow programming language, whereas SQL is a declarative programming language. • In other words, a Pig Latin program is a step-by-step set of operations on an input relation, in which each step is a single transformation. • In many ways, programming in Pig Latin is like working at the level of an RDBMS query planner, which figures out how to turn a declarative statement into a system of steps. • RDBMSs store data in tables, with tightly predefined schemas. • Pig is more relaxed about the data that it processes: you can define a schema at runtime, but it’s optional. • Pig provides a built-in load function for this format. Unlike with a traditional database, there is no data import process to load the data into the RDBMS. • The data is loaded from the filesystem (usually HDFS) as the first step in the processing.
  • 35. • Pig’s support for complex, nested data structures differentiates it from SQL, which operates on flatter data structures. • Also, Pig’s ability to use UDFs and streaming operators that are tightly integrated with the language and Pig’s nested data structures makes Pig Latin more customizable than most SQL dialects. • There are several features to support online, low-latency queries that RDBMSs have that are absent in Pig, such as transactions and indexes • . As mentioned earlier, Pig does not support random reads or queries in the order of tens of milliseconds. • Nor does it support random writes to update small portions of data; all writes are bulk, streaming writes, just like MapReduce.
  • 36. • Hive sits between Pig and conventional RDBMSs. • Like Pig, Hive is designed to use HDFS for storage, but otherwise there are some significant differences. Its query language, • HiveQL, is based on SQL, and anyone who is familiar with SQL would have little trouble writing queries in HiveQL.
  • 37. LITTLE INFO ABOUT Apache Hive Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top of Hadoop. It is a software project that provides data query and analysis. It facilitates reading, writing and handling wide datasets that stored in distributed storage and queried by Structure Query Language (SQL) syntax.
  • 38. Used for It is frequently used for data warehousing tasks like ❑Data Encapsulation, ❑Ad-hoc Queries, and ❑ Analysis of huge datasets.
  • 39. Hive developed by • Initially Hive is developed by • Facebook • Amazon, • Netflix • It delivers standard SQL functionality for analytics.
  • 40. • Apache Hive is a data warehouse software project that is built on top of the Hadoop ecosystem. • It provides an SQL-like interface to query and analyze large datasets stored in Hadoop’s distributed file system (HDFS) • Hive uses a language called HiveQL, which is similar to SQL to allow users to express data queries, transformations, and analyses in a familiar syntax. 🡪 HiveQL statements are compiled into MapReduce jobs, which are then executed on the Hadoop cluster to process the data.
  • 41. Hive used for • Hive can be used for a variety of data processing tasks, such as data warehousing • ETL (extract, transform, load) • Ad-hoc data analysis. • It is widely used in the big data industry, especially in companies that have adopted the Hadoop ecosystem as their primary data processing platform.
  • 42. Components of Hive: 1.WebHCat – It provides a service which can be utilized by the user to run Hadoop MapReduce (or YARN), Pig, Hive tasks . 2.HCatalog – It is a Hive component and is a table as well as a store management layer for Hadoop. 3. It enables user along with various data processing tools like Pig and MapReduce which enables to read and write on the grid easily.
  • 43. Modes of Hive: 1.Local Mode – It is used, when the Hadoop is built under pseudo mode which has only one data node, 2.when the data size is smaller in term of restricted to single local machine, and when processing will be faster on smaller datasets existing in the local machine. 3.Map Reduce Mode – It is used, when Hadoop is built with multiple data nodes and data is divided across various nodes, it will function on huge datasets and query is executed parallelly.
  • 44. Features of Hive: 1.It provides indexes, including bitmap indexes to accelerate the queries. Index type containing compaction and bitmap index as of 0.10. 2.Metadata storage in a RDBMS, reduces the time to function semantic checks during query execution. 3.Built in user-defined functions (UDFs) to manipulation of strings, dates, and other data-mining tools. Hive is reinforced to extend the UDF set to deal with the use-cases not reinforced by predefined functions.
  • 45. 1.It stores schemas in a database and processes the data into the Hadoop File Distributed File System (HDFS). 2.It is built for Online Analytical Processing (OLAP). 3.It delivers various types of querying language which are frequently known as Hive Query Language (HVL or HiveQL).