the mapreduce programming paradigm in cybersecurity

The MapReduce Programming Paradigm
• Hadoop MapReduce is a software framework for distributed processing of
large data sets on computing clusters.
• It is a sub-project of the Apache Hadoop project.
• Apache Hadoop is an open-source framework that allows to store and
process big data in a distributed environment across clusters of computers
using simple programming models.

• Mapreduce helps to split the input data set into a number of parts and run a
program on all data parts parallel at once.
• The term MapReduce refers to two separate and distinct tasks.
• The first is the map operation, takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key/value
pairs).
• The reduce operation combines those data tuples based on the key and
accordingly modifies the value of the key.

• MapReduce is a programming paradigm that was designed to allow parallel
distributed processing of large sets of data, converting them to sets of
tuples, and then combining and reducing those tuples into smaller sets of
tuples.
• With respect to MapReduce, tuples refer to key-value pairs by which data is
grouped, sorted, and processed.

• In the map task, you delegate your data to key-value pairs, transform it, and
filter it. Then you assign the data to nodes for processing.
• In the reduce task, you aggregate that data down to smaller sized datasets.
Data from the reduce step is transformed into a standard key-value format
— where the key acts as the record identifier and the value is the value that’s
being identified by the key.

Hadoop JobTracker and TaskTracker Design
• In hadoop system there are five services always running in background (called hadoop daemon services).
• Daemon Services of Hadoop
• 1. Namenodes
• 2. Secondary Namenodes
• 3. Jobtracker
• 4. Datanodes
• 5. Tasktracker
• Job Tracker is the master daemon for both Job resource management and scheduling / monitoring of
jobs. It acts as a liaison between Hadoop and your application.

• MapReduce Work Flow Hadoop divides the job into tasks. There are two types of
tasks: 1. Map tasks (Splits & Mapping)
• 2. Reduce tasks (Shuffling, Reducing)
• The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a
• 1. Jobtracker: Acts like a master (responsible for complete execution of submitted
job)
• 2. Multiple Task Trackers: Acts like slaves, each of them performing the job

the mapreduce programming paradigm in cybersecurity

• 1. User copy all input files to distributed file system using namenode meta data.
• 2. Submit jobs to client which applied to input files fetched stored in datanodes.
• 3. Client get information about input files from namenodes to be process.
• 4. Client create splits of all files for the jobs
• 5. After splitting files client stored meta data about this job to DFS.
• 6. Now client submit this job to job tracker.

7. Now jobtracker come into picture and initialize job with job queue.
8. Jobtracker read job files from DFS submitted by client.
9. Now jobtracker create maps and reduces for jobs and input splits applied to
mappers.
Same number of mapper are there as many input splits are there. Every map work
on individual split and create output

10. Now tasktrackers come into picture and jobs submitted to every tasktrackers by
jobtracker and receiving heartbeat from every TaskTracker for confirming tasktracker
working properly or not. This heartbeat frequently sent to JobTracker in 3 second by
every TaskTrackers.
If suppose any task tracker is not sending heartbeat to jobtracker in 3 second then
JobTracker wait for 30 second more after that jobtracker consider those tasktracker as a
dead state and upate metadata about those task trackers.
11. Picks tasks from splits.

How status updates are propagated through the
MapReduce 1 system

• Job Completion
• When the jobtracker receives a notification that the last task for a job is
complete (this will be the special job cleanup task), it changes the status for
the job to “successful.”
• Then, when the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns from the
waitForCompletion() method

Yarn map reduce
• YARN meets the scalability shortcomings of “classic” MapReduce by splitting the responsibilities
of the jobtracker into separate entities.
• The jobtracker takes care of both job scheduling (matching tasks with tasktrackers) and task
progress monitoring .
• YARN separates these two roles into two independent daemons:
• 1. Resource manager:
-> To manage the use of resources across the cluster.
• 2. Application master:
-> To manage the lifecycle of applications running on the cluster

THREE COMPONENTS OF HADOOP
• HDFS
• MAP REDUCE
• YARN

• YET ANOTHER RESOURCE NEGOTIATOR IS used to manage
hadoopcluster.

MAIN COMPONENTS OF YARN
ARCHITECTURE
• Client: user submits job to job client
• Resource manager:client after copying the
resources to hdfs submits job to RM
• Node manager:RM contacts to NM to allocate a container and launch AM
• Application master: Decides how to run map reduce phase
• Container

• Yarn child is a java program for executing task it runs on separate jvm to
isolate user code from long running system daemons
• Yarn doesn’t support jvm reuse

APACHE PIG
• Pig is a high-level platform or tool which is used to process the large
datasets.
• It provides a high-level of abstraction for processing over the MapReduce.
• It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes.
• A high level scripting lang is easier to write and understand code. user
express operations more naturally.
• It provides a more user friendly syntax for programming.

Example of scripting lang
Certainly! High-level scripting languages are designed to be easy to read and write, with a
focus on rapid development. Here are some examples:
1.Python:
•Known for its readability and simplicity.
•Widely used in web development, data science, machine learning, and automation.
python
print("Hello, World!")
• 2. JavaScript:
• Mainly used for front-end web development.
Also used on the server-side with Node.js.
console.log("Hello, World!");

Bash:
•Designed for shell scripting and automation in Unix-like operating
systems.
bash
echo "Hello, World!“
Ruby:
•Emphasizes simplicity and productivity.
•Commonly used for web development.
ruby
puts "Hello, World!
"

About Apache pig
• First, to process the data which is stored in the HDFS, the programmers will
write the scripts using the Pig Latin Language.
• Internally Pig Engine(a component of Apache Pig) converted all these scripts
into a specific map and reduce task.
• But these are not visible to the programmers in order to provide a high-
level of abstraction.
• Pig Latin and Pig Engine are the two main components of the Apache Pig
tool.
• The result of Pig always stored in the HDFS.

Features of Apache Pig:
• For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
• Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages .
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• The data structure is multivalued, nested, and richer.
• Pig can handle the analysis of both structured and unstructured data.

Installing and Running Pig
• Runs as a client-side application.
• Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install
on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop
filesystems) from your workstation.
• Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will
need Cygwin).
• Download a stable release from http://pig.apache.org/releases.html.
• Cygwin gives users linux experience on windows.

• Execution Types Pig has two execution types or modes:
• 1. local mode and
• 2. MapReduce mode.
• Local mode In local mode, Pig runs in a single JVM and accesses the local
filesystem.
• This mode is suitable only for small datasets and when trying out Pig.

• MapReduce mode
• In MapReduce mode:
• Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster.
• The cluster may be a pseudo- or fully distributed cluster.
• MapReduce mode (with a fully distributed cluster) is what you use when you want to
run Pig on large datasets.
• To use MapReduce mode, you first need to check that the version of Pig you
downloaded is compatible with the version of Hadoop you are using.

Running Pig Programs
• There are three ways of executing Pig programs, all of which work in both local and MapReduce mode:
• Script
• Grunt
• Embedded
• 1. Pig can run a script file that contains Pig commands. For example, pig script.
• pig runs the commands in the local file script.pig.
• 2.Grunt
• Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is
not used. It is also possible to run Pig scripts from within Grunt using run and exec.
• 3. Embedded
• You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java.
• For programmatic access to Grunt, use PigRunner.

Apache Pig Execution Modes
• You can run Apache Pig in two modes
• Local Mode and
• HDFS mode.
• Local Mode
• In this mode, all the files are installed and run from your local
host and local file system.
• There is no need of Hadoop or HDFS. This mode is generally
used for testing purpose.

MapReduce Mode
• MapReduce mode is where we load or process the data
that exists in the Hadoop File System (HDFS) using Apache
Pig.
• In this mode, whenever we execute the Pig Latin
statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.

Pig commands
• Apache pig provides a set of commands in its scripting language.
• 1.LOAD: load data and write pig script
• 2. FILTER: filter A by x>0
• 3.GROUP: group B by x;
• 4.FOREACH:FOREACH A generate x,
• 5.JOIN
• 6.STORE
• 🡪 These commands are designed to make it easier for users to express complex data
transformations .

About pig latin
• We can run pig latin commands in three ways
• The grunt is an interactive shell.
• Pig latin is used for data processing
• Pig latin script helps in analysis of large data sets.
• It provides an abstraction over the map reduce programming model, making
it easier for users to express complex data transformations.

Comparison with Databases
• Pig Latin is similar to SQL.
• The presence of such operators as GROUP BY and DESCRIBE reinforces this impression.
• However, there are several differences between the two languages, and between Pig and RDBMSs in general.
• The most significant difference is that Pig Latin is a data flow programming language, whereas SQL is a declarative programming
language.
• In other words, a Pig Latin program is a step-by-step set of operations on an input relation, in which each step is a single transformation.
• In many ways, programming in Pig Latin is like working at the level of an RDBMS query planner, which figures out how to turn a
declarative statement into a system of steps.
• RDBMSs store data in tables, with tightly predefined schemas.
• Pig is more relaxed about the data that it processes: you can define a schema at runtime, but it’s optional.
• Pig provides a built-in load function for this format. Unlike with a traditional database, there is no data import process to load the data
into the RDBMS.
• The data is loaded from the filesystem (usually HDFS) as the first step in the processing.

• Pig’s support for complex, nested data structures differentiates it from SQL, which operates on
flatter data structures.
• Also, Pig’s ability to use UDFs and streaming operators that are tightly integrated with the language
and Pig’s nested data structures makes Pig Latin more customizable than most SQL dialects.
• There are several features to support online, low-latency queries that RDBMSs have that are absent
in Pig, such as transactions and indexes
• . As mentioned earlier, Pig does not support random reads or queries in the order of tens of
milliseconds.
• Nor does it support random writes to update small portions of data; all writes are bulk, streaming
writes, just like MapReduce.

• Hive sits between Pig and conventional RDBMSs.
• Like Pig, Hive is designed to use HDFS for storage, but otherwise there are
some significant differences. Its query language,
• HiveQL, is based on SQL, and anyone who is familiar with SQL would have
little trouble writing queries in HiveQL.

LITTLE INFO ABOUT Apache Hive
Apache Hive is a data warehouse and an ETL tool which provides an
SQL-like interface between the user and the Hadoop distributed file
system (HDFS) which integrates Hadoop.
It is built on top of Hadoop.
It is a software project that provides data query and analysis.
It facilitates reading, writing and handling wide datasets that stored
in distributed storage and queried by Structure Query Language
(SQL) syntax.

Used for
It is frequently used for data warehousing tasks like
❑Data Encapsulation,
❑Ad-hoc Queries, and
❑ Analysis of huge datasets.

Hive developed by
• Initially Hive is developed by
• Facebook
• Amazon,
• Netflix
• It delivers standard SQL functionality for analytics.

• Apache Hive is a data warehouse software project that is built on top of the
Hadoop ecosystem.
• It provides an SQL-like interface to query and analyze large datasets stored
in Hadoop’s distributed file system (HDFS)
• Hive uses a language called HiveQL, which is similar to SQL
to allow users to express data queries, transformations, and analyses in a
familiar syntax.
🡪 HiveQL statements are compiled into MapReduce jobs, which are then
executed on the Hadoop cluster to process the data.

Hive used for
• Hive can be used for a variety of data processing tasks, such as
data warehousing
• ETL (extract, transform, load)
• Ad-hoc data analysis.
• It is widely used in the big data industry, especially in
companies that have adopted the Hadoop ecosystem as their
primary data processing platform.

Components of Hive:
1.WebHCat –
It provides a service which can be utilized by the user to run Hadoop
MapReduce (or YARN), Pig, Hive tasks .
2.HCatalog –
It is a Hive component and is a table as well as a store management
layer for Hadoop.
3. It enables user along with various data processing tools like Pig and
MapReduce which enables to read and write on the grid easily.

Modes of Hive:
1.Local Mode –
It is used, when the Hadoop is built under pseudo mode which has only one
data node,
2.when the data size is smaller in term of restricted to single local machine, and
when processing will be faster on smaller datasets existing in the local
machine.
3.Map Reduce Mode –
It is used, when Hadoop is built with multiple data nodes and data is divided
across various nodes, it will function on huge datasets and query is executed
parallelly.

Features of Hive:
1.It provides indexes, including bitmap indexes to accelerate the queries.
Index type containing compaction and bitmap index as of 0.10.
2.Metadata storage in a RDBMS, reduces the time to function semantic
checks during query execution.
3.Built in user-defined functions (UDFs) to manipulation of strings, dates,
and other data-mining tools. Hive is reinforced to extend the UDF set
to deal with the use-cases not reinforced by predefined functions.

1.It stores schemas in a database and processes the data into the
Hadoop File Distributed File System (HDFS).
2.It is built for Online Analytical Processing (OLAP).
3.It delivers various types of querying language which are
frequently known as Hive Query Language (HVL or HiveQL).

the mapreduce programming paradigm in cybersecurity

More Related Content

Similar to the mapreduce programming paradigm in cybersecurity (20)

Recently uploaded (20)

the mapreduce programming paradigm in cybersecurity