Unit II Hadoop Ecosystem_Updated.pptx

PIG,
Zookeeper, how it helps in monitoring a cluster, HBase uses Zookeeper and how to Build Applications
with Zookeeper.
SPARK: Introduction to Data Analysis with Spark, Downloading Spark and Getting Started, Programming
with RDDs, Machine Learning with MLlib.

HBase
• Limitations of Hadoop: Hadoop can perform only batch processing, and data will be accessed only in
a sequential manner. That means one has to search the entire dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of time
(random access).
• HBase is part of the Hadoop ecosystem which offers random real-time read/write access to data in
the Hadoop File System. HBase is a Hadoop project which is Open Source, distributed
Hadoop database which has its genesis in the Google’s Bigtable.
• Its programming language is Java.
• Now, it is an integral part of the Apache Software Foundation and the Hadoop ecosystem.
• Also, it is a high availability database which exclusively runs on top of the HDFS.
• It is a column-oriented database built on top of HDFS.

• Why should you use HBase Technology?
• Along with HDFS and MapReduce, HBase is one of the core components of the Hadoop ecosystem. Here are some salient
features of HBase which make it significant to use:
• Apache HBase has a completely distributed architecture.
• It can easily work on extremely large scale data.
• HBase offers high security and easy management which results in unprecedented high write throughput.
• For both structured and semi-structured data types we can use it.
• Moreover, the MapReduce jobs can be backed with HBase Tables.

HBase and HDFS
HDFS HBase
HDFS is a distributed file system suitable for storing
large files.
HBase is a database built on top of the HDFS.
HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables.
It provides high latency batch processing; no concept
of batch processing.
It provides low latency access to single rows from
billions of records (Random access).
It provides only sequential access of data.
HBase internally uses Hash tables and provides random
access, and it stores the data in indexed HDFS files for
faster lookups.

Storage Mechanism in HBase
• HBase is a column-oriented database and the tables in it are sorted by row.
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
Rowid
Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3

Apache HBase Architecture
• We know HBase is acting like a big table to record the data, and tables are split into regions. Again,
Regions are divided vertically by family column to create stores. These stores are called files in HDFS.
• HBase has three major components which are master servers, client library, and region servers. It's up to
the requirement of the organization whether to add region servers or not.
• MasterServer : It allocates regions to the region servers with the help of zookeeper. It balances the load
across the region servers. MasterServer is responsible for changes like schema changes and metadata
operations, like creating column families and tables.
• Regions: Regions are nothing but tables which are split into small tables and spread across the region
servers.
• RegionServer: Region servers communicate with other components and complete the below tasks:
• It communicates with the client to handle data related tasks.
• It takes care of the read and write tasks of the regions under it.
• It decides the size of a region based on the threshold it has.

Unit II Hadoop Ecosystem_Updated.pptx

• The memory here acts as a temporary space to store the data. When anything is entered into Hbse,
it is initially stored in the memory, and later, it will be transferred to HFiles where data is stored in
blocks.
• Zookeeper: Zookeeper is an open source project, and it facilitates the services like managing the
configuration data, providing distributed synchronisation, naming, etc. It helps the master server in
discovering the available servers. Zookeeper helps the client servers in communicating with region
servers.

Secondary Indexing
• Secondary indexes are an orthogonal way to access data from its primary access path. In HBase, you
have a single index that is lexicographically sorted on the primary row key. Access to records in any way
other than through the primary row requires scanning over potentially all the rows in the table to test
them against your filter. With secondary indexing, the columns or expressions you index form an
alternate row key to allow point lookups and range scans along this new axis.
• Covered Indexes: In covered indexes - we do not need to go back to the primary table once we have
found the index entry. Instead, we bundle the data we care about right in the index rows, saving read-
time overhead.For example, the following would create an index on the v1 and v2 columns and include
the v3 column in the index as well to prevent having to get it from the data table:
• CREATE INDEX my_index ON my_table (v1,v2) INCLUDE(v3)

• Functional Indexes: Functional indexes allow you to create an index not just on columns, but on an
arbitrary expressions. Then when a query uses that expression, the index may be used to retrieve
the results instead of the data table. For example, you could create an index on
UPPER(FIRST_NAME||‘ ’||LAST_NAME) to allow you to do case insensitive searches on the combined
first name and last name of a person.
• For example, the following would create this functional index:
• CREATE INDEX UPPER_NAME_IDX ON EMP (UPPER(FIRST_NAME||' '||LAST_NAME))
• With this index in place, when the following query is issued, the index would be used instead of the
data table to retrieve the results:
• SELECT EMP_ID FROM EMP WHERE UPPER(FIRST_NAME||' '||LAST_NAME)='JOHN DOE'

Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

Apache Hive
• Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. A data warehouse provides a central store of information that can easily be analyzed
to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of
data using SQL.
• Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store
and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to
work quickly on petabytes of data. What makes Hive unique is the ability to query large datasets,
leveraging Apache Tez or MapReduce, with a SQL-like interface.
• Motivation:
• Yahoo worked on Pig to facilitate application deployment on Hadoop. Their need mainly was
focused on unstructured data Simultaneously Facebook started working on deploying
warehouse solutions on Hadoop that resulted in Hive. The size of data being collected and
analyzed in industry for business intelligence (BI) is growing rapidly making traditional
warehousing solution prohibitively expensive.

How does Hive work?
• Hive was created to allow non-programmers familiar with SQL to work with petabytes of data,
using a SQL-like interface called HiveQL. Hive uses batch processing so that it works quickly across
a very large distributed database.
• Hive transforms HiveQL queries into MapReduce or Tez jobs that run on Apache Hadoop’s
distributed job scheduling framework, Yet Another Resource Negotiator (YARN). It queries data
stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon
S3.
• Hive stores its database and table metadata in a metastore, which is a database or file backed
store that enables easy data abstraction and discovery.
• Hive includes HCatalog, which is a table and storage management layer that reads data from the
Hive metastore to facilitate seamless integration between Hive, Apache Pig, and MapReduce.
• By using the metastore, HCatalog allows Pig and MapReduce to use the same data structures as
Hive, so that the metadata doesn’t have to be redefined for each engine.

Benefits of Hive
B
Fast: Hive is designed to quickly handle petabytes of data using batch processing.
Familiar: Hive provides a familiar, SQL-like interface that is accessible to non-programmers.
Scalable: Hive is easy to distribute and scale based on your needs.
It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.
It is capable of analyzing large datasets stored in HDFS.
It allows different storage types such as plain text, RCFile, and HBase.
It uses indexing to accelerate queries.
It can operate on compressed data stored in the Hadoop ecosystem.
It supports user-defined functions (UDFs) where user can provide its functionality.

When to use Hive
● Most suitable for data warehouse applications where relatively static data is analyzed.
● Fast response time is not required.
● Data is not changing rapidly.
● An abstraction to underlying MR program.
● Hive of course is a good choice for queries that lend themselves to being expressed in SQL,
particularly long-running queries where fault tolerance is desirable.
● Hive can be a good choice if you’d like to write feature-rich, fault-tolerant, batch (i.e., not near-real-
time) transformation or ETL jobs in a pluggable SQL engine.

Hive Architecture
Hive Client: Hive allows writing applications in
various languages, including Java, Python, and C++.
It supports different types of clients such as:-
● Thrift Server - It is a cross-language service
provider platform that serves the request
from all those programming languages that
supports Thrift.
● JDBC Driver - It is used to establish a
connection between hive and Java
applications. The JDBC Driver is present in
the class
org.apache.hadoop.hive.jdbc.HiveDriver.
● ODBC Driver - It allows the applications that
support the ODBC protocol to connect to
Hive.

Hive Services: The following are the services provided by Hive
● Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.
● Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
● Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in
the warehouse. It also includes metadata of column and its type information, the serializers and deserializers
which is used to read and write data and the corresponding HDFS files where the data is stored.
● Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to
Hive Driver.
● Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers
the queries to the compiler.
● Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different
query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
● Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS
tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.

Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the values of a particular
column like date, course, city or country. The advantage of partitioning is that since the data is stored in
slices, the query response time becomes faster.
As we know that Hadoop is used to handle the huge amount of data, it is always required to use the best
approach to deal with it. The partitioning in Hive is the best example of it.
Let's assume we have a data of 10 million students studying in an institute. Now, we have to fetch the
students of a particular course. If we use a traditional approach, we have to go through the entire data.
This leads to performance degradation. In such a case, we can adopt the better approach i.e.,
partitioning in Hive and divide the data among the different datasets based on particular columns.
The partitioning in Hive can be executed in two ways -
● Static partitioning
● Dynamic partitioning

Static Partitioning: In static or manual partitioning, it is required to pass the values of partitioned
columns manually while loading the data into the table. Hence, the data file doesn't contain the partitioned
columns.
create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
Dynamic Partitioning: In dynamic partitioning, the values of partitioned columns exist within the
table. So, it is not required to pass the values of partitioned columns manually.
Enable the dynamic partition by using the following commands: -
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Create a partition table by using the following command: -
hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';

Comparison with Traditional Database
Below are the key features of Hive that differ from RDBMS.
● Hive resembles a traditional database by supporting SQL interface but it is not a full database.
Hive can be better called as data warehouse instead of database.
● Hive enforces schema on read time whereas RDBMS enforces schema on write time.
● In RDBMS, a table’s schema is enforced at data load time, If the data being loaded doesn’t conform
to the schema, then it is rejected. This design is called schema on write.
● But Hive doesn’t verify the data when it is loaded, but rather when a it is retrieved. This is called
schema on read.
● Schema on read makes for a very fast initial load, since the data does not have to be read,
parsed, and serialized to disk in the database’s internal format. The load operation is just a file copy
or move.
● Schema on write makes query time performance faster, since the database can index columns
and perform compression on the data but it takes longer to load data into the database.

● Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read
and Write many times.
● In RDBMS, record level updates, insertions and deletes, transactions and indexes are
possible. Whereas these are not allowed in Hive because Hive was built to operate over HDFS
data using MapReduce, where full-table scans are the norm and a table update is achieved by
transforming the data into a new table.
● In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive can 100’s
Petabytes very easily.
● As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction
Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is
significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce
jobs and due to the size of the data sets Hadoop was designed to serve.

● RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive
is suited for data warehouse applications, where relatively static data is analyzed, fast response
times are not required, and when the data is not changing rapidly.
● To overcome the limitations of Hive, HBase is being integrated with Hive to support record level
operations and OLAP.
● Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very
costly scale up.

HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of
Hive data types is given below.
Integer Types: TINYINT, SMALLINT, INT, BIGINT
Decimal Type: Float, Double
Date/Time Types: TIMESTAMP, DATES
String Types: STRING, Varchar, Char
Complex Type: Struct, Map, Array

HiveQL
• Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analyzing
structured data. It separates users from the complexity of Map Reduce programming. It reuses
common concepts from relational databases, such as tables, rows, columns, and schema, to ease
learning. Hive provides a CLI for Hive query writing using Hive Query Language (HiveQL).
• Most interactions tend to take place over a command line interface (CLI). Generally, HiveQL syntax is
similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats
which are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
• Hive Queries: Hive provides SQL type querying language for the ETL purpose on top of Hadoop file
system. Hive Query language (HiveQL) provides SQL type environment in Hive to work with tables,
databases, queries. We can have a different type of Clauses associated with Hive to perform different
type data manipulations and querying.

Hive queries provides the following features:
• Data modeling such as Creation of databases, tables, etc.
• ETL functionalities such as Extraction, Transformation, and Loading data into tables
• Joins to merge different data tables
• User specific custom scripts for ease of code
• Faster querying tool on top of Hadoop

Querying Data
Create a Database: Create a database named “company” by running the create comman
create database company;
Next, verify the database is created by running the show command:
show databases;
Open the “company” database by using the following command:
use company;
Create a Table in Hive
Use column names when creating a table. Create the table by running the following command:
create table employees (id int, name string, country string, department string, salary int)

• Load Data From a File: You have created a table, but it is empty because data is not loaded from the
file located in the /hadoop directory.
• Load data by running the load command:
load data inpath '/hdoop/employees.txt' overwrite into table employees;
• Verify if the data is loaded by running the select command:
select * from employees;
• Display Hive Data: You have several options for displaying data from the table.
• Display Columns: Display columns of a table by running the desc command:
• desc employees;
• Display Selected Data
• select name,country from employees;

Sorting and Aggregating
• ORDER AND SORT
• ORDER BY (ASC|DESC): This is similar to the RDBMS ORDER BY statement. A sorted order is maintained
across all of the output from every reducer. It performs the global sort using only one reducer, so it
takes a longer time to return the result. Usage with LIMIT is strongly recommended for ORDER BY.
• SORT BY (ASC|DESC): This indicates which columns to sort when ordering the reducer input records.
This means it completes sorting before sending data to the reducer.
• DISTRIBUTE BY – Rows with matching column values will be partitioned to the same reducer. When
used alone, it does not guarantee sorted input to the reducer. The DISTRIBUTE BY statement is similar
to GROUP BY in RDBMS in terms of deciding which reducer to distribute the mapper output to. When
using with SORT BY, DISTRIBUTE BY must be specified before the SORT BY statement.
• CLUSTER BY – This is a shorthand operator to perform DISTRIBUTE BY and SORT BY operations on the
same group of columns. And, it is sorted locally in each reducer. The CLUSTER BY statement does not
support ASC or DESC yet.

Sorting and Aggregating
• Sort, order, distribute & cluster:
• The SORT BY and ORDER BY clauses are used to define the order of the output data. However,
DISTRIBUTE BY and CLUSTER BY clauses are used to distribute the data to multiple reducers based
on the key columns.
• We can use Sort by or Order by or Distribute by or Cluster by clauses in a hive SELECT query to get
the output data in the desired order.
• SORT BY:
• The SORT by clause sorts the data per reducer. As a result, if we have N number of reducers, we will
have N number of sorted files in the output. These files can have overlapping data ranges. Also, the
output data is not globally sorted because the hive sorts the rows before feeding them to reducers
based on the key columns used in the SORT BY clause. The syntax of the SORT BY clause is as
below:
• SELECT Col1, Col2,……ColN FROM TableName SORT BY Col1 <ASC | DESC>, Col2 <ASC | DESC>, ….
ColN <ASC | DESC>

• ORDER BY:
• ORDER BY clause orders the data globally. Because it ensures the global ordering of the data,
all the data need to be passed from a single reducer only. As a result, the order by clause
outputs one single file only.
• Bringing all the data on one single reducer can become a performance killer, especially if our
output dataset is significantly large. So, we should always avoid the ORDER BY clause in the
hive queries.
• However, if we need to enforce a global ordering of the data, and the output dataset is not
that big, we can use this hive clause to order the final dataset globally.
• The syntax of the ORDER BY clause in hive is as below:
• SELECT Col1, Col2,……ColN FROM TableName ORDER BY Col1 <ASC | DESC>, Col2 <ASC |
DESC>, …. ColN <ASC | DESC>

• DISTRIBUTE BY:
• DISTRIBUTE BY clause is used to distribute the input rows among reducers. It does not ensures
that all rows for the same key columns are going to the same reducer. So, if we need to partition
the data on some key column, we can use the DISTRIBUTE BY clause in the hive queries.
However, the DISTRIBUTE BY clause does not sort the data either at the reducer level or globally.
Also, the same key values might not be placed next to each other in the output dataset.
• As a result, the DISTRIBUTE BY clause may output N number of unsorted files where N is the
number of reducers used in the query processing. But, the output files do not contain
overlapping data ranges.
• The syntax of the DISTRIBUTE BY clause in hive is as below:
• SELECT Col1, Col2,……ColN FROM TableName DISTRIBUTE BY Col1, Col2, ….. ColN

• CLUSTER BY
• CLUSTER BY clause is a combination of DISTRIBUTE BY and SORT BY clauses together. That
means the output of the CLUSTER BY clause is equivalent to the output of DISTRIBUTE BY +
SORT BY clauses. The CLUSTER BY clause distributes the data based on the key column and
then sorts the output data by putting the same key column values adjacent to each other. So,
the output of the CLUSTER BY clause is sorted at the reducer level. As a result, we can get N
number of sorted output files where N is the number of reducers used in the query
processing. Also, the CLUSTER by clause ensures that we are getting non-overlapping data
ranges into the final outputs. However, if the query is processed by only one reducer the
output will be equivalent to the output of the ORDER BY clause.
• The syntax of the CLUSTER BY clause is as below:
• SELECT Col1, Col2,……ColN FROM TableName CLUSTER BY Col1, Col2, ….. ColN

• Hive Aggregate Functions: Hive Aggregate Functions are the most used built-in functions that
take a set of values and return a single value, when used with a group, it aggregates all values
in each group and returns one value for each group.
• Aggregate Functions in Hive can be used with or without GROUP BY functions however these
aggregation functions are mostly used with GROUP BY.
• Most of these functions ignore NULL values.

• Hive Aggregate Functions: Hive Aggregate Functions are the most used built-in functions that take a
set of values and return a single value, when used with a group, it aggregates all values in each group
and returns one value for each group.
• Aggregate Functions in Hive can be used with or without GROUP BY functions however these
aggregation functions are mostly used with GROUP BY.
• Most of these functions ignore NULL values.

Functions Description
COUNT() Returns the count of all rows in a table including rows containing NULL values
When you specify a column as an input, it ignores NULL values in the column for the count.
Also ignores duplicates by using DISTINCT. Return: BIGINT
SUM()
Returns the sum of all values in a column. When used with a group it returns the sum for
each group. Also ignores duplicates by using DISTINCT. Return: DOUBLE
AVG()
Returns the average of all values in a column. When used with a group it returns an
average for each group. Return: DOUBLE
MIN()
Returns the minimum value of the column from all rows. When used with a group it returns
a minimum for each group. Return: DOUBLE
MAX()
Returns the maximum value of the column from all rows. When used with a group it returns
a maximum for each group. Return: DOUBLE
Variance(col) Returns the variance of a numeric column for all rows or for each group. Return: DOUBLE
Stddev_samp
(col)
Returns the sample statistical standard deviation of all values in a column or for each group.
Return: DOUBLE
Corr(col1,
col2)
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
Return: DOUBLE

• Examples:
• select count(*) from employee;
• select count(salary) from employee;
• select count(distinct gender, salary) from employee;
• select sum(salary) from employee;
• select sum(distinct salary) from employee;
• select avg(salary) from employee group by age;
• select age,avg(salary) from employee group by age;
• select min(salary) from employee;
• select max(salary) from employee;
• select variance(salary) from employee;
• select stddev_pop(salary) from employee;

Joins & Sub queries
• HiveQL – JOIN: The HiveQL Join clause is used to combine the data of two or more tables based on a
related column between them. The various type of HiveQL joins are: -
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join

• Inner Join: The Records common to the both tables will be retrieved by this Inner Join. Here we are
performing join query using JOIN keyword between the tables.
• Left Outer Join: Hive query language LEFT OUTER JOIN returns all the rows from the left table even
though there are no matches in right table If ON Clause matches zero records in the right table, the joins
still return a record in the result with NULL in each column from the right table.
• Right outer Join: Hive query language RIGHT OUTER JOIN returns all the rows from the Right table even
though there are no matches in left table. If ON Clause matches zero records in the left table, the joins still
return a record in the result with NULL in each column from the left table. RIGHT joins always return
records from a Right table and matched records from the left table. If the left table is having no values
corresponding to the column, it will return NULL values in that place.
• Full outer join: It combines records of both the tables sample_joins and sample_joins1 based on the JOIN
Condition given in query. It returns all the records from both tables and fills in NULL Values for the
columns missing values matched on either side.

• Sub queries: A Query present within a Query is known as a sub query. The main query will depend
on the values returned by the subqueries. Subqueries can be classified into two types
• Subqueries in FROM clause
• Subqueries in WHERE clause
• When to use:
• To get a particular value combined from two column values from different tables. Dependency of
one table values on other tables. Comparative checking of one column values from other tables
• SELECT col FROM ( SELECT a+b AS col FROM t1) t2
• SELECT * FROM A WHERE A.a IN (SELECT foo FROM B);

Apache PIG
• Pig Hadoop is basically a high-level programming language that is helpful for the analysis of huge datasets.
Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to perform a lot of data
administration operations.
• For writing data analysis programs, Pig renders a high-level programming language called Pig Latin.
Several operators are provided by Pig Latin using which personalized functions for writing, reading, and
processing of data can be developed by programmers.
• For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these scripts need to
be transformed into MapReduce tasks. This is achieved with the help of Pig Engine.

What is Apache Pig?
• Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets
of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language
provides various operators using which programmers can develop their own functions for reading,
writing, and processing data.
• To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?
• Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex
codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters, ordering,
etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from
MapReduce.

Features of Pig
• Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good
at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so
the programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to read, process,
and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming languages
such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.

Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is
pretty simple.
It is quite difficult in MapReduce to perform a
Join operation between datasets.
Any novice programmer with a basic knowledge
of SQL can work conveniently with Apache Pig.
Exposure to Java is must to work with
MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great
extent.
MapReduce will require almost 20 times more
the number of lines to perform the same task.
There is no need for compilation. On execution,
every Apache Pig operator is converted
internally into a MapReduce job.
MapReduce jobs have a long compilation
process.

Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can store
data without designing a schema (values are
stored as $01, $02 etc.)
Schema is mandatory in SQL.
The data model in Apache Pig is nested
relational.
The data model used in SQL is flat relational.
Apache Pig provides limited opportunity for
Query optimization.
There is more opportunity for query optimization
in SQL.
Apache Pig Hive
Apache Pig uses a language called Pig Latin. It
was originally created at Yahoo.
Hive uses a language called HiveQL. It was
originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in
pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured, unstructured,
and semi-structured data.
Hive is mostly for structured data.

Applications of Apache Pig
• To process huge data sources such as web logs.
• To perform data processing for search platforms.
• To process time sensitive data loads.

Apache Pig Components
• Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous
checks. The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and logical
operators.
• In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
• Optimizer: The logical plan (DAG) is passed to the logical optimizer,
which carries out the logical optimizations such as projection and
pushdown.
• Compiler: The compiler compiles the optimized logical plan into a
series of MapReduce jobs.
• Execution engine: Finally the MapReduce jobs are submitted to
Hadoop in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.

Unit II Hadoop Ecosystem_Updated.pptx

More Related Content

Similar to Unit II Hadoop Ecosystem_Updated.pptx (20)

Recently uploaded (20)

Unit II Hadoop Ecosystem_Updated.pptx