SlideShare a Scribd company logo
Unit II
Hadoop EcoSystem
PIG,
Zookeeper, how it helps in monitoring a cluster, HBase uses Zookeeper and how to Build Applications
with Zookeeper.
SPARK: Introduction to Data Analysis with Spark, Downloading Spark and Getting Started, Programming
with RDDs, Machine Learning with MLlib.
HBase
• Limitations of Hadoop: Hadoop can perform only batch processing, and data will be accessed only in
a sequential manner. That means one has to search the entire dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of time
(random access).
• HBase is part of the Hadoop ecosystem which offers random real-time read/write access to data in
the Hadoop File System. HBase is a Hadoop project which is Open Source, distributed
Hadoop database which has its genesis in the Google’s Bigtable.
• Its programming language is Java.
• Now, it is an integral part of the Apache Software Foundation and the Hadoop ecosystem.
• Also, it is a high availability database which exclusively runs on top of the HDFS.
• It is a column-oriented database built on top of HDFS.
• Why should you use HBase Technology?
• Along with HDFS and MapReduce, HBase is one of the core components of the Hadoop ecosystem. Here are some salient
features of HBase which make it significant to use:
• Apache HBase has a completely distributed architecture.
• It can easily work on extremely large scale data.
• HBase offers high security and easy management which results in unprecedented high write throughput.
• For both structured and semi-structured data types we can use it.
• Moreover, the MapReduce jobs can be backed with HBase Tables.
HBase and HDFS
HDFS HBase
HDFS is a distributed file system suitable for storing
large files.
HBase is a database built on top of the HDFS.
HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables.
It provides high latency batch processing; no concept
of batch processing.
It provides low latency access to single rows from
billions of records (Random access).
It provides only sequential access of data.
HBase internally uses Hash tables and provides random
access, and it stores the data in indexed HDFS files for
faster lookups.
Storage Mechanism in HBase
• HBase is a column-oriented database and the tables in it are sorted by row.
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
Rowid
Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3
Apache HBase Architecture
• We know HBase is acting like a big table to record the data, and tables are split into regions. Again,
Regions are divided vertically by family column to create stores. These stores are called files in HDFS.
• HBase has three major components which are master servers, client library, and region servers. It's up to
the requirement of the organization whether to add region servers or not.
• MasterServer : It allocates regions to the region servers with the help of zookeeper. It balances the load
across the region servers. MasterServer is responsible for changes like schema changes and metadata
operations, like creating column families and tables.
• Regions: Regions are nothing but tables which are split into small tables and spread across the region
servers.
• RegionServer: Region servers communicate with other components and complete the below tasks:
• It communicates with the client to handle data related tasks.
• It takes care of the read and write tasks of the regions under it.
• It decides the size of a region based on the threshold it has.
Unit II Hadoop Ecosystem_Updated.pptx
• The memory here acts as a temporary space to store the data. When anything is entered into Hbse,
it is initially stored in the memory, and later, it will be transferred to HFiles where data is stored in
blocks.
• Zookeeper: Zookeeper is an open source project, and it facilitates the services like managing the
configuration data, providing distributed synchronisation, naming, etc. It helps the master server in
discovering the available servers. Zookeeper helps the client servers in communicating with region
servers.
Secondary Indexing
• Secondary indexes are an orthogonal way to access data from its primary access path. In HBase, you
have a single index that is lexicographically sorted on the primary row key. Access to records in any way
other than through the primary row requires scanning over potentially all the rows in the table to test
them against your filter. With secondary indexing, the columns or expressions you index form an
alternate row key to allow point lookups and range scans along this new axis.
• Covered Indexes: In covered indexes - we do not need to go back to the primary table once we have
found the index entry. Instead, we bundle the data we care about right in the index rows, saving read-
time overhead.For example, the following would create an index on the v1 and v2 columns and include
the v3 column in the index as well to prevent having to get it from the data table:
• CREATE INDEX my_index ON my_table (v1,v2) INCLUDE(v3)
• Functional Indexes: Functional indexes allow you to create an index not just on columns, but on an
arbitrary expressions. Then when a query uses that expression, the index may be used to retrieve
the results instead of the data table. For example, you could create an index on
UPPER(FIRST_NAME||‘ ’||LAST_NAME) to allow you to do case insensitive searches on the combined
first name and last name of a person.
• For example, the following would create this functional index:
• CREATE INDEX UPPER_NAME_IDX ON EMP (UPPER(FIRST_NAME||' '||LAST_NAME))
• With this index in place, when the following query is issued, the index would be used instead of the
data table to retrieve the results:
• SELECT EMP_ID FROM EMP WHERE UPPER(FIRST_NAME||' '||LAST_NAME)='JOHN DOE'
Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Apache Hive
• Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. A data warehouse provides a central store of information that can easily be analyzed
to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of
data using SQL.
• Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store
and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to
work quickly on petabytes of data. What makes Hive unique is the ability to query large datasets,
leveraging Apache Tez or MapReduce, with a SQL-like interface.
• Motivation:
• Yahoo worked on Pig to facilitate application deployment on Hadoop. Their need mainly was
focused on unstructured data Simultaneously Facebook started working on deploying
warehouse solutions on Hadoop that resulted in Hive. The size of data being collected and
analyzed in industry for business intelligence (BI) is growing rapidly making traditional
warehousing solution prohibitively expensive.
How does Hive work?
• Hive was created to allow non-programmers familiar with SQL to work with petabytes of data,
using a SQL-like interface called HiveQL. Hive uses batch processing so that it works quickly across
a very large distributed database.
• Hive transforms HiveQL queries into MapReduce or Tez jobs that run on Apache Hadoop’s
distributed job scheduling framework, Yet Another Resource Negotiator (YARN). It queries data
stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon
S3.
• Hive stores its database and table metadata in a metastore, which is a database or file backed
store that enables easy data abstraction and discovery.
• Hive includes HCatalog, which is a table and storage management layer that reads data from the
Hive metastore to facilitate seamless integration between Hive, Apache Pig, and MapReduce.
• By using the metastore, HCatalog allows Pig and MapReduce to use the same data structures as
Hive, so that the metadata doesn’t have to be redefined for each engine.
Benefits of Hive
B
Fast: Hive is designed to quickly handle petabytes of data using batch processing.
Familiar: Hive provides a familiar, SQL-like interface that is accessible to non-programmers.
Scalable: Hive is easy to distribute and scale based on your needs.
It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.
It is capable of analyzing large datasets stored in HDFS.
It allows different storage types such as plain text, RCFile, and HBase.
It uses indexing to accelerate queries.
It can operate on compressed data stored in the Hadoop ecosystem.
It supports user-defined functions (UDFs) where user can provide its functionality.
When to use Hive
● Most suitable for data warehouse applications where relatively static data is analyzed.
● Fast response time is not required.
● Data is not changing rapidly.
● An abstraction to underlying MR program.
● Hive of course is a good choice for queries that lend themselves to being expressed in SQL,
particularly long-running queries where fault tolerance is desirable.
● Hive can be a good choice if you’d like to write feature-rich, fault-tolerant, batch (i.e., not near-real-
time) transformation or ETL jobs in a pluggable SQL engine.
Hive Architecture
Hive Client: Hive allows writing applications in
various languages, including Java, Python, and C++.
It supports different types of clients such as:-
● Thrift Server - It is a cross-language service
provider platform that serves the request
from all those programming languages that
supports Thrift.
● JDBC Driver - It is used to establish a
connection between hive and Java
applications. The JDBC Driver is present in
the class
org.apache.hadoop.hive.jdbc.HiveDriver.
● ODBC Driver - It allows the applications that
support the ODBC protocol to connect to
Hive.
Hive Services: The following are the services provided by Hive
● Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.
● Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
● Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in
the warehouse. It also includes metadata of column and its type information, the serializers and deserializers
which is used to read and write data and the corresponding HDFS files where the data is stored.
● Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to
Hive Driver.
● Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers
the queries to the compiler.
● Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different
query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
● Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS
tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.
Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the values of a particular
column like date, course, city or country. The advantage of partitioning is that since the data is stored in
slices, the query response time becomes faster.
As we know that Hadoop is used to handle the huge amount of data, it is always required to use the best
approach to deal with it. The partitioning in Hive is the best example of it.
Let's assume we have a data of 10 million students studying in an institute. Now, we have to fetch the
students of a particular course. If we use a traditional approach, we have to go through the entire data.
This leads to performance degradation. In such a case, we can adopt the better approach i.e.,
partitioning in Hive and divide the data among the different datasets based on particular columns.
The partitioning in Hive can be executed in two ways -
● Static partitioning
● Dynamic partitioning
Static Partitioning: In static or manual partitioning, it is required to pass the values of partitioned
columns manually while loading the data into the table. Hence, the data file doesn't contain the partitioned
columns.
create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
Dynamic Partitioning: In dynamic partitioning, the values of partitioned columns exist within the
table. So, it is not required to pass the values of partitioned columns manually.
Enable the dynamic partition by using the following commands: -
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Create a partition table by using the following command: -
hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
Comparison with Traditional Database
Below are the key features of Hive that differ from RDBMS.
● Hive resembles a traditional database by supporting SQL interface but it is not a full database.
Hive can be better called as data warehouse instead of database.
● Hive enforces schema on read time whereas RDBMS enforces schema on write time.
● In RDBMS, a table’s schema is enforced at data load time, If the data being loaded doesn’t conform
to the schema, then it is rejected. This design is called schema on write.
● But Hive doesn’t verify the data when it is loaded, but rather when a it is retrieved. This is called
schema on read.
● Schema on read makes for a very fast initial load, since the data does not have to be read,
parsed, and serialized to disk in the database’s internal format. The load operation is just a file copy
or move.
● Schema on write makes query time performance faster, since the database can index columns
and perform compression on the data but it takes longer to load data into the database.
● Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read
and Write many times.
● In RDBMS, record level updates, insertions and deletes, transactions and indexes are
possible. Whereas these are not allowed in Hive because Hive was built to operate over HDFS
data using MapReduce, where full-table scans are the norm and a table update is achieved by
transforming the data into a new table.
● In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive can 100’s
Petabytes very easily.
● As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction
Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is
significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce
jobs and due to the size of the data sets Hadoop was designed to serve.
● RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive
is suited for data warehouse applications, where relatively static data is analyzed, fast response
times are not required, and when the data is not changing rapidly.
● To overcome the limitations of Hive, HBase is being integrated with Hive to support record level
operations and OLAP.
● Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very
costly scale up.
HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of
Hive data types is given below.
Integer Types: TINYINT, SMALLINT, INT, BIGINT
Decimal Type: Float, Double
Date/Time Types: TIMESTAMP, DATES
String Types: STRING, Varchar, Char
Complex Type: Struct, Map, Array
HiveQL
• Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analyzing
structured data. It separates users from the complexity of Map Reduce programming. It reuses
common concepts from relational databases, such as tables, rows, columns, and schema, to ease
learning. Hive provides a CLI for Hive query writing using Hive Query Language (HiveQL).
• Most interactions tend to take place over a command line interface (CLI). Generally, HiveQL syntax is
similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats
which are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
• Hive Queries: Hive provides SQL type querying language for the ETL purpose on top of Hadoop file
system. Hive Query language (HiveQL) provides SQL type environment in Hive to work with tables,
databases, queries. We can have a different type of Clauses associated with Hive to perform different
type data manipulations and querying.
Hive queries provides the following features:
• Data modeling such as Creation of databases, tables, etc.
• ETL functionalities such as Extraction, Transformation, and Loading data into tables
• Joins to merge different data tables
• User specific custom scripts for ease of code
• Faster querying tool on top of Hadoop
Querying Data
Create a Database: Create a database named “company” by running the create comman
create database company;
Next, verify the database is created by running the show command:
show databases;
Open the “company” database by using the following command:
use company;
Create a Table in Hive
Use column names when creating a table. Create the table by running the following command:
create table employees (id int, name string, country string, department string, salary int)
• Load Data From a File: You have created a table, but it is empty because data is not loaded from the
file located in the /hadoop directory.
• Load data by running the load command:
load data inpath '/hdoop/employees.txt' overwrite into table employees;
• Verify if the data is loaded by running the select command:
select * from employees;
• Display Hive Data: You have several options for displaying data from the table.
• Display Columns: Display columns of a table by running the desc command:
• desc employees;
• Display Selected Data
• select name,country from employees;
Sorting and Aggregating
• ORDER AND SORT
• ORDER BY (ASC|DESC): This is similar to the RDBMS ORDER BY statement. A sorted order is maintained
across all of the output from every reducer. It performs the global sort using only one reducer, so it
takes a longer time to return the result. Usage with LIMIT is strongly recommended for ORDER BY.
• SORT BY (ASC|DESC): This indicates which columns to sort when ordering the reducer input records.
This means it completes sorting before sending data to the reducer.
• DISTRIBUTE BY – Rows with matching column values will be partitioned to the same reducer. When
used alone, it does not guarantee sorted input to the reducer. The DISTRIBUTE BY statement is similar
to GROUP BY in RDBMS in terms of deciding which reducer to distribute the mapper output to. When
using with SORT BY, DISTRIBUTE BY must be specified before the SORT BY statement.
• CLUSTER BY – This is a shorthand operator to perform DISTRIBUTE BY and SORT BY operations on the
same group of columns. And, it is sorted locally in each reducer. The CLUSTER BY statement does not
support ASC or DESC yet.
Sorting and Aggregating
• Sort, order, distribute & cluster:
• The SORT BY and ORDER BY clauses are used to define the order of the output data. However,
DISTRIBUTE BY and CLUSTER BY clauses are used to distribute the data to multiple reducers based
on the key columns.
• We can use Sort by or Order by or Distribute by or Cluster by clauses in a hive SELECT query to get
the output data in the desired order.
• SORT BY:
• The SORT by clause sorts the data per reducer. As a result, if we have N number of reducers, we will
have N number of sorted files in the output. These files can have overlapping data ranges. Also, the
output data is not globally sorted because the hive sorts the rows before feeding them to reducers
based on the key columns used in the SORT BY clause. The syntax of the SORT BY clause is as
below:
• SELECT Col1, Col2,……ColN FROM TableName SORT BY Col1 <ASC | DESC>, Col2 <ASC | DESC>, ….
ColN <ASC | DESC>
• ORDER BY:
• ORDER BY clause orders the data globally. Because it ensures the global ordering of the data,
all the data need to be passed from a single reducer only. As a result, the order by clause
outputs one single file only.
• Bringing all the data on one single reducer can become a performance killer, especially if our
output dataset is significantly large. So, we should always avoid the ORDER BY clause in the
hive queries.
• However, if we need to enforce a global ordering of the data, and the output dataset is not
that big, we can use this hive clause to order the final dataset globally.
• The syntax of the ORDER BY clause in hive is as below:
• SELECT Col1, Col2,……ColN FROM TableName ORDER BY Col1 <ASC | DESC>, Col2 <ASC |
DESC>, …. ColN <ASC | DESC>
• DISTRIBUTE BY:
• DISTRIBUTE BY clause is used to distribute the input rows among reducers. It does not ensures
that all rows for the same key columns are going to the same reducer. So, if we need to partition
the data on some key column, we can use the DISTRIBUTE BY clause in the hive queries.
However, the DISTRIBUTE BY clause does not sort the data either at the reducer level or globally.
Also, the same key values might not be placed next to each other in the output dataset.
• As a result, the DISTRIBUTE BY clause may output N number of unsorted files where N is the
number of reducers used in the query processing. But, the output files do not contain
overlapping data ranges.
• The syntax of the DISTRIBUTE BY clause in hive is as below:
• SELECT Col1, Col2,……ColN FROM TableName DISTRIBUTE BY Col1, Col2, ….. ColN
• CLUSTER BY
• CLUSTER BY clause is a combination of DISTRIBUTE BY and SORT BY clauses together. That
means the output of the CLUSTER BY clause is equivalent to the output of DISTRIBUTE BY +
SORT BY clauses. The CLUSTER BY clause distributes the data based on the key column and
then sorts the output data by putting the same key column values adjacent to each other. So,
the output of the CLUSTER BY clause is sorted at the reducer level. As a result, we can get N
number of sorted output files where N is the number of reducers used in the query
processing. Also, the CLUSTER by clause ensures that we are getting non-overlapping data
ranges into the final outputs. However, if the query is processed by only one reducer the
output will be equivalent to the output of the ORDER BY clause.
• The syntax of the CLUSTER BY clause is as below:
• SELECT Col1, Col2,……ColN FROM TableName CLUSTER BY Col1, Col2, ….. ColN
Unit II Hadoop Ecosystem_Updated.pptx
• Hive Aggregate Functions: Hive Aggregate Functions are the most used built-in functions that
take a set of values and return a single value, when used with a group, it aggregates all values
in each group and returns one value for each group.
• Aggregate Functions in Hive can be used with or without GROUP BY functions however these
aggregation functions are mostly used with GROUP BY.
• Most of these functions ignore NULL values.
• Hive Aggregate Functions: Hive Aggregate Functions are the most used built-in functions that take a
set of values and return a single value, when used with a group, it aggregates all values in each group
and returns one value for each group.
• Aggregate Functions in Hive can be used with or without GROUP BY functions however these
aggregation functions are mostly used with GROUP BY.
• Most of these functions ignore NULL values.
Functions Description
COUNT() Returns the count of all rows in a table including rows containing NULL values
When you specify a column as an input, it ignores NULL values in the column for the count.
Also ignores duplicates by using DISTINCT. Return: BIGINT
SUM()
Returns the sum of all values in a column. When used with a group it returns the sum for
each group. Also ignores duplicates by using DISTINCT. Return: DOUBLE
AVG()
Returns the average of all values in a column. When used with a group it returns an
average for each group. Return: DOUBLE
MIN()
Returns the minimum value of the column from all rows. When used with a group it returns
a minimum for each group. Return: DOUBLE
MAX()
Returns the maximum value of the column from all rows. When used with a group it returns
a maximum for each group. Return: DOUBLE
Variance(col) Returns the variance of a numeric column for all rows or for each group. Return: DOUBLE
Stddev_samp
(col)
Returns the sample statistical standard deviation of all values in a column or for each group.
Return: DOUBLE
Corr(col1,
col2)
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
Return: DOUBLE
• Examples:
• select count(*) from employee;
• select count(salary) from employee;
• select count(distinct gender, salary) from employee;
• select sum(salary) from employee;
• select sum(distinct salary) from employee;
• select avg(salary) from employee group by age;
• select age,avg(salary) from employee group by age;
• select min(salary) from employee;
• select max(salary) from employee;
• select variance(salary) from employee;
• select stddev_pop(salary) from employee;
Joins & Sub queries
• HiveQL – JOIN: The HiveQL Join clause is used to combine the data of two or more tables based on a
related column between them. The various type of HiveQL joins are: -
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join
• Inner Join: The Records common to the both tables will be retrieved by this Inner Join. Here we are
performing join query using JOIN keyword between the tables.
• Left Outer Join: Hive query language LEFT OUTER JOIN returns all the rows from the left table even
though there are no matches in right table If ON Clause matches zero records in the right table, the joins
still return a record in the result with NULL in each column from the right table.
• Right outer Join: Hive query language RIGHT OUTER JOIN returns all the rows from the Right table even
though there are no matches in left table. If ON Clause matches zero records in the left table, the joins still
return a record in the result with NULL in each column from the left table. RIGHT joins always return
records from a Right table and matched records from the left table. If the left table is having no values
corresponding to the column, it will return NULL values in that place.
• Full outer join: It combines records of both the tables sample_joins and sample_joins1 based on the JOIN
Condition given in query. It returns all the records from both tables and fills in NULL Values for the
columns missing values matched on either side.
• Sub queries: A Query present within a Query is known as a sub query. The main query will depend
on the values returned by the subqueries. Subqueries can be classified into two types
• Subqueries in FROM clause
• Subqueries in WHERE clause
• When to use:
• To get a particular value combined from two column values from different tables. Dependency of
one table values on other tables. Comparative checking of one column values from other tables
• SELECT col FROM ( SELECT a+b AS col FROM t1) t2
• SELECT * FROM A WHERE A.a IN (SELECT foo FROM B);
Apache PIG
• Pig Hadoop is basically a high-level programming language that is helpful for the analysis of huge datasets.
Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to perform a lot of data
administration operations.
• For writing data analysis programs, Pig renders a high-level programming language called Pig Latin.
Several operators are provided by Pig Latin using which personalized functions for writing, reading, and
processing of data can be developed by programmers.
• For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these scripts need to
be transformed into MapReduce tasks. This is achieved with the help of Pig Engine.
What is Apache Pig?
• Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets
of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language
provides various operators using which programmers can develop their own functions for reading,
writing, and processing data.
• To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Why Do We Need Apache Pig?
• Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex
codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters, ordering,
etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from
MapReduce.
Features of Pig
• Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good
at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so
the programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to read, process,
and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming languages
such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.
Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is
pretty simple.
It is quite difficult in MapReduce to perform a
Join operation between datasets.
Any novice programmer with a basic knowledge
of SQL can work conveniently with Apache Pig.
Exposure to Java is must to work with
MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great
extent.
MapReduce will require almost 20 times more
the number of lines to perform the same task.
There is no need for compilation. On execution,
every Apache Pig operator is converted
internally into a MapReduce job.
MapReduce jobs have a long compilation
process.
Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can store
data without designing a schema (values are
stored as $01, $02 etc.)
Schema is mandatory in SQL.
The data model in Apache Pig is nested
relational.
The data model used in SQL is flat relational.
Apache Pig provides limited opportunity for
Query optimization.
There is more opportunity for query optimization
in SQL.
Apache Pig Hive
Apache Pig uses a language called Pig Latin. It
was originally created at Yahoo.
Hive uses a language called HiveQL. It was
originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in
pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured, unstructured,
and semi-structured data.
Hive is mostly for structured data.
Applications of Apache Pig
• To process huge data sources such as web logs.
• To perform data processing for search platforms.
• To process time sensitive data loads.
Apache Pig Components
• Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous
checks. The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and logical
operators.
• In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
• Optimizer: The logical plan (DAG) is passed to the logical optimizer,
which carries out the logical optimizations such as projection and
pushdown.
• Compiler: The compiler compiles the optimized logical plan into a
series of MapReduce jobs.
• Execution engine: Finally the MapReduce jobs are submitted to
Hadoop in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.
Unit II Hadoop Ecosystem_Updated.pptx

More Related Content

PPTX
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
PPTX
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
PDF
BIGDATA ppts
Krisshhna Daasaarii
 
PPTX
Hive and querying data
KarthigaGunasekaran1
 
PPTX
Apache HBase™
Prashant Gupta
 
PPTX
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
HariPalani10
 
PPTX
hive architecture and hive components in detail
HariKumar544765
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
BIGDATA ppts
Krisshhna Daasaarii
 
Hive and querying data
KarthigaGunasekaran1
 
Apache HBase™
Prashant Gupta
 
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
HariPalani10
 
hive architecture and hive components in detail
HariKumar544765
 
Big Data and Cloud Computing
Farzad Nozarian
 

Similar to Unit II Hadoop Ecosystem_Updated.pptx (20)

PDF
Techincal Talk Hbase-Ditributed,no-sql database
Rishabh Dugar
 
PPTX
Apache Hive
tusharsinghal58
 
PPTX
Apache hive introduction
Mahmood Reza Esmaili Zand
 
PPTX
An Introduction-to-Hive and its Applications and Implementations.pptx
iaeronlineexm
 
PPTX
Hive
Manas Nayak
 
PPTX
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
PPTX
Cloudera Hadoop Distribution
Thisara Pramuditha
 
PPTX
01-Introduction-to-Hive.pptx
VIJAYAPRABAP
 
PPTX
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
PDF
What is Apache Hadoop and its ecosystem?
tommychauhan
 
PPTX
Big data and tools
Shivam Shukla
 
PPTX
HBase.pptx
Sadhik7
 
PPTX
Big data solutions in Azure
Mostafa
 
ODP
Apache hive1
sheetal sharma
 
PPTX
HADOOP ECOSYSTEM ALL ABOUT HADOOP,HADOOP PPT BIG DATA.pptx
Himani271945
 
PPTX
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
PPTX
Building Big data solutions in Azure
Mostafa
 
PPTX
hadoop-ecosystem-ppt.pptx
raghavanand36
 
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Techincal Talk Hbase-Ditributed,no-sql database
Rishabh Dugar
 
Apache Hive
tusharsinghal58
 
Apache hive introduction
Mahmood Reza Esmaili Zand
 
An Introduction-to-Hive and its Applications and Implementations.pptx
iaeronlineexm
 
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
Cloudera Hadoop Distribution
Thisara Pramuditha
 
01-Introduction-to-Hive.pptx
VIJAYAPRABAP
 
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Big data and tools
Shivam Shukla
 
HBase.pptx
Sadhik7
 
Big data solutions in Azure
Mostafa
 
Apache hive1
sheetal sharma
 
HADOOP ECOSYSTEM ALL ABOUT HADOOP,HADOOP PPT BIG DATA.pptx
Himani271945
 
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
Building Big data solutions in Azure
Mostafa
 
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Ad

Recently uploaded (20)

PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Ad

Unit II Hadoop Ecosystem_Updated.pptx

  • 2. PIG, Zookeeper, how it helps in monitoring a cluster, HBase uses Zookeeper and how to Build Applications with Zookeeper. SPARK: Introduction to Data Analysis with Spark, Downloading Spark and Getting Started, Programming with RDDs, Machine Learning with MLlib.
  • 3. HBase • Limitations of Hadoop: Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. • A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access). • HBase is part of the Hadoop ecosystem which offers random real-time read/write access to data in the Hadoop File System. HBase is a Hadoop project which is Open Source, distributed Hadoop database which has its genesis in the Google’s Bigtable. • Its programming language is Java. • Now, it is an integral part of the Apache Software Foundation and the Hadoop ecosystem. • Also, it is a high availability database which exclusively runs on top of the HDFS. • It is a column-oriented database built on top of HDFS.
  • 4. • Why should you use HBase Technology? • Along with HDFS and MapReduce, HBase is one of the core components of the Hadoop ecosystem. Here are some salient features of HBase which make it significant to use: • Apache HBase has a completely distributed architecture. • It can easily work on extremely large scale data. • HBase offers high security and easy management which results in unprecedented high write throughput. • For both structured and semi-structured data types we can use it. • Moreover, the MapReduce jobs can be backed with HBase Tables.
  • 5. HBase and HDFS HDFS HBase HDFS is a distributed file system suitable for storing large files. HBase is a database built on top of the HDFS. HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables. It provides high latency batch processing; no concept of batch processing. It provides low latency access to single rows from billions of records (Random access). It provides only sequential access of data. HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.
  • 6. Storage Mechanism in HBase • HBase is a column-oriented database and the tables in it are sorted by row. • Table is a collection of rows. • Row is a collection of column families. • Column family is a collection of columns. • Column is a collection of key value pairs. Rowid Column Family Column Family Column Family Column Family col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3 1 2 3
  • 7. Apache HBase Architecture • We know HBase is acting like a big table to record the data, and tables are split into regions. Again, Regions are divided vertically by family column to create stores. These stores are called files in HDFS. • HBase has three major components which are master servers, client library, and region servers. It's up to the requirement of the organization whether to add region servers or not. • MasterServer : It allocates regions to the region servers with the help of zookeeper. It balances the load across the region servers. MasterServer is responsible for changes like schema changes and metadata operations, like creating column families and tables. • Regions: Regions are nothing but tables which are split into small tables and spread across the region servers. • RegionServer: Region servers communicate with other components and complete the below tasks: • It communicates with the client to handle data related tasks. • It takes care of the read and write tasks of the regions under it. • It decides the size of a region based on the threshold it has.
  • 9. • The memory here acts as a temporary space to store the data. When anything is entered into Hbse, it is initially stored in the memory, and later, it will be transferred to HFiles where data is stored in blocks. • Zookeeper: Zookeeper is an open source project, and it facilitates the services like managing the configuration data, providing distributed synchronisation, naming, etc. It helps the master server in discovering the available servers. Zookeeper helps the client servers in communicating with region servers.
  • 10. Secondary Indexing • Secondary indexes are an orthogonal way to access data from its primary access path. In HBase, you have a single index that is lexicographically sorted on the primary row key. Access to records in any way other than through the primary row requires scanning over potentially all the rows in the table to test them against your filter. With secondary indexing, the columns or expressions you index form an alternate row key to allow point lookups and range scans along this new axis. • Covered Indexes: In covered indexes - we do not need to go back to the primary table once we have found the index entry. Instead, we bundle the data we care about right in the index rows, saving read- time overhead.For example, the following would create an index on the v1 and v2 columns and include the v3 column in the index as well to prevent having to get it from the data table: • CREATE INDEX my_index ON my_table (v1,v2) INCLUDE(v3)
  • 11. • Functional Indexes: Functional indexes allow you to create an index not just on columns, but on an arbitrary expressions. Then when a query uses that expression, the index may be used to retrieve the results instead of the data table. For example, you could create an index on UPPER(FIRST_NAME||‘ ’||LAST_NAME) to allow you to do case insensitive searches on the combined first name and last name of a person. • For example, the following would create this functional index: • CREATE INDEX UPPER_NAME_IDX ON EMP (UPPER(FIRST_NAME||' '||LAST_NAME)) • With this index in place, when the following query is issued, the index would be used instead of the data table to retrieve the results: • SELECT EMP_ID FROM EMP WHERE UPPER(FIRST_NAME||' '||LAST_NAME)='JOHN DOE'
  • 12. Applications of HBase • It is used whenever there is a need to write heavy applications. • HBase is used whenever we need to provide fast random access to available data. • Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
  • 13. Apache Hive • Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. A data warehouse provides a central store of information that can easily be analyzed to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of data using SQL. • Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. What makes Hive unique is the ability to query large datasets, leveraging Apache Tez or MapReduce, with a SQL-like interface. • Motivation: • Yahoo worked on Pig to facilitate application deployment on Hadoop. Their need mainly was focused on unstructured data Simultaneously Facebook started working on deploying warehouse solutions on Hadoop that resulted in Hive. The size of data being collected and analyzed in industry for business intelligence (BI) is growing rapidly making traditional warehousing solution prohibitively expensive.
  • 14. How does Hive work? • Hive was created to allow non-programmers familiar with SQL to work with petabytes of data, using a SQL-like interface called HiveQL. Hive uses batch processing so that it works quickly across a very large distributed database. • Hive transforms HiveQL queries into MapReduce or Tez jobs that run on Apache Hadoop’s distributed job scheduling framework, Yet Another Resource Negotiator (YARN). It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. • Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery. • Hive includes HCatalog, which is a table and storage management layer that reads data from the Hive metastore to facilitate seamless integration between Hive, Apache Pig, and MapReduce. • By using the metastore, HCatalog allows Pig and MapReduce to use the same data structures as Hive, so that the metadata doesn’t have to be redefined for each engine.
  • 15. Benefits of Hive B Fast: Hive is designed to quickly handle petabytes of data using batch processing. Familiar: Hive provides a familiar, SQL-like interface that is accessible to non-programmers. Scalable: Hive is easy to distribute and scale based on your needs. It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs. It is capable of analyzing large datasets stored in HDFS. It allows different storage types such as plain text, RCFile, and HBase. It uses indexing to accelerate queries. It can operate on compressed data stored in the Hadoop ecosystem. It supports user-defined functions (UDFs) where user can provide its functionality.
  • 16. When to use Hive ● Most suitable for data warehouse applications where relatively static data is analyzed. ● Fast response time is not required. ● Data is not changing rapidly. ● An abstraction to underlying MR program. ● Hive of course is a good choice for queries that lend themselves to being expressed in SQL, particularly long-running queries where fault tolerance is desirable. ● Hive can be a good choice if you’d like to write feature-rich, fault-tolerant, batch (i.e., not near-real- time) transformation or ETL jobs in a pluggable SQL engine.
  • 17. Hive Architecture Hive Client: Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types of clients such as:- ● Thrift Server - It is a cross-language service provider platform that serves the request from all those programming languages that supports Thrift. ● JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver. ● ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.
  • 18. Hive Services: The following are the services provided by Hive ● Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands. ● Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and commands. ● Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored. ● Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver. ● Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler. ● Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs. ● Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.
  • 19. Partitioning in Hive The partitioning in Hive means dividing the table into some parts based on the values of a particular column like date, course, city or country. The advantage of partitioning is that since the data is stored in slices, the query response time becomes faster. As we know that Hadoop is used to handle the huge amount of data, it is always required to use the best approach to deal with it. The partitioning in Hive is the best example of it. Let's assume we have a data of 10 million students studying in an institute. Now, we have to fetch the students of a particular course. If we use a traditional approach, we have to go through the entire data. This leads to performance degradation. In such a case, we can adopt the better approach i.e., partitioning in Hive and divide the data among the different datasets based on particular columns. The partitioning in Hive can be executed in two ways - ● Static partitioning ● Dynamic partitioning
  • 20. Static Partitioning: In static or manual partitioning, it is required to pass the values of partitioned columns manually while loading the data into the table. Hence, the data file doesn't contain the partitioned columns. create table student (id int, name string, age int, institute string) partitioned by (course string) row format delimited fields terminated by ','; Dynamic Partitioning: In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not required to pass the values of partitioned columns manually. Enable the dynamic partition by using the following commands: - hive> set hive.exec.dynamic.partition=true; hive> set hive.exec.dynamic.partition.mode=nonstrict; Create a partition table by using the following command: - hive> create table student_part (id int, name string, age int, institute string) partitioned by (course string) row format delimited fields terminated by ',';
  • 21. Comparison with Traditional Database Below are the key features of Hive that differ from RDBMS. ● Hive resembles a traditional database by supporting SQL interface but it is not a full database. Hive can be better called as data warehouse instead of database. ● Hive enforces schema on read time whereas RDBMS enforces schema on write time. ● In RDBMS, a table’s schema is enforced at data load time, If the data being loaded doesn’t conform to the schema, then it is rejected. This design is called schema on write. ● But Hive doesn’t verify the data when it is loaded, but rather when a it is retrieved. This is called schema on read. ● Schema on read makes for a very fast initial load, since the data does not have to be read, parsed, and serialized to disk in the database’s internal format. The load operation is just a file copy or move. ● Schema on write makes query time performance faster, since the database can index columns and perform compression on the data but it takes longer to load data into the database.
  • 22. ● Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times. ● In RDBMS, record level updates, insertions and deletes, transactions and indexes are possible. Whereas these are not allowed in Hive because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. ● In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive can 100’s Petabytes very easily. ● As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve.
  • 23. ● RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive is suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly. ● To overcome the limitations of Hive, HBase is being integrated with Hive to support record level operations and OLAP. ● Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very costly scale up.
  • 24. HIVE Data Types Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of Hive data types is given below. Integer Types: TINYINT, SMALLINT, INT, BIGINT Decimal Type: Float, Double Date/Time Types: TIMESTAMP, DATES String Types: STRING, Varchar, Char Complex Type: Struct, Map, Array
  • 25. HiveQL • Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analyzing structured data. It separates users from the complexity of Map Reduce programming. It reuses common concepts from relational databases, such as tables, rows, columns, and schema, to ease learning. Hive provides a CLI for Hive query writing using Hive Query Language (HiveQL). • Most interactions tend to take place over a command line interface (CLI). Generally, HiveQL syntax is similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats which are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File). • Hive Queries: Hive provides SQL type querying language for the ETL purpose on top of Hadoop file system. Hive Query language (HiveQL) provides SQL type environment in Hive to work with tables, databases, queries. We can have a different type of Clauses associated with Hive to perform different type data manipulations and querying.
  • 26. Hive queries provides the following features: • Data modeling such as Creation of databases, tables, etc. • ETL functionalities such as Extraction, Transformation, and Loading data into tables • Joins to merge different data tables • User specific custom scripts for ease of code • Faster querying tool on top of Hadoop
  • 27. Querying Data Create a Database: Create a database named “company” by running the create comman create database company; Next, verify the database is created by running the show command: show databases; Open the “company” database by using the following command: use company; Create a Table in Hive Use column names when creating a table. Create the table by running the following command: create table employees (id int, name string, country string, department string, salary int)
  • 28. • Load Data From a File: You have created a table, but it is empty because data is not loaded from the file located in the /hadoop directory. • Load data by running the load command: load data inpath '/hdoop/employees.txt' overwrite into table employees; • Verify if the data is loaded by running the select command: select * from employees; • Display Hive Data: You have several options for displaying data from the table. • Display Columns: Display columns of a table by running the desc command: • desc employees; • Display Selected Data • select name,country from employees;
  • 29. Sorting and Aggregating • ORDER AND SORT • ORDER BY (ASC|DESC): This is similar to the RDBMS ORDER BY statement. A sorted order is maintained across all of the output from every reducer. It performs the global sort using only one reducer, so it takes a longer time to return the result. Usage with LIMIT is strongly recommended for ORDER BY. • SORT BY (ASC|DESC): This indicates which columns to sort when ordering the reducer input records. This means it completes sorting before sending data to the reducer. • DISTRIBUTE BY – Rows with matching column values will be partitioned to the same reducer. When used alone, it does not guarantee sorted input to the reducer. The DISTRIBUTE BY statement is similar to GROUP BY in RDBMS in terms of deciding which reducer to distribute the mapper output to. When using with SORT BY, DISTRIBUTE BY must be specified before the SORT BY statement. • CLUSTER BY – This is a shorthand operator to perform DISTRIBUTE BY and SORT BY operations on the same group of columns. And, it is sorted locally in each reducer. The CLUSTER BY statement does not support ASC or DESC yet.
  • 30. Sorting and Aggregating • Sort, order, distribute & cluster: • The SORT BY and ORDER BY clauses are used to define the order of the output data. However, DISTRIBUTE BY and CLUSTER BY clauses are used to distribute the data to multiple reducers based on the key columns. • We can use Sort by or Order by or Distribute by or Cluster by clauses in a hive SELECT query to get the output data in the desired order. • SORT BY: • The SORT by clause sorts the data per reducer. As a result, if we have N number of reducers, we will have N number of sorted files in the output. These files can have overlapping data ranges. Also, the output data is not globally sorted because the hive sorts the rows before feeding them to reducers based on the key columns used in the SORT BY clause. The syntax of the SORT BY clause is as below: • SELECT Col1, Col2,……ColN FROM TableName SORT BY Col1 <ASC | DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>
  • 31. • ORDER BY: • ORDER BY clause orders the data globally. Because it ensures the global ordering of the data, all the data need to be passed from a single reducer only. As a result, the order by clause outputs one single file only. • Bringing all the data on one single reducer can become a performance killer, especially if our output dataset is significantly large. So, we should always avoid the ORDER BY clause in the hive queries. • However, if we need to enforce a global ordering of the data, and the output dataset is not that big, we can use this hive clause to order the final dataset globally. • The syntax of the ORDER BY clause in hive is as below: • SELECT Col1, Col2,……ColN FROM TableName ORDER BY Col1 <ASC | DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>
  • 32. • DISTRIBUTE BY: • DISTRIBUTE BY clause is used to distribute the input rows among reducers. It does not ensures that all rows for the same key columns are going to the same reducer. So, if we need to partition the data on some key column, we can use the DISTRIBUTE BY clause in the hive queries. However, the DISTRIBUTE BY clause does not sort the data either at the reducer level or globally. Also, the same key values might not be placed next to each other in the output dataset. • As a result, the DISTRIBUTE BY clause may output N number of unsorted files where N is the number of reducers used in the query processing. But, the output files do not contain overlapping data ranges. • The syntax of the DISTRIBUTE BY clause in hive is as below: • SELECT Col1, Col2,……ColN FROM TableName DISTRIBUTE BY Col1, Col2, ….. ColN
  • 33. • CLUSTER BY • CLUSTER BY clause is a combination of DISTRIBUTE BY and SORT BY clauses together. That means the output of the CLUSTER BY clause is equivalent to the output of DISTRIBUTE BY + SORT BY clauses. The CLUSTER BY clause distributes the data based on the key column and then sorts the output data by putting the same key column values adjacent to each other. So, the output of the CLUSTER BY clause is sorted at the reducer level. As a result, we can get N number of sorted output files where N is the number of reducers used in the query processing. Also, the CLUSTER by clause ensures that we are getting non-overlapping data ranges into the final outputs. However, if the query is processed by only one reducer the output will be equivalent to the output of the ORDER BY clause. • The syntax of the CLUSTER BY clause is as below: • SELECT Col1, Col2,……ColN FROM TableName CLUSTER BY Col1, Col2, ….. ColN
  • 35. • Hive Aggregate Functions: Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single value, when used with a group, it aggregates all values in each group and returns one value for each group. • Aggregate Functions in Hive can be used with or without GROUP BY functions however these aggregation functions are mostly used with GROUP BY. • Most of these functions ignore NULL values.
  • 36. • Hive Aggregate Functions: Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single value, when used with a group, it aggregates all values in each group and returns one value for each group. • Aggregate Functions in Hive can be used with or without GROUP BY functions however these aggregation functions are mostly used with GROUP BY. • Most of these functions ignore NULL values.
  • 37. Functions Description COUNT() Returns the count of all rows in a table including rows containing NULL values When you specify a column as an input, it ignores NULL values in the column for the count. Also ignores duplicates by using DISTINCT. Return: BIGINT SUM() Returns the sum of all values in a column. When used with a group it returns the sum for each group. Also ignores duplicates by using DISTINCT. Return: DOUBLE AVG() Returns the average of all values in a column. When used with a group it returns an average for each group. Return: DOUBLE MIN() Returns the minimum value of the column from all rows. When used with a group it returns a minimum for each group. Return: DOUBLE MAX() Returns the maximum value of the column from all rows. When used with a group it returns a maximum for each group. Return: DOUBLE Variance(col) Returns the variance of a numeric column for all rows or for each group. Return: DOUBLE Stddev_samp (col) Returns the sample statistical standard deviation of all values in a column or for each group. Return: DOUBLE Corr(col1, col2) Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group. Return: DOUBLE
  • 38. • Examples: • select count(*) from employee; • select count(salary) from employee; • select count(distinct gender, salary) from employee; • select sum(salary) from employee; • select sum(distinct salary) from employee; • select avg(salary) from employee group by age; • select age,avg(salary) from employee group by age; • select min(salary) from employee; • select max(salary) from employee; • select variance(salary) from employee; • select stddev_pop(salary) from employee;
  • 39. Joins & Sub queries • HiveQL – JOIN: The HiveQL Join clause is used to combine the data of two or more tables based on a related column between them. The various type of HiveQL joins are: - • Inner Join • Left Outer Join • Right Outer Join • Full Outer Join
  • 40. • Inner Join: The Records common to the both tables will be retrieved by this Inner Join. Here we are performing join query using JOIN keyword between the tables. • Left Outer Join: Hive query language LEFT OUTER JOIN returns all the rows from the left table even though there are no matches in right table If ON Clause matches zero records in the right table, the joins still return a record in the result with NULL in each column from the right table. • Right outer Join: Hive query language RIGHT OUTER JOIN returns all the rows from the Right table even though there are no matches in left table. If ON Clause matches zero records in the left table, the joins still return a record in the result with NULL in each column from the left table. RIGHT joins always return records from a Right table and matched records from the left table. If the left table is having no values corresponding to the column, it will return NULL values in that place. • Full outer join: It combines records of both the tables sample_joins and sample_joins1 based on the JOIN Condition given in query. It returns all the records from both tables and fills in NULL Values for the columns missing values matched on either side.
  • 41. • Sub queries: A Query present within a Query is known as a sub query. The main query will depend on the values returned by the subqueries. Subqueries can be classified into two types • Subqueries in FROM clause • Subqueries in WHERE clause • When to use: • To get a particular value combined from two column values from different tables. Dependency of one table values on other tables. Comparative checking of one column values from other tables • SELECT col FROM ( SELECT a+b AS col FROM t1) t2 • SELECT * FROM A WHERE A.a IN (SELECT foo FROM B);
  • 42. Apache PIG • Pig Hadoop is basically a high-level programming language that is helpful for the analysis of huge datasets. Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to perform a lot of data administration operations. • For writing data analysis programs, Pig renders a high-level programming language called Pig Latin. Several operators are provided by Pig Latin using which personalized functions for writing, reading, and processing of data can be developed by programmers. • For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these scripts need to be transformed into MapReduce tasks. This is achieved with the help of Pig Engine.
  • 43. What is Apache Pig? • Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. • To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data. • To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
  • 44. Why Do We Need Apache Pig? • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java. • Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by almost 16 times. • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL. • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
  • 45. Features of Pig • Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc. • Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. • Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language. • Extensibility − Using the existing operators, users can develop their own functions to read, process, and write data. • UDF’s − Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. • Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
  • 46. Apache Pig MapReduce Apache Pig is a data flow language. MapReduce is a data processing paradigm. It is a high level language. MapReduce is low level and rigid. Performing a Join operation in Apache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation between datasets. Any novice programmer with a basic knowledge of SQL can work conveniently with Apache Pig. Exposure to Java is must to work with MapReduce. Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. MapReduce will require almost 20 times more the number of lines to perform the same task. There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. MapReduce jobs have a long compilation process.
  • 47. Pig SQL Pig Latin is a procedural language. SQL is a declarative language. In Apache Pig, schema is optional. We can store data without designing a schema (values are stored as $01, $02 etc.) Schema is mandatory in SQL. The data model in Apache Pig is nested relational. The data model used in SQL is flat relational. Apache Pig provides limited opportunity for Query optimization. There is more opportunity for query optimization in SQL. Apache Pig Hive Apache Pig uses a language called Pig Latin. It was originally created at Yahoo. Hive uses a language called HiveQL. It was originally created at Facebook. Pig Latin is a data flow language. HiveQL is a query processing language. Pig Latin is a procedural language and it fits in pipeline paradigm. HiveQL is a declarative language. Apache Pig can handle structured, unstructured, and semi-structured data. Hive is mostly for structured data.
  • 48. Applications of Apache Pig • To process huge data sources such as web logs. • To perform data processing for search platforms. • To process time sensitive data loads.
  • 49. Apache Pig Components • Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. • In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. • Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. • Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs. • Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.