Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS

• Apache Hive is a Data Warehousing tool built on top of Hadoop and is used for data
analysis.
• Hive is targeted towards users who are comfortable with SQL.
• It is similar to SQL and called HiveQL, used for managing and querying structured data.
• This language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers.
• The popular feature of Hive is that there is no need to learn Java.
• Hive, an open source date warehousing framework based on Hadoop, was developed by the
Data Infrastructure Team at Facebook.

• Hive is also one of the technologies that are being used to address the requirements at
Facebook.
• Hive is very popular with all the users internally at Facebook and is being used to run
thousands of jobs on the cluster with hundreds of users, for a wide variety of
applications.
• Hive-Hadoop cluster at Facebook stores more than 2PB of raw data and regularly
loads 15 TB of data on a daily basis.

Role of Hive in real time applications

SQL
• SQL stands for Structured Query Language.
• SQL is a language which helps us to work with the databases. Database does not
understand English or any other language.
• Just as to create software, we use Java or C#, in a similar way to work with databases,
we use SQL.
• SQL is the standard language of Database and is also pronounced as Sequel by many
people
• SQL itself is a declarative language.
• SQL deals with structured data and is for RDBMS that is a relational database
management
• SQL support schema for data storage
• We use SQL when we need frequent modification in records. SQL is used for better
performance
Data Modelling using Entities and
Relationships

HiveQL
• Hive’s SQL language is known as HiveQL, it is a combination of
SQL-92, Oracle’s SQL language, and MySQL.
• HiveQL provides some improved features from the previous
version of SQL standards, like analytics function from SQL 2003.
• Some Hive’s’ extension like multitable inserts, TRANSFORM,
MAP and REDUCE.
Relationships

Hive Data Types
• Hive data types are categorized in numeric types, string types, misc types,
and complex types.
Relationships
Type Size Range
TINYINT 1-byte signed
integer
-128 to 127
SMALLINT 2-byte signed
integer
32,768 to
32,767
INT 4-byte signed
integer
2,147,483,648
to
2,147,483,647
BIGINT 8-byte signed
integer
-
9,223,372,036
,854,775,808
to
9,223,372,036
,854,775,807
Integer Types Decimal Type
Type Size Range
FLOAT 4-byte Single
precision
floating
point
number
DOUBLE 8-byte Double
precision
floating
point
number

Relationships
Date/Time Types
TIMESTAMP
•It supports traditional UNIX timestamp with optional nanosecond precision.
•As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
•As Floating point numeric type, it is interpreted as UNIX timestamp in
seconds with decimal precision.
•As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
DATES
The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.

String Types
• STRING
• The string is a sequence of characters. It values can be
enclosed within single quotes (') or double quotes (").
• Varchar
• The varchar is a variable length type whose range lies
between 1 and 65535, which specifies that the maximum
number of characters allowed in the character string.
• CHAR
• The char is a fixed-length type whose maximum length is
fixed at 255.
Relationships

Complex Type
Type Size Range
Struct It is similar to C struct or an
object where fields are
accessed using the "dot"
notation.
struct('James','Roy')
Map It contains the key-value
tuples where the fields are
accessed using array
notation.
map('first','James','last','Roy')
Array It is a collection of similar type
of values that indexable using
zero-based integers.
array('James','Roy')
Relationships

Hive DDL Commands
DDL Command Use With
CREATE Database, Table
SHOW
Databases, Tables, Table Properties, Partitions,
Functions, Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table
Relationships
Hive DDL commands are the statements used for defining and changing the
structure of a table or database in Hive. It is used to build or modify the tables
and other objects in the database.

Hive DML Commands
• Hive DML (Data Manipulation Language) commands are used to
insert, update, retrieve, and delete data from the Hive table once the
table and database schema has been defined using Hive DDL
commands.
• The various Hive DML commands are:
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT
Relationships

Hive Sort by vs order by
• Hive supports SORT BY which sorts the data per
reducer. The difference between "order by" and
"sort by" is that the former guarantees total
order in the output while the latter only
guarantees ordering of the rows within a reducer.
If there are more than one reducer, "sort by" may
give partially ordered final results.
Relationships

Hive Joining tables
• The HiveQL Join clause is used to combine the
data of two or more tables based on a related
column between them. The various type of
HiveQL joins are: -
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join
Relationships

Relationships

Relationships
Here, we are going to execute the join
clauses on the records of the following
table:

Relationships
inner Join in HiveQL
The HiveQL inner join is used to return the rows of multiple tables
where the join condition satisfies. In other words, the join criteria
find the match records in every table being joined.
Example of Inner Join in Hive
In this example, we take two table employee and employee_department. The
primary key (empid) of employee table represents the foreign key (depid) of
employee_department table. Let's perform the inner join operation by using the
following steps: -
•Select the database in which we want to create a table.
select e1.empname, e2.department_name from employee e1 join employee_d
epartment e2 on e1.empid= e2.depid;

What are the Hive Partitions?
• Apache Hive organizes
tables into partitions.
Partitioning is a way of
dividing a table into related
parts based on the values of
particular columns like date,
city, and department.
• Each table in the hive can
have one or more partition
keys to identify a particular
partition. Using partition it is
easy to do queries on slices
of the data.

Why is Partitioning Important?
• In the current century, we know that the huge amount of data which is in the range of
petabytes is getting stored in HDFS. So due to this, it becomes very difficult for Hadoop
users to query this huge amount of data.
• The Hive was introduced to lower down this burden of data querying. Apache Hive
converts the SQL queries into MapReduce jobs and then submits it to the Hadoop
cluster. When we submit a SQL query, Hive read the entire data-set.
• So, it becomes inefficient to run MapReduce jobs over a large table. Thus this is
resolved by creating partitions in tables. Apache Hive makes this job of implementing
partitions very easy by creating partitions by its automatic partition scheme at the
time of table creation.
• In Partitioning method, all the table data is divided into multiple partitions. Each
partition corresponds to a specific value(s) of partition column(s). It is kept as a sub-
record inside the table’s record present in the HDFS.
• Therefore on querying a particular table, appropriate partition of the table is queried
which contains the query value. Thus this decreases the I/O time required by the
query. Hence increases the performance speed.

How to Create Partitions in
Hive?
• To create data partitioning in Hive following
command is used-
CREATE TABLE table_name (column1 data_type,
column2 data_type) PARTITIONED BY (partition1
data_type, partition2 data_type,….);

Hive Data Partitioning Example
• Now let’s understand data partitioning in Hive with an
example. Consider a table named Tab1. The table
contains client detail like id, name, dept, and yoj( year of
joining).
• Suppose we need to retrieve the details of all the clients
who joined in 2012. Then, the query searches the whole
table for the required information. But if we partition the
client data with the year and store it in a separate file,
this will reduce the query processing time. The below
example will help us to learn how to partition a file and
its data-

The file name says file1 contains client data table:
tab1/clientdata/file1
id, name, dept, yoj
1, sunny, SC, 2009
2, animesh, HR, 2009
3, sumeer, SC, 2010
4, sarthak, TP, 2010
Now, let us partition above data into two files using years
tab1/clientdata/2009/file2
1, sunny, SC, 2009
2, animesh, HR, 2009
tab1/clientdata/2010/file3
3, sumeer, SC, 2010
4, sarthak, TP, 2010

• Now when we are retrieving the data from the table, only the data of the
specified partition will be queried. Creating a partitioned table is as follows:
• CREATE TABLE table_tab1 (id INT, name STRING, dept STRING, yoj INT) PARTITIONED BY
(year STRING);
• LOAD DATA LOCAL INPATH tab1’/clientdata/2009/file2’OVERWRITE INTO TABLE
studentTab PARTITION (year=’2009′);
• LOAD DATA LOCAL INPATH tab1’/clientdata/2010/file3’OVERWRITE INTO TABLE
studentTab PARTITION (year=’2010′);

Hive Static Partitioning
• Insert input data files individually into a partition table is Static Partition.
• Usually when loading files (big files) into Hive tables static partitions are preferred.
• Static Partition saves your time in loading data compared to dynamic partition.
• You “statically” add a partition in the table and move the file into the partition of the table.
• We can alter the partition in the static partition.
• You can get the partition column value from the filename, day of date etc without reading
the whole big file.
• If you want to use the Static partition in the hive you should set property set
hive.mapred.mode = strict This property set by default in hive-site.xml
• Static partition is in Strict Mode.
• You should use where clause to use limit in the static partition.
• You can perform Static partition on Hive Manage table or external table.
Types of Hive Partitioning :
• Static Partitioning
• Dynamic Partitioning

Hive Dynamic Partitioning
• Single insert to partition table is known as a dynamic partition.
• Usually, dynamic partition loads the data from the non-partitioned table.
• Dynamic Partition takes more time in loading data compared to
static partition.
• When you have large data stored in a table then the Dynamic partition
is suitable.
• If you want to partition a number of columns but you don’t know how
many columns then also dynamic partition is suitable.
• Dynamic partition there is no required where clause to use limit.
• we can’t perform alter on the Dynamic partition.
• You can perform dynamic partition on hive external table and managed
table.
• If you want to use the Dynamic partition in the hive then the mode is in
non-strict mode.
• Here are Hive dynamic partition properties you should allow

Hive Partitioning – Advantages and
Disadvantages
• a) Hive Partitioning Advantages
• Partitioning in Hive distributes execution load
horizontally.
• In partition faster execution of queries with the
low volume of data takes place. For example,
search population from Vatican City returns very
fast instead of searching entire world population.

• b) Hive Partitioning Disadvantages
• There is the possibility of too many small
partition creations- too many directories.
• Partition is effective for low volume data. But
there some queries like group by on high volume
of data take a long time to execute. For example,
grouping population of China will take a long
time as compared to a grouping of the population
in Vatican City.

Bucketing in Hive
• What is Bucketing in Hive
• Basically, for decomposing table data sets into
more manageable parts, Apache Hive offers
another technique. That technique is what we
call Bucketing in Hive.

• Why Bucketing?
• Basically, the concept of Hive
Partitioning provides a way of segregating hive
table data into multiple files/directories.
However, it only gives effective results in few
scenarios. Such as:
– When there is the limited number of partitions.
– Or, while partitions are of comparatively equal
size.

• Although, it is not possible in all scenarios. For example when are
partitioning our tables based geographic locations like country. Hence, some
bigger countries will have large partitions (ex: 4-5 countries itself
contributing 70-80% of total data).
• While small countries data will create small partitions (remaining all
countries in the world may contribute to just 20-30 % of total data). Hence,
at that time Partitioning will not be ideal.
• Then, to solve that problem of over partitioning, Hive offers Bucketing
concept. It is another effective technique for decomposing table data sets
into more manageable parts.

• Features of Bucketing in Hive
• Basically, this concept is based on hashing function on the bucketed column. Along with
mod (by the total number of buckets).
• i. Where the hash_function depends on the type of the bucketing column.
ii. However, the Records with the same bucketed column will always be stored in the same
bucket.
iii. Moreover, to divide the table into buckets we use CLUSTERED BY clause.
iv. Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-
based.
v. Along with Partitioning on Hive tables bucketing can be done and even without
partitioning.
vi. Moreover, Bucketed tables will create almost equally distributed data file parts.

• Advantages of Bucketing in Hive
• i. On comparing with non-bucketed tables, Bucketed tables offer the efficient
sampling.
ii. Map-side joins will be faster on bucketed tables than non-bucketed tables,
as the data files are equal sized parts.
iii. Here also bucketed tables offer faster query responses than non-bucketed
tables as compared to Similar to partitioning.
iv. This concept offers the flexibility to keep the records in each bucket to be
sorted by one or more columns.
v. Since the join of each bucket becomes an efficient merge-sort, this makes
map-side joins even more efficient.
Limitations of Bucketing in Hive
i. However, it doesn’t ensure that the table is properly populated.
ii. So, we need to handle Data Loading into buckets by our-self.

• Example Use Case for Bucketing in Hive
• To understand the remaining features of Hive Bucketing let’s see an example
Use case, by creating buckets for the sample user records file for testing in
this post
first_name,last_name, address, country, city, state, post,phone1,phone2,
email, web Rebbecca, Didio, 171 E 24th St, AU, Leith, TA, 7315, 03-8174-
9123, 0458-665-290,
rebbecca.didio@didio.com.au,http://www.brandtjonathanfesq.com.au
Hence, let’s create the table partitioned by country and bucketed by state
and sorted in ascending order of cities.

• Creation of Bucketed Tables
• However, with the help of CLUSTERED BY clause
and optional SORTED BY clause in CREATE
TABLE statement we can create bucketed tables.
Moreover, we can create a bucketed_user table
with above-given requirement with the help of
the below HiveQL.

• CREATE TABLE bucketed_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
COMMENT ‘A bucketed sorted user table’
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;

CREATE TABLE bucketed_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
COMMENT 'A bucketed sorted user table'
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32
BUCKETS
STORED AS SEQUENCEFILE;
As shown in code for state and city columns
Bucketed columns are included in the table
definition, Unlike partitioned
columns. Especially, which are not included in
table columns definition.

• Inserting data Into Bucketed Tables
• However, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH
command, similar to partitioned tables. Instead to populate the bucketed tables we
need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another
table.
• Hence, we will create one temporary table in hive with all the columns in input file
from that table we will copy into our target bucketed table for this.
• i. However, in bucketing the property hive.enforce.bucketing = true is similar to
hive.exec.dynamic.partition=true property. So, we can enable dynamic bucketing
while loading data into hive table By setting this property.
• ii. Moreover, it will automatically set the number of reduce tasks to be equal to the
number of buckets mentioned in the table definition (for example 32 in our case).
Further, it automatically selects the clustered by column from table definition.
• iii. Also, we have to manually convey the same information to Hive that, number of
reduce tasks to be run (for example in our case, by using set
mapred.reduce.tasks=32) and CLUSTER BY (state) and SORT BY (city) clause in the
above INSERT …Statement at the end since we do not set this property in Hive
Session.

Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS

More Related Content

Similar to Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS (20)

More from RUHULAMINHAZARIKA (6)

Recently uploaded (20)

Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS