SlideShare a Scribd company logo
Introduction to Apache Hive
• Apache Hive is a Data Warehousing tool built on top of Hadoop and is used for data
analysis.
• Hive is targeted towards users who are comfortable with SQL.
• It is similar to SQL and called HiveQL, used for managing and querying structured data.
• This language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers.
• The popular feature of Hive is that there is no need to learn Java.
• Hive, an open source date warehousing framework based on Hadoop, was developed by the
Data Infrastructure Team at Facebook.
• Hive is also one of the technologies that are being used to address the requirements at
Facebook.
• Hive is very popular with all the users internally at Facebook and is being used to run
thousands of jobs on the cluster with hundreds of users, for a wide variety of
applications.
• Hive-Hadoop cluster at Facebook stores more than 2PB of raw data and regularly
loads 15 TB of data on a daily basis.
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive Architecture:
Where to Use Hive
Role of Hive in real time applications
Limitations of Hive:
SQL
• SQL stands for Structured Query Language.
• SQL is a language which helps us to work with the databases. Database does not
understand English or any other language.
• Just as to create software, we use Java or C#, in a similar way to work with databases,
we use SQL.
• SQL is the standard language of Database and is also pronounced as Sequel by many
people
• SQL itself is a declarative language.
• SQL deals with structured data and is for RDBMS that is a relational database
management
• SQL support schema for data storage
• We use SQL when we need frequent modification in records. SQL is used for better
performance
Data Modelling using Entities and
Relationships
HiveQL
• Hive’s SQL language is known as HiveQL, it is a combination of
SQL-92, Oracle’s SQL language, and MySQL.
• HiveQL provides some improved features from the previous
version of SQL standards, like analytics function from SQL 2003.
• Some Hive’s’ extension like multitable inserts, TRANSFORM,
MAP and REDUCE.
Data Modelling using Entities and
Relationships
Hive Data Types
• Hive data types are categorized in numeric types, string types, misc types,
and complex types.
Data Modelling using Entities and
Relationships
Type Size Range
TINYINT 1-byte signed
integer
-128 to 127
SMALLINT 2-byte signed
integer
32,768 to
32,767
INT 4-byte signed
integer
2,147,483,648
to
2,147,483,647
BIGINT 8-byte signed
integer
-
9,223,372,036
,854,775,808
to
9,223,372,036
,854,775,807
Integer Types Decimal Type
Type Size Range
FLOAT 4-byte Single
precision
floating
point
number
DOUBLE 8-byte Double
precision
floating
point
number
Data Modelling using Entities and
Relationships
Date/Time Types
TIMESTAMP
•It supports traditional UNIX timestamp with optional nanosecond precision.
•As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
•As Floating point numeric type, it is interpreted as UNIX timestamp in
seconds with decimal precision.
•As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
DATES
The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.
String Types
• STRING
• The string is a sequence of characters. It values can be
enclosed within single quotes (') or double quotes (").
• Varchar
• The varchar is a variable length type whose range lies
between 1 and 65535, which specifies that the maximum
number of characters allowed in the character string.
• CHAR
• The char is a fixed-length type whose maximum length is
fixed at 255.
Data Modelling using Entities and
Relationships
Complex Type
Type Size Range
Struct It is similar to C struct or an
object where fields are
accessed using the "dot"
notation.
struct('James','Roy')
Map It contains the key-value
tuples where the fields are
accessed using array
notation.
map('first','James','last','Roy')
Array It is a collection of similar type
of values that indexable using
zero-based integers.
array('James','Roy')
Data Modelling using Entities and
Relationships
Hive DDL Commands
DDL Command Use With
CREATE Database, Table
SHOW
Databases, Tables, Table Properties, Partitions,
Functions, Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table
Data Modelling using Entities and
Relationships
Hive DDL commands are the statements used for defining and changing the
structure of a table or database in Hive. It is used to build or modify the tables
and other objects in the database.
Hive DML Commands
• Hive DML (Data Manipulation Language) commands are used to
insert, update, retrieve, and delete data from the Hive table once the
table and database schema has been defined using Hive DDL
commands.
• The various Hive DML commands are:
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT
Data Modelling using Entities and
Relationships
Hive Sort by vs order by
• Hive supports SORT BY which sorts the data per
reducer. The difference between "order by" and
"sort by" is that the former guarantees total
order in the output while the latter only
guarantees ordering of the rows within a reducer.
If there are more than one reducer, "sort by" may
give partially ordered final results.
Data Modelling using Entities and
Relationships
Hive Joining tables
• The HiveQL Join clause is used to combine the
data of two or more tables based on a related
column between them. The various type of
HiveQL joins are: -
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join
Data Modelling using Entities and
Relationships
Data Modelling using Entities and
Relationships
Data Modelling using Entities and
Relationships
Here, we are going to execute the join
clauses on the records of the following
table:
Data Modelling using Entities and
Relationships
inner Join in HiveQL
The HiveQL inner join is used to return the rows of multiple tables
where the join condition satisfies. In other words, the join criteria
find the match records in every table being joined.
Example of Inner Join in Hive
In this example, we take two table employee and employee_department. The
primary key (empid) of employee table represents the foreign key (depid) of
employee_department table. Let's perform the inner join operation by using the
following steps: -
•Select the database in which we want to create a table.
select e1.empname, e2.department_name from employee e1 join employee_d
epartment e2 on e1.empid= e2.depid;
What are the Hive Partitions?
• Apache Hive organizes
tables into partitions.
Partitioning is a way of
dividing a table into related
parts based on the values of
particular columns like date,
city, and department.
• Each table in the hive can
have one or more partition
keys to identify a particular
partition. Using partition it is
easy to do queries on slices
of the data.
Why is Partitioning Important?
• In the current century, we know that the huge amount of data which is in the range of
petabytes is getting stored in HDFS. So due to this, it becomes very difficult for Hadoop
users to query this huge amount of data.
• The Hive was introduced to lower down this burden of data querying. Apache Hive
converts the SQL queries into MapReduce jobs and then submits it to the Hadoop
cluster. When we submit a SQL query, Hive read the entire data-set.
• So, it becomes inefficient to run MapReduce jobs over a large table. Thus this is
resolved by creating partitions in tables. Apache Hive makes this job of implementing
partitions very easy by creating partitions by its automatic partition scheme at the
time of table creation.
• In Partitioning method, all the table data is divided into multiple partitions. Each
partition corresponds to a specific value(s) of partition column(s). It is kept as a sub-
record inside the table’s record present in the HDFS.
• Therefore on querying a particular table, appropriate partition of the table is queried
which contains the query value. Thus this decreases the I/O time required by the
query. Hence increases the performance speed.
How to Create Partitions in
Hive?
• To create data partitioning in Hive following
command is used-
CREATE TABLE table_name (column1 data_type,
column2 data_type) PARTITIONED BY (partition1
data_type, partition2 data_type,….);
Hive Data Partitioning Example
• Now let’s understand data partitioning in Hive with an
example. Consider a table named Tab1. The table
contains client detail like id, name, dept, and yoj( year of
joining).
• Suppose we need to retrieve the details of all the clients
who joined in 2012. Then, the query searches the whole
table for the required information. But if we partition the
client data with the year and store it in a separate file,
this will reduce the query processing time. The below
example will help us to learn how to partition a file and
its data-
The file name says file1 contains client data table:
tab1/clientdata/file1
id, name, dept, yoj
1, sunny, SC, 2009
2, animesh, HR, 2009
3, sumeer, SC, 2010
4, sarthak, TP, 2010
Now, let us partition above data into two files using years
tab1/clientdata/2009/file2
1, sunny, SC, 2009
2, animesh, HR, 2009
tab1/clientdata/2010/file3
3, sumeer, SC, 2010
4, sarthak, TP, 2010
• Now when we are retrieving the data from the table, only the data of the
specified partition will be queried. Creating a partitioned table is as follows:
• CREATE TABLE table_tab1 (id INT, name STRING, dept STRING, yoj INT) PARTITIONED BY
(year STRING);
• LOAD DATA LOCAL INPATH tab1’/clientdata/2009/file2’OVERWRITE INTO TABLE
studentTab PARTITION (year=’2009′);
• LOAD DATA LOCAL INPATH tab1’/clientdata/2010/file3’OVERWRITE INTO TABLE
studentTab PARTITION (year=’2010′);
Hive Static Partitioning
• Insert input data files individually into a partition table is Static Partition.
• Usually when loading files (big files) into Hive tables static partitions are preferred.
• Static Partition saves your time in loading data compared to dynamic partition.
• You “statically” add a partition in the table and move the file into the partition of the table.
• We can alter the partition in the static partition.
• You can get the partition column value from the filename, day of date etc without reading
the whole big file.
• If you want to use the Static partition in the hive you should set property set
hive.mapred.mode = strict This property set by default in hive-site.xml
• Static partition is in Strict Mode.
• You should use where clause to use limit in the static partition.
• You can perform Static partition on Hive Manage table or external table.
Types of Hive Partitioning :
• Static Partitioning
• Dynamic Partitioning
Hive Dynamic Partitioning
• Single insert to partition table is known as a dynamic partition.
• Usually, dynamic partition loads the data from the non-partitioned table.
• Dynamic Partition takes more time in loading data compared to
static partition.
• When you have large data stored in a table then the Dynamic partition
is suitable.
• If you want to partition a number of columns but you don’t know how
many columns then also dynamic partition is suitable.
• Dynamic partition there is no required where clause to use limit.
• we can’t perform alter on the Dynamic partition.
• You can perform dynamic partition on hive external table and managed
table.
• If you want to use the Dynamic partition in the hive then the mode is in
non-strict mode.
• Here are Hive dynamic partition properties you should allow
Hive Partitioning – Advantages and
Disadvantages
• a) Hive Partitioning Advantages
• Partitioning in Hive distributes execution load
horizontally.
• In partition faster execution of queries with the
low volume of data takes place. For example,
search population from Vatican City returns very
fast instead of searching entire world population.
• b) Hive Partitioning Disadvantages
• There is the possibility of too many small
partition creations- too many directories.
• Partition is effective for low volume data. But
there some queries like group by on high volume
of data take a long time to execute. For example,
grouping population of China will take a long
time as compared to a grouping of the population
in Vatican City.
Bucketing in Hive
• What is Bucketing in Hive
• Basically, for decomposing table data sets into
more manageable parts, Apache Hive offers
another technique. That technique is what we
call Bucketing in Hive.
• Why Bucketing?
• Basically, the concept of Hive
Partitioning provides a way of segregating hive
table data into multiple files/directories.
However, it only gives effective results in few
scenarios. Such as:
– When there is the limited number of partitions.
– Or, while partitions are of comparatively equal
size.
• Although, it is not possible in all scenarios. For example when are
partitioning our tables based geographic locations like country. Hence, some
bigger countries will have large partitions (ex: 4-5 countries itself
contributing 70-80% of total data).
• While small countries data will create small partitions (remaining all
countries in the world may contribute to just 20-30 % of total data). Hence,
at that time Partitioning will not be ideal.
• Then, to solve that problem of over partitioning, Hive offers Bucketing
concept. It is another effective technique for decomposing table data sets
into more manageable parts.
• Features of Bucketing in Hive
• Basically, this concept is based on hashing function on the bucketed column. Along with
mod (by the total number of buckets).
• i. Where the hash_function depends on the type of the bucketing column.
ii. However, the Records with the same bucketed column will always be stored in the same
bucket.
iii. Moreover, to divide the table into buckets we use CLUSTERED BY clause.
iv. Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-
based.
v. Along with Partitioning on Hive tables bucketing can be done and even without
partitioning.
vi. Moreover, Bucketed tables will create almost equally distributed data file parts.
• Advantages of Bucketing in Hive
• i. On comparing with non-bucketed tables, Bucketed tables offer the efficient
sampling.
ii. Map-side joins will be faster on bucketed tables than non-bucketed tables,
as the data files are equal sized parts.
iii. Here also bucketed tables offer faster query responses than non-bucketed
tables as compared to Similar to partitioning.
iv. This concept offers the flexibility to keep the records in each bucket to be
sorted by one or more columns.
v. Since the join of each bucket becomes an efficient merge-sort, this makes
map-side joins even more efficient.
Limitations of Bucketing in Hive
i. However, it doesn’t ensure that the table is properly populated.
ii. So, we need to handle Data Loading into buckets by our-self.
• Example Use Case for Bucketing in Hive
• To understand the remaining features of Hive Bucketing let’s see an example
Use case, by creating buckets for the sample user records file for testing in
this post
first_name,last_name, address, country, city, state, post,phone1,phone2,
email, web Rebbecca, Didio, 171 E 24th St, AU, Leith, TA, 7315, 03-8174-
9123, 0458-665-290,
rebbecca.didio@didio.com.au,http://www.brandtjonathanfesq.com.au
Hence, let’s create the table partitioned by country and bucketed by state
and sorted in ascending order of cities.
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
• Creation of Bucketed Tables
• However, with the help of CLUSTERED BY clause
and optional SORTED BY clause in CREATE
TABLE statement we can create bucketed tables.
Moreover, we can create a bucketed_user table
with above-given requirement with the help of
the below HiveQL.
• CREATE TABLE bucketed_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
COMMENT ‘A bucketed sorted user table’
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;
CREATE TABLE bucketed_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
COMMENT 'A bucketed sorted user table'
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32
BUCKETS
STORED AS SEQUENCEFILE;
As shown in code for state and city columns
Bucketed columns are included in the table
definition, Unlike partitioned
columns. Especially, which are not included in
table columns definition.
• Inserting data Into Bucketed Tables
• However, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH
command, similar to partitioned tables. Instead to populate the bucketed tables we
need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another
table.
• Hence, we will create one temporary table in hive with all the columns in input file
from that table we will copy into our target bucketed table for this.
• i. However, in bucketing the property hive.enforce.bucketing = true is similar to
hive.exec.dynamic.partition=true property. So, we can enable dynamic bucketing
while loading data into hive table By setting this property.
• ii. Moreover, it will automatically set the number of reduce tasks to be equal to the
number of buckets mentioned in the table definition (for example 32 in our case).
Further, it automatically selects the clustered by column from table definition.
• iii. Also, we have to manually convey the same information to Hive that, number of
reduce tasks to be run (for example in our case, by using set
mapred.reduce.tasks=32) and CLUSTER BY (state) and SORT BY (city) clause in the
above INSERT …Statement at the end since we do not set this property in Hive
Session.

More Related Content

Similar to Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS (20)

PPTX
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
AbdellahELMAMOUN
 
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Apache Hive, data segmentation and bucketing
earnwithme2522
 
PPTX
Apache hive
pradipbajpai68
 
PPTX
Hive and HiveQL - Module6
Rohit Agrawal
 
PDF
hive query language and its usages with examples
harikumar288574
 
PPTX
Hive presentation
Hitesh Agrawal
 
ODT
ACADGILD:: HADOOP LESSON
Padma shree. T
 
PPTX
6.hive
Prashant Gupta
 
PDF
20081030linkedin
Jeff Hammerbacher
 
PPTX
Apache Hive
tusharsinghal58
 
PPTX
Unit 5-apache hive
vishal choudhary
 
PPTX
Big Data Analytics (BAD601) Module-4.pptx
AmbikaVenkatesh4
 
ODP
Apache hive1
sheetal sharma
 
PPTX
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
PPTX
Hive
GowriLatha1
 
PDF
Hive Demo Paper at VLDB 2009
Namit Jain
 
PPT
Introduction to Hive for Hadoop
ryanlecompte
 
PDF
Hive explanation with examples and syntax
dspyanand
 
PPTX
Apache hive introduction
Mahmood Reza Esmaili Zand
 
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
AbdellahELMAMOUN
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Hive, data segmentation and bucketing
earnwithme2522
 
Apache hive
pradipbajpai68
 
Hive and HiveQL - Module6
Rohit Agrawal
 
hive query language and its usages with examples
harikumar288574
 
Hive presentation
Hitesh Agrawal
 
ACADGILD:: HADOOP LESSON
Padma shree. T
 
20081030linkedin
Jeff Hammerbacher
 
Apache Hive
tusharsinghal58
 
Unit 5-apache hive
vishal choudhary
 
Big Data Analytics (BAD601) Module-4.pptx
AmbikaVenkatesh4
 
Apache hive1
sheetal sharma
 
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
Hive Demo Paper at VLDB 2009
Namit Jain
 
Introduction to Hive for Hadoop
ryanlecompte
 
Hive explanation with examples and syntax
dspyanand
 
Apache hive introduction
Mahmood Reza Esmaili Zand
 

More from RUHULAMINHAZARIKA (6)

PPTX
Divide and Conquer in DAA concept. For B Tech CSE
RUHULAMINHAZARIKA
 
PPTX
Soft_Computing_Presentation for soft computing
RUHULAMINHAZARIKA
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPT
Hadoop Map-Reduce from the subject: Big Data Analytics
RUHULAMINHAZARIKA
 
PPT
DAA-Divide and Conquer methodology, DAA 2024
RUHULAMINHAZARIKA
 
PPT
Big Data Analytics Materials, Chapter: 1
RUHULAMINHAZARIKA
 
Divide and Conquer in DAA concept. For B Tech CSE
RUHULAMINHAZARIKA
 
Soft_Computing_Presentation for soft computing
RUHULAMINHAZARIKA
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Hadoop Map-Reduce from the subject: Big Data Analytics
RUHULAMINHAZARIKA
 
DAA-Divide and Conquer methodology, DAA 2024
RUHULAMINHAZARIKA
 
Big Data Analytics Materials, Chapter: 1
RUHULAMINHAZARIKA
 
Ad

Recently uploaded (20)

PPT
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
Introduction to Basic Renewable Energy.pptx
examcoordinatormesu
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PPTX
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
Thermal runway and thermal stability.pptx
godow93766
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Introduction to Basic Renewable Energy.pptx
examcoordinatormesu
 
Design Thinking basics for Engineers.pdf
CMR University
 
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
MRRS Strength and Durability of Concrete
CivilMythili
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Ad

Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS

  • 2. • Apache Hive is a Data Warehousing tool built on top of Hadoop and is used for data analysis. • Hive is targeted towards users who are comfortable with SQL. • It is similar to SQL and called HiveQL, used for managing and querying structured data. • This language also allows traditional map/reduce programmers to plug in their custom mappers and reducers. • The popular feature of Hive is that there is no need to learn Java. • Hive, an open source date warehousing framework based on Hadoop, was developed by the Data Infrastructure Team at Facebook.
  • 3. • Hive is also one of the technologies that are being used to address the requirements at Facebook. • Hive is very popular with all the users internally at Facebook and is being used to run thousands of jobs on the cluster with hundreds of users, for a wide variety of applications. • Hive-Hadoop cluster at Facebook stores more than 2PB of raw data and regularly loads 15 TB of data on a daily basis.
  • 7. Role of Hive in real time applications
  • 9. SQL • SQL stands for Structured Query Language. • SQL is a language which helps us to work with the databases. Database does not understand English or any other language. • Just as to create software, we use Java or C#, in a similar way to work with databases, we use SQL. • SQL is the standard language of Database and is also pronounced as Sequel by many people • SQL itself is a declarative language. • SQL deals with structured data and is for RDBMS that is a relational database management • SQL support schema for data storage • We use SQL when we need frequent modification in records. SQL is used for better performance Data Modelling using Entities and Relationships
  • 10. HiveQL • Hive’s SQL language is known as HiveQL, it is a combination of SQL-92, Oracle’s SQL language, and MySQL. • HiveQL provides some improved features from the previous version of SQL standards, like analytics function from SQL 2003. • Some Hive’s’ extension like multitable inserts, TRANSFORM, MAP and REDUCE. Data Modelling using Entities and Relationships
  • 11. Hive Data Types • Hive data types are categorized in numeric types, string types, misc types, and complex types. Data Modelling using Entities and Relationships Type Size Range TINYINT 1-byte signed integer -128 to 127 SMALLINT 2-byte signed integer 32,768 to 32,767 INT 4-byte signed integer 2,147,483,648 to 2,147,483,647 BIGINT 8-byte signed integer - 9,223,372,036 ,854,775,808 to 9,223,372,036 ,854,775,807 Integer Types Decimal Type Type Size Range FLOAT 4-byte Single precision floating point number DOUBLE 8-byte Double precision floating point number
  • 12. Data Modelling using Entities and Relationships Date/Time Types TIMESTAMP •It supports traditional UNIX timestamp with optional nanosecond precision. •As Integer numeric type, it is interpreted as UNIX timestamp in seconds. •As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal precision. •As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision) DATES The Date value is used to specify a particular year, month and day, in the form YYYY--MM--DD. However, it didn't provide the time of the day. The range of Date type lies between 0000--01--01 to 9999--12--31.
  • 13. String Types • STRING • The string is a sequence of characters. It values can be enclosed within single quotes (') or double quotes ("). • Varchar • The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the maximum number of characters allowed in the character string. • CHAR • The char is a fixed-length type whose maximum length is fixed at 255. Data Modelling using Entities and Relationships
  • 14. Complex Type Type Size Range Struct It is similar to C struct or an object where fields are accessed using the "dot" notation. struct('James','Roy') Map It contains the key-value tuples where the fields are accessed using array notation. map('first','James','last','Roy') Array It is a collection of similar type of values that indexable using zero-based integers. array('James','Roy') Data Modelling using Entities and Relationships
  • 15. Hive DDL Commands DDL Command Use With CREATE Database, Table SHOW Databases, Tables, Table Properties, Partitions, Functions, Index DESCRIBE Database, Table, view USE Database DROP Database, Table ALTER Database, Table TRUNCATE Table Data Modelling using Entities and Relationships Hive DDL commands are the statements used for defining and changing the structure of a table or database in Hive. It is used to build or modify the tables and other objects in the database.
  • 16. Hive DML Commands • Hive DML (Data Manipulation Language) commands are used to insert, update, retrieve, and delete data from the Hive table once the table and database schema has been defined using Hive DDL commands. • The various Hive DML commands are: • LOAD • SELECT • INSERT • DELETE • UPDATE • EXPORT • IMPORT Data Modelling using Entities and Relationships
  • 17. Hive Sort by vs order by • Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results. Data Modelling using Entities and Relationships
  • 18. Hive Joining tables • The HiveQL Join clause is used to combine the data of two or more tables based on a related column between them. The various type of HiveQL joins are: - • Inner Join • Left Outer Join • Right Outer Join • Full Outer Join Data Modelling using Entities and Relationships
  • 19. Data Modelling using Entities and Relationships
  • 20. Data Modelling using Entities and Relationships Here, we are going to execute the join clauses on the records of the following table:
  • 21. Data Modelling using Entities and Relationships inner Join in HiveQL The HiveQL inner join is used to return the rows of multiple tables where the join condition satisfies. In other words, the join criteria find the match records in every table being joined. Example of Inner Join in Hive In this example, we take two table employee and employee_department. The primary key (empid) of employee table represents the foreign key (depid) of employee_department table. Let's perform the inner join operation by using the following steps: - •Select the database in which we want to create a table. select e1.empname, e2.department_name from employee e1 join employee_d epartment e2 on e1.empid= e2.depid;
  • 22. What are the Hive Partitions? • Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. • Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data.
  • 23. Why is Partitioning Important? • In the current century, we know that the huge amount of data which is in the range of petabytes is getting stored in HDFS. So due to this, it becomes very difficult for Hadoop users to query this huge amount of data. • The Hive was introduced to lower down this burden of data querying. Apache Hive converts the SQL queries into MapReduce jobs and then submits it to the Hadoop cluster. When we submit a SQL query, Hive read the entire data-set. • So, it becomes inefficient to run MapReduce jobs over a large table. Thus this is resolved by creating partitions in tables. Apache Hive makes this job of implementing partitions very easy by creating partitions by its automatic partition scheme at the time of table creation. • In Partitioning method, all the table data is divided into multiple partitions. Each partition corresponds to a specific value(s) of partition column(s). It is kept as a sub- record inside the table’s record present in the HDFS. • Therefore on querying a particular table, appropriate partition of the table is queried which contains the query value. Thus this decreases the I/O time required by the query. Hence increases the performance speed.
  • 24. How to Create Partitions in Hive? • To create data partitioning in Hive following command is used- CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2 data_type,….);
  • 25. Hive Data Partitioning Example • Now let’s understand data partitioning in Hive with an example. Consider a table named Tab1. The table contains client detail like id, name, dept, and yoj( year of joining). • Suppose we need to retrieve the details of all the clients who joined in 2012. Then, the query searches the whole table for the required information. But if we partition the client data with the year and store it in a separate file, this will reduce the query processing time. The below example will help us to learn how to partition a file and its data-
  • 26. The file name says file1 contains client data table: tab1/clientdata/file1 id, name, dept, yoj 1, sunny, SC, 2009 2, animesh, HR, 2009 3, sumeer, SC, 2010 4, sarthak, TP, 2010 Now, let us partition above data into two files using years tab1/clientdata/2009/file2 1, sunny, SC, 2009 2, animesh, HR, 2009 tab1/clientdata/2010/file3 3, sumeer, SC, 2010 4, sarthak, TP, 2010
  • 27. • Now when we are retrieving the data from the table, only the data of the specified partition will be queried. Creating a partitioned table is as follows: • CREATE TABLE table_tab1 (id INT, name STRING, dept STRING, yoj INT) PARTITIONED BY (year STRING); • LOAD DATA LOCAL INPATH tab1’/clientdata/2009/file2’OVERWRITE INTO TABLE studentTab PARTITION (year=’2009′); • LOAD DATA LOCAL INPATH tab1’/clientdata/2010/file3’OVERWRITE INTO TABLE studentTab PARTITION (year=’2010′);
  • 28. Hive Static Partitioning • Insert input data files individually into a partition table is Static Partition. • Usually when loading files (big files) into Hive tables static partitions are preferred. • Static Partition saves your time in loading data compared to dynamic partition. • You “statically” add a partition in the table and move the file into the partition of the table. • We can alter the partition in the static partition. • You can get the partition column value from the filename, day of date etc without reading the whole big file. • If you want to use the Static partition in the hive you should set property set hive.mapred.mode = strict This property set by default in hive-site.xml • Static partition is in Strict Mode. • You should use where clause to use limit in the static partition. • You can perform Static partition on Hive Manage table or external table. Types of Hive Partitioning : • Static Partitioning • Dynamic Partitioning
  • 29. Hive Dynamic Partitioning • Single insert to partition table is known as a dynamic partition. • Usually, dynamic partition loads the data from the non-partitioned table. • Dynamic Partition takes more time in loading data compared to static partition. • When you have large data stored in a table then the Dynamic partition is suitable. • If you want to partition a number of columns but you don’t know how many columns then also dynamic partition is suitable. • Dynamic partition there is no required where clause to use limit. • we can’t perform alter on the Dynamic partition. • You can perform dynamic partition on hive external table and managed table. • If you want to use the Dynamic partition in the hive then the mode is in non-strict mode. • Here are Hive dynamic partition properties you should allow
  • 30. Hive Partitioning – Advantages and Disadvantages • a) Hive Partitioning Advantages • Partitioning in Hive distributes execution load horizontally. • In partition faster execution of queries with the low volume of data takes place. For example, search population from Vatican City returns very fast instead of searching entire world population.
  • 31. • b) Hive Partitioning Disadvantages • There is the possibility of too many small partition creations- too many directories. • Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of China will take a long time as compared to a grouping of the population in Vatican City.
  • 32. Bucketing in Hive • What is Bucketing in Hive • Basically, for decomposing table data sets into more manageable parts, Apache Hive offers another technique. That technique is what we call Bucketing in Hive.
  • 33. • Why Bucketing? • Basically, the concept of Hive Partitioning provides a way of segregating hive table data into multiple files/directories. However, it only gives effective results in few scenarios. Such as: – When there is the limited number of partitions. – Or, while partitions are of comparatively equal size.
  • 34. • Although, it is not possible in all scenarios. For example when are partitioning our tables based geographic locations like country. Hence, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data). • While small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). Hence, at that time Partitioning will not be ideal. • Then, to solve that problem of over partitioning, Hive offers Bucketing concept. It is another effective technique for decomposing table data sets into more manageable parts.
  • 35. • Features of Bucketing in Hive • Basically, this concept is based on hashing function on the bucketed column. Along with mod (by the total number of buckets). • i. Where the hash_function depends on the type of the bucketing column. ii. However, the Records with the same bucketed column will always be stored in the same bucket. iii. Moreover, to divide the table into buckets we use CLUSTERED BY clause. iv. Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1- based. v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning. vi. Moreover, Bucketed tables will create almost equally distributed data file parts.
  • 36. • Advantages of Bucketing in Hive • i. On comparing with non-bucketed tables, Bucketed tables offer the efficient sampling. ii. Map-side joins will be faster on bucketed tables than non-bucketed tables, as the data files are equal sized parts. iii. Here also bucketed tables offer faster query responses than non-bucketed tables as compared to Similar to partitioning. iv. This concept offers the flexibility to keep the records in each bucket to be sorted by one or more columns. v. Since the join of each bucket becomes an efficient merge-sort, this makes map-side joins even more efficient. Limitations of Bucketing in Hive i. However, it doesn’t ensure that the table is properly populated. ii. So, we need to handle Data Loading into buckets by our-self.
  • 37. • Example Use Case for Bucketing in Hive • To understand the remaining features of Hive Bucketing let’s see an example Use case, by creating buckets for the sample user records file for testing in this post first_name,last_name, address, country, city, state, post,phone1,phone2, email, web Rebbecca, Didio, 171 E 24th St, AU, Leith, TA, 7315, 03-8174- 9123, 0458-665-290, [email protected],http://www.brandtjonathanfesq.com.au Hence, let’s create the table partitioned by country and bucketed by state and sorted in ascending order of cities.
  • 39. • Creation of Bucketed Tables • However, with the help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement we can create bucketed tables. Moreover, we can create a bucketed_user table with above-given requirement with the help of the below HiveQL.
  • 40. • CREATE TABLE bucketed_user( firstname VARCHAR(64), lastname VARCHAR(64), address STRING, city VARCHAR(64), state VARCHAR(64), post STRING, phone1 VARCHAR(64), phone2 STRING, email STRING, web STRING ) COMMENT ‘A bucketed sorted user table’ PARTITIONED BY (country VARCHAR(64)) CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS STORED AS SEQUENCEFILE;
  • 41. CREATE TABLE bucketed_user( firstname VARCHAR(64), lastname VARCHAR(64), address STRING, city VARCHAR(64), state VARCHAR(64), post STRING, phone1 VARCHAR(64), phone2 STRING, email STRING, web STRING ) COMMENT 'A bucketed sorted user table' PARTITIONED BY (country VARCHAR(64)) CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS STORED AS SEQUENCEFILE; As shown in code for state and city columns Bucketed columns are included in the table definition, Unlike partitioned columns. Especially, which are not included in table columns definition.
  • 42. • Inserting data Into Bucketed Tables • However, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH command, similar to partitioned tables. Instead to populate the bucketed tables we need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another table. • Hence, we will create one temporary table in hive with all the columns in input file from that table we will copy into our target bucketed table for this. • i. However, in bucketing the property hive.enforce.bucketing = true is similar to hive.exec.dynamic.partition=true property. So, we can enable dynamic bucketing while loading data into hive table By setting this property. • ii. Moreover, it will automatically set the number of reduce tasks to be equal to the number of buckets mentioned in the table definition (for example 32 in our case). Further, it automatically selects the clustered by column from table definition. • iii. Also, we have to manually convey the same information to Hive that, number of reduce tasks to be run (for example in our case, by using set mapred.reduce.tasks=32) and CLUSTER BY (state) and SORT BY (city) clause in the above INSERT …Statement at the end since we do not set this property in Hive Session.