HIVE
Abhinav Tyagi
What is Hive?
 Hive is a data warehouse infrastructure tool to
process structure data in Hadoop. It resides on top
of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
 Initially Hive was developed by Facebook, later
the Apache Software Foundation took it up and
developed it further as an open source under the
name Apache Hive.
Features of Hive
 It stores Schema in a database and processed data
into HDFS(Hadoop Distributed File System).
 It is designed for OLAP.
 It provides SQL type language for querying called
HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of Hive
Architecture of Hive
 User Interface - Hive is a data warehouse infrastructure software that
can create interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive HD.
 Meta Store -Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a table, their data
types and HDFS mapping.
 HiveQL Process Engine- HiveQL is similar to SQL for querying on schema
info on the Megastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and process
it.
 Execution Engine - The conjunction part of HiveQL process
Engine and MapReduce is Hive Execution Engine.
Execution engine processes the query and generates
results as same as MapReduce results. It uses the flavor of
MapReduce.
 HDFS or HBASE - Hadoop distributed file system or HBASE
are the data storage techniques to store data into the file
system.
Working of Hive
Working of Hive
 Execute Query- The Hive interface such as Command Line or
Web UI sends query Driver to execute.
 Get Plan- The driver takes the help of query complier that
parses the query to check the syntax and query plan or the
requirement of query.
 Get Metadata- The compiler sends metadata request to
Megastore
 Send Metadata- Metastore sends metadata as a response to the
compiler.
 Send Plan- The compiler checks the
requirement and resends the plan to the
driver. Up to here, the parsing and compiling
of a query is complete.
 Execute Plan- the driver sends the execute
plan to the execution engine.
 Execute Job- Internally, the process of
execution job is a MapReduce job. The
execution engine sends the job to JobTracker,
which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
 Metadata Ops- Meanwhile in execution,
the execution engine can execute
metadata operations with Metastore.
 Fetch Result- The execution engine
receives the results from Data nodes.
 Send Results- The execution engine sends
those resultant values to the driver.
 Send Results- The driver sends the results
to Hive Interfaces.
Hive- Data Types
All the data types in hive are classified into
four types
Column Types
Literals
Null Values
Complex Types
Column Types
 Integral Types - Integer type data can be specified using
integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data
range is smaller than the INT, you use SMALLINT. TINYINT
is smaller than SMALLINT.
 String Types - String type data types can be specified
using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.
 Timestamp - It supports traditional UNIX timestamp with
optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
 Dates - DATE values are described in year/month/day
format in the form {{YYYY-MM-DD}}.
 Decimals -The DECIMAL type in Hive is as same as Big
Decimal format of Java. It is used for representing
immutable arbitrary precision.
 Union Types - Union is a collection of heterogeneous
data types. You can create an instance using create
union.
Literals
 Floating Point Types - Floating point types are
nothing but numbers with decimal points.
Generally, this type of data is composed of
DOUBLE data type.
 Decimal Type - Decimal type data is nothing but
floating point value with higher range than
DOUBLE data type. The range of decimal type is
approximately -10-308
to 10308
.
Complex Types
Arrays - Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps - Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs - Structs in Hive is similar to using complex data with
comment.
Syntax: STRUCT<col_name : data_type [ COMMENT col_comment, … ]>
Create Database
 hive> CREATE DATABASE [IF NOT EXISTS] userdb;
 hive> CREATE SCHEMA userdb;
 hive> SHOW DATABASES;
Drop Database
 hive>DROP DATABASE [IF EXISTS] userdb;
 hive> DROP DATABASE [IF EXISTS] userdb CASCADE;
 hive> DROP SCHEMA userdb;
Create Table
hive> CREATE TABLE IF NOT EXISTS employee(eid int,
name String, salary String, destination String)
>COMMENT ‘Employee details’
>ROW FORMAT DELIMITED
>FIELDS TERMINATED BY ‘t’
>LINES TERMINATED BY ‘n’
>STORED AS TEXTFILE;
Partition
 Hive organizes tables into partitions. It is a way of
dividing a table into related parts based on the values of
partitioned columns such as date, city, and department.
Using partition, it is easy to query a portion of the data.
 Adding partition- Syntax - hive> ALTER TABLE employee
ADD PARTITION(year =‘2013’) location ‘/2012/part2012’;
 Dropping partition - Syntax - hive>ALTER TABLE employee
DROP [IF EXISTS] PARTITION (year=‘2013’);
id, name, dept, year
1, Mark, TP, 2012
2, Bob, HR, 2012
3, Sam,SC, 2013
4, Adam, SC, 2013
HiveQL - Select Where
 The Hive Query Language (HiveQL) is a query language for Hive to
process and analyze structured data in a Metastore.
 hive> SELECT * FROM employee WHERE salary>30000;
HiveQL - Select Order By
 The ORDER BY clause is used to retrieve the details based on one
column and sort the result set by ascending or descending order.
 hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
HiveQL - Select-Group By
 The GROUP BY clause is used to group all the records in a result set
using a particular collection column. It is used to query a group of
records.
 hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
HiveQL - Select-Joins
 JOIN is a clause that is used for combining specific fields from two
tables by using values common to each one. It is used to combine
records from two or more tables in the database. It is more or less
similar to SQL JOIN.
 There are different types of joins given as follows:
• JOIN
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
JOIN
 JOIN clause is used to combine and retrieve the
records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be
raised using the primary keys and foreign keys of
the tables.
 hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o ON (c.ID =
o.CUSTOMER_ID);
Left Outer Join
 The HiveQL LEFT OUTER JOIN returns all the rows from
the left table, even if there are no matches in the right
table. This means, if the ON clause matches 0 (zero)
records in the right table, the JOIN still returns a row in
the result, but with NULL in each column from the right
table.
 hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM
CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID =
o.CUSTOMER_ID);
Right Outer Join
 The HiveQL RIGHT OUTER JOIN returns all the rows
from the right table, even if there are no matches in
the left table. If the ON clause matches 0 (zero)
records in the left table, the JOIN still returns a row
in the result, but with NULL in each column from the
left table.
 hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM
CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID =
o.CUSTOMER_ID);
Full Outer Join
 The HiveQL FULL OUTER JOIN combines the records
of both the left and the right outer tables that fulfill
the JOIN condition. The joined table contains either
all the records from both the tables, or fills in NULL
values for missing matches on either side.
 hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON
(c.ID = o.CUSTOMER_ID);
Thank You

More Related Content

PPT
Hive(ppt)
PPT
Hive(ppt)
PPTX
Hive presentation
PPTX
Session 14 - Hive
PPTX
hive in bigdata.pptx
PPTX
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
PPT
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
PDF
20081030linkedin
Hive(ppt)
Hive(ppt)
Hive presentation
Session 14 - Hive
hive in bigdata.pptx
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
20081030linkedin

Similar to Introduction to Big Data Hive by Abhinav Tyagi (20)

PPTX
PPTX
Apache hive
PPTX
HivePart1.pptx
PPTX
Unit 5-apache hive
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Ten tools for ten big data areas 04_Apache Hive
PPTX
Hive @ Bucharest Java User Group
PPTX
Apache Hive
PPTX
Hive_Pig.pptx
PPTX
Apache hive
PPTX
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
PPTX
PPTX
Hive commands
PPTX
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
PPTX
Hive.pptx
PPTX
PPTX
HiveQL.pptx
PPT
Unit 5-lecture4
PPTX
Apache Hive and commands PPT Presentation
PPTX
Hive and HiveQL - Module6
Apache hive
HivePart1.pptx
Unit 5-apache hive
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Ten tools for ten big data areas 04_Apache Hive
Hive @ Bucharest Java User Group
Apache Hive
Hive_Pig.pptx
Apache hive
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Hive commands
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Hive.pptx
HiveQL.pptx
Unit 5-lecture4
Apache Hive and commands PPT Presentation
Hive and HiveQL - Module6
Ad

More from kuthubussaman1 (7)

PPTX
Aggregate Data Model in NoSQL database.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
AI Model Use Case by Ummul Hyrul Fathima
PPT
Hadoop distributed file system (HDFS), HDFS concept
PPTX
avrointroduction-150325003254-conversion-gate01.pptx
PPTX
Four Types of Normalization in DBMS Explained
Aggregate Data Model in NoSQL database.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Big Data Technologies - Introduction.pptx
AI Model Use Case by Ummul Hyrul Fathima
Hadoop distributed file system (HDFS), HDFS concept
avrointroduction-150325003254-conversion-gate01.pptx
Four Types of Normalization in DBMS Explained
Ad

Recently uploaded (20)

PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Tartificialntelligence_presentation.pptx
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Benefits of Physical activity for teenagers.pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
The various Industrial Revolutions .pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Developing a website for English-speaking practice to English as a foreign la...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Univ-Connecticut-ChatGPT-Presentaion.pdf
observCloud-Native Containerability and monitoring.pptx
A comparative study of natural language inference in Swahili using monolingua...
sustainability-14-14877-v2.pddhzftheheeeee
Tartificialntelligence_presentation.pptx
Module 1.ppt Iot fundamentals and Architecture
Enhancing emotion recognition model for a student engagement use case through...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Getting Started with Data Integration: FME Form 101
Benefits of Physical activity for teenagers.pptx
O2C Customer Invoices to Receipt V15A.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
The various Industrial Revolutions .pptx
1 - Historical Antecedents, Social Consideration.pdf
search engine optimization ppt fir known well about this
Zenith AI: Advanced Artificial Intelligence
A novel scalable deep ensemble learning framework for big data classification...
Developing a website for English-speaking practice to English as a foreign la...

Introduction to Big Data Hive by Abhinav Tyagi

  • 2. What is Hive?  Hive is a data warehouse infrastructure tool to process structure data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.  Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.
  • 3. Features of Hive  It stores Schema in a database and processed data into HDFS(Hadoop Distributed File System).  It is designed for OLAP.  It provides SQL type language for querying called HiveQL or HQL.  It is familiar, fast, scalable, and extensible.
  • 5. Architecture of Hive  User Interface - Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD.  Meta Store -Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types and HDFS mapping.  HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info on the Megastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it.
  • 6.  Execution Engine - The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.  HDFS or HBASE - Hadoop distributed file system or HBASE are the data storage techniques to store data into the file system.
  • 8. Working of Hive  Execute Query- The Hive interface such as Command Line or Web UI sends query Driver to execute.  Get Plan- The driver takes the help of query complier that parses the query to check the syntax and query plan or the requirement of query.  Get Metadata- The compiler sends metadata request to Megastore  Send Metadata- Metastore sends metadata as a response to the compiler.
  • 9.  Send Plan- The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.  Execute Plan- the driver sends the execute plan to the execution engine.  Execute Job- Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.
  • 10.  Metadata Ops- Meanwhile in execution, the execution engine can execute metadata operations with Metastore.  Fetch Result- The execution engine receives the results from Data nodes.  Send Results- The execution engine sends those resultant values to the driver.  Send Results- The driver sends the results to Hive Interfaces.
  • 11. Hive- Data Types All the data types in hive are classified into four types Column Types Literals Null Values Complex Types
  • 12. Column Types  Integral Types - Integer type data can be specified using integral data types, INT. When the data range exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.  String Types - String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
  • 13.  Timestamp - It supports traditional UNIX timestamp with optional nanosecond precision. It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.  Dates - DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.  Decimals -The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing immutable arbitrary precision.  Union Types - Union is a collection of heterogeneous data types. You can create an instance using create union.
  • 14. Literals  Floating Point Types - Floating point types are nothing but numbers with decimal points. Generally, this type of data is composed of DOUBLE data type.  Decimal Type - Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The range of decimal type is approximately -10-308 to 10308 .
  • 15. Complex Types Arrays - Arrays in Hive are used the same way they are used in Java. Syntax: ARRAY<data_type> Maps - Maps in Hive are similar to Java Maps. Syntax: MAP<primitive_type, data_type> Structs - Structs in Hive is similar to using complex data with comment. Syntax: STRUCT<col_name : data_type [ COMMENT col_comment, … ]>
  • 16. Create Database  hive> CREATE DATABASE [IF NOT EXISTS] userdb;  hive> CREATE SCHEMA userdb;  hive> SHOW DATABASES;
  • 17. Drop Database  hive>DROP DATABASE [IF EXISTS] userdb;  hive> DROP DATABASE [IF EXISTS] userdb CASCADE;  hive> DROP SCHEMA userdb;
  • 18. Create Table hive> CREATE TABLE IF NOT EXISTS employee(eid int, name String, salary String, destination String) >COMMENT ‘Employee details’ >ROW FORMAT DELIMITED >FIELDS TERMINATED BY ‘t’ >LINES TERMINATED BY ‘n’ >STORED AS TEXTFILE;
  • 19. Partition  Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.  Adding partition- Syntax - hive> ALTER TABLE employee ADD PARTITION(year =‘2013’) location ‘/2012/part2012’;  Dropping partition - Syntax - hive>ALTER TABLE employee DROP [IF EXISTS] PARTITION (year=‘2013’);
  • 20. id, name, dept, year 1, Mark, TP, 2012 2, Bob, HR, 2012 3, Sam,SC, 2013 4, Adam, SC, 2013
  • 21. HiveQL - Select Where  The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore.  hive> SELECT * FROM employee WHERE salary>30000;
  • 22. HiveQL - Select Order By  The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or descending order.  hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
  • 23. HiveQL - Select-Group By  The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.  hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
  • 24. HiveQL - Select-Joins  JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database. It is more or less similar to SQL JOIN.  There are different types of joins given as follows: • JOIN • LEFT OUTER JOIN • RIGHT OUTER JOIN • FULL OUTER JOIN
  • 25. JOIN  JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.  hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
  • 26. Left Outer Join  The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still returns a row in the result, but with NULL in each column from the right table.  hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
  • 27. Right Outer Join  The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in the result, but with NULL in each column from the left table.  hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
  • 28. Full Outer Join  The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that fulfill the JOIN condition. The joined table contains either all the records from both the tables, or fills in NULL values for missing matches on either side.  hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);