SlideShare a Scribd company logo
Working with Hive
Topics to Cover
- Introduction to Hive and its Architecture
- Different Modes of executing Hive queries
- HiveQL (DDL & DML Operations)
- External vs. Managed Tables
- Hive vs. Impala
- User-Defined Functions (UDFs)
- Exercises
2
Introduction to Hive and its Architecture
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy. This is a brief tutorial that
provides an introduction on how to use Apache Hive HiveQL with Hadoop Distributed File System. This
tutorial can be your first step towards becoming a successful Hadoop Developer with Hive.
Having prior knowledge on Core Java, Database concepts of SQL, Hadoop File system, and any of Linux
operating system flavors is an added added advantage if you want to speed up learning Hive.
Features of Hive
Here are the features of Hive:
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
It is important to understand that, Hive is not :
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
3
Hive Architecture
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web
UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of
tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process
Engine
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
of the replacements of traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a query for MapReduce job
and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates results
as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.
4
How does Hive Work
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker,
which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
5
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
HiveQL (DDL & DML Operations)
All the data types in Hive are classified into four types, given as follows:
1. Column Types
2. Literals
3. Null Values
4. Complex Types
Create Database Statement
Create Database is a statement used to create a database in Hive. A database in Hive
is a namespace or a collection of tables. The syntax for this statement is as follows
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>;
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name
already exists. We can use SCHEMA in place of DATABASE in this command. The following query is
executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;
The following query is used to verify a databases list:
hive> SHOW DATABASES;
default
userdb
Create Table Statement
Create Table is a statement used to create a table in Hive. The syntax and example
are as follows:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
6
LAB 1
GETTING STARTED WITH HIVE ENVIRONMENT
Hive is open source project and can be downloaded from Apache website URL : http://hive.apache.org
You can install it on CenOS that was installed previously in lab exercises.
Hive comes preinstalled with Cloudera CDH Virtual Machine, and may not require reinstallation.
1. Start the CDH VM, and login as user cloudera.
7
2. In the web browser, click Hue and login with same credentials as used for VM login
3. Click on Query Editors drop down -> Hive.
Run a basic query :
8
Hive can also be run on the command line. For the, either open a terminal within your VM, or connect to it
through Putty SSH application.
Execute command as given below:
login as: cloudera
cloudera@192.168.23.157's password: cloudera
9
[cloudera@quickstart ~]$ hive
2016-12-04 22:15:02,688 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing
PrefixTreeCodec is not present. Continuing without it.
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show tables;
OK
canada_regions
sales
things
Time taken: 1.166 seconds, Fetched: 3 row(s)
hive>
hive> select * from sales;
OK
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
Time taken: 0.98 seconds, Fetched: 5 row(s)
hive>
LAB 2
USING HIVE TO MAP AN EXTERNAL TABLE OVER WEBLOG DATA IN HDFS
You will often want to create tables over existing data that does not live within the managed Hive
warehouse in HDFS. Creating a Hive external table is one of the easiest ways to handle this scenario.
Queries from the Hive client will execute as they normally do over internally managed tables.
Make sure you have access to a the Hadoop cluster with Hive installed. This recipe depends on having
the weblog_entries dataset loaded into an HDFS directory at the absolute path
/input/weblog/weblog_records.txt.
Carry out the following steps to map an external table in HDFS:
1. Open a text editor, like vi or gedit.
2. Add the CREATE TABLE syntax, as follows:
DROP TABLE IF EXISTS weblog_entries;
CREATE EXTERNAL TABLE weblog_entries (
md5 STRING,
url STRING,
request_date STRING,
request_time STRING,
ip STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY
'n'
LOCATION '/input/weblog/';
3. Save the script as weblog_create_external_table.hsql in the working directory. Copy the web logs file in
to HDFS.
[cloudera@localhost]$ hadoop fs -mkdir -p /input/weblog/
[cloudera@localhost]$ hadoop fs -put weblog_entries.txt /input/weblog/
4. Run the script from the operating system shell by supplying the –f option to the Hive client, as follows:
hive -f weblog_create_external_table.hql
10
5. You should see two successful commands issued to the Hive client.
OK
Time taken: 3.036 seconds
OK
Time taken: 3.389 seconds
Open Hive in the terminal and explore the newly created table.
[cloudera@quickstart data]$ hive
hive> show tables;
OK
sales
weblog_entries
Time taken: 1.139 seconds, Fetched: 4 row(s)
hive> desc weblog_entries;
OK
md5 string
url string
request_date string
request_time string
ip string
Time taken: 0.254 seconds, Fetched: 5 row(s)
hive>
hive> exit;
LAB 3
USING HIVE TO DYNAMICALLY CREATE TABLES FROM THE RESULTS OF A WEBLOG QUERY
This lab will outline a shorthand technique for inline table creation when the query is executed. Having to
create every table definition up front is impractical and does not scale for large ETL. Being able to
dynamically define intermediate tables is tremendously useful for complex analytics with multiple staging
points.
In this lab, we will create a new table that contains three fields from the weblog entry dataset, namely
request_date, request_time, and url. In addition to this, we will define a new field called url_length.
This lab depends on having the weblog_entries dataset loaded into Hive table through previous lab
exercise. Issue the following command in Hive:
hive> desc weblog_entries;
Carry out the following steps to create an inline table definition using an alias:
1. Open a text editor, like vi or gedit.
2. Add the following inline creation syntax:
CREATE TABLE weblog_entries_with_url_length AS
SELECT url, request_date, request_time, length(url) as url_length
FROM weblog_entries;
3. Save the script as weblog_entries_create_table_as.hql in the active directory.
4. Run the script from the operating system shell by supplying the -f option to the Hive, as follows:
hive -f weblog_create_table_as.hql
5. To verify that the table was created successfully, issue the following command , using the -e option:
hive -e "describe weblog_entries_with_url_length"
6. You should see a table with three string fields and a fourth int field holding the
11
URL length:
url string
request_date string
request_time string
url_length int
LAB 4
USING HIVE TO INTERSECT WEBLOG IPS AND DETERMINE THE COUNTRY
Hive does not directly support foreign keys. Nevertheless, it is still very common to join records on
identically matching keys contained in one or more tables. This recipe will show a very simple inner join
over weblog data that links each request record in the weblog_entries table to a country, based on the
request IP.
For each record contained in the weblog_entries table, the query will print the record out with an
additional trailing value showing the determined country.
Make sure you have access to a the Hadoop cluster with Hive installed. This lab depends on having the
weblog_entries dataset loaded into Hive table through lab exercise 2.
Issue the following command in Hive:
describe weblog_entries
You should see the following response:
OK
md5 string
url string
request_date string
request_time string
ip string
Additionally, this recipe requires that the ip-to-country dataset be loaded into a Hive table named
ip_to_country with the following fields mapped to the respective datatypes.
1. Copy the file ip_to_country.txt in to HDFS.
[cloudera@localhost data]$ hadoop fs -put ip_to_country.txt /input/ip_to_country
2. Add the CREATE TABLE syntax, as follows:
[cloudera@localhost]$ vi ip-to-country.hsql
DROP TABLE IF EXISTS ip_to_country;
CREATE EXTERNAL TABLE ip_to_country (
ip string,
country string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY
'n'
LOCATION '/input/ip_to_country';
[cloudera@localhost data]$ hive -f ip-to-country.hsql
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
OK
Time taken: 0.812 seconds
OK
Time taken: 0.601 seconds
12
[cloudera@localhost]$ hive -e "describe ip_to_country"
Performing an inner join in Hive:
1. Open a text editor, like vi or gedit.
2. Add the following inline creation syntax:
SELECT wle.*, itc.country FROM weblog_entries wle
JOIN ip_to_country itc ON wle.ip = itc.ip;
3. Save the script as weblog_simple_ip_join.hql in the active directory.
4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should
see the results of the SELECT statement printed out to the console. The following snippet is a printout
containing only two sample rows. The full printout will contain all 3000 rows.
13

More Related Content

What's hot (20)

PPTX
Alternatives to Apache Accumulo’s Java API
Josh Elser
 
PDF
Apache HBase 0.98
AndrewPurtell
 
PPT
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Piotr Pruski
 
PDF
Real-time Big Data Analytics Engine using Impala
Jason Shih
 
PPT
Cloudera Impala Internals
David Groozman
 
PDF
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Piotr Pruski
 
PDF
How Impala Works
Yue Chen
 
PPTX
HBase In Action - Chapter 04: HBase table design
phanleson
 
KEY
Automating Drupal Development: Makefiles, features and beyond
Nuvole
 
PDF
Drupal 8 Configuration Management
Philip Norton
 
PPTX
BD-zero lecture.pptx
vishal choudhary
 
PDF
Cloudera Impala technical deep dive
huguk
 
PDF
Configuration Management in Drupal 8: A preview (DrupalCamp Alpe Adria 2014)
Nuvole
 
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
PPTX
Hadoop, Evolution of Hadoop, Features of Hadoop
Dr Neelesh Jain
 
PPTX
DSpace 4.2 Basics & Configuration
DuraSpace
 
PDF
Configuration Management in Drupal 8: A preview (DrupalDays Milano 2014)
Nuvole
 
PPTX
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Jinith Joseph
 
DOCX
Next generation technology
Shashwat Shriparv
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
Alternatives to Apache Accumulo’s Java API
Josh Elser
 
Apache HBase 0.98
AndrewPurtell
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Piotr Pruski
 
Real-time Big Data Analytics Engine using Impala
Jason Shih
 
Cloudera Impala Internals
David Groozman
 
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Piotr Pruski
 
How Impala Works
Yue Chen
 
HBase In Action - Chapter 04: HBase table design
phanleson
 
Automating Drupal Development: Makefiles, features and beyond
Nuvole
 
Drupal 8 Configuration Management
Philip Norton
 
BD-zero lecture.pptx
vishal choudhary
 
Cloudera Impala technical deep dive
huguk
 
Configuration Management in Drupal 8: A preview (DrupalCamp Alpe Adria 2014)
Nuvole
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Dr Neelesh Jain
 
DSpace 4.2 Basics & Configuration
DuraSpace
 
Configuration Management in Drupal 8: A preview (DrupalDays Milano 2014)
Nuvole
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Jinith Joseph
 
Next generation technology
Shashwat Shriparv
 
Optimizing Hive Queries
Owen O'Malley
 

Similar to Working with Hive Analytics (20)

PPTX
Apache hive introduction
Mahmood Reza Esmaili Zand
 
PPTX
Big Data & Analytics (CSE6005) L6.pptx
Anonymous9etQKwW
 
PPTX
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
AbdellahELMAMOUN
 
PPTX
Hive.pptx
MahakSingh12
 
PPTX
Hive_Pig.pptx
PAVANKUMARNOOKALA
 
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
PPTX
Apache Hive
tusharsinghal58
 
PPTX
Session 14 - Hive
AnandMHadoop
 
PPTX
03 hive query language (hql)
Subhas Kumar Ghosh
 
PDF
Hive explanation with examples and syntax
dspyanand
 
PPTX
01-Introduction-to-Hive.pptx
VIJAYAPRABAP
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
PPTX
Unit 5-apache hive
vishal choudhary
 
PPTX
Apache Hive and commands PPT Presentation
Dhanush947555
 
PPTX
An Introduction-to-Hive and its Applications and Implementations.pptx
iaeronlineexm
 
PDF
Apache Hive micro guide - ConfusedCoders
Yash Sharma
 
PPTX
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
PPTX
Hive Hadoop
Farafekr Technology Ltd.
 
PPTX
Hive presentation
Hitesh Agrawal
 
PDF
20081030linkedin
Jeff Hammerbacher
 
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Big Data & Analytics (CSE6005) L6.pptx
Anonymous9etQKwW
 
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
AbdellahELMAMOUN
 
Hive.pptx
MahakSingh12
 
Hive_Pig.pptx
PAVANKUMARNOOKALA
 
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Apache Hive
tusharsinghal58
 
Session 14 - Hive
AnandMHadoop
 
03 hive query language (hql)
Subhas Kumar Ghosh
 
Hive explanation with examples and syntax
dspyanand
 
01-Introduction-to-Hive.pptx
VIJAYAPRABAP
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Unit 5-apache hive
vishal choudhary
 
Apache Hive and commands PPT Presentation
Dhanush947555
 
An Introduction-to-Hive and its Applications and Implementations.pptx
iaeronlineexm
 
Apache Hive micro guide - ConfusedCoders
Yash Sharma
 
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
Hive presentation
Hitesh Agrawal
 
20081030linkedin
Jeff Hammerbacher
 
Ad

More from Manish Chopra (20)

PDF
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
PDF
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
Manish Chopra
 
PDF
Getting Started with ChatGPT.pdf
Manish Chopra
 
PDF
Grafana and AWS - Implementation and Usage
Manish Chopra
 
PDF
Containers Auto Scaling on AWS.pdf
Manish Chopra
 
PDF
OpenKM Solution Document
Manish Chopra
 
PDF
Alfresco Content Services - Solution Document
Manish Chopra
 
PDF
Jenkins Study Guide ToC
Manish Chopra
 
PDF
Ansible Study Guide ToC
Manish Chopra
 
PDF
Microservices with Dockers and Kubernetes
Manish Chopra
 
PDF
Unix and Linux Operating Systems
Manish Chopra
 
PDF
Preparing a Dataset for Processing
Manish Chopra
 
PDF
Organizations with largest hadoop clusters
Manish Chopra
 
PDF
Distributed File Systems
Manish Chopra
 
PDF
Difference between hadoop 2 vs hadoop 3
Manish Chopra
 
PDF
Oracle solaris 11 installation
Manish Chopra
 
PDF
Big Data Analytics Course Guide TOC
Manish Chopra
 
PDF
Emergence and Importance of Cloud Computing for the Enterprise
Manish Chopra
 
PDF
Steps to create an RPM package in Linux
Manish Chopra
 
PDF
Setting up a HADOOP 2.2 cluster on CentOS 6
Manish Chopra
 
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
Manish Chopra
 
Getting Started with ChatGPT.pdf
Manish Chopra
 
Grafana and AWS - Implementation and Usage
Manish Chopra
 
Containers Auto Scaling on AWS.pdf
Manish Chopra
 
OpenKM Solution Document
Manish Chopra
 
Alfresco Content Services - Solution Document
Manish Chopra
 
Jenkins Study Guide ToC
Manish Chopra
 
Ansible Study Guide ToC
Manish Chopra
 
Microservices with Dockers and Kubernetes
Manish Chopra
 
Unix and Linux Operating Systems
Manish Chopra
 
Preparing a Dataset for Processing
Manish Chopra
 
Organizations with largest hadoop clusters
Manish Chopra
 
Distributed File Systems
Manish Chopra
 
Difference between hadoop 2 vs hadoop 3
Manish Chopra
 
Oracle solaris 11 installation
Manish Chopra
 
Big Data Analytics Course Guide TOC
Manish Chopra
 
Emergence and Importance of Cloud Computing for the Enterprise
Manish Chopra
 
Steps to create an RPM package in Linux
Manish Chopra
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Manish Chopra
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 

Working with Hive Analytics

  • 1. Working with Hive Topics to Cover - Introduction to Hive and its Architecture - Different Modes of executing Hive queries - HiveQL (DDL & DML Operations) - External vs. Managed Tables - Hive vs. Impala - User-Defined Functions (UDFs) - Exercises
  • 2. 2 Introduction to Hive and its Architecture Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. This is a brief tutorial that provides an introduction on how to use Apache Hive HiveQL with Hadoop Distributed File System. This tutorial can be your first step towards becoming a successful Hadoop Developer with Hive. Having prior knowledge on Core Java, Database concepts of SQL, Hadoop File system, and any of Linux operating system flavors is an added added advantage if you want to speed up learning Hive. Features of Hive Here are the features of Hive: • It stores schema in a database and processed data into HDFS. • It is designed for OLAP. • It provides SQL type language for querying called HiveQL or HQL. • It is familiar, fast, scalable, and extensible. It is important to understand that, Hive is not : • A relational database • A design for OnLine Transaction Processing (OLTP) • A language for real-time queries and row-level updates
  • 3. 3 Hive Architecture The following component diagram depicts the architecture of Hive: This component diagram contains different units. The following table describes each unit: Unit Name Operation User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.
  • 4. 4 How does Hive Work The following diagram depicts the workflow between Hive and Hadoop. The following table defines how Hive interacts with Hadoop framework: Step Operation 1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. 2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 Get Metadata The compiler sends metadata request to Metastore (any database). 4 Send Metadata Metastore sends metadata as a response to the compiler. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6 Execute Plan The driver sends the execute plan to the execution engine. 7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. 7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore. 8 Fetch Result The execution engine receives the results from Data nodes.
  • 5. 5 9 Send Results The execution engine sends those resultant values to the driver. 10 Send Results The driver sends the results to Hive Interfaces. HiveQL (DDL & DML Operations) All the data types in Hive are classified into four types, given as follows: 1. Column Types 2. Literals 3. Null Values 4. Complex Types Create Database Statement Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>; Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. We can use SCHEMA in place of DATABASE in this command. The following query is executed to create a database named userdb: hive> CREATE DATABASE [IF NOT EXISTS] userdb; or hive> CREATE SCHEMA userdb; The following query is used to verify a databases list: hive> SHOW DATABASES; default userdb Create Table Statement Create Table is a statement used to create a table in Hive. The syntax and example are as follows: CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format]
  • 6. 6 LAB 1 GETTING STARTED WITH HIVE ENVIRONMENT Hive is open source project and can be downloaded from Apache website URL : http://hive.apache.org You can install it on CenOS that was installed previously in lab exercises. Hive comes preinstalled with Cloudera CDH Virtual Machine, and may not require reinstallation. 1. Start the CDH VM, and login as user cloudera.
  • 7. 7 2. In the web browser, click Hue and login with same credentials as used for VM login 3. Click on Query Editors drop down -> Hive. Run a basic query :
  • 8. 8 Hive can also be run on the command line. For the, either open a terminal within your VM, or connect to it through Putty SSH application. Execute command as given below: login as: cloudera [email protected]'s password: cloudera
  • 9. 9 [cloudera@quickstart ~]$ hive 2016-12-04 22:15:02,688 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it. Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive> show tables; OK canada_regions sales things Time taken: 1.166 seconds, Fetched: 3 row(s) hive> hive> select * from sales; OK Joe 2 Hank 4 Ali 0 Eve 3 Hank 2 Time taken: 0.98 seconds, Fetched: 5 row(s) hive> LAB 2 USING HIVE TO MAP AN EXTERNAL TABLE OVER WEBLOG DATA IN HDFS You will often want to create tables over existing data that does not live within the managed Hive warehouse in HDFS. Creating a Hive external table is one of the easiest ways to handle this scenario. Queries from the Hive client will execute as they normally do over internally managed tables. Make sure you have access to a the Hadoop cluster with Hive installed. This recipe depends on having the weblog_entries dataset loaded into an HDFS directory at the absolute path /input/weblog/weblog_records.txt. Carry out the following steps to map an external table in HDFS: 1. Open a text editor, like vi or gedit. 2. Add the CREATE TABLE syntax, as follows: DROP TABLE IF EXISTS weblog_entries; CREATE EXTERNAL TABLE weblog_entries ( md5 STRING, url STRING, request_date STRING, request_time STRING, ip STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' LOCATION '/input/weblog/'; 3. Save the script as weblog_create_external_table.hsql in the working directory. Copy the web logs file in to HDFS. [cloudera@localhost]$ hadoop fs -mkdir -p /input/weblog/ [cloudera@localhost]$ hadoop fs -put weblog_entries.txt /input/weblog/ 4. Run the script from the operating system shell by supplying the –f option to the Hive client, as follows: hive -f weblog_create_external_table.hql
  • 10. 10 5. You should see two successful commands issued to the Hive client. OK Time taken: 3.036 seconds OK Time taken: 3.389 seconds Open Hive in the terminal and explore the newly created table. [cloudera@quickstart data]$ hive hive> show tables; OK sales weblog_entries Time taken: 1.139 seconds, Fetched: 4 row(s) hive> desc weblog_entries; OK md5 string url string request_date string request_time string ip string Time taken: 0.254 seconds, Fetched: 5 row(s) hive> hive> exit; LAB 3 USING HIVE TO DYNAMICALLY CREATE TABLES FROM THE RESULTS OF A WEBLOG QUERY This lab will outline a shorthand technique for inline table creation when the query is executed. Having to create every table definition up front is impractical and does not scale for large ETL. Being able to dynamically define intermediate tables is tremendously useful for complex analytics with multiple staging points. In this lab, we will create a new table that contains three fields from the weblog entry dataset, namely request_date, request_time, and url. In addition to this, we will define a new field called url_length. This lab depends on having the weblog_entries dataset loaded into Hive table through previous lab exercise. Issue the following command in Hive: hive> desc weblog_entries; Carry out the following steps to create an inline table definition using an alias: 1. Open a text editor, like vi or gedit. 2. Add the following inline creation syntax: CREATE TABLE weblog_entries_with_url_length AS SELECT url, request_date, request_time, length(url) as url_length FROM weblog_entries; 3. Save the script as weblog_entries_create_table_as.hql in the active directory. 4. Run the script from the operating system shell by supplying the -f option to the Hive, as follows: hive -f weblog_create_table_as.hql 5. To verify that the table was created successfully, issue the following command , using the -e option: hive -e "describe weblog_entries_with_url_length" 6. You should see a table with three string fields and a fourth int field holding the
  • 11. 11 URL length: url string request_date string request_time string url_length int LAB 4 USING HIVE TO INTERSECT WEBLOG IPS AND DETERMINE THE COUNTRY Hive does not directly support foreign keys. Nevertheless, it is still very common to join records on identically matching keys contained in one or more tables. This recipe will show a very simple inner join over weblog data that links each request record in the weblog_entries table to a country, based on the request IP. For each record contained in the weblog_entries table, the query will print the record out with an additional trailing value showing the determined country. Make sure you have access to a the Hadoop cluster with Hive installed. This lab depends on having the weblog_entries dataset loaded into Hive table through lab exercise 2. Issue the following command in Hive: describe weblog_entries You should see the following response: OK md5 string url string request_date string request_time string ip string Additionally, this recipe requires that the ip-to-country dataset be loaded into a Hive table named ip_to_country with the following fields mapped to the respective datatypes. 1. Copy the file ip_to_country.txt in to HDFS. [cloudera@localhost data]$ hadoop fs -put ip_to_country.txt /input/ip_to_country 2. Add the CREATE TABLE syntax, as follows: [cloudera@localhost]$ vi ip-to-country.hsql DROP TABLE IF EXISTS ip_to_country; CREATE EXTERNAL TABLE ip_to_country ( ip string, country string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' LOCATION '/input/ip_to_country'; [cloudera@localhost data]$ hive -f ip-to-country.hsql Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties OK Time taken: 0.812 seconds OK Time taken: 0.601 seconds
  • 12. 12 [cloudera@localhost]$ hive -e "describe ip_to_country" Performing an inner join in Hive: 1. Open a text editor, like vi or gedit. 2. Add the following inline creation syntax: SELECT wle.*, itc.country FROM weblog_entries wle JOIN ip_to_country itc ON wle.ip = itc.ip; 3. Save the script as weblog_simple_ip_join.hql in the active directory. 4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the results of the SELECT statement printed out to the console. The following snippet is a printout containing only two sample rows. The full printout will contain all 3000 rows.
  • 13. 13