Working with Hive Analytics

Working with Hive
Topics to Cover
- Introduction to Hive and its Architecture
- Different Modes of executing Hive queries
- HiveQL (DDL & DML Operations)
- External vs. Managed Tables
- Hive vs. Impala
- User-Defined Functions (UDFs)
- Exercises

2
Introduction to Hive and its Architecture
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy. This is a brief tutorial that
provides an introduction on how to use Apache Hive HiveQL with Hadoop Distributed File System. This
tutorial can be your first step towards becoming a successful Hadoop Developer with Hive.
Having prior knowledge on Core Java, Database concepts of SQL, Hadoop File system, and any of Linux
operating system flavors is an added added advantage if you want to speed up learning Hive.
Features of Hive
Here are the features of Hive:
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
It is important to understand that, Hive is not :
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates

3
Hive Architecture
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web
UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of
tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process
Engine
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
of the replacements of traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a query for MapReduce job
and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates results
as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.

4
How does Hive Work
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker,
which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.

5
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
HiveQL (DDL & DML Operations)
All the data types in Hive are classified into four types, given as follows:
1. Column Types
2. Literals
3. Null Values
4. Complex Types
Create Database Statement
Create Database is a statement used to create a database in Hive. A database in Hive
is a namespace or a collection of tables. The syntax for this statement is as follows
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>;
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name
already exists. We can use SCHEMA in place of DATABASE in this command. The following query is
executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;
The following query is used to verify a databases list:
hive> SHOW DATABASES;
default
userdb
Create Table Statement
Create Table is a statement used to create a table in Hive. The syntax and example
are as follows:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

6
LAB 1
GETTING STARTED WITH HIVE ENVIRONMENT
Hive is open source project and can be downloaded from Apache website URL : http://hive.apache.org
You can install it on CenOS that was installed previously in lab exercises.
Hive comes preinstalled with Cloudera CDH Virtual Machine, and may not require reinstallation.
1. Start the CDH VM, and login as user cloudera.

7
2. In the web browser, click Hue and login with same credentials as used for VM login
3. Click on Query Editors drop down -> Hive.
Run a basic query :

8
Hive can also be run on the command line. For the, either open a terminal within your VM, or connect to it
through Putty SSH application.
Execute command as given below:
login as: cloudera
cloudera@192.168.23.157's password: cloudera

9
[cloudera@quickstart ~]$ hive
2016-12-04 22:15:02,688 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing
PrefixTreeCodec is not present. Continuing without it.
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show tables;
OK
canada_regions
sales
things
Time taken: 1.166 seconds, Fetched: 3 row(s)
hive>
hive> select * from sales;
OK
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive>
LAB 2
USING HIVE TO MAP AN EXTERNAL TABLE OVER WEBLOG DATA IN HDFS
You will often want to create tables over existing data that does not live within the managed Hive
warehouse in HDFS. Creating a Hive external table is one of the easiest ways to handle this scenario.
Queries from the Hive client will execute as they normally do over internally managed tables.
Make sure you have access to a the Hadoop cluster with Hive installed. This recipe depends on having
the weblog_entries dataset loaded into an HDFS directory at the absolute path
/input/weblog/weblog_records.txt.
Carry out the following steps to map an external table in HDFS:
1. Open a text editor, like vi or gedit.
2. Add the CREATE TABLE syntax, as follows:
DROP TABLE IF EXISTS weblog_entries;
CREATE EXTERNAL TABLE weblog_entries (
md5 STRING,
url STRING,
request_date STRING,
request_time STRING,
ip STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY
'n'
LOCATION '/input/weblog/';
3. Save the script as weblog_create_external_table.hsql in the working directory. Copy the web logs file in
to HDFS.
[cloudera@localhost]$ hadoop fs -mkdir -p /input/weblog/
[cloudera@localhost]$ hadoop fs -put weblog_entries.txt /input/weblog/
4. Run the script from the operating system shell by supplying the –f option to the Hive client, as follows:
hive -f weblog_create_external_table.hql

10
5. You should see two successful commands issued to the Hive client.
OK
Time taken: 3.036 seconds
OK
Open Hive in the terminal and explore the newly created table.
[cloudera@quickstart data]$ hive
hive> show tables;
OK
sales
weblog_entries
hive> desc weblog_entries;
OK
md5 string
url string
request_date string
request_time string
ip string
hive>
hive> exit;
LAB 3
USING HIVE TO DYNAMICALLY CREATE TABLES FROM THE RESULTS OF A WEBLOG QUERY
This lab will outline a shorthand technique for inline table creation when the query is executed. Having to
create every table definition up front is impractical and does not scale for large ETL. Being able to
dynamically define intermediate tables is tremendously useful for complex analytics with multiple staging
points.
In this lab, we will create a new table that contains three fields from the weblog entry dataset, namely
request_date, request_time, and url. In addition to this, we will define a new field called url_length.
This lab depends on having the weblog_entries dataset loaded into Hive table through previous lab
exercise. Issue the following command in Hive:
hive> desc weblog_entries;
Carry out the following steps to create an inline table definition using an alias:
2. Add the following inline creation syntax:
CREATE TABLE weblog_entries_with_url_length AS
SELECT url, request_date, request_time, length(url) as url_length
FROM weblog_entries;
3. Save the script as weblog_entries_create_table_as.hql in the active directory.
4. Run the script from the operating system shell by supplying the -f option to the Hive, as follows:
hive -f weblog_create_table_as.hql
5. To verify that the table was created successfully, issue the following command , using the -e option:
hive -e "describe weblog_entries_with_url_length"
6. You should see a table with three string fields and a fourth int field holding the

11
URL length:
url string
request_date string
request_time string
url_length int
LAB 4
USING HIVE TO INTERSECT WEBLOG IPS AND DETERMINE THE COUNTRY
Hive does not directly support foreign keys. Nevertheless, it is still very common to join records on
identically matching keys contained in one or more tables. This recipe will show a very simple inner join
over weblog data that links each request record in the weblog_entries table to a country, based on the
request IP.
For each record contained in the weblog_entries table, the query will print the record out with an
additional trailing value showing the determined country.
Make sure you have access to a the Hadoop cluster with Hive installed. This lab depends on having the
weblog_entries dataset loaded into Hive table through lab exercise 2.
Issue the following command in Hive:
describe weblog_entries
You should see the following response:
OK
md5 string
url string
request_date string
request_time string
ip string
Additionally, this recipe requires that the ip-to-country dataset be loaded into a Hive table named
ip_to_country with the following fields mapped to the respective datatypes.
1. Copy the file ip_to_country.txt in to HDFS.
[cloudera@localhost data]$ hadoop fs -put ip_to_country.txt /input/ip_to_country
2. Add the CREATE TABLE syntax, as follows:
[cloudera@localhost]$ vi ip-to-country.hsql
DROP TABLE IF EXISTS ip_to_country;
CREATE EXTERNAL TABLE ip_to_country (
ip string,
country string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY
'n'
LOCATION '/input/ip_to_country';
[cloudera@localhost data]$ hive -f ip-to-country.hsql
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
OK
OK

12
[cloudera@localhost]$ hive -e "describe ip_to_country"
Performing an inner join in Hive:
2. Add the following inline creation syntax:
SELECT wle.*, itc.country FROM weblog_entries wle
JOIN ip_to_country itc ON wle.ip = itc.ip;
3. Save the script as weblog_simple_ip_join.hql in the active directory.
4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should
see the results of the SELECT statement printed out to the console. The following snippet is a printout
containing only two sample rows. The full printout will contain all 3000 rows.

Working with Hive Analytics

More Related Content

What's hot (20)

Similar to Working with Hive Analytics (20)

More from Manish Chopra (20)

Recently uploaded (20)

Working with Hive Analytics