1
Introduction to Apache Hive (and
HCatalog)	
  
Mark Grover
github.com/markgrover/nyc-hug-
hive
Me!
•  Contributor	
  to	
  Apache	
  Hive	
  
•  Sec3on	
  Author	
  of	
  O’Reilly’s	
  Programming	
  Hive	
  book	
  
•  So?ware	
  Developer	
  at	
  Cloudera	
  
•  @mark_grover	
  
•  mgrover@cloudera.com	
  
•  hFps://github.com/markgrover/nyc-­‐hug-­‐hive	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Preamble
•  This	
  is	
  a	
  remote	
  talk	
  
•  Feel	
  free	
  to	
  ask	
  ques3ons	
  any	
  3me!	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive
•  Data	
  warehouse	
  system	
  for	
  Hadoop	
  
•  Enables	
  Extract/Transform/Load	
  (ETL)	
  
•  Associate	
  structure	
  with	
  a	
  variety	
  of	
  data	
  formats	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Loca3on	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Data	
  Format	
  Handler	
  (SerDe)	
  
•  Integrates	
  with	
  HDFS,	
  HBase,	
  MongoDB,	
  etc.	
  
•  Query	
  execu3on	
  in	
  MapReduce	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Why use Hive?
•  MapReduce	
  is	
  catered	
  towards	
  developers	
  
•  Run	
  SQL-­‐like	
  queries	
  that	
  get	
  compiled	
  and	
  run	
  as	
  
MapReduce	
  jobs	
  
•  Data	
  in	
  Hadoop	
  even	
  though	
  generally	
  unstructured	
  
has	
  some	
  vague	
  structure	
  associated	
  with	
  it	
  
•  Benefits	
  of	
  MapReduce	
  +	
  HDFS	
  (Hadoop)	
  
•  Fault	
  tolerant	
  
•  Robust	
  
•  Scalable	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive features
•  Create	
  table,	
  create	
  view,	
  create	
  index	
  -­‐	
  DDL	
  
•  Select,	
  where	
  clause,	
  group	
  by,	
  order	
  by,	
  joins	
  
•  Pluggable	
  User	
  Defined	
  Func3ons	
  -­‐	
  UDFs	
  (e.g	
  
from_unix3me)	
  
•  Pluggable	
  User	
  Defined	
  Aggregate	
  Func3ons	
  -­‐	
  
UDAFs	
  (e.g.	
  count,	
  avg)	
  
•  Pluggable	
  User	
  Defined	
  Table	
  Genera3ng	
  Func3ons	
  
-­‐	
  UDTFs	
  (e.g.	
  explode)	
  
Hive features
•  Pluggable	
  custom	
  Input/Output	
  format	
  
•  Pluggable	
  Serializa3on	
  Deserializa3on	
  libraries	
  
(SerDes)	
  
•  Pluggable	
  custom	
  map	
  and	
  reduce	
  scripts	
  
What Hive does NOT support
•  OLTP	
  workloads	
  -­‐	
  low	
  latency	
  
•  Correlated	
  subqueries	
  
•  Not	
  super	
  performant	
  with	
  small	
  amounts	
  of	
  data	
  
•  How	
  much	
  data	
  do	
  you	
  need	
  to	
  call	
  it	
  “Big	
  Data”?	
  
Other Hive features
•  Par33oning	
  
•  Sampling	
  
•  Bucke3ng	
  
•  Various	
  join	
  op3miza3ons	
  
•  Integra3on	
  with	
  HBase	
  and	
  other	
  storage	
  handlers	
  
•  Views	
  –	
  Unmaterialized	
  
•  Complex	
  data	
  types	
  –	
  arrays,	
  structs,	
  maps	
  
Connecting to Hive
•  Hive	
  Shell	
  
•  JDBC	
  driver	
  
•  ODBC	
  driver	
  
•  Thri?	
  client	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive metastore
•  Backed	
  by	
  RDBMS	
  
•  Derby,	
  MySQL,	
  PostgreSQL,	
  etc.	
  supported	
  
•  Default	
  Embedded	
  Derby	
  
•  Not	
  recommend	
  for	
  anything	
  but	
  a	
  quick	
  Proof	
  of	
  
Concept	
  
•  3	
  different	
  modes	
  of	
  opera3on:	
  
•  Embedded	
  Derby	
  (default)	
  
•  Local	
  
•  Remote	
  
Hive architecture
Hive Remote Mode
Hive server
Problems with Hive Server
•  No	
  sessions/concurrency	
  
•  Essen3ally	
  need	
  1	
  server	
  per	
  client	
  
•  Security	
  
•  Auding/Logging	
  
Hive server 2
Hive architecture
•  Compiler	
  
•  Parser	
  
•  Type	
  checking	
  
•  Seman3c	
  Analyzer	
  
•  Plan	
  Genera3on	
  
•  Task	
  Genera3on	
  
Hive architecture
•  Execu3on	
  Engine	
  
•  Plan	
  
•  Operators	
  
•  SerDes	
  
•  UDFs/UDAFs/UDTFs	
  
•  Metastore	
  
•  Stores	
  schema	
  of	
  data	
  
•  HCatalog	
  
Architecture Summary
•  Use	
  remote	
  metastore	
  service	
  for	
  sharing	
  the	
  
metastore	
  with	
  HCatalog	
  and	
  other	
  tools	
  
•  Use	
  Hive	
  Server2	
  for	
  concurrent	
  queries	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive Metastore Remote Mode
HCatalog
•  Sub-­‐component	
  of	
  Hive	
  
•  Table	
  and	
  storage	
  management	
  service	
  
•  Public	
  APIs	
  and	
  webservice	
  wrappers	
  for	
  accessing	
  
metadata	
  in	
  Hive	
  metastore	
  
•  Metastore	
  contains	
  informa3on	
  of	
  interest	
  to	
  other	
  
tools	
  (Pig,	
  MapReduce	
  jobs)	
  
•  Expose	
  that	
  informa3on	
  as	
  REST	
  interface	
  
•  WebHCat:	
  Web	
  Server	
  for	
  engaging	
  with	
  the	
  Hive	
  
metastore	
  	
  
WebHCat
REST
REST
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Applications of Hive
•  Web	
  Analy3cs	
  
•  Retail	
  
•  Healthcare	
  
•  Spam	
  detec3on	
  
•  Data	
  Mining	
  
•  Ad	
  op3miza3on	
  
•  ETL	
  workloads	
  
How to install Hive
•  Download	
  Apache	
  Hive	
  tarball	
  
•  Use	
  Apache	
  Bigtop	
  packages	
  
•  Use	
  a	
  Demo	
  VM	
  
Want to learn more about Hive?
Contact info
@mark_grover	
  
github.com/markgrover	
  
linkedin.com/in/grovermark	
  
mgrover@cloudera.com	
  
	
  
	
  

More Related Content

PPTX
Java database connectivity with MySql
PPTX
Introduction to xampp
PDF
InnoDB Internal
PDF
Inside MongoDB: the Internals of an Open-Source Database
PPT
mysql-Tutorial with Query presentation.ppt
PPT
Php File Operations
PPT
Apache Hive - Introduction
Java database connectivity with MySql
Introduction to xampp
InnoDB Internal
Inside MongoDB: the Internals of an Open-Source Database
mysql-Tutorial with Query presentation.ppt
Php File Operations
Apache Hive - Introduction

What's hot (20)

PPTX
Session 14 - Hive
PPTX
Web Database
PPT
Indexing and Hashing
PDF
Spring MVC to iOS and the REST
PDF
Version Control with SVN
PPTX
Python/Flask Presentation
PPTX
Standard Template Library
PPT
Présentation Oracle DataBase 11g
PPTX
Basic oracle-database-administration
PDF
The InnoDB Storage Engine for MySQL
PPTX
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
PPTX
Cascading Style Sheet (CSS)
PDF
Title, heading and paragraph tags
ODP
Ms sql-server
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
PPTX
Apache hive
PDF
Bootstrap
PDF
Data Warehouse Design Considerations
PDF
Session 14 - Hive
Web Database
Indexing and Hashing
Spring MVC to iOS and the REST
Version Control with SVN
Python/Flask Presentation
Standard Template Library
Présentation Oracle DataBase 11g
Basic oracle-database-administration
The InnoDB Storage Engine for MySQL
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Cascading Style Sheet (CSS)
Title, heading and paragraph tags
Ms sql-server
Introduction to Apache Hive(Big Data, Final Seminar)
Apache hive
Bootstrap
Data Warehouse Design Considerations
Ad

Viewers also liked (20)

PDF
HCatalog
PDF
HiveServer2 for Apache Hive
PPTX
Hive hcatalog
PPTX
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
PPTX
HiveServer2
PDF
Hive Quick Start Tutorial
PDF
Inside Hulu's Data platform (BigDataCamp LA 2013)
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PDF
Improving HDFS Availability with Hadoop RPC Quality of Service
PDF
Future of HCatalog
PPT
Hadoop and Hive
PPTX
Hive and HiveQL - Module6
PDF
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
PDF
Analytics in olap with lucene & hadoop
PDF
A Tool For Big Data Analysis using Apache Spark
PPT
Hadoop Summit 2009 Hive
PDF
How to Create 80% of a Big Data Pilot Project
PPTX
Lessons Learned - Monitoring the Data Pipeline at Hulu
PPTX
SC4 Workshop 1: Logistics and big data German herrero
HCatalog
HiveServer2 for Apache Hive
Hive hcatalog
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
HiveServer2
Hive Quick Start Tutorial
Inside Hulu's Data platform (BigDataCamp LA 2013)
HIVE: Data Warehousing & Analytics on Hadoop
Improving HDFS Availability with Hadoop RPC Quality of Service
Future of HCatalog
Hadoop and Hive
Hive and HiveQL - Module6
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Analytics in olap with lucene & hadoop
A Tool For Big Data Analysis using Apache Spark
Hadoop Summit 2009 Hive
How to Create 80% of a Big Data Pilot Project
Lessons Learned - Monitoring the Data Pipeline at Hulu
SC4 Workshop 1: Logistics and big data German herrero
Ad

Similar to Introduction to Hive and HCatalog (20)

PPTX
Big Data & Analytics (CSE6005) L6.pptx
PPTX
Apache Hive
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPTX
PPTX
Hive - A theoretical overview in Detail.pptx
PPTX
Hive.pptx
PPTX
03 hive query language (hql)
PPTX
Hive big-data meetup
PPTX
Hive_Pig.pptx
PPTX
PPTX
Hive and querying data
PPTX
Presentation ON Hive Big Data NOSQL.pptx
PPTX
BDA: Introduction to HIVE, PIG and HBASE
PPTX
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
PPTX
hive architecture and hive components in detail
PPTX
Apache hive introduction
PPTX
hive.pptx
PPTX
Hive ppt on the basis of importance of big data
PPTX
hive_slides_Webinar_Session_1.pptx
PPTX
Unit 5-apache hive
Big Data & Analytics (CSE6005) L6.pptx
Apache Hive
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive - A theoretical overview in Detail.pptx
Hive.pptx
03 hive query language (hql)
Hive big-data meetup
Hive_Pig.pptx
Hive and querying data
Presentation ON Hive Big Data NOSQL.pptx
BDA: Introduction to HIVE, PIG and HBASE
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
hive architecture and hive components in detail
Apache hive introduction
hive.pptx
Hive ppt on the basis of importance of big data
hive_slides_Webinar_Session_1.pptx
Unit 5-apache hive

More from markgrover (20)

PDF
From discovering to trusting data
PDF
Amundsen lineage designs - community meeting, Dec 2020
PDF
Amundsen at Brex and Looker integration
PDF
REA Group's journey with Data Cataloging and Amundsen
PDF
Amundsen gremlin proxy design
PDF
Amundsen: From discovering to security data
PDF
Amundsen: From discovering to security data
PDF
Data Discovery & Trust through Metadata
PDF
Data Discovery and Metadata
PDF
The Lyft data platform: Now and in the future
PDF
Disrupting Data Discovery
PDF
TensorFlow Extension (TFX) and Apache Beam
PDF
Big Data at Speed
PDF
Near real-time anomaly detection at Lyft
PDF
Dogfooding data at Lyft
PDF
Fighting cybersecurity threats with Apache Spot
PDF
Fraud Detection with Hadoop
PDF
Top 5 mistakes when writing Spark applications
PDF
Top 5 mistakes when writing Spark applications
PPTX
Architecting Applications with Hadoop
From discovering to trusting data
Amundsen lineage designs - community meeting, Dec 2020
Amundsen at Brex and Looker integration
REA Group's journey with Data Cataloging and Amundsen
Amundsen gremlin proxy design
Amundsen: From discovering to security data
Amundsen: From discovering to security data
Data Discovery & Trust through Metadata
Data Discovery and Metadata
The Lyft data platform: Now and in the future
Disrupting Data Discovery
TensorFlow Extension (TFX) and Apache Beam
Big Data at Speed
Near real-time anomaly detection at Lyft
Dogfooding data at Lyft
Fighting cybersecurity threats with Apache Spot
Fraud Detection with Hadoop
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Architecting Applications with Hadoop

Recently uploaded (20)

PPTX
MAD Unit - 3 User Interface and Data Management (Diploma IT)
PPTX
Principal presentation for NAAC (1).pptx
PDF
Computer System Architecture 3rd Edition-M Morris Mano.pdf
PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
PPTX
Software Engineering and software moduleing
PDF
Principles of operation, construction, theory, advantages and disadvantages, ...
PDF
Mechanics of materials week 2 rajeshwari
PDF
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
PPTX
Micro1New.ppt.pptx the mai themes of micfrobiology
PPT
Programmable Logic Controller PLC and Industrial Automation
PDF
Computer organization and architecuture Digital Notes....pdf
PDF
Unit1 - AIML Chapter 1 concept and ethics
DOCX
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
PDF
Project_Mgmt_Institute_-Marc Marc Marc .pdf
PDF
[jvmmeetup] next-gen integration with apache camel and quarkus.pdf
PDF
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
PPT
UNIT-I Machine Learning Essentials for 2nd years
PDF
Present and Future of Systems Engineering: Air Combat Systems
PDF
20250617 - IR - Global Guide for HR - 51 pages.pdf
PDF
August -2025_Top10 Read_Articles_ijait.pdf
MAD Unit - 3 User Interface and Data Management (Diploma IT)
Principal presentation for NAAC (1).pptx
Computer System Architecture 3rd Edition-M Morris Mano.pdf
MLpara ingenieira CIVIL, meca Y AMBIENTAL
Software Engineering and software moduleing
Principles of operation, construction, theory, advantages and disadvantages, ...
Mechanics of materials week 2 rajeshwari
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
Micro1New.ppt.pptx the mai themes of micfrobiology
Programmable Logic Controller PLC and Industrial Automation
Computer organization and architecuture Digital Notes....pdf
Unit1 - AIML Chapter 1 concept and ethics
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
Project_Mgmt_Institute_-Marc Marc Marc .pdf
[jvmmeetup] next-gen integration with apache camel and quarkus.pdf
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
UNIT-I Machine Learning Essentials for 2nd years
Present and Future of Systems Engineering: Air Combat Systems
20250617 - IR - Global Guide for HR - 51 pages.pdf
August -2025_Top10 Read_Articles_ijait.pdf

Introduction to Hive and HCatalog

  • 1. 1 Introduction to Apache Hive (and HCatalog)   Mark Grover github.com/markgrover/nyc-hug- hive
  • 2. Me! •  Contributor  to  Apache  Hive   •  Sec3on  Author  of  O’Reilly’s  Programming  Hive  book   •  So?ware  Developer  at  Cloudera   •  @mark_grover   •  [email protected]   •  hFps://github.com/markgrover/nyc-­‐hug-­‐hive  
  • 3. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 4. Preamble •  This  is  a  remote  talk   •  Feel  free  to  ask  ques3ons  any  3me!  
  • 5. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 6. Hive •  Data  warehouse  system  for  Hadoop   •  Enables  Extract/Transform/Load  (ETL)   •  Associate  structure  with  a  variety  of  data  formats   •  Logical  Table  -­‐>  Physical  Loca3on   •  Logical  Table  -­‐>  Physical  Data  Format  Handler  (SerDe)   •  Integrates  with  HDFS,  HBase,  MongoDB,  etc.   •  Query  execu3on  in  MapReduce  
  • 7. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 8. Why use Hive? •  MapReduce  is  catered  towards  developers   •  Run  SQL-­‐like  queries  that  get  compiled  and  run  as   MapReduce  jobs   •  Data  in  Hadoop  even  though  generally  unstructured   has  some  vague  structure  associated  with  it   •  Benefits  of  MapReduce  +  HDFS  (Hadoop)   •  Fault  tolerant   •  Robust   •  Scalable  
  • 9. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 10. Hive features •  Create  table,  create  view,  create  index  -­‐  DDL   •  Select,  where  clause,  group  by,  order  by,  joins   •  Pluggable  User  Defined  Func3ons  -­‐  UDFs  (e.g   from_unix3me)   •  Pluggable  User  Defined  Aggregate  Func3ons  -­‐   UDAFs  (e.g.  count,  avg)   •  Pluggable  User  Defined  Table  Genera3ng  Func3ons   -­‐  UDTFs  (e.g.  explode)  
  • 11. Hive features •  Pluggable  custom  Input/Output  format   •  Pluggable  Serializa3on  Deserializa3on  libraries   (SerDes)   •  Pluggable  custom  map  and  reduce  scripts  
  • 12. What Hive does NOT support •  OLTP  workloads  -­‐  low  latency   •  Correlated  subqueries   •  Not  super  performant  with  small  amounts  of  data   •  How  much  data  do  you  need  to  call  it  “Big  Data”?  
  • 13. Other Hive features •  Par33oning   •  Sampling   •  Bucke3ng   •  Various  join  op3miza3ons   •  Integra3on  with  HBase  and  other  storage  handlers   •  Views  –  Unmaterialized   •  Complex  data  types  –  arrays,  structs,  maps  
  • 14. Connecting to Hive •  Hive  Shell   •  JDBC  driver   •  ODBC  driver   •  Thri?  client  
  • 15. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 16. Hive metastore •  Backed  by  RDBMS   •  Derby,  MySQL,  PostgreSQL,  etc.  supported   •  Default  Embedded  Derby   •  Not  recommend  for  anything  but  a  quick  Proof  of   Concept   •  3  different  modes  of  opera3on:   •  Embedded  Derby  (default)   •  Local   •  Remote  
  • 20. Problems with Hive Server •  No  sessions/concurrency   •  Essen3ally  need  1  server  per  client   •  Security   •  Auding/Logging  
  • 22. Hive architecture •  Compiler   •  Parser   •  Type  checking   •  Seman3c  Analyzer   •  Plan  Genera3on   •  Task  Genera3on  
  • 23. Hive architecture •  Execu3on  Engine   •  Plan   •  Operators   •  SerDes   •  UDFs/UDAFs/UDTFs   •  Metastore   •  Stores  schema  of  data   •  HCatalog  
  • 24. Architecture Summary •  Use  remote  metastore  service  for  sharing  the   metastore  with  HCatalog  and  other  tools   •  Use  Hive  Server2  for  concurrent  queries  
  • 25. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 27. HCatalog •  Sub-­‐component  of  Hive   •  Table  and  storage  management  service   •  Public  APIs  and  webservice  wrappers  for  accessing   metadata  in  Hive  metastore   •  Metastore  contains  informa3on  of  interest  to  other   tools  (Pig,  MapReduce  jobs)   •  Expose  that  informa3on  as  REST  interface   •  WebHCat:  Web  Server  for  engaging  with  the  Hive   metastore    
  • 29. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 30. Applications of Hive •  Web  Analy3cs   •  Retail   •  Healthcare   •  Spam  detec3on   •  Data  Mining   •  Ad  op3miza3on   •  ETL  workloads  
  • 31. How to install Hive •  Download  Apache  Hive  tarball   •  Use  Apache  Bigtop  packages   •  Use  a  Demo  VM  
  • 32. Want to learn more about Hive?
  • 33. Contact info @mark_grover   github.com/markgrover   linkedin.com/in/grovermark   [email protected]