SlideShare a Scribd company logo
HCatalog 
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Inside a Data-Driven Company 
■ Analysts use multiple tools for processing data 
● Java MapReduce, Hive and Pig and more 
■ Analysts create many (derivative) datasets 
● Different formats e.g. CSV, JSON, Avro, ORC 
● Files in HDFS 
● Tables in Hive We simply pick a right 
tool and format for 
each application! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Pig Analysts 
■ To process data, they need to remember 
● where a dataset is located 
● what format it has 
● what the schema is 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Hive Analysts 
■ They store popular datasets in (external) Hive tables 
■ To analyze datasets generated by Pig analysts, they need to 
know 
● where a dataset is located 
● what format it has 
● what the schema is 
● how to load it into Hive table/partition 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Changes, Changes And Changes 
Let’s start using more 
efficient format since 
today! 
NO WAY! 
We would have to re-write, re-test and 
re-deploy our applications! 
This means a lot of engineering work 
for us! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
MR, Pig, Hive And Data Storage 
■ Hive reads data location, format and schema from metadata 
● Managed by Hive Metastore 
■ MapReduce encodes them in the application code 
■ Pig specifies them in the script 
● Schema can be provided by the Loader 
Conclusion 
■ MapReduce and Pig are sensitive to metadata changes! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Data Availability 
Jeff: Is your dataset already generated? 
Tom: Check in HDFS! 
Jeff: What is the location? 
Tom: /data/streams/20140101 
Jeff: hdfs dfs -ls /data/streams/20140101 
Not yet…. :( 
Tom: Check it later! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog 
■ Aims to solve these problems! 
■ First of all 
● It knows the location, format and schema of our datasets! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog With Pig 
■ Traditional way 
raw = load ‘/data/streams/20140101’ using MyLoader() 
as (time:chararray, host:chararray, userId:int, songId:int, duration:int); 
… 
store output into ‘/data/users/20140101’ using MyStorer(); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog With Pig 
■ Traditional way 
raw = load ‘/data/streams/20140101’ using MyLoader() 
as (time:chararray, host:chararray, userId:int, songId:int, duration:int); 
… 
store output into ‘/data/users/20140101’ using MyStorer(); 
■ HCatalog way 
raw = load ‘streams’ using HCatLoader(); 
… 
store output into users using HCatStorer(‘date=20140101’); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Interacting with HCatalog 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. 
Demo
Demo 
1. Upload a dataset in HDFS 
2. Add the dataset HCatalog 
3. Access the dataset with Hive and Pig 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
1. Upload a dataset in HDFS 
$ hdfs dfs -put streams /data 
Streams: timestamp, host, userId, songId, duration 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
2. Add the dataset HCatalog 
■ A file with HCatalog DDL can be prepared 
CREATE EXTERNAL TABLE streams(time string, host string, userId int, songId int, duration int) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' 
STORED AS TEXTFILE 
LOCATION '/data/streams'; 
■ And executed by hcat -f 
$ hcat -f streams.hcatalog 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
2. Add the dataset HCatalog 
■ Describe the dataset 
$ hcat -e "describe streams" 
OK 
time string None 
host string None 
userid int None 
songid int None 
duration int None 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
3. Access the dataset with Pig 
raw_streams = LOAD 'streams' USING org.apache.hcatalog.pig.HCatLoader(); 
all_count = FOREACH (GROUP raw_streams ALL) GENERATE COUNT(raw_streams); 
DUMP all_count; 
$ pig -useHCatalog streams.pig 
… 
(93294) 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
3. Access the dataset with Hive 
$ hive -e "select count(*) from streams" 
OK 
93294 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Goals 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Three Main Goals 
1. Provide an abstraction on top of datasets stored in HDFS 
● Just use the name of the dataset, not the path 
2. Enable data discovery 
● Store all datasets (and their properties) in HCatalog 
● Integrate with Web UI 
3. Provide notifications for data availability 
● Process new data immediately when it appears 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Supported Formats 
and Projects 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Supported Projects And Formats 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Custom Formats 
■ A custom format can be supported 
● But InputFormat, OutputFormat, and Hive SerDe must 
be provided 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Pig Interface - HCatLoader 
■ Consists of HCatLoader and HCatStorer 
■ HCatLoader read data from a dataset 
● Indicate which partitions to scan by following the load 
statement with a partition filter statement 
raw = load 'streams' using HCatLoader(); 
valid = filter raw by date = '20140101' and isValid(duration); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Pig Interface - HCatStorer 
■ HCatStorer writes data to a dataset 
● A specification of partition keys can be also provided 
● Possible to write to a single partition or multiple partitions 
store valid into 'streams_valid' using HCatStorer 
('date=20110924'); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
MapReduce Interface 
■ Consists of HCatInputFormat and HCatOutputFormat 
■ HCatInputFormat accepts a dataset to read data from 
● Optionally, indicate which partitions to scan 
■ HCatOutputFormat accepts a dataset to write to 
● Optionally, indicated with partition to write to 
● Possible to write to a single partition or multiple partitions 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Hive Interface 
■ There is no Hive-specific interface 
● Hive can read information from HCatalog directly 
■ Actually, HCatalog is now a submodule of Hive 
Conclusion 
■ HCatalog enables non-Hive projects to access Hive tables 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Components 
■ Hive Metastore to store information about datasets 
● A table per dataset is created (the same as in Hive) 
■ hcat CLI 
● Create and drop tables, specify table parameters, etc 
■ Programming interfaces 
● For MapReduce and Pig 
● New ones can be implemented e.g. for Crunch 
■ WebHCat server 
● More about it later 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Features 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Data Discovery 
■ A nice web UI can be build on top of HCatalog 
● You have all Hive tables there for free! 
● See Yahoo!’s illustrative example below 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. 
Pictures 
come from 
Yahoo’s 
presentation 
at Hadoop 
Summit San 
Jose 2014
Properties Of Datasets 
■ Can store data life-cycle management information 
● Cleaning, archiving and replication tools can learn which 
datasets are eligible for their services 
ALTER TABLE intermediate.featured-streams SET TBLPROPERTIES ('retention' = 30); 
SHOW TBLPROPERTIES intermediate.featured-streams; 
SHOW TBLPROPERTIES intermediate.featured-streams("retention"); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Notifications 
■ Provides notifications for certain events 
● e.g. Oozie or custom Java code can wait for those events 
and immediately schedule tasks depending on them 
■ Multiple events can trigger notifications 
● A database, a table or a partition is added 
● A set of partitions is added 
● A database, a table or a partition is dropped 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Evolution Of Data 
■ Allows data producers to change how they write data 
● No need to re-write existing data 
● HCatalog can read old and new data 
● Data consumers don’t have to change their applications 
■ Data location, format and schema can be changed 
● In case of schema, new fields can be added 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog Beyond HDFS 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
WebHCat Server 
■ Provides a REST-like web API for HCatalog 
● Send requests to get information about datasets 
curl -s 'http://hostname: 
port/templeton/v1/ddl/database/db_name/table/table_name?user. 
name=adam' 
● Send requests to run Pig or Hive scripts 
■ Previously called Templeton 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
There Is More! 
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
■ Data engineers, architects and instructors 
■ +4 years of experience in Apache Hadoop 
● Working with hundreds-node Hadoop clusters 
● Developing Hadoop applications in many cool frameworks 
● Delivering Hadoop trainings for +2 years 
■ Passionate about data 
● Speaking at conferences and meetups 
● Blogging and reviewing books 
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

More Related Content

PDF
Simplified Data Management And Process Scheduling in Hadoop
GetInData
 
PPTX
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Big Data Spain
 
PDF
Introduction to Hive and HCatalog
markgrover
 
PDF
Introduction to Hadoop Ecosystem
GetInData
 
PPTX
Hive hcatalog
Alexandre Poletto
 
PDF
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 
PPTX
HCatalog Hadoop Summit 2011
Hortonworks
 
PDF
Web Services Hadoop Summit 2012
Hortonworks
 
Simplified Data Management And Process Scheduling in Hadoop
GetInData
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Big Data Spain
 
Introduction to Hive and HCatalog
markgrover
 
Introduction to Hadoop Ecosystem
GetInData
 
Hive hcatalog
Alexandre Poletto
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 
HCatalog Hadoop Summit 2011
Hortonworks
 
Web Services Hadoop Summit 2012
Hortonworks
 

What's hot (20)

PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Future of HCatalog - Hadoop Summit 2012
Hortonworks
 
PPTX
6.hive
Prashant Gupta
 
PPTX
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Gruter
 
PDF
Future of HCatalog
DataWorks Summit
 
PPTX
An intriduction to hive
Reza Ameri
 
PPTX
Introduction to Hive
Uday Vakalapudi
 
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
PPTX
Apache hive
pradipbajpai68
 
PDF
Hive Anatomy
nzhang
 
PPTX
Asbury Hadoop Overview
Brian Enochson
 
ODP
Hadoop - Overview
Jay
 
PDF
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
PDF
SQOOP PPT
Dushhyant Kumar
 
PPTX
Hive Hadoop
Farafekr Technology Ltd.
 
ODP
An introduction to Apache Hadoop Hive
Mike Frampton
 
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
PPTX
Pptx present
Nitish Bhardwaj
 
PDF
Sqoop
Prashant Gupta
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Future of HCatalog - Hadoop Summit 2012
Hortonworks
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Gruter
 
Future of HCatalog
DataWorks Summit
 
An intriduction to hive
Reza Ameri
 
Introduction to Hive
Uday Vakalapudi
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Apache hive
pradipbajpai68
 
Hive Anatomy
nzhang
 
Asbury Hadoop Overview
Brian Enochson
 
Hadoop - Overview
Jay
 
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
SQOOP PPT
Dushhyant Kumar
 
An introduction to Apache Hadoop Hive
Mike Frampton
 
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
Pptx present
Nitish Bhardwaj
 
Ad

Viewers also liked (20)

PDF
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Adam Kawa
 
PDF
Apache Flume
GetInData
 
PDF
Quick Introduction to Apache Tez
GetInData
 
PDF
Apache Flume
Arinto Murdopo
 
PDF
How Apache Drives Music Recommendations At Spotify
Josh Baer
 
PDF
Apache Flume - DataDayTexas
Arvind Prabhakar
 
PDF
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
PDF
The Evolution of Hadoop at Spotify - Through Failures and Pain
Rafał Wojdyła
 
PDF
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Edward Yoon
 
PDF
Inside Hulu's Data platform (BigDataCamp LA 2013)
Prasan Samtani
 
PPTX
Jak utworzyć Company Page na Linkedin
Marcin Czajka
 
PPTX
Data Discovery on Hadoop - Realizing the Full Potential of your Data
DataWorks Summit
 
PPTX
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Adam Kawa
 
PDF
Hadoop.TW : Now and Future
Jazz Yao-Tsung Wang
 
PDF
Apache Flume NG
huguk
 
PPTX
HDFS Federation
Hortonworks
 
PDF
Introduction To Elastic MapReduce at WHUG
Adam Kawa
 
PDF
Analytics in olap with lucene & hadoop
lucenerevolution
 
PDF
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
 
PDF
Apache Avro and You
Eric Wendelin
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Adam Kawa
 
Apache Flume
GetInData
 
Quick Introduction to Apache Tez
GetInData
 
Apache Flume
Arinto Murdopo
 
How Apache Drives Music Recommendations At Spotify
Josh Baer
 
Apache Flume - DataDayTexas
Arvind Prabhakar
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
Rafał Wojdyła
 
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Edward Yoon
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Prasan Samtani
 
Jak utworzyć Company Page na Linkedin
Marcin Czajka
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
DataWorks Summit
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Adam Kawa
 
Hadoop.TW : Now and Future
Jazz Yao-Tsung Wang
 
Apache Flume NG
huguk
 
HDFS Federation
Hortonworks
 
Introduction To Elastic MapReduce at WHUG
Adam Kawa
 
Analytics in olap with lucene & hadoop
lucenerevolution
 
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
 
Apache Avro and You
Eric Wendelin
 
Ad

Similar to HCatalog (20)

PDF
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
PPTX
Hadoop data access layer v4.0
SpringPeople
 
PDF
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Kevin Crocker
 
PPTX
TriHUG November HCatalog Talk by Alan Gates
trihug
 
PDF
A Reference Architecture for ETL 2.0
DataWorks Summit
 
PPTX
GETTING YOUR DATA IN HADOOP.pptx
infinix8
 
PDF
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
PDF
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
PPTX
Druid at Hadoop Ecosystem
Slim Bouguerra
 
PPTX
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
PPTX
H cat berlinbuzzwords2012
Hortonworks
 
PDF
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
PDF
Yahoo! Hack Europe Workshop
Hortonworks
 
PDF
9/2017 STL HUG - Back to School
Adam Doyle
 
PPTX
HDFS tiered storage: mounting object stores in HDFS
DataWorks Summit
 
PPTX
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter
 
PPTX
Building data pipelines with kite
Joey Echeverria
 
PDF
InternReport
Swetha Tanamala
 
PDF
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
PDF
Application Architectures with Hadoop
hadooparchbook
 
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Hadoop data access layer v4.0
SpringPeople
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Kevin Crocker
 
TriHUG November HCatalog Talk by Alan Gates
trihug
 
A Reference Architecture for ETL 2.0
DataWorks Summit
 
GETTING YOUR DATA IN HADOOP.pptx
infinix8
 
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Druid at Hadoop Ecosystem
Slim Bouguerra
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
H cat berlinbuzzwords2012
Hortonworks
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
Yahoo! Hack Europe Workshop
Hortonworks
 
9/2017 STL HUG - Back to School
Adam Doyle
 
HDFS tiered storage: mounting object stores in HDFS
DataWorks Summit
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter
 
Building data pipelines with kite
Joey Echeverria
 
InternReport
Swetha Tanamala
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Application Architectures with Hadoop
hadooparchbook
 

More from GetInData (20)

PDF
LLMOps: from Demo to Production-Ready GenAI Systems
GetInData
 
PDF
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
PDF
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
GetInData
 
PDF
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
GetInData
 
PDF
How NOT to win a Kaggle competition
GetInData
 
PDF
How to become good Developer in Scrum Team?
GetInData
 
PDF
OpenLineage & Airflow - data lineage has never been easier
GetInData
 
PDF
Benefits of a Homemade ML Platform
GetInData
 
PDF
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
GetInData
 
PDF
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
GetInData
 
PDF
MLOps implemented - how we combine the cloud & open-source to boost data scie...
GetInData
 
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
PDF
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
GetInData
 
PDF
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
GetInData
 
PDF
Big data trends - Krzysztof Zarzycki, GetInData
GetInData
 
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
GetInData
 
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
GetInData
 
PDF
Complex event processing platform handling millions of users - Krzysztof Zarz...
GetInData
 
PDF
Predicting Startup Market Trends based on the news and social media - Albert ...
GetInData
 
LLMOps: from Demo to Production-Ready GenAI Systems
GetInData
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
GetInData
 
How NOT to win a Kaggle competition
GetInData
 
How to become good Developer in Scrum Team?
GetInData
 
OpenLineage & Airflow - data lineage has never been easier
GetInData
 
Benefits of a Homemade ML Platform
GetInData
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
GetInData
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
GetInData
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
GetInData
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
GetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
GetInData
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
GetInData
 

Recently uploaded (20)

PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Doc9.....................................
SofiaCollazos
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Doc9.....................................
SofiaCollazos
 

HCatalog

  • 1. HCatalog © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
  • 2. Motivation © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 3. Inside a Data-Driven Company ■ Analysts use multiple tools for processing data ● Java MapReduce, Hive and Pig and more ■ Analysts create many (derivative) datasets ● Different formats e.g. CSV, JSON, Avro, ORC ● Files in HDFS ● Tables in Hive We simply pick a right tool and format for each application! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 4. Pig Analysts ■ To process data, they need to remember ● where a dataset is located ● what format it has ● what the schema is © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 5. Hive Analysts ■ They store popular datasets in (external) Hive tables ■ To analyze datasets generated by Pig analysts, they need to know ● where a dataset is located ● what format it has ● what the schema is ● how to load it into Hive table/partition © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 6. Changes, Changes And Changes Let’s start using more efficient format since today! NO WAY! We would have to re-write, re-test and re-deploy our applications! This means a lot of engineering work for us! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 7. MR, Pig, Hive And Data Storage ■ Hive reads data location, format and schema from metadata ● Managed by Hive Metastore ■ MapReduce encodes them in the application code ■ Pig specifies them in the script ● Schema can be provided by the Loader Conclusion ■ MapReduce and Pig are sensitive to metadata changes! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 8. Data Availability Jeff: Is your dataset already generated? Tom: Check in HDFS! Jeff: What is the location? Tom: /data/streams/20140101 Jeff: hdfs dfs -ls /data/streams/20140101 Not yet…. :( Tom: Check it later! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 9. HCatalog ■ Aims to solve these problems! ■ First of all ● It knows the location, format and schema of our datasets! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 10. HCatalog With Pig ■ Traditional way raw = load ‘/data/streams/20140101’ using MyLoader() as (time:chararray, host:chararray, userId:int, songId:int, duration:int); … store output into ‘/data/users/20140101’ using MyStorer(); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 11. HCatalog With Pig ■ Traditional way raw = load ‘/data/streams/20140101’ using MyLoader() as (time:chararray, host:chararray, userId:int, songId:int, duration:int); … store output into ‘/data/users/20140101’ using MyStorer(); ■ HCatalog way raw = load ‘streams’ using HCatLoader(); … store output into users using HCatStorer(‘date=20140101’); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 12. Interacting with HCatalog © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. Demo
  • 13. Demo 1. Upload a dataset in HDFS 2. Add the dataset HCatalog 3. Access the dataset with Hive and Pig © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 14. HCatalog CLI 1. Upload a dataset in HDFS $ hdfs dfs -put streams /data Streams: timestamp, host, userId, songId, duration © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 15. HCatalog CLI 2. Add the dataset HCatalog ■ A file with HCatalog DDL can be prepared CREATE EXTERNAL TABLE streams(time string, host string, userId int, songId int, duration int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION '/data/streams'; ■ And executed by hcat -f $ hcat -f streams.hcatalog © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 16. HCatalog CLI 2. Add the dataset HCatalog ■ Describe the dataset $ hcat -e "describe streams" OK time string None host string None userid int None songid int None duration int None © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 17. HCatalog CLI 3. Access the dataset with Pig raw_streams = LOAD 'streams' USING org.apache.hcatalog.pig.HCatLoader(); all_count = FOREACH (GROUP raw_streams ALL) GENERATE COUNT(raw_streams); DUMP all_count; $ pig -useHCatalog streams.pig … (93294) © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 18. HCatalog CLI 3. Access the dataset with Hive $ hive -e "select count(*) from streams" OK 93294 © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 19. Goals © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 20. Three Main Goals 1. Provide an abstraction on top of datasets stored in HDFS ● Just use the name of the dataset, not the path 2. Enable data discovery ● Store all datasets (and their properties) in HCatalog ● Integrate with Web UI 3. Provide notifications for data availability ● Process new data immediately when it appears © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 21. Supported Formats and Projects © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 22. Supported Projects And Formats © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 23. Custom Formats ■ A custom format can be supported ● But InputFormat, OutputFormat, and Hive SerDe must be provided © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 24. Pig Interface - HCatLoader ■ Consists of HCatLoader and HCatStorer ■ HCatLoader read data from a dataset ● Indicate which partitions to scan by following the load statement with a partition filter statement raw = load 'streams' using HCatLoader(); valid = filter raw by date = '20140101' and isValid(duration); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 25. Pig Interface - HCatStorer ■ HCatStorer writes data to a dataset ● A specification of partition keys can be also provided ● Possible to write to a single partition or multiple partitions store valid into 'streams_valid' using HCatStorer ('date=20110924'); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 26. MapReduce Interface ■ Consists of HCatInputFormat and HCatOutputFormat ■ HCatInputFormat accepts a dataset to read data from ● Optionally, indicate which partitions to scan ■ HCatOutputFormat accepts a dataset to write to ● Optionally, indicated with partition to write to ● Possible to write to a single partition or multiple partitions © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 27. Hive Interface ■ There is no Hive-specific interface ● Hive can read information from HCatalog directly ■ Actually, HCatalog is now a submodule of Hive Conclusion ■ HCatalog enables non-Hive projects to access Hive tables © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 28. Components ■ Hive Metastore to store information about datasets ● A table per dataset is created (the same as in Hive) ■ hcat CLI ● Create and drop tables, specify table parameters, etc ■ Programming interfaces ● For MapReduce and Pig ● New ones can be implemented e.g. for Crunch ■ WebHCat server ● More about it later © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 29. Features © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 30. Data Discovery ■ A nice web UI can be build on top of HCatalog ● You have all Hive tables there for free! ● See Yahoo!’s illustrative example below © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. Pictures come from Yahoo’s presentation at Hadoop Summit San Jose 2014
  • 31. Properties Of Datasets ■ Can store data life-cycle management information ● Cleaning, archiving and replication tools can learn which datasets are eligible for their services ALTER TABLE intermediate.featured-streams SET TBLPROPERTIES ('retention' = 30); SHOW TBLPROPERTIES intermediate.featured-streams; SHOW TBLPROPERTIES intermediate.featured-streams("retention"); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 32. Notifications ■ Provides notifications for certain events ● e.g. Oozie or custom Java code can wait for those events and immediately schedule tasks depending on them ■ Multiple events can trigger notifications ● A database, a table or a partition is added ● A set of partitions is added ● A database, a table or a partition is dropped © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 33. Evolution Of Data ■ Allows data producers to change how they write data ● No need to re-write existing data ● HCatalog can read old and new data ● Data consumers don’t have to change their applications ■ Data location, format and schema can be changed ● In case of schema, new fields can be added © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 34. HCatalog Beyond HDFS © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 35. WebHCat Server ■ Provides a REST-like web API for HCatalog ● Send requests to get information about datasets curl -s 'http://hostname: port/templeton/v1/ddl/database/db_name/table/table_name?user. name=adam' ● Send requests to run Pig or Hive scripts ■ Previously called Templeton © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 36. There Is More! © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
  • 37. ■ Data engineers, architects and instructors ■ +4 years of experience in Apache Hadoop ● Working with hundreds-node Hadoop clusters ● Developing Hadoop applications in many cool frameworks ● Delivering Hadoop trainings for +2 years ■ Passionate about data ● Speaking at conferences and meetups ● Blogging and reviewing books © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.