SlideShare a Scribd company logo
Integrate Apache Flink with Apache Hive
Xuefu Zhang,
-- Senior Staff Engineer, Alibaba
-- Hive PMC, Apache Member
Bowen Li
-- Senior Engineer, Alibaba
● Background
● Goals
● Technical Overview
● Current Progress
● Demo
● Q&A
Agenda
Background
● Flink has achieved an impressive success in stream processing
● Its scalability and potential has been proven and pushed further by Blink, now
part of Flink
● at Alibaba, Flink is used to process extremely large amount of data at an
unprecedented scale
1.7B Events/secEB Total PB Everyday 1T Event/Day
Streaming SQL
● Majority of stream analytics can be expressed in SQL
● Instead of programming, streaming SQL gives a user a non-programming way of
writing and deploying streaming jobs
● For SQL, there is need for metadata: sources, sinks, UDFs, views, etc
● The metadata needs a store
Streaming SQL (cont’d)
● Currently, Flink stores metadata in a memory
● The metadata is ill-organized, scattered around in different components
● Poor usability, interoperability, productivity, and manageability
● Problem #1: Flink lacking a well-organized, persistent store for its metadata
Batch and SQL
● Stream analytics users usually have also offline, batch analytics
● ETL is still an important use case for big data
● AI/ML is a major driving force behind both real-time and batch analytics
○ Gathering data to train and test a model, deploying it in stream processing
● SQL is the main tool processing big data for batch
● Unfortunately, users have to have a different engine for non-stream processing
Batch and SQL (cont’d)
● Flink has showed prevailing advantages over other solutions for
heavy-volume stream processing
● In Blink, we systematically explored Flink’s capabilities in batch processing,
and it shows great potential
Flink is the fastest due to its pipelined execution
Tez and Spark do not overlap 1st and 2nd stages
MapReduce is slow despite overlapping stages
A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
Batch and SQL (cont’d)
● Batch requires more on SQL capability
● Demands an even stronger metadata management
● Hive is the de facto standard for big data/batch processing on Hadoop
● The center of big data ecosystem is Hive metadata store
● Problem #2: Flink lacking a seamless access to Hive’s metadata and data
Heterogeneous Sources/Sinks
● Whether batch or streaming, Flink usually needs to access many data systems
○ Hive
○ MySQL
○ Key-Value stores
○ Kafka stream
● Different data catalogs
● Problem #3, Flink needs a unified interface to interact with different data catalogs
Beyond Flink
● Batch has a large use case then streaming
● Many Hive users are not Flink users
● We like Hive users can benefit from Flink’s batch capabilities
● Problem #4: Flink needing a story for Hive users
Four Goals
● Define Unified catalog API
● Implement In-Memory catalog and persistent catalog for Flink metadata
● Implement Hive catalog, enabling deep integration with Hive
● Provide Flink as Hive’s new execution engine (long-term)
Technical Overview
● Define unified catalog APIs (FLIP-30)
● Three implementations
○ Generic in-memory catalog
○ Generic persistent catalog (based on Hive metastore)
○ Hive catalog
● Hive data access
● Hive on Flink is not yet planned
Architecture
Flink Deployment
Flink Runtime
Query processing & optimization
Table API and SQL
SQL Client/Zeppelin
Catalog APIs
Catalog APIs and Implementations
GenericInMemoryCatalog
GenericHiveMetastoreCatalog
ReadableCatalog
ReadableWritableCatalog
HiveCatalog
Shim Layer:
HiveMetastoreClient
CatalogManager
TableEnvironment
inheritance reference
SQL Client HiveCatalogBase
Hive Metastore
Catalog APIs
Hive Data Connector
BatchTableFactory
HiveTableFactory
BatchTableSource
HiveTableSource
InputFormat
HiveTableInputFormat
BatchTableSink
HiveTableSink
OutputFormat
HiveTableOutputFormat
Read
Write
Hive Data
HiveTableSink HiveTableOutputFormat
Current Progress, Development Plan, and Demo
Bowen Li
Integrating Flink with Hive
This is a major change, work needs to be broken into parts
Part 1. Unified Catalog APIs (FLIP-30, FLINK-11275)
Part 2. Integrate Flink with Hive (FLINK-10556)
● for metadata thru Hive Metastore (FLINK-10744)
● for data (FLINK-10729)
Part 3. Support a complete set of SQL DDL/DML in Flink (FLINK-10232)
1 - Unified Catalog APIs
Flink current status:
○ Barely any catalog support
○ Has separate function catalog
Our highlighted improvements:
○ Introduced new catalog APIs and framework and connected to Calcite
● ReadableCatalog and ReadableWritableCatalog
● Meta-Objects: Database, Table, View, Partition, Functions, Stats, etc
● Operations: Create/Alter/Rename/Drop/Get/List/Exist/
○ Unified function catalog with new catalog APIs and supported persisting functions
1 - Unified Catalog APIs
Flink current status:
○ No well-structured hierarchy yet to manage metadata
○ Needs better SQL user experience when referencing metadata
Our highlighted improvements:
● Introduced two-level management structure: <catalog>.<db>.<meta-object>
● Added CatalogManager to resolve object name
select * from defaultCatalog.defaultDb.Tbl => select * from Tbl
● Made Flink case-insensitive to object names, similar to Hive, MySQL, Oracle
1 - Unified Catalog APIs
Flink current status:
No production-ready catalogs
Our highlighted improvements:
Developed three production-ready catalogs
■ GenericInMemoryCatalog - in-memory non-persistent, per session
■ HiveCatalog - compatible with Hive, read/write Hive meta-objects
■ GenericHiveMetastoreCatalog - persist Flink streaming and batch meta-objects
1 - Unified Catalog APIs
Catalogs are pluggable and opens opportunities to build catalogs for
○ Streams and MQ
● Kafka (Confluent Schema Registry), Kinesis, RabbitMQ, Pulsar, etc
○ Structured Data
● RDMS like MySQL, etc
○ Semi-Structured Data
● ElasticSearch, HBase, Cassandra, etc
○ Your other favorite data management systems
● …...
2 - Flink-Hive Integration - Metadata - HiveCatalog
Our highlighted improvements:
Developed HiveCatalog, via which Flink can
● read Hive meta-objects, like tables, views, functions, stats
● create and write Hive meta-objects to Hive Metastore such that Hive can consume
Flink can read and write Hive metadata thru HiveCatalogFlink can read and write Hive metadata thru HiveCatalog
2 - Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
Our highlighted improvements:
● Persisted Flink’s metadata (both streaming and batch) by using Hive Metastore purely
as storage
HiveCatalog v.s. GenericHiveMetastoreCatalog
● for Hive batch metadata
● Hive can understand
● for any streaming and batch metadata
● Hive may not understand
Both are backed by Hive Metastore
2. Flink-Hive Integration - Data
Our highlighted improvements:
Connector:
○ Developed source and sink to read/write partition/non-partition tables and views
○ Supported partition-pruning
Data Types:
○ Supported for all Hive simple and complex (array, map, struct) data types
2. Flink-Hive Integration -
User defined functions and Version Compatibility
● Hive user defined functions
■ Supported Hive UDF
■ Working on supporting Hive GenericUDF, UDTF, UDAF
● Hive versions
■ Currently supports Hive 2.3.4 and 1.2.2 via shimming
■ Relies on Hive’s backward compatibility for 2.x and 1.x
● Working on direct support for more Hive versions, e.g. 2.1.1, 1.2.1
Timeline
First Targeted Flink release - 1.9.0, June 2019
Demo with Flink SQL CLI
• Query Hive Metadata
• Create Hive Source/Sink with HiveCatalog to read/write data
• Create CSV Source/Sink with GenericHiveMetastoreCatalog to read/write data
This tremendous amount of work cannot happen without help and support
Shout out to everyone in the community and our team
who have been helping us with designs, codes, feedbacks, etc!
● Flink is good at stream processing, but batch processing is equally important
● Flink has shown its potential in batch processing
● Flink/Hive integration benefits both communities
● This is a big effort
● We are taking a phased approach
● Your contribution is greatly welcome and appreciated!
Conclusions
Flink Forward China, Beijing, Dec 2019!
All major Chinese tech companies will attend.
Expected Attendees: 3,000+
Reach out to flink-forward-china@list.alibaba-inc.com for details!
Call for sponsors
Thanks!

More Related Content

What's hot (13)

PDF
Exploring Oracle Multitenant in Oracle Database 12c
Zohar Elkayam
 
DOC
Amit Kumar_Resume
Amit Kumar
 
PDF
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Zohar Elkayam
 
PDF
Adding real time reporting to your database oracle db in memory
Zohar Elkayam
 
PDF
Metadata Synchronization in MySQL NDB Cluster 8.0
Arnab Ray
 
PPTX
What's New in DITA 1.3 (Tekom, Nov 2014)
Contrext Solutions
 
PDF
Big data for cio 2015
Zohar Elkayam
 
PPTX
Directory Structure Changes in Laravel 5.3
DHRUV NATH
 
PPTX
Where the &amp;$%! did this come from e resources in alma%2-f_primo a teachi...
Martin Patrick
 
DOC
Informatica Online Training
Rao Rao
 
PPTX
Oracle OpenWo2014 review part 03 three_paa_s_database
Getting value from IoT, Integration and Data Analytics
 
PPTX
Evolutionary database design
Salehein Syed
 
PDF
Free Libre Open Source Software at FFZG library
Dobrica Pavlinušić
 
Exploring Oracle Multitenant in Oracle Database 12c
Zohar Elkayam
 
Amit Kumar_Resume
Amit Kumar
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Zohar Elkayam
 
Adding real time reporting to your database oracle db in memory
Zohar Elkayam
 
Metadata Synchronization in MySQL NDB Cluster 8.0
Arnab Ray
 
What's New in DITA 1.3 (Tekom, Nov 2014)
Contrext Solutions
 
Big data for cio 2015
Zohar Elkayam
 
Directory Structure Changes in Laravel 5.3
DHRUV NATH
 
Where the &amp;$%! did this come from e resources in alma%2-f_primo a teachi...
Martin Patrick
 
Informatica Online Training
Rao Rao
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Getting value from IoT, Integration and Data Analytics
 
Evolutionary database design
Salehein Syed
 
Free Libre Open Source Software at FFZG library
Dobrica Pavlinušić
 

Similar to Integrating Flink with Hive - Flink Forward SF 2019 (20)

PDF
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Bowen Li
 
PDF
Flink and Hive integration - unifying enterprise data processing systems
Bowen Li
 
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward
 
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PDF
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
HostedbyConfluent
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PPTX
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
PDF
OpenLineage for Stream Processing | Kafka Summit London
HostedbyConfluent
 
PDF
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
José Román Martín Gil
 
PDF
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPT
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009
Goutam Biswas
 
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
PDF
Kettle: Pentaho Data Integration tool
Alex Rayón Jerez
 
PPTX
Apache flink
Janu Jahnavi
 
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
PDF
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Bowen Li
 
Flink and Hive integration - unifying enterprise data processing systems
Bowen Li
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward
 
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
HostedbyConfluent
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
OpenLineage for Stream Processing | Kafka Summit London
HostedbyConfluent
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
José Román Martín Gil
 
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009
Goutam Biswas
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
Kettle: Pentaho Data Integration tool
Alex Rayón Jerez
 
Apache flink
Janu Jahnavi
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
Ad

More from Bowen Li (13)

PDF
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Bowen Li
 
PDF
How to contribute to Apache Flink @ Seattle Flink meetup
Bowen Li
 
PDF
Community update on flink 1.9 and How to Contribute to Flink
Bowen Li
 
PDF
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Bowen Li
 
PDF
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
Bowen Li
 
PDF
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Bowen Li
 
PDF
Status Update of Seattle Flink Meetup, Jun 2018
Bowen Li
 
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
PDF
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
PDF
Stream processing with Apache Flink @ OfferUp
Bowen Li
 
PDF
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Bowen Li
 
PDF
Opening - Seattle Apache Flink Meetup
Bowen Li
 
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Bowen Li
 
How to contribute to Apache Flink @ Seattle Flink meetup
Bowen Li
 
Community update on flink 1.9 and How to Contribute to Flink
Bowen Li
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Bowen Li
 
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
Bowen Li
 
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Bowen Li
 
Status Update of Seattle Flink Meetup, Jun 2018
Bowen Li
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
Stream processing with Apache Flink @ OfferUp
Bowen Li
 
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Bowen Li
 
Opening - Seattle Apache Flink Meetup
Bowen Li
 
Ad

Recently uploaded (20)

PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
PDF
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PPTX
cybersecurityandthe importance of the that
JayachanduHNJc
 
PDF
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
cybersecurityandthe importance of the that
JayachanduHNJc
 
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 

Integrating Flink with Hive - Flink Forward SF 2019

  • 1. Integrate Apache Flink with Apache Hive Xuefu Zhang, -- Senior Staff Engineer, Alibaba -- Hive PMC, Apache Member Bowen Li -- Senior Engineer, Alibaba
  • 2. ● Background ● Goals ● Technical Overview ● Current Progress ● Demo ● Q&A Agenda
  • 3. Background ● Flink has achieved an impressive success in stream processing ● Its scalability and potential has been proven and pushed further by Blink, now part of Flink ● at Alibaba, Flink is used to process extremely large amount of data at an unprecedented scale
  • 4. 1.7B Events/secEB Total PB Everyday 1T Event/Day
  • 5. Streaming SQL ● Majority of stream analytics can be expressed in SQL ● Instead of programming, streaming SQL gives a user a non-programming way of writing and deploying streaming jobs ● For SQL, there is need for metadata: sources, sinks, UDFs, views, etc ● The metadata needs a store
  • 6. Streaming SQL (cont’d) ● Currently, Flink stores metadata in a memory ● The metadata is ill-organized, scattered around in different components ● Poor usability, interoperability, productivity, and manageability ● Problem #1: Flink lacking a well-organized, persistent store for its metadata
  • 7. Batch and SQL ● Stream analytics users usually have also offline, batch analytics ● ETL is still an important use case for big data ● AI/ML is a major driving force behind both real-time and batch analytics ○ Gathering data to train and test a model, deploying it in stream processing ● SQL is the main tool processing big data for batch ● Unfortunately, users have to have a different engine for non-stream processing
  • 8. Batch and SQL (cont’d) ● Flink has showed prevailing advantages over other solutions for heavy-volume stream processing ● In Blink, we systematically explored Flink’s capabilities in batch processing, and it shows great potential
  • 9. Flink is the fastest due to its pipelined execution Tez and Spark do not overlap 1st and 2nd stages MapReduce is slow despite overlapping stages A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
  • 10. Batch and SQL (cont’d) ● Batch requires more on SQL capability ● Demands an even stronger metadata management ● Hive is the de facto standard for big data/batch processing on Hadoop ● The center of big data ecosystem is Hive metadata store ● Problem #2: Flink lacking a seamless access to Hive’s metadata and data
  • 11. Heterogeneous Sources/Sinks ● Whether batch or streaming, Flink usually needs to access many data systems ○ Hive ○ MySQL ○ Key-Value stores ○ Kafka stream ● Different data catalogs ● Problem #3, Flink needs a unified interface to interact with different data catalogs
  • 12. Beyond Flink ● Batch has a large use case then streaming ● Many Hive users are not Flink users ● We like Hive users can benefit from Flink’s batch capabilities ● Problem #4: Flink needing a story for Hive users
  • 13. Four Goals ● Define Unified catalog API ● Implement In-Memory catalog and persistent catalog for Flink metadata ● Implement Hive catalog, enabling deep integration with Hive ● Provide Flink as Hive’s new execution engine (long-term)
  • 14. Technical Overview ● Define unified catalog APIs (FLIP-30) ● Three implementations ○ Generic in-memory catalog ○ Generic persistent catalog (based on Hive metastore) ○ Hive catalog ● Hive data access ● Hive on Flink is not yet planned
  • 15. Architecture Flink Deployment Flink Runtime Query processing & optimization Table API and SQL SQL Client/Zeppelin Catalog APIs
  • 16. Catalog APIs and Implementations GenericInMemoryCatalog GenericHiveMetastoreCatalog ReadableCatalog ReadableWritableCatalog HiveCatalog Shim Layer: HiveMetastoreClient CatalogManager TableEnvironment inheritance reference SQL Client HiveCatalogBase Hive Metastore Catalog APIs
  • 18. Current Progress, Development Plan, and Demo Bowen Li
  • 19. Integrating Flink with Hive This is a major change, work needs to be broken into parts Part 1. Unified Catalog APIs (FLIP-30, FLINK-11275) Part 2. Integrate Flink with Hive (FLINK-10556) ● for metadata thru Hive Metastore (FLINK-10744) ● for data (FLINK-10729) Part 3. Support a complete set of SQL DDL/DML in Flink (FLINK-10232)
  • 20. 1 - Unified Catalog APIs Flink current status: ○ Barely any catalog support ○ Has separate function catalog Our highlighted improvements: ○ Introduced new catalog APIs and framework and connected to Calcite ● ReadableCatalog and ReadableWritableCatalog ● Meta-Objects: Database, Table, View, Partition, Functions, Stats, etc ● Operations: Create/Alter/Rename/Drop/Get/List/Exist/ ○ Unified function catalog with new catalog APIs and supported persisting functions
  • 21. 1 - Unified Catalog APIs Flink current status: ○ No well-structured hierarchy yet to manage metadata ○ Needs better SQL user experience when referencing metadata Our highlighted improvements: ● Introduced two-level management structure: <catalog>.<db>.<meta-object> ● Added CatalogManager to resolve object name select * from defaultCatalog.defaultDb.Tbl => select * from Tbl ● Made Flink case-insensitive to object names, similar to Hive, MySQL, Oracle
  • 22. 1 - Unified Catalog APIs Flink current status: No production-ready catalogs Our highlighted improvements: Developed three production-ready catalogs ■ GenericInMemoryCatalog - in-memory non-persistent, per session ■ HiveCatalog - compatible with Hive, read/write Hive meta-objects ■ GenericHiveMetastoreCatalog - persist Flink streaming and batch meta-objects
  • 23. 1 - Unified Catalog APIs Catalogs are pluggable and opens opportunities to build catalogs for ○ Streams and MQ ● Kafka (Confluent Schema Registry), Kinesis, RabbitMQ, Pulsar, etc ○ Structured Data ● RDMS like MySQL, etc ○ Semi-Structured Data ● ElasticSearch, HBase, Cassandra, etc ○ Your other favorite data management systems ● …...
  • 24. 2 - Flink-Hive Integration - Metadata - HiveCatalog Our highlighted improvements: Developed HiveCatalog, via which Flink can ● read Hive meta-objects, like tables, views, functions, stats ● create and write Hive meta-objects to Hive Metastore such that Hive can consume Flink can read and write Hive metadata thru HiveCatalogFlink can read and write Hive metadata thru HiveCatalog
  • 25. 2 - Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog Our highlighted improvements: ● Persisted Flink’s metadata (both streaming and batch) by using Hive Metastore purely as storage
  • 26. HiveCatalog v.s. GenericHiveMetastoreCatalog ● for Hive batch metadata ● Hive can understand ● for any streaming and batch metadata ● Hive may not understand Both are backed by Hive Metastore
  • 27. 2. Flink-Hive Integration - Data Our highlighted improvements: Connector: ○ Developed source and sink to read/write partition/non-partition tables and views ○ Supported partition-pruning Data Types: ○ Supported for all Hive simple and complex (array, map, struct) data types
  • 28. 2. Flink-Hive Integration - User defined functions and Version Compatibility ● Hive user defined functions ■ Supported Hive UDF ■ Working on supporting Hive GenericUDF, UDTF, UDAF ● Hive versions ■ Currently supports Hive 2.3.4 and 1.2.2 via shimming ■ Relies on Hive’s backward compatibility for 2.x and 1.x ● Working on direct support for more Hive versions, e.g. 2.1.1, 1.2.1
  • 29. Timeline First Targeted Flink release - 1.9.0, June 2019
  • 30. Demo with Flink SQL CLI • Query Hive Metadata • Create Hive Source/Sink with HiveCatalog to read/write data • Create CSV Source/Sink with GenericHiveMetastoreCatalog to read/write data
  • 31. This tremendous amount of work cannot happen without help and support Shout out to everyone in the community and our team who have been helping us with designs, codes, feedbacks, etc!
  • 32. ● Flink is good at stream processing, but batch processing is equally important ● Flink has shown its potential in batch processing ● Flink/Hive integration benefits both communities ● This is a big effort ● We are taking a phased approach ● Your contribution is greatly welcome and appreciated! Conclusions
  • 33. Flink Forward China, Beijing, Dec 2019! All major Chinese tech companies will attend. Expected Attendees: 3,000+ Reach out to [email protected] for details! Call for sponsors