Integrating Flink with Hive - Flink Forward SF 2019

Integrate Apache Flink with Apache Hive
Xuefu Zhang,
-- Senior Staff Engineer, Alibaba
-- Hive PMC, Apache Member
Bowen Li
-- Senior Engineer, Alibaba

● Background
● Goals
● Technical Overview
● Current Progress
● Demo
● Q&A
Agenda

Background
● Flink has achieved an impressive success in stream processing
● Its scalability and potential has been proven and pushed further by Blink, now
part of Flink
● at Alibaba, Flink is used to process extremely large amount of data at an
unprecedented scale

1.7B Events/secEB Total PB Everyday 1T Event/Day

Streaming SQL
● Majority of stream analytics can be expressed in SQL
● Instead of programming, streaming SQL gives a user a non-programming way of
writing and deploying streaming jobs
● For SQL, there is need for metadata: sources, sinks, UDFs, views, etc
● The metadata needs a store

Streaming SQL (cont’d)
● Currently, Flink stores metadata in a memory
● The metadata is ill-organized, scattered around in different components
● Poor usability, interoperability, productivity, and manageability
● Problem #1: Flink lacking a well-organized, persistent store for its metadata

Batch and SQL
● Stream analytics users usually have also offline, batch analytics
● ETL is still an important use case for big data
● AI/ML is a major driving force behind both real-time and batch analytics
○ Gathering data to train and test a model, deploying it in stream processing
● SQL is the main tool processing big data for batch
● Unfortunately, users have to have a different engine for non-stream processing

Batch and SQL (cont’d)
● Flink has showed prevailing advantages over other solutions for
heavy-volume stream processing
● In Blink, we systematically explored Flink’s capabilities in batch processing,
and it shows great potential

Flink is the fastest due to its pipelined execution
Tez and Spark do not overlap 1st and 2nd stages
MapReduce is slow despite overlapping stages
A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015

Batch and SQL (cont’d)
● Batch requires more on SQL capability
● Demands an even stronger metadata management
● Hive is the de facto standard for big data/batch processing on Hadoop
● The center of big data ecosystem is Hive metadata store
● Problem #2: Flink lacking a seamless access to Hive’s metadata and data

Heterogeneous Sources/Sinks
● Whether batch or streaming, Flink usually needs to access many data systems
○ Hive
○ MySQL
○ Key-Value stores
○ Kafka stream
● Different data catalogs
● Problem #3, Flink needs a unified interface to interact with different data catalogs

Beyond Flink
● Batch has a large use case then streaming
● Many Hive users are not Flink users
● We like Hive users can benefit from Flink’s batch capabilities
● Problem #4: Flink needing a story for Hive users

Four Goals
● Define Unified catalog API
● Implement In-Memory catalog and persistent catalog for Flink metadata
● Implement Hive catalog, enabling deep integration with Hive
● Provide Flink as Hive’s new execution engine (long-term)

Technical Overview
● Define unified catalog APIs (FLIP-30)
● Three implementations
○ Generic in-memory catalog
○ Generic persistent catalog (based on Hive metastore)
○ Hive catalog
● Hive data access
● Hive on Flink is not yet planned

Architecture
Flink Deployment
Flink Runtime
Query processing & optimization
Table API and SQL
SQL Client/Zeppelin
Catalog APIs

Catalog APIs and Implementations
GenericInMemoryCatalog
GenericHiveMetastoreCatalog
ReadableCatalog
ReadableWritableCatalog
HiveCatalog
Shim Layer:
HiveMetastoreClient
CatalogManager
TableEnvironment
inheritance reference
SQL Client HiveCatalogBase
Hive Metastore
Catalog APIs

Hive Data Connector
BatchTableFactory
HiveTableFactory
BatchTableSource
HiveTableSource
InputFormat
HiveTableInputFormat
BatchTableSink
HiveTableSink
OutputFormat
HiveTableOutputFormat
Read
Write
Hive Data
HiveTableSink HiveTableOutputFormat

Current Progress, Development Plan, and Demo
Bowen Li

Integrating Flink with Hive
This is a major change, work needs to be broken into parts
Part 1. Unified Catalog APIs (FLIP-30, FLINK-11275)
Part 2. Integrate Flink with Hive (FLINK-10556)
● for metadata thru Hive Metastore (FLINK-10744)
● for data (FLINK-10729)
Part 3. Support a complete set of SQL DDL/DML in Flink (FLINK-10232)

1 - Unified Catalog APIs
Flink current status:
○ Barely any catalog support
○ Has separate function catalog
Our highlighted improvements:
○ Introduced new catalog APIs and framework and connected to Calcite
● ReadableCatalog and ReadableWritableCatalog
● Meta-Objects: Database, Table, View, Partition, Functions, Stats, etc
● Operations: Create/Alter/Rename/Drop/Get/List/Exist/
○ Unified function catalog with new catalog APIs and supported persisting functions

○ No well-structured hierarchy yet to manage metadata
○ Needs better SQL user experience when referencing metadata
● Introduced two-level management structure: <catalog>.<db>.<meta-object>
● Added CatalogManager to resolve object name
select * from defaultCatalog.defaultDb.Tbl => select * from Tbl
● Made Flink case-insensitive to object names, similar to Hive, MySQL, Oracle

No production-ready catalogs
Developed three production-ready catalogs
■ GenericInMemoryCatalog - in-memory non-persistent, per session
■ HiveCatalog - compatible with Hive, read/write Hive meta-objects
■ GenericHiveMetastoreCatalog - persist Flink streaming and batch meta-objects

Catalogs are pluggable and opens opportunities to build catalogs for
○ Streams and MQ
● Kafka (Confluent Schema Registry), Kinesis, RabbitMQ, Pulsar, etc
○ Structured Data
● RDMS like MySQL, etc
○ Semi-Structured Data
● ElasticSearch, HBase, Cassandra, etc
○ Your other favorite data management systems
● …...

2 - Flink-Hive Integration - Metadata - HiveCatalog
Developed HiveCatalog, via which Flink can
● read Hive meta-objects, like tables, views, functions, stats
● create and write Hive meta-objects to Hive Metastore such that Hive can consume
Flink can read and write Hive metadata thru HiveCatalogFlink can read and write Hive metadata thru HiveCatalog

2 - Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
● Persisted Flink’s metadata (both streaming and batch) by using Hive Metastore purely
as storage

HiveCatalog v.s. GenericHiveMetastoreCatalog
● for Hive batch metadata
● Hive can understand
● for any streaming and batch metadata
● Hive may not understand
Both are backed by Hive Metastore

2. Flink-Hive Integration - Data
Connector:
○ Developed source and sink to read/write partition/non-partition tables and views
○ Supported partition-pruning
Data Types:
○ Supported for all Hive simple and complex (array, map, struct) data types

2. Flink-Hive Integration -
User defined functions and Version Compatibility
● Hive user defined functions
■ Supported Hive UDF
■ Working on supporting Hive GenericUDF, UDTF, UDAF
● Hive versions
■ Currently supports Hive 2.3.4 and 1.2.2 via shimming
■ Relies on Hive’s backward compatibility for 2.x and 1.x
● Working on direct support for more Hive versions, e.g. 2.1.1, 1.2.1

Timeline
First Targeted Flink release - 1.9.0, June 2019

Demo with Flink SQL CLI
• Query Hive Metadata
• Create Hive Source/Sink with HiveCatalog to read/write data
• Create CSV Source/Sink with GenericHiveMetastoreCatalog to read/write data

This tremendous amount of work cannot happen without help and support
Shout out to everyone in the community and our team
who have been helping us with designs, codes, feedbacks, etc!

● Flink is good at stream processing, but batch processing is equally important
● Flink has shown its potential in batch processing
● Flink/Hive integration benefits both communities
● This is a big effort
● We are taking a phased approach
● Your contribution is greatly welcome and appreciated!
Conclusions

Flink Forward China, Beijing, Dec 2019!
All major Chinese tech companies will attend.
Expected Attendees: 3,000+
Reach out to flink-forward-china@list.alibaba-inc.com for details!
Call for sponsors

Integrating Flink with Hive - Flink Forward SF 2019

More Related Content

What's hot (13)

Similar to Integrating Flink with Hive - Flink Forward SF 2019 (20)

More from Bowen Li (13)

Recently uploaded (20)

Integrating Flink with Hive - Flink Forward SF 2019