SlideShare a Scribd company logo
TIMO WALTHER, SOFTWARE ENGINEER
FLINK FORWARD, BERLIN
SEPTEMBER 5, 2018
FLINK SQL IN ACTION
© 2018 data Artisans2
ABOUT DATA ARTISANS
Original creators of
Apache Flink®
Open Source Apache Flink
+ dAApplication Manager
+ dA Streaming Ledger
© 2018 data Artisans3
BIG APACHE FLINK SQL USERS
© 2018 data Artisans4
FLINK’S POWERFUL ABSTRACTIONS
Process Function (events, state, time)
DataStream API (streams, windows)
SQL / Table API (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
Layered abstractions to
navigate simple to complex use cases
© 2018 data Artisans5
APACHE FLINK’S RELATIONAL APIS
Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
LINQ-style Table APIANSI SQL
© 2018 data Artisans6
QUERY TRANSLATION
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
Input data is
bounded
(batch)
Input data is
unbounded
(streaming)DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Calcite
Parser & Validator
Table API SQL API
DataSet
External
Tables
DataStream
Table API Validator
© 2018 data Artisans7
WHAT IF “CLICKS” IS A FILE?
Clicks
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
user cnt
Mary 2
Bob 1
Liz 1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Input data is
read at once
Result is produced
at once
© 2018 data Artisans8
WHAT IF “CLICKS” IS A STREAM?
user cTime url
user cnt
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Clicks
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1Mary 2
Input data is
continuously read
Result is
continuously updated
The result is the same!
© 2018 data Artisans9
• Usability
‒ ANSI SQL syntax: No custom “StreamSQL” syntax.
‒ ANSI SQL semantics: No stream-specific results.
• Portability
‒ Run the same query on bounded and unbounded data
‒ Run the same query on recorded and real-time data
• How can we achieve SQL semantics on streams?
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query
WHY IS STREAM-BATCH UNIFICATION IMPORTANT?
© 2018 data Artisans10
• Materialized views (MV) are similar to regular views,
but persisted to disk or memory
‒Used to speed-up analytical queries
‒MVs need to be updated when the base tables change
• MV maintenance is very similar to SQL on streams
‒Base table updates are a stream of DML statements
‒MV definition query is evaluated on that stream
‒MV is query result and continuously updated
DATABASE SYSTEMS RUN QUERIES ON STREAMS
© 2018 data Artisans11
CONTINUOUS QUERIES IN FLINK
• Core concept is a “DynamicTable”
‒Dynamic tables are changing over time
• Queries on dynamic tables
‒produce new dynamic tables (which are updated based on input)
‒do not terminate
• Stream ↔ Dynamic table conversions
11
© 2018 data Artisans12
STREAM ↔ DYNAMIC TABLE CONVERSIONS
• Append Conversions
‒Records are only inserted (appended)
• Upsert Conversions
‒Records are upserted/deleted
‒Records have a (composite) unique key
• Retract Conversions
‒Records are inserted/deleted
SELECT user, url
FROM clicks
WHERE url LIKE '%xyz.com'
SELECT user, COUNT(url)
FROM clicks
GROUP BY user
SQL FEATURES
© 2018 data Artisans14
SQL FEATURE SET IN FLINK 1.6.0
• SELECT FROMWHERE
• GROUP BY / HAVING
‒ Non-windowed,TUMBLE, HOP, SESSION windows
• JOIN / IN
‒ Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN
‒ Non-windowed INNER, LEFT / RIGHT / FULL OUTER JOIN
• [streaming only] OVER /WINDOW
‒ UNBOUNDED / BOUNDED PRECEDING
• [batch only] UNION / INTERSECT / EXCEPT / ORDER BY
© 2018 data Artisans15
• Support for POJOs, maps, arrays, and other nested types
• Large set of built-in functions (150+)
‒ LIKE, EXTRACT, TIMESTAMPADD, FROM_BASE64, MD5, STDDEV_POP, AVG, …
• Support for custom UDFs (scalar, table, aggregate)
SQL FEATURE SET IN FLINK 1.6.0
See also:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/functions.html
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/udfs.html
© 2018 data Artisans16
• Streaming enrichment joins (Temporal joins) [FLINK-9712]
• Support for complex event processing (CEP) [FLINK-6935]
‒ MATCH_RECOGNIZE
• More connectors and formats [FLINK-8535]
UPCOMING SQL FEATURES
SELECT
SUM(o.amount * r.rate) AS amount
FROM
Orders AS o,
LATERAL TABLE (Rates(o.rowtime)) AS r
WHERE r.currency = o.currency;
© 2018 data Artisans17
WHAT CAN I BUILD WITH THIS?
• Data Pipelines
‒ Transform, aggregate, and move events in real-time
• Low-latency ETL
‒ Convert and write streams to file systems, DBMS, K-V stores, indexes, …
‒ Ingest appearing files to produce streams
• Stream & Batch Analytics
‒ Run analytical queries over bounded and unbounded data
‒ Query and compare historic and real-time data
• Power Live Dashboards
‒ Compute and update data to visualize in real-time
SQL CLIENT BETA
© 2018 data Artisans19
• Newest member of the Flink SQL family (since Flink 1.5)
INTRODUCTION TO SQL CLIENT
© 2018 data Artisans20
• Goal: Flink without a single line of code
‒ only SQL andYAML
‒ "drag&drop" SQL JAR files for connectors and formats
• Build on top of Flink'sTable & SQL API
• Useful for prototyping & submission
INTRODUCTION TO SQL CLIENT
© 2018 data Artisans21
SQL CLIENT CONFIGURATION
See also:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sqlClient.html
© 2018 data Artisans22
PLAY AROUND WITH FLINK SQL
SQLClient
Results CLI
Submit Query
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user
Gateway
Database /
HDFS
Event Log
Query
StateResults
Submit Job
Catalog
Optimizer
Result Server
Initialized by:
conf/sql-client-defaults.yaml
Initialized by:
--environment my-config.yaml
Modified by DDL commands within session.
Changelog
orTable
© 2018 data Artisans23
SUBMIT DETACHED QUERIES
SQLClient
Target Information CLI
Submit Query
INSERT INTO dashboard
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user
Gateway
Database /
HDFS
Event Log
Query
State
Submit Job
Catalog
Optimizer
Result Server
Initialized by:
conf/sql-client-defaults.yaml
Initialized by:
--environment my-config.yaml
Modified by DDL commands within session.
Cluster ID &
Job ID
HTTPS://GITHUB.COM/DATAARTISANS/SQL-TRAINING
ACTION TIME!
© 2018 data Artisans27
SUMMARY
• Unification of stream and batch is important.
• Flink’s SQL solves many streaming and batch use cases.
• Runs in production at Alibaba, Uber, and others.
• The community is working on improving user interfaces.
• Get involved, discuss, and contribute!
THANK YOU!
@twalthr
@dataArtisans
@ApacheFlink
WE ARE HIRING
data-artisans.com/careers
Available on O’Reilly Early Release!

More Related Content

What's hot (20)

PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
PDF
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward
 
PDF
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward
 
PPTX
The Past, Present, and Future of Apache Flink
Aljoscha Krettek
 
PPTX
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward
 
PDF
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward
 
PDF
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
PDF
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Flink Forward
 
PDF
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward
 
PDF
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward
 
PPTX
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
Flink Forward
 
PDF
Monitoring Flink with Prometheus
Maximilian Bode
 
PPTX
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PDF
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
PDF
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward
 
PDF
dA Platform Overview
Robert Metzger
 
PPTX
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward
 
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward
 
The Past, Present, and Future of Apache Flink
Aljoscha Krettek
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward
 
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Flink Forward
 
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Flink Forward
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward
 
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward
 
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
Flink Forward
 
Monitoring Flink with Prometheus
Maximilian Bode
 
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward
 
dA Platform Overview
Robert Metzger
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward
 

Similar to Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action" (20)

PPTX
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
PPTX
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
PDF
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
PPTX
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
PDF
Stream Sql with Flink @ Yelp
Enrico Canzonieri
 
PPTX
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
Stream Sql with Flink @ Yelp
Enrico Canzonieri
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Ad

Recently uploaded (20)

PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 

Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"

  • 1. TIMO WALTHER, SOFTWARE ENGINEER FLINK FORWARD, BERLIN SEPTEMBER 5, 2018 FLINK SQL IN ACTION
  • 2. © 2018 data Artisans2 ABOUT DATA ARTISANS Original creators of Apache Flink® Open Source Apache Flink + dAApplication Manager + dA Streaming Ledger
  • 3. © 2018 data Artisans3 BIG APACHE FLINK SQL USERS
  • 4. © 2018 data Artisans4 FLINK’S POWERFUL ABSTRACTIONS Process Function (events, state, time) DataStream API (streams, windows) SQL / Table API (dynamic tables) Stream- & Batch Data Processing High-level Analytics API Stateful Event- Driven Applications val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } Layered abstractions to navigate simple to complex use cases
  • 5. © 2018 data Artisans5 APACHE FLINK’S RELATIONAL APIS Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data. tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user LINQ-style Table APIANSI SQL
  • 6. © 2018 data Artisans6 QUERY TRANSLATION tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Input data is bounded (batch) Input data is unbounded (streaming)DataSet Rules DataSet PlanDataSet DataStreamDataStream Plan DataStream Rules Calcite Catalog Calcite Logical Plan Calcite Optimizer Calcite Parser & Validator Table API SQL API DataSet External Tables DataStream Table API Validator
  • 7. © 2018 data Artisans7 WHAT IF “CLICKS” IS A FILE? Clicks user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Input data is read at once Result is produced at once
  • 8. © 2018 data Artisans8 WHAT IF “CLICKS” IS A STREAM? user cTime url user cnt SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Clicks Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… Bob 1 Liz 1 Mary 1Mary 2 Input data is continuously read Result is continuously updated The result is the same!
  • 9. © 2018 data Artisans9 • Usability ‒ ANSI SQL syntax: No custom “StreamSQL” syntax. ‒ ANSI SQL semantics: No stream-specific results. • Portability ‒ Run the same query on bounded and unbounded data ‒ Run the same query on recorded and real-time data • How can we achieve SQL semantics on streams? now bounded query unbounded query past future bounded query start of the stream unbounded query WHY IS STREAM-BATCH UNIFICATION IMPORTANT?
  • 10. © 2018 data Artisans10 • Materialized views (MV) are similar to regular views, but persisted to disk or memory ‒Used to speed-up analytical queries ‒MVs need to be updated when the base tables change • MV maintenance is very similar to SQL on streams ‒Base table updates are a stream of DML statements ‒MV definition query is evaluated on that stream ‒MV is query result and continuously updated DATABASE SYSTEMS RUN QUERIES ON STREAMS
  • 11. © 2018 data Artisans11 CONTINUOUS QUERIES IN FLINK • Core concept is a “DynamicTable” ‒Dynamic tables are changing over time • Queries on dynamic tables ‒produce new dynamic tables (which are updated based on input) ‒do not terminate • Stream ↔ Dynamic table conversions 11
  • 12. © 2018 data Artisans12 STREAM ↔ DYNAMIC TABLE CONVERSIONS • Append Conversions ‒Records are only inserted (appended) • Upsert Conversions ‒Records are upserted/deleted ‒Records have a (composite) unique key • Retract Conversions ‒Records are inserted/deleted SELECT user, url FROM clicks WHERE url LIKE '%xyz.com' SELECT user, COUNT(url) FROM clicks GROUP BY user
  • 14. © 2018 data Artisans14 SQL FEATURE SET IN FLINK 1.6.0 • SELECT FROMWHERE • GROUP BY / HAVING ‒ Non-windowed,TUMBLE, HOP, SESSION windows • JOIN / IN ‒ Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN ‒ Non-windowed INNER, LEFT / RIGHT / FULL OUTER JOIN • [streaming only] OVER /WINDOW ‒ UNBOUNDED / BOUNDED PRECEDING • [batch only] UNION / INTERSECT / EXCEPT / ORDER BY
  • 15. © 2018 data Artisans15 • Support for POJOs, maps, arrays, and other nested types • Large set of built-in functions (150+) ‒ LIKE, EXTRACT, TIMESTAMPADD, FROM_BASE64, MD5, STDDEV_POP, AVG, … • Support for custom UDFs (scalar, table, aggregate) SQL FEATURE SET IN FLINK 1.6.0 See also: https://ci.apache.org/projects/flink/flink-docs-master/dev/table/functions.html https://ci.apache.org/projects/flink/flink-docs-master/dev/table/udfs.html
  • 16. © 2018 data Artisans16 • Streaming enrichment joins (Temporal joins) [FLINK-9712] • Support for complex event processing (CEP) [FLINK-6935] ‒ MATCH_RECOGNIZE • More connectors and formats [FLINK-8535] UPCOMING SQL FEATURES SELECT SUM(o.amount * r.rate) AS amount FROM Orders AS o, LATERAL TABLE (Rates(o.rowtime)) AS r WHERE r.currency = o.currency;
  • 17. © 2018 data Artisans17 WHAT CAN I BUILD WITH THIS? • Data Pipelines ‒ Transform, aggregate, and move events in real-time • Low-latency ETL ‒ Convert and write streams to file systems, DBMS, K-V stores, indexes, … ‒ Ingest appearing files to produce streams • Stream & Batch Analytics ‒ Run analytical queries over bounded and unbounded data ‒ Query and compare historic and real-time data • Power Live Dashboards ‒ Compute and update data to visualize in real-time
  • 19. © 2018 data Artisans19 • Newest member of the Flink SQL family (since Flink 1.5) INTRODUCTION TO SQL CLIENT
  • 20. © 2018 data Artisans20 • Goal: Flink without a single line of code ‒ only SQL andYAML ‒ "drag&drop" SQL JAR files for connectors and formats • Build on top of Flink'sTable & SQL API • Useful for prototyping & submission INTRODUCTION TO SQL CLIENT
  • 21. © 2018 data Artisans21 SQL CLIENT CONFIGURATION See also: https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sqlClient.html
  • 22. © 2018 data Artisans22 PLAY AROUND WITH FLINK SQL SQLClient Results CLI Submit Query SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Gateway Database / HDFS Event Log Query StateResults Submit Job Catalog Optimizer Result Server Initialized by: conf/sql-client-defaults.yaml Initialized by: --environment my-config.yaml Modified by DDL commands within session. Changelog orTable
  • 23. © 2018 data Artisans23 SUBMIT DETACHED QUERIES SQLClient Target Information CLI Submit Query INSERT INTO dashboard SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Gateway Database / HDFS Event Log Query State Submit Job Catalog Optimizer Result Server Initialized by: conf/sql-client-defaults.yaml Initialized by: --environment my-config.yaml Modified by DDL commands within session. Cluster ID & Job ID
  • 25. © 2018 data Artisans27 SUMMARY • Unification of stream and batch is important. • Flink’s SQL solves many streaming and batch use cases. • Runs in production at Alibaba, Uber, and others. • The community is working on improving user interfaces. • Get involved, discuss, and contribute!
  • 26. THANK YOU! @twalthr @dataArtisans @ApacheFlink WE ARE HIRING data-artisans.com/careers Available on O’Reilly Early Release!

Editor's Notes

  • #2: Today I wanna show you the power and simplicity And why and how you can leverage it for your use cases my name is … committer, pmc, and part of the flink team since the early days I’m working as a software engineer
  • #3: • data Artisans was founded by the original creators of Apache Flink • We provide dA Platform, manage a zoo of stream processing application - a complete stream processing infrastructure with open-source Apache Flink
  • #4:  If your company would like to be represented on the “Powered by Apache Flink SQL” page, email me.
  • #5: Flink offers APIs for different levels of abstraction that all can be mixed and matched. On the lowest level, ProcessFunctions give precise control about state and time, i.e., when to process data. The intermediate level, the so-called DataStream API provides higher-level primitives such as for window processing Finally, on the top level the relational API, SQL and the Table API, are centered around the concept of dynamic tables. This is what I’m going to talk about next.
  • #18: Now let me show an example to show the power and simplicity of SQL
  • #22: Non-programmatic way of configuring Flink jobs Per-session and/or global configuration in YAML Configures: Tables from external systems Views defined in SQL User-defined functions Execution properties (e.g. result mode, execution mode) Deployment properties
  • #25: a Flink SQL client container to submit queries and visualize their results, a Flink master and a Flink worker container to execute queries, an Apache Kafka container to produce input streams and consume result streams, an Apache Zookeeper container (required by Kafka), and an ElasticSearch container to maintain an external materialized table.
  • #29: (Keep this slide up during the Q&A part of your talk. Having this up in the final 5-10 minutes of the session gives the audience something useful to look at.)