Flink 2.1 SQL: Unlocking Real-time Data & AI Integration for Scalable Stream Processing

Good afternoon everyone! I'm Lincoln, a staff engineer from Alibaba Cloud and also a PMC member of Apache Flink. Today, I will share a topic about the SQL advances for data and AI in the coming Flink 2.1 release.

Flink 2.1 SQL's Key Advancements

Let me outline the structure of today’s talk. We’ll explore three parts:

Data + AI: Bridging Real-time Data Processing with AI Capabilities in Flink SQL

Here, I’ll introduce how Flink SQL 2.1 bridges real-time data processing with AI capabilities. You’ll see how we’ve enhanced support for AI functions, from model registration to seamless integration with SQL via ML_PREDICT, enabling tasks like text generation with large models and RAG workflows. For more information on Flink SQL AI functions, refer to the official Flink documentation.

Optimized Joins: Addressing Critical Challenges in Flink Streaming Joins

Next, we’ll address a critical challenge in streaming joins. I’ll dive into two groundbreaking improvements: Delta Join, which eliminates state storage by combining indexes and changelog processing, and Multi-way Join, which reduces redundancy in multi-stream joins while maintaining low latency.

What’s Next: Flink SQL's Future Enhancements and Roadmap for Data & AI

Finally, I’ll share a roadmap for future enhancements, including vector search support in RAG pipelines and expanded AI function support. This journey will show how Flink 2.1 empowers you to build scalable, real-time AI pipelines seamlessly.

Before diving into Flink AI functions, let me start with a real-world problem.

Real-Time Product Compliance: A Challenging Use Case

Imagine you’re running a global e-commerce platform. You’ve got sellers uploading millions of product listings every day. But here’s the catch: you need to make sure every single listing complies with local laws in every country you operate in.

For example, a product titled ‘Grape Juice Beverage with Minimal Alcohol Content’ violates policies in specific countries because it contains ‘alcohol’.

Right now, teams use Flink SQL to build pipelines to help human review:

Read product listing data from Kafka topic first,
using a custom UDF like keyword_match, to check titles against a list of forbidden keywords
Output risky listings for human review.

But here’s the problem – rules-based systems are super rigid.

Why Rule-based Keyword Matching Fails: False Positives & Negatives

Alright, let’s look at two concrete examples where rule-based keyword matching fails.

Case 1: The Overblocking Case (False Positive)

Imagine a product titled ‘grape juice (no alcohol)’. The keyword ‘alcohol’ triggers the rule, and the system flags it as risky. But it explicitly said ‘no alcohol’! This is a false positive – we’re blocking a safe product, wasting human review time, and risking customer frustration.

Case 2: The Underblocking Case (False Negative)

Now, check out this title: ‘Natural Vanilla Extract’. Our keyword list includes ‘alcohol’ and ‘wine’, but vanilla extract often contains alcohol! The rule misses it entirely – a false negative. This may lead to critical punishment.

So… How do we fix this? We need something smarter than keywords.

Leveraging AI’s Semantic-based Analytics for Smarter Compliance

Let’s test if AI actually works better. Here’s a quick example with ChatGPT – but remember, this could be any LLM or custom model you train.

Step 1: Teach the AI the task

We give the model a clear prompt, including role, rules, and examples.

Step 2: Test with our problem case

When we feed it the tricky ‘Natural Vanilla Extract’ case, the AI derives it ‘Contains alcohol’. This is exactly what we need.

Now, let’s try to integrate this into the UDF.

The Hidden Costs of Custom AI UDFs in Flink

We built a new UDF that connects directly to the LLM, and the pipeline looks almost the same as before. Same flow:

Kafka In (Product titles stream in)
After upgrading to the new UDF, the decisions are now more intelligent.
Kafka Out (Results go to the same review topic – no changes needed downstream).

Looks perfect… but wait.

While this works for small-scale testing, real-world challenges hit fast.

Building a custom LLM UDF feels great at first, but here’s the reality check:

Code Rewrite: If we want to switch from OpenAI to Alibaba Cloud, we rewrite the UDF code. Testing different models? More code changes. This approach doesn’t scale. It’s not realistic to rewrite UDF code every time a new model or API changes.
Sync Requests = Traffic Jam: Every product title triggers a synchronous API call to the LLM. Each call takes 1-3 seconds – then the throughput will be pretty low. So if we want higher throughput using async requests, we need to rewrite the UDF again by using AsyncScalarFunction. Now we should pay more attention to async callbacks and error handling. This is not funny, there must be a better way.

Let’s see how Flink 2.1 solves this.

Apache Flink SQL Native AI Functions: Simplified AI Integration

Here’s how Flink SQL native AI functions work:

Use CREATE MODEL to register any LLM with a simple SQL command. Switch models? Just change the MODELparameter – no code rewrite needed. Model management becomes so easy.

The new ML_PREDICT() function in Flink 2.1 is ready for these use cases:

Chat/Completion Tasks: For scenarios like product compliance checks, sentiment analysis, just pass text to the model, and it returns decisions.
Embedding Generation: Do features extract, it can power your RAG pipelines by generating vector embeddings from text.

Everything works directly in SQL. Enable async processing: just add a simple parameter. Switch models: reference the MODEL name in your SQL query – no UDF code changes.

Real-Time Product Compliance with Flink SQL AI Functions: A Concrete Example

Let’s bring it all together with a concrete example. Here’s how Flink AI functions solve our product compliance challenge:

First, we create a compliance model using CREATE MODEL syntax – specifying the provider (here’s the ModelScope from Alibaba Cloud), and the model name, qwen-turbo, and a system prompt that tells the AI its role as a product listing review expert.

When a product title like ‘Natural Vanilla Extract’ arrives, Flink sends it to the AI model via ML_PREDICT function. It’s an async request to ensure high throughput. The model analyzes it and returns a JSON response.

Lastly, we insert results into the risk output topic, only if the risk_rate exceeds the defined threshold.

Optimizing Flink SQL AI Functions: Async Tuning & Resource Planning for Performance

Let’s dive into two critical optimizations: the async configuration and resource setting.

Async Execution is 1st choice: Enable async execution can be 1st choice in ML_PREDICT() calls. This is more cost-effective compared to increasing the task parallelism. Use allow_unordered output_mode for append-only streams, so Flink might process them faster. Set max-concurrent-operations to match your LLM’s capacity. And set the async timeout param larger than the AI’s max latency, if you don’t want to trigger a task failure. For more details on asynchronous operations, refer to the Flink documentation.
Resource Planning with Little’s Law: Apply this formula for capacity planning:

L: Queue slots (pending requests)
λ: Request rate (QPS)
W: Average latency (model response time)

For example: we need 120 max concurrent operations for a target 100 QPS, and 1.2 seconds for the 99th percentile latency. Also, we need to pay more attention to memory settings in TaskManager, considering the queue length and avg row size. Proper tuning may significantly boost throughput and stability for running AI functions.

JSON Processing Revolution: Introducing Flink SQL's VARIANT Type

Before we dive into joins, let’s talk about something foundational: JSON.

JSON Everywhere: From Big Data to AI Workflows

Let’s start with a simple fact: JSON is everywhere. From traditional data pipelines to the new AI workflows, JSON is an important format for representing structured and semi-structured data. We see it in event logs, search documents, API payloads. RAG pipelines rely on JSON to store and query documents. Even LLM prompts and outputs are often formatted as JSON.

So far, Flink SQL has supported a lot of built-in JSON functions. But as JSON gets deeper and more dynamic, here’s a performance challenge.

The Hidden Cost of Traditional JSON Parsing in Flink

On the surface, JSON_VALUE and similar functions make it easy to access data inside a JSON string. But under the hood, each call triggers a full JSON parse. Every time and for every row. This may work fine for simple cases, but when dealing with large datasets, nested structures, or deeply queried paths — like the JSON Path $.metadata.device — then performance degrades fast.

There’s no schema awareness, no indexing inside the JSON, so SQL planners cannot optimize access.

Flink SQL’s New VARIANT Type: Efficient Semi-Structured Data Handling

Flink 2.1 introduces the new VARIANT type, a native, binary-encoded semi-structured type. Unlike plain JSON strings, VARIANT stores both metadata and values in a structured way. So accessing the data.metadata.device is just a direct offset lookup, not a full parsing anymore.

Because it’s schema-aware, query optimization can be applied by the SQL planner in future releases. This makes it ideal for data pipelines. VARIANT unlocks performance and flexibility for working with JSON at scale. Learn more about the Flink VARIANT type.

Optimized Joins: Solving Critical Challenges in Flink Stream Processing

Now let’s move to streaming joins, another core challenge in real-time processing. Flink SQL supports rich kinds of join types: regular joins, interval joins, temporal joins, lookup joins, and more. Each is designed for specific use cases. Among them, the regular join is the most intuitive—it looks exactly like a traditional SQL join, making it easy to write and understand.

Limitations of Regular Joins in Flink Streaming: Scalability Issues

Let’s see how it works: we have two input streams: Pageviews and orders arrive from Kafka, they will be joined on product_id column. The join operator has two state storages, the Left State and Right State, both grouped by product_id. When a new event arrives, Flink looks up matching entries in the opposite state table and does the join logic and output.

The regular join relies heavily on Flink’s state backend to buffer the input streams. When used properly, this provides high throughput and low latency, especially for small to medium-sized data streams. But as stream size grows, the regular join begins to struggle. The state maintained for each side of the join node grows more and more larger. Eventually, this leads to slow access to state, long time checkpoints and recovery. You might start to see latency spikes, backpressure, or even checkpoint failures. This is a classic tradeoff—simplicity and flexibility come at the cost of scalability.

So how do we fix the scaling problem of regular joins?

Delta Join: A Fundamentally Different Approach to Stateless Joins

That’s where Delta Join comes in - a fundamentally different approach. Unlike regular joins that keep all data in Flink’s state backend, Delta Join uses external storage in a system like Fluss, which is based on RocksDB. Here’s how it works:

Fluss sends changelog updates continuously, so the join always stays fresh. When a new event arrives, Delta Join performs an index lookup in Fluss — just like querying a key-value store.

As you see, Delta Join is stateless now. All the large state issues disappear. This unlocks massive join workloads that were previously infeasible.

Delta Join In a Production Case: Real-world Performance Gains

Here’s a summary of what we’ve observed in production. Regular Join quickly becomes unsustainable at scale — requiring 100TB+ of state, long checkpoints duration, and complex recovery. With Delta Join, we externalize state to Fluss, achieve a second-level checkpointing, cutting CPU and memory usage by more than 80%, reducing bootstrap time by 87%, and enabling real-time traceability of join operators. This makes Flink more robust for large-scale join processing.

Cascaded Regular Join vs. Multi-Way Join: Eliminating Redundancy

Let’s take a closer look at the hidden cost of Regular Joins — especially in multi-stream joins.

Regular Join is binary-only: it joins two streams at a time. So if you want to join T1, T2, and T3, Flink builds a cascaded plan — first T1 joins T2, then that result joins T3.

But this creates serious inefficiencies:

Each join stage keeps its own full state.
Intermediate results (like T1 join T2) get stored again in the next join.
State size multiplies, and checkpoint duration spikes.

FLIP-415’s mini-batch joinhelps with intermediate output overhead — but not with state duplication. So... how do we fix this? Let’s talk about Multi-Way Join.

Multi-Way Join in Streaming: A New Strategy for Efficiency

To solve this, the Multi-Way Join was introduced — a new join strategy that removes redundancy at the root. It allows joining multiple streams with the same join key in a single operator.

Instead of cascading binary joins, we use one indexed state table for each input. In this way, no intermediate state duplication, no nested checkpointing delays. The more streams you join, the more you benefit from this solution.

What’s Next: Flink SQL's Roadmap for Data & AI

Now, let’s wrap up with what’s next.

Apache Flink SQL for End-to-End RAG Pipelines

One near-term goal is enabling end-to-end RAG pipelines directly in SQL. Today, users can generate embeddings and push them into systems like Milvus, but retrieval is unsupported yet.

After integrating vector search directly into Flink SQL using a new VECTOR_SEARCH() function, combined with ML_PREDICT for both embedding and generation, this allows fully declarative RAG pipelines:

Ingest and embed data
Retrieve top-k neighbors via vector search
Use results in downstream inference

This simplifies RAG pipeline a lot.

Expanding AI Support: Multimodal & Evaluation Functions

We’re also extending the scope of AI support in Flink SQL. In addition to existing text and embedding functions, we’re adding support for:

Multimodal, including image and audio inputs.
Also the Evaluation function to assess model output quality during execution.

These additions will allow users to integrate and monitor model behavior directly inside data pipelines.

Continuous Optimization for Stream Join Performance

Join performance remains a key focus for streaming use cases. Delta Join already decouples state from the Flink task, and addresses the scalability problem for high-cardinality joins. Looking ahead, we’re adding support for additional storage, including Apache Paimon, to support near-real-time delta join. We’re also improving support for more complex multi-stream joins. Relax the schema alignment and support more query patterns.

Key Takeaways: Flink 2.1 SQL's Impact

To recap, here are the key takeaways from this talk:

Flink SQL for Data + AI

We’ve introduced SQL-native model management and AI functions to make AI integration easier and more consistent within SQL pipelines.
The new VARIANT type allows Flink SQL to handle semi-structured data like JSON more efficiently, with better performance and native planner optimization for the future.

Addressing the Stream Join Challenges

Delta Join removes local state by offloading storage, making large-scale joins more stable and resource-efficient.
Multi-Way Join eliminates redundant state in multi-stream joins, significantly improving performance at scale.

Roadmap

We’re working on deeper integration for AI pipelines, including vector search and multi-modal support.
Join performance remains a focus, with more flexibility and new storage integrations coming.

All these efforts aim to make Flink SQL a more complete solution for building real-time, intelligent data pipelines.