2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines

© 2023 Cloudera, Inc. All rights reserved.
Unlocking Financial Data with Real-Time
Pipelines
(Flink Analytics on Stocks with
SQL )
Tim Spann
Principal Developer Advocate
28-February-2024

© 2023 Cloudera, Inc. All rights reserved. 3
Introduction
Overview
Finance Data
Apache Kafka and Apache Flink
Demos
Agenda (45 minutes)

Financial institutions thrive on accurate and timely data to drive critical decision-making
processes, risk assessments, and regulatory compliance. However, managing and processing
vast amounts of financial data in real-time can be a daunting task. To overcome this
challenge, modern data engineering solutions have emerged, combining powerful
technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient
and reliable real-time data pipelines. In this talk, we will explore how this technology stack
can unlock the full potential of financial data, enabling organizations to make data-driven
decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time
access to accurate and reliable data is crucial. Traditional batch processing falls short when it
comes to handling rapidly changing financial markets and responding to customer demands
promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the
strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of
financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.

Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw

Conﬁdential—Restricted
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual

This week in Apache NiFi, Apache Flink,
Apache Kafka, ML, AI, Apache Spark, Apache
Iceberg, Python, Java, LLM, GenAI, Vector
DB and Open Source friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
FLaNK Stack Weekly by Tim Spann

Overview

DATA VELOCITY in FINANCIAL SERVICES
Streaming capabilities vary, all enhance insight
Transaction
Data
Core Banking
Risk
Data
Behavioral
Data
Cyber
Market Data
News
Feeds
Customer
Data
Connected
Devices/
Wearables
Chat Bots
Normal
Streaming
Regulatory,
Compliance
Near-Real
Time
Streaming
Real-Time
Streaming
Social Media

NEXT GEN PLATFORM FOR TACKLING FINANCIAL CRIME
Leveraging data and analytics across the enterprise from the Edge to AI.
Ingest
Streaming
Data
BANKING DATA
Data Flow (CDF)
Data Science
Workbench
FINANCIAL CRIME APPLICATIONS
Data
Scientists
Data
Processing
Data Engineering Data Warehouse
Operational
DB
Catalog | Schema | Security | Governance
Business
Analysts
EDGE DATA
ALTERNATIVE DATA
Enterprise Data Store
ML
Cyber Security AML
Fraud Surveillance
Analytical
Tools
BI and
Visualization
Ingest
Data at
Rest
Deploy Models
Ingest Stream
or Batch Data
Teams
speaking
the same
language
ENTERPRISE DATA
TRADING DATA
`
Ingest
1
2
3
4

11
Kafka & Flink (Flink SQL with Stream SQL Builder) for real time analytics
Kafka
Kafka topics
Database
Machine
learning
Flink SQL
w/ SSB
Data Warehouse Data Viz
Monitoring
Alerting
F
in
a
n
c
e
D
a
t
a
Architecture in the context of Financial Use Cases
DataFlow / NiFi

NIFI MEETS AI
Vector DB
AI Model
Unstructured ﬁle types
Data in Motion
With Cloudera
Capture, process &
distribute any data,
anywhere
Other enterprise data Open Data Lakehouse
Materialized Views
Structured Sources
Applications/API’s
Streams

Transactions
Data
Account
Data
Device Logs
Business Event
Logic
Data
Lakehouse
Flagged
Records
Continuous
Results
Fraud Analyst
deﬁned suspicious transaction
Real-Time
Fraud Scoring
Freeze
transaction
Fraud
Monitoring
Dashboard
Stop Fraud When It Happens—Real Life Example
Simpliﬁed example of deployed use case
DATA RELEVANCE

FINANCE DATA

Extract Company Name from User Query via NLP
Convert Company Name to Stock Symbol via Finnhub REST

REST API ARCHITECTURE - Using FLaNK to pull the data out of anything in near-real time
INGEST PREPARE PUBLISH
DATA SOURCES
Internal Users
(After Sales)
External
Systems
ENTERPRISE
LAKEHOUSE
CAPABILITY VIEW
INGESTION
MESSAGE HUB
STORAGE
BATCH
MANAGEMENT
STREAM
CONSUMPTION
Closed Loop
Systems
SQL Stream Builder
Machine Learning
Data Visualization
Workload Manager
watsonx.data

KAFKA and FLINK

STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many
patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Eﬃcient implementation to operate at speed with
big data volumes.
• Organized by topic to support several use cases.

CONTINUOUS SQL
● SSB is a Continuous SQL engine
● It’s SQL, but a slightly different mental model, but with big implications
Traditional Parse/Execute/Fetch model Continuous SQL Model
Hint: The query is boundless and never ﬁnishes, and time matters
AKA: SELECT * FROM foo WHERE 1=0 -- will run forever

20
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simpliﬁes access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL

SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the ﬁrehose

22
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simpliﬁes access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL

ICEBERG INTEGRATION
Robust Next Generation Architecture for Data Driven Business
Uniﬁed Processing Engine Massive Open table format
Iceberg Support for Flink APIs through SSB
• Maximally open
• Maximally ﬂexible
• Ultra high performance for MASSIVE data

DEMO
I Can Haz Data?
https://github.com/tspannhw/FLaNK-Py-Stocks
https://github.com/tspannhw/PaK-Stocks

Continuous SQL
select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet,
max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet,
max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed,
count(alt_baro) as RowCount,
hex as ICAO, flight as IDENT
from `sr1`.`default_database`.`adsb`
group by flight, hex;
select transcom.title, transcom.description, mta.VehicleRef,
DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) as miles,
mta.StopPointName, mta.Bearing, mta.DestinationName, mta.ExpectedArrivalTime, mta.VehicleLocationLatitude, mta.VehicleLocationLongitude,
mta.ArrivalProximityText, mta.DistanceFromStop, mta.AimedArrivalTime, mta.`Date`, mta.ts, mta.uuid, mta.EstimatedPassengerCapacity, mta.EstimatedPassengerCount
from `schemareg1`.`default_database`.`mta` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ mta
FULL OUTER JOIN `schemareg1`.`default_database`.`transcom` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ transcom
ON (transcom.latitude >= CAST(mta.VehicleLocationLatitude as float) - 0.3)
AND (transcom.longitude >= CAST(mta.VehicleLocationLongitude as float) - 0.3)
AND (transcom.latitude <= CAST(mta.VehicleLocationLatitude as float) + 0.3)
AND (transcom.longitude <= CAST(mta.VehicleLocationLongitude as float) + 0.3)
WHERE mta.VehicleRef is not null
AND transcom.title is not null
AND DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) <= 120

DEMO

CDC with Flink SQL (SSB)

Streaming CDC with Cloudera SQL Stream Builder (Flink SQL)
https://github.com/tspannhw/FLaNK-CDC/blob/main/flinkcdc.MD

https://docs.cloudera.com/csa/1.10.0/how-to-ssb/topics/csa-ssb-cdc-connectors.html
CDC with Debezium and Flink
SQL Stream Builder with Flink SQL

CDC with Debezium and Flink
SQL Stream Builder with Flink SQL

CREATE TABLE `postgres_cdc_newjerseybus` (
`title` STRING,
`description` STRING,
`link` STRING,
`guid` STRING,
`advisoryAlert` STRING,
`pubDate` STRING,
`ts` STRING,
`companyname` STRING,
`uuid` STRING,
`servicename` STRING
) WITH (
'connector' = 'postgres-cdc',
'database-name' = 'tspann',
'hostname' = '192.168.1.153',
'password' = 'tspann',
'decoding.plugin.name' = 'pgoutput',
'schema-name' = 'public',
'table-name' = 'newjerseybus',
'username' = 'tspann',
'port' = '5432'
);
Flink SQL Tables - Debezium CDC From Database Tables

Flink SQL Tables - Upsert to Kafka Topics
CREATE TABLE ùpsert_kafka_newjerseybus` (
`title` String,
`description` String,
`link` String,
`guid` String,
àdvisoryAlert` String,
`pubDate` String,
`ts` String,
`companyname` String,
ùuid` String,
`servicename` String,
èventTimestamp` TIMESTAMP(3),
WATERMARK FOR èventTimestamp` AS èventTimestamp` -
INTERVAL '5' SECOND,
PRIMARY KEY (uuid) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'kafka_newjerseybus',
'properties.bootstrap.servers' = 'kafka:9092',
'key.format' = 'json',
'value.format' = 'json'
);

https://medium.com/@tspann/cdc-not-cat-data-capture-e43713879c03

PYTHON 2024

2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines

More Related Content

What's hot (20)

Similar to 2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines (20)

More from Timothy Spann (20)

Recently uploaded (20)

2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines