SlideShare a Scribd company logo
© 2023 Cloudera, Inc. All rights reserved.
Unlocking Financial Data with Real-Time
Pipelines
(Flink Analytics on Stocks with
SQL )
Tim Spann
Principal Developer Advocate
28-February-2024
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 3
Introduction
Overview
Finance Data
Apache Kafka and Apache Flink
Demos
Agenda (45 minutes)
© 2023 Cloudera, Inc. All rights reserved. 4
Financial institutions thrive on accurate and timely data to drive critical decision-making
processes, risk assessments, and regulatory compliance. However, managing and processing
vast amounts of financial data in real-time can be a daunting task. To overcome this
challenge, modern data engineering solutions have emerged, combining powerful
technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient
and reliable real-time data pipelines. In this talk, we will explore how this technology stack
can unlock the full potential of financial data, enabling organizations to make data-driven
decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time
access to accurate and reliable data is crucial. Traditional batch processing falls short when it
comes to handling rapidly changing financial markets and responding to customer demands
promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the
strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of
financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.
© 2023 Cloudera, Inc. All rights reserved. 5
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
© 2023 Cloudera, Inc. All rights reserved. 6
Confidential—Restricted
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual
© 2023 Cloudera, Inc. All rights reserved. 7
This week in Apache NiFi, Apache Flink,
Apache Kafka, ML, AI, Apache Spark, Apache
Iceberg, Python, Java, LLM, GenAI, Vector
DB and Open Source friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
FLaNK Stack Weekly by Tim Spann
© 2023 Cloudera, Inc. All rights reserved.
Overview
© 2023 Cloudera, Inc. All rights reserved. 9
DATA VELOCITY in FINANCIAL SERVICES
Streaming capabilities vary, all enhance insight
Transaction
Data
Core Banking
Risk
Data
Behavioral
Data
Cyber
Market Data
News
Feeds
Customer
Data
Connected
Devices/
Wearables
Chat Bots
Normal
Streaming
Regulatory,
Compliance
Near-Real
Time
Streaming
Real-Time
Streaming
Social Media
© 2023 Cloudera, Inc. All rights reserved. 10
NEXT GEN PLATFORM FOR TACKLING FINANCIAL CRIME
Leveraging data and analytics across the enterprise from the Edge to AI.
Ingest
Streaming
Data
BANKING DATA
Data Flow (CDF)
Data Science
Workbench
FINANCIAL CRIME APPLICATIONS
Data
Scientists
Data
Processing
Data Engineering Data Warehouse
Operational
DB
Catalog | Schema | Security | Governance
Business
Analysts
EDGE DATA
ALTERNATIVE DATA
Enterprise Data Store
ML
Cyber Security AML
Fraud Surveillance
Analytical
Tools
BI and
Visualization
Ingest
Data at
Rest
Deploy Models
Ingest Stream
or Batch Data
Teams
speaking
the same
language
ENTERPRISE DATA
TRADING DATA
`
Ingest
1
2
3
4
11
© 2023 Cloudera, Inc. All rights reserved.
Kafka & Flink (Flink SQL with Stream SQL Builder) for real time analytics
Kafka
Kafka topics
Database
Machine
learning
Flink SQL
w/ SSB
Data Warehouse Data Viz
Monitoring
Alerting
F
in
a
n
c
e
D
a
t
a
Architecture in the context of Financial Use Cases
DataFlow / NiFi
© 2023 Cloudera, Inc. All rights reserved. 12
NIFI MEETS AI
Vector DB
AI Model
Unstructured file types
Data in Motion
With Cloudera
Capture, process &
distribute any data,
anywhere
Other enterprise data Open Data Lakehouse
Materialized Views
Structured Sources
Applications/API’s
Streams
© 2023 Cloudera, Inc. All rights reserved. 13
Transactions
Data
Account
Data
Device Logs
Business Event
Logic
Data
Lakehouse
Flagged
Records
Continuous
Results
Fraud Analyst
defined suspicious transaction
Real-Time
Fraud Scoring
Freeze
transaction
Fraud
Monitoring
Dashboard
Stop Fraud When It Happens—Real Life Example
Simplified example of deployed use case
DATA RELEVANCE
© 2023 Cloudera, Inc. All rights reserved.
FINANCE DATA
Extract Company Name from User Query via NLP
Convert Company Name to Stock Symbol via Finnhub REST
REST API ARCHITECTURE - Using FLaNK to pull the data out of anything in near-real time
INGEST PREPARE PUBLISH
DATA SOURCES
Internal Users
(After Sales)
External
Systems
ENTERPRISE
LAKEHOUSE
CAPABILITY VIEW
INGESTION
MESSAGE HUB
STORAGE
BATCH
MANAGEMENT
STREAM
CONSUMPTION
Closed Loop
Systems
SQL Stream Builder
Machine Learning
Data Visualization
Workload Manager
watsonx.data
© 2023 Cloudera, Inc. All rights reserved.
KAFKA and FLINK
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 18
STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many
patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Efficient implementation to operate at speed with
big data volumes.
• Organized by topic to support several use cases.
© 2023 Cloudera, Inc. All rights reserved. 19
CONTINUOUS SQL
● SSB is a Continuous SQL engine
● It’s SQL, but a slightly different mental model, but with big implications
Traditional Parse/Execute/Fetch model Continuous SQL Model
Hint: The query is boundless and never finishes, and time matters
AKA: SELECT * FROM foo WHERE 1=0 -- will run forever
20
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2023 Cloudera, Inc. All rights reserved. 21
SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
22
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2019 Cloudera, Inc. All rights reserved. 23
ICEBERG INTEGRATION
Robust Next Generation Architecture for Data Driven Business
Unified Processing Engine Massive Open table format
Iceberg Support for Flink APIs through SSB
• Maximally open
• Maximally flexible
• Ultra high performance for MASSIVE data
© 2023 Cloudera, Inc. All rights reserved.
DEMO
I Can Haz Data?
https://github.com/tspannhw/FLaNK-Py-Stocks
https://github.com/tspannhw/PaK-Stocks
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 26
Continuous SQL
select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet,
max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet,
max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed,
count(alt_baro) as RowCount,
hex as ICAO, flight as IDENT
from `sr1`.`default_database`.`adsb`
group by flight, hex;
select transcom.title, transcom.description, mta.VehicleRef,
DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) as miles,
mta.StopPointName, mta.Bearing, mta.DestinationName, mta.ExpectedArrivalTime, mta.VehicleLocationLatitude, mta.VehicleLocationLongitude,
mta.ArrivalProximityText, mta.DistanceFromStop, mta.AimedArrivalTime, mta.`Date`, mta.ts, mta.uuid, mta.EstimatedPassengerCapacity, mta.EstimatedPassengerCount
from `schemareg1`.`default_database`.`mta` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ mta
FULL OUTER JOIN `schemareg1`.`default_database`.`transcom` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ transcom
ON (transcom.latitude >= CAST(mta.VehicleLocationLatitude as float) - 0.3)
AND (transcom.longitude >= CAST(mta.VehicleLocationLongitude as float) - 0.3)
AND (transcom.latitude <= CAST(mta.VehicleLocationLatitude as float) + 0.3)
AND (transcom.longitude <= CAST(mta.VehicleLocationLongitude as float) + 0.3)
WHERE mta.VehicleRef is not null
AND transcom.title is not null
AND DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) <= 120
© 2023 Cloudera, Inc. All rights reserved.
DEMO
© 2023 Cloudera, Inc. All rights reserved.
CDC with Flink SQL (SSB)
© 2023 Cloudera, Inc. All rights reserved. 29
Streaming CDC with Cloudera SQL Stream Builder (Flink SQL)
https://github.com/tspannhw/FLaNK-CDC/blob/main/flinkcdc.MD
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 30
https://docs.cloudera.com/csa/1.10.0/how-to-ssb/topics/csa-ssb-cdc-connectors.html
CDC with Debezium and Flink
SQL Stream Builder with Flink SQL
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 31
CDC with Debezium and Flink
SQL Stream Builder with Flink SQL
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 32
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 33
CREATE TABLE `postgres_cdc_newjerseybus` (
`title` STRING,
`description` STRING,
`link` STRING,
`guid` STRING,
`advisoryAlert` STRING,
`pubDate` STRING,
`ts` STRING,
`companyname` STRING,
`uuid` STRING,
`servicename` STRING
) WITH (
'connector' = 'postgres-cdc',
'database-name' = 'tspann',
'hostname' = '192.168.1.153',
'password' = 'tspann',
'decoding.plugin.name' = 'pgoutput',
'schema-name' = 'public',
'table-name' = 'newjerseybus',
'username' = 'tspann',
'port' = '5432'
);
Flink SQL Tables - Debezium CDC From Database Tables
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 34
Flink SQL Tables - Upsert to Kafka Topics
CREATE TABLE `upsert_kafka_newjerseybus` (
`title` String,
`description` String,
`link` String,
`guid` String,
`advisoryAlert` String,
`pubDate` String,
`ts` String,
`companyname` String,
`uuid` String,
`servicename` String,
`eventTimestamp` TIMESTAMP(3),
WATERMARK FOR `eventTimestamp` AS `eventTimestamp` -
INTERVAL '5' SECOND,
PRIMARY KEY (uuid) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'kafka_newjerseybus',
'properties.bootstrap.servers' = 'kafka:9092',
'key.format' = 'json',
'value.format' = 'json'
);
© 2023 Cloudera, Inc. All rights reserved.
https://medium.com/@tspann/cdc-not-cat-data-capture-e43713879c03
© 2023 Cloudera, Inc. All rights reserved. 36
© 2023 Cloudera, Inc. All rights reserved.
PYTHON 2024
38
TH N Y U

More Related Content

PPTX
Nutanix
rosslili
 
PPTX
시스템 프로그램 설계1 최종발표
Jeongmin Cha
 
PDF
Amazon SageMaker 모델 학습 방법 소개::최영준, 솔루션즈 아키텍트 AI/ML 엑스퍼트, AWS::AWS AIML 스페셜 웨비나
Amazon Web Services Korea
 
PDF
[Webinar]: Working with Reactive Spring
Knoldus Inc.
 
PPTX
Serverless Siege: AWS Lambda Pentesting - OWASP Top 10 Serverless C0c0n 2023
Divyanshu
 
PDF
Building Event-Driven (Micro) Services with Apache Kafka
Guido Schmutz
 
PDF
Amazon SageMaker 모델 빌딩 파이프라인 소개::이유동, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스...
Amazon Web Services Korea
 
PDF
Amazon Aurora 신규 서비스 알아보기::최유정::AWS Summit Seoul 2018
Amazon Web Services Korea
 
Nutanix
rosslili
 
시스템 프로그램 설계1 최종발표
Jeongmin Cha
 
Amazon SageMaker 모델 학습 방법 소개::최영준, 솔루션즈 아키텍트 AI/ML 엑스퍼트, AWS::AWS AIML 스페셜 웨비나
Amazon Web Services Korea
 
[Webinar]: Working with Reactive Spring
Knoldus Inc.
 
Serverless Siege: AWS Lambda Pentesting - OWASP Top 10 Serverless C0c0n 2023
Divyanshu
 
Building Event-Driven (Micro) Services with Apache Kafka
Guido Schmutz
 
Amazon SageMaker 모델 빌딩 파이프라인 소개::이유동, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스...
Amazon Web Services Korea
 
Amazon Aurora 신규 서비스 알아보기::최유정::AWS Summit Seoul 2018
Amazon Web Services Korea
 

What's hot (20)

PDF
Chuleta de aprendizaje de Python3 (1).pdf
victorpedro20
 
PPTX
SignalR Overview
Michael Sukachev
 
PPTX
Hybrid cloud and azure stack
TechsternSolutions
 
PDF
(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf
ssuserf8b8bd1
 
PDF
AWS Control Tower
CloudHesive
 
PDF
AWS CLOUD 2017 - AWS 기반 하이브리드 클라우드 환경 구성 전략 (김용우 솔루션즈 아키텍트)
Amazon Web Services Korea
 
PPTX
Centralized Logging System Using ELK Stack
Rohit Sharma
 
PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
PPTX
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
PDF
Hcx intro preso v2
Parashar Singh
 
PDF
End-to-End Machine Learning with Amazon SageMaker
Sungmin Kim
 
PPTX
Everything you need to know about Azure Virtual Machines
Adil Arif
 
PPTX
VMware Advance Troubleshooting Workshop - Day 3
Vepsun Technologies
 
PPTX
Warehouse Storage Managment System Database Schema
Matt Saragusa
 
PDF
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Amazon Web Services Korea
 
PDF
GoDaddy Guide to cPanel and WordPress
GoDaddy
 
PDF
클라우드 마이그레이션을 통한 비지니스 성공 사례- AWS Summit Seoul 2017
Amazon Web Services Korea
 
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
PDF
Understanding Kubernetes
Tu Pham
 
PDF
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon Web Services Korea
 
Chuleta de aprendizaje de Python3 (1).pdf
victorpedro20
 
SignalR Overview
Michael Sukachev
 
Hybrid cloud and azure stack
TechsternSolutions
 
(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf
ssuserf8b8bd1
 
AWS Control Tower
CloudHesive
 
AWS CLOUD 2017 - AWS 기반 하이브리드 클라우드 환경 구성 전략 (김용우 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Centralized Logging System Using ELK Stack
Rohit Sharma
 
Introduction to Apache ZooKeeper
Saurav Haloi
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
Hcx intro preso v2
Parashar Singh
 
End-to-End Machine Learning with Amazon SageMaker
Sungmin Kim
 
Everything you need to know about Azure Virtual Machines
Adil Arif
 
VMware Advance Troubleshooting Workshop - Day 3
Vepsun Technologies
 
Warehouse Storage Managment System Database Schema
Matt Saragusa
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Amazon Web Services Korea
 
GoDaddy Guide to cPanel and WordPress
GoDaddy
 
클라우드 마이그레이션을 통한 비지니스 성공 사례- AWS Summit Seoul 2017
Amazon Web Services Korea
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Understanding Kubernetes
Tu Pham
 
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon Web Services Korea
 
Ad

Similar to 2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines (20)

PDF
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
PDF
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
PDF
Meetup - Brasil - Data In Motion - 2023 September 19
ssuser73434e
 
PDF
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann
 
PDF
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann
 
PDF
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
PDF
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
PDF
big data fest building modern data streaming apps
Timothy Spann
 
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
PDF
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Timothy Spann
 
PDF
RTAS 2023: Building a Real-Time IoT Application
Timothy Spann
 
PDF
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
PDF
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann
 
PDF
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
PDF
Tracking crime as it occurs with apache phoenix, apache hbase and apache nifi
Timothy Spann
 
PDF
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
PDF
Meetup Streaming Data Pipeline Development
Timothy Spann
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Meetup - Brasil - Data In Motion - 2023 September 19
ssuser73434e
 
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
big data fest building modern data streaming apps
Timothy Spann
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Timothy Spann
 
RTAS 2023: Building a Real-Time IoT Application
Timothy Spann
 
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
Tracking crime as it occurs with apache phoenix, apache hbase and apache nifi
Timothy Spann
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
Meetup Streaming Data Pipeline Development
Timothy Spann
 
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 

Recently uploaded (20)

PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Presentation on animal welfare a good topic
kidscream385
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines

  • 1. © 2023 Cloudera, Inc. All rights reserved. Unlocking Financial Data with Real-Time Pipelines (Flink Analytics on Stocks with SQL ) Tim Spann Principal Developer Advocate 28-February-2024
  • 2. © 2023 Cloudera, Inc. All rights reserved.
  • 3. © 2023 Cloudera, Inc. All rights reserved. 3 Introduction Overview Finance Data Apache Kafka and Apache Flink Demos Agenda (45 minutes)
  • 4. © 2023 Cloudera, Inc. All rights reserved. 4 Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence. Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.
  • 5. © 2023 Cloudera, Inc. All rights reserved. 5 Tim Spann Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw
  • 6. © 2023 Cloudera, Inc. All rights reserved. 6 Confidential—Restricted @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual
  • 7. © 2023 Cloudera, Inc. All rights reserved. 7 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java, LLM, GenAI, Vector DB and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  • 8. © 2023 Cloudera, Inc. All rights reserved. Overview
  • 9. © 2023 Cloudera, Inc. All rights reserved. 9 DATA VELOCITY in FINANCIAL SERVICES Streaming capabilities vary, all enhance insight Transaction Data Core Banking Risk Data Behavioral Data Cyber Market Data News Feeds Customer Data Connected Devices/ Wearables Chat Bots Normal Streaming Regulatory, Compliance Near-Real Time Streaming Real-Time Streaming Social Media
  • 10. © 2023 Cloudera, Inc. All rights reserved. 10 NEXT GEN PLATFORM FOR TACKLING FINANCIAL CRIME Leveraging data and analytics across the enterprise from the Edge to AI. Ingest Streaming Data BANKING DATA Data Flow (CDF) Data Science Workbench FINANCIAL CRIME APPLICATIONS Data Scientists Data Processing Data Engineering Data Warehouse Operational DB Catalog | Schema | Security | Governance Business Analysts EDGE DATA ALTERNATIVE DATA Enterprise Data Store ML Cyber Security AML Fraud Surveillance Analytical Tools BI and Visualization Ingest Data at Rest Deploy Models Ingest Stream or Batch Data Teams speaking the same language ENTERPRISE DATA TRADING DATA ` Ingest 1 2 3 4
  • 11. 11 © 2023 Cloudera, Inc. All rights reserved. Kafka & Flink (Flink SQL with Stream SQL Builder) for real time analytics Kafka Kafka topics Database Machine learning Flink SQL w/ SSB Data Warehouse Data Viz Monitoring Alerting F in a n c e D a t a Architecture in the context of Financial Use Cases DataFlow / NiFi
  • 12. © 2023 Cloudera, Inc. All rights reserved. 12 NIFI MEETS AI Vector DB AI Model Unstructured file types Data in Motion With Cloudera Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
  • 13. © 2023 Cloudera, Inc. All rights reserved. 13 Transactions Data Account Data Device Logs Business Event Logic Data Lakehouse Flagged Records Continuous Results Fraud Analyst defined suspicious transaction Real-Time Fraud Scoring Freeze transaction Fraud Monitoring Dashboard Stop Fraud When It Happens—Real Life Example Simplified example of deployed use case DATA RELEVANCE
  • 14. © 2023 Cloudera, Inc. All rights reserved. FINANCE DATA
  • 15. Extract Company Name from User Query via NLP Convert Company Name to Stock Symbol via Finnhub REST
  • 16. REST API ARCHITECTURE - Using FLaNK to pull the data out of anything in near-real time INGEST PREPARE PUBLISH DATA SOURCES Internal Users (After Sales) External Systems ENTERPRISE LAKEHOUSE CAPABILITY VIEW INGESTION MESSAGE HUB STORAGE BATCH MANAGEMENT STREAM CONSUMPTION Closed Loop Systems SQL Stream Builder Machine Learning Data Visualization Workload Manager watsonx.data
  • 17. © 2023 Cloudera, Inc. All rights reserved. KAFKA and FLINK
  • 18. © 2023 Cloudera, Inc. All rights reserved. © 2019 Cloudera, Inc. All rights reserved. 18 STREAMS MESSAGING WITH KAFKA • Highly reliable distributed messaging system. • Decouple applications, enables many-to-many patterns. • Publish-Subscribe semantics. • Horizontal scalability. • Efficient implementation to operate at speed with big data volumes. • Organized by topic to support several use cases.
  • 19. © 2023 Cloudera, Inc. All rights reserved. 19 CONTINUOUS SQL ● SSB is a Continuous SQL engine ● It’s SQL, but a slightly different mental model, but with big implications Traditional Parse/Execute/Fetch model Continuous SQL Model Hint: The query is boundless and never finishes, and time matters AKA: SELECT * FROM foo WHERE 1=0 -- will run forever
  • 20. 20 © 2022 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 21. © 2023 Cloudera, Inc. All rights reserved. 21 SSB MATERIALIZED VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
  • 22. 22 © 2022 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 ICEBERG INTEGRATION Robust Next Generation Architecture for Data Driven Business Unified Processing Engine Massive Open table format Iceberg Support for Flink APIs through SSB • Maximally open • Maximally flexible • Ultra high performance for MASSIVE data
  • 24. © 2023 Cloudera, Inc. All rights reserved. DEMO I Can Haz Data? https://github.com/tspannhw/FLaNK-Py-Stocks https://github.com/tspannhw/PaK-Stocks
  • 25. © 2023 Cloudera, Inc. All rights reserved.
  • 26. © 2023 Cloudera, Inc. All rights reserved. 26 Continuous SQL select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet, max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet, max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed, count(alt_baro) as RowCount, hex as ICAO, flight as IDENT from `sr1`.`default_database`.`adsb` group by flight, hex; select transcom.title, transcom.description, mta.VehicleRef, DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) as miles, mta.StopPointName, mta.Bearing, mta.DestinationName, mta.ExpectedArrivalTime, mta.VehicleLocationLatitude, mta.VehicleLocationLongitude, mta.ArrivalProximityText, mta.DistanceFromStop, mta.AimedArrivalTime, mta.`Date`, mta.ts, mta.uuid, mta.EstimatedPassengerCapacity, mta.EstimatedPassengerCount from `schemareg1`.`default_database`.`mta` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ mta FULL OUTER JOIN `schemareg1`.`default_database`.`transcom` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ transcom ON (transcom.latitude >= CAST(mta.VehicleLocationLatitude as float) - 0.3) AND (transcom.longitude >= CAST(mta.VehicleLocationLongitude as float) - 0.3) AND (transcom.latitude <= CAST(mta.VehicleLocationLatitude as float) + 0.3) AND (transcom.longitude <= CAST(mta.VehicleLocationLongitude as float) + 0.3) WHERE mta.VehicleRef is not null AND transcom.title is not null AND DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) <= 120
  • 27. © 2023 Cloudera, Inc. All rights reserved. DEMO
  • 28. © 2023 Cloudera, Inc. All rights reserved. CDC with Flink SQL (SSB)
  • 29. © 2023 Cloudera, Inc. All rights reserved. 29 Streaming CDC with Cloudera SQL Stream Builder (Flink SQL) https://github.com/tspannhw/FLaNK-CDC/blob/main/flinkcdc.MD
  • 30. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 30 https://docs.cloudera.com/csa/1.10.0/how-to-ssb/topics/csa-ssb-cdc-connectors.html CDC with Debezium and Flink SQL Stream Builder with Flink SQL
  • 31. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 31 CDC with Debezium and Flink SQL Stream Builder with Flink SQL
  • 32. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 32
  • 33. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 33 CREATE TABLE `postgres_cdc_newjerseybus` ( `title` STRING, `description` STRING, `link` STRING, `guid` STRING, `advisoryAlert` STRING, `pubDate` STRING, `ts` STRING, `companyname` STRING, `uuid` STRING, `servicename` STRING ) WITH ( 'connector' = 'postgres-cdc', 'database-name' = 'tspann', 'hostname' = '192.168.1.153', 'password' = 'tspann', 'decoding.plugin.name' = 'pgoutput', 'schema-name' = 'public', 'table-name' = 'newjerseybus', 'username' = 'tspann', 'port' = '5432' ); Flink SQL Tables - Debezium CDC From Database Tables
  • 34. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 34 Flink SQL Tables - Upsert to Kafka Topics CREATE TABLE `upsert_kafka_newjerseybus` ( `title` String, `description` String, `link` String, `guid` String, `advisoryAlert` String, `pubDate` String, `ts` String, `companyname` String, `uuid` String, `servicename` String, `eventTimestamp` TIMESTAMP(3), WATERMARK FOR `eventTimestamp` AS `eventTimestamp` - INTERVAL '5' SECOND, PRIMARY KEY (uuid) NOT ENFORCED ) WITH ( 'connector' = 'upsert-kafka', 'topic' = 'kafka_newjerseybus', 'properties.bootstrap.servers' = 'kafka:9092', 'key.format' = 'json', 'value.format' = 'json' );
  • 35. © 2023 Cloudera, Inc. All rights reserved. https://medium.com/@tspann/cdc-not-cat-data-capture-e43713879c03
  • 36. © 2023 Cloudera, Inc. All rights reserved. 36
  • 37. © 2023 Cloudera, Inc. All rights reserved. PYTHON 2024