SlideShare a Scribd company logo
Building Modern Data Streaming
Apps with NiFi, Flink and Kafka
Tim Spann
Principal Developer Advocate
8-June-2023
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
3
FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink x Java
4
FLiP Stack Weekly
This week in Apache NiFi, Apache Flink, Apache
Pulsar, Apache Spark, Apache Iceberg, Python,
Java and Open Source friends.
https://bit.ly/32dAJft
© 2023 Cloudera, Inc. All rights reserved. 5
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
STREAMING
7
What is Real-Time?
8
Streaming From … To …
Data distribution as a first class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors
© 2023 Cloudera, Inc. All rights reserved. 9
BUILDING REAL-TIME REQUIRES A TEAM
© 2023 Cloudera, Inc. All rights reserved. 10
End to End Streaming Pipeline Example
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Clickstream Market data
Machine logs Social
SQL
11
CDP: AN OPEN DATA LAKEHOUSE
METADATA AND
DATA CATALOG
OBSERVABILITY REPLICATION
SECURITY &
GOVERNANCE
Private Cloud
12
DATAFLOW FOR THE PUBLIC CLOUD
13
Analytics-in-Stream
Data Sources Streaming Storage
Substrate
Cloudera Stream Processing
Kafka + NiFi enables
real-time ingestion into
lakes / analytics services
Data Distribution
Service
Cloudera DataFlow
Warehouses & Operational DB
Data Lakes & Lake Houses
Data-At-Rest Analytics
Data Apps Powered by
Streaming Insights and used
by other Analytics Services
Kafka + Flink
enables streaming
analytics
Cloudera Stream Processing
Streaming
Analytics
Low Latency
Data Products
Data-In-Motion Streaming Analytics
Cloudera Edge Flow
Edge Ingest
© 2023 Cloudera, Inc. All rights reserved. 14
Retail Use Case: Ingest Retail Goods Prices
Codeless Data Movement
Streams Messaging
Cluster
DataFlow Cluster
Cloud Object Stores
Custom
Producers/Consumers
Storage
© 2023 Cloudera, Inc. All rights reserved. 15
Pricing Pipeline
Device Data
Shelf Price
Logs
Logs
Errors
Aggregates
Other data
SQL
Analytics
MiNiFi
Agent
Cloudera Flow
Management
Cloudera Stream
Processing
Cloudera Streaming
Analytics
Cloudera Edge
Management
APACHE KAFKA
© 2023 Cloudera, Inc. All rights reserved. 17
What is Apache Kafka?
Distributed: horizontally scalable
Partitioned: the data is split-up and distributed across the brokers
Replicated: allows for automatic failover
Unique: Kafka does not track the consumption of messages (the consumers do)
Fast: designed from the ground up with a focus on performance and throughput
Kafka was built at Linkedin in 2011
Open sourced as an Apache project
© 2023 Cloudera, Inc. All rights reserved. 18
Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story
writer, widely regarded as one of the
major figures of 20th-century
literature. His work fuses elements
of realism and the fantastic.
Wikipedia
© 2023 Cloudera, Inc. All rights reserved. 19
What is Can You Do With Apache Kafka?
Web site activity: track page views, searches, etc. in real time
Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
Real-time data ingestion: fast processing of a very large volume of messages
© 2023 Cloudera, Inc. All rights reserved. 20
Kafka Terms
● Kafka is a publish/subscribe messaging system comprised of the following
components:
○ Topic: a message feed
○ Producer: a process that publishes messages to a topic
○ Consumer: a process that subscribes to a topic and processes its
messages
○ Broker: a server in a Kafka cluster
© 2023 Cloudera, Inc. All rights reserved. 21
• Highly reliable distributed
messaging system
• Decouple applications,
enables many-to-many
patterns
• Publish-Subscribe
semantics
• Horizontal scalability
• Efficient implementation
to operate at speed with
big data volumes
• Organized by topic to
support several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
EVENTS
APACHE FLINK
© 2023 Cloudera, Inc. All rights reserved. 23
Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
24
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
DATAFLOW
APACHE NIFI
© 2023 Cloudera, Inc. All rights reserved. 26
Apache NiFi
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
ACQUIRE PROCESS DELIVER
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
© 2023 Cloudera, Inc. All rights reserved. 27
Provenance
28
Extensibility
● Built from the ground up with extensions in mind
● Service-loader pattern for…
○ Processors
○ Controller Services
○ Reporting Tasks
○ Prioritizers
● Extensions packaged as NiFi Archives (NARs)
○ Deploy NiFi lib directory and restart
○ Same model as standard components
© 2023 Cloudera, Inc. All rights reserved. 29
Parquet
Reader/
Writers
● Native Record
Processors for Apache
Parquet Files!
● CSV <-> Parquet
● XML <-> Parquet
● AVRO <-> Parquet
● JSON <-> Parquet
● More...
© 2023 Cloudera, Inc. All rights reserved. 30
ReadyFlow
Gallery
• Cloudera provided
flow definitions
• Cover most common
data flow use cases
• Optimized to work
with CDP
sources/destinations
• Can be deployed and
adjusted as needed
© 2023 Cloudera, Inc. All rights reserved. 31
Flow
Catalog
• Central repository for
flow definitions
• Import existing NiFi
flows
• Manage flow
definitions
• Initiate flow
deployments
Apache NiFi with Python Custom Processors
Python as a 1st class citizen
© 2023 Cloudera, Inc. All rights reserved. 33
Processing one million events per second with
NiFi
SOURCES AND SINKS
35
APACHE ICEBERG
A Flexible, Performant & Scalable Table Format
● Donated by Netflix to the Apache Foundation in 2018
● Flexibility
○ Hidden partitioning
○ Full schema evolution
● Data Warehouse Operations
○ Atomic Consistent Isolated Durable (ACID) Transactions
○ Time travel and rollback
● Supports best in class SQL performance
○ High performance at Petabyte scale
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
FREE LEARNING ENVIRONMENT
CSP Community Edition
● Kafka, KConnect, SMM,
SR, Flink, and SSB in
Docker
● Runs in Docker
● Try new features quickly
● Develop applications
locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $>docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
https://www.cloudera.com/downloads/cdf/csp-community-edition.html
Open Source Edition
● Apache NiFi in Docker
● Runs in Docker
● Try new features quickly
● Develop applications
locally
● Docker NiFi
○ docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUgh
vvgEvjnaLjFEB apache/nifi:latest
● Licensed under the ASF License
● Unsupported
https://hub.docker.com/r/apache/nifi
DEMO AND CODE
Aggregate all data from sensors, drones, logs, geo-location devices, images
from cameras, results from running predictions on pre-trained models.
Collect: Bring Together
Mediate point-to-point and bi-directional data flows, distribute, delivering
data reliably to Apache Iceberg, S3, SnowFlake, Slack and Email.
Conduct: Mediate the Data Flow
Orchestrate, parse, merge, aggregate, filter, join, transform, fork, query,
sort, dissect, store, enrich with weather, location, sentiment analysis,
image analysis, object detection, image recognition and more with Apache
Tika, Apache OpenNLP and Machine Learning.
Curate: Gain Insights
https://github.com/tspannhw/FLaNK-TravelAdvisory
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
RESOURCES AND WRAP-UP
48
Streaming Resources
● https://dzone.com/articles/real-time-stream-processing-with-hazelcast-and-
streamnative
● https://flipstackweekly.com/
● https://www.datainmotion.dev/
● https://www.flankstack.dev/
● https://github.com/tspannhw
● https://medium.com/@tspann
● https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d7395
d714
● https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Stre
aming_Engineer.pdf
49
Resources
© 2023 Cloudera, Inc. All rights reserved. 50
51
TH N Y U

More Related Content

Similar to Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka (20)

PDF
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
PDF
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
PDF
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Timothy Spann
 
PDF
AIDEVDAY_ Data-in-Motion to Supercharge AI
Timothy Spann
 
PDF
ITPC Building Modern Data Streaming Apps
Timothy Spann
 
PDF
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
PDF
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
PDF
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
PDF
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann
 
PDF
RTAS 2023: Building a Real-Time IoT Application
Timothy Spann
 
PDF
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
PDF
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
PDF
Meetup Streaming Data Pipeline Development
Timothy Spann
 
PDF
WarsawITDays_ ApacheNiFi202
Timothy Spann
 
PDF
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
PDF
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
Flink sql for continuous sql etl apps & Apache NiFi devops
Timothy Spann
 
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
PDF
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann
 
PDF
Continus sql with sql stream builder
Timothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Timothy Spann
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
Timothy Spann
 
ITPC Building Modern Data Streaming Apps
Timothy Spann
 
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann
 
RTAS 2023: Building a Real-Time IoT Application
Timothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
Meetup Streaming Data Pipeline Development
Timothy Spann
 
WarsawITDays_ ApacheNiFi202
Timothy Spann
 
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
Flink sql for continuous sql etl apps & Apache NiFi devops
Timothy Spann
 
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann
 
Continus sql with sql stream builder
Timothy Spann
 

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
Ad

Recently uploaded (20)

PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Ad

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka

  • 1. Building Modern Data Streaming Apps with NiFi, Flink and Kafka Tim Spann Principal Developer Advocate 8-June-2023
  • 3. 3 FLiPN-FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://github.com/tspannhw/EverythingApacheNiFi https://medium.com/@tspann Apache NiFi x Apache Kafka x Apache Flink x Java
  • 4. 4 FLiP Stack Weekly This week in Apache NiFi, Apache Flink, Apache Pulsar, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft
  • 5. © 2023 Cloudera, Inc. All rights reserved. 5 Future of Data - Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 8. 8 Streaming From … To … Data distribution as a first class citizen IOT Devices LOG DATA SOURCES ON-PREM DATA SOURCES BIG DATA CLOUD SERVICES CLOUD BUSINESS PROCESS SERVICES * CLOUD DATA* ANALYTICS /SERVICE (Cloudera DW) App Logs Laptops /Servers Mobile Apps Security Agents CLOUD WAREHOUSE UNIVERSAL DATA DISTRIBUTION (Ingest, Transform, Deliver) Ingest Processors Ingest Gateway Router, Filter & Transform Processors Destination Processors
  • 9. © 2023 Cloudera, Inc. All rights reserved. 9 BUILDING REAL-TIME REQUIRES A TEAM
  • 10. © 2023 Cloudera, Inc. All rights reserved. 10 End to End Streaming Pipeline Example Enterprise sources Weather Errors Aggregates Alerts Stocks ETL Analytics Clickstream Market data Machine logs Social SQL
  • 11. 11 CDP: AN OPEN DATA LAKEHOUSE METADATA AND DATA CATALOG OBSERVABILITY REPLICATION SECURITY & GOVERNANCE Private Cloud
  • 12. 12 DATAFLOW FOR THE PUBLIC CLOUD
  • 13. 13 Analytics-in-Stream Data Sources Streaming Storage Substrate Cloudera Stream Processing Kafka + NiFi enables real-time ingestion into lakes / analytics services Data Distribution Service Cloudera DataFlow Warehouses & Operational DB Data Lakes & Lake Houses Data-At-Rest Analytics Data Apps Powered by Streaming Insights and used by other Analytics Services Kafka + Flink enables streaming analytics Cloudera Stream Processing Streaming Analytics Low Latency Data Products Data-In-Motion Streaming Analytics Cloudera Edge Flow Edge Ingest
  • 14. © 2023 Cloudera, Inc. All rights reserved. 14 Retail Use Case: Ingest Retail Goods Prices Codeless Data Movement Streams Messaging Cluster DataFlow Cluster Cloud Object Stores Custom Producers/Consumers Storage
  • 15. © 2023 Cloudera, Inc. All rights reserved. 15 Pricing Pipeline Device Data Shelf Price Logs Logs Errors Aggregates Other data SQL Analytics MiNiFi Agent Cloudera Flow Management Cloudera Stream Processing Cloudera Streaming Analytics Cloudera Edge Management
  • 17. © 2023 Cloudera, Inc. All rights reserved. 17 What is Apache Kafka? Distributed: horizontally scalable Partitioned: the data is split-up and distributed across the brokers Replicated: allows for automatic failover Unique: Kafka does not track the consumption of messages (the consumers do) Fast: designed from the ground up with a focus on performance and throughput Kafka was built at Linkedin in 2011 Open sourced as an Apache project
  • 18. © 2023 Cloudera, Inc. All rights reserved. 18 Yes, Franz, It’s Kafka Let’s do a metamorphosis on your data. Don’t fear changing data. You don’t need to be a brilliant writer to stream data. Franz Kafka was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. Wikipedia
  • 19. © 2023 Cloudera, Inc. All rights reserved. 19 What is Can You Do With Apache Kafka? Web site activity: track page views, searches, etc. in real time Events & log aggregation: particularly in distributed systems where messages come from multiple sources Monitoring and metrics: aggregate statistics from distributed applications and build a dashboard application Stream processing: process raw data, clean it up, and forward it on to another topic or messaging system Real-time data ingestion: fast processing of a very large volume of messages
  • 20. © 2023 Cloudera, Inc. All rights reserved. 20 Kafka Terms ● Kafka is a publish/subscribe messaging system comprised of the following components: ○ Topic: a message feed ○ Producer: a process that publishes messages to a topic ○ Consumer: a process that subscribes to a topic and processes its messages ○ Broker: a server in a Kafka cluster
  • 21. © 2023 Cloudera, Inc. All rights reserved. 21 • Highly reliable distributed messaging system • Decouple applications, enables many-to-many patterns • Publish-Subscribe semantics • Horizontal scalability • Efficient implementation to operate at speed with big data volumes • Organized by topic to support several use cases Source System Source System Source System Kafka Fraud Detection Security Systems Real-Time Monitoring Many-To-Many Publish-Subscribe EVENTS
  • 23. © 2023 Cloudera, Inc. All rights reserved. 23 Flink SQL https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite
  • 24. 24 Flink SQL -- specify Kafka partition key on output SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  • 26. © 2023 Cloudera, Inc. All rights reserved. 26 Apache NiFi Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  • 27. © 2023 Cloudera, Inc. All rights reserved. 27 Provenance
  • 28. 28 Extensibility ● Built from the ground up with extensions in mind ● Service-loader pattern for… ○ Processors ○ Controller Services ○ Reporting Tasks ○ Prioritizers ● Extensions packaged as NiFi Archives (NARs) ○ Deploy NiFi lib directory and restart ○ Same model as standard components
  • 29. © 2023 Cloudera, Inc. All rights reserved. 29 Parquet Reader/ Writers ● Native Record Processors for Apache Parquet Files! ● CSV <-> Parquet ● XML <-> Parquet ● AVRO <-> Parquet ● JSON <-> Parquet ● More...
  • 30. © 2023 Cloudera, Inc. All rights reserved. 30 ReadyFlow Gallery • Cloudera provided flow definitions • Cover most common data flow use cases • Optimized to work with CDP sources/destinations • Can be deployed and adjusted as needed
  • 31. © 2023 Cloudera, Inc. All rights reserved. 31 Flow Catalog • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 32. Apache NiFi with Python Custom Processors Python as a 1st class citizen
  • 33. © 2023 Cloudera, Inc. All rights reserved. 33 Processing one million events per second with NiFi
  • 35. 35 APACHE ICEBERG A Flexible, Performant & Scalable Table Format ● Donated by Netflix to the Apache Foundation in 2018 ● Flexibility ○ Hidden partitioning ○ Full schema evolution ● Data Warehouse Operations ○ Atomic Consistent Isolated Durable (ACID) Transactions ○ Time travel and rollback ● Supports best in class SQL performance ○ High performance at Petabyte scale
  • 38. CSP Community Edition ● Kafka, KConnect, SMM, SR, Flink, and SSB in Docker ● Runs in Docker ● Try new features quickly ● Develop applications locally ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry ○ $>docker compose up ● Licensed under the Cloudera Community License ● Unsupported ● Community Group Hub for CSP ● Find it on docs.cloudera.com under Applications https://www.cloudera.com/downloads/cdf/csp-community-edition.html
  • 39. Open Source Edition ● Apache NiFi in Docker ● Runs in Docker ● Try new features quickly ● Develop applications locally ● Docker NiFi ○ docker run --name nifi -p 8443:8443 -d -e SINGLE_USER_CREDENTIALS_USERNAME=admin -e SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUgh vvgEvjnaLjFEB apache/nifi:latest ● Licensed under the ASF License ● Unsupported https://hub.docker.com/r/apache/nifi
  • 41. Aggregate all data from sensors, drones, logs, geo-location devices, images from cameras, results from running predictions on pre-trained models. Collect: Bring Together Mediate point-to-point and bi-directional data flows, distribute, delivering data reliably to Apache Iceberg, S3, SnowFlake, Slack and Email. Conduct: Mediate the Data Flow Orchestrate, parse, merge, aggregate, filter, join, transform, fork, query, sort, dissect, store, enrich with weather, location, sentiment analysis, image analysis, object detection, image recognition and more with Apache Tika, Apache OpenNLP and Machine Learning. Curate: Gain Insights
  • 48. 48 Streaming Resources ● https://dzone.com/articles/real-time-stream-processing-with-hazelcast-and- streamnative ● https://flipstackweekly.com/ ● https://www.datainmotion.dev/ ● https://www.flankstack.dev/ ● https://github.com/tspannhw ● https://medium.com/@tspann ● https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d7395 d714 ● https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Stre aming_Engineer.pdf
  • 50. © 2023 Cloudera, Inc. All rights reserved. 50