SlideShare a Scribd company logo
Best Practices for Developing With
Data Flow
21-March-2023
IMPORTANT
© 2023 Cloudera, Inc. All rights reserved.
WARNINGS
• We will stop your design sessions after 4 hours of inactivity
• Starting Sessions takes at least 5 minutes up to 30 minutes be patient
• Ending Sessions takes at least 5 minutes up to 30 minutes be patient
• Starting Sessions requires re-entering Workload User Password in
parameters and applying.
DATA IN MOTION -
Overview
© 2023 Cloudera, Inc. All rights reserved.
UNIVERSAL DATA DISTRIBUTION
Connect to any data source anywhere, process and deliver to any destination
Ingest Process Distribute
Active
Passive
Route
Filter
Enrich
Transform
Data born in
the cloud
Data born
outside the
cloud
Any
destination
Connectors
Gateway
Endpoint
Connect & Pull
Send
Connectors
Deliver
6
© 2023 Cloudera, Inc. All rights reserved.
CLOUDERA FLOW MANAGEMENT
Ingest and manage data from edge-to-cloud using a no-code interface
ACQUIRE PROCESS DELIVER
• Over 350 pre-built processors
• Easy to build your own processors
• Parse, enrich & apply schema
• Filter, Split, Merge & Route
• Throttle & Backpressure
• Guaranteed delivery
• Full data provenance
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRIC
H
SCAN
REPLACE
TRANSLAT
E
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
FAQ
© 2023 Cloudera, Inc. All rights reserved.
RECOMMENDATIONS
• Kafka Topic Names cannot have blank lines or spaces
• Schema Names cannot have blank lines or spaces
• Make sure everything in parameters is trimmed
• Every action is audited
• Shutdown your flow when you are done for the day or not working
• All flows will be shutdown after 4 hours of running
© 2023 Cloudera, Inc. All rights reserved.
RELOGIN FREQUENTLY
© 2023 Cloudera, Inc. All rights reserved.
SPACES AND EMPTY LINES MATTER
© 2023 Cloudera, Inc. All rights reserved.
General Tips
© 2023 Cloudera, Inc. All rights reserved.
RECOMMENDATIONS
For Developers
• Use the latest Chrome
• Disable virus scans
• Disable VPN
• Use a fast network
• Don’t run too many other things
• For security reasons, things time out
• Don’t put in extra spaces in parameters, names or anywhere
• Don’t use special characters in names
• Keep names unique prefaced with your username_
© 2023 Cloudera, Inc. All rights reserved.
IMPORTANT
For Developers
• Start with a ReadyFlow for things like Kafka as they set up a lot of items for
you.
• If you have services not working or missing, stop and restart the test
session. It will add SSL context.
• Publish your flow to the catalog, this is your backup
© 2023 Cloudera, Inc. All rights reserved.
USEFUL DOCUMENTATION
For Developers
• https://docs.cloudera.com/dataflow/cloud/readyflow-overview-listenhttp-k
afka/topics/cdf-ingest-listenhttp-kafka-prerequisites.html
•
•
© 2023 Cloudera, Inc. All rights reserved.
COMMON ERRORS
For Developers
• SSL Error
• Authorization
• Doesn’t exist
• Timed Out (Login again in another tab:
https://login.cdpworkshops.cloudera.com/auth/realms/se-workshop-5/pro
tocol/saml/clients/cdp-sso)
• If you stopped your session, after you restart you have to reset your
sensitive parameters including CD Workload User Password.
ERROR – 2023-04-19 07:50:53 pm EDT
PublishKafka2RecordCDP[id=cb47d8e4-9b13-33ff-a8f
8-d4b27bf3da58] Failed to send
FlowFile[filename=508557dc-5eb1-4bcf-ade5-029b37
3a3a8d] to Kafka:
org.apache.kafka.common.errors.TopicAuthorizationE
xception: Not authorized to access topics:
[tim_traveladvisory]
Flow Design Tips
Your items may be on other
pages
Make sure your session is active.
If things are slow, weird or not working, click and end your session. Once it is stopped, log
off, close your browser. Restart your browser, log back in and restart your session.
Name all your Connections
Click to see your warnings and errors.
If you click twice or hold you can then
select and copy your errors.
Check the ERROR in Bulletins.
If you see a permission or missing item issue
make sure you have your Workload User Name
and Workload Password set correctly.
Make sure your table, topic or schema exist -
check for typos.
If nothing else works, please go to the Slack
channel and post your id, flow and error. We
will check it for you.
Default SSL Context Keystore Password
Sensitive
If you stopped or restarted a Test
Session then you may need to
re-enter and apply your password.
Publish your flow to back it up
Download your flow as json
If you see error popups, refresh your browser and make sure you have a super fast internet and Chrome
browser.
If you see a popup with Invalid Revision or Invalid request.
Refresh your screen, if that doesn’t work, restart Chrome or logout and login again.
The cluster is running in the United States, so it may have latency.
Searching for Flow Designs with start, don’t add a wild card.
If you want to get earliest messages, set Kafka to earliest
Set your workload passwords.
COMMON ERRORS
No blank lines or spaces in topic names
PublishKafka2RecordCDP[id=15c4f0be-11aa-3526-a169-427bbe63dbc1] Failed
to send FlowFile[filename=923449ec-e460-4cba-a656-28473fd56d06] to
Kafka: org.apache.kafka.common.errors.TopicAuthorizationException: Not
authorized to access topics: [ alexvkahan_syslog_critical
]
RESOURCES
35
© 2023 Cloudera, Inc. All rights reserved.
RESOURCES
● https://docs.cloudera.com/cdp-public-cloud/cloud/cli/topics/mc-installing-cdp-client.html
● https://www.datainmotion.dev/2023/04/dataflow-processors.html
● https://github.com/tspannhw/FLaNK-DataFlows
● https://github.com/tspannhw/FLaNK-TravelAdvisory
● https://github.com/tspannhw/FLaNK-AllTheStreams
● Bestinflow.slack.com
● https://docs.google.com/forms/d/1Ku2KSDFoxJy45jiOWuLRDi9Trpgm-42aaxeAVwy-fpo/edit
● https://www.cloudera.com/solutions/dim-developer.html
● https://community.cloudera.com/t5/Community-Articles/DataFlow-Designer-Event/ta-p/368947
● Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development
CLOUDERA DataFlow in
CDP
37
© 2023 Cloudera, Inc. All rights reserved.
OPERATIONAL CHALLENGES WITH NiFi FLOWS
Resource
contention
Impacts performance of all
flows in the cluster
On-demand
manual scaling
Operational nightmare
Comprehensive flow
visibility
Monitoring NiFi flows
across multiple clusters
can be challenging
Oversizing
clusters
High infrastructure costs
© 2023 Cloudera, Inc. All rights reserved.
ARCHITECTURAL SHIFT IN DATAFLOW
Decouple NiFi Flow Designer and the NiFi Cluster Runtime to Support Diverse Runtimes
Classic NiFi
Architecture
NiFi Flow Designer + NiFi Cluster
Runtime are tightly coupled
NiFi Cluster
NiFi Cluster
Runtime
NiFi Flow Designer
Flow
Development
New NiFi
Architecture
Runtimes
Flow
Development
NiFi Flow Designer
CEM /
MiNiFi
DataFlow
Deployment
Containerized
NiFi Clusters
Kafka
Connect
Kafka centric /
Microservices
DataFlow
Function
Serverless
functions
Develop flows in designer and deploy in different runtimes
based on use case
Agent Based
Deployment
NiFi
Cluster
Stateful
traditional NiFi
Option 1
“Nifi on VMs”
Option 2
“Nifi on K8s”
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT OPTIONS FOR DATAFLOW
Classic NiFi
Designer
New NiFi
Designer
Flow Catalog
NiFi Registry
Version control
Flow Designer
Deployment
target
NiFi Cluster
CDP PVC
NiFi Cluster
DataHub
Kafka Connect DataFlow
Deployment
DataFlow
Function
Form factor - CDP PVC BASE - CDP PC - DataHub - CDP PC - DataHub
- CDP PVC BASE
- DataFlow as a Function - CDP PC - Data Services
Runtime - VM based/bare metal - VM based - VM based/bare metal - Serverless - Container based
Workload
profile
- Uniform, flat workload
profile
- Uniform, flat workload
profile
- source/sink for Kafka
(stateless)
- not permanent, only
used from time to time,
event based
- Constantly changing
workload profile
ReadyFlow
Gallery
© 2023 Cloudera, Inc. All rights reserved.
CONTAINER BASED DATAFLOW (AS OF TODAY PUBLIC CLOUD ONLY)
Flow Deployment Flow Monitoring
Allows easy flow deployment based
on NiFi 1.20 across CDP
environments (Dev, QA, Prod)
Define and assign KPIs to your
flows
Easy NiFi version upgrades
Update/Add KPIs, Update
Parameters, Change sizing
configuration
Automatic infrastructure scaling
based on CPU utilization
Central monitoring console for all
your flows across environments
Monitor flow metrics and
infrastructure usage
Define alerts for flows breaching
assigned KPIs
Flow Catalog
Keep track of your flow definitions
and versions in a central catalog
Reuse your existing NiFi flows by
uploading them to the catalog
Discover, search and reuse existing
flows easily
41
© 2023 Cloudera, Inc. All rights reserved.
Auto-scale
Kubernetes
clusters on CDP
Flow Catalog
Upload to Catalog
With full monitoring
DEPLOYMENT WITH SDLC SUPPORT
New Flows Designer
Classic NiFi Flows
(on-prem or on Data Hub)
ReadyFlow Gallery
I
n
s
t
a
n
t
i
a
t
i
o
n
i
n
t
o
C
a
t
a
l
o
g
Upload to Catalog
Select a Flow &
use the
deployment wizard
© 2023 Cloudera, Inc. All rights reserved.
FLOW DEVELOPMENT BEST PRACTICES
good
bad
Name your
processors/
connections
Parameterize
connection
information
Tag sensitive
properties as
“sensitive”
Define controller
services on
process group
level (except
Default NiFi SSL
Context Service)
© 2023 Cloudera, Inc. All rights reserved.
FLOW DEVELOPMENT BEST PRACTICES
CDPEnvironment Parameter & Default SSLContextService
CDP Environment
Parameter
•Use whenever Hadoop
configuration files are
needed
•CDF detects parameter
usage, obtains the
Hadoop configuration
files from SDX, makes
them available to NiFi
pods and replaces
parameter value
accordingly
•No copying of config
files required anymore
Default NiFi SSL Context
Service
•Use whenever SSL
Context Service is
required to interact with
CDP service in target
environment
•CDF detects reference,
creates a key and
truststore for the target
environment and
configures a Default NiFi
SSL Context Service
accordingly
•No more manual
creation of truststores
and moving around of
certificates to interact
with CDP services
•Default NiFi SSL Context
Service must be an
external controller
service i.e. defined
outside of the process
group that’s exported
44
© 2023 Cloudera, Inc. All rights reserved.
Data Flow Design
for Everyone
• Cloud-native data flow
development
• Developers get their own
sandbox
• Start developing flows without
installing NiFi
• Redesigned visual canvas
• Optimized interaction patterns
• Integration into CDF-PC Catalog
for versioning
45
© 2023 Cloudera, Inc. All rights reserved.
Context aware
configuration panel
Configuration side panel
automatically represents selected
canvas component
Developers can still navigate on
the canvas while having
configuration easily accessible
Allows for quick access to
configuration while eliminating
clicks
Reflects canvas selection
46
© 2023 Cloudera, Inc. All rights reserved.
Simplified
Parameters and
Controller Services
Manage Services and Parameters
centrally for you flow draft
Upload files like JDBC drivers or
scripts directly through the
Designer UI
Understand impact of changing
parameters through Referenced
Components
DefaultSSLContextService for
secure interaction with CDP
services is automatically set up
47
© 2023 Cloudera, Inc. All rights reserved.
Interactive
development
through test
sessions
Start a test session running a
specific NiFi version at any point
in time
Test sessions provide a NiFi
runtime and allow
starting/stopping processors and
services
Explore data in queues to validate
processing logic
Allows you to pin flow file
attributes for quick comparison
48
© 2023 Cloudera, Inc. All rights reserved.
Data Viewer
View your data at every step of
the flow
Auto-detects the type and
formats accordingly (JSON, Avro,
XML, YAML)
Allows you to pin attributes
Download the flowfile content
49
© 2023 Cloudera, Inc. All rights reserved.
Copy & Paste
Copy & Paste components
between different drafts
Paste clipboard content in text
editor to get JSON representation
50
© 2023 Cloudera, Inc. All rights reserved.
LOG ANALYTICS IMPLEMENTATION
• Runs on Primary Node
• 4 Concurrent Tasks
• Downstream Load Balancer
• Syslog RecordReader
• JSON RecordWriter
• SQL Filter
• Leverages Schema Registry
• Guaranteed Single Node Delivery
• CDP Username/Password
• NiFi Default SSL Context
Meaningful Queue Name
Meaningful Queue Name
Meaningful Queue Name
51
© 2023 Cloudera, Inc. All rights reserved.
SYSLOG RFC 5424
• PRI — or "priority", Facility (what kind of message) * 8 + Severity (how urgent is the message)
• VERSION — version is always "1" for RFC 5424
• TIMESTAMP — valid timestamp examples (must follow ISO 8601 format with uppercase "T" and "Z")
• HOSTNAME — using FQDN (fully qualified domain name) is recommended
• APP-NAME — usually the name of the device or application that provided the message
• PROCID — often used to provide the process name or process ID (is - "nil" in the example)
• MSGID — should identify the type of message
• STRUCTURED-DATA — named lists of key-value pairs for easy parsing and searching
• MSG — details about the event
© 2023 Cloudera, Inc. All rights reserved.
QUEUE CONFIGURATION
• FlowFile Expiration - Data that cannot be processed in
a timely fashion can be automatically removed from
the flow
• Back Pressure Thresholds - Thresholds indicate how
much data should be allowed to exist in the queue
before the component that is the source of the
Connection is no longer scheduled to run. This allows
the system to avoid being overrun with data
• Load Balance Strategy – Strategy to distribute the
data in a flow across the nodes in the cluster. When
enabled, compression can be configured on FlowFile
contents and attributes
• Prioritization – Determines the order in which flow
files are processed
© 2023 Cloudera, Inc. All rights reserved.
RECORD-ORIENTED DATA WITH NIFI
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON,
Parquet, Scripted, Syslog5424, Syslog, WindowsEvent,
XML
• Record Writers - Avro, CSV, FreeFromText, Json,
Parquet, Scripted, XML
• Record Reader and Writer support referencing a
schema registry for retrieving schemas when
necessary.
• Enable processors that accept any data format
without having to worry about the parsing and
serialization logic.
• Allows us to keep FlowFiles larger, each consisting of
multiple records, which results in far better
performance.
© 2023 Cloudera, Inc. All rights reserved.
RUNNING SQL ON FLOWFILES
• Evaluates one or more SQL queries against the
contents of a FlowFile.
• This can be used, for example, for field-specific
filtering, transformation, and row-level filtering.
• Columns can be renamed, simple calculations and
aggregations performed.
• The SQL statement must be valid ANSI SQL and is
powered by Apache Calcite.
© 2023 Cloudera, Inc. All rights reserved.
CLOUDERA DATA FLOW – PUBLIC CLOUD
© 2023 Cloudera, Inc. All rights reserved.
READYFLOWS - FOR THE MOST COMMON USE CASES
ReadyFlows Gallery
A list of pre-defined flows called
ReadyFlows to accelerate flow
authorship and deployment
Select a ReadyFlow
and use the
deployment wizard
Auto-Scale
Kubernetes Clusters
on CDP
For new NiFi users
57
© 2023 Cloudera, Inc. All rights reserved.
READYFLOW
GALLERY
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Optimized to work with CDP
sources/destinations
• Can be deployed and adjusted
as needed
58
© 2023 Cloudera, Inc. All rights reserved.
FLOW CATALOG
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments
59
© 2023 Cloudera, Inc. All rights reserved.
TURNS FLOW
DEFINITIONS
INTO FLOW
DEPLOYMENTS
2.) NiFi Config
4.) Configure Sizing & Scaling 5.) Define KPIs
1.) Start Deployment Wizard
3.) Provide Parameters for NiFi
60
© 2023 Cloudera, Inc. All rights reserved.
KEY
PERFORMANCE
INDICATORS
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring
61
© 2023 Cloudera, Inc. All rights reserved.
DASHBOARD
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events
62
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT
MANAGER
• Manage flow deployment
lifecycle
(Suspend/Start/Terminate)
• Add/Edit KPIs
• Change sizing configuration
• Update parameters
• Change NiFi version of the
deployment
• Gateway to NiFi canvas
63
© 2023 Cloudera, Inc. All rights reserved.
NIFI VERSION
UPGRADES
• Pick up NiFi hotfixes easily
• Upgrade (or downgrade) the
hotfix version of existing
deployments
• Rolling upgrade (if the
deployment has >1 NiFi nodes)
Deploy a DataFlow
Function
© 2023 Cloudera, Inc. All rights reserved.
DataFlow Functions provides an
efficient, cost optimized, scalable way to
run NiFi flows in a completely serverless
fashion for event-driven use cases.
© 2023 Cloudera, Inc. All rights reserved.
Development & Runtime of DataFlow Functions
Step1. Develop functions
on local workstation or in
CDP Public Cloud using
no-code, UI designer
Step 2. Run functions on
serverless compute
services in AWS, Azure &
GCP
AWS Lambda Azure Functions Google Cloud Functions
© 2023 Cloudera, Inc. All rights reserved.
DataFlow Functions Use Cases
Trigger Based, Batch, Scheduled and Microservice Use Cases
Serverless Trigger-Based
File Processing Pipeline
Develop & run data processing pipelines when
files are created or updated in any of the cloud
object stores
Example: When a photo is uploaded to object
storage, a data flow is triggered which runs image
resizing code and delivers resized image to
different locations.
Serverless Workflows /
Orchestration
Chain different low-code functions to build
complex workflows
Example: Automate the handling of support
tickets in a call center or orchestrate data
movement across different cloud services.
Serverless
Scheduled Tasks
Develop and run scheduled tasks without any
code on pre-defined timed intervals
Example: Offload an external database running
on-premises into the cloud once a day every
morning at 4:00 a.m.
Serverless
Microservices
Build and deploy serverless independent modules
that power your applications microservices
architecture
Example: Event-driven functions for easy
communication between thousands of decoupled
services that power a ride-sharing application.
Serverless
Web APIs
Easily build endpoints for your web applications
with HTTP APIs without any code using DFF and
any of the cloud providers' function triggers
Example: Build high performant, scalable web
applications across multiple data centers.
Serverless
Customized Triggers
With the DFF State feature, build flows to create
customized triggers allowing access to
on-premises or external services
Example: Near real time offloading of files from a
remote SFTP server.
© 2023 Cloudera, Inc. All rights reserved.
Sample DataFlow Functions Use Case
DataFlow Functions easily enables near real-time file processing in a serverless architecture
© 2023 Cloudera, Inc. All rights reserved.
Summary
Reduce
Operational Overhead
Lower TCO
Cost optimization
Pay for Usage
True Pay for Value
Faster ROI
Serverless NiFi
New Use Cases
No code UI
Rapid dev & test
70
© 2023 Cloudera, Inc. All rights reserved.
CDF-PC: Run serverless NiFi with DataFlow Functions
True serverless compute
Leverages AWS Lambda, Azure
Functions or Google Cloud Functions
for compute. No servers to manage.
Guaranteed delivery
Acknowledges receipt of event to
source system only once it has been
delivered to the destination .
Execute on events
Supports cloud provider native triggers
to launch a function.
f(x)
NiFi DevOps
71
© 2023 Cloudera, Inc. All rights reserved.
CDF-PC Deployments: Resource Isolation & Monitoring
Central Monitoring
Monitor health and performance of all
flow deployments across environments
or clouds in a single dashboard
Resource Isolation
Turn process groups into separate
flow deployments and assign
minimum and maximum resources
NiFi DevOps
Inbound Connections
Send data from clients to flow
deployments for further distribution
and leave the load balancer, DNS and
security configuration to CDF-PC
72
© 2023 Cloudera, Inc. All rights reserved.
CDF-PC Deployments: Auto-Scaling & Custom NARs
Configure Auto Scaling
with Cost Controls
For each flow select container node
size and min/max node count for
cost control
Zero Data Loss Guarantee
when scaling down
Support scaling down which requires
complex coordination ensuring existing
data sheds to other avail nodes.
NiFi DevOps
Support for Custom
Nars
Run your existing NiFi data flows that
rely on custom NARs/components
73
© 2023 Cloudera, Inc. All rights reserved.
CDF-PC: Upgrades & Automation
Powerful CLI for
automation
Automate the entire flow lifecycle with
the CLI including single command flow
deployment
High velocity NiFi
releases
New NiFi releases & Hotfixes can be
shipped at any time and are
immediately available for flow
deployments
NiFi DevOps
74
© 2023 Cloudera, Inc. All rights reserved.
Cloud Native Flow Runtime
DataFlow Deployment
K8S / Containers
Flow
High Throughput / Low
Latency Workloads
Cloud Native Flow Runtime — Multi-Cloud support
for deploying flows on auto-scaling K8S NiFi
clusters or as serverless functions in any cloud
providers’ Function as a Service runtime
CLOUDERA DATAFLOW FOR THE PUBLIC CLOUD (CDF-PC)
Cloud Native Data Distribution Powered by Apache NiFi
Catalog / ReadyFlow
Flow Developer Tooling
Flow Designer
Productivity Tooling for Developers — Flow
designer combined w/ catalog of flows
provides developers the agility & extensibility
to build data movement flows in minutes
Dashboard
DataFlow Functions
Function as a Service
Flow
Event Driven
micro-bursty Workloads
Solves the First & Last Mile Problem -- Easily connect to any data born on
the edge, on-prem or in the cloud and deliver it to any destination.
new
new
75
© 2023 Cloudera, Inc. All rights reserved.
Resources
● New - GA Announcement Blog Post
● New - Technical Blog: Self-service data pipeline development
● New - DataFlow Designer Product Tour
● New - Kafka to Iceberg Demo Video
● New - Kafka to Snowflake Demo Video
● New - What's New Post
● Deploying Functions
● Updated - Product Page
● Updated - Product Documentation
● Universal Data Distribution Blog series
Best Practices For Workflow
Best Practices For Workflow
© 2023 Cloudera, Inc. All rights reserved.
DAILY ZOOM
https://cloudera.zoom.us/j/964
60893376?pwd=eWZEVDhpZm
pFSDNRejFzMXkvcHpOdz09
© 2023 Cloudera, Inc. All rights reserved.
SLACK CHANNEL
https://github.com/tspannhw/FLaN
K-DataFlows
© 2023 Cloudera, Inc. All rights reserved.
SOURCE CODE AND EXAMPLES
https://github.com/tspannhw/FLaN
K-DataFlows
© 2023 Cloudera, Inc. All rights reserved.
Submit Your Flow
https://docs.google.com/forms/d/1Ku2KSDFoxJy45jiOWuLRDi9Trpgm-42aaxeAVwy-fpo
Best Practices For Workflow
© 2023 Cloudera, Inc. All rights reserved.
EXAMPLE
https://www.linkedin.com/posts/georgevetticaden_just-finished-up-a-trial-run-of-the-new-dat
a-activity-7058557234556907520-W6O2/
© 2023 Cloudera, Inc. All rights reserved.
HELP
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate
Princeton Future of Data Meetup
https://medium.com/@tspann
https://github.com/tspannhw
https://bestinflow.slack.com/
https://docs.cloudera.com/dataflow/cloud/release-notes/topics/cdf-whats-
new-latest.html
Contact Me!
© 2023 Cloudera, Inc. All rights reserved.
Source
NiFi Kafka
Flink
SQL
Real Time
Dashboard
Flink Stream SQL
Streaming
SQL
Source
Source
Data
Collection
Central
Cache
Real Time
ETL
Real-time Analytic
Real Time
OLAP
THANK YOU

More Related Content

What's hot (20)

PPTX
An Introduction to Confluent Cloud: Apache Kafka as a Service
confluent
 
PPTX
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
PPTX
Snowflake Overview
Snowflake Computing
 
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
PDF
Introduction to Apache Hive
Avkash Chauhan
 
PDF
Apache Kafka Streams + Machine Learning / Deep Learning
Kai Wähner
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Free GitOps Workshop + Intro to Kubernetes & GitOps
Weaveworks
 
PPTX
Pythonsevilla2019 - Introduction to MLFlow
Fernando Ortega Gallego
 
PDF
Apply MLOps at Scale
Databricks
 
PDF
Snowflake free trial_lab_guide
slidedown1
 
PDF
Seamless MLOps with Seldon and MLflow
Databricks
 
PDF
Build Real-Time Applications with Databricks Streaming
Databricks
 
PDF
Introduction to MLflow
Databricks
 
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
PPTX
Introduction to Apache Flink
mxmxm
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PPTX
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
PPTX
Kafka 101
Clement Demonchy
 
An Introduction to Confluent Cloud: Apache Kafka as a Service
confluent
 
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Snowflake Overview
Snowflake Computing
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Introduction to Apache Hive
Avkash Chauhan
 
Apache Kafka Streams + Machine Learning / Deep Learning
Kai Wähner
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Weaveworks
 
Pythonsevilla2019 - Introduction to MLFlow
Fernando Ortega Gallego
 
Apply MLOps at Scale
Databricks
 
Snowflake free trial_lab_guide
slidedown1
 
Seamless MLOps with Seldon and MLflow
Databricks
 
Build Real-Time Applications with Databricks Streaming
Databricks
 
Introduction to MLflow
Databricks
 
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Introduction to Apache Flink
mxmxm
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Kafka 101
Clement Demonchy
 

Similar to Best Practices For Workflow (20)

PDF
Meet the Committers Webinar_ Lab Preparation
Timothy Spann
 
PDF
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
PDF
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann
 
PPTX
Kafka for DBAs
Gwen (Chen) Shapira
 
PDF
Unconference Round Table Notes
Timothy Spann
 
PDF
Meetup - Brasil - Data In Motion - 2023 September 19
ssuser73434e
 
PDF
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PDF
con8832-cloudha-2811114.pdf
Neaman Ahmed MBA ITIL OCP Automic
 
PPTX
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
Shaik Dasthagiri
 
PDF
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
ShapeBlue
 
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Timothy Spann
 
PDF
Apache Kafka - Strakin Technologies Pvt Ltd
Strakin Technologies Pvt Ltd
 
PPTX
Decoupling Decisions with Apache Kafka
Grant Henke
 
PDF
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Courtney Llamas
 
PPTX
Windows azure overview for SharePoint Pros
Usama Wahab Khan Cloud, Data and AI
 
PDF
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
PPTX
Database as a Service, Collaborate 2016
Kellyn Pot'Vin-Gorman
 
PPTX
Azure News Slides for October2017 - Azure Nights User Group
Michael Frank
 
PPTX
[CON6985]Expanding DBaaS Beyond Data Centers Hybrid Cloud Onboarding via Orac...
Bharat Paliwal
 
Meet the Committers Webinar_ Lab Preparation
Timothy Spann
 
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann
 
Kafka for DBAs
Gwen (Chen) Shapira
 
Unconference Round Table Notes
Timothy Spann
 
Meetup - Brasil - Data In Motion - 2023 September 19
ssuser73434e
 
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
con8832-cloudha-2811114.pdf
Neaman Ahmed MBA ITIL OCP Automic
 
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
Shaik Dasthagiri
 
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
ShapeBlue
 
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Timothy Spann
 
Apache Kafka - Strakin Technologies Pvt Ltd
Strakin Technologies Pvt Ltd
 
Decoupling Decisions with Apache Kafka
Grant Henke
 
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Courtney Llamas
 
Windows azure overview for SharePoint Pros
Usama Wahab Khan Cloud, Data and AI
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
Database as a Service, Collaborate 2016
Kellyn Pot'Vin-Gorman
 
Azure News Slides for October2017 - Azure Nights User Group
Michael Frank
 
[CON6985]Expanding DBaaS Beyond Data Centers Hybrid Cloud Onboarding via Orac...
Bharat Paliwal
 
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
Ad

Recently uploaded (20)

PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Tally software_Introduction_Presentation
AditiBansal54083
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 

Best Practices For Workflow

  • 1. Best Practices for Developing With Data Flow 21-March-2023
  • 3. © 2023 Cloudera, Inc. All rights reserved. WARNINGS • We will stop your design sessions after 4 hours of inactivity • Starting Sessions takes at least 5 minutes up to 30 minutes be patient • Ending Sessions takes at least 5 minutes up to 30 minutes be patient • Starting Sessions requires re-entering Workload User Password in parameters and applying.
  • 4. DATA IN MOTION - Overview
  • 5. © 2023 Cloudera, Inc. All rights reserved. UNIVERSAL DATA DISTRIBUTION Connect to any data source anywhere, process and deliver to any destination Ingest Process Distribute Active Passive Route Filter Enrich Transform Data born in the cloud Data born outside the cloud Any destination Connectors Gateway Endpoint Connect & Pull Send Connectors Deliver
  • 6. 6 © 2023 Cloudera, Inc. All rights reserved. CLOUDERA FLOW MANAGEMENT Ingest and manage data from edge-to-cloud using a no-code interface ACQUIRE PROCESS DELIVER • Over 350 pre-built processors • Easy to build your own processors • Parse, enrich & apply schema • Filter, Split, Merge & Route • Throttle & Backpressure • Guaranteed delivery • Full data provenance • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRIC H SCAN REPLACE TRANSLAT E CONVERT ENCRYPT TALL EVALUATE EXECUTE
  • 7. FAQ
  • 8. © 2023 Cloudera, Inc. All rights reserved. RECOMMENDATIONS • Kafka Topic Names cannot have blank lines or spaces • Schema Names cannot have blank lines or spaces • Make sure everything in parameters is trimmed • Every action is audited • Shutdown your flow when you are done for the day or not working • All flows will be shutdown after 4 hours of running
  • 9. © 2023 Cloudera, Inc. All rights reserved. RELOGIN FREQUENTLY
  • 10. © 2023 Cloudera, Inc. All rights reserved. SPACES AND EMPTY LINES MATTER
  • 11. © 2023 Cloudera, Inc. All rights reserved.
  • 13. © 2023 Cloudera, Inc. All rights reserved. RECOMMENDATIONS For Developers • Use the latest Chrome • Disable virus scans • Disable VPN • Use a fast network • Don’t run too many other things • For security reasons, things time out • Don’t put in extra spaces in parameters, names or anywhere • Don’t use special characters in names • Keep names unique prefaced with your username_
  • 14. © 2023 Cloudera, Inc. All rights reserved. IMPORTANT For Developers • Start with a ReadyFlow for things like Kafka as they set up a lot of items for you. • If you have services not working or missing, stop and restart the test session. It will add SSL context. • Publish your flow to the catalog, this is your backup
  • 15. © 2023 Cloudera, Inc. All rights reserved. USEFUL DOCUMENTATION For Developers • https://docs.cloudera.com/dataflow/cloud/readyflow-overview-listenhttp-k afka/topics/cdf-ingest-listenhttp-kafka-prerequisites.html • •
  • 16. © 2023 Cloudera, Inc. All rights reserved. COMMON ERRORS For Developers • SSL Error • Authorization • Doesn’t exist • Timed Out (Login again in another tab: https://login.cdpworkshops.cloudera.com/auth/realms/se-workshop-5/pro tocol/saml/clients/cdp-sso) • If you stopped your session, after you restart you have to reset your sensitive parameters including CD Workload User Password. ERROR – 2023-04-19 07:50:53 pm EDT PublishKafka2RecordCDP[id=cb47d8e4-9b13-33ff-a8f 8-d4b27bf3da58] Failed to send FlowFile[filename=508557dc-5eb1-4bcf-ade5-029b37 3a3a8d] to Kafka: org.apache.kafka.common.errors.TopicAuthorizationE xception: Not authorized to access topics: [tim_traveladvisory]
  • 18. Your items may be on other pages
  • 19. Make sure your session is active. If things are slow, weird or not working, click and end your session. Once it is stopped, log off, close your browser. Restart your browser, log back in and restart your session.
  • 20. Name all your Connections
  • 21. Click to see your warnings and errors. If you click twice or hold you can then select and copy your errors.
  • 22. Check the ERROR in Bulletins. If you see a permission or missing item issue make sure you have your Workload User Name and Workload Password set correctly. Make sure your table, topic or schema exist - check for typos. If nothing else works, please go to the Slack channel and post your id, flow and error. We will check it for you.
  • 23. Default SSL Context Keystore Password Sensitive If you stopped or restarted a Test Session then you may need to re-enter and apply your password.
  • 24. Publish your flow to back it up
  • 26. If you see error popups, refresh your browser and make sure you have a super fast internet and Chrome browser.
  • 27. If you see a popup with Invalid Revision or Invalid request. Refresh your screen, if that doesn’t work, restart Chrome or logout and login again.
  • 28. The cluster is running in the United States, so it may have latency.
  • 29. Searching for Flow Designs with start, don’t add a wild card.
  • 30. If you want to get earliest messages, set Kafka to earliest
  • 31. Set your workload passwords.
  • 33. No blank lines or spaces in topic names PublishKafka2RecordCDP[id=15c4f0be-11aa-3526-a169-427bbe63dbc1] Failed to send FlowFile[filename=923449ec-e460-4cba-a656-28473fd56d06] to Kafka: org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [ alexvkahan_syslog_critical ]
  • 35. 35 © 2023 Cloudera, Inc. All rights reserved. RESOURCES ● https://docs.cloudera.com/cdp-public-cloud/cloud/cli/topics/mc-installing-cdp-client.html ● https://www.datainmotion.dev/2023/04/dataflow-processors.html ● https://github.com/tspannhw/FLaNK-DataFlows ● https://github.com/tspannhw/FLaNK-TravelAdvisory ● https://github.com/tspannhw/FLaNK-AllTheStreams ● Bestinflow.slack.com ● https://docs.google.com/forms/d/1Ku2KSDFoxJy45jiOWuLRDi9Trpgm-42aaxeAVwy-fpo/edit ● https://www.cloudera.com/solutions/dim-developer.html ● https://community.cloudera.com/t5/Community-Articles/DataFlow-Designer-Event/ta-p/368947 ● Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development
  • 37. 37 © 2023 Cloudera, Inc. All rights reserved. OPERATIONAL CHALLENGES WITH NiFi FLOWS Resource contention Impacts performance of all flows in the cluster On-demand manual scaling Operational nightmare Comprehensive flow visibility Monitoring NiFi flows across multiple clusters can be challenging Oversizing clusters High infrastructure costs
  • 38. © 2023 Cloudera, Inc. All rights reserved. ARCHITECTURAL SHIFT IN DATAFLOW Decouple NiFi Flow Designer and the NiFi Cluster Runtime to Support Diverse Runtimes Classic NiFi Architecture NiFi Flow Designer + NiFi Cluster Runtime are tightly coupled NiFi Cluster NiFi Cluster Runtime NiFi Flow Designer Flow Development New NiFi Architecture Runtimes Flow Development NiFi Flow Designer CEM / MiNiFi DataFlow Deployment Containerized NiFi Clusters Kafka Connect Kafka centric / Microservices DataFlow Function Serverless functions Develop flows in designer and deploy in different runtimes based on use case Agent Based Deployment NiFi Cluster Stateful traditional NiFi Option 1 “Nifi on VMs” Option 2 “Nifi on K8s”
  • 39. © 2023 Cloudera, Inc. All rights reserved. DEPLOYMENT OPTIONS FOR DATAFLOW Classic NiFi Designer New NiFi Designer Flow Catalog NiFi Registry Version control Flow Designer Deployment target NiFi Cluster CDP PVC NiFi Cluster DataHub Kafka Connect DataFlow Deployment DataFlow Function Form factor - CDP PVC BASE - CDP PC - DataHub - CDP PC - DataHub - CDP PVC BASE - DataFlow as a Function - CDP PC - Data Services Runtime - VM based/bare metal - VM based - VM based/bare metal - Serverless - Container based Workload profile - Uniform, flat workload profile - Uniform, flat workload profile - source/sink for Kafka (stateless) - not permanent, only used from time to time, event based - Constantly changing workload profile ReadyFlow Gallery
  • 40. © 2023 Cloudera, Inc. All rights reserved. CONTAINER BASED DATAFLOW (AS OF TODAY PUBLIC CLOUD ONLY) Flow Deployment Flow Monitoring Allows easy flow deployment based on NiFi 1.20 across CDP environments (Dev, QA, Prod) Define and assign KPIs to your flows Easy NiFi version upgrades Update/Add KPIs, Update Parameters, Change sizing configuration Automatic infrastructure scaling based on CPU utilization Central monitoring console for all your flows across environments Monitor flow metrics and infrastructure usage Define alerts for flows breaching assigned KPIs Flow Catalog Keep track of your flow definitions and versions in a central catalog Reuse your existing NiFi flows by uploading them to the catalog Discover, search and reuse existing flows easily
  • 41. 41 © 2023 Cloudera, Inc. All rights reserved. Auto-scale Kubernetes clusters on CDP Flow Catalog Upload to Catalog With full monitoring DEPLOYMENT WITH SDLC SUPPORT New Flows Designer Classic NiFi Flows (on-prem or on Data Hub) ReadyFlow Gallery I n s t a n t i a t i o n i n t o C a t a l o g Upload to Catalog Select a Flow & use the deployment wizard
  • 42. © 2023 Cloudera, Inc. All rights reserved. FLOW DEVELOPMENT BEST PRACTICES good bad Name your processors/ connections Parameterize connection information Tag sensitive properties as “sensitive” Define controller services on process group level (except Default NiFi SSL Context Service)
  • 43. © 2023 Cloudera, Inc. All rights reserved. FLOW DEVELOPMENT BEST PRACTICES CDPEnvironment Parameter & Default SSLContextService CDP Environment Parameter •Use whenever Hadoop configuration files are needed •CDF detects parameter usage, obtains the Hadoop configuration files from SDX, makes them available to NiFi pods and replaces parameter value accordingly •No copying of config files required anymore Default NiFi SSL Context Service •Use whenever SSL Context Service is required to interact with CDP service in target environment •CDF detects reference, creates a key and truststore for the target environment and configures a Default NiFi SSL Context Service accordingly •No more manual creation of truststores and moving around of certificates to interact with CDP services •Default NiFi SSL Context Service must be an external controller service i.e. defined outside of the process group that’s exported
  • 44. 44 © 2023 Cloudera, Inc. All rights reserved. Data Flow Design for Everyone • Cloud-native data flow development • Developers get their own sandbox • Start developing flows without installing NiFi • Redesigned visual canvas • Optimized interaction patterns • Integration into CDF-PC Catalog for versioning
  • 45. 45 © 2023 Cloudera, Inc. All rights reserved. Context aware configuration panel Configuration side panel automatically represents selected canvas component Developers can still navigate on the canvas while having configuration easily accessible Allows for quick access to configuration while eliminating clicks Reflects canvas selection
  • 46. 46 © 2023 Cloudera, Inc. All rights reserved. Simplified Parameters and Controller Services Manage Services and Parameters centrally for you flow draft Upload files like JDBC drivers or scripts directly through the Designer UI Understand impact of changing parameters through Referenced Components DefaultSSLContextService for secure interaction with CDP services is automatically set up
  • 47. 47 © 2023 Cloudera, Inc. All rights reserved. Interactive development through test sessions Start a test session running a specific NiFi version at any point in time Test sessions provide a NiFi runtime and allow starting/stopping processors and services Explore data in queues to validate processing logic Allows you to pin flow file attributes for quick comparison
  • 48. 48 © 2023 Cloudera, Inc. All rights reserved. Data Viewer View your data at every step of the flow Auto-detects the type and formats accordingly (JSON, Avro, XML, YAML) Allows you to pin attributes Download the flowfile content
  • 49. 49 © 2023 Cloudera, Inc. All rights reserved. Copy & Paste Copy & Paste components between different drafts Paste clipboard content in text editor to get JSON representation
  • 50. 50 © 2023 Cloudera, Inc. All rights reserved. LOG ANALYTICS IMPLEMENTATION • Runs on Primary Node • 4 Concurrent Tasks • Downstream Load Balancer • Syslog RecordReader • JSON RecordWriter • SQL Filter • Leverages Schema Registry • Guaranteed Single Node Delivery • CDP Username/Password • NiFi Default SSL Context Meaningful Queue Name Meaningful Queue Name Meaningful Queue Name
  • 51. 51 © 2023 Cloudera, Inc. All rights reserved. SYSLOG RFC 5424 • PRI — or "priority", Facility (what kind of message) * 8 + Severity (how urgent is the message) • VERSION — version is always "1" for RFC 5424 • TIMESTAMP — valid timestamp examples (must follow ISO 8601 format with uppercase "T" and "Z") • HOSTNAME — using FQDN (fully qualified domain name) is recommended • APP-NAME — usually the name of the device or application that provided the message • PROCID — often used to provide the process name or process ID (is - "nil" in the example) • MSGID — should identify the type of message • STRUCTURED-DATA — named lists of key-value pairs for easy parsing and searching • MSG — details about the event
  • 52. © 2023 Cloudera, Inc. All rights reserved. QUEUE CONFIGURATION • FlowFile Expiration - Data that cannot be processed in a timely fashion can be automatically removed from the flow • Back Pressure Thresholds - Thresholds indicate how much data should be allowed to exist in the queue before the component that is the source of the Connection is no longer scheduled to run. This allows the system to avoid being overrun with data • Load Balance Strategy – Strategy to distribute the data in a flow across the nodes in the cluster. When enabled, compression can be configured on FlowFile contents and attributes • Prioritization – Determines the order in which flow files are processed
  • 53. © 2023 Cloudera, Inc. All rights reserved. RECORD-ORIENTED DATA WITH NIFI • Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML • Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML • Record Reader and Writer support referencing a schema registry for retrieving schemas when necessary. • Enable processors that accept any data format without having to worry about the parsing and serialization logic. • Allows us to keep FlowFiles larger, each consisting of multiple records, which results in far better performance.
  • 54. © 2023 Cloudera, Inc. All rights reserved. RUNNING SQL ON FLOWFILES • Evaluates one or more SQL queries against the contents of a FlowFile. • This can be used, for example, for field-specific filtering, transformation, and row-level filtering. • Columns can be renamed, simple calculations and aggregations performed. • The SQL statement must be valid ANSI SQL and is powered by Apache Calcite.
  • 55. © 2023 Cloudera, Inc. All rights reserved. CLOUDERA DATA FLOW – PUBLIC CLOUD
  • 56. © 2023 Cloudera, Inc. All rights reserved. READYFLOWS - FOR THE MOST COMMON USE CASES ReadyFlows Gallery A list of pre-defined flows called ReadyFlows to accelerate flow authorship and deployment Select a ReadyFlow and use the deployment wizard Auto-Scale Kubernetes Clusters on CDP For new NiFi users
  • 57. 57 © 2023 Cloudera, Inc. All rights reserved. READYFLOW GALLERY • Cloudera provided flow definitions • Cover most common data flow use cases • Optimized to work with CDP sources/destinations • Can be deployed and adjusted as needed
  • 58. 58 © 2023 Cloudera, Inc. All rights reserved. FLOW CATALOG • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 59. 59 © 2023 Cloudera, Inc. All rights reserved. TURNS FLOW DEFINITIONS INTO FLOW DEPLOYMENTS 2.) NiFi Config 4.) Configure Sizing & Scaling 5.) Define KPIs 1.) Start Deployment Wizard 3.) Provide Parameters for NiFi
  • 60. 60 © 2023 Cloudera, Inc. All rights reserved. KEY PERFORMANCE INDICATORS • Visibility into flow deployments • Track high level flow performance • Track in-depth NiFi component metrics • Defined in Deployment Wizard • Monitoring & Alerts in Deployment Details KPI Definition in Deployment Wizard KPI Monitoring
  • 61. 61 © 2023 Cloudera, Inc. All rights reserved. DASHBOARD • Central Monitoring View • Monitors flow deployments across CDP environments • Monitors flow deployment health & performance • Drill into flow deployment to monitor system metrics and deployment events
  • 62. 62 © 2023 Cloudera, Inc. All rights reserved. DEPLOYMENT MANAGER • Manage flow deployment lifecycle (Suspend/Start/Terminate) • Add/Edit KPIs • Change sizing configuration • Update parameters • Change NiFi version of the deployment • Gateway to NiFi canvas
  • 63. 63 © 2023 Cloudera, Inc. All rights reserved. NIFI VERSION UPGRADES • Pick up NiFi hotfixes easily • Upgrade (or downgrade) the hotfix version of existing deployments • Rolling upgrade (if the deployment has >1 NiFi nodes)
  • 65. © 2023 Cloudera, Inc. All rights reserved. DataFlow Functions provides an efficient, cost optimized, scalable way to run NiFi flows in a completely serverless fashion for event-driven use cases.
  • 66. © 2023 Cloudera, Inc. All rights reserved. Development & Runtime of DataFlow Functions Step1. Develop functions on local workstation or in CDP Public Cloud using no-code, UI designer Step 2. Run functions on serverless compute services in AWS, Azure & GCP AWS Lambda Azure Functions Google Cloud Functions
  • 67. © 2023 Cloudera, Inc. All rights reserved. DataFlow Functions Use Cases Trigger Based, Batch, Scheduled and Microservice Use Cases Serverless Trigger-Based File Processing Pipeline Develop & run data processing pipelines when files are created or updated in any of the cloud object stores Example: When a photo is uploaded to object storage, a data flow is triggered which runs image resizing code and delivers resized image to different locations. Serverless Workflows / Orchestration Chain different low-code functions to build complex workflows Example: Automate the handling of support tickets in a call center or orchestrate data movement across different cloud services. Serverless Scheduled Tasks Develop and run scheduled tasks without any code on pre-defined timed intervals Example: Offload an external database running on-premises into the cloud once a day every morning at 4:00 a.m. Serverless Microservices Build and deploy serverless independent modules that power your applications microservices architecture Example: Event-driven functions for easy communication between thousands of decoupled services that power a ride-sharing application. Serverless Web APIs Easily build endpoints for your web applications with HTTP APIs without any code using DFF and any of the cloud providers' function triggers Example: Build high performant, scalable web applications across multiple data centers. Serverless Customized Triggers With the DFF State feature, build flows to create customized triggers allowing access to on-premises or external services Example: Near real time offloading of files from a remote SFTP server.
  • 68. © 2023 Cloudera, Inc. All rights reserved. Sample DataFlow Functions Use Case DataFlow Functions easily enables near real-time file processing in a serverless architecture
  • 69. © 2023 Cloudera, Inc. All rights reserved. Summary Reduce Operational Overhead Lower TCO Cost optimization Pay for Usage True Pay for Value Faster ROI Serverless NiFi New Use Cases No code UI Rapid dev & test
  • 70. 70 © 2023 Cloudera, Inc. All rights reserved. CDF-PC: Run serverless NiFi with DataFlow Functions True serverless compute Leverages AWS Lambda, Azure Functions or Google Cloud Functions for compute. No servers to manage. Guaranteed delivery Acknowledges receipt of event to source system only once it has been delivered to the destination . Execute on events Supports cloud provider native triggers to launch a function. f(x) NiFi DevOps
  • 71. 71 © 2023 Cloudera, Inc. All rights reserved. CDF-PC Deployments: Resource Isolation & Monitoring Central Monitoring Monitor health and performance of all flow deployments across environments or clouds in a single dashboard Resource Isolation Turn process groups into separate flow deployments and assign minimum and maximum resources NiFi DevOps Inbound Connections Send data from clients to flow deployments for further distribution and leave the load balancer, DNS and security configuration to CDF-PC
  • 72. 72 © 2023 Cloudera, Inc. All rights reserved. CDF-PC Deployments: Auto-Scaling & Custom NARs Configure Auto Scaling with Cost Controls For each flow select container node size and min/max node count for cost control Zero Data Loss Guarantee when scaling down Support scaling down which requires complex coordination ensuring existing data sheds to other avail nodes. NiFi DevOps Support for Custom Nars Run your existing NiFi data flows that rely on custom NARs/components
  • 73. 73 © 2023 Cloudera, Inc. All rights reserved. CDF-PC: Upgrades & Automation Powerful CLI for automation Automate the entire flow lifecycle with the CLI including single command flow deployment High velocity NiFi releases New NiFi releases & Hotfixes can be shipped at any time and are immediately available for flow deployments NiFi DevOps
  • 74. 74 © 2023 Cloudera, Inc. All rights reserved. Cloud Native Flow Runtime DataFlow Deployment K8S / Containers Flow High Throughput / Low Latency Workloads Cloud Native Flow Runtime — Multi-Cloud support for deploying flows on auto-scaling K8S NiFi clusters or as serverless functions in any cloud providers’ Function as a Service runtime CLOUDERA DATAFLOW FOR THE PUBLIC CLOUD (CDF-PC) Cloud Native Data Distribution Powered by Apache NiFi Catalog / ReadyFlow Flow Developer Tooling Flow Designer Productivity Tooling for Developers — Flow designer combined w/ catalog of flows provides developers the agility & extensibility to build data movement flows in minutes Dashboard DataFlow Functions Function as a Service Flow Event Driven micro-bursty Workloads Solves the First & Last Mile Problem -- Easily connect to any data born on the edge, on-prem or in the cloud and deliver it to any destination. new new
  • 75. 75 © 2023 Cloudera, Inc. All rights reserved. Resources ● New - GA Announcement Blog Post ● New - Technical Blog: Self-service data pipeline development ● New - DataFlow Designer Product Tour ● New - Kafka to Iceberg Demo Video ● New - Kafka to Snowflake Demo Video ● New - What's New Post ● Deploying Functions ● Updated - Product Page ● Updated - Product Documentation ● Universal Data Distribution Blog series
  • 78. © 2023 Cloudera, Inc. All rights reserved. DAILY ZOOM https://cloudera.zoom.us/j/964 60893376?pwd=eWZEVDhpZm pFSDNRejFzMXkvcHpOdz09
  • 79. © 2023 Cloudera, Inc. All rights reserved. SLACK CHANNEL https://github.com/tspannhw/FLaN K-DataFlows
  • 80. © 2023 Cloudera, Inc. All rights reserved. SOURCE CODE AND EXAMPLES https://github.com/tspannhw/FLaN K-DataFlows
  • 81. © 2023 Cloudera, Inc. All rights reserved. Submit Your Flow https://docs.google.com/forms/d/1Ku2KSDFoxJy45jiOWuLRDi9Trpgm-42aaxeAVwy-fpo
  • 83. © 2023 Cloudera, Inc. All rights reserved. EXAMPLE https://www.linkedin.com/posts/georgevetticaden_just-finished-up-a-trial-run-of-the-new-dat a-activity-7058557234556907520-W6O2/
  • 84. © 2023 Cloudera, Inc. All rights reserved. HELP Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate Princeton Future of Data Meetup https://medium.com/@tspann https://github.com/tspannhw https://bestinflow.slack.com/ https://docs.cloudera.com/dataflow/cloud/release-notes/topics/cdf-whats- new-latest.html Contact Me!
  • 85. © 2023 Cloudera, Inc. All rights reserved. Source NiFi Kafka Flink SQL Real Time Dashboard Flink Stream SQL Streaming SQL Source Source Data Collection Central Cache Real Time ETL Real-time Analytic Real Time OLAP