SlideShare a Scribd company logo
© 2024 Cloudera, Inc. All rights reserved.
Building Real-Time Generative AI
Pipelines
Tim Spann
Principal Developer Advocate
April 12, 2024
AI Max
Summit
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved. 3
This week in Apache NiFi, Apache Flink,
Apache Kafka, ML, AI, Apache Spark, Apache
Iceberg, Python, Java and Open Source
friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
FLaNK Stack Weekly by Tim Spann
© 2024 Cloudera, Inc. All rights reserved. 4
Confidential—Restricted
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual
https://linktr.ee/tspannhw
© 2024 Cloudera, Inc. All rights reserved. 5
Tim Spann
Twitter: @PaasDev Blog: datainmotion.dev
Principal Developer Advocate
Princeton Future of Data Meetup
ex-Pivotal, ex-Hortonworks,
ex-StreamNative, ex-HPE,
ex-PwC, ex-EY.
https://medium.com/@tspann
https://github.com/tspannhw
© 2024 Cloudera, Inc. All rights reserved. 6
Some common Vector DBs Open Community & Open Models
RAPID INNOVATION IN THE LLM SPACE
Too much to cover today.. but you should know the common LLMs, Frameworks, Tools
Notable LLMs
Closed Models Open Models
GPT3.5
GPT4
Llama2
Mistral7B
Mixtral8x7B
Claude2
++ 100s more… check out the HuggingFace LLM
Leaderboard (pretrained, domain fine-tuned, chat models, …)
Code Llama
Popular LLM Frameworks
When to use one over the other? Use Langchain if you need a general-purpose framework with
flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search)
Langchain is a framework for
developing apps powered by LLMs
● Python and JavaScript Libraries
● Provides modules for LLM
Interface, Retrieval, & Agents
LLamaIndex is a framework designed
specifically for RAG apps
● Python and JavaScript Libraries
● Provides built in optimizations /
techniques for advanced RAG
HuggingFace is an ML community for hosting &
collaborating on models, datasets, and ML applications
● Latest open source LLMs are in HuggingFace
● + great learning resources / demos
https://huggingface.co/
Open Source vs Self Hosted vs SaaS option
© 2024 Cloudera, Inc. All rights reserved. 7
Enterprise Knowledge Base / Chatbot / Q&A
- Customer Support & Troubleshooting
- Enable open ended conversations with user
provided prompts
Code assistant:
- Provide relevant snippets of code as a
response to a request written in natural
language.
- Assist with creating test cases and
synthetic test data.
- Reference other relevant data such as a
company’s documentation to help provide
more accurate responses.
Social and emotional sensing
- Gauge emotions and opinions based on a
piece of text.
- Understand and deliver a more nuanced
message back based on sentiment.
ENTERPRISE WIDE USE CASES FOR AN LLM
Classification and Clustering
- Categorize and sort large volumes of data
into common themes and trends to support
more informed decision making.
Language Translation
- Globalize your content by feeding web
pages through LLMs for translation.
- Combine with chatbots to provide
multilingual support to your customer base.
Document Summarization
- Distill large amounts of text down to the
most relevant points.
Content Generation
- Provide detailed and contextually relevant
prompts to develop outlines, brainstorm
ideas and approaches for content.
L
Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
© 2024 Cloudera, Inc. All rights reserved.
NLP / AI / LLM
Generative AI
9
Which Model and When?
Use the right model for right job: closed or open-source
Closed
Source
Usage can
easily scale
but so can
your costs
Rapidly
improving
AI models
Most
advanced
AI models
Excel at more
specialized
tasks
Great for a
wide range
of tasks
Open
Source
Better cost
planning
Compliance,
privacy, and
security risks
More control
over where &
how models
are deployed
10
Adoption of Generative AI is a Journey
Identifying AI challenges in the enterprise
Data integration
barriers
● Streamlined access to
enterprise data
Rigid model
infrastructure
● Modularity
● Flexibility
● AI Ops
Lack of security
and transparency
● Model control
● Built-in security
● Visibility & governance
What’s missing
Challenges
11
Data = Organization Context
Your data enables contextually accurate responses from LLMs
Large Language
Model
User Query
Contextually
Inaccurate
Response
Data
Organization
Context
User Query
Large Language
Model
Contextually
Accurate
Response
© 2024 Cloudera, Inc. All rights reserved. 12
CLOSED-SOURCE
FOUNDATION MODELS
MODEL HUBS
OPEN SOURCE
FOUNDATION MODELS
FINE-TUNED MODELS
PRIVATE
VECTOR STORE
MANAGED
VECTOR STORE
CLOUD INFRASTRUCTURE
Milvus, Solr*
Meta (Llama 2)
Applied Machine Learning Prototypes (AMPs)
Hugging Face
Pinecone
SPECIALIZED HARDWARE
APIs: OpenAI (GPT-4 Turbo)
Amazon Bedrock: Anthropic (Claude 2), Cohere…
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
INFERENCE
DATA STORE &
VISUALIZATION
Open Data Lakehouse
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
SERVING
DATA STORE &
VISUALIZATION
AI APPLICATIONS
Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Databases
Transactions
Public Data Feeds
S3 / Files
Logs
ATM Data
Live Chat
…
ARCHITECTURE
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
© 2024 Cloudera, Inc. All rights reserved.
DATAFLOW / STREAMING
LLM USE CASE
Vector DB
AI Model
Unstructured file types
Data in Motion
on Cloudera Data
Platform (CDP)
Capture, process &
distribute any data,
anywhere
Other enterprise data Open Data Lakehouse
Materialized Views
Structured Sources
Applications/API’s
Streams
https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0.0 Features
● Python Integration
● Parameters
● JDK 21+
● JSON Flow Serialization
● Rules Engine for Development
Assistance
● Run Process Group as Stateless
● flow.json.gz
https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
17
DataFlow Pipelines Can Help
External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents
Prompt engineering
Crafting and structuring queries to optimize
LLM responses
Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)
Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions
18
UNSTRUCTURED DATA WITH NIFI
• Archives - tar, gzipped, zipped, …
• Images - PNG, JPG, GIF, BMP, …
• Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, …
• Videos - MP4, Clips, Mov, Youtube URL…
• Sound - MP3, …
• Social / Chat - Slack, Discord, Twitter, REST, Email, …
• Identify Mime Types, Chunk Documents, Store to Vector Database
• Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
19
CLOUD ML/DL/AI/Vector Database Services
• Cloudera ML
• Amazon Polly, Translate, Textract, Transcribe, Bedrock, …
• Hugging Face
• IBM Watson X.AI
• Vector Stores Anywhere: Weaviate, Pinecone, Milvus,
Chroma DB, SOLR, …
20
https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-infe
rence-88efe98e580f
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved.
Python Processors
© 2024 Cloudera, Inc. All rights reserved.
Extract Entities
● Python 3.10+
● NLP, SpaCY
● Extract locations
● Extract organizations
● Extract money
● Extract time
● Extract events
● Extract countries
● Extract objects, food, people, quantities
https://github.com/tspannhw/FLaNK-python-processors/blob/main/ExtractEntities.py
© 2024 Cloudera, Inc. All rights reserved.
Extract Company Names
● Python 3.10+
● Hugging Face, NLP, SpaCY, PyTorch
https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
© 2024 Cloudera, Inc. All rights reserved.
WatsonX SDK To Foundation
● Python 3.10+
● LLM
● WatsonX.AI Foundation Models
● Inference
● Secure
● Official SDK from IBM
https://github.com/tspannhw/FLaNK-python-watsonx-processor
© 2024 Cloudera, Inc. All rights reserved.
CaptionImage
● Python 3.10+
● Hugging Face
● Salesforce/blip-image-captioning-large
● Generate Captions for Images
● Adds captions to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
RESNetImageClassification
● Python 3.10+
● Hugging Face
● Transformers
● Pytorch
● Datasets
● microsoft/resnet-50
● Adds classification label to FlowFile
Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
NSFWImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● Falconsai/nsfw_image_detection
● Adds normal and nsfw to FlowFile
Attributes
● Gives score on safety of image
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
FacialEmotionsImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● facial_emotions_image_detection
● Image Classification
● Adds labels/scores to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
Other Python Processors
● Chunk Document, Parse Document
● Prompt Chat GPT
● Put Chroma, Query Chroma
● Put Pinecone, Query Pinecone
© 2024 Cloudera, Inc. All rights reserved.
FLINK SQL
© 2024 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 32
FLINK SQL -> CLOUDERA MACHINE LEARNING MODELS
© 2024 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 33
FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI
© 2024 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 34
SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL
https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af
SELECT CALLLLM(CAST(messagetext as
STRING)) as generatedtext,
messagerealname, messageusername,
messagetext,messageusertz,
messageid, threadts, ts
FROM flankslackmessages
WHERE messagetype = 'message'
© 2024 Cloudera, Inc. All rights reserved. 35
SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
© 2024 Cloudera, Inc. All rights reserved. 36
© Cloudera, Inc. All rights reserved.
Apache Flink SQL
Democratize access to real-time data with just SQL
© 2024 Cloudera, Inc. All rights reserved.
Infer Tables from Kafka Topics with JSON or Avro
© 2024 Cloudera, Inc. All rights reserved.
APACHE KAFKA
© 2024 Cloudera, Inc. All rights reserved.
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story writer,
widely regarded as one of the major figures of
20th-century literature. His work fuses
elements of realism and the fantastic.
Wikipedia
YES, FRANZ, IT’S KAFKA
© 2024 Cloudera, Inc. All rights reserved. 40
Streams
Replication
Manager (SRM)
• Event Replication engine for Kafka
• Supports active-active, multi-cluster,
cross DC replication scenarios
• Leverage Kafka Connect for
scalability and HA
• Replicate data and configurations
(ACL, partitioning, new topics, etc)
• Offset translation for simplified
failover
• Integrate replication monitoring with
SMM
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved.
APACHE ICEBERG
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved. 44
Cloudera’s Open Data Lakehouse
❏ Multi-function analytics for Streaming, Data
Engineering, Data Warehouse and AI/ML with
integrated data services
❏ Common security and governance policies and data
lineage with SDX integration
❏ Common dataset with all CDP analytics engines
without data duplication and movement
❏ Deployment freedom with Multi-Hybrid Cloud
Iceberg Tables
DATA
WAREHOUSE
MACHINE
LEARNING
DATA
ENGINEERING
DATA
FLOW
STREAM
PROCESSING
Multi-Hybrid Cloud
Metadata | Security | Encryption | Control | Governance
© 2024 Cloudera, Inc. All rights reserved. 45
Compute Engine Interoperability & SDX
Integration
● Snapshot isolation ensures consistent data
access and processing with various
compute engines including Hive, Spark,
Impala and Nifi
● Security & Governance support (e.g. FGAC)
through Ranger integration
● Data lineage support through Atlas
integration
Apache Impala
Iceberg Tables
Ranger Atlas
© 2024 Cloudera, Inc. All rights reserved.
FLINK & ICEBERG INTEGRATION
Robust Next Generation Architecture for Data Driven Business
Unified Processing Engine Massive Open table format
Iceberg Support for Flink APIs through SSB
• Maximally open
• Maximally flexible
• Ultra high performance for MASSIVE data
• Can be used as Source and Sink
• Supports batch and streaming modes
• Supports time travel
© 2024 Cloudera, Inc. All rights reserved.
NIFI & ICEBERG INTEGRATION
• PutIceberg processor in CFM 2.1.6
• PutIcebergCDC
© 2024 Cloudera, Inc. All rights reserved.
DEMO
I Can Haz
Data?
© 2024 Cloudera, Inc. All rights reserved.
CSP Community Edition
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry.
○ $>docker compose up
● Licensed under the Cloudera Community License
● Unsupported Commercially (Community Help - Ask Tim)
● Community Group Hub for CSP
● Find it on docs.cloudera.com (see QR Code)
● Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB
● Develop apps locally
© 2024 Cloudera, Inc. All rights reserved.
Open Source Edition
• Apache NiFi in Docker
• Try new features
quickly
• Develop applications
locally
● Docker NiFi
○ docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv
vgEvjnaLjFEB apache/nifi:latest
● Licensed under the ASF License
● Unsupported
● NiFi 1.25 and NiFi 2.0.0-M2
https://hub.docker.com/r/apache/nifi
© 2024 Cloudera, Inc. All rights reserved.
https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce
Street Cameras
© 2024 Cloudera, Inc. All rights reserved.
CEM, CDF, CSP
● https://medium.com/cloudera-inc/watching-airport-traffic-in-real-time-32c522a6e386
● https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-mi
nifi-nifi-kafka-and-flink-ee03ee6722cb
● https://medium.com/cloudera-inc/finding-the-best-way-around-7491c76ca4cb
● https://medium.com/cloudera-inc/nyc-traffic-are-you-kidding-me-6d3fa853903b
CDF
● https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-nifi-in-k8-969b44c84958
LLM, GenAI, HuggingFace, WatsonX, OLLAMA, Mistral, NiFi, Python, Slack, Pytorch
● https://medium.com/@tspann/using-ollama-with-mistral-and-apache-nifi-720c17f5ff12
● https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe
98e580f
● https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c
● https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89
● https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataflows-59744f
7d28a9
● https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea6
e3b61
● https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406
© 2024 Cloudera, Inc. All rights reserved.
https://github.com/tspannhw/PaK-Stocks
https://github.com/tspannhw/FLaNK-Py-Stocks
https://medium.com/cloudera-inc/let-nifi-worry-about-those-stoc
ks-for-you-57d5f16b5e6b
Real-Time AI  Streaming - AI Max Princeton
CLOUDERA STREAM PROCESSING
Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing
Enterprise grade messaging products for Apache
Kafka. Streams Messaging Manager to
monitor/operate clusters, Streams Replication
Manager for HA/DR deployments, Schema Registry for
centralized schema management, and support for
Kafka Connect and Cruise Control
Cloudera Streaming Analytics (CSA)
Powered By Apache Flink
Cloudera Streams Messaging (CSM)
Powered by Apache Kafka
Powered by Apache Flink with SQL StreamBuilder, it
provides low-latency stream processing capabilities
with advanced windowing & state management made
simple with SQL
ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA
Extend streams messaging services for Schema Mgmt, Replication & Monitoring
Schema Registry
Kafka Schema Governance
Streams Replication Manager
Kafka Replication Service
for Disaster Recovery
Streams Messaging Manager
Management & Monitoring Service
for all of your Kafka clusters
Kafka Data Movement, Operations and Security Made Easier
ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA
Kafka Connect Support
Simple Data Movement
Change Data Capture Connectors
Build Custom Connectors with NiFi
Ranger Security
Improved ACL and Audit for
Kafka, KConnect and Schema
Registry
Cruise Control Support
Intelligent Rebalancing
& Self-Healing of your
Kafka Clusters
CLOUDERA STREAM PROCESSING
Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing
Enterprise grade messaging products for Apache
Kafka. Streams Messaging Manager to
monitor/operate clusters, Streams Replication
Manager for HA/DR deployments, Schema Registry for
centralized schema management, and support for
Kafka Connect and Cruise Control
Cloudera Streaming Analytics (CSA)
Powered By Apache Flink
Cloudera Streams Messaging (CSM)
Powered by Apache Kafka
Powered by Apache Flink with SQL StreamBuilder, it
provides low-latency stream processing capabilities
with advanced windowing & state management made
simple with SQL
NEXT GENERATION STREAMING ANALYTICS WITH APACHE
FLINK
Low latency stateful stream processing
● Flink is a distributed data processing
systems ideally suited for real-time, event
driven applications.
● Unifies stream and batch processing
● Advanced features - late arriving data,
checkpointing, event time processing,
Exactly Once Processing
Real-Time
Insights
Event
Processing
Low
Latency
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2024 Cloudera, Inc. All rights reserved. 61
LLMs ARE FOUNDATION MODELS
Base models that can be adapted for a wide range of use cases
Terabytes of Data
(Multiple Formats)
Foundation Models
(Billions of Parameters)
Train Adapt
Question/Answering
Sentiment Analysis
Doc summarization
… ++ more
➔ Historically, data scientists trained specialized models against narrow datasets to solve specific tasks.
➔ LLMs are Foundation models that can be adapted to perform a variety of tasks.
◆ It is faster to “adapt” a foundation model than it is to train a specialized model from scratch
◆ Decouples “knowledge” from “intelligence”
◆ Opens up AI use cases to software developers (instead of just specialised data scientists)
© 2024 Cloudera, Inc. All rights reserved.
https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce
Street Cameras
© 2024 Cloudera, Inc. All rights reserved. 63
TH N Y U

More Related Content

PDF
TCFPro24 Building Real-Time Generative AI Pipelines
Timothy Spann
 
PDF
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
Timothy Spann
 
PDF
Generative AI on Enterprise Cloud with NiFi and Milvus
Timothy Spann
 
PDF
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
PDF
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
PDF
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
PDF
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
TCFPro24 Building Real-Time Generative AI Pipelines
Timothy Spann
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
Timothy Spann
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Timothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 

Similar to Real-Time AI Streaming - AI Max Princeton (20)

PDF
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
NETUserGroupBern
 
PPTX
[DSC Europe 24] Tomislav Tipuric - Exploring LLMs across clouds – A Year in t...
DataScienceConferenc1
 
PDF
Blending AI in Enterprise Architecture.pdf
Calvin Hendryx-Parker
 
PPTX
Spark and Deep Learning Frameworks at Scale 7.19.18
Cloudera, Inc.
 
PDF
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...
Principled Technologies
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PDF
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
Matthew Sinclair
 
PPTX
03_aiops-1.pptx
FarazulHoda2
 
PDF
Open LLMs: Viable for Production or Low-Quality Toy?
M Waleed Kadous
 
PDF
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PDF
KubeCon & CloudNative Con 2024 Artificial Intelligent
Emre Gündoğdu
 
PPTX
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
PPTX
Hadoop and Machine Learning
joshwills
 
PDF
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Pôle Systematic Paris-Region
 
PPTX
Data Science and CDSW
Jason Hubbard
 
PPTX
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
PDF
Oleksii Ivanchenko: Generative AI architecture patterns in production (UA)
Lviv Startup Club
 
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
NETUserGroupBern
 
[DSC Europe 24] Tomislav Tipuric - Exploring LLMs across clouds – A Year in t...
DataScienceConferenc1
 
Blending AI in Enterprise Architecture.pdf
Calvin Hendryx-Parker
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Cloudera, Inc.
 
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...
Principled Technologies
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
Matthew Sinclair
 
03_aiops-1.pptx
FarazulHoda2
 
Open LLMs: Viable for Production or Low-Quality Toy?
M Waleed Kadous
 
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
KubeCon & CloudNative Con 2024 Artificial Intelligent
Emre Gündoğdu
 
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
Hadoop and Machine Learning
joshwills
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Pôle Systematic Paris-Region
 
Data Science and CDSW
Jason Hubbard
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
Oleksii Ivanchenko: Generative AI architecture patterns in production (UA)
Lviv Startup Club
 
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
Ad

Recently uploaded (20)

PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 

Real-Time AI Streaming - AI Max Princeton

  • 1. © 2024 Cloudera, Inc. All rights reserved. Building Real-Time Generative AI Pipelines Tim Spann Principal Developer Advocate April 12, 2024 AI Max Summit
  • 2. © 2024 Cloudera, Inc. All rights reserved.
  • 3. © 2024 Cloudera, Inc. All rights reserved. 3 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  • 4. © 2024 Cloudera, Inc. All rights reserved. 4 Confidential—Restricted @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual https://linktr.ee/tspannhw
  • 5. © 2024 Cloudera, Inc. All rights reserved. 5 Tim Spann Twitter: @PaasDev Blog: datainmotion.dev Principal Developer Advocate Princeton Future of Data Meetup ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-HPE, ex-PwC, ex-EY. https://medium.com/@tspann https://github.com/tspannhw
  • 6. © 2024 Cloudera, Inc. All rights reserved. 6 Some common Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs ● Python and JavaScript Libraries ● Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps ● Python and JavaScript Libraries ● Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications ● Latest open source LLMs are in HuggingFace ● + great learning resources / demos https://huggingface.co/ Open Source vs Self Hosted vs SaaS option
  • 7. © 2024 Cloudera, Inc. All rights reserved. 7 Enterprise Knowledge Base / Chatbot / Q&A - Customer Support & Troubleshooting - Enable open ended conversations with user provided prompts Code assistant: - Provide relevant snippets of code as a response to a request written in natural language. - Assist with creating test cases and synthetic test data. - Reference other relevant data such as a company’s documentation to help provide more accurate responses. Social and emotional sensing - Gauge emotions and opinions based on a piece of text. - Understand and deliver a more nuanced message back based on sentiment. ENTERPRISE WIDE USE CASES FOR AN LLM Classification and Clustering - Categorize and sort large volumes of data into common themes and trends to support more informed decision making. Language Translation - Globalize your content by feeding web pages through LLMs for translation. - Combine with chatbots to provide multilingual support to your customer base. Document Summarization - Distill large amounts of text down to the most relevant points. Content Generation - Provide detailed and contextually relevant prompts to develop outlines, brainstorm ideas and approaches for content. L Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
  • 8. © 2024 Cloudera, Inc. All rights reserved. NLP / AI / LLM Generative AI
  • 9. 9 Which Model and When? Use the right model for right job: closed or open-source Closed Source Usage can easily scale but so can your costs Rapidly improving AI models Most advanced AI models Excel at more specialized tasks Great for a wide range of tasks Open Source Better cost planning Compliance, privacy, and security risks More control over where & how models are deployed
  • 10. 10 Adoption of Generative AI is a Journey Identifying AI challenges in the enterprise Data integration barriers ● Streamlined access to enterprise data Rigid model infrastructure ● Modularity ● Flexibility ● AI Ops Lack of security and transparency ● Model control ● Built-in security ● Visibility & governance What’s missing Challenges
  • 11. 11 Data = Organization Context Your data enables contextually accurate responses from LLMs Large Language Model User Query Contextually Inaccurate Response Data Organization Context User Query Large Language Model Contextually Accurate Response
  • 12. © 2024 Cloudera, Inc. All rights reserved. 12 CLOSED-SOURCE FOUNDATION MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus, Solr* Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Hugging Face Pinecone SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION AI APPLICATIONS
  • 13. Live Q&A Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … ARCHITECTURE INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  • 14. © 2024 Cloudera, Inc. All rights reserved. DATAFLOW / STREAMING
  • 15. LLM USE CASE Vector DB AI Model Unstructured file types Data in Motion on Cloudera Data Platform (CDP) Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
  • 16. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features ● Python Integration ● Parameters ● JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  • 17. 17 DataFlow Pipelines Can Help External Context Ingest Ingesting, routing, clean, enrich, transforming, parsing, chunking and vectorizing structured, unstructured, semistructured, binary data and documents Prompt engineering Crafting and structuring queries to optimize LLM responses Context Retrieval Enhancing LLM with external context such as Retrieval Augmented Generation (RAG) Roundtrip Interface Act as a Discord, REST, Kafka, SQL, Slack bot to roundtrip discussions
  • 18. 18 UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  • 19. 19 CLOUD ML/DL/AI/Vector Database Services • Cloudera ML • Amazon Polly, Translate, Textract, Transcribe, Bedrock, … • Hugging Face • IBM Watson X.AI • Vector Stores Anywhere: Weaviate, Pinecone, Milvus, Chroma DB, SOLR, …
  • 21. © 2024 Cloudera, Inc. All rights reserved.
  • 22. © 2024 Cloudera, Inc. All rights reserved. Python Processors
  • 23. © 2024 Cloudera, Inc. All rights reserved. Extract Entities ● Python 3.10+ ● NLP, SpaCY ● Extract locations ● Extract organizations ● Extract money ● Extract time ● Extract events ● Extract countries ● Extract objects, food, people, quantities https://github.com/tspannhw/FLaNK-python-processors/blob/main/ExtractEntities.py
  • 24. © 2024 Cloudera, Inc. All rights reserved. Extract Company Names ● Python 3.10+ ● Hugging Face, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  • 25. © 2024 Cloudera, Inc. All rights reserved. WatsonX SDK To Foundation ● Python 3.10+ ● LLM ● WatsonX.AI Foundation Models ● Inference ● Secure ● Official SDK from IBM https://github.com/tspannhw/FLaNK-python-watsonx-processor
  • 26. © 2024 Cloudera, Inc. All rights reserved. CaptionImage ● Python 3.10+ ● Hugging Face ● Salesforce/blip-image-captioning-large ● Generate Captions for Images ● Adds captions to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 27. © 2024 Cloudera, Inc. All rights reserved. RESNetImageClassification ● Python 3.10+ ● Hugging Face ● Transformers ● Pytorch ● Datasets ● microsoft/resnet-50 ● Adds classification label to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 28. © 2024 Cloudera, Inc. All rights reserved. NSFWImageDetection ● Python 3.10+ ● Hugging Face ● Transformers ● Falconsai/nsfw_image_detection ● Adds normal and nsfw to FlowFile Attributes ● Gives score on safety of image ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 29. © 2024 Cloudera, Inc. All rights reserved. FacialEmotionsImageDetection ● Python 3.10+ ● Hugging Face ● Transformers ● facial_emotions_image_detection ● Image Classification ● Adds labels/scores to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 30. © 2024 Cloudera, Inc. All rights reserved. Other Python Processors ● Chunk Document, Parse Document ● Prompt Chat GPT ● Put Chroma, Query Chroma ● Put Pinecone, Query Pinecone
  • 31. © 2024 Cloudera, Inc. All rights reserved. FLINK SQL
  • 32. © 2024 Cloudera, Inc. All rights reserved. © 2023 Cloudera, Inc. All rights reserved. 32 FLINK SQL -> CLOUDERA MACHINE LEARNING MODELS
  • 33. © 2024 Cloudera, Inc. All rights reserved. © 2023 Cloudera, Inc. All rights reserved. 33 FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI
  • 34. © 2024 Cloudera, Inc. All rights reserved. © 2023 Cloudera, Inc. All rights reserved. 34 SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af SELECT CALLLLM(CAST(messagetext as STRING)) as generatedtext, messagerealname, messageusername, messagetext,messageusertz, messageid, threadts, ts FROM flankslackmessages WHERE messagetype = 'message'
  • 35. © 2024 Cloudera, Inc. All rights reserved. 35 SSB MATERIALIZED VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
  • 36. © 2024 Cloudera, Inc. All rights reserved. 36 © Cloudera, Inc. All rights reserved. Apache Flink SQL Democratize access to real-time data with just SQL
  • 37. © 2024 Cloudera, Inc. All rights reserved. Infer Tables from Kafka Topics with JSON or Avro
  • 38. © 2024 Cloudera, Inc. All rights reserved. APACHE KAFKA
  • 39. © 2024 Cloudera, Inc. All rights reserved. Let’s do a metamorphosis on your data. Don’t fear changing data. You don’t need to be a brilliant writer to stream data. Franz Kafka was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. Wikipedia YES, FRANZ, IT’S KAFKA
  • 40. © 2024 Cloudera, Inc. All rights reserved. 40 Streams Replication Manager (SRM) • Event Replication engine for Kafka • Supports active-active, multi-cluster, cross DC replication scenarios • Leverage Kafka Connect for scalability and HA • Replicate data and configurations (ACL, partitioning, new topics, etc) • Offset translation for simplified failover • Integrate replication monitoring with SMM
  • 41. © 2024 Cloudera, Inc. All rights reserved.
  • 42. © 2024 Cloudera, Inc. All rights reserved. APACHE ICEBERG
  • 43. © 2024 Cloudera, Inc. All rights reserved.
  • 44. © 2024 Cloudera, Inc. All rights reserved. 44 Cloudera’s Open Data Lakehouse ❏ Multi-function analytics for Streaming, Data Engineering, Data Warehouse and AI/ML with integrated data services ❏ Common security and governance policies and data lineage with SDX integration ❏ Common dataset with all CDP analytics engines without data duplication and movement ❏ Deployment freedom with Multi-Hybrid Cloud Iceberg Tables DATA WAREHOUSE MACHINE LEARNING DATA ENGINEERING DATA FLOW STREAM PROCESSING Multi-Hybrid Cloud Metadata | Security | Encryption | Control | Governance
  • 45. © 2024 Cloudera, Inc. All rights reserved. 45 Compute Engine Interoperability & SDX Integration ● Snapshot isolation ensures consistent data access and processing with various compute engines including Hive, Spark, Impala and Nifi ● Security & Governance support (e.g. FGAC) through Ranger integration ● Data lineage support through Atlas integration Apache Impala Iceberg Tables Ranger Atlas
  • 46. © 2024 Cloudera, Inc. All rights reserved. FLINK & ICEBERG INTEGRATION Robust Next Generation Architecture for Data Driven Business Unified Processing Engine Massive Open table format Iceberg Support for Flink APIs through SSB • Maximally open • Maximally flexible • Ultra high performance for MASSIVE data • Can be used as Source and Sink • Supports batch and streaming modes • Supports time travel
  • 47. © 2024 Cloudera, Inc. All rights reserved. NIFI & ICEBERG INTEGRATION • PutIceberg processor in CFM 2.1.6 • PutIcebergCDC
  • 48. © 2024 Cloudera, Inc. All rights reserved. DEMO I Can Haz Data?
  • 49. © 2024 Cloudera, Inc. All rights reserved. CSP Community Edition ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry. ○ $>docker compose up ● Licensed under the Cloudera Community License ● Unsupported Commercially (Community Help - Ask Tim) ● Community Group Hub for CSP ● Find it on docs.cloudera.com (see QR Code) ● Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB ● Develop apps locally
  • 50. © 2024 Cloudera, Inc. All rights reserved. Open Source Edition • Apache NiFi in Docker • Try new features quickly • Develop applications locally ● Docker NiFi ○ docker run --name nifi -p 8443:8443 -d -e SINGLE_USER_CREDENTIALS_USERNAME=admin -e SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv vgEvjnaLjFEB apache/nifi:latest ● Licensed under the ASF License ● Unsupported ● NiFi 1.25 and NiFi 2.0.0-M2 https://hub.docker.com/r/apache/nifi
  • 51. © 2024 Cloudera, Inc. All rights reserved. https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce Street Cameras
  • 52. © 2024 Cloudera, Inc. All rights reserved. CEM, CDF, CSP ● https://medium.com/cloudera-inc/watching-airport-traffic-in-real-time-32c522a6e386 ● https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-mi nifi-nifi-kafka-and-flink-ee03ee6722cb ● https://medium.com/cloudera-inc/finding-the-best-way-around-7491c76ca4cb ● https://medium.com/cloudera-inc/nyc-traffic-are-you-kidding-me-6d3fa853903b CDF ● https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-nifi-in-k8-969b44c84958 LLM, GenAI, HuggingFace, WatsonX, OLLAMA, Mistral, NiFi, Python, Slack, Pytorch ● https://medium.com/@tspann/using-ollama-with-mistral-and-apache-nifi-720c17f5ff12 ● https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe 98e580f ● https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c ● https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89 ● https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataflows-59744f 7d28a9 ● https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea6 e3b61 ● https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406
  • 53. © 2024 Cloudera, Inc. All rights reserved. https://github.com/tspannhw/PaK-Stocks https://github.com/tspannhw/FLaNK-Py-Stocks https://medium.com/cloudera-inc/let-nifi-worry-about-those-stoc ks-for-you-57d5f16b5e6b
  • 55. CLOUDERA STREAM PROCESSING Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing Enterprise grade messaging products for Apache Kafka. Streams Messaging Manager to monitor/operate clusters, Streams Replication Manager for HA/DR deployments, Schema Registry for centralized schema management, and support for Kafka Connect and Cruise Control Cloudera Streaming Analytics (CSA) Powered By Apache Flink Cloudera Streams Messaging (CSM) Powered by Apache Kafka Powered by Apache Flink with SQL StreamBuilder, it provides low-latency stream processing capabilities with advanced windowing & state management made simple with SQL
  • 56. ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA Extend streams messaging services for Schema Mgmt, Replication & Monitoring Schema Registry Kafka Schema Governance Streams Replication Manager Kafka Replication Service for Disaster Recovery Streams Messaging Manager Management & Monitoring Service for all of your Kafka clusters
  • 57. Kafka Data Movement, Operations and Security Made Easier ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA Kafka Connect Support Simple Data Movement Change Data Capture Connectors Build Custom Connectors with NiFi Ranger Security Improved ACL and Audit for Kafka, KConnect and Schema Registry Cruise Control Support Intelligent Rebalancing & Self-Healing of your Kafka Clusters
  • 58. CLOUDERA STREAM PROCESSING Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing Enterprise grade messaging products for Apache Kafka. Streams Messaging Manager to monitor/operate clusters, Streams Replication Manager for HA/DR deployments, Schema Registry for centralized schema management, and support for Kafka Connect and Cruise Control Cloudera Streaming Analytics (CSA) Powered By Apache Flink Cloudera Streams Messaging (CSM) Powered by Apache Kafka Powered by Apache Flink with SQL StreamBuilder, it provides low-latency stream processing capabilities with advanced windowing & state management made simple with SQL
  • 59. NEXT GENERATION STREAMING ANALYTICS WITH APACHE FLINK Low latency stateful stream processing ● Flink is a distributed data processing systems ideally suited for real-time, event driven applications. ● Unifies stream and batch processing ● Advanced features - late arriving data, checkpointing, event time processing, Exactly Once Processing Real-Time Insights Event Processing Low Latency
  • 60. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 61. © 2024 Cloudera, Inc. All rights reserved. 61 LLMs ARE FOUNDATION MODELS Base models that can be adapted for a wide range of use cases Terabytes of Data (Multiple Formats) Foundation Models (Billions of Parameters) Train Adapt Question/Answering Sentiment Analysis Doc summarization … ++ more ➔ Historically, data scientists trained specialized models against narrow datasets to solve specific tasks. ➔ LLMs are Foundation models that can be adapted to perform a variety of tasks. ◆ It is faster to “adapt” a foundation model than it is to train a specialized model from scratch ◆ Decouples “knowledge” from “intelligence” ◆ Opens up AI use cases to software developers (instead of just specialised data scientists)
  • 62. © 2024 Cloudera, Inc. All rights reserved. https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce Street Cameras
  • 63. © 2024 Cloudera, Inc. All rights reserved. 63 TH N Y U