SlideShare a Scribd company logo
Building Secure RAG
Applications with Open Large
Language Models
Tim Spann, Senior Solutions Engineer
Tim Spann
paasdev.bsky.social
@PaasDev // Blog: datainmotion.dev
Senior Solutions Engineer, Snowflake
NY/NJ/Philly - Cloud Data + AI Meetups
ex-Zilliz, ex-Pivotal, ex-Cloudera, ex-HPE,
ex-StreamNative, ex-EY, ex-Hortonworks.
https://medium.com/@tspann
https://github.com/tspannhw
This week in Apache NiFi, Apache Polaris,
Apache Flink, Apache Kafka, ML, AI,
Streamlit, Jupyter, Apache Iceberg, Python,
Java, LLM, GenAI, Snowflake, Unstructured
Data and Open Source friends.
https://bit.ly/32dAJft
AI + Streaming Weekly by Tim Spann
AGENDA
Introduction and Overview
Data
Demo
Resources
5
Building Secure
RAG Apps
Requires a Team
For Data
AWS S3
Bucket
Structured,
Semistructured,
Unstructured
Data
When you think of RAG, you think of
unstructured data like documents or
giant chunks of text.
Unstructured Data
● Lots of formats
● Text, Documents, PDF
● Images, Videos, Audio
● Email
● Variants
Unstructured
● Open Data like Open AQ - Air
Quality Data
● Location, Time,Sensors
● Apache Avro, Parquet, Orc
● JSON and XML
● Hierarchical Data
● Logs
● Key-Value
Semi-Structured Data
https://docs.snowflake.com/en/sql-refe
rence/data-types-semistructured
Semi-structured
Structured Data
● Snowflake Tables
● Snowflake Hybrid Tables
● Apache Iceberg Tables
● Relational Tables
● Postgresql Tables
● CSV, TSV
Structured
Apache Iceberg™ - Append
● NiFi - PutIcebergTable
● Snowpark -
df.write.mode("append").
save_as_table("atable_iceberg")
https://quickstarts.snowflake.com/guide/getting_started_iceberg_tables/
I Can
Haz
Data?
Open Large Language Models
Snowflake Arctic Instruct
https://huggingface.co/Snowflake/snowflake-arctic-instruct
Snowflake's Arctic-embed-m-v2.0
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
Llama-3.3-70b, mixtral-8x7b, llama3.1-405b,
mistral-7b
Retrieval Augmented Generation (RAG)
Build
Ingest -> Extract -> Split -> Build Indexes
Serve
Orchestration | Observability <-> Retrieval
<-> Inference
Open Source Option
Apache NiFi
Build
Ingest, Extract, Split, LLM
Calls
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Hundreds of sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
Apache NiFi for Data Ingest, Movement and Routing
• Moving Binary, Unstructured, Image
and Tabular Data
• Enrichment
• Universal Visual Processor
• Simple Event Processor
• Routing
• Feeding data to Central Messaging
• Support for modern protocols
• Kafka Protocol Source/Sink
• Pulsar Protocol Source/Sink
The Power of Apache NiFi
APACHE NIFI 2.0 FEATURES
Major Updates:
● Python Integration
● Parameterization
● JDK 21+
● Provenance / Data Lineage
● Rules Engine for Development Assistance
● Additional Azure Processors
● Integration with Zendesk, Slack,
● Database Tables as Schemas
● Amazon Glue Schema Registry
● OpenTelemetry Support
Real-Time Integration and AI
Architecture
https://nifi.apache.org/docs/nifi-docs/html/overview.html
18
PROVENANCE
UNSTRUCTURED DATA WITH NIFI
• Archives - tar, gzipped, zipped, …
• Images - PNG, JPG, GIF, BMP, …
• Documents - HTML, Markdown, RSS, PDF, Doc, RTF,
Plain Text, …
• Videos - MP4, Clips, Mov, Youtube URL…
• Sound - MP3, …
• Social / Chat - Slack, Discord, Twitter, REST, Email, …
• Identify Mime Types, Chunk Documents, Store to Vector Database
• Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
RECORD-ORIENTED DATA WITH NIFI
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet,
Scripted, Syslog5424, Syslog, WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML
• Record Reader and Writer support referencing a schema registry for
retrieving schemas when necessary.
• Enable processors that accept any data format without having to worry about
the parsing and serialization logic.
• Allows us to keep FlowFiles larger, each consisting of multiple records, which
results in far better performance.
Extract Company Names
● Python 3.10+
● Hugging Face, NLP, SpaCY, PyTorch
https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
CaptionImage
● Python 3.10+
● Hugging Face
● Salesforce/blip-image-captioning-large
● Generate Captions for Images
● Adds captions to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
RESNetImageClassification
● Python 3.10+
● Hugging Face
● Transformers
● Pytorch
● Datasets
● microsoft/resnet-50
● Adds classification label to FlowFile
Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
Address To Lat/Long
● Python 3.10+
● geopy Library
● Nominatim
● OpenStreetMaps (OSM)
● openstreetmap.org/copyright
● Returns as attributes and JSON file
● Works with partial addresses
● Categorizes location
● Bounding Box
https://github.com/tspannhw/FLaNKAI-Boston
DEMO
RESOURCES AND WRAP-UP
https://www.linkedin.com/in/timothyspann/
Open Source Edition
● Apache NiFi in
Docker
● Runs in Docker
● Try new features
quickly
● Develop applications
locally
● Docker NiFi
○ docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv
vgEvjnaLjFEB apache/nifi:latest
● Licensed under the ASF License
● Unsupported
https://hub.docker.com/r/apache/nifi
Free Data and AI Event
● King of Prussia
● Princeton
● New York
● Virtual
https://www.snowflake.com/events/data-for-breakfast/
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM

More Related Content

Similar to 2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM (20)

PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 
PDF
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
PDF
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
Timothy Spann
 
PDF
Using FLiP with influxdb for EdgeAI IoT at Scale
Timothy Spann
 
PDF
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
InfluxData
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
ApacheCon 2021 Apache Deep Learning 302
Timothy Spann
 
PDF
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Timothy Spann
 
PDF
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
PDF
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Timothy Spann
 
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
Timothy Spann
 
PDF
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Timothy Spann
 
PDF
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
PDF
AIDevWorldApacheNiFi101
Timothy Spann
 
PPTX
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
PPTX
HDInsight for Architects
Ashish Thapliyal
 
PPTX
Apache Deep Learning 201
DataWorks Summit
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
Timothy Spann
 
Using FLiP with influxdb for EdgeAI IoT at Scale
Timothy Spann
 
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
InfluxData
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
ApacheCon 2021 Apache Deep Learning 302
Timothy Spann
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Timothy Spann
 
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Timothy Spann
 
Using FLiP with influxdb for edgeai iot at scale 2022
Timothy Spann
 
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Timothy Spann
 
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
AIDevWorldApacheNiFi101
Timothy Spann
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
HDInsight for Architects
Ashish Thapliyal
 
Apache Deep Learning 201
DataWorks Summit
 

More from Timothy Spann (20)

PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
PDF
09-25-2024 NJX Venture Summit Introduction to Unstructured Data
Timothy Spann
 
PDF
09-19-2024 AI Camp Hybrid Seach - Milvus for Vector Database
Timothy Spann
 
PDF
09-18-2024 NYC Meetup Vector Databases 102
Timothy Spann
 
PDF
09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
09-25-2024 NJX Venture Summit Introduction to Unstructured Data
Timothy Spann
 
09-19-2024 AI Camp Hybrid Seach - Milvus for Vector Database
Timothy Spann
 
09-18-2024 NYC Meetup Vector Databases 102
Timothy Spann
 
09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...
Timothy Spann
 
Ad

Recently uploaded (20)

PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Ad

2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM

  • 1. Building Secure RAG Applications with Open Large Language Models Tim Spann, Senior Solutions Engineer
  • 2. Tim Spann paasdev.bsky.social @PaasDev // Blog: datainmotion.dev Senior Solutions Engineer, Snowflake NY/NJ/Philly - Cloud Data + AI Meetups ex-Zilliz, ex-Pivotal, ex-Cloudera, ex-HPE, ex-StreamNative, ex-EY, ex-Hortonworks. https://medium.com/@tspann https://github.com/tspannhw
  • 3. This week in Apache NiFi, Apache Polaris, Apache Flink, Apache Kafka, ML, AI, Streamlit, Jupyter, Apache Iceberg, Python, Java, LLM, GenAI, Snowflake, Unstructured Data and Open Source friends. https://bit.ly/32dAJft AI + Streaming Weekly by Tim Spann
  • 5. 5 Building Secure RAG Apps Requires a Team For Data AWS S3 Bucket
  • 6. Structured, Semistructured, Unstructured Data When you think of RAG, you think of unstructured data like documents or giant chunks of text.
  • 7. Unstructured Data ● Lots of formats ● Text, Documents, PDF ● Images, Videos, Audio ● Email ● Variants Unstructured
  • 8. ● Open Data like Open AQ - Air Quality Data ● Location, Time,Sensors ● Apache Avro, Parquet, Orc ● JSON and XML ● Hierarchical Data ● Logs ● Key-Value Semi-Structured Data https://docs.snowflake.com/en/sql-refe rence/data-types-semistructured Semi-structured
  • 9. Structured Data ● Snowflake Tables ● Snowflake Hybrid Tables ● Apache Iceberg Tables ● Relational Tables ● Postgresql Tables ● CSV, TSV Structured
  • 10. Apache Iceberg™ - Append ● NiFi - PutIcebergTable ● Snowpark - df.write.mode("append"). save_as_table("atable_iceberg") https://quickstarts.snowflake.com/guide/getting_started_iceberg_tables/ I Can Haz Data?
  • 11. Open Large Language Models Snowflake Arctic Instruct https://huggingface.co/Snowflake/snowflake-arctic-instruct Snowflake's Arctic-embed-m-v2.0 https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0 Llama-3.3-70b, mixtral-8x7b, llama3.1-405b, mistral-7b
  • 12. Retrieval Augmented Generation (RAG) Build Ingest -> Extract -> Split -> Build Indexes Serve Orchestration | Observability <-> Retrieval <-> Inference
  • 13. Open Source Option Apache NiFi Build Ingest, Extract, Split, LLM Calls
  • 14. • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Hundreds of processors • Visual command and control • Hundreds of sources • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering • Version Control Apache NiFi for Data Ingest, Movement and Routing
  • 15. • Moving Binary, Unstructured, Image and Tabular Data • Enrichment • Universal Visual Processor • Simple Event Processor • Routing • Feeding data to Central Messaging • Support for modern protocols • Kafka Protocol Source/Sink • Pulsar Protocol Source/Sink The Power of Apache NiFi
  • 16. APACHE NIFI 2.0 FEATURES Major Updates: ● Python Integration ● Parameterization ● JDK 21+ ● Provenance / Data Lineage ● Rules Engine for Development Assistance ● Additional Azure Processors ● Integration with Zendesk, Slack, ● Database Tables as Schemas ● Amazon Glue Schema Registry ● OpenTelemetry Support Real-Time Integration and AI
  • 19. UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  • 20. RECORD-ORIENTED DATA WITH NIFI • Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML • Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML • Record Reader and Writer support referencing a schema registry for retrieving schemas when necessary. • Enable processors that accept any data format without having to worry about the parsing and serialization logic. • Allows us to keep FlowFiles larger, each consisting of multiple records, which results in far better performance.
  • 21. Extract Company Names ● Python 3.10+ ● Hugging Face, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  • 22. CaptionImage ● Python 3.10+ ● Hugging Face ● Salesforce/blip-image-captioning-large ● Generate Captions for Images ● Adds captions to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 23. RESNetImageClassification ● Python 3.10+ ● Hugging Face ● Transformers ● Pytorch ● Datasets ● microsoft/resnet-50 ● Adds classification label to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 24. Address To Lat/Long ● Python 3.10+ ● geopy Library ● Nominatim ● OpenStreetMaps (OSM) ● openstreetmap.org/copyright ● Returns as attributes and JSON file ● Works with partial addresses ● Categorizes location ● Bounding Box https://github.com/tspannhw/FLaNKAI-Boston
  • 25. DEMO
  • 27. Open Source Edition ● Apache NiFi in Docker ● Runs in Docker ● Try new features quickly ● Develop applications locally ● Docker NiFi ○ docker run --name nifi -p 8443:8443 -d -e SINGLE_USER_CREDENTIALS_USERNAME=admin -e SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv vgEvjnaLjFEB apache/nifi:latest ● Licensed under the ASF License ● Unsupported https://hub.docker.com/r/apache/nifi
  • 28. Free Data and AI Event ● King of Prussia ● Princeton ● New York ● Virtual https://www.snowflake.com/events/data-for-breakfast/