SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache NiFi Integration with Apache Spark
Timothy Spann, Solutions Engineer
2 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Disclaimer
à This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be
developed.
à Technical feasibility, market demand, user feedback, and the Apache Software
Foundation community development process can all effect timing and final delivery.
à This document’s description of these features and technology directions does not
represent a contractual commitment, promise or obligation from Hortonworks to deliver
these features in any generally available product.
à Product features and technology directions are subject to change, and must not be
included in contracts, purchase orders, or sales agreements of any kind.
à Since this document contains an outline of general product development plans,
customers should not rely upon it when making a purchase decision.
3 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Integration Options
§ Apache Spark Integration via Kafka and Spark Streaming (1.6+)
§ Apache Spark Integration via Kafka and Spark Structured Streaming (2.2+)
§ Apache Spark Integration via Apache Livy
4
Apache Kafka and Apache NiFi
Integration
+
5 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
NiFi and Kafka Are Complementary
NiFi
Provide dataflow solution
• Centralized management, from edge to core
• Great traceability, event level data provenance
starting when data is born
• Interactive command and control – real time
operational visibility
• Dataflow management, including prioritization,
back pressure, and edge intelligence
• Visual representation of global dataflow
Kafka
Provide durable stream store
• Low latency
• Distributed data durability
• Decentralized management of producers &
consumers
+
6 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Integrated Provisioning and Security
Kafka 1.0 Support
To enhance data governance and lineage, users can
now manage access control policies using resource or
tag-based security in Ranger for Kafka 1.0 clusters.
Users can now install, configure, manage, upgrade,
monitor, and secure Kafka 1.0 clusters with Ambari.
New processors in NiFi and Streaming Analytics
Manager support Kafka 1.0 features including message
headers and transactions.
7 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache NiFi and Kafka 1.0 – Use Case for Kafka Message Headers
8
Apache Spark – Apache Kafka – Apache
NiFi Architecture
9 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Join
Architecture Example
Acquire/Move
Routing
&
Filtering
Parse
Analyze Model
Topic 1
Topic 2
AggregateCorrolate Pattern Matching
JSON Data
AVRO Data
Windowing
Aggregations
Spark Processing
Flow Management Stream Analysis
++
10 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Stream Processing
Streaming Analytics
Manager
Machine Learning
Distributed queue
Buffering
Process decoupling
Structured Streaming with SQL
Orchestration
Queueing
Simple Event Processing
Data Definition Between Environments
Schema Versioning
11 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Key Integration Points – NiFi & Kafka
NiFi
MiNiFi
MiNiFi
MiNiFi
Kafka
Consumer 1
Consumer 2
Consumer N
• Producer Processors (Main)
• PublishKafka_0_11 (0.10 Kafka Client)
• PublishKafka_1_0 (1.0 Kafka Client)
• PublishKafkaRecord_0_11 (0.11 Kafka Client)
• PublishKafkaRecord_1_0 (1.0 Kafka Client)
+
12 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Key Integration Points – NiFi & Kafka
Kafka
Producer 1
Producer 2
Producer N
NiFi
Destination 1
Destination 2
Destination 3
• Consumer Processors (Main)
• ConsumeKafka_0_11 (0.11 Kafka Client)
• ConsumeKafka_1_0 (1.0 Kafka Client)
• ConsumeKafkaRecord_0_11 (0.11 Kafka Client)
• ConsumeKafkaRecord_1_0 (1.0 Kafka Client)
+
13 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Better Together
NiFiMiNiFi
Kafka
Spark
Incoming Topic
Results Topic
PublishKafka
ConsumeKafka
Destinations
MiNiFi
• MiNiFi – Collection, filtering, and prioritization at the edge
• NiFi - Central data flow management, routing, enriching, and transformation
• Kafka - Central messaging bus for subscription by downstream consumers
• Spark - Streaming analytics focused on complex event processing
+ +SR
14 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
NiFi PublishKafkaRecord_1_0
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
PublishKafka
Apache NiFi – Node 2
PublishKafka
= Concurrent Task
• Each NiFi node runs an
instance of
PublishKafkaRecord_1_0
• Each instance has one or
more concurrent tasks
(threads)
• Each concurrent task is an
independent producer,
sends data round-robin to
partitions of a topic
• Records with Schemas for
Performance
+
15
Apache Spark Streaming – Apache Kafka
– Apache NiFi Architecture
16 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Spark Streaming
à Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and
fault-tolerant streaming applications.
à Data can be ingested from various data sources like Kafka, Flume, Twitter, ZeroMQ or TCP
sockets
à Data is processed using the now-familiar API: map, filter, reduce, join and window
à Processed data can be stored in databases, filesystems, or live dashboards
17 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Streaming Integration via Kafka
https://community.hortonworks.com/content/kbentry/173818/hdp-264-hdf-31-apache-spark-streaming-integration.html
18 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Streaming Integration via Kafka
19
Apache Spark Structured Streaming –
Apache Kafka – Apache NiFi Architecture
20 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Structured Streaming Integration via Kafka
https://community.hortonworks.com/articles/91379/spark-structured-streaming-with-nifi-and-kafka-usi.html
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
https://community.hortonworks.com/content/kbentry/174105/hdp-264-hdf-31-apache-spark-structured-streaming-
i.html
val records = spark.
readStream.
format("kafka").
option("subscribe", "smartPlug2").
option("kafka.bootstrap.servers",
"mykafkabroker:6667").load
21 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache NiFi – Apache Kafka – Apache Spark
22
Apache Spark – Apache Livy
23 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Introducing Apache Livy
à Apache Livy is the open source REST interface for interacting with Apache Spark from
anywhere
à Installed as Spark2 Ambari Service
Livy Client
HTTP HTTP (RPC)
Spark Interactive Session
SparkContext
Spark Batch Session
SparkContext
Livy Server
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-
guide/content/ch_submit-spark-apps-livy.html
24 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Livy Server as a Session Management Service
Livy
Server
Remote
Spark
Driver
Session
Remote
Context
Interactive
REST API
Batch
REST API
Standard Spark
Batch Job
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
https://livy.incubator.apache.org/docs/latest/rest-api.html
25
Apache Spark – Apache Livy – Apache
NiFi Integration
26 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
SQL
Architecture Example
Routing & Filtering
Parse
Analyze
Session 1
Session 1
AggregateSQL
JSON Data
Spark Processing
Flow Management Analytics
27 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
NiFi to Spark Processing
Streaming Analytics
Manager
Machine Learning
REST API
Enterprise Tested
Secure
Structured Streaming with SQL
Orchestration
Queueing
Simple Event Processing
Data Definition Between Environments
Schema Versioning
28 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Key Integration Points – NiFi & Spark
NiFi
MiNiFi
MiNiFi
MiNiFi
Livy
Spark
Spark 2
Spark N
• Processor and Controller
• ExecuteSparkInteractive – setup job and code to Livy Session Service
• LivySessionService – manages Spark Livy connection pool
+ +
29 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Better Together
NiFiMiNiFi
Livy
Spark
Session
Batch
ExecuteSpark
Interactive
MiNiFi
• MiNiFi – Collection, filtering, and prioritization at the edge
• NiFi - Central data flow management, routing, enriching, and transformation
• Livy – Secure HTTPS connection to running Spark batch and sessions jobs with
cached RDD sharing and a live Spark context.
• Spark - Streaming analytics focused on complex event processing
+ +
LivySessionService
30
Apache Spark – Apache Livy – Apache
NiFi Architecture
31 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Integration via Apache Livy
32 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Integration via Apache Livy
https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte.html
https://community.hortonworks.com/articles/171893/hdf-31-executing-apache-spark-via-executesparkinte-1.html
33 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
34 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Questions?
Hortonworks Community Connection:
Data Ingestion and Streaming
https://community.hortonworks.com/
35 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Contact
https://community.hortonworks.com/users/9304/tspann.html
https://dzone.com/users/297029/bunkertor.html
https://www.meetup.com/futureofdata-princeton/
https://twitter.com/PaaSDev
https://community.hortonworks.com/articles/174105/hdp-264-hdf-31-apache-spark-structured-streaming-i.html
36 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories
37 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Community Engagement
Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved
4,000+
Registered Users
10,000+
Answers
15,000+
Technical Assets
One Website!
38 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Register at dataworkssummit.com
#DWS18
Berlin, Germany
San Jose, California
APRIL 16-19, 2018 | ESTREL HOTEL
JUNE 17-21, 2018 | MCENERY CONVENTION CENTER

More Related Content

What's hot (20)

PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PDF
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PDF
Apache Flink internals
Kostas Tzoumas
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Understanding Query Plans and Spark UIs
Databricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Apache Flink and what it is used for
Aljoscha Krettek
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Free Training: How to Build a Lakehouse
Databricks
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Flink vs. Spark
Slim Baltagi
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Apache Flink internals
Kostas Tzoumas
 

Similar to Running Apache NiFi with Apache Spark : Integration Options (20)

PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
PDF
HDF 3.1 : An Introduction to New Features
Timothy Spann
 
PPTX
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
PPTX
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
PPTX
Mission to NARs with Apache NiFi
Hortonworks
 
PDF
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
PPTX
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PPTX
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
PPTX
The Avant-garde of Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
The Avant-garde of Apache NiFi
Joe Percivall
 
PPTX
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
PDF
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
PDF
Apache Nifi Crash Course
DataWorks Summit
 
PDF
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
PPTX
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Aldrin Piri
 
PPTX
Integrating Apache NiFi and Apache Flink
Hortonworks
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
HDF 3.1 : An Introduction to New Features
Timothy Spann
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
Mission to NARs with Apache NiFi
Hortonworks
 
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
The Avant-garde of Apache NiFi
DataWorks Summit/Hadoop Summit
 
The Avant-garde of Apache NiFi
Joe Percivall
 
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Apache Nifi Crash Course
DataWorks Summit
 
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Aldrin Piri
 
Integrating Apache NiFi and Apache Flink
Hortonworks
 
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
Ad

Recently uploaded (20)

PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Digital Circuits, important subject in CS
contactparinay1
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 

Running Apache NiFi with Apache Spark : Integration Options

  • 1. 1 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache NiFi Integration with Apache Spark Timothy Spann, Solutions Engineer
  • 2. 2 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Disclaimer à This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. à Technical feasibility, market demand, user feedback, and the Apache Software Foundation community development process can all effect timing and final delivery. à This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. à Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. à Since this document contains an outline of general product development plans, customers should not rely upon it when making a purchase decision.
  • 3. 3 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Integration Options § Apache Spark Integration via Kafka and Spark Streaming (1.6+) § Apache Spark Integration via Kafka and Spark Structured Streaming (2.2+) § Apache Spark Integration via Apache Livy
  • 4. 4 Apache Kafka and Apache NiFi Integration +
  • 5. 5 © Hortonworks Inc. 2011 – 2018 All Rights Reserved NiFi and Kafka Are Complementary NiFi Provide dataflow solution • Centralized management, from edge to core • Great traceability, event level data provenance starting when data is born • Interactive command and control – real time operational visibility • Dataflow management, including prioritization, back pressure, and edge intelligence • Visual representation of global dataflow Kafka Provide durable stream store • Low latency • Distributed data durability • Decentralized management of producers & consumers +
  • 6. 6 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Integrated Provisioning and Security Kafka 1.0 Support To enhance data governance and lineage, users can now manage access control policies using resource or tag-based security in Ranger for Kafka 1.0 clusters. Users can now install, configure, manage, upgrade, monitor, and secure Kafka 1.0 clusters with Ambari. New processors in NiFi and Streaming Analytics Manager support Kafka 1.0 features including message headers and transactions.
  • 7. 7 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache NiFi and Kafka 1.0 – Use Case for Kafka Message Headers
  • 8. 8 Apache Spark – Apache Kafka – Apache NiFi Architecture
  • 9. 9 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Join Architecture Example Acquire/Move Routing & Filtering Parse Analyze Model Topic 1 Topic 2 AggregateCorrolate Pattern Matching JSON Data AVRO Data Windowing Aggregations Spark Processing Flow Management Stream Analysis ++
  • 10. 10 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Stream Processing Streaming Analytics Manager Machine Learning Distributed queue Buffering Process decoupling Structured Streaming with SQL Orchestration Queueing Simple Event Processing Data Definition Between Environments Schema Versioning
  • 11. 11 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Key Integration Points – NiFi & Kafka NiFi MiNiFi MiNiFi MiNiFi Kafka Consumer 1 Consumer 2 Consumer N • Producer Processors (Main) • PublishKafka_0_11 (0.10 Kafka Client) • PublishKafka_1_0 (1.0 Kafka Client) • PublishKafkaRecord_0_11 (0.11 Kafka Client) • PublishKafkaRecord_1_0 (1.0 Kafka Client) +
  • 12. 12 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Key Integration Points – NiFi & Kafka Kafka Producer 1 Producer 2 Producer N NiFi Destination 1 Destination 2 Destination 3 • Consumer Processors (Main) • ConsumeKafka_0_11 (0.11 Kafka Client) • ConsumeKafka_1_0 (1.0 Kafka Client) • ConsumeKafkaRecord_0_11 (0.11 Kafka Client) • ConsumeKafkaRecord_1_0 (1.0 Kafka Client) +
  • 13. 13 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Better Together NiFiMiNiFi Kafka Spark Incoming Topic Results Topic PublishKafka ConsumeKafka Destinations MiNiFi • MiNiFi – Collection, filtering, and prioritization at the edge • NiFi - Central data flow management, routing, enriching, and transformation • Kafka - Central messaging bus for subscription by downstream consumers • Spark - Streaming analytics focused on complex event processing + +SR
  • 14. 14 © Hortonworks Inc. 2011 – 2018 All Rights Reserved NiFi PublishKafkaRecord_1_0 Apache NiFi - Node 1 Apache Kafka Topic 1 - Partition 1 Topic 1 - Partition 2 PublishKafka Apache NiFi – Node 2 PublishKafka = Concurrent Task • Each NiFi node runs an instance of PublishKafkaRecord_1_0 • Each instance has one or more concurrent tasks (threads) • Each concurrent task is an independent producer, sends data round-robin to partitions of a topic • Records with Schemas for Performance +
  • 15. 15 Apache Spark Streaming – Apache Kafka – Apache NiFi Architecture
  • 16. 16 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Spark Streaming à Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and fault-tolerant streaming applications. à Data can be ingested from various data sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets à Data is processed using the now-familiar API: map, filter, reduce, join and window à Processed data can be stored in databases, filesystems, or live dashboards
  • 17. 17 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Streaming Integration via Kafka https://community.hortonworks.com/content/kbentry/173818/hdp-264-hdf-31-apache-spark-streaming-integration.html
  • 18. 18 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Streaming Integration via Kafka
  • 19. 19 Apache Spark Structured Streaming – Apache Kafka – Apache NiFi Architecture
  • 20. 20 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Structured Streaming Integration via Kafka https://community.hortonworks.com/articles/91379/spark-structured-streaming-with-nifi-and-kafka-usi.html https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html https://community.hortonworks.com/content/kbentry/174105/hdp-264-hdf-31-apache-spark-structured-streaming- i.html val records = spark. readStream. format("kafka"). option("subscribe", "smartPlug2"). option("kafka.bootstrap.servers", "mykafkabroker:6667").load
  • 21. 21 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache NiFi – Apache Kafka – Apache Spark
  • 22. 22 Apache Spark – Apache Livy
  • 23. 23 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Introducing Apache Livy à Apache Livy is the open source REST interface for interacting with Apache Spark from anywhere à Installed as Spark2 Ambari Service Livy Client HTTP HTTP (RPC) Spark Interactive Session SparkContext Spark Batch Session SparkContext Livy Server https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component- guide/content/ch_submit-spark-apps-livy.html
  • 24. 24 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Livy Server as a Session Management Service Livy Server Remote Spark Driver Session Remote Context Interactive REST API Batch REST API Standard Spark Batch Job Spark Executor Spark Executor Spark Executor Spark Executor https://livy.incubator.apache.org/docs/latest/rest-api.html
  • 25. 25 Apache Spark – Apache Livy – Apache NiFi Integration
  • 26. 26 © Hortonworks Inc. 2011 – 2018 All Rights Reserved SQL Architecture Example Routing & Filtering Parse Analyze Session 1 Session 1 AggregateSQL JSON Data Spark Processing Flow Management Analytics
  • 27. 27 © Hortonworks Inc. 2011 – 2018 All Rights Reserved NiFi to Spark Processing Streaming Analytics Manager Machine Learning REST API Enterprise Tested Secure Structured Streaming with SQL Orchestration Queueing Simple Event Processing Data Definition Between Environments Schema Versioning
  • 28. 28 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Key Integration Points – NiFi & Spark NiFi MiNiFi MiNiFi MiNiFi Livy Spark Spark 2 Spark N • Processor and Controller • ExecuteSparkInteractive – setup job and code to Livy Session Service • LivySessionService – manages Spark Livy connection pool + +
  • 29. 29 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Better Together NiFiMiNiFi Livy Spark Session Batch ExecuteSpark Interactive MiNiFi • MiNiFi – Collection, filtering, and prioritization at the edge • NiFi - Central data flow management, routing, enriching, and transformation • Livy – Secure HTTPS connection to running Spark batch and sessions jobs with cached RDD sharing and a live Spark context. • Spark - Streaming analytics focused on complex event processing + + LivySessionService
  • 30. 30 Apache Spark – Apache Livy – Apache NiFi Architecture
  • 31. 31 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Integration via Apache Livy
  • 32. 32 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Integration via Apache Livy https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte.html https://community.hortonworks.com/articles/171893/hdf-31-executing-apache-spark-via-executesparkinte-1.html
  • 33. 33 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
  • 34. 34 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Questions? Hortonworks Community Connection: Data Ingestion and Streaming https://community.hortonworks.com/
  • 35. 35 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Contact https://community.hortonworks.com/users/9304/tspann.html https://dzone.com/users/297029/bunkertor.html https://www.meetup.com/futureofdata-princeton/ https://twitter.com/PaaSDev https://community.hortonworks.com/articles/174105/hdp-264-hdf-31-apache-spark-structured-streaming-i.html
  • 36. 36 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Hortonworks Community Connection Read access for everyone, join to participate and be recognized • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories
  • 37. 37 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Community Engagement Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved 4,000+ Registered Users 10,000+ Answers 15,000+ Technical Assets One Website!
  • 38. 38 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Register at dataworkssummit.com #DWS18 Berlin, Germany San Jose, California APRIL 16-19, 2018 | ESTREL HOTEL JUNE 17-21, 2018 | MCENERY CONVENTION CENTER