SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Self-Service Data Ingestion Using NiFi,
StreamSets & Kafka
Guido Schmutz – 6.12.2017
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Agenda
1. Data Flow Processing
2. Apache NiFi
3. StreamSets Data Collector
4. Kafka Connect
5. Summary
Data Flow Processing
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Traditional Big Data Architecture
BI	Tools
Enterprise Data
Warehouse
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
File Import / SQL Import
SQL
Search	/	Explore
Online	&	Mobile	
Apps
Search
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine	Learning
• Graph	Algorithms
• Natural	Language	Processing
Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – handle event stream data
BI	Tools
Enterprise Data
Warehouse
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Event
Hub
Call
Center
Weather
Data
Mobile
Apps
SQL
Search	/	Explore
Online	&	Mobile	
Apps
Search
Data Flow
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine	Learning
• Graph	Algorithms
• Natural	Language	Processing
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – taking Velocity into account
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Results
Parallel Batch
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI	Tools
Enterprise Data
Warehouse
Search	/	Explore
Online	&	Mobile	
Apps
File Import / SQL Import
Weather
Data
Event
Hub
Event
Hub
Event
Hub
Container
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – Asynchronous Microservice Architecture
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Parallel
Batch
ProcessingDistributed
Filesystem
Microservice
NoSQLRDBMS
SQL
Search
BI	Tools
Enterprise Data
Warehouse
Search	/	Explore
Online	&	Mobile	
Apps
File Import / SQL Import
Weather
Data
{		}
API
Event
Hub
Event
Hub
Event
Hub
Container
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Parallel
Batch
ProcessingDistributed
Filesystem
Microservice
NoSQLRDBMS
SQL
Search
BI	Tools
Enterprise Data
Warehouse
Search	/	Explore
Online	&	Mobile	
Apps
File Import / SQL Import
Weather
Data
{		}
API
Event
Hub
Event
Hub
Event
Hub
Integrate Sanitize / Normalize Deliver
IoT GW
MQTT	Broker
Continuous Ingestion -
DataFlow Pipelines
DB	Source
Big	Data
Log
Stream	
Processing
IoT Sensor
Event	Hub
Topic
Topic
REST
Topic
IoT GW
CDC	GW
Connect
CDC
DB	Source
Log CDC
Native
IoT Sensor
IoT Sensor
12
Dataflow	GW
Topic
Topic
Queue
Messaging	GW
Topic
Dataflow	GW
Dataflow
Topic
REST
12
File	Source
Log
Log
Log
Social
Native
DataFlow Pipeline
• Flow-based ”programming”
• Ingest Data from various sources
• Extract – Transform – Load
• High-Throughput, straight-through
data flows
• Data Lineage
• Batch- or Stream-Processing
• Visual coding with flow editor
• Event Stream Processing (ESP) but
not Complex Event Processing (CEP)
Source: Confluent
SQL Polling
Change Data Capture (CDC)
File Polling (File Tailing)
File Stream (Appender)
Continuous Ingestion –
Integrating data sources
Sensor Stream
Ingestion with/without Transformation?
Zero Transformation
• No transformation, plain ingest, no
schema validation
• Keep the original format – Text,
CSV, …
• Allows to store data that may have
errors in the schema
Format Transformation
• Prefer name of Format Translation
• Simply change the format
• Change format from Text to Avro
• Does schema validation
Enrichment Transformation
• Add new data to the message
• Do not change existing values
• Convert a value from one system to
another and add it to the message
Value Transformation
• Replaces values in the message
• Convert a value from one system to
another and change the value in-place
• Destroys the raw data!
Why is Data Ingestion Difficult?
Physical and Logical
Infrastructure changes
rapidly
Key Challenges:
Infrastructure Automation
Edge Deployment
Infrastructure Drift
Data Structures and
formats evolve and change
unexpectedly
Key Challenges:
Consumption Readiness
Corruption and Loss
Structure Drift
Data semantics change
with evolving applications
Key Challenges
Timely Intervention
System Consistency
Semantic Drift
Source: Streamsets
Challenges for Ingesting Sensor Data
• Multitude of sensors
• Real-Time Streaming
• Multiple Firmware versions
• Bad Data from damaged
sensors
• Regulatory Constraints
• Data Quality
Source: Cloudera
Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw?
truck/nn/
positionTruck-4
Truck-5
Raw	Data	
Store
?
{"truckid":"57","driverid":"15","routeid":"1927624662
","eventtype":"Normal","latitude":"38.65","longitude":
"-90.21","correlationId":"4412891759760421296"}
Apache NiFi
Apache NiFi
• Originated at NSA as Niagarafiles – developed
behind closed doors for 8 years
• Open sourced December 2014, Apache Top
Level Project July 2015
• Look-and-Feel modernized in 2016
• Opaque, “file-oriented” payload
• Distributed system of processors with
centralized control
• Based on flow-based programming concepts
• Data Provenance and Data Lineage
• Web-based user interface
Processors for Source and Sink
• ConsumeXXXX (AMQP, EWS, IMAP, JMS, Kafka, MQTT, POP3, …)
• DeleteXXXX (DynamoDB, Elasticsearch, HDFS, RethinkDB, S3, SQS, ...)
• FetchXXXX (AzureBlobStorage, ElasticSearch, File, FTP, HBase, HDFS, S3 ...)
• ExecuteXXXX (FlumeSink, FlumeSource, Script, SQL, ...)
• GetXXXX (AzureEventHub, Couchbase, DynamoDB, File, FTP, HBase, HDFS,
HTTP, Ignite, JMSQueue, JMSTopic, Kafka, Mongo, Solr, Splunk, SQS, TCP, ...)
• ListenXXXX (HTTP, RELP, SMTP, Syslog, TCP, UDP, WebSocket, ...)
• PublishXXXX (Kafka, MQTT)
• PutXXXX (AzureBlobStorage, AzureEventHub, CassandraQL, CloudWatchMetric,
Couchbase, DynamoDB, Elasticsearch, Email, FTP, File, Hbase, HDFS, HiveQL,
Kudu, Lambda, Mongo, Parquet, Slack, SQL, TCP, ....)
• QueryXXXX (Cassandra, DatabaseTable, DNS, Elasticserach)
Processors for Processing
• ConvertXxxxToYyyy
• ConvertRecord
• EnforceOrder
• EncryptContent
• ExtractXXXX (AvroMetdata,
EmailAttachments, Grok,
HL7Attributes, ImageMetadata, ...)
• GeoEnrichIP
• JoltTransformJSON
• MergeContent
• ReplaceText
• ResizeImage
• SplitXXXX (Avro, Content, JSON,
Record, Xml, ...)
• TailFile
• TransformXML
• UpdateAttribute
Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
truck/nn/
positionTruck-4
Truck-5
Raw	Data	
Store
MQTT
to Kafka
Kafka	to
Raw
{"truckid":"57","driverid":"15","routeid":"1927624662
","eventtype":"Normal","latitude":"38.65","longitude":
"-90.21","correlationId":"4412891759760421296"}
Port: 1883
Port: 1884
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Processor
Demo: Kafka Processor
Demo: Masking Field with ReplaceText Processor
StreamSets
StreamSets Data Collector
• Founded by ex-Cloudera, Informatica
employees
• Continuous open source, intent-driven, big data
ingest
• Visible, record-oriented approach fixes
combinatorial explosion
• Batch or stream processing
• Standalone, Spark cluster, MapReduce cluster
• IDE for pipeline development by ‘civilians’
• Relatively new - first public release September
2015
• So far, vast majority of commits are from
StreamSets staff
StreamSets Origins
Source:	https://streamsets.com/connectors
An origin stage represents the
source for the pipeline. You can
use a single origin stage in a
pipeline
Origins on the right are available
out of the box
API for writing custom origins
StreamSets Processors
A processor stage represents a type of
data processing that you want to perform
use as many processors in a pipeline as
you need
Programming languages supported
• Java
• JavaScript
• Jython
• Groovy
• Java Expression Language (EL) Spark
Some of processors available out-of-the-
box:
• Expression Evaluator
• Field Flattener
• Field Hasher
• Field Masker
• Field Merger
• Field Order
• Field Splitter
• Field Zip
• Groovy Evaluator
• JDBC Lookup
• JSON Parser
• Spark Evaluator
• …
StreamSets Destinations
A destination stage represents
the target for a pipeline. You can
use one or more destinations in a
pipeline
Destinations on the right are
available out of the box
API for writing custom origins
Source:	https://streamsets.com/connectors
Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
truck/nn/
positionTruck-4
Truck-5
Raw	Data	
Store
MQTT-1
to Kafka
Kafka	to
Raw
{"truckid":"57","driverid":"15","routeid":"1927624662
","eventtype":"Normal","latitude":"38.65","longitude":
"-90.21","correlationId":"4412891759760421296"}
MQTT-2
to Kafka
Edge
Port: 1883
Port: 1884
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Source
Demo: Kafka Sink
Demo: Dataflow for MQTT to Kafka
Demo: Masking fields
Demo: Sending Message to Kafka in Avro
StreamSets Dataflow Performance Manager
• Map dataflows to topologies, manage releases &
track changes
• Measure KPIs and establish baselines for data
availability and accuracy
• Master dataflow operations through Data SLAs
Source:	https://streamsets.com/connectors
Kafka Connect
Kafka Connect - Overview
Source
Connector
Sink
Connector
Kafka Connect – Single Message Transforms (SMT)
Simple Transformations for a single message
Defined as part of Kafka Connect
• some useful transforms provided out-of-the-box
• Easily implement your own
Optionally deploy 1+ transforms with each
connector
• Modify messages produced by source
connector
• Modify messages sent to sink connectors
Makes it much easier to mix and match connectors
Some of currently available
transforms:
• InsertField
• ReplaceField
• MaskField
• ValueToKey
• ExtractField
• TimestampRouter
• RegexRouter
• SetSchemaMetaData
• Flatten
• TimestampConverter
Kafka Connect – Many Connectors
60+ since first release (0.9+)
20+ from Confluent and Partners
Source:	http://www.confluent.io/product/connectors
Confluent	supported	Connectors
Certified	Connectors Community	Connectors
Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
truck/nn/
positionTruck-4
Truck-5
Raw	Data	
Store
MQTT-1
to Kafka
Kafka	to
Raw
{"truckid":"57","driverid":"15","routeid":"1927624662
","eventtype":"Normal","latitude":"38.65","longitude":
"-90.21","correlationId":"4412891759760421296"}
MQTT-2
to Kafka
Port: 1883
Port: 1884
Demo: Dataflow for MQTT to Kafka
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors" 
-H "Content-Type: application/json" 
-d $'{
"name": "mqtt-source",
"config": {
"connector.class":
"com.datamountaineer.streamreactor.connect.mqtt.source.MqttSourceConnector",
"connect.mqtt.connection.timeout": "1000",
"tasks.max": "1",
"connect.mqtt.kcql":
"INSERT INTO truck_position SELECT * FROM truck/+/position",
"name": "MqttSourceConnector",
"connect.mqtt.service.quality": "0",
"connect.mqtt.client.id": "tm-mqtt-connect-01",
"connect.mqtt.converter.throw.on.error": "true",
"connect.mqtt.hosts": "tcp://mosquitto-1:1883"
}
}'
Summary
Summary
Apache NiFi
• visual dataflow modelling
• very powerful – “with power
comes responsibility”
• special package for Edge
computing
• data lineage and data
provenance
• supports for backpressure
• no transport mechanism
(DEV/TST/PROD)
• custom processors
• supported by Hortonworks
StreamSets
• visual dataflow modelling
• very powerful – “with power
comes responsibility”
• special package for Edge
computing
• data lineage and data
provenance
• no transport mechanism
• custom sources, sinks,
processors
• supported by StreamSets
Kafka Connect
• declarative style data flows
• simplicity - “simple things
done simple”
• very well integrated with
Kafka – comes with Kafka
• Single Message Transforms
(SMT)
• use Kafka Streams for
complex data flows
• custom connectors
• supported by Confluent
Technology on its own won't help you.
You need to know how to use it properly.

More Related Content

What's hot (20)

PPTX
Big Data Application Architectures - Fraud Detection
DataWorks Summit/Hadoop Summit
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
PPTX
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
PDF
Accelerate and modernize your data pipelines
Paul Van Siclen
 
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PDF
Data Architecture Strategies: The Rise of the Graph Database
DATAVERSITY
 
PDF
Introducing Databricks Delta
Databricks
 
PDF
Architecting Modern Data Platforms
Ankit Rathi
 
PPTX
Demystifying Data Warehouse as a Service
Snowflake Computing
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PDF
Spark SQL
Joud Khattab
 
PDF
Intro to Delta Lake
Databricks
 
PDF
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 
PDF
Best Practices in Metadata Management
DATAVERSITY
 
PDF
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Big Data Application Architectures - Fraud Detection
DataWorks Summit/Hadoop Summit
 
Databricks Fundamentals
Dalibor Wijas
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Accelerate and modernize your data pipelines
Paul Van Siclen
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Data Architecture Strategies: The Rise of the Graph Database
DATAVERSITY
 
Introducing Databricks Delta
Databricks
 
Architecting Modern Data Platforms
Ankit Rathi
 
Demystifying Data Warehouse as a Service
Snowflake Computing
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Spark SQL
Joud Khattab
 
Intro to Delta Lake
Databricks
 
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 
Best Practices in Metadata Management
DATAVERSITY
 
Data Lake,beyond the Data Warehouse
Data Science Thailand
 

Similar to Self-Service Data Ingestion Using NiFi, StreamSets & Kafka (20)

PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
PDF
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
PDF
Meetup - Brasil - Data In Motion - 2023 September 19
ssuser73434e
 
PDF
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann
 
PDF
WarsawITDays_ ApacheNiFi202
Timothy Spann
 
PDF
xGem Data Stream Processing
Jorge Hirtz
 
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
PDF
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
PDF
BigDataFest Building Modern Data Streaming Apps
ssuser73434e
 
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
PDF
Building Real-Time Travel Alerts
Timothy Spann
 
PPTX
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
PDF
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
PDF
big data fest building modern data streaming apps
Timothy Spann
 
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
PDF
Streaming architecture patterns
hadooparchbook
 
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
Timothy Spann
 
PDF
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
PDF
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
Meetup - Brasil - Data In Motion - 2023 September 19
ssuser73434e
 
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann
 
WarsawITDays_ ApacheNiFi202
Timothy Spann
 
xGem Data Stream Processing
Jorge Hirtz
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
BigDataFest Building Modern Data Streaming Apps
ssuser73434e
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Building Real-Time Travel Alerts
Timothy Spann
 
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
big data fest building modern data streaming apps
Timothy Spann
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
Streaming architecture patterns
hadooparchbook
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Timothy Spann
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
Ad

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
Guido Schmutz
 
PDF
Event Broker (Kafka) in a Modern Data Architecture
Guido Schmutz
 
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
PDF
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
PDF
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Guido Schmutz
 
PDF
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Guido Schmutz
 
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
Guido Schmutz
 
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
PDF
Location Analytics Real-Time Geofencing using Kafka
Guido Schmutz
 
PDF
Streaming Visualisation
Guido Schmutz
 
PDF
Kafka as an event store - is it good enough?
Guido Schmutz
 
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
PDF
Fundamentals Big Data and AI Architecture
Guido Schmutz
 
PDF
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
PDF
Streaming Visualization
Guido Schmutz
 
30 Minutes to the Analytics Platform with Infrastructure as Code
Guido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Guido Schmutz
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Guido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Guido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualisation
Guido Schmutz
 
Kafka as an event store - is it good enough?
Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
Fundamentals Big Data and AI Architecture
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualization
Guido Schmutz
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
AI/ML Applications in Financial domain projects
Rituparna De
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 

Self-Service Data Ingestion Using NiFi, StreamSets & Kafka

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Self-Service Data Ingestion Using NiFi, StreamSets & Kafka Guido Schmutz – 6.12.2017 @gschmutz guidoschmutz.wordpress.com
  • 2. Guido Schmutz Working at Trivadis for more than 21 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: [email protected] Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz
  • 3. Agenda 1. Data Flow Processing 2. Apache NiFi 3. StreamSets Data Collector 4. Kafka Connect 5. Summary
  • 5. Hadoop Clusterd Hadoop Cluster Big Data Cluster Traditional Big Data Architecture BI Tools Enterprise Data Warehouse Billing & Ordering CRM / Profile Marketing Campaigns File Import / SQL Import SQL Search / Explore Online & Mobile Apps Search NoSQL Parallel Batch Processing Distributed Filesystem • Machine Learning • Graph Algorithms • Natural Language Processing
  • 6. Event Hub Event Hub Hadoop Clusterd Hadoop Cluster Big Data Cluster Event Hub – handle event stream data BI Tools Enterprise Data Warehouse Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Event Hub Call Center Weather Data Mobile Apps SQL Search / Explore Online & Mobile Apps Search Data Flow NoSQL Parallel Batch Processing Distributed Filesystem • Machine Learning • Graph Algorithms • Natural Language Processing
  • 7. Hadoop Clusterd Hadoop Cluster Big Data Cluster Event Hub – taking Velocity into account Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Results Parallel Batch Processing Distributed Filesystem Stream Analytics NoSQL Reference / Models SQL Search Dashboard BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data Event Hub Event Hub Event Hub
  • 8. Container Hadoop Clusterd Hadoop Cluster Big Data Cluster Event Hub – Asynchronous Microservice Architecture Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Parallel Batch ProcessingDistributed Filesystem Microservice NoSQLRDBMS SQL Search BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data { } API Event Hub Event Hub Event Hub
  • 9. Container Hadoop Clusterd Hadoop Cluster Big Data Cluster Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Parallel Batch ProcessingDistributed Filesystem Microservice NoSQLRDBMS SQL Search BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data { } API Event Hub Event Hub Event Hub Integrate Sanitize / Normalize Deliver
  • 10. IoT GW MQTT Broker Continuous Ingestion - DataFlow Pipelines DB Source Big Data Log Stream Processing IoT Sensor Event Hub Topic Topic REST Topic IoT GW CDC GW Connect CDC DB Source Log CDC Native IoT Sensor IoT Sensor 12 Dataflow GW Topic Topic Queue Messaging GW Topic Dataflow GW Dataflow Topic REST 12 File Source Log Log Log Social Native
  • 11. DataFlow Pipeline • Flow-based ”programming” • Ingest Data from various sources • Extract – Transform – Load • High-Throughput, straight-through data flows • Data Lineage • Batch- or Stream-Processing • Visual coding with flow editor • Event Stream Processing (ESP) but not Complex Event Processing (CEP) Source: Confluent
  • 12. SQL Polling Change Data Capture (CDC) File Polling (File Tailing) File Stream (Appender) Continuous Ingestion – Integrating data sources Sensor Stream
  • 13. Ingestion with/without Transformation? Zero Transformation • No transformation, plain ingest, no schema validation • Keep the original format – Text, CSV, … • Allows to store data that may have errors in the schema Format Transformation • Prefer name of Format Translation • Simply change the format • Change format from Text to Avro • Does schema validation Enrichment Transformation • Add new data to the message • Do not change existing values • Convert a value from one system to another and add it to the message Value Transformation • Replaces values in the message • Convert a value from one system to another and change the value in-place • Destroys the raw data!
  • 14. Why is Data Ingestion Difficult? Physical and Logical Infrastructure changes rapidly Key Challenges: Infrastructure Automation Edge Deployment Infrastructure Drift Data Structures and formats evolve and change unexpectedly Key Challenges: Consumption Readiness Corruption and Loss Structure Drift Data semantics change with evolving applications Key Challenges Timely Intervention System Consistency Semantic Drift Source: Streamsets
  • 15. Challenges for Ingesting Sensor Data • Multitude of sensors • Real-Time Streaming • Multiple Firmware versions • Bad Data from damaged sensors • Regulatory Constraints • Data Quality Source: Cloudera
  • 18. Apache NiFi • Originated at NSA as Niagarafiles – developed behind closed doors for 8 years • Open sourced December 2014, Apache Top Level Project July 2015 • Look-and-Feel modernized in 2016 • Opaque, “file-oriented” payload • Distributed system of processors with centralized control • Based on flow-based programming concepts • Data Provenance and Data Lineage • Web-based user interface
  • 19. Processors for Source and Sink • ConsumeXXXX (AMQP, EWS, IMAP, JMS, Kafka, MQTT, POP3, …) • DeleteXXXX (DynamoDB, Elasticsearch, HDFS, RethinkDB, S3, SQS, ...) • FetchXXXX (AzureBlobStorage, ElasticSearch, File, FTP, HBase, HDFS, S3 ...) • ExecuteXXXX (FlumeSink, FlumeSource, Script, SQL, ...) • GetXXXX (AzureEventHub, Couchbase, DynamoDB, File, FTP, HBase, HDFS, HTTP, Ignite, JMSQueue, JMSTopic, Kafka, Mongo, Solr, Splunk, SQS, TCP, ...) • ListenXXXX (HTTP, RELP, SMTP, Syslog, TCP, UDP, WebSocket, ...) • PublishXXXX (Kafka, MQTT) • PutXXXX (AzureBlobStorage, AzureEventHub, CassandraQL, CloudWatchMetric, Couchbase, DynamoDB, Elasticsearch, Email, FTP, File, Hbase, HDFS, HiveQL, Kudu, Lambda, Mongo, Parquet, Slack, SQL, TCP, ....) • QueryXXXX (Cassandra, DatabaseTable, DNS, Elasticserach)
  • 20. Processors for Processing • ConvertXxxxToYyyy • ConvertRecord • EnforceOrder • EncryptContent • ExtractXXXX (AvroMetdata, EmailAttachments, Grok, HL7Attributes, ImageMetadata, ...) • GeoEnrichIP • JoltTransformJSON • MergeContent • ReplaceText • ResizeImage • SplitXXXX (Avro, Content, JSON, Record, Xml, ...) • TailFile • TransformXML • UpdateAttribute
  • 21. Demo Case Truck-2 truck/nn/ position Truck-1 Truck-3 truck position raw truck/nn/ positionTruck-4 Truck-5 Raw Data Store MQTT to Kafka Kafka to Raw {"truckid":"57","driverid":"15","routeid":"1927624662 ","eventtype":"Normal","latitude":"38.65","longitude": "-90.21","correlationId":"4412891759760421296"} Port: 1883 Port: 1884
  • 22. Demo: Dataflow for MQTT to Kafka
  • 25. Demo: Masking Field with ReplaceText Processor
  • 27. StreamSets Data Collector • Founded by ex-Cloudera, Informatica employees • Continuous open source, intent-driven, big data ingest • Visible, record-oriented approach fixes combinatorial explosion • Batch or stream processing • Standalone, Spark cluster, MapReduce cluster • IDE for pipeline development by ‘civilians’ • Relatively new - first public release September 2015 • So far, vast majority of commits are from StreamSets staff
  • 28. StreamSets Origins Source: https://streamsets.com/connectors An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline Origins on the right are available out of the box API for writing custom origins
  • 29. StreamSets Processors A processor stage represents a type of data processing that you want to perform use as many processors in a pipeline as you need Programming languages supported • Java • JavaScript • Jython • Groovy • Java Expression Language (EL) Spark Some of processors available out-of-the- box: • Expression Evaluator • Field Flattener • Field Hasher • Field Masker • Field Merger • Field Order • Field Splitter • Field Zip • Groovy Evaluator • JDBC Lookup • JSON Parser • Spark Evaluator • …
  • 30. StreamSets Destinations A destination stage represents the target for a pipeline. You can use one or more destinations in a pipeline Destinations on the right are available out of the box API for writing custom origins Source: https://streamsets.com/connectors
  • 31. Demo Case Truck-2 truck/nn/ position Truck-1 Truck-3 truck position raw truck/nn/ positionTruck-4 Truck-5 Raw Data Store MQTT-1 to Kafka Kafka to Raw {"truckid":"57","driverid":"15","routeid":"1927624662 ","eventtype":"Normal","latitude":"38.65","longitude": "-90.21","correlationId":"4412891759760421296"} MQTT-2 to Kafka Edge Port: 1883 Port: 1884
  • 32. Demo: Dataflow for MQTT to Kafka
  • 35. Demo: Dataflow for MQTT to Kafka
  • 37. Demo: Sending Message to Kafka in Avro
  • 38. StreamSets Dataflow Performance Manager • Map dataflows to topologies, manage releases & track changes • Measure KPIs and establish baselines for data availability and accuracy • Master dataflow operations through Data SLAs Source: https://streamsets.com/connectors
  • 40. Kafka Connect - Overview Source Connector Sink Connector
  • 41. Kafka Connect – Single Message Transforms (SMT) Simple Transformations for a single message Defined as part of Kafka Connect • some useful transforms provided out-of-the-box • Easily implement your own Optionally deploy 1+ transforms with each connector • Modify messages produced by source connector • Modify messages sent to sink connectors Makes it much easier to mix and match connectors Some of currently available transforms: • InsertField • ReplaceField • MaskField • ValueToKey • ExtractField • TimestampRouter • RegexRouter • SetSchemaMetaData • Flatten • TimestampConverter
  • 42. Kafka Connect – Many Connectors 60+ since first release (0.9+) 20+ from Confluent and Partners Source: http://www.confluent.io/product/connectors Confluent supported Connectors Certified Connectors Community Connectors
  • 43. Demo Case Truck-2 truck/nn/ position Truck-1 Truck-3 truck position raw truck/nn/ positionTruck-4 Truck-5 Raw Data Store MQTT-1 to Kafka Kafka to Raw {"truckid":"57","driverid":"15","routeid":"1927624662 ","eventtype":"Normal","latitude":"38.65","longitude": "-90.21","correlationId":"4412891759760421296"} MQTT-2 to Kafka Port: 1883 Port: 1884
  • 44. Demo: Dataflow for MQTT to Kafka #!/bin/bash curl -X "POST" "http://192.168.69.138:8083/connectors" -H "Content-Type: application/json" -d $'{ "name": "mqtt-source", "config": { "connector.class": "com.datamountaineer.streamreactor.connect.mqtt.source.MqttSourceConnector", "connect.mqtt.connection.timeout": "1000", "tasks.max": "1", "connect.mqtt.kcql": "INSERT INTO truck_position SELECT * FROM truck/+/position", "name": "MqttSourceConnector", "connect.mqtt.service.quality": "0", "connect.mqtt.client.id": "tm-mqtt-connect-01", "connect.mqtt.converter.throw.on.error": "true", "connect.mqtt.hosts": "tcp://mosquitto-1:1883" } }'
  • 46. Summary Apache NiFi • visual dataflow modelling • very powerful – “with power comes responsibility” • special package for Edge computing • data lineage and data provenance • supports for backpressure • no transport mechanism (DEV/TST/PROD) • custom processors • supported by Hortonworks StreamSets • visual dataflow modelling • very powerful – “with power comes responsibility” • special package for Edge computing • data lineage and data provenance • no transport mechanism • custom sources, sinks, processors • supported by StreamSets Kafka Connect • declarative style data flows • simplicity - “simple things done simple” • very well integrated with Kafka – comes with Kafka • Single Message Transforms (SMT) • use Kafka Streams for complex data flows • custom connectors • supported by Confluent
  • 47. Technology on its own won't help you. You need to know how to use it properly.