1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hortonworks Data Flow
Wrangling the Internet of Things
Pat Alwell – Solutions Engineer
Big Data Day Los Angeles ,CA
August 2018
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
whoami
Pat Alwell
Solutions Engineer Hortonworks
AWS | Spark | Hadoop Admin
Career Started at Algebraix Data
Connect with me:
GitHub  https://github.com/patalwell
Email  palwell@hortonworks.com
Goals for the Session
• Demonstrate how organizations can leverage
Hortonworks Dataflow (HDF) to wrangle the Internet
of Things
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache NiFi Managed Dataflow
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Nothing in HDF Makes Sense Except in Light
of Flow Based Programming
Flow
Management
Administration
HDF
Streaming
Flow-based programming is an
abstraction of information packets,
algorithmic transformations, and a
common set of connections. The flow of
data is essentially equivocal to a
production line. Raw material is pushed
or pulled into a process and transformed
to meet an end goal.
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
• Connection = Route between processors
 Queues that can be dynamically prioritized
• Process Group = Logical Group of processors and their connections
 Receive data via input ports, send data via output ports
 FlowFile = Unit of data moving through the system
 Content + Attributes (Metadata)
 Processor = Process data
 Transforms/Writes FlowFiles
 Creates Provenance
Nifi Terminology
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What is HDF?
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Flow
Management
HDF
How Can we Manage Flows with
HDF?
Apache NiFi supports powerful and scalable directed
graphs of data routing, transformation, and system
mediation logic.
Apache MiNiFi is a complementary data collection
approach that supplements the core tenets of NiFi in
dataflow management, focusing on the collection of data
at the source of its creation.
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Visual Command and Control A Convenient Graphical User Interface that
supports Flow Based Programming
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Visual Command and Control
Over 200 + Processors designed to help you
capture and deliver data to and from common
sources
Examples Include:
-Capturing Logs from Mobile Devices and
Sensors; formatting said logs with Regex,
pushing said logs into HDFS or S3
-Collecting sensor readings from GPIO headers
and delivering the information to an application
via Kafka
-Customer sentiment analysis by joining social
media data to customer information within Hive
or Phoenix
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
220+ Processors for Deeper Ecosystem Integration
Hash
Extract
Merge
Duplicate
Scan
GeoEnrich
Replace
ConvertSplit
Translate
Route Content
Route Context
Route Text
Control Rate
Distribute Load
Generate Table Fetch
Jolt Transform JSON
Prioritized Delivery
Encrypt
Tail
Evaluate
Execute
All Apache project logos are trademarks of the ASF and the respective projects.
Fetch
HTTP
Syslog
Email
HTML
Image
HL7
FTP
UDP
XML
SFTP
AMQP
WebSocket
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Provenance and Lineage
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Payload Prioritization
• Configure a prioritizer per
connection
• Determine what is important for
your data – time based, arrival
order, importance of a data set
• Funnel many connections down to
a single connection to prioritize
across data sets
• Develop your own prioritizer if
needed
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Back-Pressure
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Latency vs. Throughput
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
• Java
• < 40MB binary distribution
• Requires JRE 1.8
• More feature complete
• Targeted for any system that can run a JVM
• Supported Processors: https://github.com/apache/nifi-
minifi/blob/6ddf8bb0ee3614320a53ce7f2e0b3950ee4d9c5f
/minifi-docs/src/main/markdown/minifi-java-agent-quick-
start.md
• C++
• Dynamic heap of ~1MB based on use-case
• Targeted for resource constrained environments
• Supported Processors: https://github.com/apache/nifi-
minifi-cpp/blob/master/PROCESSORS.md
Minifi’s Key Features
An Embedded Extension that supports Flow
Based Programming on the Edge
Agents Provide:
• Small and lightweight footprint
• Central management of agents
• Generation of data provenance
• Integration with NiFi for follow-on
dataflow management and full chain of
custody of information
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
How can we Administer and Secure flow
activity?
Administration
HDF
The Apache Ambari project is aimed at making Hadoop
management simpler by developing software for
provisioning, managing, and monitoring Apache Hadoop
clusters. Ambari provides an intuitive, easy-to-use
Hadoop management web UI backed by its RESTful
APIs. - https://ambari.apache.org/
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster Administration and Role Based Security
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
• NiFi Registry - sub-project of Apache NiFi
• https://github.com/apache/nifi-registry
• https://issues.apache.org/jira/projects/NIFIREG
• Complimentary application, central location for
storage/management of “versioned” resources
• Initial capability to store and retrieve “versioned
flows”
• Integration on NiFi side
• Start/Stop version control of a process group
• Change version (upgrade/downgrade)
• Import new process group from a version
Version Control for Flows
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
• Parameterize configuration like connection
strings, file paths, etc.
• Referenced via Expression language
• Kafka Brokers = ${kafka.brokers}
• Variables associated with a process group
• Right-click on canvas to view variables for
current process group
• Hierarchical order of precedence, resolve
closest reference to component
• Editing variables automatically restarts
any components referencing the variables!
Level 1
Level 2
Vars
Vars
Variable Registry Flow Ubiquity
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
⬢ Data Governance
– Centralized registry to provide reusable
schema
– Version management to define
relationship between schemas
– Validation to enable generic format
conversion and generic routing
⬢ Operational Efficiency
– Centralized registry to avoid attaching
schema to every piece of data
– Version management to enable
consumers and producers can evolve at
different rates
– Validation to ensure data quality
Schema Registry
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
How Can we take advantage of Streaming
computations?
HDF
Streaming
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
A Distributed Streaming Platform that supports
Pub-Sub Systems
• Publish and Subscribe to streams of records,
similar to a messaging queue
• Store streams of records in a fault-tolerant way
• Process Streams of records as they occur
• Topic is a partitioned Log of events
Generally used to…
• Build real-time streaming data pipelines to
reliably transfer data between systems
• Build real-time streaming applications that
transform or react to the streams of data
What is Kafka?
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Streaming Messaging Manager
(*NEW)
“Kafka Blindness” – Customers who use Kafka today struggle with
monitoring and managing Kafka clusters.
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What is Storm?
A Distributed Fault Tolerant Service that
processes Streams of Data• Capture data from external systems (Kafka, Hbase, Hive, HDFS, and AWS
Kinesis) Spout
• Transform and aggregate said data using filter, map, flatMap, aggregate,
reduce, count, etc. Bolt
• Write data back to an external system for storage or visualization (Hbase,
Hive, Druid) Bolt
• The chain is known as a topology. The topology is run under a master
slave type architecture.
Generally used to…
• Processing streams
• No need for intermediate queues. Continuous computation Send
data to clients continuously so they can update and show results in
real time, such as site metrics.
• Distributed remote procedure call
• Easily parallelize CPU-intensive operations.
*
* Leibiusky, Jonathan. Getting Started with Storm:
Continuous Streaming Computation with Twitter's
Cluster Technology (p. 1). O'Reilly Media. Kindle
Edition.
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What does our flow look
like?
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
⬢ Full Q&A Platform (like StackOverflow)
⬢ Knowledge Base Articles
⬢ Code Gallery and Samples
⬢ https://community.hortonworks.com
Join Us: Hortonworks Community Connection
27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
⬢ Download our HDF Sandbox for Docker, VMWare, or
VirtualBox:
https://hortonworks.com/downloads/#sandbox
⬢ Follow our HDF tutorials:
https://hortonworks.com/tutorial/analyze-iot-weather-
station-data-via-connected-data-architecture/
⬢ Reach out to an Enterprise Account Manager
Care to Learn More?
28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions ?
29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache NiFi / ETL Tools
NiFi
NOT schema dependent
• Dataflow management for both structured
and unstructured data, powered by
separation of metadata and payload
• Schema is not required, but you can have
schema
• Minimum modeling effort, just enough to
manage dataflows
• Do the plumbing job, maximize developers’
brainpower for creative work
⚠ Not designed to do heavy lifting transformation
work for DB tables (JOIN datasets, etc.). You
can create custom processors to do that, but
long way to go to catch up with existing ETL
tools from user experience perspective (GUI for
data wrangling, cleansing, etc.)
ETL (Informatica, etc.)
Schema dependent
• Tailored for Databases/WH
• ETL operations based on schema/data
modeling
• Highly efficient, optimized performance
⚠ Must pre-prepare your data, time consuming to
build data modeling, and maintain schemas
⚠ Not geared towards handling unstructured data,
PDF, Audio, Video, etc.
⚠ Not designed to solve dataflow problems
30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache NiFi / Integration, or ingestion, Frameworks
NiFi
End user facing dataflow management
tool
• Out of the box solution for dataflow
management
• Interactive command and control in the core,
design and deploy on the edge
• Flexible failure handling at each point of the
flow
• Visual representation of global dataflow and
connectivities
• Native cross data center communication
• Data provenance for traceability
⚠ Not a library to be embedded in other
applications
Integration framework (Spring
Integration, Camel, etc), ingestion
framework (Flume, etc)
Developer facing integration tool with a
focus on data ingestion
• A set of tools to orchestrate workflow
• A fixed design and deploy pattern
• Leverage messaging bus across
disconnected networks
⚠ Developer facing, custom coding needed to
optimize
⚠ Pre-built failure handling, lack of flexibility
⚠ No holistic view of global dataflow
⚠ No built-in data traceability
31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache NiFi / Messaging Bus Services
NiFi
Provide dataflow solution
• Centralized management, from edge to
core
• Great traceability, event level data
provenance starting when data is born
• Interactive command and control – real
time operational visibility
• Dataflow management, including
prioritization, back pressure, and edge
intelligence
• Visual representation of global dataflow
⚠ Not a messaging bus, flow maintenance
needed when you have frequent consumer
side updates
Messaging Bus (Kafka, JMS, etc.)
Provide messaging bus service
• Low latency
• Great data durability
• Decentralized management (producers &
consumers)
• Low broker maintenance for dynamic
consumer side updates
⚠ Not designed to solve dataflow problems
(prioritization, edge intelligence, etc.)
⚠ Traceability limited to in/out of topics, no lineage
⚠ Lack of global view of
components/connectivities
32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache NiFi / Processing Frameworks
NiFi
Simple event processing
• Primarily feed data into processing
frameworks, can process data, with a
focus on simple event processing
• Operate on a single piece of data, or in
correlation with an enrichment dataset
(enrichment, parsing, splitting, and
transformations)
• Can scale out, but scale up better to
take full advantage of hardware
resources, run concurrent processing
tasks/threads (processing terabytes of
data per day on a single node)
⚠ Not another distributed processing
framework, but to feed data into those
Processing Frameworks (Storm, Spark,
etc.)
Complex and distributed processing
• Complex processing from multiple streams
(JOIN operations)
• Analyzing data across time windows (rolling
window aggregation, standard deviation,
etc.)
• Scale out to thousands of nodes if needed
⚠ Not designed to collect data or manage data
flow

More Related Content

PDF
Data ingestion and distribution with apache NiFi
PPTX
Streaming analytics manager
PPTX
HDF Powered by Apache NiFi Introduction
PDF
Introduction to HDF 3.0
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
PPTX
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
PPTX
State of the Apache NiFi Ecosystem & Community
PDF
What is New in Apache Hive 3.0?
Data ingestion and distribution with apache NiFi
Streaming analytics manager
HDF Powered by Apache NiFi Introduction
Introduction to HDF 3.0
Hortonworks Data in Motion Webinar Series - Part 1
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
State of the Apache NiFi Ecosystem & Community
What is New in Apache Hive 3.0?

What's hot (20)

PDF
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
PPTX
Apache NiFi in the Hadoop Ecosystem
PPTX
Log Analytics Optimization
PPTX
Introduction to Apache NiFi - Seattle Scalability Meetup
PDF
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
PDF
Solving Cybersecurity at Scale
PPTX
Integrating NiFi and Apex
PPTX
Integrating NiFi and Flink
PDF
Dataflow Management From Edge to Core with Apache NiFi
PDF
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
PDF
Fast SQL on Hadoop, really?
PDF
Curing the Kafka Blindness – Streams Messaging Manager
PDF
Apache NiFi: Ingesting Enterprise Data At Scale
PDF
What's New in Apache Hive 3.0?
PPTX
Manage democratization of the data - Data Replication in Hadoop
PDF
Meet HBase 2.0 and Phoenix 5.0
PDF
Joe Witt presentation on Apache NiFi
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
Apache NiFi - Flow Based Programming Meetup
PPTX
IoT with Apache MXNet and Apache NiFi and MiniFi
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Apache NiFi in the Hadoop Ecosystem
Log Analytics Optimization
Introduction to Apache NiFi - Seattle Scalability Meetup
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Solving Cybersecurity at Scale
Integrating NiFi and Apex
Integrating NiFi and Flink
Dataflow Management From Edge to Core with Apache NiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
Fast SQL on Hadoop, really?
Curing the Kafka Blindness – Streams Messaging Manager
Apache NiFi: Ingesting Enterprise Data At Scale
What's New in Apache Hive 3.0?
Manage democratization of the data - Data Replication in Hadoop
Meet HBase 2.0 and Phoenix 5.0
Joe Witt presentation on Apache NiFi
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Apache NiFi - Flow Based Programming Meetup
IoT with Apache MXNet and Apache NiFi and MiniFi
Ad

Similar to Data Con LA 2018 - Streaming and IoT by Pat Alwell (20)

PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
PPTX
Future of Data New Jersey - HDF 3.0 Deep Dive
PDF
HDF 3.1 : An Introduction to New Features
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
PDF
Curing the Kafka blindness—Streams Messaging Manager
PPTX
SoCal BigData Day
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
PPTX
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
PDF
Hortonworks and Red Hat Webinar - Part 2
PPTX
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
PDF
Apache Nifi Crash Course
PPTX
Webinar Series Part 5 New Features of HDF 5
PPTX
Big data processing engines, Atlanta Meetup 4/30
PDF
Apache Nifi Crash Course
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
PDF
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
PPTX
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
PPTX
Internet of Things Crash Course Workshop at Hadoop Summit
PPTX
Internet of things Crash Course Workshop
PDF
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Future of Data New Jersey - HDF 3.0 Deep Dive
HDF 3.1 : An Introduction to New Features
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Curing the Kafka blindness—Streams Messaging Manager
SoCal BigData Day
NJ Hadoop Meetup - Apache NiFi Deep Dive
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Hortonworks and Red Hat Webinar - Part 2
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Apache Nifi Crash Course
Webinar Series Part 5 New Features of HDF 5
Big data processing engines, Atlanta Meetup 4/30
Apache Nifi Crash Course
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of things Crash Course Workshop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PPTX
Internet of Everything -Basic concepts details
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
Co-training pseudo-labeling for text classification with support vector machi...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
SGT Report The Beast Plan and Cyberphysical Systems of Control
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
EIS-Webinar-Regulated-Industries-2025-08.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
future_of_ai_comprehensive_20250822032121.pptx
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Internet of Everything -Basic concepts details
Connector Corner: Transform Unstructured Documents with Agentic Automation
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
NewMind AI Weekly Chronicles – August ’25 Week IV
Electrocardiogram sequences data analytics and classification using unsupervi...
MuleSoft-Compete-Deck for midddleware integrations
Convolutional neural network based encoder-decoder for efficient real-time ob...

Data Con LA 2018 - Streaming and IoT by Pat Alwell

  • 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks Data Flow Wrangling the Internet of Things Pat Alwell – Solutions Engineer Big Data Day Los Angeles ,CA August 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved whoami Pat Alwell Solutions Engineer Hortonworks AWS | Spark | Hadoop Admin Career Started at Algebraix Data Connect with me: GitHub  https://github.com/patalwell Email  [email protected] Goals for the Session • Demonstrate how organizations can leverage Hortonworks Dataflow (HDF) to wrangle the Internet of Things
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache NiFi Managed Dataflow SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Nothing in HDF Makes Sense Except in Light of Flow Based Programming Flow Management Administration HDF Streaming Flow-based programming is an abstraction of information packets, algorithmic transformations, and a common set of connections. The flow of data is essentially equivocal to a production line. Raw material is pushed or pulled into a process and transformed to meet an end goal.
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved • Connection = Route between processors  Queues that can be dynamically prioritized • Process Group = Logical Group of processors and their connections  Receive data via input ports, send data via output ports  FlowFile = Unit of data moving through the system  Content + Attributes (Metadata)  Processor = Process data  Transforms/Writes FlowFiles  Creates Provenance Nifi Terminology
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What is HDF?
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Flow Management HDF How Can we Manage Flows with HDF? Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. Apache MiNiFi is a complementary data collection approach that supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation.
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Visual Command and Control A Convenient Graphical User Interface that supports Flow Based Programming • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Visual Command and Control Over 200 + Processors designed to help you capture and deliver data to and from common sources Examples Include: -Capturing Logs from Mobile Devices and Sensors; formatting said logs with Regex, pushing said logs into HDFS or S3 -Collecting sensor readings from GPIO headers and delivering the information to an application via Kafka -Customer sentiment analysis by joining social media data to customer information within Hive or Phoenix
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 220+ Processors for Deeper Ecosystem Integration Hash Extract Merge Duplicate Scan GeoEnrich Replace ConvertSplit Translate Route Content Route Context Route Text Control Rate Distribute Load Generate Table Fetch Jolt Transform JSON Prioritized Delivery Encrypt Tail Evaluate Execute All Apache project logos are trademarks of the ASF and the respective projects. Fetch HTTP Syslog Email HTML Image HL7 FTP UDP XML SFTP AMQP WebSocket
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Provenance and Lineage
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Payload Prioritization • Configure a prioritizer per connection • Determine what is important for your data – time based, arrival order, importance of a data set • Funnel many connections down to a single connection to prioritize across data sets • Develop your own prioritizer if needed
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Back-Pressure
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Latency vs. Throughput
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved • Java • < 40MB binary distribution • Requires JRE 1.8 • More feature complete • Targeted for any system that can run a JVM • Supported Processors: https://github.com/apache/nifi- minifi/blob/6ddf8bb0ee3614320a53ce7f2e0b3950ee4d9c5f /minifi-docs/src/main/markdown/minifi-java-agent-quick- start.md • C++ • Dynamic heap of ~1MB based on use-case • Targeted for resource constrained environments • Supported Processors: https://github.com/apache/nifi- minifi-cpp/blob/master/PROCESSORS.md Minifi’s Key Features An Embedded Extension that supports Flow Based Programming on the Edge Agents Provide: • Small and lightweight footprint • Central management of agents • Generation of data provenance • Integration with NiFi for follow-on dataflow management and full chain of custody of information
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved How can we Administer and Secure flow activity? Administration HDF The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. - https://ambari.apache.org/
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Administration and Role Based Security
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved • NiFi Registry - sub-project of Apache NiFi • https://github.com/apache/nifi-registry • https://issues.apache.org/jira/projects/NIFIREG • Complimentary application, central location for storage/management of “versioned” resources • Initial capability to store and retrieve “versioned flows” • Integration on NiFi side • Start/Stop version control of a process group • Change version (upgrade/downgrade) • Import new process group from a version Version Control for Flows
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved • Parameterize configuration like connection strings, file paths, etc. • Referenced via Expression language • Kafka Brokers = ${kafka.brokers} • Variables associated with a process group • Right-click on canvas to view variables for current process group • Hierarchical order of precedence, resolve closest reference to component • Editing variables automatically restarts any components referencing the variables! Level 1 Level 2 Vars Vars Variable Registry Flow Ubiquity
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved ⬢ Data Governance – Centralized registry to provide reusable schema – Version management to define relationship between schemas – Validation to enable generic format conversion and generic routing ⬢ Operational Efficiency – Centralized registry to avoid attaching schema to every piece of data – Version management to enable consumers and producers can evolve at different rates – Validation to ensure data quality Schema Registry
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved How Can we take advantage of Streaming computations? HDF Streaming
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved A Distributed Streaming Platform that supports Pub-Sub Systems • Publish and Subscribe to streams of records, similar to a messaging queue • Store streams of records in a fault-tolerant way • Process Streams of records as they occur • Topic is a partitioned Log of events Generally used to… • Build real-time streaming data pipelines to reliably transfer data between systems • Build real-time streaming applications that transform or react to the streams of data What is Kafka?
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Streaming Messaging Manager (*NEW) “Kafka Blindness” – Customers who use Kafka today struggle with monitoring and managing Kafka clusters.
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What is Storm? A Distributed Fault Tolerant Service that processes Streams of Data• Capture data from external systems (Kafka, Hbase, Hive, HDFS, and AWS Kinesis) Spout • Transform and aggregate said data using filter, map, flatMap, aggregate, reduce, count, etc. Bolt • Write data back to an external system for storage or visualization (Hbase, Hive, Druid) Bolt • The chain is known as a topology. The topology is run under a master slave type architecture. Generally used to… • Processing streams • No need for intermediate queues. Continuous computation Send data to clients continuously so they can update and show results in real time, such as site metrics. • Distributed remote procedure call • Easily parallelize CPU-intensive operations. * * Leibiusky, Jonathan. Getting Started with Storm: Continuous Streaming Computation with Twitter's Cluster Technology (p. 1). O'Reilly Media. Kindle Edition.
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What does our flow look like?
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved ⬢ Full Q&A Platform (like StackOverflow) ⬢ Knowledge Base Articles ⬢ Code Gallery and Samples ⬢ https://community.hortonworks.com Join Us: Hortonworks Community Connection
  • 27. 27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved ⬢ Download our HDF Sandbox for Docker, VMWare, or VirtualBox: https://hortonworks.com/downloads/#sandbox ⬢ Follow our HDF tutorials: https://hortonworks.com/tutorial/analyze-iot-weather- station-data-via-connected-data-architecture/ ⬢ Reach out to an Enterprise Account Manager Care to Learn More?
  • 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions ?
  • 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache NiFi / ETL Tools NiFi NOT schema dependent • Dataflow management for both structured and unstructured data, powered by separation of metadata and payload • Schema is not required, but you can have schema • Minimum modeling effort, just enough to manage dataflows • Do the plumbing job, maximize developers’ brainpower for creative work ⚠ Not designed to do heavy lifting transformation work for DB tables (JOIN datasets, etc.). You can create custom processors to do that, but long way to go to catch up with existing ETL tools from user experience perspective (GUI for data wrangling, cleansing, etc.) ETL (Informatica, etc.) Schema dependent • Tailored for Databases/WH • ETL operations based on schema/data modeling • Highly efficient, optimized performance ⚠ Must pre-prepare your data, time consuming to build data modeling, and maintain schemas ⚠ Not geared towards handling unstructured data, PDF, Audio, Video, etc. ⚠ Not designed to solve dataflow problems
  • 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache NiFi / Integration, or ingestion, Frameworks NiFi End user facing dataflow management tool • Out of the box solution for dataflow management • Interactive command and control in the core, design and deploy on the edge • Flexible failure handling at each point of the flow • Visual representation of global dataflow and connectivities • Native cross data center communication • Data provenance for traceability ⚠ Not a library to be embedded in other applications Integration framework (Spring Integration, Camel, etc), ingestion framework (Flume, etc) Developer facing integration tool with a focus on data ingestion • A set of tools to orchestrate workflow • A fixed design and deploy pattern • Leverage messaging bus across disconnected networks ⚠ Developer facing, custom coding needed to optimize ⚠ Pre-built failure handling, lack of flexibility ⚠ No holistic view of global dataflow ⚠ No built-in data traceability
  • 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache NiFi / Messaging Bus Services NiFi Provide dataflow solution • Centralized management, from edge to core • Great traceability, event level data provenance starting when data is born • Interactive command and control – real time operational visibility • Dataflow management, including prioritization, back pressure, and edge intelligence • Visual representation of global dataflow ⚠ Not a messaging bus, flow maintenance needed when you have frequent consumer side updates Messaging Bus (Kafka, JMS, etc.) Provide messaging bus service • Low latency • Great data durability • Decentralized management (producers & consumers) • Low broker maintenance for dynamic consumer side updates ⚠ Not designed to solve dataflow problems (prioritization, edge intelligence, etc.) ⚠ Traceability limited to in/out of topics, no lineage ⚠ Lack of global view of components/connectivities
  • 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache NiFi / Processing Frameworks NiFi Simple event processing • Primarily feed data into processing frameworks, can process data, with a focus on simple event processing • Operate on a single piece of data, or in correlation with an enrichment dataset (enrichment, parsing, splitting, and transformations) • Can scale out, but scale up better to take full advantage of hardware resources, run concurrent processing tasks/threads (processing terabytes of data per day on a single node) ⚠ Not another distributed processing framework, but to feed data into those Processing Frameworks (Storm, Spark, etc.) Complex and distributed processing • Complex processing from multiple streams (JOIN operations) • Analyzing data across time windows (rolling window aggregation, standard deviation, etc.) • Scale out to thousands of nodes if needed ⚠ Not designed to collect data or manage data flow