SlideShare a Scribd company logo
2
Most read
11
Most read
16
Most read
MAKING BIG DATA COME ALIVE
Integrating Apache Spark And NiFi
For Data Lakes
Ron Bodkin Founder & President
Scott Reisdorf R&D Architect
2
Agenda
• Requirements
• Design
• Demo
3
• A central repository
with trusted,
consistent data
• Reduce costs by
offloading analytical
systems and archiving cold
data
• Derive value quickly
with easier discovery
and prototyping
• A laboratory for
experimenting with
new technologies
and data
Goals for a Data Lake
4
• Automation of pipelines
with metadata and
performance tracking
• Governance with
clear distinction of
roles and responsibilities
• SLA tracking with
alerts on failures or
violations
• Interactive data discovery
and experimentation
What’s Needed For A Hadoop Data Lake?
5
Example Ingestion Project
• 4000+ unique flat files and RDMS tables, plus a few streaming
data feeds
• Mix of incremental and snapshot data
• Ingest into Hadoop (minimally HDFS and Hive tables)
• Cleansing/encryption and data validation
• Metadata capture
Focus shifts over time from data ingestion to
transformation then to analytics
6
Design
7
Apache Spark Functions
• Cleanse
• Validate
• Profile
• Wrangle
8
Pipeline design with Apache
• Visual drag-and-drop
• Dozens of data connectors
• 150+ pre-built transforms
• Data lineage
• Batch and Streaming
• Extensible
© 2016 Think Big, a Teradata Company 7/10/2016
9
Role separation
• IT Designers design models in NiFi
• Register with framework
• Integrated development process
© 2016 Think Big, a Teradata Company 7/10/2016
Apache NiFi Think Big framework
• Users configure new feeds
• Based on common model
• Generated and executed in NiFi
register
deploy
1010
7/10/2016
© 2015 Think Big, a Teradata Company
User features
around
org. roles
Visual design
Streaming
and Batch
Fully
governed
Integrated
Best
Practices
Secure, modern
architecture
Design Approach
Will be open
source (Apache
license)
1111
Ingest and Prepare
• UI-guided feed creation
• Data protection
• Data cleanse
• Data validation
• Data profiling
• Powered by Apache Spark
Unpack and/or
merge small files
Put file
HDFS
Cleanse/Stand
ardize
Spark
Data Profile
Spark
Metadata
Validate
Spark
Data Ingest Model
Metadata determines
behavior of individual
components
Adds many Hadoop-
specific higher-level NiFi
processors
Index Text
Elasticsearch
Merge / Dedupe
Hive
Compress &
Archive Originals
HDFS,S3
Extract Table
JDBC
Get File(s)
Filesystem
Message
JMS/Kafka
Other
HTTP/REST, etc.
Data policies
12
1313
Data self-service and “wrangle”
• Graphical SQL builder
• 100+ transform functions
• Machine learning
• Publish and schedule
• Powered by Apache Spark
1414
Data Discovery
• Google-like searching
• Extensible metadata
• Data profile
• Data sampling
1515
Operations
• Dashboard
• Health Monitoring
• Data Confidence
• SLA enforcement
• Alerts
• Performance reports
16
• Powerful search capabilities for users against data
(think Google-like searching)
• NiFi processor extracts source data from Hadoop table
for indexing in ElasticSearch
• Incremental updates during ingest
ElasticSearch – Full Text Indexing
Data Lake
select id,user,tweet
from twitter_feed
extract JSON
17
Demo
1818

More Related Content

What's hot (20)

PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
PPTX
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
Timothy Spann
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Introduction to Apache NiFi 1.11.4
Timothy Spann
 
PPTX
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
PDF
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
PDF
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Timothy Spann
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PPTX
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PDF
Elasticsearch in Netflix
Danny Yuan
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Timothy Spann
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Introduction to Apache NiFi 1.11.4
Timothy Spann
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Timothy Spann
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Elasticsearch in Netflix
Danny Yuan
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PPTX
The Elephant in the Clouds
DataWorks Summit/Hadoop Summit
 
PPTX
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
PPTX
Hortonworks Data In Motion Series Part 4
Hortonworks
 
PDF
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Smarter Home with Apache NiFi and Spark
DataWorks Summit/Hadoop Summit
 
PPTX
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
IOT, Streaming Analytics and Machine Learning
DataWorks Summit/Hadoop Summit
 
PPTX
Integrating Apache NiFi and Apache Flink
Hortonworks
 
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
PDF
Joe Witt presentation on Apache NiFi
Mark Kerzner
 
PPTX
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Chris Fregly
 
PPTX
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks
 
PPTX
The Avant-garde of Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
The Elephant in the Clouds
DataWorks Summit/Hadoop Summit
 
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
Hortonworks Data In Motion Series Part 4
Hortonworks
 
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Building a Smarter Home with Apache NiFi and Spark
DataWorks Summit/Hadoop Summit
 
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
IOT, Streaming Analytics and Machine Learning
DataWorks Summit/Hadoop Summit
 
Integrating Apache NiFi and Apache Flink
Hortonworks
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Joe Witt presentation on Apache NiFi
Mark Kerzner
 
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Chris Fregly
 
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks
 
The Avant-garde of Apache NiFi
DataWorks Summit/Hadoop Summit
 
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
Ad

Similar to Integrating Apache Spark and NiFi for Data Lakes (20)

PPTX
Marketing Digital Command Center
DataWorks Summit
 
PPTX
Use of NiFi Product by Apache Foundation
gamevasani
 
PDF
Social Media Monitoring with NiFi, Druid and Superset
Thiago Santiago
 
PPTX
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Marco Garcia
 
PPTX
Overview of NiFi Product by Apache Foundation
gamevasani
 
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
DataWorks Summit
 
PPTX
Data ingestion using NiFi - Quick Overview
Durga Gadiraju
 
PDF
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 
PDF
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
PPTX
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
PDF
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
PDF
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann
 
PPTX
Turning a Data Pond into a Data Lake with Apache NiFi
Gene Peters
 
PDF
Hail hydrate! from stream to lake using open source
Timothy Spann
 
PDF
Data Ingest Self Service and Management using Nifi and Kafka
DataWorks Summit
 
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
PDF
AIDEVDAY_ Data-in-Motion to Supercharge AI
Timothy Spann
 
PDF
Building Real-Time Travel Alerts
Timothy Spann
 
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
Marketing Digital Command Center
DataWorks Summit
 
Use of NiFi Product by Apache Foundation
gamevasani
 
Social Media Monitoring with NiFi, Druid and Superset
Thiago Santiago
 
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Marco Garcia
 
Overview of NiFi Product by Apache Foundation
gamevasani
 
Best practices and lessons learnt from Running Apache NiFi at Renault
DataWorks Summit
 
Data ingestion using NiFi - Quick Overview
Durga Gadiraju
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann
 
Turning a Data Pond into a Data Lake with Apache NiFi
Gene Peters
 
Hail hydrate! from stream to lake using open source
Timothy Spann
 
Data Ingest Self Service and Management using Nifi and Kafka
DataWorks Summit
 
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
Timothy Spann
 
Building Real-Time Travel Alerts
Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
July Patch Tuesday
Ivanti
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 

Integrating Apache Spark and NiFi for Data Lakes

  • 1. MAKING BIG DATA COME ALIVE Integrating Apache Spark And NiFi For Data Lakes Ron Bodkin Founder & President Scott Reisdorf R&D Architect
  • 3. 3 • A central repository with trusted, consistent data • Reduce costs by offloading analytical systems and archiving cold data • Derive value quickly with easier discovery and prototyping • A laboratory for experimenting with new technologies and data Goals for a Data Lake
  • 4. 4 • Automation of pipelines with metadata and performance tracking • Governance with clear distinction of roles and responsibilities • SLA tracking with alerts on failures or violations • Interactive data discovery and experimentation What’s Needed For A Hadoop Data Lake?
  • 5. 5 Example Ingestion Project • 4000+ unique flat files and RDMS tables, plus a few streaming data feeds • Mix of incremental and snapshot data • Ingest into Hadoop (minimally HDFS and Hive tables) • Cleansing/encryption and data validation • Metadata capture Focus shifts over time from data ingestion to transformation then to analytics
  • 7. 7 Apache Spark Functions • Cleanse • Validate • Profile • Wrangle
  • 8. 8 Pipeline design with Apache • Visual drag-and-drop • Dozens of data connectors • 150+ pre-built transforms • Data lineage • Batch and Streaming • Extensible © 2016 Think Big, a Teradata Company 7/10/2016
  • 9. 9 Role separation • IT Designers design models in NiFi • Register with framework • Integrated development process © 2016 Think Big, a Teradata Company 7/10/2016 Apache NiFi Think Big framework • Users configure new feeds • Based on common model • Generated and executed in NiFi register deploy
  • 10. 1010 7/10/2016 © 2015 Think Big, a Teradata Company User features around org. roles Visual design Streaming and Batch Fully governed Integrated Best Practices Secure, modern architecture Design Approach Will be open source (Apache license)
  • 11. 1111 Ingest and Prepare • UI-guided feed creation • Data protection • Data cleanse • Data validation • Data profiling • Powered by Apache Spark
  • 12. Unpack and/or merge small files Put file HDFS Cleanse/Stand ardize Spark Data Profile Spark Metadata Validate Spark Data Ingest Model Metadata determines behavior of individual components Adds many Hadoop- specific higher-level NiFi processors Index Text Elasticsearch Merge / Dedupe Hive Compress & Archive Originals HDFS,S3 Extract Table JDBC Get File(s) Filesystem Message JMS/Kafka Other HTTP/REST, etc. Data policies 12
  • 13. 1313 Data self-service and “wrangle” • Graphical SQL builder • 100+ transform functions • Machine learning • Publish and schedule • Powered by Apache Spark
  • 14. 1414 Data Discovery • Google-like searching • Extensible metadata • Data profile • Data sampling
  • 15. 1515 Operations • Dashboard • Health Monitoring • Data Confidence • SLA enforcement • Alerts • Performance reports
  • 16. 16 • Powerful search capabilities for users against data (think Google-like searching) • NiFi processor extracts source data from Hadoop table for indexing in ElasticSearch • Incremental updates during ingest ElasticSearch – Full Text Indexing Data Lake select id,user,tweet from twitter_feed extract JSON
  • 18. 1818

Editor's Notes

  • #13: Notice that we delegate processing to the Spark and Hadoop cluster for much of our work