SlideShare a Scribd company logo
Data pipelines: building 

an efficient instrument 

to create a custom
workflow
Speaker: Daniel Yavorovych
DevOpsFest2020
Daniel Yavorovych
CTO & Co-Founder at Dysnix
10+ years of * nix-systems administration;

5+ years of DevOps, SRE;

7+ years in the development of the cloud solution
architectures and HL / HA infrastructures;

7+ years in the development of highly-powerful servers
(Python / Golang).
Real-Time Data Pipelines
Processing

When is it needed?
Why is this a problem?
Real-time processing is needed for continuously data arriving -
for example from Twitter, media news, Email, etc.
Most solutions for working with Data Pipelines imply working
in Batch mode. There are only a few alternatives which will be
discussed further.
Data Pipeline Solutions
Google Cloud Dataflow
Batch and Stream modes!
Fully integrated with AutoML, Pub/Sub and other
GCP components
Vender-lock-in
It is not expensive, but it costs more than
self-hosted solutions
Data Pipeline Solutions
Apache Airflow
Open Source & No vendor lock-in
User Interface for visualizing Data Pipelines and
Processing
Support of various executors (Apache Spark,
Celery, Kubernetes)
No Stream Mode
Data Pipeline Solutions
Luigi
Open Source & No vendor lock-in
Not very scalable: you need to split tasks in
projects for parallel execution
User Interface
Hard to use: Dag tasks cannot be viewed before
execution, logs view is difficult
Data Pipeline Solutions
argoproj/argo-events
Open source & No vendor lock-in
No User Interface
Real-time mode
Kubernetes-native solution
20+ event sources
Argo workflow support: 

- container-native 

- workflow engine
New and poor
Data Pipeline Solutions
Apache NiFi
Open Source & No vendor lock-in
Difficult integration with Kubernetes
Real-time mode
Flexible & User-friendly Interface for
viewing Data Pipeline and Processing
Highly scalable
Lots of native Processors available
We choose NiFi because: 

the number of native processors
available
NiFi provides many ready-made Processors -
from Twitter API and Slack to TCP and HTTP
servers, S3, GCS, Google PUB / SUB (there
are about 300 of them)
We choose NiFi because: 

Custom Scripts
Have you ever lacked Processors? Write your
own Processor in one of the convenient
languages: Clojure, ECMAScript, Groovy, Lua,
Python, Ruby.


Will it work faster?


I rewrote some Processors in Python just to
substitute several NiFi Processors and it
began working even faster...
WechooseNiFibecause:
possibilitytochangedataflows
&queuesinreal-time

You can stop the Processor or a group of
Processors at any time to make some changes
and start working again.


At the same time, all other Processors that do
not depend on the shutdown will continue
working. This allows you to stop those
Processors that have errors or if just some
changes are required.


All messages will be added to the NiFi queue
We choose NiFi because: 

NiFi Registry


NiFi Registry is a central location for the
storage and management of shared resources
across one or more instances of NiFi and/or
MiNiFi.


This allows you not only to switch between
each of NiFi Processors and Processors
Groups but also to create a version of your
work (similar to GIT), always be able to roll
back to one of the previous versions.
WechooseNifibecause:
Templates

NiFi templates allow you to export all your
data flow to an XML file as a backup with a few
keystrokes or hand it off to another
developer. It can also be used as a base for
presets (we'll talk about this later)
We choose Nifi because:
External Auth & Users/Groups
NiFi has flexible support for sharing
permissions for Users / Groups with different
Permissions.

Permissions can be set both for operations
(viewing / editing Flow, and specific objects
(Processors / Processors groups).


NiFi also supports external authentication
(there is even support for the OpenID
protocol). For example, we integrated
Keycloak to store user data in one place.
LDAP
Kerberos
NiFi Arch
NiFi Arch: Cluster Mode
NiFi Scalability
bit.ly/nifi-limits

Source:

Horizontal scaling
There’s no limit of nodes in a single
cluster (only node hardware limits
and limits of network performance)
It’s easy to join a new node to the
running cluster
NiFi Scalability: Multiple
Clusters
In any case, if you lack 10 nodes because
you are limited with the network bandwidth
then you can build several NiFi clusters and
connect them through Remote Processor
Groups.
NiFi & Kubernetes
Existing solutions:
https://medium.com/swlh/operationalising-nifi-on-kubernetes-1a8e0ae16a6c
https://hub.helm.sh/charts/cetic/nifi
https://community.cloudera.com/t5/Community-Articles/Deploy-NiFi-On-Kubernetes/ta-p/269758
The last Helm Chart was the most relevant and we took it as a basis
Helm chart
12375
Grafana Dashboard ID:
Nifi registry
Grafana dashboard & prometheus
metrics
Predefined Nifi Flow
Tips & Tricks
Use Kafka or any Message Bus. If there are any failures in NiFi, safety must
be in any concern.
Although NiFi has a visual editor and a bunch of Processors they must be
built by a technically competent engineer, otherwise, data flow can be
destabilized.
For unpredictable inputs, use Rate Limit Processor.
Use NiFi Registry - it will always allow you to roll back!
Don’t try to use only Native NiFi Processors: sometimes it's too complicated
and easier to write a couple of lines in Python.
Don’t gloss over the mistakes! Working in NiFi you can deal with errors the
same way as with regular data and send them to Slack or use for your
purposes.
Production Architecture
Example
Conclusion
NiFi proved to be not only good for rapid prototyping of Data Pipeline Flow
but also a good basis for scalable and loaded ELT systems
Of all free self-hosted implementations that support NiFi, it is the most
modern and actively developing
Configuration of a NiFi cluster in Kubernetes did not seem like a trivial task
but after some difficulties faced this ready-to-use solution meets all the
requirements
NiFi is flexible - it does not block everything on itself and using it properly
you can achieve very good results with the support of really big but similar
projects
Dysnix Open Source
github.com/dysnix

Helm charts
Cryptocurrency nodes docker images
Prometheus exporters
Grafana dashboards
Terraform for Blockchain-ETL (project for Google Cloud Platform)
Daniel Yavorovych
CTO & Co-Founder at Dysnix
daniel@dysnix.com
https://www.linkedin.com/in/daniel-yavorovych/
Questions?

More Related Content

What's hot (20)

PDF
Bitsy graph database
LambdaZen LLC
 
PDF
ApacheCon 2021 Apache Deep Learning 302
Timothy Spann
 
PDF
Cracking the nut, solving edge ai with apache tools and frameworks
Timothy Spann
 
PDF
Python web conference 2022 apache pulsar development 101 with python (f li-...
Timothy Spann
 
PDF
Deploying OpenNebula in an HPC environment
CSUC - Consorci de Serveis Universitaris de Catalunya
 
PDF
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
Timothy Spann
 
PDF
Api world apache nifi 101
Timothy Spann
 
PDF
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Timothy Spann
 
PDF
Codeless pipelines with pulsar and flink
Timothy Spann
 
PDF
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
LinuxCon ContainerCon CloudOpen China
 
PDF
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...
StreamNative
 
PPTX
Boolan machine learning summit
Adam Gibson
 
PDF
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
PDF
Cracking the nut, solving edge ai with apache tools and frameworks
Timothy Spann
 
PDF
Learning the basics of Apache NiFi for iot OSS Europe 2020
Timothy Spann
 
PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 
PPTX
Cloud streaming presentation
edmandt
 
PDF
Apache NiFi User Guide
Deon Huang
 
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
PDF
Microsoft Office 2010 by Mr. EJ Lopez
kristine1018
 
Bitsy graph database
LambdaZen LLC
 
ApacheCon 2021 Apache Deep Learning 302
Timothy Spann
 
Cracking the nut, solving edge ai with apache tools and frameworks
Timothy Spann
 
Python web conference 2022 apache pulsar development 101 with python (f li-...
Timothy Spann
 
Deploying OpenNebula in an HPC environment
CSUC - Consorci de Serveis Universitaris de Catalunya
 
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
Timothy Spann
 
Api world apache nifi 101
Timothy Spann
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Timothy Spann
 
Codeless pipelines with pulsar and flink
Timothy Spann
 
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
LinuxCon ContainerCon CloudOpen China
 
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...
StreamNative
 
Boolan machine learning summit
Adam Gibson
 
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
Cracking the nut, solving edge ai with apache tools and frameworks
Timothy Spann
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Timothy Spann
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 
Cloud streaming presentation
edmandt
 
Apache NiFi User Guide
Deon Huang
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Microsoft Office 2010 by Mr. EJ Lopez
kristine1018
 

Similar to DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow (20)

PDF
AIDevWorldApacheNiFi101
Timothy Spann
 
PDF
Monitoring&Logging - Stanislav Kolenkin
Kuberton
 
PDF
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 
PDF
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
PDF
PyData Boston 2013
Travis Oliphant
 
PDF
Gluster FS a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015
Codemotion
 
PDF
Codemotion Rome 2015. GlusterFS
Roberto Franchini
 
PDF
Automate your data flows with Apache NIFI
Adam Doyle
 
PDF
SRE NL MeetUp - eBPF.pdf
SiteReliabilityEngin
 
PDF
Introduction to Filecoin
Vanessa Lošić
 
PPTX
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
CELI
 
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
PDF
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Timothy Spann
 
PPTX
Top 10 dev ops tools (1)
yalini97
 
PDF
Architecture of a Next-Generation Parallel File System
Great Wide Open
 
PPTX
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
PDF
Using FLiP with influxdb for EdgeAI IoT at Scale
Timothy Spann
 
PDF
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
InfluxData
 
PDF
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
PDF
Modern VoIP in modern infrastructures
Giacomo Vacca
 
AIDevWorldApacheNiFi101
Timothy Spann
 
Monitoring&Logging - Stanislav Kolenkin
Kuberton
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
PyData Boston 2013
Travis Oliphant
 
Gluster FS a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015
Codemotion
 
Codemotion Rome 2015. GlusterFS
Roberto Franchini
 
Automate your data flows with Apache NIFI
Adam Doyle
 
SRE NL MeetUp - eBPF.pdf
SiteReliabilityEngin
 
Introduction to Filecoin
Vanessa Lošić
 
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
CELI
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Timothy Spann
 
Top 10 dev ops tools (1)
yalini97
 
Architecture of a Next-Generation Parallel File System
Great Wide Open
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
Using FLiP with influxdb for EdgeAI IoT at Scale
Timothy Spann
 
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
InfluxData
 
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
Modern VoIP in modern infrastructures
Giacomo Vacca
 
Ad

More from DevOps_Fest (20)

PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
PPTX
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps_Fest
 
PDF
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps_Fest
 
PDF
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
DevOps_Fest
 
PDF
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
DevOps_Fest
 
PDF
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
DevOps_Fest
 
PPTX
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
DevOps_Fest
 
PDF
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps_Fest
 
PPTX
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps_Fest
 
PDF
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
DevOps_Fest
 
PPTX
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
DevOps_Fest
 
PDF
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
DevOps_Fest
 
PDF
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
DevOps_Fest
 
PPTX
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
DevOps_Fest
 
PPTX
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
DevOps_Fest
 
PDF
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
DevOps_Fest
 
PPTX
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
DevOps_Fest
 
PDF
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
DevOps_Fest
 
PPTX
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
DevOps_Fest
 
PDF
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
DevOps_Fest
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps_Fest
 
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps_Fest
 
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
DevOps_Fest
 
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
DevOps_Fest
 
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
DevOps_Fest
 
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
DevOps_Fest
 
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps_Fest
 
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps_Fest
 
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
DevOps_Fest
 
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
DevOps_Fest
 
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
DevOps_Fest
 
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
DevOps_Fest
 
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
DevOps_Fest
 
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
DevOps_Fest
 
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
DevOps_Fest
 
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
DevOps_Fest
 
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
DevOps_Fest
 
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
DevOps_Fest
 
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
DevOps_Fest
 
Ad

Recently uploaded (20)

PDF
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PDF
epi editorial commitee meeting presentation
MIPLM
 
PDF
Council of Chalcedon Re-Examined
Smiling Lungs
 
PPTX
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
epi editorial commitee meeting presentation
MIPLM
 
Council of Chalcedon Re-Examined
Smiling Lungs
 
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
Introduction presentation of the patentbutler tool
MIPLM
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
Horarios de distribución de agua en julio
pegazohn1978
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 

DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow

  • 1. Data pipelines: building an efficient instrument to create a custom workflow Speaker: Daniel Yavorovych DevOpsFest2020
  • 2. Daniel Yavorovych CTO & Co-Founder at Dysnix 10+ years of * nix-systems administration; 5+ years of DevOps, SRE; 7+ years in the development of the cloud solution architectures and HL / HA infrastructures; 7+ years in the development of highly-powerful servers (Python / Golang).
  • 3. Real-Time Data Pipelines Processing When is it needed? Why is this a problem? Real-time processing is needed for continuously data arriving - for example from Twitter, media news, Email, etc. Most solutions for working with Data Pipelines imply working in Batch mode. There are only a few alternatives which will be discussed further.
  • 4. Data Pipeline Solutions Google Cloud Dataflow Batch and Stream modes! Fully integrated with AutoML, Pub/Sub and other GCP components Vender-lock-in It is not expensive, but it costs more than self-hosted solutions
  • 5. Data Pipeline Solutions Apache Airflow Open Source & No vendor lock-in User Interface for visualizing Data Pipelines and Processing Support of various executors (Apache Spark, Celery, Kubernetes) No Stream Mode
  • 6. Data Pipeline Solutions Luigi Open Source & No vendor lock-in Not very scalable: you need to split tasks in projects for parallel execution User Interface Hard to use: Dag tasks cannot be viewed before execution, logs view is difficult
  • 7. Data Pipeline Solutions argoproj/argo-events Open source & No vendor lock-in No User Interface Real-time mode Kubernetes-native solution 20+ event sources Argo workflow support: - container-native - workflow engine New and poor
  • 8. Data Pipeline Solutions Apache NiFi Open Source & No vendor lock-in Difficult integration with Kubernetes Real-time mode Flexible & User-friendly Interface for viewing Data Pipeline and Processing Highly scalable Lots of native Processors available
  • 9. We choose NiFi because: the number of native processors available NiFi provides many ready-made Processors - from Twitter API and Slack to TCP and HTTP servers, S3, GCS, Google PUB / SUB (there are about 300 of them)
  • 10. We choose NiFi because: Custom Scripts Have you ever lacked Processors? Write your own Processor in one of the convenient languages: Clojure, ECMAScript, Groovy, Lua, Python, Ruby. Will it work faster? I rewrote some Processors in Python just to substitute several NiFi Processors and it began working even faster...
  • 11. WechooseNiFibecause: possibilitytochangedataflows &queuesinreal-time You can stop the Processor or a group of Processors at any time to make some changes and start working again. At the same time, all other Processors that do not depend on the shutdown will continue working. This allows you to stop those Processors that have errors or if just some changes are required. All messages will be added to the NiFi queue
  • 12. We choose NiFi because: NiFi Registry NiFi Registry is a central location for the storage and management of shared resources across one or more instances of NiFi and/or MiNiFi. This allows you not only to switch between each of NiFi Processors and Processors Groups but also to create a version of your work (similar to GIT), always be able to roll back to one of the previous versions.
  • 13. WechooseNifibecause: Templates NiFi templates allow you to export all your data flow to an XML file as a backup with a few keystrokes or hand it off to another developer. It can also be used as a base for presets (we'll talk about this later)
  • 14. We choose Nifi because: External Auth & Users/Groups NiFi has flexible support for sharing permissions for Users / Groups with different Permissions. Permissions can be set both for operations (viewing / editing Flow, and specific objects (Processors / Processors groups). NiFi also supports external authentication (there is even support for the OpenID protocol). For example, we integrated Keycloak to store user data in one place. LDAP Kerberos
  • 17. NiFi Scalability bit.ly/nifi-limits Source: Horizontal scaling There’s no limit of nodes in a single cluster (only node hardware limits and limits of network performance) It’s easy to join a new node to the running cluster
  • 18. NiFi Scalability: Multiple Clusters In any case, if you lack 10 nodes because you are limited with the network bandwidth then you can build several NiFi clusters and connect them through Remote Processor Groups.
  • 19. NiFi & Kubernetes Existing solutions: https://medium.com/swlh/operationalising-nifi-on-kubernetes-1a8e0ae16a6c https://hub.helm.sh/charts/cetic/nifi https://community.cloudera.com/t5/Community-Articles/Deploy-NiFi-On-Kubernetes/ta-p/269758 The last Helm Chart was the most relevant and we took it as a basis
  • 20. Helm chart 12375 Grafana Dashboard ID: Nifi registry Grafana dashboard & prometheus metrics Predefined Nifi Flow
  • 21. Tips & Tricks Use Kafka or any Message Bus. If there are any failures in NiFi, safety must be in any concern. Although NiFi has a visual editor and a bunch of Processors they must be built by a technically competent engineer, otherwise, data flow can be destabilized. For unpredictable inputs, use Rate Limit Processor. Use NiFi Registry - it will always allow you to roll back! Don’t try to use only Native NiFi Processors: sometimes it's too complicated and easier to write a couple of lines in Python. Don’t gloss over the mistakes! Working in NiFi you can deal with errors the same way as with regular data and send them to Slack or use for your purposes.
  • 23. Conclusion NiFi proved to be not only good for rapid prototyping of Data Pipeline Flow but also a good basis for scalable and loaded ELT systems Of all free self-hosted implementations that support NiFi, it is the most modern and actively developing Configuration of a NiFi cluster in Kubernetes did not seem like a trivial task but after some difficulties faced this ready-to-use solution meets all the requirements NiFi is flexible - it does not block everything on itself and using it properly you can achieve very good results with the support of really big but similar projects
  • 24. Dysnix Open Source github.com/dysnix Helm charts Cryptocurrency nodes docker images Prometheus exporters Grafana dashboards Terraform for Blockchain-ETL (project for Google Cloud Platform)
  • 25. Daniel Yavorovych CTO & Co-Founder at Dysnix [email protected] https://www.linkedin.com/in/daniel-yavorovych/ Questions?