SlideShare a Scribd company logo
Using Spark Streaming and NiFi for the next
generation of ETL in the enterprise
Darryl Dutton, Principal Consultant, T4G
Kenneth Poon, Director of Data Engineering, RBC
The Journey
Agenda
What is the Event Standardization
Service (ESS) Use Case
The drivers to modernize ESS
The solution for ESS and benefits
The project challenges
The Good, the Bad and the Ugly
Questions
What is the
ESS Use Case
Event Standardization Service (ESS) captures
customer activity across all channels, such as
Online Banking, Mobile Apps, Bank Branch,
Advice Center, etc…
ESS facilitates customer journey reporting to
turn raw event data into actionable insights
ESS provides APIs to customer-facing systems
to get insights on recent customer activity,
journeys, and life events.
ESS – Business Value
Understand customer activity across all
channels
Identify customer journey from
interactions
Identify life events from journeys to
optimize customer experience
Life Events
Customer Journey
Events / Interactions
What is the ESS Use Case - Legacy
Event Standardization Service – Legacy Architecture
Event Hub / Ingest ProcessingData Source Data Storage and Batch Processing
Reporting /
Analytics
Business
Events
(real time)
IBM
Data
Power
IBM MQ
TeraData
TPump
Stage 0
60 Minute Mini
Batch (SQL)
Teradata
Core EDW
TeraData
Extended
Model
TeraData
Report
view
Batch SQL
Batch
Extracts
Oracle
BI tool
Batch
Source
Events
IBM
Data Stage
Batch SQL
Apps/
Ad hoc
The drivers to
modernize
ESS
Provide real-time access to customer event and
journey data
Reduce cost to enhance, support, and maintain
Simplify onboarding process for new systems
Support exponential growth of event data
Provide users with self-serve validation tools
Key Solution
Components
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Extract & Load
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Transformation
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Integration
Event Standardization Service – High Level Design
Event Hub / Ingest ProcessingData Source Data Storage and Logic Processing
Reporting /
Analytics
YARN
NiFi
Business
Events
Data
Power
IBM MQ
XML
JSON
Batch
Source
Events
Kafka
HDFS
Text
Kafka
Data
Stage
Text
KafkaSPARK
streaming
SPARK
streaming
SPARK
StreamingJSON
JSON JSON
Kafka WAL/Offsets
Teradata
Core EDW
Other
Data
Stores
Down
Stream
Systems
Read & RouteRead & Route
Processors for
Near Real Time
Events
Read & RouteRead & RouteProcessors for
Batch Events
Read & RouteRead & RouteProcessors for
Persistence/OPS Text
Parquet
Lookup/Reference Data
Elastic
Search
Kibana
(OPS)
Email
Server
The solution for ESS
NiFi Implementation
External
HDFS
Spark Implementation
Hadoop Cluster
Edge Node YARN Resource
Manager
Node Manager 5
Node Manager
2
Node Manager
3
Node Manager 6
Node Manager
4
Kafka Cluster
Server 1
Node Manager 7
P0 P1 P2
Server 2
P3 P4 P5
Node Manager
1
YARN Resource
Manager
Node Manager 2 Node Manager 3 Node Manager 4
Node Manager 1
Spark App
Spark Driver(AM)
Spark Executor Spark Executor Spark Executor
Spark Executor Spark Executor Spark Executor
Data
Node
Data
Node
Data
Nodes
Benefits
• Event data available for further
analytics in near real-time
• Scalability solved
• Handle higher outage windows
• Fast development and iterations
• Better data flow visibility
• Integration to legacy infrastructure
• Reinvestment of IT budget to
newer open source technologies
Project
Challenges
• Too many new things at once
• Lack of knowledge and
documentation of legacy systems
• Infrastructure readiness
• Implementing security requirements
• Versioning of different open source
Apache projects
• Getting to simple
The Good
The Bad &
The Ugly
NiFi Canvas – rapid build through configuration
NiFi Monitoring and Retry
NiFi – Integration & Load Testing
NiFi – Access Control (Groups, Users, LDAP integration)
NiFi – Supporting Different Environments
DEV
UAT
PROD
NiFi – Version Upgrade
1.3.0 1.5.0
Spark Streaming input source and output sink
“The streaming sinks are designed to be
idempotent for handling reprocessing.”
You need to handle the logic for duplicate
replay/reprocessing when writing output if
exact once processing is needed.
Spark Structure Streams….focus on logic code, not plumbing code
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write StreamWrite Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write StreamWrite Stream
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Spark Session
Spark Streaming Hosting on YARN….deploy, control and logging
YARN
Data NodeData Node
Spark
Streaming
Application
Kafka
HDFS
NiFi
Data NodeData NodeProcessorsKafka
Spark Submit
(package your version)
DeployControlLogging
HDFS
(temp files)
Stop?...Kill
Text Commands
Logs and metrics ‘tailed’ and saved
Email
Notification
Log4J2
/app/spark/spark-2.2.0/bin/spark-submit 
--jars spark-sql-kafka-0-10_2.11-2.2.0.jar 
--class <com.MainClassName> 
--master yarn 
--deploy-mode cluster 
--queue <your queue name> 
--num-executors 18 
--executor-cores 1 
--executor-memory 4G 
--driver-memory 4G 
--driver-java-options="-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties -Dconfig.file=./application.conf
-Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" 
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties 
-Dconfig.file=./application.conf -Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" 
--conf "spark.yarn.maxAppAttempts=4" 
--conf "spark.yarn.am.attemptFailuresValidityInterval=1h" 
--conf "spark.yarn.max.executor.failures=16" 
--conf "spark.speculation=false" 
--conf "spark.task.maxFailures=1" 
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" 
--conf "spark.ui.showConsoleProgress=false" 
--conf "spark.shuffle.consolidateFiles=true" 
--conf "spark.locality.wait=1s" 
--conf "spark.sql.tungsten.enabled=false" 
--conf "spark.sql.codegen=false" 
--conf "spark.sql.unsafe.enabled=false" 
--conf "spark.streaming.backpressure.enabled=true" 
--conf "spark.streaming.kafka.consumer.cache.enabled=false" 
--conf "spark.ui.view.acls=*" 
--principal <your principle name> 
--keytab <keytab file path> 
--files ./log4j.properties#log4j.properties,./log4j2.xml#log4j2.xml,./application.conf#application.conf,./metrics.properties#metrics.properties,
./kafka_client_jaas.conf#kafka_client_jaas.conf,/app/pbrtappk/YYYYY#YYYYYYY,./krb5.conf,./client.truststore.jks $1
Summary
• NiFi has been great on load/extract
• Use NiFi to handle routes & format
• Spark good for transforms
• Operationalizing Spark Streaming
is a challenge
• Deploying changes with NiFi a
challenge
• Keep it simple
Questions?
Darryl Dutton, T4G
darryl.dutton@T4G.com
Ready to Build Brilliant?
We’re always looking for new challenges
and teammates.
Connect with us!
800.399.5370
hello@t4g.com
www.t4g.com
Kenneth Poon, RBC
kenneth.t.poon@rbc.com
Helping clients thrive and
communities prosper.
Always hiring!
Simplify. Agile. Innovate.
jobs.rbc.com

More Related Content

What's hot (20)

PPTX
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Apache Nifi Crash Course
DataWorks Summit
 
PDF
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
PPTX
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PDF
Snowflake free trial_lab_guide
slidedown1
 
PPTX
Apache Spark.
JananiJ19
 
PDF
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PDF
Nifi
Julio Castro
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Architecting a datalake
Laurent Leturgez
 
PDF
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
 
PDF
Introdution to Dataops and AIOps (or MLOps)
Adrien Blind
 
PDF
Introduction to PySpark
Russell Jurney
 
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
PPTX
Snowflake essentials
qureshihamid
 
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Apache Nifi Crash Course
DataWorks Summit
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Snowflake free trial_lab_guide
slidedown1
 
Apache Spark.
JananiJ19
 
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Introduction to Kafka Streams
Guozhang Wang
 
Databricks Platform.pptx
Alex Ivy
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Architecting a datalake
Laurent Leturgez
 
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
 
Introdution to Dataops and AIOps (or MLOps)
Adrien Blind
 
Introduction to PySpark
Russell Jurney
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Snowflake essentials
qureshihamid
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 

Similar to Using Spark Streaming and NiFi for the next generation of ETL in the enterprise (20)

PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
PDF
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
PDF
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
PDF
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PDF
What no one tells you about writing a streaming app
hadooparchbook
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PPTX
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
PPTX
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Chris Fregly
 
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
What no one tells you about writing a streaming app
hadooparchbook
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Chris Fregly
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

  • 1. Using Spark Streaming and NiFi for the next generation of ETL in the enterprise Darryl Dutton, Principal Consultant, T4G Kenneth Poon, Director of Data Engineering, RBC
  • 3. Agenda What is the Event Standardization Service (ESS) Use Case The drivers to modernize ESS The solution for ESS and benefits The project challenges The Good, the Bad and the Ugly Questions
  • 4. What is the ESS Use Case Event Standardization Service (ESS) captures customer activity across all channels, such as Online Banking, Mobile Apps, Bank Branch, Advice Center, etc… ESS facilitates customer journey reporting to turn raw event data into actionable insights ESS provides APIs to customer-facing systems to get insights on recent customer activity, journeys, and life events.
  • 5. ESS – Business Value Understand customer activity across all channels Identify customer journey from interactions Identify life events from journeys to optimize customer experience Life Events Customer Journey Events / Interactions
  • 6. What is the ESS Use Case - Legacy Event Standardization Service – Legacy Architecture Event Hub / Ingest ProcessingData Source Data Storage and Batch Processing Reporting / Analytics Business Events (real time) IBM Data Power IBM MQ TeraData TPump Stage 0 60 Minute Mini Batch (SQL) Teradata Core EDW TeraData Extended Model TeraData Report view Batch SQL Batch Extracts Oracle BI tool Batch Source Events IBM Data Stage Batch SQL Apps/ Ad hoc
  • 7. The drivers to modernize ESS Provide real-time access to customer event and journey data Reduce cost to enhance, support, and maintain Simplify onboarding process for new systems Support exponential growth of event data Provide users with self-serve validation tools
  • 16. Event Standardization Service – High Level Design Event Hub / Ingest ProcessingData Source Data Storage and Logic Processing Reporting / Analytics YARN NiFi Business Events Data Power IBM MQ XML JSON Batch Source Events Kafka HDFS Text Kafka Data Stage Text KafkaSPARK streaming SPARK streaming SPARK StreamingJSON JSON JSON Kafka WAL/Offsets Teradata Core EDW Other Data Stores Down Stream Systems Read & RouteRead & Route Processors for Near Real Time Events Read & RouteRead & RouteProcessors for Batch Events Read & RouteRead & RouteProcessors for Persistence/OPS Text Parquet Lookup/Reference Data Elastic Search Kibana (OPS) Email Server The solution for ESS
  • 18. HDFS Spark Implementation Hadoop Cluster Edge Node YARN Resource Manager Node Manager 5 Node Manager 2 Node Manager 3 Node Manager 6 Node Manager 4 Kafka Cluster Server 1 Node Manager 7 P0 P1 P2 Server 2 P3 P4 P5 Node Manager 1 YARN Resource Manager Node Manager 2 Node Manager 3 Node Manager 4 Node Manager 1 Spark App Spark Driver(AM) Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Data Node Data Node Data Nodes
  • 19. Benefits • Event data available for further analytics in near real-time • Scalability solved • Handle higher outage windows • Fast development and iterations • Better data flow visibility • Integration to legacy infrastructure • Reinvestment of IT budget to newer open source technologies
  • 20. Project Challenges • Too many new things at once • Lack of knowledge and documentation of legacy systems • Infrastructure readiness • Implementing security requirements • Versioning of different open source Apache projects • Getting to simple
  • 21. The Good The Bad & The Ugly
  • 22. NiFi Canvas – rapid build through configuration
  • 24. NiFi – Integration & Load Testing
  • 25. NiFi – Access Control (Groups, Users, LDAP integration)
  • 26. NiFi – Supporting Different Environments DEV UAT PROD
  • 27. NiFi – Version Upgrade 1.3.0 1.5.0
  • 28. Spark Streaming input source and output sink “The streaming sinks are designed to be idempotent for handling reprocessing.” You need to handle the logic for duplicate replay/reprocessing when writing output if exact once processing is needed.
  • 29. Spark Structure Streams….focus on logic code, not plumbing code Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write Stream
  • 30. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write Stream
  • 31. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write Stream
  • 32. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write StreamWrite Stream
  • 33. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write StreamWrite Stream Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Spark Session
  • 34. Spark Streaming Hosting on YARN….deploy, control and logging YARN Data NodeData Node Spark Streaming Application Kafka HDFS NiFi Data NodeData NodeProcessorsKafka Spark Submit (package your version) DeployControlLogging HDFS (temp files) Stop?...Kill Text Commands Logs and metrics ‘tailed’ and saved Email Notification Log4J2
  • 35. /app/spark/spark-2.2.0/bin/spark-submit --jars spark-sql-kafka-0-10_2.11-2.2.0.jar --class <com.MainClassName> --master yarn --deploy-mode cluster --queue <your queue name> --num-executors 18 --executor-cores 1 --executor-memory 4G --driver-memory 4G --driver-java-options="-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties -Dconfig.file=./application.conf -Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties -Dconfig.file=./application.conf -Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" --conf "spark.yarn.maxAppAttempts=4" --conf "spark.yarn.am.attemptFailuresValidityInterval=1h" --conf "spark.yarn.max.executor.failures=16" --conf "spark.speculation=false" --conf "spark.task.maxFailures=1" --conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" --conf "spark.ui.showConsoleProgress=false" --conf "spark.shuffle.consolidateFiles=true" --conf "spark.locality.wait=1s" --conf "spark.sql.tungsten.enabled=false" --conf "spark.sql.codegen=false" --conf "spark.sql.unsafe.enabled=false" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.streaming.kafka.consumer.cache.enabled=false" --conf "spark.ui.view.acls=*" --principal <your principle name> --keytab <keytab file path> --files ./log4j.properties#log4j.properties,./log4j2.xml#log4j2.xml,./application.conf#application.conf,./metrics.properties#metrics.properties, ./kafka_client_jaas.conf#kafka_client_jaas.conf,/app/pbrtappk/YYYYY#YYYYYYY,./krb5.conf,./client.truststore.jks $1
  • 36. Summary • NiFi has been great on load/extract • Use NiFi to handle routes & format • Spark good for transforms • Operationalizing Spark Streaming is a challenge • Deploying changes with NiFi a challenge • Keep it simple
  • 37. Questions? Darryl Dutton, T4G [email protected] Ready to Build Brilliant? We’re always looking for new challenges and teammates. Connect with us! 800.399.5370 [email protected] www.t4g.com Kenneth Poon, RBC [email protected] Helping clients thrive and communities prosper. Always hiring! Simplify. Agile. Innovate. jobs.rbc.com

Editor's Notes

  • #2: Darryl
  • #3: Darryl
  • #4: Darryl
  • #5: ESS (Event Standardization Service) is a new service built by RBC’s Data & Analytics group to collect customer interaction data across various channels (such as Online Banking, Mobile apps, Branch, ATMs, Advice Center) into a central repository, apply analytics to it, and then make the data available through APIs. The idea originated around 8 years ago with a system called ECS (Event Capture Service). ECS started collecting events from various channels and load it into the data warehouse. Over time, upstream systems stopped sending new events because it was difficult to onboard and expensive. The dataset became incomplete (new events were missing), making it unusable for customer journey reporting. Last year, RBC partnered with T4G to build a new event service (ESS) that would address all the pain points of the old system and be designed in such a way to capture ALL events across ALL channels, right now and into the future. One of the top priorities of 2018 is to be able to link online and offline activities to get a holistic view of customer journey.
  • #6: One of the common questions we get asked is what we are doing all these events. The goal of ESS is to turn the raw events into actionable insights that can improve the customer experience and the bank’s bottom line. At a high-level, we want to construct customer journeys from the interaction data, which can help predict life events. Through path analysis and prediction, knowing a customer’s current and next stage in life allows us to target them with more relevant offers in a timely manner, and even geo-targeted offers since we also track location. Here are some of the other use cases we are currently working on: Advisor Support Enable advisors to view real-time interaction data to assist with problem resolution Digital  Offline Efficiencies Identify opportunities to reduce Advice Centre call volumes Sales Attribution Identify the right digital marketing mix to drive sales Link digital activity (research) to offline conversions (Mortgages)
  • #7: Before we started building the new event service, we wanted to understand how the old service was designed and implemented, and find out the reasons why it became unusable over time. The first thing we found was that the technology stack was a bit out-dated (but very mature and reliable). (click) Source systems would send XML events to a SOAP endpoint on DataPower, which gets routed to MQ queue, and feeds into Teradata warehouse through TPUMP utility and BTEQ mini-batch process, running every 60 minutes. (click) All the processing runs on Mainframe, triggered by JCLs and scheduled through Zeke. The data is then used for internal reporting on OBI and Tableau. (click) The batch feeds were copied from z/OS to Datastage, and then loaded to Teradata as well through same mini-batch process. (click) (click) As you can tell, a lot of vendor products were used, making it difficult to find resources in the market who have all of this expertise. Also, the folks who worked on it were either retired, switched teams, or no longer with the bank. But technology was only half the problem. Having a rigid XML schema and process-heavy development and deployment cycle resulted in months to deploy a simple change. These reasons made it very expensive to continue to use this system.
  • #8: Since we were going to re-architect the event service to make it easier for systems to use, we figure we would also modernize the tech stack to make it less costly to enhance, support, and maintain as well. As customers go digital and do more of their banking online and on apps, we are seeing the # of interaction events generated exponentially outgrow the # of events the old service was able to handle. RBC was falling behind in the channel analytics space, which is a huge lost opportunity for the bank if we can’t capitalize on all that customer data to analyze their banking behavior and tendencies. Over the last 6 months, several new features have been rolled out across the different channels (especially digital and mobile apps), and we are happy to say that the new ESS service has been able to keep up with the demand. We were also able to go back and capture critical business events that were not onboarded before (such as branch and call center activity) I’ll hand it over to Darryl now to talk about the key components of the new solution.
  • #9: Darryl
  • #10: Darryl
  • #11: Darryl Rapidly build data pipelines Required integrations supported Configuration over code Many available processors/services Easy ingestion, routing and splits Simple transforms and format changes Flow modification at runtime Built-in queuing and backpressure
  • #12: Darryl
  • #13: Darryl
  • #14: Darryl Provide processing in near real time Micro batching good enough Complex transformation/enrichment Structured Streaming…Elegant Automatic retries Out-of-box integration with Kafka Use SQL on streaming data API allow future path to ML
  • #15: Darryl
  • #16: Darryl Move data across boundaries De-couple systems Reliable messaging High performance, high volume Hold events for long outages Supported integrations
  • #17: Darryl
  • #18: Darryl We recently upgraded from 1.3.0 to 1.5.0 in Production. We had both instances up in parallel and migrated 1 template at a time so that we had a way to easily rollback if it didn’t work in 1.5.0 external Zookeeper for Nifi Cluster because of excessive logging issue https://issues.apache.org/jira/browse/NIFI-3731
  • #19: Darryl
  • #20: From what Darryl described, there are quite a few benefits in the new architecture: Instead of events being made available for processing after 60 minutes, Spark streaming enables the events to be consumed in near real-time within seconds Building a distributed system allows us to scale horizontally – we can add more nodes to NiFi cluster, increase # of Kafka partitions, or increase # of Spark executors, as the volume of events increase over time. We moved away from vendor products and embraced open-source, although we are using Confluent for Kafka and Hortonworks for Hadoop. Using open-source frees us from vendor lock-in and assures long-term viability. It’s also easier to find developers who are more interested in working on new tech, which allows for succession planning. Not everything was new though: We did continue to use proven enterprise infrastructure (DataPower and MQ) as our REST API layer to receive events to ensure high availability and fault tolerance. This was in lieu of having a REST proxy for Kafka available. Using NiFi sped up development (allowed for quick prototyping and testing), and also made it operationally easier to manage data-in-motion through visual controls.
  • #21: As with any new project, we encountered some challenges along the way. When you’re building a new system with all new tech, the last thing you need is to also introduce a new language – Scala. My developers (all Java background) didn’t know Scala, but wanted to use it for Spark Streaming. If you have 0 experience in Scala, prototyping to get something to work is much different than writing a Production-grade app. Understanding what the old service did was also a challenge. We didn’t have the right skillset to understand the legacy implementation on Mainframe (and there was also a lack of documentation). Security requirements – communication over SSL and Kerberos-authenticated - lots of certs: Between NiFi nodes in the cluster, connecting to HDFS, connecting to Kafka, connecting to Elasticsearch Open source projects continuously have new versions: NiFi – We started with 1.1.0, then 1.3.0, and now 1.5.0 When we started, we only had Kafka 0.9, but needed Kafka 0.10 or higher to support SSL for Sparking Streaming integration with Kafka. Had to wait 4-5 months for Kafka 0.10 cluster to be ready. To prevent ESS from becoming obsolete over time, we’ll need to continue to optimize and simplify the tech stack so the next generation of folks don’t retire this system in 5 yrs.
  • #22: After a year working with Hadoop, NiFi, Spark, Kafka, and Elasticsearch, several teams at RBC are becoming proficient at it. However, this wasn’t the case last year. Last year, both the development and platform teams were learning at the same time. Dev teams were learning how to build Production-grade apps on Hadoop. The Platform team was learning how to manage and operate an enterprise Hadoop and Kafka environment that supports multitenancy. I’ll go over our experience with NiFi and Darryl will go over our experience with Spark Streaming.
  • #23: As a manager, one thing I love about NiFi is how quickly developers can whip up new data flows. NiFi is perfect for moving data from 1 or more sources to 1 or more sinks. Code reviews are much easier as it’s not as subjective to different coding style since NiFi is more like configuration as code. One new thing to gripe about now is how straight the connector lines are and the spacing between the processors. I was never a fan of drag-and-drop ETL dev tools in RBC (such as BusinessWorks and Datastage), but NiFi gives you more control, has a easy-to-use interface, and is more scalable. NiFi does data movement very well. Debugging was extremely easy with the provenance repository. If there was any failure, it is easy to find out which message failed to process and why.
  • #24: Monitoring failures and implementing retries is always a pain when you have to code it yourself. NiFi makes it very easy to configure it. For retry, just have a self-loop. (click). We configure the penalty duration of the processor if we want to introduce a backoff and wait a certain amt of time before retrying. Typical use case for retry on failure is when you’re writing to a sink and there’s a connection failure (i.e. to HDFS, to Kafka topic, to Elasticsearch). (click, click) Whenever there is a failure, we used the MonitorActivity processor (click) to then send a consolidated email every 5 mins to alert our support team of a failure. (click) Once the issue has been resolved, we send a recovery alert email. (click, click)
  • #25: For testing, we created a bunch of simple flows that read data from disk and publish to one of our ingestion points (either MQ or Kafka topic) We did this for load testing when we needed to pump hundreds of thousands of events to the system in a very short period of time to simulate volume at 10 x the peak, and to measure our expected throughput. We also had test Kafka consumers to verify we indeed published to the topic successfully. The test classes normally would’ve taken us around 5-10 minutes to code in Java. In NiFi, we can quickly whip these up in a minute or 2.
  • #26: As with any GUI interface, we have to implement some sort of access control. We configured LDAP authentication against Active Directory for user login. We used SSL certificates for initial login and setting up secure cluster, and then disabled it. We created read-only, read-write, and admin NiFi groups, assigning different policies for each. This was fairly straightforward. In PROD, developers have read-only access, and support folks have write and admin access.
  • #27: Traditionally with our Java applications, we deploy same JAR file in each environment, but reading from different config files. Trying to replicate this in NiFi was not straight-forward because not all configs could be externalized into variables. Often times, we had to manually alter the values of certain configuration values after importing a new flow to a different environment. (click) To reduce inadvertent changes to existing processors, what we decided to do was for brand new flows or completely refactored flows, we would replace the entire flow xml (Export from lower environment and promote to next environment). For minor changes, we would just manually re-apply them in the subsequent environment. (click) Another concern working with different environments in NiFi is that the NiFi canvas looks the same (grey background color) for each environment. We used Custom Javascript plugin for chrome and added code to change canvas background color to a different one for each environment. Green for DEV, Yellow for UAT, and Red for PROD. That way, we could be more careful when working in PROD.
  • #28: A few months ago in March, we upgraded from NiFi 1.3.0 to 1.5.0, after being on 1.3.0 for a good 8 months There is no magic button to do in-place upgrade. What we did was set up a parallel NiFi cluster on different ports (and also separate zookeeper cluster). (click) We stopped all the processors that ingested new data, let any in-flight messages finish processing (click), and then we shut down the old NiFi instance. We then started the processors back up on the new 1.5.0 cluster. (click) 2 reasons why we upgraded: resolve PROD issue where NiFi couldn’t start back up use newer versions of Kafka producer and consumer. Once in PROD, our data center lost power and all our servers shut down unexpectedly. When the servers came back up, we couldn’t start NiFi instance back up due to “No enum constant CONTENTMISSING error (NIFI-4093)”, which apparently was fixed in NiFi 1.4. The JettyServer just couldn’t start up due to this error because of a bug where the wrong Enum was used to determine how to process an update to the FlowFile repository. At that time, the only way to get NiFi to start back up was to clear out the flowfile repository, which means all in-flight messages were lost. Another reason we upgraded was to keep up to date with the Kafka producer/consumer versions. We migrated to Kafka 0.11 early in the year but NiFi 1.3.0 only had Kafka producers and consumers up to 0.10. Before we upgraded to NiFi 1.5.0, we downloaded the newer Kafka producer/consumer nar files from NiFi 1.5.0 and used them in 1.3.0. Ever since upgrading to NiFi 1.5.0, our cluster has become much more stable and we haven’t faced any cluster issues in last 3 months. Darryl will now talk about some of the good and bad with the Spark Streaming component.
  • #29: Darryl
  • #30: Darryl
  • #31: Darryl
  • #32: Darryl
  • #33: Darryl
  • #34: Darryl
  • #35: Darryl
  • #36: Darryl
  • #37: Darryl
  • #38: >> Darryl to speak first RBC is based in Toronto, Ontario, Canada, but we have offices around the world as well (New York, London, etc…) The Data & Analytics team is in Toronto and we are always looking to hire strong developers. Feel free to email me directly, connect over LinkedIn, or just visit jobs.rbc.com to explore available opportunities. Thanks