SlideShare a Scribd company logo
Standardized Data Management
Data Ingestion Platform
Arun Manivannan
Senior Data Engineer
Our Data
History
Our Dataflow
Discuss some
interesting
problems
Questions
So, what do we do now?
Data in SCB context
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
Data in SCB context
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
History
2012
Enterprise Data
Management on Teradata
Now
Unified ingestion platform
Until a year ago
Legacy Ingestion
framework
<2012
Traditional group-wise Data
Warehouses
2014
EDM on Hadoop
CLEANSE VALIDATERECEIVE CONSTRUCTRECORD
What is our view of Ingestion Framework?
PREPROCESSING PROCESSING
SECURITY AND LINEAGE TRACKING
VALIDATERECEIVE CONSTRUCTRECORD
Cleanse
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
VALIDATERECEIVE CONSTRUCTRECORD
Validate
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
Essential pre-processing
RECORDRECEIVE CLEANSE VALIDATE CONSTRUCT
1
2
3
4
5
Column and Row count validation
Embedded new line removal and special character replacement
Datatype validation
Data transformation
Value defaulting
VALIDATERECEIVE CONSTRUCTRECORD
Record
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
VALIDATERECEIVE CONSTRUCTRECORD
Construct
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
Backing technologies
RECORDRECEIVE VALIDATE CONSTRUCT
Change Data
Capture
CLEANSE
Generation 2 Awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
Generation 2 Awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
Extending NiFi via Custom Processors
Good Problems
Don’t bring me anything but trouble.
Good news weakens me.
- Charles Kettering
RECORDRECEIVE VALIDATE CONSTRUCT
Types of Data
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
TYPES OF DATA
CLEANSE
Types of Data
RECORDRECEIVE VALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Today’s EOD
snapshot data
Monday’s snapshot
data
Tuesday’s snapshot
data
Wednesday’s
snapshot data
Master Transactional
History Current
CLEANSE
Frequency
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro-JSON messages
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
Data accumulated
until Tuesday
Today’s changes
Wednesday’s
snapshot data
Frequency - Hourly Incremental
hourly changesToday’s changes
Today’s changes
Today’s changesWednesday’s hourly
changes
hourly changes
Wednesday’s hourly
changes
Wednesday’s
snapshot data
RECORDRECEIVE VALIDATE CONSTRUCTCLEANSE
Data formats
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
Data formats - CDC Delimited
TIMESTAMP, TRANSACTION_ID, OPERATION_TYPE, USER_ID,<DATAFIELDVALUE1>,<DATAFIELDVALUE2>….
2017-07-20 22:38:04.00000|426664065479|B|DWUSER|OFFICE TELEPHONE NO|21234567890
CDC Columns Application Columns
Could be one of I, B, A, D
CDC Delimited - Snapshot building
RECORDRECEIVE VALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Snapshot building
History Snapshot
D records moves to History I or latest A moves to Snapshot
CLEANSE
Data formats and Frequency
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
Data formats - Delimited with Header/Trailer
H3
TRANS_ID_1|USER_ID1|100.00
TRANS_ID_2|USER_ID2|3.00
TRANS_ID_3|USER_ID1|20.00
T123.00
» Header and trailer has meta information about the data in the file
» Varieties of header and trailer formats
Row count of the file
Sum or a rolling checksum of column(s) in that file
* Header and trailer information is used during reconciliation
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform
The final piece to the puzzle
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
Among others already discussed,
» Schema evolution
» Cascading re-runs
» full-dump override
PROCESSING
» TYPES OF DATA
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
More awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Metadata management UI1
Schema evolution & retrofitting historic data2
Reruns, Cascading reruns, full-dump override3
One-touch deployment4
CLEANSE
To summarise
Our code is (still) manageable and extensible
1
The volume, the variety and the interesting combinations of
data presented us some very interesting problems to solve.
2
And… we intend to open source the framework for the
general audience.
3
4
With HDF and HDP, we were able to abstract away the cross
cutting concerns such as security and concurrency.
THANK YOU !
Questions?
@arunma

More Related Content

What's hot (20)

PPTX
cloud computing, Principle and Paradigms: 1 introdution
Majid Hajibaba
 
PDF
Cloud Computing Service Models | IaaS PaaS SaaS Explained | Cloud Masters Pro...
Edureka!
 
PPTX
Cloud computing
Bilel BARHOUMI
 
PDF
Exploiting IAM in the google cloud platform - dani_goland_mohsan_farid
CloudVillage
 
PPTX
Google File System
guest2cb4689
 
PPT
security Issues of cloud computing
prachupanchal
 
PDF
Deep Dive: a technical insider's view of NetBackup 8.1 and NetBackup Appliances
Veritas Technologies LLC
 
PPTX
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...
Majid Hajibaba
 
PPTX
Cloud File System with GFS and HDFS
Dr Neelesh Jain
 
PDF
Journey to Cloud-Native: Where to start in your app modernization process
VMware Tanzu
 
PDF
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Edureka!
 
PDF
Cloud Computing Roadmap Public Vs Private Vs Hybrid And SaaS Vs PaaS Vs IaaS ...
SlideTeam
 
PPTX
Cloud workload migration guidelines
Jen Wei Lee
 
PPTX
Service Discovery using etcd, Consul and Kubernetes
Sreenivas Makam
 
PPTX
DevOps + DataOps = Digital Transformation
Delphix
 
PDF
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
PPTX
Public Cloud vs Private Cloud
SKALI Group
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PDF
Cloud computing security
Antonio Sanz Alcober
 
cloud computing, Principle and Paradigms: 1 introdution
Majid Hajibaba
 
Cloud Computing Service Models | IaaS PaaS SaaS Explained | Cloud Masters Pro...
Edureka!
 
Cloud computing
Bilel BARHOUMI
 
Exploiting IAM in the google cloud platform - dani_goland_mohsan_farid
CloudVillage
 
Google File System
guest2cb4689
 
security Issues of cloud computing
prachupanchal
 
Deep Dive: a technical insider's view of NetBackup 8.1 and NetBackup Appliances
Veritas Technologies LLC
 
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...
Majid Hajibaba
 
Cloud File System with GFS and HDFS
Dr Neelesh Jain
 
Journey to Cloud-Native: Where to start in your app modernization process
VMware Tanzu
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Edureka!
 
Cloud Computing Roadmap Public Vs Private Vs Hybrid And SaaS Vs PaaS Vs IaaS ...
SlideTeam
 
Cloud workload migration guidelines
Jen Wei Lee
 
Service Discovery using etcd, Consul and Kubernetes
Sreenivas Makam
 
DevOps + DataOps = Digital Transformation
Delphix
 
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
Public Cloud vs Private Cloud
SKALI Group
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Cloud computing security
Antonio Sanz Alcober
 

Similar to SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform (20)

PPTX
Data Ingestion Engine
Adam Doyle
 
PDF
Modern OLAP Databases CMU Advanced Databases
CynthiaRothrock
 
PDF
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
PDF
Streaming SQL for Data Engineers: The Next Big Thing? With Yaroslav Tkachenko...
HostedbyConfluent
 
PDF
Integration Patterns for Big Data Applications
Michael Häusler
 
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
PDF
Operational-Analytics
Niloy Mukherjee
 
PPTX
The End of a Myth: Ultra-Scalable Transactional Management
Ricardo Jimenez-Peris
 
PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
PDF
Growth of relational model: Interdependence and complementary to big data
IJECEIAES
 
PDF
Modern ETL Pipelines with Change Data Capture
Databricks
 
PPTX
ParStream - Big Data for Business Users
ParStream Inc.
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
PDF
Modularized ETL Writing with Apache Spark
Databricks
 
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
PDF
Rennes Meetup 2019-09-26 - Change data capture in production
David Morin
 
PDF
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
Yann Pauly
 
Data Ingestion Engine
Adam Doyle
 
Modern OLAP Databases CMU Advanced Databases
CynthiaRothrock
 
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
Streaming SQL for Data Engineers: The Next Big Thing? With Yaroslav Tkachenko...
HostedbyConfluent
 
Integration Patterns for Big Data Applications
Michael Häusler
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Operational-Analytics
Niloy Mukherjee
 
The End of a Myth: Ultra-Scalable Transactional Management
Ricardo Jimenez-Peris
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Growth of relational model: Interdependence and complementary to big data
IJECEIAES
 
Modern ETL Pipelines with Change Data Capture
Databricks
 
ParStream - Big Data for Business Users
ParStream Inc.
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Modularized ETL Writing with Apache Spark
Databricks
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Rennes Meetup 2019-09-26 - Change data capture in production
David Morin
 
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
Yann Pauly
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform

  • 1. Standardized Data Management Data Ingestion Platform Arun Manivannan Senior Data Engineer
  • 2. Our Data History Our Dataflow Discuss some interesting problems Questions So, what do we do now?
  • 3. Data in SCB context Variety of applications - Web, Batch, Mainframes 1 Hundreds of applications (~180 data lake source apps) 2 Variety of consumption patterns 3 Multi-regulatory guided data storage 4 5 50+ countries
  • 4. Data in SCB context Variety of applications - Web, Batch, Mainframes 1 Hundreds of applications (~180 data lake source apps) 2 Variety of consumption patterns 3 Multi-regulatory guided data storage 4 5 50+ countries
  • 5. History 2012 Enterprise Data Management on Teradata Now Unified ingestion platform Until a year ago Legacy Ingestion framework <2012 Traditional group-wise Data Warehouses 2014 EDM on Hadoop
  • 6. CLEANSE VALIDATERECEIVE CONSTRUCTRECORD What is our view of Ingestion Framework? PREPROCESSING PROCESSING SECURITY AND LINEAGE TRACKING
  • 9. Essential pre-processing RECORDRECEIVE CLEANSE VALIDATE CONSTRUCT 1 2 3 4 5 Column and Row count validation Embedded new line removal and special character replacement Datatype validation Data transformation Value defaulting
  • 12. Backing technologies RECORDRECEIVE VALIDATE CONSTRUCT Change Data Capture CLEANSE
  • 13. Generation 2 Awesomeness RECORDRECEIVE VALIDATE CONSTRUCT Faster development cycle for new applications. Easier to reason with the flow with NiFi’s visual flow representation. 1 Consistent tooling for preprocessing, ops data management, error reporting and archival 2 Significantly faster processing through Spark 3 ORC performs well for most of our consumption patterns - supporting both predicate and projection push down. 4 5 Security and managed concurrency CLEANSE Change Data Capture
  • 14. Generation 2 Awesomeness RECORDRECEIVE VALIDATE CONSTRUCT Faster development cycle for new applications. Easier to reason with the flow with NiFi’s visual flow representation. 1 Consistent tooling for preprocessing, ops data management, error reporting and archival 2 Significantly faster processing through Spark 3 ORC performs well for most of our consumption patterns - supporting both predicate and projection push down. 4 5 Security and managed concurrency CLEANSE Change Data Capture
  • 15. Extending NiFi via Custom Processors
  • 16. Good Problems Don’t bring me anything but trouble. Good news weakens me. - Charles Kettering
  • 17. RECORDRECEIVE VALIDATE CONSTRUCT Types of Data RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) TYPES OF DATA CLEANSE
  • 18. Types of Data RECORDRECEIVE VALIDATE CONSTRUCT Data accumulated until Yesterday Today’s changes Today’s EOD snapshot data Monday’s snapshot data Tuesday’s snapshot data Wednesday’s snapshot data Master Transactional History Current CLEANSE
  • 19. Frequency RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro-JSON messages » Spreadsheets DATA FORMATS TYPES OF DATA » Streaming » Hourly incremental » Daily » Weekly FREQUENCY CLEANSE
  • 20. Data accumulated until Tuesday Today’s changes Wednesday’s snapshot data Frequency - Hourly Incremental hourly changesToday’s changes Today’s changes Today’s changesWednesday’s hourly changes hourly changes Wednesday’s hourly changes Wednesday’s snapshot data RECORDRECEIVE VALIDATE CONSTRUCTCLEANSE
  • 21. Data formats RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro messages » XML » Spreadsheets DATA FORMATS TYPES OF DATA » Streaming » Hourly incremental » Daily » Weekly FREQUENCY CLEANSE
  • 22. Data formats - CDC Delimited TIMESTAMP, TRANSACTION_ID, OPERATION_TYPE, USER_ID,<DATAFIELDVALUE1>,<DATAFIELDVALUE2>…. 2017-07-20 22:38:04.00000|426664065479|B|DWUSER|OFFICE TELEPHONE NO|21234567890 CDC Columns Application Columns Could be one of I, B, A, D
  • 23. CDC Delimited - Snapshot building RECORDRECEIVE VALIDATE CONSTRUCT Data accumulated until Yesterday Today’s changes Snapshot building History Snapshot D records moves to History I or latest A moves to Snapshot CLEANSE
  • 24. Data formats and Frequency RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) DATA FORMATS TYPES OF DATA » Streaming » Hourly incremental » Daily » Weekly FREQUENCY CLEANSE » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro messages » XML » Spreadsheets
  • 25. Data formats - Delimited with Header/Trailer H3 TRANS_ID_1|USER_ID1|100.00 TRANS_ID_2|USER_ID2|3.00 TRANS_ID_3|USER_ID1|20.00 T123.00 » Header and trailer has meta information about the data in the file » Varieties of header and trailer formats Row count of the file Sum or a rolling checksum of column(s) in that file * Header and trailer information is used during reconciliation
  • 27. The final piece to the puzzle RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) DATA FORMATS » Streaming » Hourly incremental » Daily » Weekly FREQUENCY Among others already discussed, » Schema evolution » Cascading re-runs » full-dump override PROCESSING » TYPES OF DATA CLEANSE » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro messages » XML » Spreadsheets
  • 28. More awesomeness RECORDRECEIVE VALIDATE CONSTRUCT Metadata management UI1 Schema evolution & retrofitting historic data2 Reruns, Cascading reruns, full-dump override3 One-touch deployment4 CLEANSE
  • 29. To summarise Our code is (still) manageable and extensible 1 The volume, the variety and the interesting combinations of data presented us some very interesting problems to solve. 2 And… we intend to open source the framework for the general audience. 3 4 With HDF and HDP, we were able to abstract away the cross cutting concerns such as security and concurrency.