SlideShare a Scribd company logo
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Next-gen Data Flow Platform for the Enterprise
Santosh Bardwaj
Vice President, Advanced Analytics
The opinions expressed in this presentation are those of the presenters,
in their individual capacities, and not necessarily those of Discover.
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Agenda
2
What it
takes to
build an
enterprise-
ready
platform
Discover’s
next-gen data
ingestion
platform
built on NiFi
Challenges
and how we
overcame
them
1 32
Next steps
with the
platform
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
$37Bn Consumer Deposits
$9Bn Private Student Loans
$7Bn Personal Loans
1 in 4 Households1
$60Bn in Credit Card Receivables
Leading Cash
Rewards
 $183Bn Payment Services Volume
 185+ Countries/Territories
Discover is a leading U.S. direct bank & payments partner
Note(s)
Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits
1. TNS’ Consumer Payment Strategies Study
3
Deposits & Lending
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Advancing our
data-analytic
capabilities
Ingest, classify
and transform data
from “source to
insight” in minutes
Centralize data,
next-generation
analytic tools and
reporting on the
Hadoop Data Lake
Extend the
Data Lake and
advanced
analytic stack
on the Cloud to
enable speed
to market
Operationalize
business use
cases leveraging
advanced
analytic
capabilities
Provide real-time
customer insight and
rapid deployment of
new strategies into
the decision engines
Advanced
Analytics
Capabilities
1
5
4
3
2
From hours
to minutes Built around a
foundation of a
continuous data
pipeline and hybrid
data-analytic lake
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Unified data ingestion platform built on NiFi
5
Unified data ingestion platform
 Ingest data from source systems
 Push to the Enterprise Data Lake
 Governed process leveraging
common-reusable templates
What is NiFi?
 Enables automated data flow
management
 Acquires data from producers
 Delivers to consumers while
orchestrating the flow
Scalable and Customizable
Provenance
Promotes reuse
Secure
User Interface (drag & drop)
Why we chose NiFi to build our
data ingestion platform
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
The next-gen platform built on NiFi and Spark is designed to
streamline our data pipeline into a near real-time paradigm
6
Operational
Database
Raw Data Lake
(flat file)
Limited user
access and tools
Source
of Truth
Enterprise
DW
Database file
extracts
SFTP
ETL Grid ETL Grid
~24 hours
Raw
data
Source
of truth
Source of truth
- Enriched
Enterprise Data Lake
Phase 1
“True Sourcing”
Phase 2
“Enriched Sourcing”
Minutes
Nightly batch to near real-time
NiFi
Spark
NiFi
Hortonworks
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
We are also extending the capability of into the cloud
7
Batch
sources
Event
Bus
Mini-batch
Real-time
On-premise Data Lake
Model scoring/
decisioning
Real-time
analytics
History
Operational Data
Store
Real-time
AWS Data Lake
Kafka
Hortonworks
Amazon S3
Hortonworks
Spark
7
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Data Flow Categorization within the Hadoop Data Lake
8
System of
Record
(SOR)
Source of Truth
(SOT)
Source of Truth
– Enriched
(SOT-E)
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Detail flow and foundational components
9
SOURCES RAW SOR SOT SOT-E
Source files
Landing
area
File
Catalog
Convert to
standard
format
Schema
evolution
Apply
schema
changes
Raw data
consumable
Technical
metadata
Business
metadata
DQ checks
Data enrichment
(Business
transformation)
Ability to
export data
out of Lake
Continuous
integration
Monitoring Data lineage
Data
governance
Exception
handling
Security
Data
reconciliation
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Ingesting complex data - How complex?
Format of files will vary, some are easy to consume, others hard
Example: Records with Dynamic arrays/vectors of primitives or strings
Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City
Data:
John, Doe, 2, Susie, Chris, Chicago
Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta
Frank, Smith, 1, Ralph, Toronto
Example: Records with an array of Struct data types
Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City,
CompanyStruct.YearsWorked, Age
Data:
John, 1, Discover, Chicago, 3 , 44
Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35
10
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Our solution – A custom NiFi processor to handle complex data types
11
Spark
Converter
Discover schema.json
Data File.001
Data File.avsc
Data File 001.avro
Ingestion Pipeline
Source of
Truth - Source
NiFi Process
Group
System of Record
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Continuous improvement of real-time data ingestion using NiFi
NiFi Ingestion Flow Version I
Source : Flat File Destination: Hadoop
24 hours
NiFi Ingestion Flow Version II
Source : Event Bus Destination: Hadoop
Complex logic, limited scale
NiFi Ingestion Flow Version III
Source : Event Bus Destination: Hadoop
Custom NiFi processor developed in-house, reusable and scalable
Seconds
112
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
ETL on Hadoop progression
Version I
Traditional
ETL tool
Version II
ETL on
HiveQL
Version III
ETL on Spark
(hand-coded)
Coming soon
Automated
(flow-based)
ETL on Spark
13
~18 hours ~8 hours
Data enrichment from SOR to SOT (~600 jobs)
~1 hourRun time:
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Upcoming enhancements to our data pipeline
Integrating
data
quality,
catalog into
NiFi flow
Custom
processors
to parse
complex
data
structures
Enterprise
scale ETL
on Hadoop
using
Spark
Self-
service
data
pipelines
Integrating
batch and
real-time
data
pipelines
14
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Hiring Data Engineers
Q & A

More Related Content

What's hot (20)

PDF
GPT and Graph Data Science to power your Knowledge Graph
Neo4j
 
PPTX
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
PPTX
Tiger graph 2021 corporate overview [read only]
ercan5
 
PDF
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
PDF
Bangalore Meetup - Enable realtime machine learning with streaming data
Christina Lin
 
PDF
The Neo4j Data Platform for Today & Tomorrow.pdf
Neo4j
 
PDF
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Arne Roßmann
 
PPTX
Data Analytics for Finance
ellenica
 
PDF
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Neo4j
 
PPTX
Agentic AI: The 2025 Next-Gen Automation Guide
Thoughtminds
 
PPTX
AI Agents and their implications for Enterprise AI Use-cases
Debmalya Biswas
 
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
PPT
Date warehousing concepts
pcherukumalla
 
PDF
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
SkillCertProExams
 
PPT
Data Mining and Data Warehousing
Aswathy S Nair
 
PPTX
The Data Warehouse Lifecycle
bartlowe
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PDF
Advanced Retrieval Augmented Generation Techniques
Zilliz
 
PPTX
Bike Sharing Demand: Akshay Patil
Akshay Patil
 
PPTX
Building a Knowledge Graph at Zalando
Eficode
 
GPT and Graph Data Science to power your Knowledge Graph
Neo4j
 
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
Tiger graph 2021 corporate overview [read only]
ercan5
 
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
Bangalore Meetup - Enable realtime machine learning with streaming data
Christina Lin
 
The Neo4j Data Platform for Today & Tomorrow.pdf
Neo4j
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Arne Roßmann
 
Data Analytics for Finance
ellenica
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Neo4j
 
Agentic AI: The 2025 Next-Gen Automation Guide
Thoughtminds
 
AI Agents and their implications for Enterprise AI Use-cases
Debmalya Biswas
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
Date warehousing concepts
pcherukumalla
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
SkillCertProExams
 
Data Mining and Data Warehousing
Aswathy S Nair
 
The Data Warehouse Lifecycle
bartlowe
 
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Advanced Retrieval Augmented Generation Techniques
Zilliz
 
Bike Sharing Demand: Akshay Patil
Akshay Patil
 
Building a Knowledge Graph at Zalando
Eficode
 

Similar to Continuous Data Ingestion pipeline for the Enterprise (20)

PDF
GraphTalk Helsinki - Introduction to Graphs and Neo4j
Neo4j
 
PDF
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
Splunk
 
PDF
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j
 
PDF
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
MDS ap
 
PDF
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
DevOps.com
 
PPTX
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
Capgemini
 
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
PDF
The State of the Data Warehouse in 2017 and Beyond
SingleStore
 
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
PPTX
The new dominant companies are running on data
SnapLogic
 
PDF
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
HostedbyConfluent
 
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
Arcadia Data
 
PDF
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
Grid Dynamics
 
PDF
Keynote: GraphTour Toronto
Neo4j
 
PPTX
In-Memory Computing Webcast. Market Predictions 2017
SingleStore
 
PPTX
Market Research Meets Big Data Analytics for Business Transformation
Sally Sadosky
 
PDF
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
PDF
Building Sessionization Pipeline at Scale with Databricks Delta
Databricks
 
PPTX
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Denodo
 
GraphTalk Helsinki - Introduction to Graphs and Neo4j
Neo4j
 
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
Splunk
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
MDS ap
 
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
DevOps.com
 
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
Capgemini
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
The State of the Data Warehouse in 2017 and Beyond
SingleStore
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
The new dominant companies are running on data
SnapLogic
 
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
HostedbyConfluent
 
Accelerating Data Lakes and Streams with Real-time Analytics
Arcadia Data
 
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
Grid Dynamics
 
Keynote: GraphTour Toronto
Neo4j
 
In-Memory Computing Webcast. Market Predictions 2017
SingleStore
 
Market Research Meets Big Data Analytics for Business Transformation
Sally Sadosky
 
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
Building Sessionization Pipeline at Scale with Databricks Delta
Databricks
 
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Denodo
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 

Continuous Data Ingestion pipeline for the Enterprise

  • 1. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Next-gen Data Flow Platform for the Enterprise Santosh Bardwaj Vice President, Advanced Analytics The opinions expressed in this presentation are those of the presenters, in their individual capacities, and not necessarily those of Discover.
  • 2. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Agenda 2 What it takes to build an enterprise- ready platform Discover’s next-gen data ingestion platform built on NiFi Challenges and how we overcame them 1 32 Next steps with the platform 4
  • 3. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute $37Bn Consumer Deposits $9Bn Private Student Loans $7Bn Personal Loans 1 in 4 Households1 $60Bn in Credit Card Receivables Leading Cash Rewards  $183Bn Payment Services Volume  185+ Countries/Territories Discover is a leading U.S. direct bank & payments partner Note(s) Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits 1. TNS’ Consumer Payment Strategies Study 3 Deposits & Lending
  • 4. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Advancing our data-analytic capabilities Ingest, classify and transform data from “source to insight” in minutes Centralize data, next-generation analytic tools and reporting on the Hadoop Data Lake Extend the Data Lake and advanced analytic stack on the Cloud to enable speed to market Operationalize business use cases leveraging advanced analytic capabilities Provide real-time customer insight and rapid deployment of new strategies into the decision engines Advanced Analytics Capabilities 1 5 4 3 2 From hours to minutes Built around a foundation of a continuous data pipeline and hybrid data-analytic lake 4
  • 5. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Unified data ingestion platform built on NiFi 5 Unified data ingestion platform  Ingest data from source systems  Push to the Enterprise Data Lake  Governed process leveraging common-reusable templates What is NiFi?  Enables automated data flow management  Acquires data from producers  Delivers to consumers while orchestrating the flow Scalable and Customizable Provenance Promotes reuse Secure User Interface (drag & drop) Why we chose NiFi to build our data ingestion platform
  • 6. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute The next-gen platform built on NiFi and Spark is designed to streamline our data pipeline into a near real-time paradigm 6 Operational Database Raw Data Lake (flat file) Limited user access and tools Source of Truth Enterprise DW Database file extracts SFTP ETL Grid ETL Grid ~24 hours Raw data Source of truth Source of truth - Enriched Enterprise Data Lake Phase 1 “True Sourcing” Phase 2 “Enriched Sourcing” Minutes Nightly batch to near real-time NiFi Spark NiFi Hortonworks
  • 7. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute We are also extending the capability of into the cloud 7 Batch sources Event Bus Mini-batch Real-time On-premise Data Lake Model scoring/ decisioning Real-time analytics History Operational Data Store Real-time AWS Data Lake Kafka Hortonworks Amazon S3 Hortonworks Spark 7
  • 8. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Data Flow Categorization within the Hadoop Data Lake 8 System of Record (SOR) Source of Truth (SOT) Source of Truth – Enriched (SOT-E)
  • 9. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Detail flow and foundational components 9 SOURCES RAW SOR SOT SOT-E Source files Landing area File Catalog Convert to standard format Schema evolution Apply schema changes Raw data consumable Technical metadata Business metadata DQ checks Data enrichment (Business transformation) Ability to export data out of Lake Continuous integration Monitoring Data lineage Data governance Exception handling Security Data reconciliation
  • 10. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Ingesting complex data - How complex? Format of files will vary, some are easy to consume, others hard Example: Records with Dynamic arrays/vectors of primitives or strings Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City Data: John, Doe, 2, Susie, Chris, Chicago Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta Frank, Smith, 1, Ralph, Toronto Example: Records with an array of Struct data types Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City, CompanyStruct.YearsWorked, Age Data: John, 1, Discover, Chicago, 3 , 44 Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35 10
  • 11. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Our solution – A custom NiFi processor to handle complex data types 11 Spark Converter Discover schema.json Data File.001 Data File.avsc Data File 001.avro Ingestion Pipeline Source of Truth - Source NiFi Process Group System of Record
  • 12. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Continuous improvement of real-time data ingestion using NiFi NiFi Ingestion Flow Version I Source : Flat File Destination: Hadoop 24 hours NiFi Ingestion Flow Version II Source : Event Bus Destination: Hadoop Complex logic, limited scale NiFi Ingestion Flow Version III Source : Event Bus Destination: Hadoop Custom NiFi processor developed in-house, reusable and scalable Seconds 112
  • 13. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute ETL on Hadoop progression Version I Traditional ETL tool Version II ETL on HiveQL Version III ETL on Spark (hand-coded) Coming soon Automated (flow-based) ETL on Spark 13 ~18 hours ~8 hours Data enrichment from SOR to SOT (~600 jobs) ~1 hourRun time:
  • 14. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Upcoming enhancements to our data pipeline Integrating data quality, catalog into NiFi flow Custom processors to parse complex data structures Enterprise scale ETL on Hadoop using Spark Self- service data pipelines Integrating batch and real-time data pipelines 14
  • 15. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Hiring Data Engineers Q & A

Editor's Notes

  • #5: Discover has a tradition of operating on mature data – analytic platforms such as TD, SAS – Platforms are proprietary, expensive Since the beginning of this decade , there are 3 key trends that have influenced the future of the industry: Big Data, Open source tools, Real-Time analytics and  cloud Business – Reinvent our key decisioning platforms such as Fraud, Credit decisioning, Collections – Faster, Richer data, better quality insights , Faster development & deployment  Technology foundation consists of – Hadoop, a new Data pipeline Collectively should help improved our speed to market from days/ hours to minutes
  • #11: Multiple record formats within a single file Records will contain complex data structures (sub-records, dynamic arrays/vectors) Fixed width, single and multiple delimited, Mainframe
  • #12: Systematically convert source files to a standard format with schema information attached Apply our own “Discover Schema” (stored in json) to the raw source file (or use CopyBook for mainframe files) Feed the source data and our “Discover Schema” into a Spark application “Discover Schema” is needed so our convertor knows how to parse the incoming data file Output is an AVRO data file along with corresponding .avsc schema Avro data and schema is then passed on to the ingestion pipeline for further Hive Loading and processing