SlideShare a Scribd company logo
Mastering Customer
Data on Apache Spark
BIG DATA WAREHOUSING
MEETUP
AWS LOFT
APRIL 7, 2016
Agenda
6:30 Networking
Grab some food and drink... Make some friends.
6:50 Joe Caserta
President
Caserta Concepts
Welcome + Intro to BDW Meetup
About the Meetup. Why MDM needs Graph now.
7:15 Kevin Rasmussen
Big Data Engineer
Caserta Concepts
Deeper Dive into Spark/MDM/Graph
Deep dive into DUNBAR technology and how we
came up with it for Customer Data Integration
7:45 Vida Ha,
Lead Solutions Engineer
Databricks
Using Apache Spark
Intro to the different components of Spark:
MLLib, GraphX, SQL, Streaming, Python, ETL
8:30 Q&A Ask Questions, Share your experience
About Caserta Concepts
• Consulting Data Innovation and Modern Data Engineering
• Award-winning company
• Internationally recognized work force
• Strategy, Architecture, Implementation, Governance
• Innovation Partner
• Strategic Consulting
• Advanced Architecture
• Build & Deploy
• Leader in Enterprise Data Solutions
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
Amazon Best Sellers
Most popular products based on sales.
Updated hourly.
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance
Partners
Awards & Recognition
This Meetup
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Established in 2012 in NYC
• Meet monthly to share data best
practices, experiences
• 3,700+ Members
http://www.meetup.com/Big-Data-Warehousing/
http://www.slideshare.net/CasertaConcepts/
Examples of Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data
Monitoring with Storm &
Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/ Revolution
Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN
Informational
Master Data
MDM Information Ecosystem
8
Operational
Master Data
Holistic Master
Data Service
Leads
Policies
Claims
Enrolls
Sales
Finance
DW
Dimensions &
Cross-References
Marketing
Insights
What is wrong with traditional approach to MDM
• Conceptually problems with “enterprise” approach
• Long, complex implementations  low ROI
• Complex data model
• Too much human interaction
• Deliverable???
• Challenges with big data
• Data volumes
• Evolving data sources
• Need to further remove humans out of the process
 Hierarchical relationships are never rigid
 Relational models with tables and
columns not flexible enough
 Neo4j is the leading graph database
 Many MDM systems are going graph:
 Pitney Bowes - Spectrum MDM
 Reltio - Worry-Free Data for Life Sciences.
Graph Databases to the Rescue
How does a Graph DB help MDM
• Data is stored in it’s natural form  no mismatch between
requirements and data model
• Both Nodes and Relationships can have properties  supports sparse
and evolving data
• MDM for analytics  your MDM solution now delivers new
enablement, not just a back office system
• Relationship science
Open source and commercial
Gelphi
Tom Sawyer
linkurio.us
Caserta Innovation Lab (CIL)
• Internal laboratory established to test & develop solution concepts and ideas
• Used to accelerate client projects
• Examples:
• Search (SOLR) based BI
• Big Data Governance Toolkit / Data Quality Sub-System
• Text Analytics on Social Network Data
• Continuous Integration / End-to-end streaming
• Recommendation Engine Optimization
• Relationship Intelligence / Spark Graph / CDI (Dunbar)
• CIL is hosted on
Introducing Dunbar (Relationship Intelligence / CDI )
Kevin Rasmussen
Big Data Engineer, Caserta Concepts
kevin@casertaconcepts.com
How many people do you know??
Anthropologists say it’s 150…
MAX
What if we could increase this number?
Opportunities
Closing
Revenue
Yeah,butthat’sCRM101… Howcanweimprovethis?
Expand Data Sources
Explore Relationships
Enhanced Insight
Project Dunbar - Internal
Developed in:
• Python
• Neo4j Database
• Build a social graph based on internal and external data
• Run pathing algorithms to understand strategic opportunity advantages
Whoa… Not so FAST!
Throwing a bunch of unrelated points in a graph will not
give us a useable solution.
We need to
MASTER
our contact data…
• We need to clean and normalize our
incoming interaction and relationship
data (edges)
• Clean normalize and match our
entities (vertexes)
Mastering Customer Data
Customer Data Integration (CDI):
is the process of consolidating and managing customer information from all
available sources.
In other words…
We need to figure out how to
LINK people across systems!
Steps required
Standardization
Matching
Survivorship
Validation
Traditional Standardization and Matching
Matching:
join based on combinations
of cleansed and standardized
data to create match results
Cleanse and Parse:
• Names
• Resolve nicknames
• Create deterministic hash,
phonetic representation
• Addresses
• Emails
• Phone Numbers
Great – But the NEW data is different
Reveal
• Wait for the customer to “reveal” themselves
• Create a link between anonymous self and known profile
Vector
• May need behavioral statistical profiling
• Compare use vectors
Rebuild
• Recluster all prior activities
• Rebuild the Graph
..and then things got bigger
One of our customers wanted us to build it for them:
• Much larger dataset
• 6 million customers > 30% duplication rate
• 100’s of millions of customer interactions
• Few direct links across channels
Why Spark
“Big Box” MDM tools vs ROI?
• Prohibitively expensive  limited by licensing $$$
• Typically limited to the scalability of a single server
We Spark!
• Development local or distributed is similar
• Beautiful high level API’s
• Databricks cloud is soo Easy
• Full universe of Python modules
• Open source and Free**
• Blazing fast!
Spark has become our
default processing engine
for a myriad of
engineering problems
Spark map operations
Cleansing, transformation, and standardization of both interaction and
customer data
Amazing universe of Python modules:
• Address Parsing: usaddress, postal-address, etc
• Name Hashing: fuzzy, etc
• Genderization: sexmachine, etc
And all the goodies of the standard library!
We can now parallelize our workload against a large number of machines:
Matching process
• We now have clean standardized, linkable data
• We need to resolve our links between our customer
• Large table self joins
• We can even use SQL:
Matching process
The matching process output gives us the relationships between customers:
Great, but it’s not very useable, you need to traverse the dataset to find out 1234
and 1235 are the same person (and this is a trivial case)
And we need to cluster and identify our survivors (vertex)
xid yid match_type
1234 4849 phone
4849 5499 email
5499 1235 address
4849 7788 cookie
5499 7788 cookie
4849 1234 phone
Graphx to the rescue
1234 4849
5499
7788
We just need to import our
edges into a graph and “dump”
out communities
Don’t think table…
think Graph!
These matches
are actually
communities
1235
Connected components
Connected Components algorithm labels each connected
component of the graph with the ID of its lowest-numbered
vertex
This lowest number vertex can serve as our “survivor” (not
field survivorship)
Is it possible to write less code?
Field level survivorship rules
We now need to survive fields to our survivor to make it a “best record”.
Depending on the attribution we choose:
• Latest reported
• Most frequently used
We then do some simple ranking
(thanks windowed functions)
What you really want to know...
With a customer dataset of approximately
6 million customers and 100’s of millions of
data points.
… when directly compared to traditional
“big box” enterprise MDM software
Was Spark faster….
Map operations like cleansing and
standardization saw a
10x improvement
on a 4 node r3.2xlarge cluster
But the real winner was the “Graph Trick”…
Survivorship step reduced from
2 hours 5 minutes
Connected Components
< 1 minute
**loading RDD’s from S3, building the graph, and running connected components
Next Steps…
• Graph Frames
• In development by Berkley Amp Labs
• Combines Full Graph algorithm nature of
GraphX with Relationship mining of Neo4j
Thank You / Q&A
Kevin Rasmussen
Big Data Engineer, Caserta Concepts
kevin@casertaconcepts.com

More Related Content

What's hot (20)

PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Scaling Data Analytics Workloads on Databricks
Databricks
 
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
PDF
Getting Started with Databricks SQL Analytics
Databricks
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PDF
Parquet Hadoop Summit 2013
Julien Le Dem
 
PDF
Intro to Delta Lake
Databricks
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PPTX
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PDF
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Scaling Data Analytics Workloads on Databricks
Databricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
Getting Started with Databricks SQL Analytics
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Data pipelines from zero to solid
Lars Albertsson
 
Parquet Hadoop Summit 2013
Julien Le Dem
 
Intro to Delta Lake
Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 

Viewers also liked (20)

PPTX
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Caserta
 
PDF
Graph Databases for Master Data Management
Neo4j
 
PPTX
Using Hadoop as a platform for Master Data Management
DataWorks Summit
 
PPTX
Using a Graph Database for Next-Gen MDM
Neo4j
 
PDF
Data Quality in the Data Hub with RedPointGlobal
Caserta
 
PDF
Using neo4j for enterprise metadata requirements
Neo4j
 
PDF
Neo4j Solutions - Master Data Management
Caserta
 
PPTX
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Perficient, Inc.
 
PPT
Graph db
Gagan Agrawal
 
PDF
Introduction to Dublin Core Metadata
Hannes Ebner
 
PPT
Real-World Data Governance: Master Data Management & Data Governance
DATAVERSITY
 
PPT
Metadata an overview
robin fay
 
PDF
Seven building blocks for MDM
Kousik Mukherjee
 
PDF
Neo4j PartnerDay Amsterdam 2017
Neo4j
 
PPT
raph Databases with Neo4j – Emil Eifrem
buildacloud
 
PPT
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
Codemotion
 
PPTX
Presentatie Marktonderzoek
Responsum
 
PDF
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
PDF
New trends in data analysis and visualization on the web
Javier de la Torre
 
PDF
Presentatie data visualisatie, interactive storytelling
Wilbert Baan
 
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Caserta
 
Graph Databases for Master Data Management
Neo4j
 
Using Hadoop as a platform for Master Data Management
DataWorks Summit
 
Using a Graph Database for Next-Gen MDM
Neo4j
 
Data Quality in the Data Hub with RedPointGlobal
Caserta
 
Using neo4j for enterprise metadata requirements
Neo4j
 
Neo4j Solutions - Master Data Management
Caserta
 
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Perficient, Inc.
 
Graph db
Gagan Agrawal
 
Introduction to Dublin Core Metadata
Hannes Ebner
 
Real-World Data Governance: Master Data Management & Data Governance
DATAVERSITY
 
Metadata an overview
robin fay
 
Seven building blocks for MDM
Kousik Mukherjee
 
Neo4j PartnerDay Amsterdam 2017
Neo4j
 
raph Databases with Neo4j – Emil Eifrem
buildacloud
 
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
Codemotion
 
Presentatie Marktonderzoek
Responsum
 
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
New trends in data analysis and visualization on the web
Javier de la Torre
 
Presentatie data visualisatie, interactive storytelling
Wilbert Baan
 

Similar to Mastering Customer Data on Apache Spark (20)

PDF
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Spark Summit
 
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
PDF
Building a New Platform for Customer Analytics
Caserta
 
PDF
State of the State: What’s Happening in the Database Market?
Neo4j
 
PPTX
Graph all the things - PRathle
Neo4j
 
PPTX
State of the State: What’s Happening in the Database Market?
Neo4j
 
PPTX
GraphTour Boston - State of the State: Database Market
Neo4j
 
PDF
GraphTour London 2020 - Customer Journey
Neo4j
 
PPTX
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j
 
PPTX
Graphs in the Real World
Neo4j
 
PDF
Capturing Data Relationships to Develop Meaningful Customer Engagement
Precisely
 
PDF
Neo4j GraphTour Toronto Opening Keynote
Neo4j
 
PPTX
Neo4j GraphTour New York_ State of the State_Amit Chaudhry Neo4j
Neo4j
 
PDF
A6 big data_in_the_cloud
Dr. Wilfred Lin (Ph.D.)
 
PDF
Big Data Evolution
itnewsafrica
 
PPTX
Building a Big Data Solution
James Serra
 
PPTX
State of the State: What’s Happening in the Database Market?
Neo4j
 
PDF
GraphTour 2020 - Customer Journey with Neo4j Services
Neo4j
 
PPT
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Fundación Ramón Areces
 
PPTX
Finding business value in Big Data
James Serra
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Spark Summit
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
Building a New Platform for Customer Analytics
Caserta
 
State of the State: What’s Happening in the Database Market?
Neo4j
 
Graph all the things - PRathle
Neo4j
 
State of the State: What’s Happening in the Database Market?
Neo4j
 
GraphTour Boston - State of the State: Database Market
Neo4j
 
GraphTour London 2020 - Customer Journey
Neo4j
 
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j
 
Graphs in the Real World
Neo4j
 
Capturing Data Relationships to Develop Meaningful Customer Engagement
Precisely
 
Neo4j GraphTour Toronto Opening Keynote
Neo4j
 
Neo4j GraphTour New York_ State of the State_Amit Chaudhry Neo4j
Neo4j
 
A6 big data_in_the_cloud
Dr. Wilfred Lin (Ph.D.)
 
Big Data Evolution
itnewsafrica
 
Building a Big Data Solution
James Serra
 
State of the State: What’s Happening in the Database Market?
Neo4j
 
GraphTour 2020 - Customer Journey with Neo4j Services
Neo4j
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Fundación Ramón Areces
 
Finding business value in Big Data
James Serra
 

More from Caserta (20)

PPTX
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
PPTX
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
PDF
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
PDF
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
PDF
Introduction to Data Science (Data Summit, 2017)
Caserta
 
PDF
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
PDF
The Rise of the CDO in Today's Enterprise
Caserta
 
PDF
You're the New CDO, Now What?
Caserta
 
PDF
Making Big Data Easy for Everyone
Caserta
 
PDF
Benefits of the Azure Cloud
Caserta
 
PDF
Big Data Analytics on the Cloud
Caserta
 
PDF
Intro to Data Science on Hadoop
Caserta
 
PDF
The Emerging Role of the Data Lake
Caserta
 
PDF
Not Your Father's Database by Databricks
Caserta
 
PDF
Moving Past Infrastructure Limitations
Caserta
 
PDF
Balancing Data Governance and Innovation
Caserta
 
PDF
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
PDF
Balancing Data Governance and Innovation
Caserta
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Introduction to Data Science (Data Summit, 2017)
Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
The Rise of the CDO in Today's Enterprise
Caserta
 
You're the New CDO, Now What?
Caserta
 
Making Big Data Easy for Everyone
Caserta
 
Benefits of the Azure Cloud
Caserta
 
Big Data Analytics on the Cloud
Caserta
 
Intro to Data Science on Hadoop
Caserta
 
The Emerging Role of the Data Lake
Caserta
 
Not Your Father's Database by Databricks
Caserta
 
Moving Past Infrastructure Limitations
Caserta
 
Balancing Data Governance and Innovation
Caserta
 
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Balancing Data Governance and Innovation
Caserta
 

Recently uploaded (20)

PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 

Mastering Customer Data on Apache Spark

  • 1. Mastering Customer Data on Apache Spark BIG DATA WAREHOUSING MEETUP AWS LOFT APRIL 7, 2016
  • 2. Agenda 6:30 Networking Grab some food and drink... Make some friends. 6:50 Joe Caserta President Caserta Concepts Welcome + Intro to BDW Meetup About the Meetup. Why MDM needs Graph now. 7:15 Kevin Rasmussen Big Data Engineer Caserta Concepts Deeper Dive into Spark/MDM/Graph Deep dive into DUNBAR technology and how we came up with it for Customer Data Integration 7:45 Vida Ha, Lead Solutions Engineer Databricks Using Apache Spark Intro to the different components of Spark: MLLib, GraphX, SQL, Streaming, Python, ETL 8:30 Q&A Ask Questions, Share your experience
  • 3. About Caserta Concepts • Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance Amazon Best Sellers Most popular products based on sales. Updated hourly.
  • 4. Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
  • 7. This Meetup CIL - Caserta Innovations Lab Experience Big Data Warehousing Meetup • Established in 2012 in NYC • Meet monthly to share data best practices, experiences • 3,700+ Members http://www.meetup.com/Big-Data-Warehousing/ http://www.slideshare.net/CasertaConcepts/ Examples of Topics • Data Governance, Compliance & Security in Hadoop w/Cloudera • Real Time Trade Data Monitoring with Storm & Cassandra • Predictive Analytics • Exploring Big Data Analytics Techniques w/Datameer • Using a Graph DB for MDM & Relationship Mgmt • Data Science w/ Revolution Analytics • Processing 1.4 Trillion Events in Hadoop • Building a Relevance Engine using Hadoop, Mahout & Pig • Big Data 2.0 – YARN Distributed ETL & SQL w/Hadoop • Intro to NoSQL w/10GEN
  • 8. Informational Master Data MDM Information Ecosystem 8 Operational Master Data Holistic Master Data Service Leads Policies Claims Enrolls Sales Finance DW Dimensions & Cross-References Marketing Insights
  • 9. What is wrong with traditional approach to MDM • Conceptually problems with “enterprise” approach • Long, complex implementations  low ROI • Complex data model • Too much human interaction • Deliverable??? • Challenges with big data • Data volumes • Evolving data sources • Need to further remove humans out of the process
  • 10.  Hierarchical relationships are never rigid  Relational models with tables and columns not flexible enough  Neo4j is the leading graph database  Many MDM systems are going graph:  Pitney Bowes - Spectrum MDM  Reltio - Worry-Free Data for Life Sciences. Graph Databases to the Rescue
  • 11. How does a Graph DB help MDM • Data is stored in it’s natural form  no mismatch between requirements and data model • Both Nodes and Relationships can have properties  supports sparse and evolving data • MDM for analytics  your MDM solution now delivers new enablement, not just a back office system • Relationship science
  • 12. Open source and commercial Gelphi Tom Sawyer linkurio.us
  • 13. Caserta Innovation Lab (CIL) • Internal laboratory established to test & develop solution concepts and ideas • Used to accelerate client projects • Examples: • Search (SOLR) based BI • Big Data Governance Toolkit / Data Quality Sub-System • Text Analytics on Social Network Data • Continuous Integration / End-to-end streaming • Recommendation Engine Optimization • Relationship Intelligence / Spark Graph / CDI (Dunbar) • CIL is hosted on
  • 14. Introducing Dunbar (Relationship Intelligence / CDI ) Kevin Rasmussen Big Data Engineer, Caserta Concepts [email protected]
  • 15. How many people do you know??
  • 17. What if we could increase this number? Opportunities Closing Revenue Yeah,butthat’sCRM101… Howcanweimprovethis? Expand Data Sources Explore Relationships Enhanced Insight
  • 18. Project Dunbar - Internal Developed in: • Python • Neo4j Database • Build a social graph based on internal and external data • Run pathing algorithms to understand strategic opportunity advantages
  • 19. Whoa… Not so FAST! Throwing a bunch of unrelated points in a graph will not give us a useable solution. We need to MASTER our contact data… • We need to clean and normalize our incoming interaction and relationship data (edges) • Clean normalize and match our entities (vertexes)
  • 20. Mastering Customer Data Customer Data Integration (CDI): is the process of consolidating and managing customer information from all available sources. In other words… We need to figure out how to LINK people across systems!
  • 22. Traditional Standardization and Matching Matching: join based on combinations of cleansed and standardized data to create match results Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers
  • 23. Great – But the NEW data is different Reveal • Wait for the customer to “reveal” themselves • Create a link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
  • 24. ..and then things got bigger One of our customers wanted us to build it for them: • Much larger dataset • 6 million customers > 30% duplication rate • 100’s of millions of customer interactions • Few direct links across channels
  • 25. Why Spark “Big Box” MDM tools vs ROI? • Prohibitively expensive  limited by licensing $$$ • Typically limited to the scalability of a single server We Spark! • Development local or distributed is similar • Beautiful high level API’s • Databricks cloud is soo Easy • Full universe of Python modules • Open source and Free** • Blazing fast! Spark has become our default processing engine for a myriad of engineering problems
  • 26. Spark map operations Cleansing, transformation, and standardization of both interaction and customer data Amazing universe of Python modules: • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc And all the goodies of the standard library! We can now parallelize our workload against a large number of machines:
  • 27. Matching process • We now have clean standardized, linkable data • We need to resolve our links between our customer • Large table self joins • We can even use SQL:
  • 28. Matching process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
  • 29. Graphx to the rescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
  • 30. Connected components Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Is it possible to write less code?
  • 31. Field level survivorship rules We now need to survive fields to our survivor to make it a “best record”. Depending on the attribution we choose: • Latest reported • Most frequently used We then do some simple ranking (thanks windowed functions)
  • 32. What you really want to know... With a customer dataset of approximately 6 million customers and 100’s of millions of data points. … when directly compared to traditional “big box” enterprise MDM software Was Spark faster….
  • 33. Map operations like cleansing and standardization saw a 10x improvement on a 4 node r3.2xlarge cluster But the real winner was the “Graph Trick”… Survivorship step reduced from 2 hours 5 minutes Connected Components < 1 minute **loading RDD’s from S3, building the graph, and running connected components
  • 34. Next Steps… • Graph Frames • In development by Berkley Amp Labs • Combines Full Graph algorithm nature of GraphX with Relationship mining of Neo4j
  • 35. Thank You / Q&A Kevin Rasmussen Big Data Engineer, Caserta Concepts [email protected]