SlideShare a Scribd company logo
Entity Resolution Service
Bringing Petabytes of Data Online for Instant Access
Gurpreet Singh, eBay
Ying Zhang, eBay
Problem Statement
2
eBay maintains hundreds of millions of accounts* across our properties
and partners, that are sometimes unstructured, in different formats,
different character sets, and are changing over time. Identifying which
accounts belong to the same person enables us to personalize each
customer's experience, deliver great customer service, and fight fraud.
ERS brings xID along with device information online.
Background
3
xIDpii xIDguid
ERS
xID
4
Links User Credentials and transient identities to recognize a
customer:
–Hundreds of millions of user accounts.
–20+ different linking strategies with the flexibility of adding more.
–Billions of cookies, browser fingerprints, all mobile identities.
–Customer recognition across sessions, sites, browsers, devices.
xID Modular Solution
5
Sources Edges Graph Table
Account Entity
102832 10921
236896 10921
786273 10921
324324 23987
349709 73652
152631 73652
543273 37726
SOURCES – Overall Data Flow
6
Architecture
7
Monitoring
Technology Stack
8
ERS
9
Enables access to consolidated customer Identity
–RESTFul service with ~2ms response time
–Optimized access to customer identities
–10K+ requests per second
– 70 Nodes Couchbase Cluster with 11 TB memory in 3 data centers
Challenges
10
LARGE DATASET
SKILLSET
COMPLEX
TIMELINE
FUTURE GROWTH
Raw clickstream + PII ~ in PB
20+ edge types
100+ billion edges
Challenge
11
LARGE DATASET
SKILLSET
COMPLEX
TIMELINE
FUTURE GROWTH
Graph Join
Process
Geo Redundancy
Requirement
Challenge
12
LARGE DATASET
SKILLSET
COMPLEX
TIMELINE
FUTURE GROWTH
Downstream waiting
Infrastructure
Deliver it ASAP!
Challenge
13
LARGE DATASET
SKILLSET
COMPLEX
TIMELINE
FUTURE GROWTH
NoSQL knowledge
Near Real Time
Processing
Challenge
14
LARGE DATASET
SKILLSET
COMPLEX
TIMELINE
FUTURE GROWTH
More business Usecases
Data Volume
Data Variability
Data Velocity
Key Take Away’s
1
5
• Optimize CB data load to prevent negative service impact
• Reduce JSON size to increase CB Active Resident Ratio
• Geo-Redundancy by data replication
• Enterprise BigData solutions are complex (Velocity, Volume and
Variability)
THE END

More Related Content

What's hot (20)

PPTX
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
DataWorks Summit
 
PPTX
Big Data Application Architectures - IoT
DataWorks Summit/Hadoop Summit
 
PPT
Ultralight Data Movement for IoT with SDC Edge
DataWorks Summit
 
PPTX
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
DataWorks Summit
 
PDF
Azure Big data
Michel HUBERT
 
PDF
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo
 
PDF
Simplifying Cloud Architectures with Data Virtualization
Denodo
 
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
PDF
Big Data Storage Challenges and Solutions
WSO2
 
PDF
Creating a Modern Data Architecture for Digital Transformation
MongoDB
 
PDF
Cloud Modernization and Data as a Service Option
Denodo
 
PDF
Data Lakes: 8 Enterprise Data Management Requirements
SnapLogic
 
ODP
BigData Hadoop
Kumari Surabhi
 
PDF
The Curse of the Data Lake Monster
Thoughtworks
 
PPTX
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
PDF
Scaling Multi-Cloud Deployments with Denodo: Automated Infrastructure Management
Denodo
 
PDF
Introducing Databricks Delta
Databricks
 
PPTX
Data Virtualization and ETL
Lily Luo
 
PDF
Hadoop Big Data Lakes Keynote
Mark van Rijmenam
 
PDF
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Data Con LA
 
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
DataWorks Summit
 
Big Data Application Architectures - IoT
DataWorks Summit/Hadoop Summit
 
Ultralight Data Movement for IoT with SDC Edge
DataWorks Summit
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
DataWorks Summit
 
Azure Big data
Michel HUBERT
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo
 
Simplifying Cloud Architectures with Data Virtualization
Denodo
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
Big Data Storage Challenges and Solutions
WSO2
 
Creating a Modern Data Architecture for Digital Transformation
MongoDB
 
Cloud Modernization and Data as a Service Option
Denodo
 
Data Lakes: 8 Enterprise Data Management Requirements
SnapLogic
 
BigData Hadoop
Kumari Surabhi
 
The Curse of the Data Lake Monster
Thoughtworks
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
Scaling Multi-Cloud Deployments with Denodo: Automated Infrastructure Management
Denodo
 
Introducing Databricks Delta
Databricks
 
Data Virtualization and ETL
Lily Luo
 
Hadoop Big Data Lakes Keynote
Mark van Rijmenam
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Data Con LA
 

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
July Patch Tuesday
Ivanti
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Ad

Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access