SlideShare a Scribd company logo
Building a Federated Data Directory Platform for Public Health
Building a Federated Data Directory
Platform for Public Health
Mark Paul
Engineering Manager
Anshul Bajpai
Data Engineering Lead
Agenda
1. Problems with Centralised Data Directories
2. Solution: Federated Data Directory Platform
3. Design Patterns
4. Intelligent System of Record Ranking
5. Architecture Patterns
▪ Australian digital health
infrastructure
▪ National directory of health
services and the practitioners
who provide them
▪ National, government-owned,
not-for-profit organization
▪ Trusted health information and
advice for all Australians
#1 Australian health
information website
4.8m community
connections each month
Problems with Centralised
Data Directories
Healthcare Directories - Critical Healthcare Infrastructure
▪ Enables Care Coordination
▪ Single Point-of-Failure
▪ Bad Data Quality = Clinical Risk to
Patients
Healthcare Directories - Problems
Data Updated via Content
Management Systems and Call
Centres
This Model Is Reactive and Inefficient!
Data volatility (High Frequency of change to data)
Basic Centrally Managed Databases
Applications
Solution: Federated Data
Directory Platform
Federating Data is a Powerful Concept
Federated Database:
▪ Maps multiple autonomous database systems into
a single federated datastore
Federated Data Platform:
▪ Controlled aggregation to create “gold-standard
data” by using multiple Autonomous Origin Data
Sources
▪ Data Aggregation via Event Sourcing pipelines
Building the Federated Data “Puzzle”
Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors
participate as Systems of Records
Design Patterns for
Federated Data Platforms
Source Classification
▪ System of Record (SoR):
▪ Identify your Authoritative SoRs
SoRs have Role/s:
▪ Source of Truth
▪ Authoritative owner of a subset of data
▪ Source of Validation:
▪ Improve Data Quality
▪ Source of Notification:
▪ Increase “data currency”
▪ Gold Entities
▪ Your final entity models (e.g. Healthcare Service,
Organisation, Practitioner)
▪ Raw Entities
▪ Raw (Source) entities that are in pre-mapping stage that
would be eventual mapped to your gold entities
▪ Source Channels
▪ Pipeline channels that transition Raw Entities into new
version of Gold Entities
Entity / Channel Setup
Attribute Sourcing
Id: "561f10e4-0109-b99f-a2df-c059f9dc4a9b"
name: "Cottesloe Medical Centre"
bookingProviders [
{ Id: hotdoc,
providerIdentifier: cottesloe-medical-centre },
{ Id: healthengine,
providerIdentifier: ctl-m-re }
]
practitionerRelations [
{ pracId: c618860e-a69a
type: providerNumber,
value: 2djfkdn3k34 },
{ pracId: hsjfk3e-53vd
type: providerNumber,
value: dsfh4kslfls }
]
Calendar: {
openRules: […],
closedRules: […]
}
Contacts: {
Email: sss@gmail.com
Website: www.tt.com
Phone: 3242343
}
Medicare (SoV)
Healthscope (SoN)
Vendor Software (SoT)
Vendor Software (SoT)
Internal
Healthcare
Service
Practitioner
Relation
Details about
Practitioners who work at
a service
Internal
Internal
Vendor Software (SoT)
Data Federation
Pre-Processing Raw (Bronze) Stage (Silver) Gold Publishing
Pre-Processing Layer
Automated Pre Processing via Notebooks (origin API or
offline data extracts via SFTP, S3 pickup folders)
▪ Generate “Source Data” Event Object
{
DataPayload: <type> Raw Entity Model
Provenance: <type> Provenance
}
▪ DataPayload holds the “Raw Entity” (Source Specific
Model)
▪ Provenance used for source / origin identification
Raw Processing Layer (Bronze)
Picks up Source Data Event from “Pre Processing Output”
▪ Performs routine, high level parsing / cleansing
▪ Generate “Core Data” Event Object
▪ Which carries to each downstream layer and captures
transition/operational changes at each layer
▪ Generate Event Trace ID - the end-to-end traceability
identification
▪ Data Lineage Object “begins” to capture operational
outcomes to events
Stage Processing Layer (Silver)
Picks up Core Data Event from “Raw Output”
▪ Mapping Operation - Convert from Raw (Source) Entity
model to Gold Entity Model
▪ Referencing Operation - Enrichment using Reference
Data lookups
▪ Merging Operation - New Gold Version created
▪ Validation Operation: Final validation against Gold
Model validation Rules
Stage Processing Layer
▪ Merging Operation
▪ Matching by “primary key”
▪ Merging (based on last version) / Delta Determination
▪ Version Incrementation
▪ Metadata Attribution generated and appended
▪ Logs Every Change to Every Attribute on Every Event
▪ Individual Data Lineage Objects store all operational
outcomes on the event (attribute exceptions, violations,
status changes etc.)
Gold Processing Layer
Picks up Core Data Event from “Stage Output”
▪ Entity Relationship Validation - Ensures entity
relationships are “intact” - Prevents Orphans
▪ Re-Processing & Replay : Replay latest versions (for
new reference data, business / validation logic to apply)
▪ Data Science Layer - Data Quality Benchmarks
Data Provenance Object event_trace_id: 79d77056-c773-4496-
ac0d-5223c49e06f0
file_name:
ext_provider_location_service_bf14
feb8-538a-4f40-85eb-
93b77d2c1704_2019-09-
17T22:50:47Z.json
source_file_name:
ext_provider_location_service_bf14
feb8-538a-4f40-85eb-
93b77d2c1704_2019-09-
17T22:50:47Z.json
flow_name:
nifi_flow_ext_provider_location_se
rvice_withstate_v1
owner_agency: HDAInternal
arrival_timestamp: 2019-09-
17T22:50:46Z
primary_key: [“pLocSvcId”]
primary_key_temporal: TRUE
data_in_load_strategy: DELTA
unique_data_code:
ext_provider_location_service
version: v0.0.1
source_identifier: TAL-2324
Trace an Event back to it’s Exact Origin
▪ Identify Upstream Source Identity & Raw Source File
▪ Inject Source (External) Identifier (e.g. Jira Ticket #)
Source Intention
▪ Target Entity (what this event intends to update)
Data Lineage Object event_trace_id: 79d77056-c773-
4496-ac0d-5223c49e06f0
application_id: STAGE-01
application_name: STAGE
application_description: Versioned
Entity Data
application_version: 1.0.0
application_state:
STAGE_REFERENCING
dms_event_id: 4000ae0b-6b08-4dce-
a432-fff8e608e7ec
source_dms_event_id: 4de1802c-
70e6-4552-b2b0-4349bfc3a073
operation: [{
operation_name:
ENTITY_REFERENCING,
operation_rule_name:
plsParsing,
operation_result: SUCCESS,
failure_severity: “”,
attributes: [“”],
created_time: 2019-09-
30T22:39:47Z
}],
created_time: 2019-09-
30T22:39:47Z
Encapsulate Operation Outcomes that occur to Entity
Events
▪ Capture deviation of Data Quality
▪ Exceptions / Warnings
▪ Exceptions - Fix Data
▪ Warning - Improve Data (Quality)
▪ Visibility of End to End Data Flow (via Operational
Outcomes Summary)
Intelligent System of Record
Ranking
“Dedicated System of Record (SoR)”
Has full update authority over your
data attributes
1. Data Quality Regressions flow
into your System
2. Low Frequency of Change (Low
Data Currency)
Problem
“Candidate Systems of Record (CSoR)”
Alternate “SoRs” who compete to update
the same data
Healthcare Service
{ Opening Hours,
Contact Details}
SoR
A
CSoR
B
CSoR
C
Solution
Source Opening Hours Contact Details
SoR A Priority 1 Priority 1
CSoR B - Priority 2
CSoR C - Priority 3
SoR
C
In the same
MicroBatch - SoR A
wins over CSoR B
and CSoR C
SoR
B SoR
A
▪ Ranking assigned based on “business priority”
Healthcare Service Entity Attributes
Manual Ranking
Source
Total Updates Contact Details-
Lineage Warnings
Contact Details-
Lineage Errors
SoR A 10 4 2
CSoR B 8 1 0
CSoR C 2 1 1
▪ Data Lineage outcomes aggregated over last 30 days
▪ “Priority boosted” based on “Recent Performance” of Sources
Healthcare Service Entity Lineage Events
Automatic Ranking
Contact
Details
Priority 2
Priority 1
Priority 3
Updated Priority
Contact
Details
Priority 1
Priority 2
Priority 3
Original Priority
Source
Total
Updates
Lineage
Warnings
Lineage
Errors
Public
Complaints
Count
Completeness
Score
Consistency
Score
Accuracy
Score
Conformity
Score
Integrity
Score
… Nth
SoR A 10 4 2 2 60 20 99 56 21 …
CSoR B 8 1 0 1 45 34 80 54 22
CSoR C 2 1 1 6 78 45 34 56 45
…Nth
Source
… … … … … … … … … …
▪ Sources and Features are growing
▪ “Seasonal” Data Regression
▪ Source Data Quality Model: “Confidence Score” based on “Past Performance” applied in “Real Time”
Healthcare Service Entity Features
Intelligent Ranking - Future State
Architecture Patterns for
Federated Data Platforms
Architecture Overview
Logical Data Zones using Databricks DELTA
▪ Data Control Plane: LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold)
▪ Used DELTA Cache for performance optimisation (stream and batch workloads)
▪ Runs on AWS Accounts under our Security Policy and Regulatory Compliance
▪ Operational Control Plane: Cluster Administration, Management functions like,
Access Control, Jobs and Schedules
Data Plane & Processing Pipeline
Continuous Streaming Applications
▪ Enables “True Event Sourcing” via Streaming
Input, Kinesis, S3, and DELTA
▪ Running Micro-batches lead to smaller and
more manageable data volumes
▪ Recoverability through Checkpoints and
Reliability through Streaming Sinks to DELTA
tables
Data Issue Problem Statement:
Downstream Health Integrator is complaining that un-
anticipated special unicode characters in the service
description is breaking their integration.
Restore & Recover Data Versions Seamlessly
▪ During Data quality issues, we can rewind to
previous versions
▪ Using Metadata attribution, Provenance and Data
Lineage features - we can trace the root cause to a
specific origin source up-to millisecond precision
▪ Complete audit trail and ability to provide SoR Data
Quality Reporting
Questions & Feedback
Mark Paul - @ThisIsMarkPaul
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot (20)

PDF
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
PDF
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Edureka!
 
PDF
Best Practices in Metadata Management
DATAVERSITY
 
PDF
Metadata Strategies - Data Squared
DATAVERSITY
 
PPTX
Data mesh
ManojKumarR41
 
ODP
Pentaho Data Integration Introduction
mattcasters
 
PPTX
Power BI Overview
Nikkia Carter
 
PDF
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
PDF
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
PDF
Data Architecture PowerPoint Presentation Slides
SlideTeam
 
PDF
Data Warehouse Operational System Architecture
SlideTeam
 
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
PDF
How to beat Data Gravity with Kafka
confluent
 
PPTX
Data Lake Overview
James Serra
 
PDF
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
PPTX
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
Health Catalyst
 
PPTX
Exploratory data analysis using r
Tahera Shaikh
 
PPTX
Introduction to NOSQL databases
Ashwani Kumar
 
PDF
Data Modeling, Data Governance, & Data Quality
DATAVERSITY
 
PDF
DAS Slides: Data Quality Best Practices
DATAVERSITY
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Edureka!
 
Best Practices in Metadata Management
DATAVERSITY
 
Metadata Strategies - Data Squared
DATAVERSITY
 
Data mesh
ManojKumarR41
 
Pentaho Data Integration Introduction
mattcasters
 
Power BI Overview
Nikkia Carter
 
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Data Architecture PowerPoint Presentation Slides
SlideTeam
 
Data Warehouse Operational System Architecture
SlideTeam
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
How to beat Data Gravity with Kafka
confluent
 
Data Lake Overview
James Serra
 
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
Health Catalyst
 
Exploratory data analysis using r
Tahera Shaikh
 
Introduction to NOSQL databases
Ashwani Kumar
 
Data Modeling, Data Governance, & Data Quality
DATAVERSITY
 
DAS Slides: Data Quality Best Practices
DATAVERSITY
 

Similar to Building a Federated Data Directory Platform for Public Health (20)

PPTX
Growing into a proactive Data Platform
LivePerson
 
PDF
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
PDF
IDEAS Global A.I. Conference 2022.pdf
Manimuthu Ayyannan
 
PPTX
WebAction-Sami Abkay
Inside Analysis
 
PPT
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Amit Sheth
 
PPTX
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Fwdays
 
PDF
Realizing the Event Driven Enterprise
David Reines
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...
In-Memory Computing Summit
 
PPTX
WebAction In-Memory Computing Summit 2015
WebAction
 
PDF
Data management plans – EUDAT Best practices and case study | www.eudat.eu
EUDAT
 
PPTX
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper
 
PDF
Big Data Analytics for Healthcare Decision Support- Operational and Clinical
Adrish Sannyasi
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
DATAVERSITY
 
PPTX
HPE and Hortonworks join forces to Deliver Healthcare Transformation
Hortonworks
 
PDF
Scaling Security Threat Detection with Apache Spark and Databricks
Databricks
 
PDF
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
DATAVERSITY
 
PPTX
Improving Healthcare Operations Using Process Data Mining
Splunk
 
PPTX
Improving Healthcare Operations Using Process Data Mining
Splunk
 
Growing into a proactive Data Platform
LivePerson
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
IDEAS Global A.I. Conference 2022.pdf
Manimuthu Ayyannan
 
WebAction-Sami Abkay
Inside Analysis
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Amit Sheth
 
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Fwdays
 
Realizing the Event Driven Enterprise
David Reines
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...
In-Memory Computing Summit
 
WebAction In-Memory Computing Summit 2015
WebAction
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
EUDAT
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper
 
Big Data Analytics for Healthcare Decision Support- Operational and Clinical
Adrish Sannyasi
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
DATAVERSITY
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
Hortonworks
 
Scaling Security Threat Detection with Apache Spark and Databricks
Databricks
 
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
DATAVERSITY
 
Improving Healthcare Operations Using Process Data Mining
Splunk
 
Improving Healthcare Operations Using Process Data Mining
Splunk
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 

Building a Federated Data Directory Platform for Public Health

  • 2. Building a Federated Data Directory Platform for Public Health Mark Paul Engineering Manager Anshul Bajpai Data Engineering Lead
  • 3. Agenda 1. Problems with Centralised Data Directories 2. Solution: Federated Data Directory Platform 3. Design Patterns 4. Intelligent System of Record Ranking 5. Architecture Patterns
  • 4. ▪ Australian digital health infrastructure ▪ National directory of health services and the practitioners who provide them ▪ National, government-owned, not-for-profit organization ▪ Trusted health information and advice for all Australians #1 Australian health information website 4.8m community connections each month
  • 6. Healthcare Directories - Critical Healthcare Infrastructure ▪ Enables Care Coordination ▪ Single Point-of-Failure ▪ Bad Data Quality = Clinical Risk to Patients
  • 7. Healthcare Directories - Problems Data Updated via Content Management Systems and Call Centres This Model Is Reactive and Inefficient! Data volatility (High Frequency of change to data) Basic Centrally Managed Databases Applications
  • 9. Federating Data is a Powerful Concept Federated Database: ▪ Maps multiple autonomous database systems into a single federated datastore Federated Data Platform: ▪ Controlled aggregation to create “gold-standard data” by using multiple Autonomous Origin Data Sources ▪ Data Aggregation via Event Sourcing pipelines
  • 10. Building the Federated Data “Puzzle” Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors participate as Systems of Records
  • 12. Source Classification ▪ System of Record (SoR): ▪ Identify your Authoritative SoRs SoRs have Role/s: ▪ Source of Truth ▪ Authoritative owner of a subset of data ▪ Source of Validation: ▪ Improve Data Quality ▪ Source of Notification: ▪ Increase “data currency” ▪ Gold Entities ▪ Your final entity models (e.g. Healthcare Service, Organisation, Practitioner) ▪ Raw Entities ▪ Raw (Source) entities that are in pre-mapping stage that would be eventual mapped to your gold entities ▪ Source Channels ▪ Pipeline channels that transition Raw Entities into new version of Gold Entities Entity / Channel Setup
  • 13. Attribute Sourcing Id: "561f10e4-0109-b99f-a2df-c059f9dc4a9b" name: "Cottesloe Medical Centre" bookingProviders [ { Id: hotdoc, providerIdentifier: cottesloe-medical-centre }, { Id: healthengine, providerIdentifier: ctl-m-re } ] practitionerRelations [ { pracId: c618860e-a69a type: providerNumber, value: 2djfkdn3k34 }, { pracId: hsjfk3e-53vd type: providerNumber, value: dsfh4kslfls } ] Calendar: { openRules: […], closedRules: […] } Contacts: { Email: [email protected] Website: www.tt.com Phone: 3242343 } Medicare (SoV) Healthscope (SoN) Vendor Software (SoT) Vendor Software (SoT) Internal Healthcare Service Practitioner Relation Details about Practitioners who work at a service Internal Internal Vendor Software (SoT) Data Federation
  • 14. Pre-Processing Raw (Bronze) Stage (Silver) Gold Publishing
  • 15. Pre-Processing Layer Automated Pre Processing via Notebooks (origin API or offline data extracts via SFTP, S3 pickup folders) ▪ Generate “Source Data” Event Object { DataPayload: <type> Raw Entity Model Provenance: <type> Provenance } ▪ DataPayload holds the “Raw Entity” (Source Specific Model) ▪ Provenance used for source / origin identification
  • 16. Raw Processing Layer (Bronze) Picks up Source Data Event from “Pre Processing Output” ▪ Performs routine, high level parsing / cleansing ▪ Generate “Core Data” Event Object ▪ Which carries to each downstream layer and captures transition/operational changes at each layer ▪ Generate Event Trace ID - the end-to-end traceability identification ▪ Data Lineage Object “begins” to capture operational outcomes to events
  • 17. Stage Processing Layer (Silver) Picks up Core Data Event from “Raw Output” ▪ Mapping Operation - Convert from Raw (Source) Entity model to Gold Entity Model ▪ Referencing Operation - Enrichment using Reference Data lookups ▪ Merging Operation - New Gold Version created ▪ Validation Operation: Final validation against Gold Model validation Rules
  • 18. Stage Processing Layer ▪ Merging Operation ▪ Matching by “primary key” ▪ Merging (based on last version) / Delta Determination ▪ Version Incrementation ▪ Metadata Attribution generated and appended ▪ Logs Every Change to Every Attribute on Every Event ▪ Individual Data Lineage Objects store all operational outcomes on the event (attribute exceptions, violations, status changes etc.)
  • 19. Gold Processing Layer Picks up Core Data Event from “Stage Output” ▪ Entity Relationship Validation - Ensures entity relationships are “intact” - Prevents Orphans ▪ Re-Processing & Replay : Replay latest versions (for new reference data, business / validation logic to apply) ▪ Data Science Layer - Data Quality Benchmarks
  • 20. Data Provenance Object event_trace_id: 79d77056-c773-4496- ac0d-5223c49e06f0 file_name: ext_provider_location_service_bf14 feb8-538a-4f40-85eb- 93b77d2c1704_2019-09- 17T22:50:47Z.json source_file_name: ext_provider_location_service_bf14 feb8-538a-4f40-85eb- 93b77d2c1704_2019-09- 17T22:50:47Z.json flow_name: nifi_flow_ext_provider_location_se rvice_withstate_v1 owner_agency: HDAInternal arrival_timestamp: 2019-09- 17T22:50:46Z primary_key: [“pLocSvcId”] primary_key_temporal: TRUE data_in_load_strategy: DELTA unique_data_code: ext_provider_location_service version: v0.0.1 source_identifier: TAL-2324 Trace an Event back to it’s Exact Origin ▪ Identify Upstream Source Identity & Raw Source File ▪ Inject Source (External) Identifier (e.g. Jira Ticket #) Source Intention ▪ Target Entity (what this event intends to update)
  • 21. Data Lineage Object event_trace_id: 79d77056-c773- 4496-ac0d-5223c49e06f0 application_id: STAGE-01 application_name: STAGE application_description: Versioned Entity Data application_version: 1.0.0 application_state: STAGE_REFERENCING dms_event_id: 4000ae0b-6b08-4dce- a432-fff8e608e7ec source_dms_event_id: 4de1802c- 70e6-4552-b2b0-4349bfc3a073 operation: [{ operation_name: ENTITY_REFERENCING, operation_rule_name: plsParsing, operation_result: SUCCESS, failure_severity: “”, attributes: [“”], created_time: 2019-09- 30T22:39:47Z }], created_time: 2019-09- 30T22:39:47Z Encapsulate Operation Outcomes that occur to Entity Events ▪ Capture deviation of Data Quality ▪ Exceptions / Warnings ▪ Exceptions - Fix Data ▪ Warning - Improve Data (Quality) ▪ Visibility of End to End Data Flow (via Operational Outcomes Summary)
  • 22. Intelligent System of Record Ranking
  • 23. “Dedicated System of Record (SoR)” Has full update authority over your data attributes 1. Data Quality Regressions flow into your System 2. Low Frequency of Change (Low Data Currency) Problem “Candidate Systems of Record (CSoR)” Alternate “SoRs” who compete to update the same data Healthcare Service { Opening Hours, Contact Details} SoR A CSoR B CSoR C Solution
  • 24. Source Opening Hours Contact Details SoR A Priority 1 Priority 1 CSoR B - Priority 2 CSoR C - Priority 3 SoR C In the same MicroBatch - SoR A wins over CSoR B and CSoR C SoR B SoR A ▪ Ranking assigned based on “business priority” Healthcare Service Entity Attributes Manual Ranking
  • 25. Source Total Updates Contact Details- Lineage Warnings Contact Details- Lineage Errors SoR A 10 4 2 CSoR B 8 1 0 CSoR C 2 1 1 ▪ Data Lineage outcomes aggregated over last 30 days ▪ “Priority boosted” based on “Recent Performance” of Sources Healthcare Service Entity Lineage Events Automatic Ranking Contact Details Priority 2 Priority 1 Priority 3 Updated Priority Contact Details Priority 1 Priority 2 Priority 3 Original Priority
  • 26. Source Total Updates Lineage Warnings Lineage Errors Public Complaints Count Completeness Score Consistency Score Accuracy Score Conformity Score Integrity Score … Nth SoR A 10 4 2 2 60 20 99 56 21 … CSoR B 8 1 0 1 45 34 80 54 22 CSoR C 2 1 1 6 78 45 34 56 45 …Nth Source … … … … … … … … … … ▪ Sources and Features are growing ▪ “Seasonal” Data Regression ▪ Source Data Quality Model: “Confidence Score” based on “Past Performance” applied in “Real Time” Healthcare Service Entity Features Intelligent Ranking - Future State
  • 29. Logical Data Zones using Databricks DELTA ▪ Data Control Plane: LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold) ▪ Used DELTA Cache for performance optimisation (stream and batch workloads) ▪ Runs on AWS Accounts under our Security Policy and Regulatory Compliance ▪ Operational Control Plane: Cluster Administration, Management functions like, Access Control, Jobs and Schedules
  • 30. Data Plane & Processing Pipeline
  • 31. Continuous Streaming Applications ▪ Enables “True Event Sourcing” via Streaming Input, Kinesis, S3, and DELTA ▪ Running Micro-batches lead to smaller and more manageable data volumes ▪ Recoverability through Checkpoints and Reliability through Streaming Sinks to DELTA tables
  • 32. Data Issue Problem Statement: Downstream Health Integrator is complaining that un- anticipated special unicode characters in the service description is breaking their integration.
  • 33. Restore & Recover Data Versions Seamlessly ▪ During Data quality issues, we can rewind to previous versions ▪ Using Metadata attribution, Provenance and Data Lineage features - we can trace the root cause to a specific origin source up-to millisecond precision ▪ Complete audit trail and ability to provide SoR Data Quality Reporting
  • 34. Questions & Feedback Mark Paul - @ThisIsMarkPaul Your feedback is important to us. Don’t forget to rate and review the sessions.