Building a Federated Data Directory Platform for Public Health

Building a Federated Data Directory
Platform for Public Health
Mark Paul
Engineering Manager
Anshul Bajpai
Data Engineering Lead

Agenda
1. Problems with Centralised Data Directories
2. Solution: Federated Data Directory Platform
3. Design Patterns
4. Intelligent System of Record Ranking
5. Architecture Patterns

▪ Australian digital health
infrastructure
▪ National directory of health
services and the practitioners
who provide them
▪ National, government-owned,
not-for-profit organization
▪ Trusted health information and
advice for all Australians
#1 Australian health
information website
4.8m community
connections each month

Problems with Centralised
Data Directories

Healthcare Directories - Critical Healthcare Infrastructure
▪ Enables Care Coordination
▪ Single Point-of-Failure
▪ Bad Data Quality = Clinical Risk to
Patients

Healthcare Directories - Problems
Data Updated via Content
Management Systems and Call
Centres
This Model Is Reactive and Inefficient!
Data volatility (High Frequency of change to data)
Basic Centrally Managed Databases
Applications

Solution: Federated Data
Directory Platform

Federating Data is a Powerful Concept
Federated Database:
▪ Maps multiple autonomous database systems into
a single federated datastore
Federated Data Platform:
▪ Controlled aggregation to create “gold-standard
data” by using multiple Autonomous Origin Data
Sources
▪ Data Aggregation via Event Sourcing pipelines

Building the Federated Data “Puzzle”
Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors
participate as Systems of Records

Design Patterns for
Federated Data Platforms

Source Classification
▪ System of Record (SoR):
▪ Identify your Authoritative SoRs
SoRs have Role/s:
▪ Source of Truth
▪ Authoritative owner of a subset of data
▪ Source of Validation:
▪ Improve Data Quality
▪ Source of Notification:
▪ Increase “data currency”
▪ Gold Entities
▪ Your final entity models (e.g. Healthcare Service,
Organisation, Practitioner)
▪ Raw Entities
▪ Raw (Source) entities that are in pre-mapping stage that
would be eventual mapped to your gold entities
▪ Source Channels
▪ Pipeline channels that transition Raw Entities into new
version of Gold Entities
Entity / Channel Setup

Attribute Sourcing
Id: "561f10e4-0109-b99f-a2df-c059f9dc4a9b"
name: "Cottesloe Medical Centre"
bookingProviders [
{ Id: hotdoc,
providerIdentifier: cottesloe-medical-centre },
{ Id: healthengine,
providerIdentifier: ctl-m-re }
]
practitionerRelations [
{ pracId: c618860e-a69a
type: providerNumber,
value: 2djfkdn3k34 },
{ pracId: hsjfk3e-53vd
type: providerNumber,
value: dsfh4kslfls }
]
Calendar: {
openRules: […],
closedRules: […]
}
Contacts: {
Email: sss@gmail.com
Website: www.tt.com
Phone: 3242343
}
Medicare (SoV)
Healthscope (SoN)
Vendor Software (SoT)
Internal
Healthcare
Service
Practitioner
Relation
Details about
Practitioners who work at
a service
Internal
Internal
Data Federation

Pre-Processing Raw (Bronze) Stage (Silver) Gold Publishing

Pre-Processing Layer
Automated Pre Processing via Notebooks (origin API or
offline data extracts via SFTP, S3 pickup folders)
▪ Generate “Source Data” Event Object
{
DataPayload: <type> Raw Entity Model
Provenance: <type> Provenance
}
▪ DataPayload holds the “Raw Entity” (Source Specific
Model)
▪ Provenance used for source / origin identification

Raw Processing Layer (Bronze)
Picks up Source Data Event from “Pre Processing Output”
▪ Performs routine, high level parsing / cleansing
▪ Generate “Core Data” Event Object
▪ Which carries to each downstream layer and captures
transition/operational changes at each layer
▪ Generate Event Trace ID - the end-to-end traceability
identification
▪ Data Lineage Object “begins” to capture operational
outcomes to events

Stage Processing Layer (Silver)
Picks up Core Data Event from “Raw Output”
▪ Mapping Operation - Convert from Raw (Source) Entity
model to Gold Entity Model
▪ Referencing Operation - Enrichment using Reference
Data lookups
▪ Merging Operation - New Gold Version created
▪ Validation Operation: Final validation against Gold
Model validation Rules

Stage Processing Layer
▪ Merging Operation
▪ Matching by “primary key”
▪ Merging (based on last version) / Delta Determination
▪ Version Incrementation
▪ Metadata Attribution generated and appended
▪ Logs Every Change to Every Attribute on Every Event
▪ Individual Data Lineage Objects store all operational
outcomes on the event (attribute exceptions, violations,
status changes etc.)

Gold Processing Layer
Picks up Core Data Event from “Stage Output”
▪ Entity Relationship Validation - Ensures entity
relationships are “intact” - Prevents Orphans
▪ Re-Processing & Replay : Replay latest versions (for
new reference data, business / validation logic to apply)
▪ Data Science Layer - Data Quality Benchmarks

Data Provenance Object event_trace_id: 79d77056-c773-4496-
ac0d-5223c49e06f0
file_name:
ext_provider_location_service_bf14
feb8-538a-4f40-85eb-
93b77d2c1704_2019-09-
17T22:50:47Z.json
source_file_name:
ext_provider_location_service_bf14
feb8-538a-4f40-85eb-
93b77d2c1704_2019-09-
17T22:50:47Z.json
flow_name:
nifi_flow_ext_provider_location_se
rvice_withstate_v1
owner_agency: HDAInternal
arrival_timestamp: 2019-09-
17T22:50:46Z
primary_key: [“pLocSvcId”]
primary_key_temporal: TRUE
data_in_load_strategy: DELTA
unique_data_code:
ext_provider_location_service
version: v0.0.1
source_identifier: TAL-2324
Trace an Event back to it’s Exact Origin
▪ Identify Upstream Source Identity & Raw Source File
▪ Inject Source (External) Identifier (e.g. Jira Ticket #)
Source Intention
▪ Target Entity (what this event intends to update)

Data Lineage Object event_trace_id: 79d77056-c773-
4496-ac0d-5223c49e06f0
application_id: STAGE-01
application_name: STAGE
application_description: Versioned
Entity Data
application_version: 1.0.0
application_state:
STAGE_REFERENCING
dms_event_id: 4000ae0b-6b08-4dce-
a432-fff8e608e7ec
source_dms_event_id: 4de1802c-
70e6-4552-b2b0-4349bfc3a073
operation: [{
operation_name:
ENTITY_REFERENCING,
operation_rule_name:
plsParsing,
operation_result: SUCCESS,
failure_severity: “”,
attributes: [“”],
created_time: 2019-09-
30T22:39:47Z
}],
created_time: 2019-09-
30T22:39:47Z
Encapsulate Operation Outcomes that occur to Entity
Events
▪ Capture deviation of Data Quality
▪ Exceptions / Warnings
▪ Exceptions - Fix Data
▪ Warning - Improve Data (Quality)
▪ Visibility of End to End Data Flow (via Operational
Outcomes Summary)

Intelligent System of Record
Ranking

“Dedicated System of Record (SoR)”
Has full update authority over your
data attributes
1. Data Quality Regressions flow
into your System
2. Low Frequency of Change (Low
Data Currency)
Problem
“Candidate Systems of Record (CSoR)”
Alternate “SoRs” who compete to update
the same data
Healthcare Service
{ Opening Hours,
Contact Details}
SoR
A
CSoR
B
CSoR
C
Solution

Source Opening Hours Contact Details
SoR A Priority 1 Priority 1
CSoR B - Priority 2
CSoR C - Priority 3
SoR
C
In the same
MicroBatch - SoR A
wins over CSoR B
and CSoR C
SoR
B SoR
A
▪ Ranking assigned based on “business priority”
Healthcare Service Entity Attributes
Manual Ranking

Source
Total Updates Contact Details-
Lineage Warnings
Contact Details-
Lineage Errors
SoR A 10 4 2
CSoR B 8 1 0
CSoR C 2 1 1
▪ Data Lineage outcomes aggregated over last 30 days
▪ “Priority boosted” based on “Recent Performance” of Sources
Healthcare Service Entity Lineage Events
Automatic Ranking
Contact
Details
Priority 2
Priority 1
Priority 3
Updated Priority
Contact
Details
Priority 1
Priority 2
Priority 3
Original Priority

Source
Total
Updates
Lineage
Warnings
Lineage
Errors
Public
Complaints
Count
Completeness
Score
Consistency
Score
Accuracy
Score
Conformity
Score
Integrity
Score
… Nth
SoR A 10 4 2 2 60 20 99 56 21 …
CSoR B 8 1 0 1 45 34 80 54 22
CSoR C 2 1 1 6 78 45 34 56 45
…Nth
Source
… … … … … … … … … …
▪ Sources and Features are growing
▪ “Seasonal” Data Regression
▪ Source Data Quality Model: “Confidence Score” based on “Past Performance” applied in “Real Time”
Healthcare Service Entity Features
Intelligent Ranking - Future State

Architecture Patterns for
Federated Data Platforms

Logical Data Zones using Databricks DELTA
▪ Data Control Plane: LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold)
▪ Used DELTA Cache for performance optimisation (stream and batch workloads)
▪ Runs on AWS Accounts under our Security Policy and Regulatory Compliance
▪ Operational Control Plane: Cluster Administration, Management functions like,
Access Control, Jobs and Schedules

Data Plane & Processing Pipeline

Continuous Streaming Applications
▪ Enables “True Event Sourcing” via Streaming
Input, Kinesis, S3, and DELTA
▪ Running Micro-batches lead to smaller and
more manageable data volumes
▪ Recoverability through Checkpoints and
Reliability through Streaming Sinks to DELTA
tables

Data Issue Problem Statement:
Downstream Health Integrator is complaining that un-
anticipated special unicode characters in the service
description is breaking their integration.

Restore & Recover Data Versions Seamlessly
▪ During Data quality issues, we can rewind to
previous versions
▪ Using Metadata attribution, Provenance and Data
Lineage features - we can trace the root cause to a
specific origin source up-to millisecond precision
▪ Complete audit trail and ability to provide SoR Data
Quality Reporting

Questions & Feedback
Mark Paul - @ThisIsMarkPaul
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Building a Federated Data Directory Platform for Public Health

More Related Content

What's hot (20)

Similar to Building a Federated Data Directory Platform for Public Health (20)

More from Databricks (20)

Recently uploaded (20)

Building a Federated Data Directory Platform for Public Health