Building Real-Time Data
Pipeline
#DevSAIS17
For Diabetes Medication Recommender System
Using Databricks
Arivoli Tirouvingadame
Data Platform Engineer, Qventus
Jayaradha Natarajan
Sr. Data Engineer, Change Healthcare
$whoami
• Jayaradha Natarajan
Sr. Data Engineer, Change Healthcare
www.github.com/jayaradha
Open Source Committer
https://l10n.gnome.org/teams/ta/
Organizer, Data Riders meetup group
www.meetup.com/datariders
Arivoli Tirouvingadame
Data Platform Engineer, Qventus
http://www.github.com/olisource
Organizer, Data Riders meetup group
www.meetup.com/datariders
AI/ML in Healthcare
“AI will be ubiquitous in healthcare
by 2025”
https://www.techemergence.com/machine-learning-in-healthcare-executive-consensus/
Healthcare
Data
Patient
Visit
Lab
R&DIoT Sensors
Prescription
“We are in the early days of AI
assisting Physicians better
prescribe medication”
https://www.truveris.com/resources/ai-in-healthcare-helping-physicians-better-prescribe-treatments
https://www.alchemyfoodtech.com/copy-of-diabetes-epidemic
- Current: 1 in 11
adults are diabetic
- By 2040: Diabetes
population is
expected to be 2
times population of
USA
Life in a day … of a Diabetes
patient
Problem Challenge Symptoms
How can we prescribe Diabetes
medication better in near
real-time?
Solution
- Use Big Data pipeline to collect patient's Blood
glucose level and medication before/after food and
predict better medication in near real-time
Collect
Sensor Data
(Wearable
devices)
Model data
using ML
Algorithms
Predict
Medication
& alert
patient’s
mobile
device
Data
Collection
Model Predict
GlucoseMonitors
Non-meter test strips
Hospital glucose meters
Blood testing with
meters using test strips
Noninvasive meters
Continuous glucose
monitors
Ingestion data
o Typically, raw data can be structured/semi-
structured/unstructured with/without errors
o IoT devices (from Continuous Glucose Monitors)
produce structured data with/without errors
Data Storage and Cleansing
Sensor
Data
Cleansed Data
Storage
Model Storage
Raw Data
Storage
Recommendation/
Score storage
Data Cleansing
Age
Calorie
intake
Blood
glucose
level
Data Cleansing and modeling
o Data cleansing uses statistical analysis tools to read and audit data based on
a list of pre-defined constraints.
Streaming
Data
Range
check
Validate
Data
Split
Training
data
Test data
ARCHITECTURE
EMR
Raw Data Clean Data Model
Train
Transformation/
Cleansing
Reference Architecture
EMR
Raw Data Clean Data Model
Train
Transformation/
Cleansing
Prediction
Reference Architecture
Architecture components
o Kafka: Get sensor data in real-time from Wearable devices
o Apache Spark: Ingest raw data through Kafka. Use Structured Streaming (Data verification,
validation, cleansing, enrichment, etc.), and store it in S3 buckets
o MLlib: Process data stored in S3 buckets via Machine Learning libraries. Insulin intake can be
recommended
o AWS: Deploy model and other related services in EC2, EMR, etc..
o Mobile or Web App: Notify patients with medication recommendation
o D3/Tableau: Visualize via charts/dashboards
Pain points
o Maintaining multiple root accounts for Dev, Pre-Prod and Prod
environments is expensive
o Choosing HIPAA compliant services (most of the server-less
technologies are not HIPAA compliant)
o We have to build secured network from scratch and maintain them (for
example: using terraform, cloud formation, etc.).
o End-to-end encryption: Data-in-flight and Data-at-rest encryption
HIPAA Challenges
o HIPAA requires Healthcare Data to be protected.
o Ensure the confidentiality, integrity, and availability of Protected
Health Information (PHI) created, received, maintained, or
transmitted.
o Protect against any reasonably anticipated threats and hazards to the
security or integrity of PHI.
o Protect against reasonably anticipated uses or disclosures of PHI not
permitted by the Privacy Rule.
DATABRICKS PIPELINE
Databricks – Kinesis - Connector
Structured Streaming Spark MLKinesis AWS Lambda API Gateway
Databricks – Kafka - Connector
Kafka
Connector
Train
Raw
Data
Cleansed
Data Model
Spark to
clean data
Spark ML
data
Prediction
o Hybrid only or single tenant
o Selected AWS BAA HIPAA services
o Databricks auxiliary services (Web app and cluster management software) would be in a Databricks-owned AWS account
and run on dedicated VPC instance.
o Spark clusters would continue to be deployed to customers AWS account and on dedicated instances.
o End to End Encryption: Data-in-flight and Data-at-rest encryption
o Logging and Monitoring
o Audit
https://docs.databricks.com/user-guide/advanced/hipaa-compliant-deployment.html
Deployment
DEMO
Mobile App
Visualizations
Future directions
o Health: Extend it to apply to any medication management
based solutions and emergency medication management
o Wellness: Predict calorie intake
o Fitness: Predict workouts needed to be done
Acknowledgements
- Catherine Crofts, PhD, Auckland University of Technology, Auckland, New Zealand
https://www.researchgate.net/profile/Catherine_Crofts
- Baba Medicals (India)
- http://reference.medscape.com/drug/humalog-insulin-lispro-999005
References
o https://www.esri.com/
o https://risk.lexisnexis.com/
o https://symphonyhealth.prahs.com/
o https://docs.databricks.com/user-guide/advanced/hipaa-compliant-deployment.html
o https://databricks.com/blog/2017/05/18/taking-apache-sparks-structured-structured-streaming-to-
production.html
o https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-
sparks-structured-streaming.html
o https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Building Real-Time Data Pipeline for Diabetes Medication Recommender System Using Databricks with Arivoli Tirouvingadam Jayaradha Natarajan

  • 1.
    Building Real-Time Data Pipeline #DevSAIS17 ForDiabetes Medication Recommender System Using Databricks Arivoli Tirouvingadame Data Platform Engineer, Qventus Jayaradha Natarajan Sr. Data Engineer, Change Healthcare
  • 2.
    $whoami • Jayaradha Natarajan Sr.Data Engineer, Change Healthcare www.github.com/jayaradha Open Source Committer https://l10n.gnome.org/teams/ta/ Organizer, Data Riders meetup group www.meetup.com/datariders Arivoli Tirouvingadame Data Platform Engineer, Qventus http://www.github.com/olisource Organizer, Data Riders meetup group www.meetup.com/datariders
  • 3.
    AI/ML in Healthcare “AIwill be ubiquitous in healthcare by 2025” https://www.techemergence.com/machine-learning-in-healthcare-executive-consensus/
  • 4.
  • 5.
    “We are inthe early days of AI assisting Physicians better prescribe medication” https://www.truveris.com/resources/ai-in-healthcare-helping-physicians-better-prescribe-treatments
  • 6.
    https://www.alchemyfoodtech.com/copy-of-diabetes-epidemic - Current: 1in 11 adults are diabetic - By 2040: Diabetes population is expected to be 2 times population of USA
  • 7.
    Life in aday … of a Diabetes patient Problem Challenge Symptoms
  • 8.
    How can weprescribe Diabetes medication better in near real-time?
  • 9.
    Solution - Use BigData pipeline to collect patient's Blood glucose level and medication before/after food and predict better medication in near real-time Collect Sensor Data (Wearable devices) Model data using ML Algorithms Predict Medication & alert patient’s mobile device Data Collection Model Predict
  • 10.
    GlucoseMonitors Non-meter test strips Hospitalglucose meters Blood testing with meters using test strips Noninvasive meters Continuous glucose monitors
  • 11.
    Ingestion data o Typically,raw data can be structured/semi- structured/unstructured with/without errors o IoT devices (from Continuous Glucose Monitors) produce structured data with/without errors
  • 12.
    Data Storage andCleansing Sensor Data Cleansed Data Storage Model Storage Raw Data Storage Recommendation/ Score storage Data Cleansing Age Calorie intake Blood glucose level
  • 13.
    Data Cleansing andmodeling o Data cleansing uses statistical analysis tools to read and audit data based on a list of pre-defined constraints. Streaming Data Range check Validate Data Split Training data Test data
  • 14.
  • 15.
    EMR Raw Data CleanData Model Train Transformation/ Cleansing Reference Architecture
  • 16.
    EMR Raw Data CleanData Model Train Transformation/ Cleansing Prediction Reference Architecture
  • 17.
    Architecture components o Kafka:Get sensor data in real-time from Wearable devices o Apache Spark: Ingest raw data through Kafka. Use Structured Streaming (Data verification, validation, cleansing, enrichment, etc.), and store it in S3 buckets o MLlib: Process data stored in S3 buckets via Machine Learning libraries. Insulin intake can be recommended o AWS: Deploy model and other related services in EC2, EMR, etc.. o Mobile or Web App: Notify patients with medication recommendation o D3/Tableau: Visualize via charts/dashboards
  • 18.
    Pain points o Maintainingmultiple root accounts for Dev, Pre-Prod and Prod environments is expensive o Choosing HIPAA compliant services (most of the server-less technologies are not HIPAA compliant) o We have to build secured network from scratch and maintain them (for example: using terraform, cloud formation, etc.). o End-to-end encryption: Data-in-flight and Data-at-rest encryption
  • 19.
    HIPAA Challenges o HIPAArequires Healthcare Data to be protected. o Ensure the confidentiality, integrity, and availability of Protected Health Information (PHI) created, received, maintained, or transmitted. o Protect against any reasonably anticipated threats and hazards to the security or integrity of PHI. o Protect against reasonably anticipated uses or disclosures of PHI not permitted by the Privacy Rule.
  • 20.
  • 21.
    Databricks – Kinesis- Connector Structured Streaming Spark MLKinesis AWS Lambda API Gateway
  • 22.
    Databricks – Kafka- Connector Kafka Connector Train Raw Data Cleansed Data Model Spark to clean data Spark ML data Prediction
  • 23.
    o Hybrid onlyor single tenant o Selected AWS BAA HIPAA services o Databricks auxiliary services (Web app and cluster management software) would be in a Databricks-owned AWS account and run on dedicated VPC instance. o Spark clusters would continue to be deployed to customers AWS account and on dedicated instances. o End to End Encryption: Data-in-flight and Data-at-rest encryption o Logging and Monitoring o Audit https://docs.databricks.com/user-guide/advanced/hipaa-compliant-deployment.html Deployment
  • 24.
  • 25.
  • 26.
  • 29.
    Future directions o Health:Extend it to apply to any medication management based solutions and emergency medication management o Wellness: Predict calorie intake o Fitness: Predict workouts needed to be done
  • 30.
    Acknowledgements - Catherine Crofts,PhD, Auckland University of Technology, Auckland, New Zealand https://www.researchgate.net/profile/Catherine_Crofts - Baba Medicals (India) - http://reference.medscape.com/drug/humalog-insulin-lispro-999005
  • 31.
    References o https://www.esri.com/ o https://risk.lexisnexis.com/ ohttps://symphonyhealth.prahs.com/ o https://docs.databricks.com/user-guide/advanced/hipaa-compliant-deployment.html o https://databricks.com/blog/2017/05/18/taking-apache-sparks-structured-structured-streaming-to- production.html o https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache- sparks-structured-streaming.html o https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html