SlideShare a Scribd company logo
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
Microsoft Azure Cloud Computing
Predicting COVID-19 Mortality Using Polygenic Risk Score
李建璋
台大智慧醫療中心副主任
台大急診醫學部 臨床教授
生物醫學資料科學研究群主持人
Human Genome
Human Genome
Human Genome
Human Genome
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
Manhattan plot
Genome-wide Association Study
Regional Association Plot
Polygenic risk score
Effect sizes (weights) are estimated for each SNP serves as prediction for the trait
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
台灣人體生物資料庫
1. 建立臺灣自己的參考序列
2. 為臺灣建立健康對照組的序列資訊
3. 作為基因填補法(genetic imputation)的模板,增加研究效益
4. 提供台灣健康族群低頻率變異(rare allele)分布情形
5. 有助於發展全基因體關聯性研究(genome-wide association study)
TWB 2.0 為一款針對台灣華人設計
的基因型鑑定晶片,包含可立即應
用到臨床的基因變異位點,及做為
精準醫學研究用的全基因體關聯性
分析變異位 (714,431 SNPs)
● GWAS Summary (Manhattan) Plot of the Association Statistics Highlighting Susceptibility Loci with
Genome-wide Significance for COVID-19 mortality
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
PRS-CS - Polygenic Prediction via Bayesian Regression and
Continuous Shrinkage Priors
Pain O, PLoS Genet. 2021 May 4;17(5):e1009021.
Age + Sex + UKB
PRS-CS
0.838 (0.804 - 0.872)
Age + Sex + BMI +
UKB PRS-CS
0.844 (0.811 - 0.877)
Azure: Introduction
Azure.
20211119 ntuh azure hpc workshop final
Azure Global Infrastructure (microsoft.com)
Microsoft is committed
to harnessing the power of technology
to help everyone, everywhere
build a more sustainable future. 2030
On-demand
global scale
Linux, Open Source,
and Red Hat
Best for Microsoft
workloads
Purpose-built
infrastructure
Compliance in the trusted cloud | Microsoft Azure
1. Digitally recreate the Eindhoven University
of Technology sports center.
2. Simulate fitness center’s airflow and understand
how air purifiers and ventilation systems could join
forces to help reduce contagion.
Using Fluent and the compute power of Ansys Cloud via
Microsoft Azure HPC, the team accomplished their research in
just 3 weeks instead of estimated 3 years.
Breathe Easy: Conquering the Coronavirus With CFD | Ansys Advantage
20211119 ntuh azure hpc workshop final
• Up to 500K samples
• Up to 96 millions of SNPs.
• chromosome files size spans from 37 GB (#22) to 188 GB (chromosome #2).
• Total manipulated size ~2.4 TB
Effective collaboration on the Cloud
Blob
HPC cluster
user1 user2 user3
Quality Control and Locus Zoom,
Q-Q Plot, Manhattan Plot, etc.
VM w/ NVMe HPC cluster
DNN, PRS, Jupiter
Genotype Imputation
Role-Based
Access Control
1
2
3
4
CycleCloud
20211119 ntuh azure hpc workshop final
Azure VMs
VM size Processor CPU Memory Base/Peak
CPU frequency
(GHz)
Local disk (GiB) Cost (on-demand)
EAST US
Cost (low-priority)
EAST US
HC44rs Intel Xeon
Platinum 8168
44 352GiB 2.7/3.4 700 GiB (NVMe) $3.17/hr $0.63/hr
HB120v3 AMD EPYC
7V13
120 448GiB 2.45/3.675 2 * 960 GiB (NVMe) $3.6/hr $0.72/hr
E64dsv4 Intel Xeon
Platinum
8272CL
64 504GiB 2.7/4.0 2400 GiB $4.61/hr $0.92/hr
L80sv2 AMD EPYCTM
7551
80 640GiB 2.55/3.0 10 x 1.92 TB (NVMe) $6.24/hr $1.25/hr
Pricing Calculator | Microsoft Azure
GWAS on predicting Covid-19 mortality rate
Experience on Azure HPC
Step 1. Step 2. Genotypes
• ~805,426 markers
• size: ~300 GB
Data acquisition
Step 3. Step 4. Step 5.
QC GWAS PRS Download
the
results
(Quality control) (Genome-wide
Association
Study)
(Polygenic
risk score)
PRS C+T
PRS CS
PRS DNN
Step 1. Data acquisition Step 2 Step 3 Step 4 Step 5
Genotypes
• ~805,426 markers
• size: ~300 GB
Imputed genotypes
• ~96 million variants
• size: ~2.4 TB
Covid-19 testing
Death register
Reference: UKB resorce 530
UK Biobank
• ~500,000 individuals
• 40-69 years old
Step 1. Data acquisition Step 2 Step 3 Step 4
Download
QC, GWAS and PRS
Upload to cloud
Step 5
Problem.
• Data must be downloaded by
• Not familiar with
• , download over 1 month
• Upload data is extremely slow
Reference: UKB category 263
Step 1. Data acquisition Step 2 Step 3 Step 4 Step 5
Solution.
Blob
1
2 3
Problem.
• Data must be downloaded by
• Not familiar with
• 3. , download over 1 month
Download 2.4 TB <20 mins!
Up to 32 Gbps
Step 1 Step 2. Quality control Step 3 Step 4 Step 5
For SNPs
• MAF >0.001
• INFO score >0.3
For Individuals
• mismatch sex
• extreme heterozygosity
• sex chromosome aneuploidy
• kinship inference
Problem.
• PLINK on-premises is too slow
• How to choose the appropriate VM?
Reference: UKB resorce 531
Step 1 Step 2. Quality control Step 3 Step 4 Step 5
Solution.
• Try chromosome 2 and HPC can help
Credit: Raymond Meng-Ru Tsai
7.1 million SNPs, 188 GB Over 150 VMs...
Step 1 Step 2. Quality control Step 3 Step 4 Step 5
Credit: Raymond Meng-Ru Tsai
On premises >24 hours! HB120v3 (Win10) ~1 hour
Step 1 Step 2. Quality control Step 3 Step 4 Step 5
Credit: Raymond Meng-Ru Tsai
On premises >24 hours! HB120v3 (Win10) ~1 hour!
24!!!
Step 1 Step 2. Quality control Step 3 Step 4 Step 5
Solution.
HB120v3 with local NVMe disks + TeraCopy
Credit: Raymond Meng-Ru Tsai
HB120v3 NVMe disks
Disk I/O performance
Step 1 Step 4 Step 5
Step 2 Step 3. GWAS
Chromosome 1- Chromosome 22
…......
Effect size
NVMe disks
HB120v3
Step 1 Step 5
Step 2 Step 3 Step 4. PRS
Tools : PLINK, BigSNPR, PRSice-2, Lassosum
Problem.
BigSNPR is too slow
Use PLINK
PLINK BigSNPR
.bed
.bgen
GWAS
PRS (C+T)
PLINK 2
Slow
Slow
Slow
PLINK 1.9
Step 1 Step 2 Step 3 Step 4 Step 5. Download GWAS/PRS result
Blob
1
2
3
NVMe disks
QC/ GWAS
PRS
On premises
HB120v3
Download 2.4 TB <20 mins!
HPC can help!
Step 1. Step 2. Genotypes
• ~805,426 markers
• size: ~300 GB
Data acquisition
Step 3. Step 4. Step 5.
QC GWAS PRS Download
the
results
(Quality control) (Genome-wide
Association
Study)
(Polygenic
risk score)
PRS C+T
PRS CS
PRS DNN
20211119 ntuh azure hpc workshop final
Random split data into train/test sets in a stratified fashion.
Filter SNPs by GWAS statistics (p-value / effect size)
Use SMOTE to generate synthetic samples from the minority class.
Features Engineering:
• posterior SNP effect sizes under continuous shrinkage (paper)
Training & Regularization:
• Learning Rate Decay/Batch Normalization/Dropout Layer
Loss Function:
• Weighted Loss/AUC-targeted loss function/L2 Norm
Hyperparameters tuning:
• Bayesian Optimization
Window DSVM
• vCPU 24
• 224 GiB RAM
Blob
Conduct Study with
Buit-in Data Analytic Tools &
Computing Resources
1
2​
3
20211119 ntuh azure hpc workshop final
Beagle Imputation in SVS (slideshare.net)
• Memory usage
•# of samples
•# of SNPs
Shi, Shuo, et al. “Comprehensive Assessment of Genotype Imputation Performance.” Human Heredity, vol. 83, no. 3, S. Karger AG, 2017, pp. 107–16,
• An Azure HPC cluster can perform
Genotype Imputation on all 22
chromosomes simultaneously.
• Leveraging Azure CycleCloud to
parallelize the pipeline execution.
Genotype Imputation performance
• ~46M SNPs for 10,417 samples after Quality Control.
• VCF file size:
• Chromosome 9th : 78 GB (input), 15 GB (output)
• Chromosome 21st : 25 GB (input), 0.5 GB (output)
Total runtime and cost estimation
• The total accumulated compute time to complete all 22
chromosomes is estimated ~400 hours using 22
HB120v3 VMs in parallel.
• Average Azure cost per sample can be as low as ~$0.22.
Azure HPC to accelerate Genome-wide Analysis study (GWAS) (microsoft.com)
20211119 ntuh azure hpc workshop final
AI for Health | Microsoft AI
Azure HPC to accelerate Genome-wide Analysis study (GWAS) (microsoft.com)
Breathe Easy: Conquering the Coronavirus With CFD | Ansys Advantage
20211119 ntuh azure hpc workshop final
Genotype Imputation: Beagle’s scalability
(on-going)
Workflow:
1. Convert from Plink binary (.bed, .bim, &.fam) to VCF (qctool):
2. Running Imputation (Beagle 5.2):
• java -jar beagle.21Apr21.304.jar gt=c21.vcf out=c12out.gt
• Note: mitigate “java.lang.OutOfMemoryError: Java heap space” error
• -Xmx400g
• Window=10
• VM w/ bigger size:
• HB120v3-16 (448GB RAM, mem/core = 28GB)
• M192idms-v2 (4,096GB RAM, mem/core = 21.3GB)
java -Xmx400g -Xms200g -jar beagle.21Apr21.304.jar gt=c2.vcf out=c2out_window1o_10hb120v3.gt window=10
Genotype Imputation performance
Current software for genotype imputation | SpringerLink, 2009
• BEAGLE's cumulative runtime was the shortest of all three programs (350 minutes; 366 minutes in memory-saving mode [5 per cent
increase]).
• IMPUTE required a considerably longer time (433 minutes [24 per cent higher than that of BEAGLE]; 464 minutes when split into18
chromosomal segments of ~10 Mb [7 per cent increase]),
• MACH was by far the slowest program (2781 minutes [695 per cent higher than that of BEAGLE] -- that is, about two days; 4421 minutes in
memory-saving mode [59 per cent increase]).
A One-Penny Imputed Genome from Next-Generation Reference Panels, 2018
• Beagle 5.0, Beagle 4.1, Minimac4, Minimac3, and Impute4 performance comparison
• For 10k, 100k, 1M, and 10M reference samples and 1,000 target samples:
• Single-threaded: Beagle 5.0’s computation time was 3 times (10k), 12 times (100k), 43 times (1M), and 533 times (10M)
faster than others.
• Multi-threaded (12 cores): Beagle 5.0’s computation time was 5 times (10k), 23 times (100k), 156 times (1M), and 458
times (10M) faster than others.
Comprehensive Assessment of Genotype Imputation Performance (karger.com), 2019
• Beagle 4.1 has almost the same performance as SHAPEIT2+IMPUTE2, and much faster than IMPUTE2 & MACH+Minimac3
• Beagle 4.1 has lowest memory usage

More Related Content

Similar to 20211119 ntuh azure hpc workshop final (20)

PDF
Large Scale PCA Analysis in SVS
Golden Helix
 
PDF
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks
 
PPTX
Slides
Bhupendra Ghodki
 
PDF
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Databricks
 
PPTX
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Paolo Missier
 
PDF
A meta-analysis of computational biology benchmarks reveals predictors of pro...
Paul Gardner
 
PDF
Genomics data analysis in Julia
Jiahao Chen
 
PPTX
Genomics Is Not Special: Towards Data Intensive Biology
Uri Laserson
 
PDF
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Spark Summit
 
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Andy Petrella
 
PDF
Multi-trait analysis informs genetic disease studies (IIBMP 2020)
Yosuke Tanigawa
 
PDF
Computational Methods For Genetics Of Complex Traits 1st Edition Jay C Dunlap...
wrdelakoxa
 
PDF
2018. gwas data cleaning
FOODCROPS
 
PDF
Lightweight data engineering, tools, and software to facilitate data reuse an...
Sean Davis
 
PDF
High-Dimensional Machine Learning for Medicine
Paris Women in Machine Learning and Data Science
 
PPTX
Progeny Lab
Progeny Software, LLC
 
PPTX
Progeny Lab Overview
Progeny Software, LLC
 
PPTX
171114 best practices for benchmarking variant calls justin
GenomeInABottle
 
PPTX
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
Arghya Kusum Das
 
PDF
Sequencing 60,000 Samples: An Innovative Large Cohort Study for Breast Cancer...
QIAGEN
 
Large Scale PCA Analysis in SVS
Golden Helix
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Databricks
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Paolo Missier
 
A meta-analysis of computational biology benchmarks reveals predictors of pro...
Paul Gardner
 
Genomics data analysis in Julia
Jiahao Chen
 
Genomics Is Not Special: Towards Data Intensive Biology
Uri Laserson
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Spark Summit
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Andy Petrella
 
Multi-trait analysis informs genetic disease studies (IIBMP 2020)
Yosuke Tanigawa
 
Computational Methods For Genetics Of Complex Traits 1st Edition Jay C Dunlap...
wrdelakoxa
 
2018. gwas data cleaning
FOODCROPS
 
Lightweight data engineering, tools, and software to facilitate data reuse an...
Sean Davis
 
High-Dimensional Machine Learning for Medicine
Paris Women in Machine Learning and Data Science
 
Progeny Lab Overview
Progeny Software, LLC
 
171114 best practices for benchmarking variant calls justin
GenomeInABottle
 
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
Arghya Kusum Das
 
Sequencing 60,000 Samples: An Innovative Large Cohort Study for Breast Cancer...
QIAGEN
 

More from Meng-Ru (Raymond) Tsai (20)

PDF
2024年11月14日的講座《AI 業界應用與未來趨勢》由微軟Azure HPC/AI工程部的主要計劃經理蔡孟儒主講,涵蓋了生成式AI的進展、如何客製化A...
Meng-Ru (Raymond) Tsai
 
PDF
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
Meng-Ru (Raymond) Tsai
 
PDF
Microsoft Generative AI and Medical case studies.
Meng-Ru (Raymond) Tsai
 
PDF
202002 DIGI+Talent數位網路學院線上課程: 五大領堿先修課
Meng-Ru (Raymond) Tsai
 
PDF
20190627 ai+blockchain
Meng-Ru (Raymond) Tsai
 
PDF
20171024 文化大學 1 azure big data ai
Meng-Ru (Raymond) Tsai
 
PDF
20171024 文化大學 2 big data ai
Meng-Ru (Raymond) Tsai
 
PPTX
20180126 microsoft ai on healthcare
Meng-Ru (Raymond) Tsai
 
PDF
20170330 彰基 azure healthcare
Meng-Ru (Raymond) Tsai
 
PPTX
4 module09 iot
Meng-Ru (Raymond) Tsai
 
PPTX
3 module06 monitoring
Meng-Ru (Raymond) Tsai
 
PPTX
2 module07 cognitive services and the bot framework
Meng-Ru (Raymond) Tsai
 
PPTX
1 module04 dev ops
Meng-Ru (Raymond) Tsai
 
PDF
20170123 外交學院 大數據趨勢與應用
Meng-Ru (Raymond) Tsai
 
PDF
20160525 跨界新識力沙龍論壇 機器學習與跨業應用展望
Meng-Ru (Raymond) Tsai
 
PDF
20170108 微軟大數據整合解決方案- cortana intelligence suite
Meng-Ru (Raymond) Tsai
 
PPTX
20160930 bot framework workshop
Meng-Ru (Raymond) Tsai
 
PPTX
20160930 bot framework workshop
Meng-Ru (Raymond) Tsai
 
PPTX
20160323 台大 微軟學生大使招生分享會
Meng-Ru (Raymond) Tsai
 
PDF
20160304 blockchain in fsi client ready raymond
Meng-Ru (Raymond) Tsai
 
2024年11月14日的講座《AI 業界應用與未來趨勢》由微軟Azure HPC/AI工程部的主要計劃經理蔡孟儒主講,涵蓋了生成式AI的進展、如何客製化A...
Meng-Ru (Raymond) Tsai
 
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
Meng-Ru (Raymond) Tsai
 
Microsoft Generative AI and Medical case studies.
Meng-Ru (Raymond) Tsai
 
202002 DIGI+Talent數位網路學院線上課程: 五大領堿先修課
Meng-Ru (Raymond) Tsai
 
20190627 ai+blockchain
Meng-Ru (Raymond) Tsai
 
20171024 文化大學 1 azure big data ai
Meng-Ru (Raymond) Tsai
 
20171024 文化大學 2 big data ai
Meng-Ru (Raymond) Tsai
 
20180126 microsoft ai on healthcare
Meng-Ru (Raymond) Tsai
 
20170330 彰基 azure healthcare
Meng-Ru (Raymond) Tsai
 
4 module09 iot
Meng-Ru (Raymond) Tsai
 
3 module06 monitoring
Meng-Ru (Raymond) Tsai
 
2 module07 cognitive services and the bot framework
Meng-Ru (Raymond) Tsai
 
1 module04 dev ops
Meng-Ru (Raymond) Tsai
 
20170123 外交學院 大數據趨勢與應用
Meng-Ru (Raymond) Tsai
 
20160525 跨界新識力沙龍論壇 機器學習與跨業應用展望
Meng-Ru (Raymond) Tsai
 
20170108 微軟大數據整合解決方案- cortana intelligence suite
Meng-Ru (Raymond) Tsai
 
20160930 bot framework workshop
Meng-Ru (Raymond) Tsai
 
20160930 bot framework workshop
Meng-Ru (Raymond) Tsai
 
20160323 台大 微軟學生大使招生分享會
Meng-Ru (Raymond) Tsai
 
20160304 blockchain in fsi client ready raymond
Meng-Ru (Raymond) Tsai
 
Ad

Recently uploaded (20)

PPTX
Experiment 4 neurological examination.pptx
diren38730
 
PPT
Arteriovenous Access Selection and Evaluation
Dialysistechlearning
 
PPTX
Coffee & Body Health: Miracle Brew or Hidden Danger?
Rangen A. Ghafur
 
PPTX
Acids, Bases, Buffers & Henderson-Hasselbalch Equation – A Conceptual Overview
Karthik Kamath
 
PPTX
Everything You Need to Know About Abha Card.pptx
Eka Care
 
PPTX
Qualitycarein criticalcarenursingPS.pptx
Maj Tania Bose
 
PPTX
ENDONASAL ENDOSCOPIC MANAGEMENT OF PITUITARY TUMOURS.pptx
donogolo
 
PPTX
Parenteral Routes of Drug Administration: IM, IV, ID, and SC Injections.pptx
SurajDudhade
 
PDF
Sexual transmitted infections poster presentations
Shashi Bhushan
 
PDF
Top Chiropractic Billing Mistakes That Hurt Your Practice’s Revenue.pdf
senmaria721
 
PPTX
Ch 14 Pharmacology & Med Administration.pptx
djorgenmorris
 
DOCX
How Healthcare Visionaries Are Driving Systemic Change
oliverwanyama96
 
PDF
7 sins endodontics lecture quoted from binyakzan concepts
Islam Kassem
 
PPTX
shoulder hand syndrome physiotherapy.pptx
Prof. Satyen Bhattacharyya
 
PDF
Renee Repella - A Registered Nurse
Renee Repella
 
PDF
Topical Antifungal in Children and Adult
Fat Baby
 
PDF
Advanced Cancer and End of Life | VITAS Healthcare
VITASAuthor
 
PDF
Fillip Kosorukov - Served As A Research Assistant
Fillip Kosorukov
 
PPTX
Case on Acute pancreatits / PharmD / Case presentations / ppt
P. Harshitha Reddy
 
PDF
RGUHS BSc Nursing Anatomy Notes, All types of question answers are available ...
healthscedu
 
Experiment 4 neurological examination.pptx
diren38730
 
Arteriovenous Access Selection and Evaluation
Dialysistechlearning
 
Coffee & Body Health: Miracle Brew or Hidden Danger?
Rangen A. Ghafur
 
Acids, Bases, Buffers & Henderson-Hasselbalch Equation – A Conceptual Overview
Karthik Kamath
 
Everything You Need to Know About Abha Card.pptx
Eka Care
 
Qualitycarein criticalcarenursingPS.pptx
Maj Tania Bose
 
ENDONASAL ENDOSCOPIC MANAGEMENT OF PITUITARY TUMOURS.pptx
donogolo
 
Parenteral Routes of Drug Administration: IM, IV, ID, and SC Injections.pptx
SurajDudhade
 
Sexual transmitted infections poster presentations
Shashi Bhushan
 
Top Chiropractic Billing Mistakes That Hurt Your Practice’s Revenue.pdf
senmaria721
 
Ch 14 Pharmacology & Med Administration.pptx
djorgenmorris
 
How Healthcare Visionaries Are Driving Systemic Change
oliverwanyama96
 
7 sins endodontics lecture quoted from binyakzan concepts
Islam Kassem
 
shoulder hand syndrome physiotherapy.pptx
Prof. Satyen Bhattacharyya
 
Renee Repella - A Registered Nurse
Renee Repella
 
Topical Antifungal in Children and Adult
Fat Baby
 
Advanced Cancer and End of Life | VITAS Healthcare
VITASAuthor
 
Fillip Kosorukov - Served As A Research Assistant
Fillip Kosorukov
 
Case on Acute pancreatits / PharmD / Case presentations / ppt
P. Harshitha Reddy
 
RGUHS BSc Nursing Anatomy Notes, All types of question answers are available ...
healthscedu
 
Ad

20211119 ntuh azure hpc workshop final

  • 3. Microsoft Azure Cloud Computing Predicting COVID-19 Mortality Using Polygenic Risk Score 李建璋 台大智慧醫療中心副主任 台大急診醫學部 臨床教授 生物醫學資料科學研究群主持人
  • 12. Polygenic risk score Effect sizes (weights) are estimated for each SNP serves as prediction for the trait
  • 16. 台灣人體生物資料庫 1. 建立臺灣自己的參考序列 2. 為臺灣建立健康對照組的序列資訊 3. 作為基因填補法(genetic imputation)的模板,增加研究效益 4. 提供台灣健康族群低頻率變異(rare allele)分布情形 5. 有助於發展全基因體關聯性研究(genome-wide association study) TWB 2.0 為一款針對台灣華人設計 的基因型鑑定晶片,包含可立即應 用到臨床的基因變異位點,及做為 精準醫學研究用的全基因體關聯性 分析變異位 (714,431 SNPs)
  • 17. ● GWAS Summary (Manhattan) Plot of the Association Statistics Highlighting Susceptibility Loci with Genome-wide Significance for COVID-19 mortality
  • 20. PRS-CS - Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors Pain O, PLoS Genet. 2021 May 4;17(5):e1009021.
  • 21. Age + Sex + UKB PRS-CS 0.838 (0.804 - 0.872) Age + Sex + BMI + UKB PRS-CS 0.844 (0.811 - 0.877)
  • 24. Azure Global Infrastructure (microsoft.com)
  • 25. Microsoft is committed to harnessing the power of technology to help everyone, everywhere build a more sustainable future. 2030
  • 26. On-demand global scale Linux, Open Source, and Red Hat Best for Microsoft workloads Purpose-built infrastructure
  • 27. Compliance in the trusted cloud | Microsoft Azure
  • 28. 1. Digitally recreate the Eindhoven University of Technology sports center. 2. Simulate fitness center’s airflow and understand how air purifiers and ventilation systems could join forces to help reduce contagion. Using Fluent and the compute power of Ansys Cloud via Microsoft Azure HPC, the team accomplished their research in just 3 weeks instead of estimated 3 years. Breathe Easy: Conquering the Coronavirus With CFD | Ansys Advantage
  • 30. • Up to 500K samples • Up to 96 millions of SNPs. • chromosome files size spans from 37 GB (#22) to 188 GB (chromosome #2). • Total manipulated size ~2.4 TB
  • 31. Effective collaboration on the Cloud Blob HPC cluster user1 user2 user3 Quality Control and Locus Zoom, Q-Q Plot, Manhattan Plot, etc. VM w/ NVMe HPC cluster DNN, PRS, Jupiter Genotype Imputation Role-Based Access Control 1 2 3 4 CycleCloud
  • 33. Azure VMs VM size Processor CPU Memory Base/Peak CPU frequency (GHz) Local disk (GiB) Cost (on-demand) EAST US Cost (low-priority) EAST US HC44rs Intel Xeon Platinum 8168 44 352GiB 2.7/3.4 700 GiB (NVMe) $3.17/hr $0.63/hr HB120v3 AMD EPYC 7V13 120 448GiB 2.45/3.675 2 * 960 GiB (NVMe) $3.6/hr $0.72/hr E64dsv4 Intel Xeon Platinum 8272CL 64 504GiB 2.7/4.0 2400 GiB $4.61/hr $0.92/hr L80sv2 AMD EPYCTM 7551 80 640GiB 2.55/3.0 10 x 1.92 TB (NVMe) $6.24/hr $1.25/hr Pricing Calculator | Microsoft Azure
  • 34. GWAS on predicting Covid-19 mortality rate Experience on Azure HPC
  • 35. Step 1. Step 2. Genotypes • ~805,426 markers • size: ~300 GB Data acquisition Step 3. Step 4. Step 5. QC GWAS PRS Download the results (Quality control) (Genome-wide Association Study) (Polygenic risk score) PRS C+T PRS CS PRS DNN
  • 36. Step 1. Data acquisition Step 2 Step 3 Step 4 Step 5 Genotypes • ~805,426 markers • size: ~300 GB Imputed genotypes • ~96 million variants • size: ~2.4 TB Covid-19 testing Death register Reference: UKB resorce 530 UK Biobank • ~500,000 individuals • 40-69 years old
  • 37. Step 1. Data acquisition Step 2 Step 3 Step 4 Download QC, GWAS and PRS Upload to cloud Step 5 Problem. • Data must be downloaded by • Not familiar with • , download over 1 month • Upload data is extremely slow Reference: UKB category 263
  • 38. Step 1. Data acquisition Step 2 Step 3 Step 4 Step 5 Solution. Blob 1 2 3 Problem. • Data must be downloaded by • Not familiar with • 3. , download over 1 month Download 2.4 TB <20 mins! Up to 32 Gbps
  • 39. Step 1 Step 2. Quality control Step 3 Step 4 Step 5 For SNPs • MAF >0.001 • INFO score >0.3 For Individuals • mismatch sex • extreme heterozygosity • sex chromosome aneuploidy • kinship inference Problem. • PLINK on-premises is too slow • How to choose the appropriate VM? Reference: UKB resorce 531
  • 40. Step 1 Step 2. Quality control Step 3 Step 4 Step 5 Solution. • Try chromosome 2 and HPC can help Credit: Raymond Meng-Ru Tsai 7.1 million SNPs, 188 GB Over 150 VMs...
  • 41. Step 1 Step 2. Quality control Step 3 Step 4 Step 5 Credit: Raymond Meng-Ru Tsai On premises >24 hours! HB120v3 (Win10) ~1 hour
  • 42. Step 1 Step 2. Quality control Step 3 Step 4 Step 5 Credit: Raymond Meng-Ru Tsai On premises >24 hours! HB120v3 (Win10) ~1 hour! 24!!!
  • 43. Step 1 Step 2. Quality control Step 3 Step 4 Step 5 Solution. HB120v3 with local NVMe disks + TeraCopy Credit: Raymond Meng-Ru Tsai HB120v3 NVMe disks Disk I/O performance
  • 44. Step 1 Step 4 Step 5 Step 2 Step 3. GWAS Chromosome 1- Chromosome 22 …...... Effect size NVMe disks HB120v3
  • 45. Step 1 Step 5 Step 2 Step 3 Step 4. PRS Tools : PLINK, BigSNPR, PRSice-2, Lassosum Problem. BigSNPR is too slow Use PLINK PLINK BigSNPR .bed .bgen GWAS PRS (C+T) PLINK 2 Slow Slow Slow PLINK 1.9
  • 46. Step 1 Step 2 Step 3 Step 4 Step 5. Download GWAS/PRS result Blob 1 2 3 NVMe disks QC/ GWAS PRS On premises HB120v3 Download 2.4 TB <20 mins! HPC can help!
  • 47. Step 1. Step 2. Genotypes • ~805,426 markers • size: ~300 GB Data acquisition Step 3. Step 4. Step 5. QC GWAS PRS Download the results (Quality control) (Genome-wide Association Study) (Polygenic risk score) PRS C+T PRS CS PRS DNN
  • 49. Random split data into train/test sets in a stratified fashion. Filter SNPs by GWAS statistics (p-value / effect size) Use SMOTE to generate synthetic samples from the minority class. Features Engineering: • posterior SNP effect sizes under continuous shrinkage (paper) Training & Regularization: • Learning Rate Decay/Batch Normalization/Dropout Layer Loss Function: • Weighted Loss/AUC-targeted loss function/L2 Norm Hyperparameters tuning: • Bayesian Optimization
  • 50. Window DSVM • vCPU 24 • 224 GiB RAM Blob Conduct Study with Buit-in Data Analytic Tools & Computing Resources 1 2​ 3
  • 52. Beagle Imputation in SVS (slideshare.net)
  • 53. • Memory usage •# of samples •# of SNPs Shi, Shuo, et al. “Comprehensive Assessment of Genotype Imputation Performance.” Human Heredity, vol. 83, no. 3, S. Karger AG, 2017, pp. 107–16,
  • 54. • An Azure HPC cluster can perform Genotype Imputation on all 22 chromosomes simultaneously. • Leveraging Azure CycleCloud to parallelize the pipeline execution.
  • 55. Genotype Imputation performance • ~46M SNPs for 10,417 samples after Quality Control. • VCF file size: • Chromosome 9th : 78 GB (input), 15 GB (output) • Chromosome 21st : 25 GB (input), 0.5 GB (output)
  • 56. Total runtime and cost estimation • The total accumulated compute time to complete all 22 chromosomes is estimated ~400 hours using 22 HB120v3 VMs in parallel. • Average Azure cost per sample can be as low as ~$0.22. Azure HPC to accelerate Genome-wide Analysis study (GWAS) (microsoft.com)
  • 58. AI for Health | Microsoft AI Azure HPC to accelerate Genome-wide Analysis study (GWAS) (microsoft.com) Breathe Easy: Conquering the Coronavirus With CFD | Ansys Advantage
  • 60. Genotype Imputation: Beagle’s scalability (on-going) Workflow: 1. Convert from Plink binary (.bed, .bim, &.fam) to VCF (qctool): 2. Running Imputation (Beagle 5.2): • java -jar beagle.21Apr21.304.jar gt=c21.vcf out=c12out.gt • Note: mitigate “java.lang.OutOfMemoryError: Java heap space” error • -Xmx400g • Window=10 • VM w/ bigger size: • HB120v3-16 (448GB RAM, mem/core = 28GB) • M192idms-v2 (4,096GB RAM, mem/core = 21.3GB) java -Xmx400g -Xms200g -jar beagle.21Apr21.304.jar gt=c2.vcf out=c2out_window1o_10hb120v3.gt window=10
  • 61. Genotype Imputation performance Current software for genotype imputation | SpringerLink, 2009 • BEAGLE's cumulative runtime was the shortest of all three programs (350 minutes; 366 minutes in memory-saving mode [5 per cent increase]). • IMPUTE required a considerably longer time (433 minutes [24 per cent higher than that of BEAGLE]; 464 minutes when split into18 chromosomal segments of ~10 Mb [7 per cent increase]), • MACH was by far the slowest program (2781 minutes [695 per cent higher than that of BEAGLE] -- that is, about two days; 4421 minutes in memory-saving mode [59 per cent increase]). A One-Penny Imputed Genome from Next-Generation Reference Panels, 2018 • Beagle 5.0, Beagle 4.1, Minimac4, Minimac3, and Impute4 performance comparison • For 10k, 100k, 1M, and 10M reference samples and 1,000 target samples: • Single-threaded: Beagle 5.0’s computation time was 3 times (10k), 12 times (100k), 43 times (1M), and 533 times (10M) faster than others. • Multi-threaded (12 cores): Beagle 5.0’s computation time was 5 times (10k), 23 times (100k), 156 times (1M), and 458 times (10M) faster than others. Comprehensive Assessment of Genotype Imputation Performance (karger.com), 2019 • Beagle 4.1 has almost the same performance as SHAPEIT2+IMPUTE2, and much faster than IMPUTE2 & MACH+Minimac3 • Beagle 4.1 has lowest memory usage