Telomere-to-telomere assembly of a
complete human chromosomes
Karen Miga
UC Davis Genetics Seminar
Sept 30, 2019
@khmiga
New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
chr21
New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
Our current understanding of
genome biology and function30 Mb
chr21
New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
Our current understanding of
genome biology and function30 Mb
chr21
~20 Mb ?
Challenge:
Generating assemblies across repetitive regions that
span hundreds of kilobases.
Repeats (100 kb+)
Unique
variant
Unique
variant
Can high-coverage ultra-long sequencing resolve
complete assemblies of the human genome?
MinION
100kb+
It’s time to finish the human genome
The Telomere-to-Telomere (T2T) consortium is an
open, community-based effort to generate the
first complete assembly of a human genome.
Our target: CHM13hTERT
Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers
N=46; XX
Our target: CHM13hTERT
Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers
N=46; XX
Intramural Sequencing Center
CHM13 Sequencing
94 MinION/GridION flow cells
11.1M reads
155 Gb (1.6 Gb / flow cell) (50x)
99 Gb in reads >50 kb (32x)
78 Gb in reads >70 kb (25x)
Max mapped read length 1.04 Mb
From May 1/18 – Jan 8/19
Intramural Sequencing Center
CHM13 Sequencing
94 MinION/GridION flow cells
11.1M reads
155 Gb (1.6 Gb / flow cell) (50x)
99 Gb in reads >50 kb (32x)
78 Gb in reads >70 kb (25x)
Max mapped read length 1.04 Mb
From May 1/18 – Jan 8/19
50x Nanopore ultra-long
Contig building
60x PacBio
Polishing
50x 10x Genomics
Polishing
BioNano
Structural validation
• 2.94 Gbp assembly NG50: 75 Mbp
• Exceeds the continuity of the reference
genome GRCh38 (56 Mbp NG50
contig size).
• Subset of chromosome assemblies
break only at centromere.
Roadmap for completing the genome
Canu
Canu
Canu
Orthogonal Validation
Jo and Valerie
Telomere-to-telomere assembly of a complete human chromosomes
2.2 - 3.7 Mb
mean of 3010 kb (S.D. = 429; n = 49)
STRUCTURAL VARIANT
STRUCTURAL VARIANT
151516 15 3 8 2
8
4
Assemble contigs
Using overlapping
SV patterns
XqXp
Scaffold Assembly of XCEN
XqXp
Rel3 Assembly: ~3.1 Mb
The assembly is a hypothesis(!)
2107 294659
Beth SullivanJennifer Gerton
Edmund Howe
Rel3 Assembly: ~3.1 Mb
@NanoporeConf | #NanoporeConf
Marker-assisted mapping
Adam Phillippy Arang Rhie Sergey Koren
@NanoporeConf | #NanoporeConf
Create a scaffold of unique, or
single copy k-mers genome-wide
Marker-assisted mapping
Adam Phillippy Arang Rhie Sergey Koren
Marker-assisted mapping
@NanoporeConf | #NanoporeConf
Anchor high-confident
long-read alignments to
repeat assemblies
Marker-assisted mapping
Adam Phillippy Arang Rhie Sergey Koren
Marker-assisted mapping
28
Confident mapping of long reads
using a single-copy k-mer strategy
Identify and mark all sites of unique anchors across the chromosome
chrX
• 21-mers that appear ~c times in Illumina data
• Also found in PacBio/Nanopore reads
• Less frequent in the centromere, but still there
• (Validated with Duplex-Seq)
29
Confident mapping of long reads
using a single-copy k-mer strategy
Filter long read alignments: retaining those with unique k-mer anchoring
chrX
chrX
30
Spacing of single-copy k-mers can be irregular in
repeat-dense regions
chrX
chrX
X CENTROMERE ARRAY
CENTROMERE
CENX: 3.1 Mbps
Number of k-mers: 2,034
Spacing N50: 6,879
Longest distance
between k-mers
: 53,798 bp
31
10XG Polishing
Unique K-mer-based filtering: Nanopore Reads
longranger + freebayes (two rounds)
nanopolish (two rounds)
arrow (two rounds)
Unique K-mer-based filtering: PacBio (CLR) Reads
chrX
chrX
chrX
GAGE pre-polishing
ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats
Coverage
250
200
150
100
50
0
Base position
Most frequent base
Second most frequent base (error)
19 tandemly arrayed ~9.4 kb repeats
GAGE with marker-assisted polishing
Most frequent base
Second most frequent base (error)
ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats
Coverage
250
200
150
100
50
0
Base position
19 tandemly arrayed ~9.4 kb repeats
34
CSS/HiFi Evaluation
chrX
HiFi Alignments to Evaluate Polishing
CENTROMERE X:
BEFORE POLISHING
DXZ1: 3.1 Mb
35
CSS/HiFi Evaluation
chrX
HiFi Alignments to Evaluate Polishing
CENTROMERE X:
AFTER POLISHING
NOTE:
Underlying satellite array
structure remains the same.
DXZ1: 3.1 Mb
Opens the whole genome to analysis
Ariel Gershman
Winston Timp’s
Laboratory
Ariel Gershman
Winston Timp’s
Laboratory
Ariel Gershman
Winston Timp’s
Laboratory
Ariel Gershman
Winston Timp’s
Laboratory
1. Structurally validated assembly from telomere-to-telomere. Including
3.1 Mb tandem repeat at the X centromere and providing a complete
assessment across tandemly repeated gene families.
Finished T2T X Chromosome:
High Accuracy and High Continuity
1. Structurally validated assembly from telomere-to-telomere. Including
3.1 Mb tandem repeat at the X centromere and providing a complete
assessment across tandemly repeated gene families.
2. Novel polishing strategy capable of improving the quality of large repeat-
rich regions. Demonstrating dramatic improvements in quality over the
entirety of the X chromosome.
Finished T2T X Chromosome:
High Accuracy and High Continuity
1. Structurally validated assembly from telomere-to-telomere. Including
3.1 Mb tandem repeat at the X centromere and providing a complete
assessment across tandemly repeated gene families.
2. Novel polishing strategy capable of improving the quality of large repeat-
rich regions. Demonstrating dramatic improvements in quality over the
entirety of the X chromosome.
3. Statistics of CHM13 full length BAC alignments to polished assembly:
275/341 (81%) QV 37.4 QV 27.9
153/341 (45%) QV 37.7 QV 27.4
Vollger M, Logsdon, G et al. bioRxiv doi.org/10.1101/635037
MeanMedianBACs Aligned
HiFi
UL-asm
Finished T2T X Chromosome:
High Accuracy and High Continuity
@NanoporeConf | #NanoporeConf
It is time to finish the
human genome
• github.com/nanopore-wgs-consortium/chm13
• 120x Nanopore reads
• NHGRI, UW, Nottingham,
• UC Davis (PromethION, Megan Dennis)
• 50x 10x Genomics linked reads (NHGRI)
• 70x PacBio CLR reads (WashU)
• 24x PacBio HiFi reads (UW)
• 40x Hi-C (Arima Genomics)
• BioNano optical map (WashU)
• Unpolished Canu assemblies
NEW! Rel3 open data release
Additional ultra-long ONT data
from Glennis Logsdon (UW)
Read length Coverage Percent of data
>50 kbp 12X 86%
>100 kbp 9.1X 66%
>150 kbp 6.8X 49%
>200 kbp 4.9X 35%
>250 kbp 3.4X 24%
N50 = 147.1
N1 = 649.6
Max = 1538.3
0.1 1 10 100 1000 10,000
Read length (kbp)
20,000
17,500
15,000
12,500
10,000
7,500
5,000
2,500
0
Numberofreads
13.9X coverage
• github.com/nanopore-wgs-consortium/chm13
• Minimal change in continuity
• 79.5 Mbp (rel2) vs. 71.8 Mbp (rel3) NG50
• Don’t judge assemblies based on continuity
• Tricky regions are fixed
• GAGE and more SegDups automatically resolved
• Improved BAC validation
• 288 (rel2) vs. 310 (rel3) of 341 BACs resolved
• 1 chromosome down, 23 to go…
Triple the coverage, what changed?
Goal of a complete human genome in the next two
years.
Challenges in front of us:
• Acrocentric p-arms
• Large segmental duplications
• Classical Human satellites 2,3
Establishing new benchmarking standards (XChr)
Pioneering new pipelines: Polishing, repeat assembly, and array
structural validation.
Setting the bar higher for quality and completeness.
Telomere-to-telomere assembly of a complete human chromosomes

More Related Content

PPTX
Review of Liao et al - A draft human pangenome reference - Nature (2023)
PDF
KHMiga-AGBT.020923.upload.pdf
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PDF
Variant analysis and whole exome sequencing
PPTX
NGS data formats and analyses
PDF
NGS: Mapping and de novo assembly
PPTX
Gemome annotation
Review of Liao et al - A draft human pangenome reference - Nature (2023)
KHMiga-AGBT.020923.upload.pdf
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant analysis and whole exome sequencing
NGS data formats and analyses
NGS: Mapping and de novo assembly
Gemome annotation

What's hot (20)

PDF
Telomere-to-telomere assembly of a complete human X chromosome
PPTX
What's new and what's next for the human reference assembly?
PPTX
Next generation sequencing methods
PPTX
Advancements in the human genome reference assembly (GRCh38)
PPTX
Workshop NGS data analysis - 1
PDF
Genomic Data Analysis: From Reads to Variants
PDF
Data analysis pipelines for NGS applications
PDF
RNA-seq Analysis
PDF
Illumina sequencing introduction
PPTX
Explaining the assembly model
PPTX
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breeding
PDF
Molecular marker and its application to genome mapping and molecular breeding
PPTX
Next Gen Sequencing (NGS) Technology Overview
PPSX
Next Generation Sequencing
PPTX
ILLUMINA SEQUENCE.pptx
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PDF
Variant calling and how to prioritize somatic mutations and inheritated varia...
PPTX
O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
PPTX
Next Generation Sequencing and its Applications in Medical Research - Frances...
PDF
Genomic Data Analysis
Telomere-to-telomere assembly of a complete human X chromosome
What's new and what's next for the human reference assembly?
Next generation sequencing methods
Advancements in the human genome reference assembly (GRCh38)
Workshop NGS data analysis - 1
Genomic Data Analysis: From Reads to Variants
Data analysis pipelines for NGS applications
RNA-seq Analysis
Illumina sequencing introduction
Explaining the assembly model
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breeding
Molecular marker and its application to genome mapping and molecular breeding
Next Gen Sequencing (NGS) Technology Overview
Next Generation Sequencing
ILLUMINA SEQUENCE.pptx
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Variant calling and how to prioritize somatic mutations and inheritated varia...
O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
Next Generation Sequencing and its Applications in Medical Research - Frances...
Genomic Data Analysis
Ad

Similar to Telomere-to-telomere assembly of a complete human chromosomes (20)

PDF
London Calling 2019: Karen Miga
PDF
40 Years of Genome Assembly: Are We Done Yet?
PDF
How giab fits in the rest of the world telomere to telomere consortium
PDF
Ashg grc workshop2015_tg
PDF
AGBT2017 Reference Workshop: Lindsay
PDF
Alignment Approaches II: Long Reads
PDF
Building a platinum human genome assembly from single haplotype human genomes...
PPTX
Creating Reference-Grade Human Genome Assemblies
PPTX
Ashg grc workshop2014_tg
PDF
Karen miga centromere sequence characterization and variant detection
PPTX
agbt 2016 workshop lindsay
PDF
Open pacbiomodelorgpaper j_landolin_20150121
PDF
New data from giab genomes promethion
PDF
101717.kh miga ashg_grc
PDF
Genome assembly: the art of trying to make one big thing from millions of ver...
PPTX
from genome sequencing to genome assembly
PDF
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
PDF
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
PDF
Course on parsing methods for biologists with a focus on ChIP-seq data
PPTX
Ashg2017 workshop tg
London Calling 2019: Karen Miga
40 Years of Genome Assembly: Are We Done Yet?
How giab fits in the rest of the world telomere to telomere consortium
Ashg grc workshop2015_tg
AGBT2017 Reference Workshop: Lindsay
Alignment Approaches II: Long Reads
Building a platinum human genome assembly from single haplotype human genomes...
Creating Reference-Grade Human Genome Assemblies
Ashg grc workshop2014_tg
Karen miga centromere sequence characterization and variant detection
agbt 2016 workshop lindsay
Open pacbiomodelorgpaper j_landolin_20150121
New data from giab genomes promethion
101717.kh miga ashg_grc
Genome assembly: the art of trying to make one big thing from millions of ver...
from genome sequencing to genome assembly
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
Course on parsing methods for biologists with a focus on ChIP-seq data
Ashg2017 workshop tg
Ad

More from Genome Reference Consortium (20)

PPTX
Previewing GRCm39: Assembly Updates from the GRC
PPTX
Genome variation graphs with the vg toolkit
PPTX
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
PPTX
Why graph genome storage and updating wakes me up at 4 am
PPTX
Schneider grc workshop_final
PPTX
PPTX
Lrg and mane 16 oct 2018
PPTX
20181016 grc presentation-pa
PPTX
2018 1016 trio_binning_ashg_arhie_final
PDF
Variation graphs and population assisted genome inference copy
PPTX
Ashg2017 workshop schneider
PPTX
Ashg sedlazeck grc_share
PPTX
171017 giab for giab grc workshop
PDF
AGBT2017 Reference Workshop: Fulton
PPTX
AGBT2017 Reference Workshop: Schneider
PDF
Haplotype resolved structural variation assembly with long reads
PDF
Everyday de novo diploid assembly
PPTX
Getting the most from the reference assembly
PPTX
Creating Reference-Grade Human Genome Assemblies
PPTX
Genome in a Bottle
Previewing GRCm39: Assembly Updates from the GRC
Genome variation graphs with the vg toolkit
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
Why graph genome storage and updating wakes me up at 4 am
Schneider grc workshop_final
Lrg and mane 16 oct 2018
20181016 grc presentation-pa
2018 1016 trio_binning_ashg_arhie_final
Variation graphs and population assisted genome inference copy
Ashg2017 workshop schneider
Ashg sedlazeck grc_share
171017 giab for giab grc workshop
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Schneider
Haplotype resolved structural variation assembly with long reads
Everyday de novo diploid assembly
Getting the most from the reference assembly
Creating Reference-Grade Human Genome Assemblies
Genome in a Bottle

Recently uploaded (20)

PDF
LEUCEMIA LINFOBLÁSTICA AGUDA EN NIÑOS. Guías NCCN 2020-desbloqueado.pdf
PDF
final prehhhejjehehhehehehebesentation.pdf
PDF
Microplastics: Environmental Impact and Remediation Strategies
PPT
INSTRUMENTAL ANALYSIS (Electrochemical processes )-1.ppt
PPTX
Models of Eucharyotic Chromosome Dr. Thirunahari Ugandhar.pptx
PPTX
Chromosomal Aberrations Dr. Thirunahari Ugandhar.pptx
PDF
CHEM - GOC general organic chemistry.ppt
PDF
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
PPTX
Introduction of Plant Ecology and Diversity Conservation
PPT
dcs-computertraningbasics-170826004702.ppt
PPTX
The Electromagnetism Wave Spectrum. pptx
PDF
Sujay Rao Mandavilli Degrowth delusion FINAL FINAL FINAL FINAL FINAL.pdf
PDF
Sujay Rao Mandavilli Variable logic FINAL FINAL FINAL FINAL FINAL.pdf
PDF
SOCIAL PSYCHOLOGY_ CHAPTER 2.pdf- the self in a social world
PPTX
Chapter 1 Introductory course Biology Camp
PPT
ZooLec Chapter 13 (Digestive System).ppt
PDF
Thyroid Hormone by Iqra Nasir detail.pdf
PPTX
Posology_43998_PHCEUTICS-T_13-12-2023_43998_PHCEUTICS-T_17-07-2025.pptx
PPTX
complications of tooth extraction.pptx FIRM B.pptx
PPT
ecg for noob ecg interpretation ecg recall
LEUCEMIA LINFOBLÁSTICA AGUDA EN NIÑOS. Guías NCCN 2020-desbloqueado.pdf
final prehhhejjehehhehehehebesentation.pdf
Microplastics: Environmental Impact and Remediation Strategies
INSTRUMENTAL ANALYSIS (Electrochemical processes )-1.ppt
Models of Eucharyotic Chromosome Dr. Thirunahari Ugandhar.pptx
Chromosomal Aberrations Dr. Thirunahari Ugandhar.pptx
CHEM - GOC general organic chemistry.ppt
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
Introduction of Plant Ecology and Diversity Conservation
dcs-computertraningbasics-170826004702.ppt
The Electromagnetism Wave Spectrum. pptx
Sujay Rao Mandavilli Degrowth delusion FINAL FINAL FINAL FINAL FINAL.pdf
Sujay Rao Mandavilli Variable logic FINAL FINAL FINAL FINAL FINAL.pdf
SOCIAL PSYCHOLOGY_ CHAPTER 2.pdf- the self in a social world
Chapter 1 Introductory course Biology Camp
ZooLec Chapter 13 (Digestive System).ppt
Thyroid Hormone by Iqra Nasir detail.pdf
Posology_43998_PHCEUTICS-T_13-12-2023_43998_PHCEUTICS-T_17-07-2025.pptx
complications of tooth extraction.pptx FIRM B.pptx
ecg for noob ecg interpretation ecg recall

Telomere-to-telomere assembly of a complete human chromosomes

  • 1. Telomere-to-telomere assembly of a complete human chromosomes Karen Miga UC Davis Genetics Seminar Sept 30, 2019 @khmiga
  • 2. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies
  • 3. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population
  • 4. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population chr21
  • 5. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population Our current understanding of genome biology and function30 Mb chr21
  • 6. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population Our current understanding of genome biology and function30 Mb chr21 ~20 Mb ?
  • 7. Challenge: Generating assemblies across repetitive regions that span hundreds of kilobases. Repeats (100 kb+) Unique variant Unique variant Can high-coverage ultra-long sequencing resolve complete assemblies of the human genome?
  • 9. It’s time to finish the human genome The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome.
  • 10. Our target: CHM13hTERT Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers N=46; XX
  • 11. Our target: CHM13hTERT Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers N=46; XX
  • 12. Intramural Sequencing Center CHM13 Sequencing 94 MinION/GridION flow cells 11.1M reads 155 Gb (1.6 Gb / flow cell) (50x) 99 Gb in reads >50 kb (32x) 78 Gb in reads >70 kb (25x) Max mapped read length 1.04 Mb From May 1/18 – Jan 8/19
  • 13. Intramural Sequencing Center CHM13 Sequencing 94 MinION/GridION flow cells 11.1M reads 155 Gb (1.6 Gb / flow cell) (50x) 99 Gb in reads >50 kb (32x) 78 Gb in reads >70 kb (25x) Max mapped read length 1.04 Mb From May 1/18 – Jan 8/19 50x Nanopore ultra-long Contig building 60x PacBio Polishing 50x 10x Genomics Polishing BioNano Structural validation
  • 14. • 2.94 Gbp assembly NG50: 75 Mbp • Exceeds the continuity of the reference genome GRCh38 (56 Mbp NG50 contig size). • Subset of chromosome assemblies break only at centromere. Roadmap for completing the genome Canu
  • 15. Canu
  • 16. Canu
  • 19. 2.2 - 3.7 Mb mean of 3010 kb (S.D. = 429; n = 49)
  • 21. STRUCTURAL VARIANT 151516 15 3 8 2 8 4 Assemble contigs Using overlapping SV patterns
  • 23. XqXp Rel3 Assembly: ~3.1 Mb The assembly is a hypothesis(!)
  • 24. 2107 294659 Beth SullivanJennifer Gerton Edmund Howe Rel3 Assembly: ~3.1 Mb
  • 25. @NanoporeConf | #NanoporeConf Marker-assisted mapping Adam Phillippy Arang Rhie Sergey Koren
  • 26. @NanoporeConf | #NanoporeConf Create a scaffold of unique, or single copy k-mers genome-wide Marker-assisted mapping Adam Phillippy Arang Rhie Sergey Koren Marker-assisted mapping
  • 27. @NanoporeConf | #NanoporeConf Anchor high-confident long-read alignments to repeat assemblies Marker-assisted mapping Adam Phillippy Arang Rhie Sergey Koren Marker-assisted mapping
  • 28. 28 Confident mapping of long reads using a single-copy k-mer strategy Identify and mark all sites of unique anchors across the chromosome chrX • 21-mers that appear ~c times in Illumina data • Also found in PacBio/Nanopore reads • Less frequent in the centromere, but still there • (Validated with Duplex-Seq)
  • 29. 29 Confident mapping of long reads using a single-copy k-mer strategy Filter long read alignments: retaining those with unique k-mer anchoring chrX chrX
  • 30. 30 Spacing of single-copy k-mers can be irregular in repeat-dense regions chrX chrX X CENTROMERE ARRAY CENTROMERE CENX: 3.1 Mbps Number of k-mers: 2,034 Spacing N50: 6,879 Longest distance between k-mers : 53,798 bp
  • 31. 31 10XG Polishing Unique K-mer-based filtering: Nanopore Reads longranger + freebayes (two rounds) nanopolish (two rounds) arrow (two rounds) Unique K-mer-based filtering: PacBio (CLR) Reads chrX chrX chrX
  • 32. GAGE pre-polishing ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats Coverage 250 200 150 100 50 0 Base position Most frequent base Second most frequent base (error) 19 tandemly arrayed ~9.4 kb repeats
  • 33. GAGE with marker-assisted polishing Most frequent base Second most frequent base (error) ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats Coverage 250 200 150 100 50 0 Base position 19 tandemly arrayed ~9.4 kb repeats
  • 34. 34 CSS/HiFi Evaluation chrX HiFi Alignments to Evaluate Polishing CENTROMERE X: BEFORE POLISHING DXZ1: 3.1 Mb
  • 35. 35 CSS/HiFi Evaluation chrX HiFi Alignments to Evaluate Polishing CENTROMERE X: AFTER POLISHING NOTE: Underlying satellite array structure remains the same. DXZ1: 3.1 Mb
  • 36. Opens the whole genome to analysis Ariel Gershman Winston Timp’s Laboratory
  • 40. 1. Structurally validated assembly from telomere-to-telomere. Including 3.1 Mb tandem repeat at the X centromere and providing a complete assessment across tandemly repeated gene families. Finished T2T X Chromosome: High Accuracy and High Continuity
  • 41. 1. Structurally validated assembly from telomere-to-telomere. Including 3.1 Mb tandem repeat at the X centromere and providing a complete assessment across tandemly repeated gene families. 2. Novel polishing strategy capable of improving the quality of large repeat- rich regions. Demonstrating dramatic improvements in quality over the entirety of the X chromosome. Finished T2T X Chromosome: High Accuracy and High Continuity
  • 42. 1. Structurally validated assembly from telomere-to-telomere. Including 3.1 Mb tandem repeat at the X centromere and providing a complete assessment across tandemly repeated gene families. 2. Novel polishing strategy capable of improving the quality of large repeat- rich regions. Demonstrating dramatic improvements in quality over the entirety of the X chromosome. 3. Statistics of CHM13 full length BAC alignments to polished assembly: 275/341 (81%) QV 37.4 QV 27.9 153/341 (45%) QV 37.7 QV 27.4 Vollger M, Logsdon, G et al. bioRxiv doi.org/10.1101/635037 MeanMedianBACs Aligned HiFi UL-asm Finished T2T X Chromosome: High Accuracy and High Continuity
  • 43. @NanoporeConf | #NanoporeConf It is time to finish the human genome
  • 44. • github.com/nanopore-wgs-consortium/chm13 • 120x Nanopore reads • NHGRI, UW, Nottingham, • UC Davis (PromethION, Megan Dennis) • 50x 10x Genomics linked reads (NHGRI) • 70x PacBio CLR reads (WashU) • 24x PacBio HiFi reads (UW) • 40x Hi-C (Arima Genomics) • BioNano optical map (WashU) • Unpolished Canu assemblies NEW! Rel3 open data release
  • 45. Additional ultra-long ONT data from Glennis Logsdon (UW) Read length Coverage Percent of data >50 kbp 12X 86% >100 kbp 9.1X 66% >150 kbp 6.8X 49% >200 kbp 4.9X 35% >250 kbp 3.4X 24% N50 = 147.1 N1 = 649.6 Max = 1538.3 0.1 1 10 100 1000 10,000 Read length (kbp) 20,000 17,500 15,000 12,500 10,000 7,500 5,000 2,500 0 Numberofreads 13.9X coverage • github.com/nanopore-wgs-consortium/chm13
  • 46. • Minimal change in continuity • 79.5 Mbp (rel2) vs. 71.8 Mbp (rel3) NG50 • Don’t judge assemblies based on continuity • Tricky regions are fixed • GAGE and more SegDups automatically resolved • Improved BAC validation • 288 (rel2) vs. 310 (rel3) of 341 BACs resolved • 1 chromosome down, 23 to go… Triple the coverage, what changed?
  • 47. Goal of a complete human genome in the next two years. Challenges in front of us: • Acrocentric p-arms • Large segmental duplications • Classical Human satellites 2,3 Establishing new benchmarking standards (XChr) Pioneering new pipelines: Polishing, repeat assembly, and array structural validation. Setting the bar higher for quality and completeness.

Editor's Notes

  • #8: KEY POINT HERE: spacing of unique variants… Some regions are easier than others….
  • #31: Number of k-mers: 2,034 Spacing N50: 6,879 Longest distance: 53,798 bp
  • #41: Median BAC QV 37.4 (mean QV 28.0) vs median QV 37.6 (mean WV 27.4 ) for the best CHM13 HiFi asm. And resolve 85% of BACs at >99.8% idy v.s. 54% for prior PacBio asm. T otal BACs: 341 Compressed: 166 1 Median: 99.9895 QV: 39.78811 Mean: 99.8706 QV: 28.88052 Mitchell HiFi: 153 1 Median: 99.9827 QV: 37.61954 Mean: 99.81871 QV: 27.41627 UL + 10x: 275 1 Median: 99.982 QV: 37.44727 Mean: 99.84145 QV: 27.99832
  • #42: Median BAC QV 37.4 (mean QV 28.0) vs median QV 37.6 (mean WV 27.4 ) for the best CHM13 HiFi asm. And resolve 85% of BACs at >99.8% idy v.s. 54% for prior PacBio asm. T otal BACs: 341 Compressed: 166 1 Median: 99.9895 QV: 39.78811 Mean: 99.8706 QV: 28.88052 Mitchell HiFi: 153 1 Median: 99.9827 QV: 37.61954 Mean: 99.81871 QV: 27.41627 UL + 10x: 275 1 Median: 99.982 QV: 37.44727 Mean: 99.84145 QV: 27.99832
  • #43: Median BAC QV 37.4 (mean QV 28.0) vs median QV 37.6 (mean WV 27.4 ) for the best CHM13 HiFi asm. And resolve 85% of BACs at >99.8% idy v.s. 54% for prior PacBio asm. T otal BACs: 341 Compressed: 166 1 Median: 99.9895 QV: 39.78811 Mean: 99.8706 QV: 28.88052 Mitchell HiFi: 153 1 Median: 99.9827 QV: 37.61954 Mean: 99.81871 QV: 27.41627 UL + 10x: 275 1 Median: 99.982 QV: 37.44727 Mean: 99.84145 QV: 27.99832