Centromeric Regions:
A source of new, unexplored human
sequence variation
Karen H. Miga
University of California, Santa Cruz
Jan 25, 2018
GIAB Workshop
Allele 1
Allele 2 LINE
Mobile element insertion
Allele 1
Allele 2
Copy Number Variation
Inversion Polymorphism
Allele 1
Allele 2
Single Nucleotide Polymorphisms
Allele 1
Allele 2
…ATACGGATTTCATGACAGGTTA…
…ATACGGATTTGATGACAGGTTA…
CHR 9
Identifying Sequence Variants
?
Centromeres: Large Assembly Gaps
p-arm
q-arm
Multi-Megabase
Assembly Gaps
?
CENTROMERIC REGIONS
?
Inability to track variation
p-arm
q-arm
Multi-Megabase
Assembly Gaps
?Mobile element insertion
Copy Number Variation
Inversion Polymorphism
SNPs
Unable to identify using
standard genomic data:
CENTROMERIC REGIONS
?
chr 9qh
Allele2
Allele1
chr 9qh+
CHR 9
Cytogenetics: Identifying Sequence Variants
CENTROMERIC REGIONS
Mobile element insertion
Copy Number Variation
Inversion Polymorphism
SNPs
Unable to identify using
standard genomic data:
H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011
?
chr 9qh
Allele2
Allele1
chr 9qh+
CHR 9
Cytogenetics: Identifying Sequence Variants
H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011
Regulate
Centromere
Function
Contribute to
Chromosome
Cohesion
Centromeres Play a
Role in Cell Division
?
chr 9qh
Allele2
Allele1
chr 9qh+
CHR 9
Cytogenetics: Identifying Sequence Variants
H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011
• 9qh+ men had significantly
increased frequencies of
hyperdiploid
cells. (Ford et al 1978)
• 9qh+ women showed significant
differences in rates of aneuploidy.
(Ford et al 1978)
• 9qh+ is associated with of an
increased fraction of malformed
spermatozoa (Eiben et al 1987)
• Inversions spanning 9qh relate to
recurrent miscarriages in Italian
populations (Del Porto et al 1993)
Unchartered Functional Regions of the
Human Genome
Part I: Constructing a reference map of centromeric DNAs
Part II: Expand the human “variation reference map” to
include centromeric DNAs
p-arm q-arm
... ...
multi-megabase array
ALPHA SATELLITE
~171bp
Tandem Repeat
Wide Range of Percent ID: ~60-100%
1 2 3 4
Part I: Constructing a reference map of centromeric DNAs
Narrow Range of Percent ID: 94% - 100%
“Higher Order Repeat”
Multi-monomeric Repeat Unit
Human Centromeric DNA: Higher Order Repeats
p-arm q-arm
... ...
1 2 3 4 1 2 3 4 1 2 3 4
multi-megabase array
Human Centromeres:
Chromosome-Specific Satellite Sequence Organization
p-arm q-arm
... ...
p-arm q-arm
... ...
Array “A”
Array “B” Array “C”
chrX
chr3
p-arm q-arm
... ...
... ...-A- -T-
Human Centromeric DNA:
Genome Model of Sequence Organization
INVERSION
p-arm q-arm
... ...
... ...-A- -T-
Human Centromeric DNA:
Genome Model of Sequence Organization
INVERSION
p-arm q-arm
... ...
LINE
SINE
OTHER
NON-ALPHA SATELLITE
... ...-A- -T-
Human Centromeric DNA:
Genome Model of Sequence Organization
INVERSION
p-arm q-arm
... ...
LINE
SINE
OTHER
... ...-A- -T-
Non-satellite DNA
GENES NON-ALPHA SATELLITE
Human Centromeric DNA:
Genome Model of Sequence Organization
INVERSION
p-arm q-arm
... ...
LINE
SINE
OTHER
... ...-A- -T-
GENES NON-ALPHA SATELLITE
Construct a new genomic reference for each centromeric
region to broaden research in these areas
Genome Informatics
Non-satellite DNA
GM12878
B-lymphoblastoid
(Female/CEPH)
Datasets involved in Centromeric Reference Map
>200 ENCODE datasets
A B C D E F
Prediction of Higher Order Repeats
PacBio ~10kb read
>200 ENCODE datasets
α-Centauri
(centromeric automated repeat identification)
PacBio ~10kb read
A B C D E F
5’…
…3’
10x
10
B
C
D
EF
A
10
10
10
10
10
5’ 3’
Prediction of Higher Order Repeats
B
C
D
EF
A
Chromosome specific assignment
?
Experimental Evidence:
Chromosome-specific Satellite DNA tools to
Screening Somatic Cell Hybrid Panel
B
C
D
EF
A
D7Z1
6-mer
Waye	
  et	
  al	
  (1987)	
  
98%	
  	
  GenBank:	
  M16101	
  
Flow Sorted Chromosome
Alignment/Enrichment
Illumina sequencing of isolated human
chromosomes
Long Range Read Support
“Anchor” to mapped to the assembled p-arm and/
or q-arm
Chromosome specific assignment
Chromosome-assignment of Higher Order Repeats
Read Depth Estimates of Average Satellite Array Size
7q-arm
D7Z1 (6-mer)
7p-arm
D7Z2 (16-mer)
R Wevrick and H F Willard. NAR ( 1991 )
Array size estimate:
~2.65 Mb
Read Depth Estimates of Average Satellite Array Size
7q-arm
D7Z1 (6-mer) D7Z2 (16-mer)
B
C
D
EF
A
7p-arm
Array estimate:
~0.42 Mb
D7Z1
(Illumina Read
Database)
Hybrid approach
Long reads inform
sequence structure
Short, high-quality
reads generate
frequency estimates
Array size estimate:
~2.65 Mb
Read Depth Estimates of Average Satellite Array Size
7q-arm
D7Z1 (6-mer) D7Z2 (16-mer)
B
C
D
EF
A
7p-arm
Array estimate:
~0.42 Mb
D7Z1
(Illumina Read
Database)
0
50
100
150
200
D7Z2
D7Z1
Individuals
0.0 5.00.5 1.0 1.5 2.0 3.0 4.0 4.53.52.5
Array Size (Mb)
7q-arm 7p-arm
Predicting HOR Repeat Variants
α-Centauri
(centromeric automated repeat identification)
B
C
D
EF
A
5’…
…3’
(6-mer) (4-mer)
7q-arm
B
C
D
EF
A
7p-arm
Predicting HOR Repeat Variants
1.0
1.0
1.0
0.9
0.9 0.9
0.1
Hybrid approach
Long reads inform
sequence structure
Short, high-quality
reads generate
frequency estimates
7q-arm 7p-arm
Map Single Nucleotide Variants
-G--T-
B
C
D
EF
A
B’
0.9
1.0
0.1
0.9
0.9 0.9
0.9
0.1
0.1
26
2565
Account for SNVs
(frequency and position)
within the array
7q-arm 7p-arm
Incorporate Interspersed Repeats
-G--T-
B
C
D
EF
A
B’
LINE
…
L1/LINE L1Hs (2384 bp)
LINE
LINE
7q-arm 7p-arm
Detecting Array Inversions
-G--T-
…
INVERSION
Map shifts in orientation
using long error corrected
PacBio Reads
228 bp alpha satellite partial
monomer at rearrangement
GENES
INVERSION
q-armp-arm
Non-Satellite DNA
Linking to chromosome arms and non-satellite DNA
CEN3: 300Kb Segmental Duplication from 6p11.2
Gene: DNA Primase Polypeptide 2
GENES
INVERSION
q-armp-arm
Non-Satellite DNA
Linking to chromosome arms and non-satellite DNA
INVERSION
p-arm q-arm
LINE
SINE
OTHER
... ...-A- -T-
Construct a new graphical reference for each
centromeric region to broaden research in these areas
Genome Informatics
CEN X
Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
Key Advantages of Satellite DNA Graphs
Improves Unambiguous Short Read Mapping
REPEAT REPEAT REPEAT
?
5’ 3’REPEAT
Benedict Paten Adam Novak
Centromere Graphs
Demonstrate unambiguous mapping
the majority ( > 98%) of
1000 genome alpha satellite reads
1. Eliminates sequence redundancy
Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are
retained as defined “paths” in the graph:
Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are
retained as defined “paths” in the graph
3. Graph data structure and sequence analysis tools
will be consistent with the rest of the human genome
The major histocompatibility complex (Kiran Garimella & Gil McVean)
Part II: Variation Map
The major histocompatibility complex (Kiran Garimella & Gil McVean)
Expand the human “variation reference map” to include
centromeric DNAs
p-arm q-arm
... ...
1 2 3 4 5 6 7 8 9 10 11 12
CENX
DXZ1 ~ 2kb (12-mer)
Study of Array Structural Variation
1 2 3 4 5 6 7 8 9 10 11 12
DXZ1 ~ 2kb (12-mer)
Study of Array Structural Variation
cenX
Ref Graph
1
2
3
4
5
67
8
9
10
11
12
Detection of Sequence Variants
hg002 (son)
hg003 (father)
hg004 (mother)
45,43,53
Zook, Justin M., et al. 2016
Personal Genome Project trio:
Ashkenazim Jewish ancestry
Detection of Sequence Variants
hg002 (son)
hg003 (father)
hg004 (mother)
45,43,53
DEL ~0.3%
>98%
structural variant
cononical repeat
Zook, Justin M., et al. 2016
REARRANGEMENTS SHARED BY TRIO
hg002 (son)
hg003 (father)
hg004 (mother)
?????
??
?
Detection of Sequence Variants
hg002 (son)
hg003 (father)
hg004 (mother)
??????
???
?
Detection of Sequence Variants
AJ Trio
Han Chinese
(HG00512)
Yoruba
(NG19340)
Puerto Rican
(HG00733)
Expand graph to include 4 reference populations
Collaboration: Ali Bashir and Matthew Pendleton; Ichan Institute
Inversion Polymorphism
NA24385
NA24149
Ashkenazi Jewish (AJ) Trio
Mobile element insertion
L1Hs/LINE
HuRef Genome:
GM12878 Genome:
CHM1 Genome:
CHM13 Genome:
16-mer 14-mer
99.6% 0.4%
16-mer
15-mer17-mer
14-mer
99.3%
0.5%0.1%
0.1%
CEN17
(D17Z1)
Allele 1
Allele 2
Allele 1
Allele 2
Copy Number Variation
Allele 1
Allele 2
Single Nucleotide Polymorphisms
Allele 1
Allele 2
…ATACGGATTTCATGACAGGTTA…
…ATACGGATTTGATGACAGGTTA…
Illumina: Determine Frequency
Miga et al (2014)
p-arm q-arm
... ...
Individual A
8.3 Mb
p-arm q-arm
... ...
0.7 Mb
Individual B
Individuals
Array Size (Mb)
0
5
10
15
20
98.587.576.565.554.543.532.521.510.5
Study of Array Size Variation
Sequence Variation
Collection of 19 high coverage
genomes (~30-60X)
9 Populations, 3 Trios
Expand genome informatics to provide an
assessment of common satVARs in population
1000 Genome Data (1,092)
individuals from 26 distinct
populations
Identify a new source of human sequence variation
Satellite DNA
Variants
Associated
with Cancer
(Germline)
?
Catalogue of
all Common
Human
Satellite DNA
Variants
Novel Human Biomarkers:
Use of genomics to greatly improve CEN variant
detection
Increase population based sampling to improve
statistical tests
Does of human sequence variation in
centromeric regions contribute to disease?
David Haussler
Benedict Paten
Jim Kent
(CGL, UCSC Browser,
Haussler Wet Lab)
Sofie Salama
Adam Novak
Maximilian Haeussler
Brian Raney
Ian Fiddes
Yulia Newton (Josh Stuart)
Jason Chin
Volkan Sevim
Creating (and mapping to) a
Universal Reference Genome
Benedict Paten, Adam Novak, David
Haussler, UC Santa Cruz
Acknowledgements
Alex Hastie
Denghong Zhang
Ali Bashir
Thomas Keane
Mark Akeson
Miten Jain
Hugh Olsen

More Related Content

PDF
Tobias marschall haplotype aware genotyping
PDF
Variation graphs and population assisted genome inference copy
PPTX
Understanding the reference assembly: CSHL Hackathon
PDF
agbt 2016 workshop church
PDF
GMueller_Barcelona
PDF
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...
PPTX
PPTX
Schneider_AGBT2014
Tobias marschall haplotype aware genotyping
Variation graphs and population assisted genome inference copy
Understanding the reference assembly: CSHL Hackathon
agbt 2016 workshop church
GMueller_Barcelona
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...
Schneider_AGBT2014

What's hot (20)

PPTX
Ashg2014 grc workshop_schneider
PPTX
Advancements in the human genome reference assembly (GRCh38)
PPTX
hg19 (GRCh37) vs. hg38 (GRCh38)
PPTX
Schneider grc workshop_final
PPT
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
PPTX
TAGC2016 schneider
PDF
Variant Calling II
PDF
Lessons learned from high throughput CRISPR targeting in human cell lines
PDF
Grc ashg2015 workshop_mudge
PDF
Molecular Biology Lab Poster
PPTX
Creating Reference-Grade Human Genome Assemblies
PPTX
Grc workshop agbt2015_tg
PPTX
Ashg2015 schneider final
PDF
Variant calling and how to prioritize somatic mutations and inheritated varia...
PDF
Ashg grc workshop2015_tg
PPTX
agbt 2016 workshop lindsay
PPTX
Aug2013 tumor normal whole genome sequencing
PPTX
Ashg2017 workshop tg
PPTX
Ashg2017 workshop schneider
PPTX
2018 1016 trio_binning_ashg_arhie_final
Ashg2014 grc workshop_schneider
Advancements in the human genome reference assembly (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
Schneider grc workshop_final
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
TAGC2016 schneider
Variant Calling II
Lessons learned from high throughput CRISPR targeting in human cell lines
Grc ashg2015 workshop_mudge
Molecular Biology Lab Poster
Creating Reference-Grade Human Genome Assemblies
Grc workshop agbt2015_tg
Ashg2015 schneider final
Variant calling and how to prioritize somatic mutations and inheritated varia...
Ashg grc workshop2015_tg
agbt 2016 workshop lindsay
Aug2013 tumor normal whole genome sequencing
Ashg2017 workshop tg
Ashg2017 workshop schneider
2018 1016 trio_binning_ashg_arhie_final
Ad

Similar to Karen miga centromere sequence characterization and variant detection (20)

PDF
101717.kh miga ashg_grc
PPTX
Telomere-to-telomere assembly of a complete human chromosomes
PDF
KHMiga-AGBT.020923.upload.pdf
PPT
Human Genome 2009
PDF
London Calling 2019: Karen Miga
ODP
Genomics Technologies
PDF
Alignment Approaches II: Long Reads
PPT
Unilag workshop complex genome analysis
PPTX
The Paternal Tree of Humanity
PPTX
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
PDF
SSAHA_pileup
 
PPT
Biotech 2011-09-pcr and-in_situ_methods
PPTX
VCF and RDF
PPTX
GIAB for AMP GeT-RM Forum
PDF
Chromosome Structure, structure of chromosome.pdf
PDF
Next-generation sequencing - variation discovery
PDF
Human genetic variation and its contribution to complex traits
PDF
Genome Informatics 2016 poster
DOC
chromosome structure and function
ODP
Bioc strucvariant seattle_11_09
101717.kh miga ashg_grc
Telomere-to-telomere assembly of a complete human chromosomes
KHMiga-AGBT.020923.upload.pdf
Human Genome 2009
London Calling 2019: Karen Miga
Genomics Technologies
Alignment Approaches II: Long Reads
Unilag workshop complex genome analysis
The Paternal Tree of Humanity
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
SSAHA_pileup
 
Biotech 2011-09-pcr and-in_situ_methods
VCF and RDF
GIAB for AMP GeT-RM Forum
Chromosome Structure, structure of chromosome.pdf
Next-generation sequencing - variation discovery
Human genetic variation and its contribution to complex traits
Genome Informatics 2016 poster
chromosome structure and function
Bioc strucvariant seattle_11_09
Ad

More from GenomeInABottle (20)

PDF
2023 GIAB AMP Update
PDF
GIAB Tumor Normal ASHG 2023
PDF
Stratomod ASHG 2023
PDF
GIAB_ASHG_JZook_2023.pdf
PPTX
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
PPTX
Benchmarking with GIAB 220907
PPTX
Genome in a Bottle- reference materials to benchmark challenging variants and...
PPTX
GIAB Technical Germline Benchmark roadmap discussion
PDF
Giab agbt small_var_2020
PDF
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
PPTX
GIAB ASHG 2019 Structural Variant poster
PDF
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
PPTX
GIAB ASHG 2019 Small Variant poster
PPTX
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
PPTX
Jason Chin MHC diploid assembly
PPTX
GIAB update for GRC GIAB workshop 191015
PPTX
Giab for jax long read 190917
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPTX
GIAB and long reads for bio it world 190417
PDF
New methods diploid assembly with graphs
2023 GIAB AMP Update
GIAB Tumor Normal ASHG 2023
Stratomod ASHG 2023
GIAB_ASHG_JZook_2023.pdf
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Benchmarking with GIAB 220907
Genome in a Bottle- reference materials to benchmark challenging variants and...
GIAB Technical Germline Benchmark roadmap discussion
Giab agbt small_var_2020
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
GIAB ASHG 2019 Structural Variant poster
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB ASHG 2019 Small Variant poster
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
Jason Chin MHC diploid assembly
GIAB update for GRC GIAB workshop 191015
Giab for jax long read 190917
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
GIAB and long reads for bio it world 190417
New methods diploid assembly with graphs

Recently uploaded (20)

PDF
communicable diseases for healthcare - Part 1.pdf
PPTX
Introduction to CDC (1).pptx for health science students
PPTX
This book is about some common childhood
PPTX
المحاضرة الثالثة Urosurgery (Inflammation).pptx
PPTX
Indications for Surgical Delivery...pptx
PPTX
SUMMARY OF EAR, NOSE AND THROAT DISORDERS INCLUDING DEFINITION, CAUSES, CLINI...
PPTX
Communicating with the FDA During an Inspection -August 26, 2025 - GMP.pptx
PDF
periodontaldiseasesandtreatments-200626195738.pdf
PDF
Nematodes - by Sanjan PV 20-52.pdf based on all aspects
PDF
Demography and community health for healthcare.pdf
PDF
neonatology-for-nurses.pdfggghjjkkkkkkjhhg
PPTX
A Detailed Physiology of Endocrine System.pptx
PPTX
ANTI BIOTICS. SULPHONAMIDES,QUINOLONES.pptx
PPTX
Type 2 Diabetes Mellitus (T2DM) Part 3 v2.pptx
PPTX
Peripheral Arterial Diseases PAD-WPS Office.pptx
PPTX
Approch to weakness &paralysis pateint.pptx
PPTX
ACUTE PANCREATITIS combined.pptx.pptx in kids
PPTX
Bacteriology and purification of water supply
PPTX
Phamacology Presentation (Anti cance drugs).pptx
DOCX
ORGAN SYSTEM DISORDERS Zoology Class Ass
communicable diseases for healthcare - Part 1.pdf
Introduction to CDC (1).pptx for health science students
This book is about some common childhood
المحاضرة الثالثة Urosurgery (Inflammation).pptx
Indications for Surgical Delivery...pptx
SUMMARY OF EAR, NOSE AND THROAT DISORDERS INCLUDING DEFINITION, CAUSES, CLINI...
Communicating with the FDA During an Inspection -August 26, 2025 - GMP.pptx
periodontaldiseasesandtreatments-200626195738.pdf
Nematodes - by Sanjan PV 20-52.pdf based on all aspects
Demography and community health for healthcare.pdf
neonatology-for-nurses.pdfggghjjkkkkkkjhhg
A Detailed Physiology of Endocrine System.pptx
ANTI BIOTICS. SULPHONAMIDES,QUINOLONES.pptx
Type 2 Diabetes Mellitus (T2DM) Part 3 v2.pptx
Peripheral Arterial Diseases PAD-WPS Office.pptx
Approch to weakness &paralysis pateint.pptx
ACUTE PANCREATITIS combined.pptx.pptx in kids
Bacteriology and purification of water supply
Phamacology Presentation (Anti cance drugs).pptx
ORGAN SYSTEM DISORDERS Zoology Class Ass

Karen miga centromere sequence characterization and variant detection

  • 1. Centromeric Regions: A source of new, unexplored human sequence variation Karen H. Miga University of California, Santa Cruz Jan 25, 2018 GIAB Workshop
  • 2. Allele 1 Allele 2 LINE Mobile element insertion Allele 1 Allele 2 Copy Number Variation Inversion Polymorphism Allele 1 Allele 2 Single Nucleotide Polymorphisms Allele 1 Allele 2 …ATACGGATTTCATGACAGGTTA… …ATACGGATTTGATGACAGGTTA… CHR 9 Identifying Sequence Variants
  • 3. ? Centromeres: Large Assembly Gaps p-arm q-arm Multi-Megabase Assembly Gaps ? CENTROMERIC REGIONS
  • 4. ? Inability to track variation p-arm q-arm Multi-Megabase Assembly Gaps ?Mobile element insertion Copy Number Variation Inversion Polymorphism SNPs Unable to identify using standard genomic data: CENTROMERIC REGIONS
  • 5. ? chr 9qh Allele2 Allele1 chr 9qh+ CHR 9 Cytogenetics: Identifying Sequence Variants CENTROMERIC REGIONS Mobile element insertion Copy Number Variation Inversion Polymorphism SNPs Unable to identify using standard genomic data: H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011
  • 6. ? chr 9qh Allele2 Allele1 chr 9qh+ CHR 9 Cytogenetics: Identifying Sequence Variants H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011 Regulate Centromere Function Contribute to Chromosome Cohesion Centromeres Play a Role in Cell Division
  • 7. ? chr 9qh Allele2 Allele1 chr 9qh+ CHR 9 Cytogenetics: Identifying Sequence Variants H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011 • 9qh+ men had significantly increased frequencies of hyperdiploid cells. (Ford et al 1978) • 9qh+ women showed significant differences in rates of aneuploidy. (Ford et al 1978) • 9qh+ is associated with of an increased fraction of malformed spermatozoa (Eiben et al 1987) • Inversions spanning 9qh relate to recurrent miscarriages in Italian populations (Del Porto et al 1993)
  • 8. Unchartered Functional Regions of the Human Genome Part I: Constructing a reference map of centromeric DNAs Part II: Expand the human “variation reference map” to include centromeric DNAs
  • 9. p-arm q-arm ... ... multi-megabase array ALPHA SATELLITE ~171bp Tandem Repeat Wide Range of Percent ID: ~60-100% 1 2 3 4 Part I: Constructing a reference map of centromeric DNAs
  • 10. Narrow Range of Percent ID: 94% - 100% “Higher Order Repeat” Multi-monomeric Repeat Unit Human Centromeric DNA: Higher Order Repeats p-arm q-arm ... ... 1 2 3 4 1 2 3 4 1 2 3 4 multi-megabase array
  • 11. Human Centromeres: Chromosome-Specific Satellite Sequence Organization p-arm q-arm ... ... p-arm q-arm ... ... Array “A” Array “B” Array “C” chrX chr3
  • 12. p-arm q-arm ... ... ... ...-A- -T- Human Centromeric DNA: Genome Model of Sequence Organization
  • 13. INVERSION p-arm q-arm ... ... ... ...-A- -T- Human Centromeric DNA: Genome Model of Sequence Organization
  • 14. INVERSION p-arm q-arm ... ... LINE SINE OTHER NON-ALPHA SATELLITE ... ...-A- -T- Human Centromeric DNA: Genome Model of Sequence Organization
  • 15. INVERSION p-arm q-arm ... ... LINE SINE OTHER ... ...-A- -T- Non-satellite DNA GENES NON-ALPHA SATELLITE Human Centromeric DNA: Genome Model of Sequence Organization
  • 16. INVERSION p-arm q-arm ... ... LINE SINE OTHER ... ...-A- -T- GENES NON-ALPHA SATELLITE Construct a new genomic reference for each centromeric region to broaden research in these areas Genome Informatics Non-satellite DNA
  • 18. >200 ENCODE datasets A B C D E F Prediction of Higher Order Repeats PacBio ~10kb read
  • 19. >200 ENCODE datasets α-Centauri (centromeric automated repeat identification) PacBio ~10kb read A B C D E F 5’… …3’ 10x 10 B C D EF A 10 10 10 10 10 5’ 3’ Prediction of Higher Order Repeats
  • 21. Experimental Evidence: Chromosome-specific Satellite DNA tools to Screening Somatic Cell Hybrid Panel B C D EF A D7Z1 6-mer Waye  et  al  (1987)   98%    GenBank:  M16101   Flow Sorted Chromosome Alignment/Enrichment Illumina sequencing of isolated human chromosomes Long Range Read Support “Anchor” to mapped to the assembled p-arm and/ or q-arm Chromosome specific assignment
  • 23. Read Depth Estimates of Average Satellite Array Size 7q-arm D7Z1 (6-mer) 7p-arm D7Z2 (16-mer) R Wevrick and H F Willard. NAR ( 1991 )
  • 24. Array size estimate: ~2.65 Mb Read Depth Estimates of Average Satellite Array Size 7q-arm D7Z1 (6-mer) D7Z2 (16-mer) B C D EF A 7p-arm Array estimate: ~0.42 Mb D7Z1 (Illumina Read Database) Hybrid approach Long reads inform sequence structure Short, high-quality reads generate frequency estimates
  • 25. Array size estimate: ~2.65 Mb Read Depth Estimates of Average Satellite Array Size 7q-arm D7Z1 (6-mer) D7Z2 (16-mer) B C D EF A 7p-arm Array estimate: ~0.42 Mb D7Z1 (Illumina Read Database) 0 50 100 150 200 D7Z2 D7Z1 Individuals 0.0 5.00.5 1.0 1.5 2.0 3.0 4.0 4.53.52.5 Array Size (Mb)
  • 26. 7q-arm 7p-arm Predicting HOR Repeat Variants α-Centauri (centromeric automated repeat identification) B C D EF A 5’… …3’ (6-mer) (4-mer)
  • 27. 7q-arm B C D EF A 7p-arm Predicting HOR Repeat Variants 1.0 1.0 1.0 0.9 0.9 0.9 0.1 Hybrid approach Long reads inform sequence structure Short, high-quality reads generate frequency estimates
  • 28. 7q-arm 7p-arm Map Single Nucleotide Variants -G--T- B C D EF A B’ 0.9 1.0 0.1 0.9 0.9 0.9 0.9 0.1 0.1 26 2565 Account for SNVs (frequency and position) within the array
  • 29. 7q-arm 7p-arm Incorporate Interspersed Repeats -G--T- B C D EF A B’ LINE … L1/LINE L1Hs (2384 bp) LINE LINE
  • 30. 7q-arm 7p-arm Detecting Array Inversions -G--T- … INVERSION Map shifts in orientation using long error corrected PacBio Reads 228 bp alpha satellite partial monomer at rearrangement
  • 31. GENES INVERSION q-armp-arm Non-Satellite DNA Linking to chromosome arms and non-satellite DNA
  • 32. CEN3: 300Kb Segmental Duplication from 6p11.2 Gene: DNA Primase Polypeptide 2 GENES INVERSION q-armp-arm Non-Satellite DNA Linking to chromosome arms and non-satellite DNA
  • 33. INVERSION p-arm q-arm LINE SINE OTHER ... ...-A- -T- Construct a new graphical reference for each centromeric region to broaden research in these areas Genome Informatics CEN X
  • 34. Key Advantages of Satellite DNA Graphs 1. Eliminates sequence redundancy
  • 35. Key Advantages of Satellite DNA Graphs Improves Unambiguous Short Read Mapping REPEAT REPEAT REPEAT ? 5’ 3’REPEAT Benedict Paten Adam Novak Centromere Graphs Demonstrate unambiguous mapping the majority ( > 98%) of 1000 genome alpha satellite reads 1. Eliminates sequence redundancy
  • 36. Key Advantages of Satellite DNA Graphs 1. Eliminates sequence redundancy 2. Information describing long-range haplotypes are retained as defined “paths” in the graph:
  • 37. Key Advantages of Satellite DNA Graphs 1. Eliminates sequence redundancy 2. Information describing long-range haplotypes are retained as defined “paths” in the graph 3. Graph data structure and sequence analysis tools will be consistent with the rest of the human genome The major histocompatibility complex (Kiran Garimella & Gil McVean)
  • 38. Part II: Variation Map The major histocompatibility complex (Kiran Garimella & Gil McVean) Expand the human “variation reference map” to include centromeric DNAs
  • 39. p-arm q-arm ... ... 1 2 3 4 5 6 7 8 9 10 11 12 CENX DXZ1 ~ 2kb (12-mer) Study of Array Structural Variation
  • 40. 1 2 3 4 5 6 7 8 9 10 11 12 DXZ1 ~ 2kb (12-mer) Study of Array Structural Variation cenX Ref Graph 1 2 3 4 5 67 8 9 10 11 12
  • 41. Detection of Sequence Variants hg002 (son) hg003 (father) hg004 (mother) 45,43,53 Zook, Justin M., et al. 2016 Personal Genome Project trio: Ashkenazim Jewish ancestry
  • 42. Detection of Sequence Variants hg002 (son) hg003 (father) hg004 (mother) 45,43,53 DEL ~0.3% >98% structural variant cononical repeat Zook, Justin M., et al. 2016
  • 43. REARRANGEMENTS SHARED BY TRIO hg002 (son) hg003 (father) hg004 (mother) ????? ?? ?
  • 44. Detection of Sequence Variants hg002 (son) hg003 (father) hg004 (mother) ?????? ??? ?
  • 45. Detection of Sequence Variants AJ Trio Han Chinese (HG00512) Yoruba (NG19340) Puerto Rican (HG00733) Expand graph to include 4 reference populations Collaboration: Ali Bashir and Matthew Pendleton; Ichan Institute
  • 46. Inversion Polymorphism NA24385 NA24149 Ashkenazi Jewish (AJ) Trio Mobile element insertion L1Hs/LINE HuRef Genome: GM12878 Genome: CHM1 Genome: CHM13 Genome: 16-mer 14-mer 99.6% 0.4% 16-mer 15-mer17-mer 14-mer 99.3% 0.5%0.1% 0.1% CEN17 (D17Z1) Allele 1 Allele 2 Allele 1 Allele 2 Copy Number Variation Allele 1 Allele 2 Single Nucleotide Polymorphisms Allele 1 Allele 2 …ATACGGATTTCATGACAGGTTA… …ATACGGATTTGATGACAGGTTA… Illumina: Determine Frequency
  • 47. Miga et al (2014) p-arm q-arm ... ... Individual A 8.3 Mb p-arm q-arm ... ... 0.7 Mb Individual B Individuals Array Size (Mb) 0 5 10 15 20 98.587.576.565.554.543.532.521.510.5 Study of Array Size Variation
  • 48. Sequence Variation Collection of 19 high coverage genomes (~30-60X) 9 Populations, 3 Trios Expand genome informatics to provide an assessment of common satVARs in population 1000 Genome Data (1,092) individuals from 26 distinct populations Identify a new source of human sequence variation
  • 49. Satellite DNA Variants Associated with Cancer (Germline) ? Catalogue of all Common Human Satellite DNA Variants Novel Human Biomarkers: Use of genomics to greatly improve CEN variant detection Increase population based sampling to improve statistical tests Does of human sequence variation in centromeric regions contribute to disease?
  • 50. David Haussler Benedict Paten Jim Kent (CGL, UCSC Browser, Haussler Wet Lab) Sofie Salama Adam Novak Maximilian Haeussler Brian Raney Ian Fiddes Yulia Newton (Josh Stuart) Jason Chin Volkan Sevim Creating (and mapping to) a Universal Reference Genome Benedict Paten, Adam Novak, David Haussler, UC Santa Cruz Acknowledgements Alex Hastie Denghong Zhang Ali Bashir Thomas Keane Mark Akeson Miten Jain Hugh Olsen