SlideShare a Scribd company logo
Rapid bacterial outbreak
characterisation from whole
genome sequencing
Torsten Seemann
Genome Science: Biology, Technology & Bioinformatics - Wed 13 July 2014 - Oxford, UK - #UKGS2014
About me
● Victorian Bioinformatics Consortium
o Monash University, Melbourne, Australia
● Microbial genomics
o bacterial pathogens; some parasites, viruses, fungi
● Tool development
o Prokka, Nesoni, VelvetOptimiser, Snippy, ...
Microbial Diagnostic Unit
● Oldest public health lab in Australia
o established 1897 in Melbourne
o large historical isolate collection back to 1950s
● National reference laboratory
o Salmonella, Listeria, EHEC
● WHO regional reference lab
o vaccine preventable invasive bacterial pathogens
New director
● Professor Ben Howden
o clinician, microbiologist, pathologist
o early adopter of genomics and bioinformatics
● Mandate
o modernise service delivery
o enhance research output and collaboration
o nationally lead the conversion to WGS
Outbreak scenario
● Receive samples (human, animal, enviro)
● Extract, culture, isolate
● Identification via phenotype, growth, media
● Typing: MLST, MLVA, PFGE, phage, sero, ...
● Screening: VITEK
● Report back to hospital, state government
Traditional typing
● Low resolution
o small subset of genome
 MLST ~7 core genes
 MLVA uses handful of VNTR regions
o requires constant curation of new genotypes
● Labour intensive
o time consuming
Whole Genome Sequencing
● Backward compatible
o can derive most traditional genotypes
● High resolution
o all variation, plasmids, AbR & virulence genes
● High throughput
o cheap, fast - one assay replaces many
Resistance to change
● Protecting empires
o “this is how we’ve always done it”, job redundancies
● Expense of instruments
o capital purchase, new staff, maintenance
● Lack of bioinformatics support
o infrastructure, software, training
● Legal requirements
o must do PFGE, validation, accreditation
A vision for Australia
● A common online system for all labs
o upload samples
o automated standard analysis pipelines
● Access control
o each lab controls their own data
o jurisdictions can share data in national outbreaks
● Deploy on our national research cloud
o no investment or expertise needed
o can deploy private version if desired
Suggested pipeline
● Input
o FASTQ files for each isolate
● Per isolate output
o de novo assembly & annotation
o typing (species dependent)
o antibiotic resistance & virulence genes
● Per outbreak output
o annotated phylogenomic tree
o SNP distances, clonality predictions
Design goals
● Speed
o multi-threaded wherever possible
● Modular
o Unix-style reusable components
● Deployable on cloud
o Amazon, Nectar (.au), CLIMB (.uk)
● Open source
o Auditable, community contribution
Progress
● Currently
o assessing existing components
o implementing new ones - all on GitHub
● No final product yet
o but some components are usable now
● Rolling out in 2015
o labs around Australia will opt in, most are keen
Identifying isolates
● De novo assembly approach
o assemble into contigs
o BLAST contigs against all microbial sequences
o best hits, highest coverage
● Assembly free method
o build index of all microbial k-mers w/ taxonomy
o scan k-mers from reads and tally
o Kraken, BioBloomTools, ...
Kraken report
1.04 1046 1046 U 0 unclassified
98.96 99624 142 - 1 root
98.81 99473 1 - 131567 cellular organisms
98.81 99472 194 D 2 Bacteria
98.57 99233 111 P 1224 Proteobacteria
98.45 99110 318 C 1236 Gammaproteobacteria
98.07 98728 0 O 91347 Enterobacteriales
98.07 98728 52477 F 543 Enterobacteriaceae
44.95 45256 665 G 561 Escherichia
44.20 44498 33391 S 562 Escherichia coli
8.84 8899 8899 - 1274814 Escherichia coli APEC O78
0.29 287 0 - 244319 Escherichia coli O26:H11
0.29 287 287 - 573235 Escherichia coli O26:H11 str 11368
0.21 216 216 - 316401 Escherichia coli ETEC H10407
0.19 193 0 - 168807 Escherichia coli O127:H6
0.19 193 193 - 574521 Escherichia coli O127:H6 str E2348/69
http://ccb.jhu.edu/software/kraken
Assembill
● Decent automated assemblies
o only 3 parameters: outdir + R1.fq.gz + R2.fq.gz
o supports multithreading at all steps
● Main steps
o adaptor removal & quality trimming (Skewer)
o selection of K from k-mer spectra (KmerGenie)
o de novo assembly (Velvet, Spades)
o ordering of contigs against reference (MUMmer)
Prokka
● Prokaryotic Annotation
o only 2 parameters: outdir + contigs.fa
o scales to about 32 threads
● Finds
o CDS, tRNA, tmRNA, rRNA, some ncRNA
o CRISPR, signal peptides
● Produces
o Genbank, GFF3, Sequin, FASTA, ...
mlst
● Multi-Locus Sequence Typing
o only 2 parameters: scheme + contigs.fa
● Can mass-screen hundreds of assemblies
o comes bundled with PubMLST database
● Output
o tab/comma separated values
AbRicate
● Identify known AB resistance genes
o only 1 parameters: contigs.fa
● Only as good as the underlying database
o Bundled with ResFinder
o does not include SNP-based AbR-conferring genes
● Output
o tab/comma separated table
Wombac
● Quickly identify core genome SNPs
● Efficiently use all CPUs and RAM
● Re-use previous reference alignments
● Cheap to calculate new core subsets
Read alignment
Use BWA MEM
● Do not need to clip reads
● Deduces the fragment library attributes
● Marks multi-mapping reads properly
● Scales linearly to >100 cores
● Outputs SAM directly
Sorted BAM
● No intermediate files
o use Unix pipes
● Multiple CPUs with SAMtools > 0.1.19+
o use the -@ command line parameter
bwa → samtools view → samtools sort → BAM
SNP calling
● FreeBayes
o set in haploid mode (p=1)
o set regular parameters (mindepth, minfrac)
o call variants in all samples jointly (more power)
o single multi-isolate VCF output
freebayes -p 1 *.bam → all.vcf
Parallel Freebayes
● FreeBayes is single threaded
o divide genome into regions
o run separate freebayes in parallel on each region
o merge the results
o scales nearly linearly!
fasta-generate-regions.py ref.fa > regions.txt
freebayes-parallel 32 regions.txt -p 1 *.bam → all.vcf
Select core SNPs
● Core SNPs
o position present in every isolate
o more than one allele (not wholly conserved)
o usually ignore indels and other odd genotypes
● Recombination
o not all core SNPs are real
o many result of recombination
o should be filtered out, could alter tree topology
Wombac speed
● Example
o 130 E.coli isolates, MiSeq 300bp PE
o With 32 cores, used < 4GB RAM/core
o Took just over 1 hour
● Add a new sample
o Re-use existing alignments
o Will migrate to gVCF method that GATK will use
● Recalculate a core tree on subset
Email torsten.seemann@gmail.com
Twitter @torstenseemann
Blog
TheGenomeFactory.blogspot.com
Web bioinformatics.net.au
Contact

More Related Content

PDF
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Torsten Seemann
 
PDF
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Torsten Seemann
 
PPTX
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Torsten Seemann
 
PDF
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Torsten Seemann
 
PDF
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Torsten Seemann
 
PPTX
Prokka - rapid bacterial genome annotation - ABPHM 2013
Torsten Seemann
 
PDF
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
Torsten Seemann
 
PDF
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Torsten Seemann
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Torsten Seemann
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Torsten Seemann
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Torsten Seemann
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Torsten Seemann
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Torsten Seemann
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Torsten Seemann
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
Torsten Seemann
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Torsten Seemann
 

What's hot (20)

PDF
How to write bioinformatics software people will use and cite - t.seemann - ...
Torsten Seemann
 
PDF
Long read sequencing - LSCC lab talk - fri 5 june 2015
Torsten Seemann
 
PDF
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Torsten Seemann
 
PDF
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
Joseph Hughes
 
PPTX
Making Use of NGS Data: From Reads to Trees and Annotations
João André Carriço
 
PDF
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
Torsten Seemann
 
PPTX
05 costa
fruitbreedomics
 
PPTX
2014 khmer protocols
c.titus.brown
 
PPTX
transforming clinical microbiology by next generation sequencing
PathKind Labs
 
PPT
20170209 ngs for_cancer_genomics_101
Ino de Bruijn
 
PPT
Fruit breedomics workshop wp6 a cost effective strategy for mas riccardo velasco
fruitbreedomics
 
PDF
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
VHIR Vall d’Hebron Institut de Recerca
 
PPTX
I Jornada Actualización en Genética Reproductiva y Fertilidad
TECNALIA Research & Innovation
 
PDF
Long-read: assets and challenges of a (not so) emerging technology
Claire Rioualen
 
PPTX
Computational Resources In Infectious Disease
João André Carriço
 
PPTX
Clinical Applications of Next Generation Sequencing
Bell Symposium &amp; MSP Seminar
 
PPTX
ASM Microbe 2017: Reaching the Parts Other Methods Can't: Long Reads for Micr...
Nick Loman
 
PPTX
Coding & Best Practice in Programming in the NGS era
Lex Nederbragt
 
PPTX
Eccmid meet the expert 2015
João André Carriço
 
PDF
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Surya Saha
 
How to write bioinformatics software people will use and cite - t.seemann - ...
Torsten Seemann
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Torsten Seemann
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Torsten Seemann
 
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
Joseph Hughes
 
Making Use of NGS Data: From Reads to Trees and Annotations
João André Carriço
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
Torsten Seemann
 
05 costa
fruitbreedomics
 
2014 khmer protocols
c.titus.brown
 
transforming clinical microbiology by next generation sequencing
PathKind Labs
 
20170209 ngs for_cancer_genomics_101
Ino de Bruijn
 
Fruit breedomics workshop wp6 a cost effective strategy for mas riccardo velasco
fruitbreedomics
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
VHIR Vall d’Hebron Institut de Recerca
 
I Jornada Actualización en Genética Reproductiva y Fertilidad
TECNALIA Research & Innovation
 
Long-read: assets and challenges of a (not so) emerging technology
Claire Rioualen
 
Computational Resources In Infectious Disease
João André Carriço
 
Clinical Applications of Next Generation Sequencing
Bell Symposium &amp; MSP Seminar
 
ASM Microbe 2017: Reaching the Parts Other Methods Can't: Long Reads for Micr...
Nick Loman
 
Coding & Best Practice in Programming in the NGS era
Lex Nederbragt
 
Eccmid meet the expert 2015
João André Carriço
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Surya Saha
 
Ad

Viewers also liked (20)

PDF
2015 12-09 nmdd
Karin Lagesen
 
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
PPTX
Bio153 microbial genomics 2012
Mark Pallen
 
PDF
Whole Genome Sequencing (WGS) for surveillance of foodborne infections in Den...
ExternalEvents
 
PDF
Ahmed Absi slides bigbwa
Absi Ahmed
 
PDF
2011-04-26_various-assemblers-presentation
mhaimel
 
PPTX
Genome Assembly Forensics
Nathan Watson-Haigh
 
PDF
Genome assembly: then and now — v1.1
Keith Bradnam
 
PPTX
Improving and validating the Atlantic Cod genome assembly using PacBio
Lex Nederbragt
 
PDF
20140711 3 t_clark_ercc2.0_workshop
External RNA Controls Consortium
 
PPTX
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Torsten Seemann
 
PDF
Applications of Whole Genome Sequencing (WGS) technology on food safety manag...
ExternalEvents
 
PPTX
Bioinfo ngs data format visualization v2
Li Shen
 
PPTX
GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in...
ExternalEvents
 
PPT
Programming in Computational Biology
AtreyiB
 
PPTX
Sfu ngs course_workshop tutorial_2.1
Shaojun Xie
 
PPTX
Next-generation sequencing data format and visualization with ngs.plot 2015
Li Shen
 
PPT
Assembling NGS Data - IMB Winter School - 3 July 2012
Torsten Seemann
 
PDF
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Torsten Seemann
 
PDF
Intel big data analytics in health and life sciences personalized medicine
Ketan Paranjape
 
2015 12-09 nmdd
Karin Lagesen
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
Bio153 microbial genomics 2012
Mark Pallen
 
Whole Genome Sequencing (WGS) for surveillance of foodborne infections in Den...
ExternalEvents
 
Ahmed Absi slides bigbwa
Absi Ahmed
 
2011-04-26_various-assemblers-presentation
mhaimel
 
Genome Assembly Forensics
Nathan Watson-Haigh
 
Genome assembly: then and now — v1.1
Keith Bradnam
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Lex Nederbragt
 
20140711 3 t_clark_ercc2.0_workshop
External RNA Controls Consortium
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Torsten Seemann
 
Applications of Whole Genome Sequencing (WGS) technology on food safety manag...
ExternalEvents
 
Bioinfo ngs data format visualization v2
Li Shen
 
GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in...
ExternalEvents
 
Programming in Computational Biology
AtreyiB
 
Sfu ngs course_workshop tutorial_2.1
Shaojun Xie
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Li Shen
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Torsten Seemann
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Torsten Seemann
 
Intel big data analytics in health and life sciences personalized medicine
Ketan Paranjape
 
Ad

Similar to Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014 (20)

PPTX
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
PRIYANKAZALA9
 
PPTX
Toolbox for bacterial population analysis using NGS
Mirko Rossi
 
PDF
Gene disc® rapid microbiology system
danisandominguez
 
PPTX
Best Practices for Validating a Next-Gen Sequencing Workflow
Golden Helix
 
PDF
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
Andor Kiss
 
PDF
Rna isolation series product foregene
Maggie Ma
 
PDF
Company profile CoWin Bio Science
XueYang54
 
PDF
Overview of the commonly used sequencing platforms, bioinformatic search tool...
OECD Environment
 
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
 
PDF
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Nathan Olson
 
PPTX
Automating Pharmacogenomic Workflows with VSWarehouse 3 From Variants to Clin...
Golden Helix
 
PDF
whole-genome-sequencing-guide-small-genomes.pdf.pdf
CRISTIANALONSORODRIG1
 
PPTX
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
Golden Helix
 
PPTX
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
Golden Helix
 
PPTX
The Wide Spectrum of Next-Generation Sequencing Assays with VarSeq
Golden Helix
 
PPTX
ngs.pptx
aaaa bbb
 
PPTX
Apac distributor training series 3 swift product for cancer study
Swift Biosciences
 
PPTX
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
nist-spin
 
PPTX
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Vladimir Kovacevic
 
PPTX
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
nist-spin
 
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
PRIYANKAZALA9
 
Toolbox for bacterial population analysis using NGS
Mirko Rossi
 
Gene disc® rapid microbiology system
danisandominguez
 
Best Practices for Validating a Next-Gen Sequencing Workflow
Golden Helix
 
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
Andor Kiss
 
Rna isolation series product foregene
Maggie Ma
 
Company profile CoWin Bio Science
XueYang54
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
OECD Environment
 
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Nathan Olson
 
Automating Pharmacogenomic Workflows with VSWarehouse 3 From Variants to Clin...
Golden Helix
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
CRISTIANALONSORODRIG1
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
Golden Helix
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
Golden Helix
 
The Wide Spectrum of Next-Generation Sequencing Assays with VarSeq
Golden Helix
 
ngs.pptx
aaaa bbb
 
Apac distributor training series 3 swift product for cancer study
Swift Biosciences
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
nist-spin
 
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Vladimir Kovacevic
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
nist-spin
 

More from Torsten Seemann (6)

PDF
How to write bioinformatics software no one will use
Torsten Seemann
 
PDF
Snippy - T.Seemann - Poster - Genome Informatics 2016
Torsten Seemann
 
PDF
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Torsten Seemann
 
PPT
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
PDF
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Torsten Seemann
 
PPTX
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Torsten Seemann
 
How to write bioinformatics software no one will use
Torsten Seemann
 
Snippy - T.Seemann - Poster - Genome Informatics 2016
Torsten Seemann
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Torsten Seemann
 
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Torsten Seemann
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Torsten Seemann
 

Recently uploaded (20)

PPTX
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
PPTX
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
PDF
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PPTX
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
PPTX
Introduction to biochemistry.ppt-pdf_shotrs!
Vishnukanchi darade
 
PDF
Identification of Bacteria notes by EHH.pdf
Eshwarappa H
 
PPTX
Pharmacognosy: ppt :pdf :pharmacognosy :
Vishnukanchi darade
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PPTX
Role of GIS in precision farming.pptx
BikramjitDeuri
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PPTX
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
PDF
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
PPTX
General Characters and Classification of Su class Apterygota.pptx
Dr Showkat Ahmad Wani
 
PPTX
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
PPTX
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
PDF
Paleoseismic activity in the moon’s Taurus-Littrowvalley inferred from boulde...
Sérgio Sacani
 
PDF
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
PDF
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
Introduction to biochemistry.ppt-pdf_shotrs!
Vishnukanchi darade
 
Identification of Bacteria notes by EHH.pdf
Eshwarappa H
 
Pharmacognosy: ppt :pdf :pharmacognosy :
Vishnukanchi darade
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
Role of GIS in precision farming.pptx
BikramjitDeuri
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
General Characters and Classification of Su class Apterygota.pptx
Dr Showkat Ahmad Wani
 
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
Paleoseismic activity in the moon’s Taurus-Littrowvalley inferred from boulde...
Sérgio Sacani
 
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 

Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014

  • 1. Rapid bacterial outbreak characterisation from whole genome sequencing Torsten Seemann Genome Science: Biology, Technology & Bioinformatics - Wed 13 July 2014 - Oxford, UK - #UKGS2014
  • 2. About me ● Victorian Bioinformatics Consortium o Monash University, Melbourne, Australia ● Microbial genomics o bacterial pathogens; some parasites, viruses, fungi ● Tool development o Prokka, Nesoni, VelvetOptimiser, Snippy, ...
  • 3. Microbial Diagnostic Unit ● Oldest public health lab in Australia o established 1897 in Melbourne o large historical isolate collection back to 1950s ● National reference laboratory o Salmonella, Listeria, EHEC ● WHO regional reference lab o vaccine preventable invasive bacterial pathogens
  • 4. New director ● Professor Ben Howden o clinician, microbiologist, pathologist o early adopter of genomics and bioinformatics ● Mandate o modernise service delivery o enhance research output and collaboration o nationally lead the conversion to WGS
  • 5. Outbreak scenario ● Receive samples (human, animal, enviro) ● Extract, culture, isolate ● Identification via phenotype, growth, media ● Typing: MLST, MLVA, PFGE, phage, sero, ... ● Screening: VITEK ● Report back to hospital, state government
  • 6. Traditional typing ● Low resolution o small subset of genome  MLST ~7 core genes  MLVA uses handful of VNTR regions o requires constant curation of new genotypes ● Labour intensive o time consuming
  • 7. Whole Genome Sequencing ● Backward compatible o can derive most traditional genotypes ● High resolution o all variation, plasmids, AbR & virulence genes ● High throughput o cheap, fast - one assay replaces many
  • 8. Resistance to change ● Protecting empires o “this is how we’ve always done it”, job redundancies ● Expense of instruments o capital purchase, new staff, maintenance ● Lack of bioinformatics support o infrastructure, software, training ● Legal requirements o must do PFGE, validation, accreditation
  • 9. A vision for Australia ● A common online system for all labs o upload samples o automated standard analysis pipelines ● Access control o each lab controls their own data o jurisdictions can share data in national outbreaks ● Deploy on our national research cloud o no investment or expertise needed o can deploy private version if desired
  • 10. Suggested pipeline ● Input o FASTQ files for each isolate ● Per isolate output o de novo assembly & annotation o typing (species dependent) o antibiotic resistance & virulence genes ● Per outbreak output o annotated phylogenomic tree o SNP distances, clonality predictions
  • 11. Design goals ● Speed o multi-threaded wherever possible ● Modular o Unix-style reusable components ● Deployable on cloud o Amazon, Nectar (.au), CLIMB (.uk) ● Open source o Auditable, community contribution
  • 12. Progress ● Currently o assessing existing components o implementing new ones - all on GitHub ● No final product yet o but some components are usable now ● Rolling out in 2015 o labs around Australia will opt in, most are keen
  • 13. Identifying isolates ● De novo assembly approach o assemble into contigs o BLAST contigs against all microbial sequences o best hits, highest coverage ● Assembly free method o build index of all microbial k-mers w/ taxonomy o scan k-mers from reads and tally o Kraken, BioBloomTools, ...
  • 14. Kraken report 1.04 1046 1046 U 0 unclassified 98.96 99624 142 - 1 root 98.81 99473 1 - 131567 cellular organisms 98.81 99472 194 D 2 Bacteria 98.57 99233 111 P 1224 Proteobacteria 98.45 99110 318 C 1236 Gammaproteobacteria 98.07 98728 0 O 91347 Enterobacteriales 98.07 98728 52477 F 543 Enterobacteriaceae 44.95 45256 665 G 561 Escherichia 44.20 44498 33391 S 562 Escherichia coli 8.84 8899 8899 - 1274814 Escherichia coli APEC O78 0.29 287 0 - 244319 Escherichia coli O26:H11 0.29 287 287 - 573235 Escherichia coli O26:H11 str 11368 0.21 216 216 - 316401 Escherichia coli ETEC H10407 0.19 193 0 - 168807 Escherichia coli O127:H6 0.19 193 193 - 574521 Escherichia coli O127:H6 str E2348/69 http://ccb.jhu.edu/software/kraken
  • 15. Assembill ● Decent automated assemblies o only 3 parameters: outdir + R1.fq.gz + R2.fq.gz o supports multithreading at all steps ● Main steps o adaptor removal & quality trimming (Skewer) o selection of K from k-mer spectra (KmerGenie) o de novo assembly (Velvet, Spades) o ordering of contigs against reference (MUMmer)
  • 16. Prokka ● Prokaryotic Annotation o only 2 parameters: outdir + contigs.fa o scales to about 32 threads ● Finds o CDS, tRNA, tmRNA, rRNA, some ncRNA o CRISPR, signal peptides ● Produces o Genbank, GFF3, Sequin, FASTA, ...
  • 17. mlst ● Multi-Locus Sequence Typing o only 2 parameters: scheme + contigs.fa ● Can mass-screen hundreds of assemblies o comes bundled with PubMLST database ● Output o tab/comma separated values
  • 18. AbRicate ● Identify known AB resistance genes o only 1 parameters: contigs.fa ● Only as good as the underlying database o Bundled with ResFinder o does not include SNP-based AbR-conferring genes ● Output o tab/comma separated table
  • 19. Wombac ● Quickly identify core genome SNPs ● Efficiently use all CPUs and RAM ● Re-use previous reference alignments ● Cheap to calculate new core subsets
  • 20. Read alignment Use BWA MEM ● Do not need to clip reads ● Deduces the fragment library attributes ● Marks multi-mapping reads properly ● Scales linearly to >100 cores ● Outputs SAM directly
  • 21. Sorted BAM ● No intermediate files o use Unix pipes ● Multiple CPUs with SAMtools > 0.1.19+ o use the -@ command line parameter bwa → samtools view → samtools sort → BAM
  • 22. SNP calling ● FreeBayes o set in haploid mode (p=1) o set regular parameters (mindepth, minfrac) o call variants in all samples jointly (more power) o single multi-isolate VCF output freebayes -p 1 *.bam → all.vcf
  • 23. Parallel Freebayes ● FreeBayes is single threaded o divide genome into regions o run separate freebayes in parallel on each region o merge the results o scales nearly linearly! fasta-generate-regions.py ref.fa > regions.txt freebayes-parallel 32 regions.txt -p 1 *.bam → all.vcf
  • 24. Select core SNPs ● Core SNPs o position present in every isolate o more than one allele (not wholly conserved) o usually ignore indels and other odd genotypes ● Recombination o not all core SNPs are real o many result of recombination o should be filtered out, could alter tree topology
  • 25. Wombac speed ● Example o 130 E.coli isolates, MiSeq 300bp PE o With 32 cores, used < 4GB RAM/core o Took just over 1 hour ● Add a new sample o Re-use existing alignments o Will migrate to gVCF method that GATK will use ● Recalculate a core tree on subset