SlideShare a Scribd company logo
JoãoAndré Carriço,
Microbiology Institute and Instituto de Medicina Molecular,
Faculty of Medicine, University of Lisbon
jcarrico@fm.ul.pt twitter: @jacarrico
Whole genome sequencing for clinical microbiology:
Translation into routine applications
2 September 2017, Basel
A pipeline (in software engineering) consists of a chain of
processing elements arranged so that the output of each
element is the input of the next; the name is by analogy to
a physical pipeline
https://en.wikipedia.org/wiki/Pipeline_(software)
Physical pipeline
Software Pipeline
Software /Algorithm module
Microbiological
Sample
The Ideal Scenario
Magic Box of
NGS Wonders for
Clinical Microbiology
Completely characterized strain:
• Species Identification
• Serotype
• Multilocus SequenceType (MLST)
• cgMLST / wgMLST / SNPs
• Antibiotic resistance profile
• Virulence factors
• Other SBTM information eg:
• spa (S. aureus)
• emm (Group A Streptococcus)
Actionable information for :
• Diagnostics
• Surveillance
• Outbreak detection
Magic Box of
NGS Wonders for
Clinical Microbiology
Pipelines
of
HTS
analysis
software
Software Pipelines: The Good, The Bad and The Ugly
 Comparability
 The same analysis workflow is
applied to multiple samples
 Accountability
 Keeping track on what software
(and version) did the analysis
 Modularity
 Adding new software to the pipeline
without changing the existing one
 BioinformaticsWorkflow software:
https://www.nextflow.io/
https://github.com/bionode/bionode-watermill
Bionode
Watermill
Snakemake https://snakemake.readthedocs.io/en/stable/
Re-run as needed
If a module doesn’t run, there is no need
to re-run the whole analysis
Compatible with High Performance
Computing job schedulers (SLURM , etc)
 Software validation
 Most software contain bugs that can affect
the results. Pipelines can hamper tracking
the problem
 Reproducibility
 Running the same strain “should” yield the
same results but some software have
stochastics steps
 Opacity
 Given the dependency of multiple
software, it can be difficult to determine
how the final results were achieved
 Database dependency
 Several bioinformatics software
are dependent on publicly
available and curated databases.
Difficult to assess False Positives
/False Negatives.
Virulence Factor Databases
 VFDB (http://www.mgc.ac.cn/VFs/main.htm)
 Pathosystems Resource Integration Center
(PATRIC)VF (https://www.patricbrc.org/)
 Victors (http://www.phidias.us/victors/)
 PHI-Base (http://www.phi-base.org/)
 MvirDB (http://mvirdb.llnl.gov/ )
To know more:
- Presentation on the Controversies in interpreting whole genome sequence data
session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-
databases
 Comprehensive Antibiotic Resistance
Database (CARD) (https://card.mcmaster.ca/ )
 Resfinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/ )
(https://bitbucket.org/genomicepidemiology/resfinder_db ) -> DB repository
 Repository of Antibiotic resistance Cassettes
(RAC)(http://rac.aihi.mq.edu.au/rac/)
 Integrall :The integron database
(http://integrall.bio.ua.pt/)
(…)
 Software dependencies
 If a software is updated and output
changes the pipeline breaks and needs to
be revised
 Database /URL format changes
 When Databases or URL where data is
stored in public repositories changes
several software modules can be
effected (a.k.a. the NCBI effect)
 Setting up the pipeline
 Not as easy as it seems.The Bus effect .
Output of a software is used as input of another :
Most bioinformatics software are pipelines !
INNUCA  Assembly Pipeline
Prokka  GenomeAnnotation Pipeline
Nullarbor  All in one Pipeline
Web platforms
 Innuendo platform
https://www.cdc.gov/pulsenet/pathogens/wgs.html
Contamination
Mislabelling
E.coli
E. fergusonii
Mixture
Barcode
bleaching
Wrong file
assignment
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools http://www.htslib.org/
http://www.bioinformatics.babraham.ac.uk/projects/fastqc
http://cab.spbu.ru/software/spades/
https://github.com/broadinstitute/pilon
MLST 2 https://github.com/tseemann/mlst
Dependencies :
Features :
• Species confirmation
• Contamination detection
• Assembly correction
• Multiple allele detection -> multiple strains
Spades
https://github.com/INNUENDOCON/INNUca
Output
20-40 mins per strain (60x-100x coverage; 8 CPUs)
High Performance Cluster:
6-7 nodes, 244 CPUs used : 3h57m for 124 E.coli ~=1.9 mins per strain
Benchmark
Contamination and
multi-strain detection
 Genome annotation made easy byTorsten
Seemann (slides byTorsten)
 Genome annotation: adding biological
information to the sequence, by describing
features
To know more :
http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013
Available at: https://github.com/tseemann/prokka
 Complete pipeline from reads to reports byTorsten
Seemann
 Objective is automate analysis for everyday use on
public health labs /research settings
 Uses and distills outputs by a lot of software
 Avaliable at: https://github.com/tseemann/nullarbor
Slide byTorsten Seeman
From: https://github.com/tseemann/nullarbor
Slides byTorsten Seeman
 Web Platforms:
 Facilitate the use of pipelines by non-
bioinformaticians (the old and boring Windows vs
Linux software debate can end (?) …)
 Facilitate data sharing and comparison: Creation
of Federated Strain Databases
A novel cross-sectorial platform for the
integration of genomics in surveillance of
foodborne pathogens
http://www.innuendoweb.org/
Target species:
Escherichia coli
Salmonella enterica
Yersinia enterocolitica
Campylobacter sp.
http://www.irida.ca/
INNUENDO Platform
Sequences
Storage
LDAP
SLURM Job Scheduler
Computation Module
INNUca ReMatCh chewBBACA PHYLOViZ
Online
Job Processing
Application
Web
Application
R
E
S
T
A
P
I
Client
Browser
(Chrome)
Calculation
Server
R
E
S
T
A
P
I Metadata
Storage
Frontend/ DB
Server
NGS Onto
Slide credit:
Bruno Gonçalves
Target users: Reference laboratories. Small groups.
• Multi-user
• Create projects within a species for:
• Outbreaks
• Surveillance
Applying multiple pipelines to the same strains and queue them for processing using SLURM.
Can use an High Performance Computer if available
Aggregate selected strains from multiple projects into reports:
• Reports can be saved and exported
• Gene-by-gene analyses can be visualized directly into PHYLOViZ online
and and the resulting trees saved and shared.r N Closest strains in the
database can be added to the tree automatically
Automatically adds the metadata filled in the project and several tree
analysis can be performed :
• NLVGraph
• Interactive distance matrix
• Dynamic exploration of wgMLST schemas
To know more: https://online.phyloviz.net/index
Input Output
See-through box
See-through boxBlack box
Commercial/Freeware Freeware
You get what it gives you You can “tailor”
Ready to use “Major” headache
Stealth change Visible change
Standalone Dependencies
Slide credit: Mario Ramirez
 Pipelines can provide actionable results for Clinical Microbiology
out of HTS data
 One must be aware of the limitations of each pipeline. Setting
up a pipeline that can be maintainable needs Bioinformaticians.
 Most are Linux based. But web platforms can provide a easy to
use way to non-bioinformaticians and are useful to stimulate
data sharing.
 Pipelines greatly benefit from High Performance Computing
Clusters. Nevertheless, these need specialized personal to install
and maintain.
http://im.fm.ul.pt
INNUENDO project [GP/EFSA/AFSCO/2015/01/CT2]
BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014]
ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI - “Fundos Europeus Estruturais e de
Investimento” from “Programa Operacional Regional Lisboa 2020” and by national funds from FCT -
“Fundação para a Ciência e Tecnologia”
Disclaimer
The conclusions, findings, and opinions expressed in this presentation reflect only the
view of the INNUENDO consortium members and not the official position of the
European Food Safety Authority nor of the Government of the Basque Country that are
not responsible for any use that may be made of the information they contain.

More Related Content

PPTX
Genomic Epidemiology: How High Throughput Sequencing changed our view on bac...
João André Carriço
 
PPTX
Common languages in genomic epidemiology: from ontologies to algorithms
João André Carriço
 
PPTX
ECCMID 2016 - How to build actionable virulome databases
João André Carriço
 
PPT
Integrating phylogenetic inference and metadata visualization for NGS data
João André Carriço
 
PPTX
Computational Resources In Infectious Disease
João André Carriço
 
PPTX
Making Use of NGS Data: From Reads to Trees and Annotations
João André Carriço
 
PPTX
Eccmid meet the expert 2015
João André Carriço
 
PPTX
Bacterial Pathogen Genomics at NCBI
nist-spin
 
Genomic Epidemiology: How High Throughput Sequencing changed our view on bac...
João André Carriço
 
Common languages in genomic epidemiology: from ontologies to algorithms
João André Carriço
 
ECCMID 2016 - How to build actionable virulome databases
João André Carriço
 
Integrating phylogenetic inference and metadata visualization for NGS data
João André Carriço
 
Computational Resources In Infectious Disease
João André Carriço
 
Making Use of NGS Data: From Reads to Trees and Annotations
João André Carriço
 
Eccmid meet the expert 2015
João André Carriço
 
Bacterial Pathogen Genomics at NCBI
nist-spin
 

What's hot (20)

PPTX
Comparing Typing Methods : Do's and Don't's
João André Carriço
 
ODP
Mikel egana itbam_2010_ogo_system
Mikel Egaña Aranguren, Ph.D.
 
PPT
How to compare typing techniques: do’s and Don’t’s
João André Carriço
 
PPTX
Bioinformatics as a tool for understanding carcinogenesis
Despoina Kalfakakou
 
PDF
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
Torsten Seemann
 
PDF
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Nathan Olson
 
PPTX
Prashant esa2017
Prashant Hosmani
 
DOCX
EVE SMITH Resume
Eve Smith
 
PDF
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
ExternalEvents
 
PDF
Resazurin Cell Viability Assay
creativebioarray22
 
PDF
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
European Center for Disease Prevention and Control (ECDC)
 
PPTX
Choosing the Right Microbial Typing Method: A Quantitative Approach
João André Carriço
 
PPTX
The Chills and Thrills of Whole Genome Sequencing
Emiliano De Cristofaro
 
PDF
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
ExternalEvents
 
PDF
Enriching Scholarship Personal Genomics presentation
University of Michigan Taubman Health Sciences Library
 
PDF
Human Cell Line Authentication. Why is it so important?
Vall d'Hebron Institute of Research (VHIR)
 
PPTX
Madrid icgc pcawg_2016_slideshare
Neuro, McGill University
 
PPTX
GMI proficiency testing- Progress report 2016
ExternalEvents
 
PPTX
Web applications for rapid microbial taxonomy identification
ExternalEvents
 
PPTX
On the frontier of genotype-2-phenotype data integration
mhaendel
 
Comparing Typing Methods : Do's and Don't's
João André Carriço
 
Mikel egana itbam_2010_ogo_system
Mikel Egaña Aranguren, Ph.D.
 
How to compare typing techniques: do’s and Don’t’s
João André Carriço
 
Bioinformatics as a tool for understanding carcinogenesis
Despoina Kalfakakou
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
Torsten Seemann
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Nathan Olson
 
Prashant esa2017
Prashant Hosmani
 
EVE SMITH Resume
Eve Smith
 
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
ExternalEvents
 
Resazurin Cell Viability Assay
creativebioarray22
 
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
European Center for Disease Prevention and Control (ECDC)
 
Choosing the Right Microbial Typing Method: A Quantitative Approach
João André Carriço
 
The Chills and Thrills of Whole Genome Sequencing
Emiliano De Cristofaro
 
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
ExternalEvents
 
Enriching Scholarship Personal Genomics presentation
University of Michigan Taubman Health Sciences Library
 
Human Cell Line Authentication. Why is it so important?
Vall d'Hebron Institute of Research (VHIR)
 
Madrid icgc pcawg_2016_slideshare
Neuro, McGill University
 
GMI proficiency testing- Progress report 2016
ExternalEvents
 
Web applications for rapid microbial taxonomy identification
ExternalEvents
 
On the frontier of genotype-2-phenotype data integration
mhaendel
 
Ad

Similar to Software Pipelines: The Good, The Bad and The Ugly (20)

PPTX
Reproducibility - The myths and truths of pipeline bioinformatics
Simon Cockell
 
PPTX
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Tom Connor
 
PDF
Overview of the commonly used sequencing platforms, bioinformatic search tool...
OECD Environment
 
PDF
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
PDF
Pathogen Profiling Pipeline
tom14
 
PPTX
An introduction to PATRIC and its use in phage annotation
Ramy K. Aziz
 
PPTX
Best Practices for Building an End-to-End Workflow for Microbial Genomics
Jonathan Jacobs, PhD
 
PDF
Developing and sharing reproducible bioinformatics pipelines: best practices
Yohann Lelièvre
 
PDF
Computational workflows for omics analyses at the IARC
Matthieu Foll
 
PDF
160620 sole nomics v2
M. Gonzalo Claros
 
PDF
Application of NGS in Clinical Microbiology
School of Biosciences, MACFAST College, Tiruvalla, Kerala, India
 
PPTX
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Torsten Seemann
 
PPTX
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
Charles Fracchia
 
PDF
COMPARE: A global platform for the sequence-based rapid identification of pat...
European Centre for Disease Prevention and Control (ECDC)
 
PPT
Folker Meyer: Metagenomic Data Annotation
GigaScience, BGI Hong Kong
 
PDF
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Jan Aerts
 
PDF
Talk by J. Eisen for NZ Computational Genomics meeting
Jonathan Eisen
 
PDF
T-bioinfo overview
Jaclyn Williams
 
PDF
T-BioInfo Methods and Approaches
Elia Brodsky
 
PPTX
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
Nick Loman
 
Reproducibility - The myths and truths of pipeline bioinformatics
Simon Cockell
 
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Tom Connor
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
OECD Environment
 
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
Pathogen Profiling Pipeline
tom14
 
An introduction to PATRIC and its use in phage annotation
Ramy K. Aziz
 
Best Practices for Building an End-to-End Workflow for Microbial Genomics
Jonathan Jacobs, PhD
 
Developing and sharing reproducible bioinformatics pipelines: best practices
Yohann Lelièvre
 
Computational workflows for omics analyses at the IARC
Matthieu Foll
 
160620 sole nomics v2
M. Gonzalo Claros
 
Application of NGS in Clinical Microbiology
School of Biosciences, MACFAST College, Tiruvalla, Kerala, India
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Torsten Seemann
 
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
Charles Fracchia
 
COMPARE: A global platform for the sequence-based rapid identification of pat...
European Centre for Disease Prevention and Control (ECDC)
 
Folker Meyer: Metagenomic Data Annotation
GigaScience, BGI Hong Kong
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Jan Aerts
 
Talk by J. Eisen for NZ Computational Genomics meeting
Jonathan Eisen
 
T-bioinfo overview
Jaclyn Williams
 
T-BioInfo Methods and Approaches
Elia Brodsky
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
Nick Loman
 
Ad

Recently uploaded (20)

PPTX
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
PPTX
Role of GIS in precision farming.pptx
BikramjitDeuri
 
PPTX
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
PPTX
Evolution of diet breadth in herbivorus insects.pptx
Mr. Suresh R. Jambagi
 
PPTX
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
PPTX
Laboratory design and safe microbiological practices
Akanksha Divkar
 
PPTX
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
PPTX
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PPT
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
PPTX
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
PPTX
Nature of Science and the kinds of models used in science
JocelynEvascoRomanti
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PDF
NSF-DOE Vera C. Rubin Observatory Observations of Interstellar Comet 3I/ATLAS...
Sérgio Sacani
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
Role of GIS in precision farming.pptx
BikramjitDeuri
 
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
Evolution of diet breadth in herbivorus insects.pptx
Mr. Suresh R. Jambagi
 
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Laboratory design and safe microbiological practices
Akanksha Divkar
 
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
Nature of Science and the kinds of models used in science
JocelynEvascoRomanti
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
NSF-DOE Vera C. Rubin Observatory Observations of Interstellar Comet 3I/ATLAS...
Sérgio Sacani
 
Quality control test for plastic & metal.pptx
shrutipandit17
 

Software Pipelines: The Good, The Bad and The Ugly

  • 1. JoãoAndré Carriço, Microbiology Institute and Instituto de Medicina Molecular, Faculty of Medicine, University of Lisbon [email protected] twitter: @jacarrico Whole genome sequencing for clinical microbiology: Translation into routine applications 2 September 2017, Basel
  • 2. A pipeline (in software engineering) consists of a chain of processing elements arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline https://en.wikipedia.org/wiki/Pipeline_(software)
  • 4. Microbiological Sample The Ideal Scenario Magic Box of NGS Wonders for Clinical Microbiology Completely characterized strain: • Species Identification • Serotype • Multilocus SequenceType (MLST) • cgMLST / wgMLST / SNPs • Antibiotic resistance profile • Virulence factors • Other SBTM information eg: • spa (S. aureus) • emm (Group A Streptococcus) Actionable information for : • Diagnostics • Surveillance • Outbreak detection
  • 5. Magic Box of NGS Wonders for Clinical Microbiology Pipelines of HTS analysis software
  • 7.  Comparability  The same analysis workflow is applied to multiple samples  Accountability  Keeping track on what software (and version) did the analysis  Modularity  Adding new software to the pipeline without changing the existing one
  • 8.  BioinformaticsWorkflow software: https://www.nextflow.io/ https://github.com/bionode/bionode-watermill Bionode Watermill Snakemake https://snakemake.readthedocs.io/en/stable/ Re-run as needed If a module doesn’t run, there is no need to re-run the whole analysis Compatible with High Performance Computing job schedulers (SLURM , etc)
  • 9.  Software validation  Most software contain bugs that can affect the results. Pipelines can hamper tracking the problem  Reproducibility  Running the same strain “should” yield the same results but some software have stochastics steps  Opacity  Given the dependency of multiple software, it can be difficult to determine how the final results were achieved
  • 10.  Database dependency  Several bioinformatics software are dependent on publicly available and curated databases. Difficult to assess False Positives /False Negatives.
  • 11. Virulence Factor Databases  VFDB (http://www.mgc.ac.cn/VFs/main.htm)  Pathosystems Resource Integration Center (PATRIC)VF (https://www.patricbrc.org/)  Victors (http://www.phidias.us/victors/)  PHI-Base (http://www.phi-base.org/)  MvirDB (http://mvirdb.llnl.gov/ ) To know more: - Presentation on the Controversies in interpreting whole genome sequence data session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome- databases
  • 12.  Comprehensive Antibiotic Resistance Database (CARD) (https://card.mcmaster.ca/ )  Resfinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/ ) (https://bitbucket.org/genomicepidemiology/resfinder_db ) -> DB repository  Repository of Antibiotic resistance Cassettes (RAC)(http://rac.aihi.mq.edu.au/rac/)  Integrall :The integron database (http://integrall.bio.ua.pt/) (…)
  • 13.  Software dependencies  If a software is updated and output changes the pipeline breaks and needs to be revised  Database /URL format changes  When Databases or URL where data is stored in public repositories changes several software modules can be effected (a.k.a. the NCBI effect)  Setting up the pipeline  Not as easy as it seems.The Bus effect .
  • 14. Output of a software is used as input of another : Most bioinformatics software are pipelines !
  • 15. INNUCA  Assembly Pipeline Prokka  GenomeAnnotation Pipeline Nullarbor  All in one Pipeline Web platforms  Innuendo platform
  • 17. http://bowtie-bio.sourceforge.net/bowtie2/index.shtml samtools http://www.htslib.org/ http://www.bioinformatics.babraham.ac.uk/projects/fastqc http://cab.spbu.ru/software/spades/ https://github.com/broadinstitute/pilon MLST 2 https://github.com/tseemann/mlst Dependencies : Features : • Species confirmation • Contamination detection • Assembly correction • Multiple allele detection -> multiple strains Spades https://github.com/INNUENDOCON/INNUca
  • 18. Output 20-40 mins per strain (60x-100x coverage; 8 CPUs) High Performance Cluster: 6-7 nodes, 244 CPUs used : 3h57m for 124 E.coli ~=1.9 mins per strain Benchmark Contamination and multi-strain detection
  • 19.  Genome annotation made easy byTorsten Seemann (slides byTorsten)  Genome annotation: adding biological information to the sequence, by describing features To know more : http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013 Available at: https://github.com/tseemann/prokka
  • 20.  Complete pipeline from reads to reports byTorsten Seemann  Objective is automate analysis for everyday use on public health labs /research settings  Uses and distills outputs by a lot of software  Avaliable at: https://github.com/tseemann/nullarbor
  • 24.  Web Platforms:  Facilitate the use of pipelines by non- bioinformaticians (the old and boring Windows vs Linux software debate can end (?) …)  Facilitate data sharing and comparison: Creation of Federated Strain Databases
  • 25. A novel cross-sectorial platform for the integration of genomics in surveillance of foodborne pathogens http://www.innuendoweb.org/ Target species: Escherichia coli Salmonella enterica Yersinia enterocolitica Campylobacter sp. http://www.irida.ca/
  • 26. INNUENDO Platform Sequences Storage LDAP SLURM Job Scheduler Computation Module INNUca ReMatCh chewBBACA PHYLOViZ Online Job Processing Application Web Application R E S T A P I Client Browser (Chrome) Calculation Server R E S T A P I Metadata Storage Frontend/ DB Server NGS Onto Slide credit: Bruno Gonçalves Target users: Reference laboratories. Small groups.
  • 27. • Multi-user • Create projects within a species for: • Outbreaks • Surveillance
  • 28. Applying multiple pipelines to the same strains and queue them for processing using SLURM. Can use an High Performance Computer if available
  • 29. Aggregate selected strains from multiple projects into reports: • Reports can be saved and exported • Gene-by-gene analyses can be visualized directly into PHYLOViZ online and and the resulting trees saved and shared.r N Closest strains in the database can be added to the tree automatically
  • 30. Automatically adds the metadata filled in the project and several tree analysis can be performed : • NLVGraph • Interactive distance matrix • Dynamic exploration of wgMLST schemas To know more: https://online.phyloviz.net/index
  • 31. Input Output See-through box See-through boxBlack box Commercial/Freeware Freeware You get what it gives you You can “tailor” Ready to use “Major” headache Stealth change Visible change Standalone Dependencies Slide credit: Mario Ramirez
  • 32.  Pipelines can provide actionable results for Clinical Microbiology out of HTS data  One must be aware of the limitations of each pipeline. Setting up a pipeline that can be maintainable needs Bioinformaticians.  Most are Linux based. But web platforms can provide a easy to use way to non-bioinformaticians and are useful to stimulate data sharing.  Pipelines greatly benefit from High Performance Computing Clusters. Nevertheless, these need specialized personal to install and maintain.
  • 34. INNUENDO project [GP/EFSA/AFSCO/2015/01/CT2] BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014] ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI - “Fundos Europeus Estruturais e de Investimento” from “Programa Operacional Regional Lisboa 2020” and by national funds from FCT - “Fundação para a Ciência e Tecnologia” Disclaimer The conclusions, findings, and opinions expressed in this presentation reflect only the view of the INNUENDO consortium members and not the official position of the European Food Safety Authority nor of the Government of the Basque Country that are not responsible for any use that may be made of the information they contain.