Software Pipelines: The Good, The Bad and The Ugly

JoãoAndré Carriço,
Microbiology Institute and Instituto de Medicina Molecular,
Faculty of Medicine, University of Lisbon
jcarrico@fm.ul.pt twitter: @jacarrico
Whole genome sequencing for clinical microbiology:
Translation into routine applications
2 September 2017, Basel

A pipeline (in software engineering) consists of a chain of
processing elements arranged so that the output of each
element is the input of the next; the name is by analogy to
a physical pipeline
https://en.wikipedia.org/wiki/Pipeline_(software)

Physical pipeline
Software Pipeline
Software /Algorithm module

Microbiological
Sample
The Ideal Scenario
Magic Box of
NGS Wonders for
Clinical Microbiology
Completely characterized strain:
• Species Identification
• Serotype
• Multilocus SequenceType (MLST)
• cgMLST / wgMLST / SNPs
• Antibiotic resistance profile
• Virulence factors
• Other SBTM information eg:
• spa (S. aureus)
• emm (Group A Streptococcus)
Actionable information for :
• Diagnostics
• Surveillance
• Outbreak detection

Magic Box of
NGS Wonders for
Clinical Microbiology
Pipelines
of
HTS
analysis
software

Software Pipelines: The Good, The Bad and The Ugly

 Comparability
 The same analysis workflow is
applied to multiple samples
 Accountability
 Keeping track on what software
(and version) did the analysis
 Modularity
 Adding new software to the pipeline
without changing the existing one

 BioinformaticsWorkflow software:
https://www.nextflow.io/
https://github.com/bionode/bionode-watermill
Bionode
Watermill
Snakemake https://snakemake.readthedocs.io/en/stable/
Re-run as needed
If a module doesn’t run, there is no need
to re-run the whole analysis
Compatible with High Performance
Computing job schedulers (SLURM , etc)

 Software validation
 Most software contain bugs that can affect
the results. Pipelines can hamper tracking
the problem
 Reproducibility
 Running the same strain “should” yield the
same results but some software have
stochastics steps
 Opacity
 Given the dependency of multiple
software, it can be difficult to determine
how the final results were achieved

 Database dependency
 Several bioinformatics software
are dependent on publicly
available and curated databases.
Difficult to assess False Positives
/False Negatives.

Virulence Factor Databases
 VFDB (http://www.mgc.ac.cn/VFs/main.htm)
 Pathosystems Resource Integration Center
(PATRIC)VF (https://www.patricbrc.org/)
 Victors (http://www.phidias.us/victors/)
 PHI-Base (http://www.phi-base.org/)
 MvirDB (http://mvirdb.llnl.gov/ )
To know more:
- Presentation on the Controversies in interpreting whole genome sequence data
session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-
databases

 Comprehensive Antibiotic Resistance
Database (CARD) (https://card.mcmaster.ca/ )
 Resfinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/ )
(https://bitbucket.org/genomicepidemiology/resfinder_db ) -> DB repository
 Repository of Antibiotic resistance Cassettes
(RAC)(http://rac.aihi.mq.edu.au/rac/)
 Integrall :The integron database
(http://integrall.bio.ua.pt/)
(…)

 Software dependencies
 If a software is updated and output
changes the pipeline breaks and needs to
be revised
 Database /URL format changes
 When Databases or URL where data is
stored in public repositories changes
several software modules can be
effected (a.k.a. the NCBI effect)
 Setting up the pipeline
 Not as easy as it seems.The Bus effect .

Output of a software is used as input of another :
Most bioinformatics software are pipelines !

INNUCA  Assembly Pipeline
Prokka  GenomeAnnotation Pipeline
Nullarbor  All in one Pipeline
Web platforms
 Innuendo platform

https://www.cdc.gov/pulsenet/pathogens/wgs.html
Contamination
Mislabelling
E.coli
E. fergusonii
Mixture
Barcode
bleaching
Wrong file
assignment

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools http://www.htslib.org/
http://www.bioinformatics.babraham.ac.uk/projects/fastqc
http://cab.spbu.ru/software/spades/
https://github.com/broadinstitute/pilon
MLST 2 https://github.com/tseemann/mlst
Dependencies :
Features :
• Species confirmation
• Contamination detection
• Assembly correction
• Multiple allele detection -> multiple strains
Spades
https://github.com/INNUENDOCON/INNUca

Output
20-40 mins per strain (60x-100x coverage; 8 CPUs)
High Performance Cluster:
6-7 nodes, 244 CPUs used : 3h57m for 124 E.coli ~=1.9 mins per strain
Benchmark
Contamination and
multi-strain detection

 Genome annotation made easy byTorsten
Seemann (slides byTorsten)
 Genome annotation: adding biological
information to the sequence, by describing
features
To know more :
http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013
Available at: https://github.com/tseemann/prokka

 Complete pipeline from reads to reports byTorsten
Seemann
 Objective is automate analysis for everyday use on
public health labs /research settings
 Uses and distills outputs by a lot of software
 Avaliable at: https://github.com/tseemann/nullarbor

From: https://github.com/tseemann/nullarbor

 Web Platforms:
 Facilitate the use of pipelines by non-
bioinformaticians (the old and boring Windows vs
Linux software debate can end (?) …)
 Facilitate data sharing and comparison: Creation
of Federated Strain Databases

A novel cross-sectorial platform for the
integration of genomics in surveillance of
foodborne pathogens
http://www.innuendoweb.org/
Target species:
Escherichia coli
Salmonella enterica
Yersinia enterocolitica
Campylobacter sp.
http://www.irida.ca/

INNUENDO Platform
Sequences
Storage
LDAP
SLURM Job Scheduler
Computation Module
INNUca ReMatCh chewBBACA PHYLOViZ
Online
Job Processing
Application
Web
Application
R
E
S
T
A
P
I
Client
Browser
(Chrome)
Calculation
Server
R
E
S
T
A
P
I Metadata
Storage
Frontend/ DB
Server
NGS Onto
Slide credit:
Bruno Gonçalves
Target users: Reference laboratories. Small groups.

• Multi-user
• Create projects within a species for:
• Outbreaks
• Surveillance

Applying multiple pipelines to the same strains and queue them for processing using SLURM.
Can use an High Performance Computer if available

Aggregate selected strains from multiple projects into reports:
• Reports can be saved and exported
• Gene-by-gene analyses can be visualized directly into PHYLOViZ online
and and the resulting trees saved and shared.r N Closest strains in the
database can be added to the tree automatically

Automatically adds the metadata filled in the project and several tree
analysis can be performed :
• NLVGraph
• Interactive distance matrix
• Dynamic exploration of wgMLST schemas
To know more: https://online.phyloviz.net/index

Input Output
See-through box
See-through boxBlack box
Commercial/Freeware Freeware
You get what it gives you You can “tailor”
Ready to use “Major” headache
Stealth change Visible change
Standalone Dependencies
Slide credit: Mario Ramirez

 Pipelines can provide actionable results for Clinical Microbiology
out of HTS data
 One must be aware of the limitations of each pipeline. Setting
up a pipeline that can be maintainable needs Bioinformaticians.
 Most are Linux based. But web platforms can provide a easy to
use way to non-bioinformaticians and are useful to stimulate
data sharing.
 Pipelines greatly benefit from High Performance Computing
Clusters. Nevertheless, these need specialized personal to install
and maintain.

http://im.fm.ul.pt

INNUENDO project [GP/EFSA/AFSCO/2015/01/CT2]
BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014]
ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI - “Fundos Europeus Estruturais e de
Investimento” from “Programa Operacional Regional Lisboa 2020” and by national funds from FCT -
“Fundação para a Ciência e Tecnologia”
Disclaimer
The conclusions, findings, and opinions expressed in this presentation reflect only the
view of the INNUENDO consortium members and not the official position of the
European Food Safety Authority nor of the Government of the Basque Country that are
not responsible for any use that may be made of the information they contain.

Software Pipelines: The Good, The Bad and The Ugly

More Related Content

What's hot (20)

Similar to Software Pipelines: The Good, The Bad and The Ugly (20)

Recently uploaded (20)

Software Pipelines: The Good, The Bad and The Ugly