ODIN “Big Bang” event, CERN, Thursday, 17 October 2013

www.slideshare.net/SusannaSansone

Data standards, sharing and publication
in the life sciences
Susanna-Assunta Sansone, PhD

Data Consultant,

Associate Director,

Honorary Academic Editor

Principal Investigator
Board of Directors
ODIN mission

Outline of my talk
Problem:

Identification of datasets in pivotal.
But meaningful sharing and (re)use
also depend on how well described
the datasets are.
Status quo:

In the life sciences there is a wealth
of „reporting standards‟ set to
enhance and facilitate the
experimental descriptions.
Challenges:

Identify „reporting standards‟ and
their organizations, track their use,
usability and impact (e.g. linking
them to datasets), credit their
developers, users (e.g. curators)...
My team‟s activities and groups we work with
data management, biocuration and publication,
collaborative development of software, database, standards and ontology
•
•
•
•
•

environmental genomics
metabolomics
metagenomics
nanotechnology
proteomics

•
•
•
•
•

stem cell discovery
system biology
transcriptomics
toxicogenomics
environmental health

env

agro

tox/pharma

health
http://www.flickr.com/photos/notbrucelee/8016189356/

CC BY
R

O
N

E
H
E

N

R

R

I
B

E
http://www.flickr.com/photos/notbrucelee/8016189356/

CC BY
Growing movement for reproducible research



Researchers

and

bioinformaticians

in

both

academic and commercial arenas, along with
funding agencies and publishers, embrace the
concept that to be comprehensible, interoperable
and reusable shared datasets we should have
richly described:
•

entities of interest
e.g., genes, metabolites, phenotypes,

computational models, diseases ...
•

experimental steps
e.g., provenance of study materials,
technology and measurement types,
experimentalists and curators ...
The necessity for well-annotated data
and unambiguous experimental
metadata was especially apparent
• during cross-study comparisons and
data analysis
• in preparation for reformatting the
datasets for submission to the
different EBI repositories, requiring
different level of information

experimental design
sample characteristic(s)
experimental variable(s)
technology(s)
measurement(s)
protocols(s)
7

The International Conference on Systems Biology (ICSB), 22-28 August, 2008

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project

data file(s)


One must strike a balance
between
• depth and breadth of
information; and
• sufficient information
required to reuse the data





The International Conference on Systems Biology (ICSB), 22-28 August, 2008

Make annotation explicit
and discoverable


8

Capture all salient features
of the experimental
workflow

Structure the descriptions
for consistency, tracking

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
A community mobilization to develop standards, e.g.:

Nanotechnology Working Group

de jure
standard
organizations

de facto
grass-roots
groups

 Structural and operational differences
• organization types (open, close to members, society, WG etc.)
• standards development (how to formulate, conduct and maintain)
• adoption, uptake, outreach (link to journals, funders and commercial sector)
• funds (sponsors, memberships, grants, volunteering)
Types of reporting standards

Nanotechnology Working Group

Including conceptual
model, conceptual
schema from which an
exchange format is
derived to allow data to
flow from one system to
another

Including controlled
vocabularies, taxonomies,
thesauri, ontologies etc. to
use the same word and
refer to the same „thing‟

Including minimum
information reporting
requirements, or
checklists to report the
same core, essential
information
Fragmentation, duplications and gaps
epidemiology

plant biology

microbiology

Biologically-delineated
views of the world

Generic features (‘common core’)
- description of source biomaterial
- experimental design components
MS
Arrays

Gels
Columns

Scanning

transcriptomics

Arrays &
Scanning

proteomics

MS

Technologically-delineated
views of the world

NMR
FTIR
Columns

metabolomics

To compare and integrate data we need interoperable standards
Growing number of reporting standards
+ 303

To track
provenance of
the information
and ensure
richness of data
and experimental
metadata
descriptions, to
maximize
reusability

+ 150

Databases,
annotation,
curation
tools
MAGE-Tab
GCDML

AAO

SOFT

GELML

MITAB

ISA-Tab

OBI

FASTA

VO

PATO

DICOM

ENVO

XAO

DO

MIAPA
MIRIAM

MIQAS

MIX

MIGEN

MOD

SBRML

MzML
SEDML…

miame

CHEBI

SRAxml

CML

Source: MIBBI,
EQUATOR

Estimated

Source: BioPortal

+ 130

TEDDY

PRO

BTO
IDO…

MIAPE

CIMR
MIASE

REMARK

MIQE

CONSORT
MISFISHIE….
But how much do we know about these standards
• A coherent, curated and searchable registry of standards for describing
and reporting experiments in life science, environmental, biomedical and
biotechnological domains
• A coherent, curated and searchable registry of standards for describing
and reporting experiments in life science, environmental, biomedical and
biotechnological domains
• Progressively associate standards to data policies and databases
• Develop assessment criteria for usability and popularity of standards
• Help stakeholders to make informed decisions on e.g. what standards or
databases to use or recommend; identify efforts they have funded
16

The International Conference on Systems Biology (ICSB), 22-28 August, 2008

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
Will the ISNI-based ORCID affiliation module
cover standards organizations too?
User profiles populated from ORCID...

19

The International Conference on Systems Biology (ICSB), 22-28 August, 2008

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
... credit for creating, contributing to, maintaining standards

Ownership of open standards can be problematic
in broad, grass-root collaborations

20

It requires improved models, to encourage
maintenance of and contributions to these
efforts, rewards and incentives need to be
identified for all contributors to supporting the
The International Conference on Systems Biology (ICSB), 22-28 August, 2008
continued development of standards

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
... link to data records associated to publications

21

The International Conference on Systems Biology (ICSB), 22-28 August, 2008

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
...and associated article-level metrics

22

The International Conference on Systems Biology (ICSB), 22-28 August, 2008

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
We need “standards impact metrics” to evaluate use/usability

23
working with data publication platforms:
“Invisible” use of standards in data reporting tools

One of the winners.
Project: integration of ORCID with
the ISAcreator, the editor tool,
helping curators and researchers to
describe experiments following
community standards.

The International Conference on Systems Biology (ICSB), 22-28 August, 2008

Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
ODIN mission

Summarizing my talk
Problem:

Identification of datasets in pivotal.
But meaningful sharing and (re)use
also depend on how well described
the datasets are.
Status quo:

In the life sciences there is a wealth
of „reporting standards‟ set to
enhance and facilitate the
experimental descriptions.
Challenges addressed by

Identify „reporting standards‟ and
their organizations, track their use,
usability and impact (e.g. linking
them to datasets), credit their
developers, users (e.g. curators)...
Acknowledgements
Philippe Rocca-Serra
Alejandra Gonzalez-Beltran
Eamonn Maguire
Collaborators:
OBO Foundry
COSMOS
GSC
Metabolomics Society
Data Dryad
Pistoia Alliance
Elixir UK
NPG‟s Scientific Data
and many more….

Life science odin-oct2013-sa-sansone

  • 1.
    ODIN “Big Bang”event, CERN, Thursday, 17 October 2013 www.slideshare.net/SusannaSansone Data standards, sharing and publication in the life sciences Susanna-Assunta Sansone, PhD Data Consultant, Associate Director, Honorary Academic Editor Principal Investigator Board of Directors
  • 2.
    ODIN mission Outline ofmy talk Problem: Identification of datasets in pivotal. But meaningful sharing and (re)use also depend on how well described the datasets are. Status quo: In the life sciences there is a wealth of „reporting standards‟ set to enhance and facilitate the experimental descriptions. Challenges: Identify „reporting standards‟ and their organizations, track their use, usability and impact (e.g. linking them to datasets), credit their developers, users (e.g. curators)...
  • 3.
    My team‟s activitiesand groups we work with data management, biocuration and publication, collaborative development of software, database, standards and ontology • • • • • environmental genomics metabolomics metagenomics nanotechnology proteomics • • • • • stem cell discovery system biology transcriptomics toxicogenomics environmental health env agro tox/pharma health
  • 4.
  • 5.
  • 6.
    Growing movement forreproducible research  Researchers and bioinformaticians in both academic and commercial arenas, along with funding agencies and publishers, embrace the concept that to be comprehensible, interoperable and reusable shared datasets we should have richly described: • entities of interest e.g., genes, metabolites, phenotypes, computational models, diseases ... • experimental steps e.g., provenance of study materials, technology and measurement types, experimentalists and curators ...
  • 7.
    The necessity forwell-annotated data and unambiguous experimental metadata was especially apparent • during cross-study comparisons and data analysis • in preparation for reformatting the datasets for submission to the different EBI repositories, requiring different level of information experimental design sample characteristic(s) experimental variable(s) technology(s) measurement(s) protocols(s) 7 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project data file(s)
  • 8.
     One must strikea balance between • depth and breadth of information; and • sufficient information required to reuse the data   The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Make annotation explicit and discoverable  8 Capture all salient features of the experimental workflow Structure the descriptions for consistency, tracking Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 9.
    A community mobilizationto develop standards, e.g.: Nanotechnology Working Group de jure standard organizations de facto grass-roots groups  Structural and operational differences • organization types (open, close to members, society, WG etc.) • standards development (how to formulate, conduct and maintain) • adoption, uptake, outreach (link to journals, funders and commercial sector) • funds (sponsors, memberships, grants, volunteering)
  • 10.
    Types of reportingstandards Nanotechnology Working Group Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same „thing‟ Including minimum information reporting requirements, or checklists to report the same core, essential information
  • 11.
    Fragmentation, duplications andgaps epidemiology plant biology microbiology Biologically-delineated views of the world Generic features (‘common core’) - description of source biomaterial - experimental design components MS Arrays Gels Columns Scanning transcriptomics Arrays & Scanning proteomics MS Technologically-delineated views of the world NMR FTIR Columns metabolomics To compare and integrate data we need interoperable standards
  • 12.
    Growing number ofreporting standards + 303 To track provenance of the information and ensure richness of data and experimental metadata descriptions, to maximize reusability + 150 Databases, annotation, curation tools MAGE-Tab GCDML AAO SOFT GELML MITAB ISA-Tab OBI FASTA VO PATO DICOM ENVO XAO DO MIAPA MIRIAM MIQAS MIX MIGEN MOD SBRML MzML SEDML… miame CHEBI SRAxml CML Source: MIBBI, EQUATOR Estimated Source: BioPortal + 130 TEDDY PRO BTO IDO… MIAPE CIMR MIASE REMARK MIQE CONSORT MISFISHIE….
  • 13.
    But how muchdo we know about these standards
  • 14.
    • A coherent,curated and searchable registry of standards for describing and reporting experiments in life science, environmental, biomedical and biotechnological domains
  • 15.
    • A coherent,curated and searchable registry of standards for describing and reporting experiments in life science, environmental, biomedical and biotechnological domains • Progressively associate standards to data policies and databases • Develop assessment criteria for usability and popularity of standards • Help stakeholders to make informed decisions on e.g. what standards or databases to use or recommend; identify efforts they have funded
  • 16.
    16 The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 18.
    Will the ISNI-basedORCID affiliation module cover standards organizations too?
  • 19.
    User profiles populatedfrom ORCID... 19 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 20.
    ... credit forcreating, contributing to, maintaining standards Ownership of open standards can be problematic in broad, grass-root collaborations 20 It requires improved models, to encourage maintenance of and contributions to these efforts, rewards and incentives need to be identified for all contributors to supporting the The International Conference on Systems Biology (ICSB), 22-28 August, 2008 continued development of standards Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 21.
    ... link todata records associated to publications 21 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 22.
    ...and associated article-levelmetrics 22 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 23.
    We need “standardsimpact metrics” to evaluate use/usability 23
  • 24.
    working with datapublication platforms:
  • 25.
    “Invisible” use ofstandards in data reporting tools One of the winners. Project: integration of ORCID with the ISAcreator, the editor tool, helping curators and researchers to describe experiments following community standards. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 26.
    ODIN mission Summarizing mytalk Problem: Identification of datasets in pivotal. But meaningful sharing and (re)use also depend on how well described the datasets are. Status quo: In the life sciences there is a wealth of „reporting standards‟ set to enhance and facilitate the experimental descriptions. Challenges addressed by Identify „reporting standards‟ and their organizations, track their use, usability and impact (e.g. linking them to datasets), credit their developers, users (e.g. curators)...
  • 27.
    Acknowledgements Philippe Rocca-Serra Alejandra Gonzalez-Beltran EamonnMaguire Collaborators: OBO Foundry COSMOS GSC Metabolomics Society Data Dryad Pistoia Alliance Elixir UK NPG‟s Scientific Data and many more….