How to share useful data
Peter McQuilton
Biosharing.org
@drosophilic
Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
Research data life cycle
Image credit to:
Credit to: ttps://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ 2014
Better data = better science
A community mobilization for “openness”
image by Greg Emmerich
http://discovery.urlibraries.org/ https://okfn.org
Open data
is a means to do
better science
more efficiently
http://pantonprinciples.org
https://creativecommons.org
Growing movement for FAIR data and research
outputs
But in all fairness, not much data is FAIR!
But in all fairness, not much data is FAIR!
But in all fairness, not much data is FAIR!
“Reproducing the method took several months of effort, and
required using new versions and new software that posed
challenges to reconstructing and validating the results”
Unfairness in both experimental and computation
areas
• Not always well cited, stored
o Software, codes, workflows are hard(er) to get hold of
• Poorly described for third party reuse
o Different level of detail and annotation
• Curation activities are perceived as time consuming
o Collection and harmonization of detailed methods and
experimental steps is rushed at the publication stage
Not very FAIR: low findability and
understandability
• Effectively document your data so that it can be understood
in the future
• Periodically move data to new storage media (drives
degrade over time)
• Keep more than one copy of data (local and cloud)
• Migrate data to new software versions
• Use a well documented and supported format
Ideally this should be covered in a data management plan at
the start of a project, so that you can factor any associated
time and resources into your budget.
What can I do to ensure my data are
shareable/usable in the future?
Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context - standards
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
Do you know what this is?
LS1_C2_LD_TP2_P1 file1-fastq.gz
…how NOT to report the experimental
information!
LS1_C2_LD_TP2_P1 file1-fastq.gz
…how NOT to report the experimental
information!
Sample name (?!) Data file
LS1_C2_LD_TP2_P1 file1-fastq.gz
We need to clearly describe the information
• LS1 liver sample 1
• C2 compound 2
• LD low dose
• TP2 time point 2
• P1 protocol 1
• file1-fastq.gz compressed data file for sequence
information corresponding to this
sample
Sample name (?!) Data file
LS1_C2_LD_TP2_P1 file1-fastq.gz
Without context data is meaningless
Without context data is meaningless
Without context data is meaningless
Without context data is meaningless
• We need to report sufficient
information to reuse the dataset
• We must strike a balance between
depth and breadth of information
Information intensive experiments
Information intensive experiments
• Not too much
• Not too little
• ….just right
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared…
From natural language to ‘computable’ concepts
Age value?
Unit?
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age value
Unit
Strain name?
Subject of the experiment?
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition?
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part?
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Type of protocol – cell preparation
Type of protocol - sample treatment
Type of protocol – liver preparation
How do you know what to report, or how to
structure it?
• Data/content standards:
• Structure, enrich and report the description of the
datasets and the experimental context under which they
were produced
• Facilitate the discovery, sharing, understanding and
reuse of datasets
Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
193
85
346
miame
MIAPA
MIRIAM
MIQAS
MIX
MIGEN
ARRIVE
MIAPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
MAGE-Tab
GCDML
SRAxml
SOFT
FASTA
DICOM
MzML
SBRML
SEDML…
GELML
ISA-Tab
CML
MITAB
AAO
CHEBI
OBI
PATO ENVO
MOD
BTO
IDO…
TEDDY
PRO
XAO
DO
VO
There are over 600 content standards in the life sciences
de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
Community mobilisation to develop content
standards
Databases have their own standards, e.g. at EBI:
Enablers: to better describe, share and query data
Enablers: to better describe, share and query data
• Minimum information
reporting requirements, or
checklists
o Report the same core,
essential information
• Minimum information
reporting requirements, or
checklists
o Report the same core,
essential information
• Controlled vocabularies, taxonomies,
thesauri, ontologies etc.
o Use the same word and refer to the same
‘thing’
Enablers: to better describe, share and query data
• Minimum information
reporting requirements, or
checklists
o Report the same core,
essential information
• Controlled vocabularies, taxonomies,
thesauri, ontologies etc.
o Use the same word and refer to the same
‘thing’
• Conceptual model,
conceptual schema, or
exchange formats
o Allow data to flow from one
system to another
Enablers: to better describe, share and query data
A web-based, curated and searchable registry ensuring that biological
standards and databases are registered, informative and discoverable; also
monitoring the development and evolution of standards, their use in databases
and the adoption of both in data policies.
Researchers, developers and curators lack support and guidance on how to best navigate and select
content standards, understand their maturity, or find databases that implement them;
Funders, journals and librarians do not have enough information to make informed decisions on which
content standards or database to recommended in policies, or fund or implement
Our mission: To help people make the right choice
Three interlinked registries
Work out which format your data should be in for
submission to a particular database
STANDARD DATABASE
Standards and databases (and policies) cross-linked
From simple and advanced searches
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Search and filter to find what is relevant to your type of data
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and
substitutions
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and
substitutions
Create your own Collection
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
5
3
User profiles populated from ORCID...
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
5
4
... credit for creating, contributing to, maintaining standards, databases and
policies
Ownership of open standards can be problematic in
broad, grass-root collaborations
It requires improved models, to encourage
maintenance of and contributions to these
efforts, rewards and incentives need to be
identified for all contributors to supporting the
continued development of standards
What you can do with BioSharing…
“Which standard should I use for this data, considering I’d
like to publish in journal X?
“Are we using the most up-to-date version of this standard?”
“My data is in X format, which databases take that format?
How can you use community-standards?
model and related
formats
These tools and formats will help you to:
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
ISA powers data collection, curation resources and repositories, e.g.:
ISA
model and related
formats
1
Create template(s) to fit the type of
experiments to be described
Create templates detailing the steps to
be reported for different investigations,
complying to community standards in
e.g. configuring the value(s) allowed for
each field to be
• text (with/without regular expressions),
• ontology terms,
• numbers etc.
We have ‘ready to use’ community
standards compliant configurations
and can create more according to
user needs
• The ISA model records the data’s provenance, how it was generated and
where it is located.
• Published Data Descriptors are indexed in all major bibliographic indexing
services (incl. PubMed)
• However, accompanying every Data Descriptor article there are metadata files,
specifically created to aid discovery and understanding of the data itself.
• Using the ISA (Investigation, Study, Assay) model, these metadata files
provide a machine readable overview of the study that generated the data.
• Filter datasets by
data repository or
metadata
• Boolean searches
• Future enhancements:
- Statistics
- Richer queries based
on semantics of the data
ISA-explorer: A demo tool for discovering and exploring Scientific
Data’s ISA-tab metadata
ISA-explorer: A demo tool for discovering and exploring Scientific
Data’s ISA-tab metadata
Visualise the data
associated with
a paper
http://tinyurl.com/isaexplorer
• Reusability and reproducibility
o Is pivotal to drive science and discoveries
o Do your best to make your digital research outputs FAIR
• Experimental context
o Report the experimental context of your findings
o Do to your data what you wish that others would do to theirs
• Content standards
o Continuously evolving
o Make use of tools implementing standards, such as ISAtools
o Use biosharing.org to explore repositories, standards and policies
Summary
Acknowledgements
Find the right database for your data, and which data standard to
use – https://www.biosharing.org
Checking your data conforms to a standard, or making your own
templates – http://www.isa-tools.org
Where to keep research data: DCC checklist for evaluating data
repositories (DCC) - http://tinyurl.com/DCCResearchData
How and why you should manage your research data (JISC) -
http://tinyurl.com/JISCDMP
Useful links
How to share useful data

How to share useful data

  • 1.
    How to shareuseful data Peter McQuilton Biosharing.org @drosophilic
  • 2.
    Outline • Data sharing •Reusability and reproducibility • How the lack of these affects scientific accountability and progress • Experimental context • What to report – what level of granularity • How to report it – what format, structure • Content standards • How to find them • Complying with repositories, funders and publishers
  • 3.
    Outline • Data sharing •Reusability and reproducibility • How the lack of these affects scientific accountability and progress • Experimental context • What to report – what level of granularity • How to report it – what format, structure • Content standards • How to find them • Complying with repositories, funders and publishers
  • 4.
    Research data lifecycle Image credit to:
  • 5.
  • 6.
    A community mobilizationfor “openness” image by Greg Emmerich http://discovery.urlibraries.org/ https://okfn.org Open data is a means to do better science more efficiently http://pantonprinciples.org https://creativecommons.org
  • 7.
    Growing movement forFAIR data and research outputs
  • 8.
    But in allfairness, not much data is FAIR!
  • 9.
    But in allfairness, not much data is FAIR!
  • 10.
    But in allfairness, not much data is FAIR!
  • 11.
    “Reproducing the methodtook several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results” Unfairness in both experimental and computation areas
  • 12.
    • Not alwayswell cited, stored o Software, codes, workflows are hard(er) to get hold of • Poorly described for third party reuse o Different level of detail and annotation • Curation activities are perceived as time consuming o Collection and harmonization of detailed methods and experimental steps is rushed at the publication stage Not very FAIR: low findability and understandability
  • 13.
    • Effectively documentyour data so that it can be understood in the future • Periodically move data to new storage media (drives degrade over time) • Keep more than one copy of data (local and cloud) • Migrate data to new software versions • Use a well documented and supported format Ideally this should be covered in a data management plan at the start of a project, so that you can factor any associated time and resources into your budget. What can I do to ensure my data are shareable/usable in the future?
  • 14.
    Outline • Data sharing •Reusability and reproducibility • How the lack of these affects scientific accountability and progress • Experimental context - standards • What to report – what level of granularity • How to report it – what format, structure • Content standards • How to find them • Complying with repositories, funders and publishers
  • 15.
    Do you knowwhat this is? LS1_C2_LD_TP2_P1 file1-fastq.gz
  • 16.
    …how NOT toreport the experimental information! LS1_C2_LD_TP2_P1 file1-fastq.gz
  • 17.
    …how NOT toreport the experimental information! Sample name (?!) Data file LS1_C2_LD_TP2_P1 file1-fastq.gz
  • 18.
    We need toclearly describe the information • LS1 liver sample 1 • C2 compound 2 • LD low dose • TP2 time point 2 • P1 protocol 1 • file1-fastq.gz compressed data file for sequence information corresponding to this sample Sample name (?!) Data file LS1_C2_LD_TP2_P1 file1-fastq.gz
  • 19.
    Without context datais meaningless
  • 20.
    Without context datais meaningless
  • 21.
    Without context datais meaningless
  • 22.
    Without context datais meaningless
  • 23.
    • We needto report sufficient information to reuse the dataset • We must strike a balance between depth and breadth of information Information intensive experiments
  • 24.
    Information intensive experiments •Not too much • Not too little • ….just right
  • 25.
    Seven week oldC57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared… From natural language to ‘computable’ concepts
  • 26.
    Age value? Unit? Strain name Subjectof the experiment Type of diet and experimental condition Anatomy part Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts
  • 27.
    Age value Unit Strain name? Subjectof the experiment? Type of diet and experimental condition Anatomy part Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts
  • 28.
    Age value Unit Strain name Subjectof the experiment Type of diet and experimental condition? Anatomy part Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts
  • 29.
    Age value Unit Strain name Subjectof the experiment Type of diet and experimental condition Anatomy part? Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts
  • 30.
    Age value Unit Strain name Subjectof the experiment Type of diet and experimental condition Anatomy part Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts
  • 31.
    Age value Unit Strain name Subjectof the experiment Type of diet and experimental condition Anatomy part Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts Type of protocol – cell preparation Type of protocol - sample treatment Type of protocol – liver preparation
  • 32.
    How do youknow what to report, or how to structure it? • Data/content standards: • Structure, enrich and report the description of the datasets and the experimental context under which they were produced • Facilitate the discovery, sharing, understanding and reuse of datasets
  • 33.
    Outline • Data sharing •Reusability and reproducibility • How the lack of these affects scientific accountability and progress • Experimental context • What to report – what level of granularity • How to report it – what format, structure • Content standards • How to find them • Complying with repositories, funders and publishers
  • 34.
  • 35.
    de jure defacto grass-roots groups standard organizations Nanotechnology Working Group Community mobilisation to develop content standards
  • 36.
    Databases have theirown standards, e.g. at EBI:
  • 37.
    Enablers: to betterdescribe, share and query data
  • 38.
    Enablers: to betterdescribe, share and query data • Minimum information reporting requirements, or checklists o Report the same core, essential information
  • 39.
    • Minimum information reportingrequirements, or checklists o Report the same core, essential information • Controlled vocabularies, taxonomies, thesauri, ontologies etc. o Use the same word and refer to the same ‘thing’ Enablers: to better describe, share and query data
  • 40.
    • Minimum information reportingrequirements, or checklists o Report the same core, essential information • Controlled vocabularies, taxonomies, thesauri, ontologies etc. o Use the same word and refer to the same ‘thing’ • Conceptual model, conceptual schema, or exchange formats o Allow data to flow from one system to another Enablers: to better describe, share and query data
  • 41.
    A web-based, curatedand searchable registry ensuring that biological standards and databases are registered, informative and discoverable; also monitoring the development and evolution of standards, their use in databases and the adoption of both in data policies.
  • 42.
    Researchers, developers andcurators lack support and guidance on how to best navigate and select content standards, understand their maturity, or find databases that implement them; Funders, journals and librarians do not have enough information to make informed decisions on which content standards or database to recommended in policies, or fund or implement Our mission: To help people make the right choice
  • 43.
  • 44.
    Work out whichformat your data should be in for submission to a particular database
  • 45.
    STANDARD DATABASE Standards anddatabases (and policies) cross-linked
  • 46.
    From simple andadvanced searches
  • 47.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project Search and filter to find what is relevant to your type of data
  • 48.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project Tracking evolution, e.g. deprecations and substitutions
  • 49.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project Tracking evolution, e.g. deprecations and substitutions
  • 50.
    Create your ownCollection
  • 53.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 5 3 User profiles populated from ORCID...
  • 54.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 5 4 ... credit for creating, contributing to, maintaining standards, databases and policies Ownership of open standards can be problematic in broad, grass-root collaborations It requires improved models, to encourage maintenance of and contributions to these efforts, rewards and incentives need to be identified for all contributors to supporting the continued development of standards
  • 55.
    What you cando with BioSharing… “Which standard should I use for this data, considering I’d like to publish in journal X? “Are we using the most up-to-date version of this standard?” “My data is in X format, which databases take that format?
  • 56.
    How can youuse community-standards? model and related formats These tools and formats will help you to:
  • 57.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project ISA powers data collection, curation resources and repositories, e.g.: ISA model and related formats
  • 59.
    1 Create template(s) tofit the type of experiments to be described Create templates detailing the steps to be reported for different investigations, complying to community standards in e.g. configuring the value(s) allowed for each field to be • text (with/without regular expressions), • ontology terms, • numbers etc. We have ‘ready to use’ community standards compliant configurations and can create more according to user needs
  • 60.
    • The ISAmodel records the data’s provenance, how it was generated and where it is located. • Published Data Descriptors are indexed in all major bibliographic indexing services (incl. PubMed) • However, accompanying every Data Descriptor article there are metadata files, specifically created to aid discovery and understanding of the data itself. • Using the ISA (Investigation, Study, Assay) model, these metadata files provide a machine readable overview of the study that generated the data.
  • 61.
    • Filter datasetsby data repository or metadata • Boolean searches • Future enhancements: - Statistics - Richer queries based on semantics of the data ISA-explorer: A demo tool for discovering and exploring Scientific Data’s ISA-tab metadata
  • 62.
    ISA-explorer: A demotool for discovering and exploring Scientific Data’s ISA-tab metadata Visualise the data associated with a paper http://tinyurl.com/isaexplorer
  • 63.
    • Reusability andreproducibility o Is pivotal to drive science and discoveries o Do your best to make your digital research outputs FAIR • Experimental context o Report the experimental context of your findings o Do to your data what you wish that others would do to theirs • Content standards o Continuously evolving o Make use of tools implementing standards, such as ISAtools o Use biosharing.org to explore repositories, standards and policies Summary
  • 64.
  • 65.
    Find the rightdatabase for your data, and which data standard to use – https://www.biosharing.org Checking your data conforms to a standard, or making your own templates – http://www.isa-tools.org Where to keep research data: DCC checklist for evaluating data repositories (DCC) - http://tinyurl.com/DCCResearchData How and why you should manage your research data (JISC) - http://tinyurl.com/JISCDMP Useful links