How to share useful data

How to share useful data
Peter McQuilton
Biosharing.org
@drosophilic

Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers

Research data life cycle
Image credit to:

Credit to: ttps://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ 2014
Better data = better science

A community mobilization for “openness”
image by Greg Emmerich
http://discovery.urlibraries.org/ https://okfn.org
Open data
is a means to do
better science
more efficiently
http://pantonprinciples.org
https://creativecommons.org

Growing movement for FAIR data and research
outputs

But in all fairness, not much data is FAIR!

“Reproducing the method took several months of effort, and
required using new versions and new software that posed
challenges to reconstructing and validating the results”
Unfairness in both experimental and computation
areas

• Not always well cited, stored
o Software, codes, workflows are hard(er) to get hold of
• Poorly described for third party reuse
o Different level of detail and annotation
• Curation activities are perceived as time consuming
o Collection and harmonization of detailed methods and
experimental steps is rushed at the publication stage
Not very FAIR: low findability and
understandability

• Effectively document your data so that it can be understood
in the future
• Periodically move data to new storage media (drives
degrade over time)
• Keep more than one copy of data (local and cloud)
• Migrate data to new software versions
• Use a well documented and supported format
Ideally this should be covered in a data management plan at
the start of a project, so that you can factor any associated
time and resources into your budget.
What can I do to ensure my data are
shareable/usable in the future?

Outline
• Data sharing
• How the lack of these affects scientific accountability and progress
• Experimental context - standards
• What to report – what level of granularity
• How to report it – what format, structure
• How to find them
• Complying with repositories, funders and publishers

Do you know what this is?
LS1_C2_LD_TP2_P1 file1-fastq.gz

…how NOT to report the experimental
information!

…how NOT to report the experimental
information!
Sample name (?!) Data file

We need to clearly describe the information
• LS1 liver sample 1
• C2 compound 2
• LD low dose
• TP2 time point 2
• P1 protocol 1
• file1-fastq.gz compressed data file for sequence
information corresponding to this
sample
Sample name (?!) Data file

Without context data is meaningless

• We need to report sufficient
information to reuse the dataset
• We must strike a balance between
depth and breadth of information
Information intensive experiments

Information intensive experiments
• Not too much
• Not too little
• ….just right

Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared…
From natural language to ‘computable’ concepts

Age value?
Unit?
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
with low-fat diet.
Liver was dissected out, hepatocytes prepared …

Age value
Unit
Strain name?
Subject of the experiment?
Type of diet and
Anatomy part
with low-fat diet.

Age value
Unit
Strain name
Type of diet and
experimental condition?
Anatomy part
with low-fat diet.

Age value
Unit
Strain name
Type of diet and
Anatomy part?
with low-fat diet.

Age value
Unit
Strain name
Type of diet and
Anatomy part
with low-fat diet.

Age value
Unit
Strain name
Type of diet and
Anatomy part
with low-fat diet.
Type of protocol – cell preparation
Type of protocol - sample treatment
Type of protocol – liver preparation

How do you know what to report, or how to
structure it?
• Data/content standards:
• Structure, enrich and report the description of the
datasets and the experimental context under which they
were produced
• Facilitate the discovery, sharing, understanding and
reuse of datasets

193
85
346
miame
MIAPA
MIRIAM
MIQAS
MIX
MIGEN
ARRIVE
MIAPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
MAGE-Tab
GCDML
SRAxml
SOFT
FASTA
DICOM
MzML
SBRML
SEDML…
GELML
ISA-Tab
CML
MITAB
AAO
CHEBI
OBI
PATO ENVO
MOD
BTO
IDO…
TEDDY
PRO
XAO
DO
VO
There are over 600 content standards in the life sciences

de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
Community mobilisation to develop content
standards

Databases have their own standards, e.g. at EBI:

Enablers: to better describe, share and query data

• Minimum information
reporting requirements, or
checklists
o Report the same core,
essential information

checklists
• Controlled vocabularies, taxonomies,
thesauri, ontologies etc.
o Use the same word and refer to the same
‘thing’

checklists
• Controlled vocabularies, taxonomies,
thesauri, ontologies etc.
o Use the same word and refer to the same
‘thing’
• Conceptual model,
conceptual schema, or
exchange formats
o Allow data to flow from one
system to another

A web-based, curated and searchable registry ensuring that biological
standards and databases are registered, informative and discoverable; also
monitoring the development and evolution of standards, their use in databases
and the adoption of both in data policies.

Researchers, developers and curators lack support and guidance on how to best navigate and select
content standards, understand their maturity, or find databases that implement them;
Funders, journals and librarians do not have enough information to make informed decisions on which
content standards or database to recommended in policies, or fund or implement
Our mission: To help people make the right choice

Work out which format your data should be in for
submission to a particular database

STANDARD DATABASE
Standards and databases (and policies) cross-linked

From simple and advanced searches

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Search and filter to find what is relevant to your type of data

Tracking evolution, e.g. deprecations and
substitutions

5
3
User profiles populated from ORCID...

5
4
... credit for creating, contributing to, maintaining standards, databases and
policies
Ownership of open standards can be problematic in
broad, grass-root collaborations
It requires improved models, to encourage
maintenance of and contributions to these
efforts, rewards and incentives need to be
identified for all contributors to supporting the
continued development of standards

What you can do with BioSharing…
“Which standard should I use for this data, considering I’d
like to publish in journal X?
“Are we using the most up-to-date version of this standard?”
“My data is in X format, which databases take that format?

How can you use community-standards?
model and related
formats
These tools and formats will help you to:

ISA powers data collection, curation resources and repositories, e.g.:
ISA
model and related
formats

1
Create template(s) to fit the type of
experiments to be described
Create templates detailing the steps to
be reported for different investigations,
complying to community standards in
e.g. configuring the value(s) allowed for
each field to be
• text (with/without regular expressions),
• ontology terms,
• numbers etc.
We have ‘ready to use’ community
standards compliant configurations
and can create more according to
user needs

• The ISA model records the data’s provenance, how it was generated and
where it is located.
• Published Data Descriptors are indexed in all major bibliographic indexing
services (incl. PubMed)
• However, accompanying every Data Descriptor article there are metadata files,
specifically created to aid discovery and understanding of the data itself.
• Using the ISA (Investigation, Study, Assay) model, these metadata files
provide a machine readable overview of the study that generated the data.

• Filter datasets by
data repository or
metadata
• Boolean searches
• Future enhancements:
- Statistics
- Richer queries based
on semantics of the data
ISA-explorer: A demo tool for discovering and exploring Scientific
Data’s ISA-tab metadata

ISA-explorer: A demo tool for discovering and exploring Scientific
Data’s ISA-tab metadata
Visualise the data
associated with
a paper
http://tinyurl.com/isaexplorer

o Is pivotal to drive science and discoveries
o Do your best to make your digital research outputs FAIR
• Experimental context
o Report the experimental context of your findings
o Do to your data what you wish that others would do to theirs
o Continuously evolving
o Make use of tools implementing standards, such as ISAtools
o Use biosharing.org to explore repositories, standards and policies
Summary

Find the right database for your data, and which data standard to
use – https://www.biosharing.org
Checking your data conforms to a standard, or making your own
templates – http://www.isa-tools.org
Where to keep research data: DCC checklist for evaluating data
repositories (DCC) - http://tinyurl.com/DCCResearchData
How and why you should manage your research data (JISC) -
http://tinyurl.com/JISCDMP
Useful links

How to share useful data

More Related Content

What's hot

Viewers also liked

Similar to How to share useful data

More from Peter McQuilton

Recently uploaded

How to share useful data