Data sharing as part of the research workflow
Varsha Khodiyar, PhD
Data Curation Editor, Scientific Data
Nature Publishing Group
@varsha_khodiyar
@scientificdata
Perspective from Scientific Data
Data Perspective beyond Alliances, 3rd March 2016
Why the push to share data?
Research conduct
Publication bias – what is submitted
Experimental design
Statistics
Lab supervision and training
Research reporting and sharing
Gels, microscopy images
Statistical reporting
Methods description
Data deposition and availability
2
Generating research data is expensive
Just 18.1% NIH grant applications funded in 2014*
• Hours spent writing grants?
• Hours spent reviewing grants?
Resources are finite/expensive
• Modified animals
• Specialized reagents
Time and effort taken in the laboratory to generate
good, valid data
* report.nih.gov/success_rates/Success_ByIC.cfm
Data needs to be…
Discoverable
Need to
know it’s
there
Accessible
Must be able
to get to the
data
Usable
Require
sufficient
information
about how
the data was
generated
Persistent
Historical
data access
as part of the
scientific
record, as
well as for
new research
Reliable
Data
provenance
informs data
reuse
decisions
Joint Declaration of Data Citation Principles www.force11.org/group/joint-declaration-
data-citation-principles-final
Achieving human and machine accessibility of cited data in scholarly publications Starr et
al. PeerJ Computer Science (2015). doi:10.7717/peerj-cs.1
Making data count Kratz & Strasser. Sci. Data (2015). doi:10.1038/sdata.2015.39
The FAIR guiding principles for scientific data management and stewardship Williams et al.
Sci. Data (in press)
Researchers already share data
• Most researchers are sharing
data, and using the data of
others
• Direct contact between
researchers (on request) is a
common way of sharing data
• Repositories are second most
common method of sharing
Kratz and Strasser (2015) doi: 10.1371/journal.pone.0117619 9
But…
Sharing of data upon request from published articles
• relies heavily on trust
• when stored informally, disappears at a rate of ~17% per year
(Vines et al. 2014; doi: 10.1016/j.cub.2013.11.014)
Data shared in a repository
• often not reusable due to insufficient context
• may not be possible to determine reliability (peer review?)
• may not be easily findable, if not referenced in a scholarly
article
• no scholarly credit for data producers
Data papers and journals
• Ensure formal storage in repository
• Allow space for authors to include sufficient context for
reuse
• Peer reviewers often specifically requested to comment
on data archive reusability
• Data paper are formal works, giving scholarly credit to
data producers
• Formal data citations enabling data discovery via
bibliographic indexes that researchers are used to using
Data journals and multidisciplinary research
Cross-domain data sharing vital for solving the most pressing world
issues:
• Public health (social science, epidemiology & molecular biology)
• Resource management & sustainability (energy research, policy,
ecology & climate science)
Differences between researchers of vocabulary and expressions of
reliability, mean clear descriptions of data become even more essential
for cross-domain data sharing.
Multidisciplinary data journals (e.g. Data Science Journal, Scientific
Data):
• provide a data sharing outlet to researchers in all domains
• help datasets cross domain boundaries, data is more visible and
searchable i.e. less siloing
8
Data reuse by the research community
9
“The Data Descriptor made it easier
to use the data, for me it was critical
that everything was there…all the
technical details like voxel size.”
Professor Daniele Marinazzo
Data reuse by the non-research community
10
http://www.nytimes.com/interactive/2014/12/30/science/history-of-ebola-in-24-outbreaks.html
Increasing the discoverability of data
• Is data truly discoverable by
researchers outside the original
authors domain?
• Too many papers to read in each
person’s own field.
• Could increasing the machine
accessibility of data, result in
increased data reuse?
Data Descriptors have human and machine readable
components
12
Human readable
representation of
study
i.e. article (HTML &
PDF)
Human readable
representation of
study
i.e. article (HTML
& PDF)
Machine
readable
representation
of study
i.e. metadata
• We capture metadata about the data being described in each Data Descriptor
• The manuscript captures human readable metadata needed for data reuse
• The curated metadata records capture machine readable metadata needed for
machine based data discovery
Metadata at Scientific Data
ISA format for machine readable metadata
14
• Study workflow
• Key sample characteristics
needed for data discovery
• Relates samples to data files
• Shows location of dataset
• Uses controlled vocabularies
and ontologies (where
possible)
Metadata for data discovery
Search by:
• Data Repositories
• Experiment design
• Measurements made
• Technologies used
• Factor types
• Sample Characteristics
• Organism
• Environment types
• Geographic locations
scientificdata.isa-explorer.org
16
After data
analysis has
been
published
Before analysis
has been
published
Authors not
intending to
analyse data
Data Descriptors can be
submitted and published
at any point in the
research workflow
After data
analysis has
been
published
Before the
analysis has
been published
Publication
alongside analysis
article
Data as part of the publication workflow
Data as part of the research workflow?
Papers usually written after analyses, key details can be forgotten
• Ideally metadata would be captured during data generation
process
• Takes time and effort to capture adequate metadata of
sufficient quality for data reuse
Machine readable metadata
• Metadata format needs to be decided prospectively
• Researchers require professional expertise and guidance to use
ontologies (essential for machine readability and discovery)
How to ensure data generators are able to capture metadata easily
and in sufficient detail for reuse?
17
Discoverable
Machine
based data
discovery
Implement
data citations
Use
community
ontologies
Accessible &
Persistent
Encourage
use of
repositories
Use
persistent
identifiers
for data
Usable
Metadata
capture
during data
generation
process
Encourage
use of
minimal
reporting
standards
Reliable
Encourage
peer
reviewers to
evaluate
data archive
(structure,
format)
alongside
the article
Researcher
incentives
Recognise
data as a
first class
scholarly
work
Provide
tools for
data
visualization
and
discovery
Building infrastructure to promote data sharing as part
of the research workflow
Scientific Data at RDA
Working groups
Publishing Data Workflows
(co-chair)
BioSharing Registry
(Susanna Sansone is co-chair)
Interest groups
Publishing Data
Data Fabric
Data in Context
Metadata
Certification of Digital Repositories
19
Visit nature.com/sdata
Email scientificdata@nature.com
Tweet @ScientificData
Honorary Academic Editor
Susanna-Assunta Sansone
Managing Editor
Andrew L. Hufton
Data Curation Editor
Varsha K. Khodiyar
Advisory Panel and Editorial
Board including senior researchers,
funders, librarians and curators
Supported by

Data sharing as part of the research workflow

  • 1.
    Data sharing aspart of the research workflow Varsha Khodiyar, PhD Data Curation Editor, Scientific Data Nature Publishing Group @varsha_khodiyar @scientificdata Perspective from Scientific Data Data Perspective beyond Alliances, 3rd March 2016
  • 2.
    Why the pushto share data? Research conduct Publication bias – what is submitted Experimental design Statistics Lab supervision and training Research reporting and sharing Gels, microscopy images Statistical reporting Methods description Data deposition and availability 2
  • 3.
    Generating research datais expensive Just 18.1% NIH grant applications funded in 2014* • Hours spent writing grants? • Hours spent reviewing grants? Resources are finite/expensive • Modified animals • Specialized reagents Time and effort taken in the laboratory to generate good, valid data * report.nih.gov/success_rates/Success_ByIC.cfm
  • 4.
    Data needs tobe… Discoverable Need to know it’s there Accessible Must be able to get to the data Usable Require sufficient information about how the data was generated Persistent Historical data access as part of the scientific record, as well as for new research Reliable Data provenance informs data reuse decisions Joint Declaration of Data Citation Principles www.force11.org/group/joint-declaration- data-citation-principles-final Achieving human and machine accessibility of cited data in scholarly publications Starr et al. PeerJ Computer Science (2015). doi:10.7717/peerj-cs.1 Making data count Kratz & Strasser. Sci. Data (2015). doi:10.1038/sdata.2015.39 The FAIR guiding principles for scientific data management and stewardship Williams et al. Sci. Data (in press)
  • 5.
    Researchers already sharedata • Most researchers are sharing data, and using the data of others • Direct contact between researchers (on request) is a common way of sharing data • Repositories are second most common method of sharing Kratz and Strasser (2015) doi: 10.1371/journal.pone.0117619 9
  • 6.
    But… Sharing of dataupon request from published articles • relies heavily on trust • when stored informally, disappears at a rate of ~17% per year (Vines et al. 2014; doi: 10.1016/j.cub.2013.11.014) Data shared in a repository • often not reusable due to insufficient context • may not be possible to determine reliability (peer review?) • may not be easily findable, if not referenced in a scholarly article • no scholarly credit for data producers
  • 7.
    Data papers andjournals • Ensure formal storage in repository • Allow space for authors to include sufficient context for reuse • Peer reviewers often specifically requested to comment on data archive reusability • Data paper are formal works, giving scholarly credit to data producers • Formal data citations enabling data discovery via bibliographic indexes that researchers are used to using
  • 8.
    Data journals andmultidisciplinary research Cross-domain data sharing vital for solving the most pressing world issues: • Public health (social science, epidemiology & molecular biology) • Resource management & sustainability (energy research, policy, ecology & climate science) Differences between researchers of vocabulary and expressions of reliability, mean clear descriptions of data become even more essential for cross-domain data sharing. Multidisciplinary data journals (e.g. Data Science Journal, Scientific Data): • provide a data sharing outlet to researchers in all domains • help datasets cross domain boundaries, data is more visible and searchable i.e. less siloing 8
  • 9.
    Data reuse bythe research community 9 “The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.” Professor Daniele Marinazzo
  • 10.
    Data reuse bythe non-research community 10 http://www.nytimes.com/interactive/2014/12/30/science/history-of-ebola-in-24-outbreaks.html
  • 11.
    Increasing the discoverabilityof data • Is data truly discoverable by researchers outside the original authors domain? • Too many papers to read in each person’s own field. • Could increasing the machine accessibility of data, result in increased data reuse?
  • 12.
    Data Descriptors havehuman and machine readable components 12 Human readable representation of study i.e. article (HTML & PDF) Human readable representation of study i.e. article (HTML & PDF) Machine readable representation of study i.e. metadata
  • 13.
    • We capturemetadata about the data being described in each Data Descriptor • The manuscript captures human readable metadata needed for data reuse • The curated metadata records capture machine readable metadata needed for machine based data discovery Metadata at Scientific Data
  • 14.
    ISA format formachine readable metadata 14 • Study workflow • Key sample characteristics needed for data discovery • Relates samples to data files • Shows location of dataset • Uses controlled vocabularies and ontologies (where possible)
  • 15.
    Metadata for datadiscovery Search by: • Data Repositories • Experiment design • Measurements made • Technologies used • Factor types • Sample Characteristics • Organism • Environment types • Geographic locations scientificdata.isa-explorer.org
  • 16.
    16 After data analysis has been published Beforeanalysis has been published Authors not intending to analyse data Data Descriptors can be submitted and published at any point in the research workflow After data analysis has been published Before the analysis has been published Publication alongside analysis article Data as part of the publication workflow
  • 17.
    Data as partof the research workflow? Papers usually written after analyses, key details can be forgotten • Ideally metadata would be captured during data generation process • Takes time and effort to capture adequate metadata of sufficient quality for data reuse Machine readable metadata • Metadata format needs to be decided prospectively • Researchers require professional expertise and guidance to use ontologies (essential for machine readability and discovery) How to ensure data generators are able to capture metadata easily and in sufficient detail for reuse? 17
  • 18.
    Discoverable Machine based data discovery Implement data citations Use community ontologies Accessible& Persistent Encourage use of repositories Use persistent identifiers for data Usable Metadata capture during data generation process Encourage use of minimal reporting standards Reliable Encourage peer reviewers to evaluate data archive (structure, format) alongside the article Researcher incentives Recognise data as a first class scholarly work Provide tools for data visualization and discovery Building infrastructure to promote data sharing as part of the research workflow
  • 19.
    Scientific Data atRDA Working groups Publishing Data Workflows (co-chair) BioSharing Registry (Susanna Sansone is co-chair) Interest groups Publishing Data Data Fabric Data in Context Metadata Certification of Digital Repositories 19
  • 20.
    Visit nature.com/sdata Email [email protected] Tweet@ScientificData Honorary Academic Editor Susanna-Assunta Sansone Managing Editor Andrew L. Hufton Data Curation Editor Varsha K. Khodiyar Advisory Panel and Editorial Board including senior researchers, funders, librarians and curators Supported by