Developing metadata curation processes for data that can’t be shared openly

Developing metadata curation
processes for data that can’t
be shared openly
Rebecca Grant, Graham Smith, Iain
Hrynaszkiewicz
IllustrationinspiredbytheworkofJohnMaynardKeynes

1
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
1
The context for curation support

2
CC-BY-ND 2019
Stuart, David; Baynes, Grace; Hrynaszkiewicz, Iain; Allin, Katie;
Penny, Dan; Lucraft, Mithu; Astell, Mathias (2018): Whitepaper:
Practical challenges for researchers in data sharing
https://doi.org/10.6084/m9.figshare.5975011.v1
Practical Challenges for Researchers
in Data Sharing white paper
A global survey of nearly 8000
researchers

3
CC-BY-ND 2019
Global levels of data sharing:
• Poland – 76% (highest)
• Germany – 75%
• UK – 58%
• USA – 55%
Private sharing of data is more common than public sharing
of data

4
CC-BY-ND 2019
11.52%
16.39%
20.22%
23.03%
28.24%
Costs of
sharing data
Lack of time
to deposit
data
Not knowing
which
repository to
use
Unsure
about
copyright
and licensing
Organising
data in a
presentable
and useful
way
Total respondents: 7719
Problems authors face in sharing datasets

5
CC-BY-ND 2019
• Recommended repositories list
• Research Data Helpdesk
• Research Data Policies
Joe Salter
Journal
Development Editor
Graham Smith
Senior Research
Data Editor
Varsha Khodiyar
Data Curation
Manager
Iain Hrynaszkiewicz
Head of Data
Publishing
Rebecca Grant
Research Data
Manager
Data curation at Springer Nature

6
CC-BY-ND 2019
No one other than the
creator can access the
data, or even knows that
it exists
Supporting data curation: a researcher’s dataset in a
desktop folder

7
CC-BY-ND 2019
Pre-curation data checks:
 The data aren’t sensitive
 The data don’t include direct
or indirect human identifiers
 The data shouldn’t be in a
community repository
 The data are associated with
a trusted publication
After making these checks, we begin the
curation process. If necessary we may
recommend that the dataset is split into
smaller groups or collections.
Once received, we check to make sure that the
dataset is suitable for our curation services.
Multiple files in any format are accepted.
Before curation begins

8
CC-BY-ND 2019
The curated dataset will be published with
its own metadata record which includes
rich descriptive information, reuse
conditions, licence, DOI, metrics and
keywords
(this example is
https://doi.org/10.6084/m9.figshare.5259
415)
Working with the researcher’s manuscript or published paper, we draft a comprehensive
metadata record for the dataset which is sent to the researcher for approval before
being published. Embargoes can be applied if necessary.
Metadata curation output

9
CC-BY-ND 2019
9
Addressing the challenges of data that can’t be openly
shared

10
CC-BY-ND 2019
• Personally identifiable information.
• Special categories of personal information
(e.g. as specified by the GDPR or other
data protection legislation).
• Data revealing the location of rare,
endangered or commercially-valuable
species.
• Commercially sensitive data, for example
relating to industrial partners or collected
on their behalf.
What makes research data sensitive?

11
CC-BY-ND 2019
Datasets should have 0
direct identifiers included
 Name
 Fingerprint
 Facial
photographs
 Signature
 Biometric
records
 Telephone
number
Direct identifiers relate directly to an individual and are
information that, on its own, allows the clear identification of
individuals.
Assessing sensitivity of personal data: direct identifiers

12
CC-BY-ND 2019
Datasets should have <3
indirect identifiers included
Gender
Place of birth
Income
Race or ethnicity
Unusual features, e.g. rare
diseases, uncommon job titles,
or a large number of children
Indirect identifiers are information that allows the
identification of individuals through their combination with
other available information.
Assessing sensitivity of personal data: indirect identifiers

13
CC-BY-ND 2019
 De-identified and shared publicly, e.g. in a
repository.
 Deposited in a controlled access repository.
 Access managed and controlled by the
researcher (e.g. “available on request”).
Sensitive data can still be shared:

14
CC-BY-ND 2019
Journal data policies may require data sharing, or a data
availability statement describing how data can be accessed.
• Authors may not have the expertise to de-identify data
appropriately; editors may not be able to advise.
• Alternatively, data are deposited in controlled access
repositories (sometimes with minimal metadata).
• Authors may also choose to share data on request (e.g. no
metadata is available at all).
The challenges of sharing sensitive data

15
CC-BY-ND 2019
doi:10.1038/s41523-018-0079-1
Data available on request

16
CC-BY-ND 2019
Working with the curation team to provide editorial support for data
sharing:
 Reviewing accepted manuscripts.
 Providing advice on data sharing.
 Creating a metadata catalogue of rich metadata records for every article
in the journal’s repository.
 Writing detailed data availability statements.
 Build on existing data sharing practice at the journal and support more
authors to share.
npj Breast Cancer is an open access, online-
only, multidisciplinary research journal
dedicated to publishing the finest research on
breast cancer research and treatment.
Metadata curation for the journal npj Breast Cancer

17
CC-BY-ND 2019
• Curators connected with authors when an article is
accepted in principle.
• Advice given on de-identification of data.
• Advice given on suitable disciplinary repositories.
• Curator reads paper for data-related information.
• Additional information requested from author.
• Curator creates rich metadata record and DAS.
• Metadata and DAS reviewed and approved by author.
Curation workflow

18
CC-BY-ND 2019
Data reporting checklist

19
CC-BY-ND 2019
Authors were not
responsive even
though it held up
their article during
publication
The form was not
suitable for paper
which cited multiple
datasets which need
to be described.
 Type/format of data
 Filenames
 Software required
 Funder
 Additional documentation
For sensitive clinical data that aren’t shared openly:
 Sample size
 Cohort size
 Registered trial number
 Access requirements
Initially the team used a Google form to capture contextual information about the
author’s study and accompanying datasets.
Gathering contextual metadata

20
CC-BY-ND 2019
Metadata collection is now based on:
A review of the author’s paper.
+ A short spreadsheet filled out by the author.
+ A review of the author’s datasets available in other repositories.
+ Email directly to the author where necessary.
= Rich contextual information about the datasets
+ A consistent format for the metadata we use to describe studies
Adapting the metadata collection process

21
CC-BY-ND 2019
Example output: metadata record for data available on request
• Authors
• Title
• A description of the
study design
• Data type
• Data format
• Number of files, file
names
• Software required
• Access requirements
• Funder information
• Keywords
• Link to associated
paper
• Metrics
The dataset is available on request only due
to commercial sensitivity. The metadata
record is stored in the npj Breast Cancer
figshare portal and includes:

22
CC-BY-ND 2019
https://doi.org/10.1038/s41523-019-
0106-x
Example output: data availability statement

23
CC-BY-ND 2019
doi:10.1038/s41523-018-0079-1
Before: data available on request

24
CC-BY-ND 2019
14 submissions (journal articles) to date:
 3 were deposited in specialist repositories on the curator’s
recommendation (including GEO and dbGap) – Potential risk to funding if this
was not done.
 1 was a commercially sensitive dataset which required assessment,
advised not to share openly – Potential risk of legal liability if shared without
permission.
 1 paper originally consisted of references to articles for 39 gene
expression datasets, which the curator used to create a table including DOIs
and accession numbers for each.
 800+ views of the metadata records in the repository.
Curation impacts so far

25
CC-BY-ND 2019
Improvements to accessibility of all datasets, particularly those only
available on request.
Opportunity for researcher to identify issues in related publications,
e.g. incorrect accession codes.
Allows curation without access to sensitive datasets, capitalising on
knowledge of the researcher and the journal editor.
Increasing accessibility of curation to a larger proportion of researchers
– does not exclude those who cannot share openly.
Demonstrating an approach that’s compatible with generalist or
institutional repositories.
Other impacts

26
CC-BY-ND 2019
26
The story behind the image
John Maynard Keynes (1883–1946)
John Maynard Keynes was a British economist who
revolutionised the theory and practice of macroeconomics,
reformed economics and had a profound influence on
economic policy. This illustration represents the Keynesian
model which shows that in a monetary economy it is
possible to have periods of high unemployment unless
governments use active monetary and fiscal policy to
stimulate aggregate demand.
Rebecca Grant, Research Data Manager
Researchdata@springernature.com /
Rebecca.Grant@springernature.com
https://go.nature.com/ResearchDataServices
https://researchdata.springernature.com/
Thank you

27
CC-BY-ND 2019
27
The story behind the image
John Maynard Keynes (1883–1946)
John Maynard Keynes was a British economist who
revolutionised the theory and practice of macroeconomics,
reformed economics and had a profound influence on
economic policy. This illustration represents the Keynesian
model which shows that in a monetary economy it is
possible to have periods of high unemployment unless
governments use active monetary and fiscal policy to
stimulate aggregate demand.
Slide 10: photo by Frida Bredesen on Unsplash
Slide 12: photo by Ashley Edwards on Unsplash
Image credits

Developing metadata curation processes for data that can’t be shared openly

More Related Content

What's hot (20)

Similar to Developing metadata curation processes for data that can’t be shared openly (20)

More from Rebecca Grant (7)

Recently uploaded (20)

Developing metadata curation processes for data that can’t be shared openly

Editor's Notes