SlideShare a Scribd company logo
Developing metadata curation
processes for data that can’t
be shared openly
Rebecca Grant, Graham Smith, Iain
Hrynaszkiewicz
IllustrationinspiredbytheworkofJohnMaynardKeynes
1
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
1
The context for curation support
2
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Stuart, David; Baynes, Grace; Hrynaszkiewicz, Iain; Allin, Katie;
Penny, Dan; Lucraft, Mithu; Astell, Mathias (2018): Whitepaper:
Practical challenges for researchers in data sharing
https://doi.org/10.6084/m9.figshare.5975011.v1
Practical Challenges for Researchers
in Data Sharing white paper
A global survey of nearly 8000
researchers
3
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Global levels of data sharing:
• Poland – 76% (highest)
• Germany – 75%
• UK – 58%
• USA – 55%
Private sharing of data is more common than public sharing
of data
4
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
11.52%
16.39%
20.22%
23.03%
28.24%
Costs of
sharing data
Lack of time
to deposit
data
Not knowing
which
repository to
use
Unsure
about
copyright
and licensing
Organising
data in a
presentable
and useful
way
Total respondents: 7719
Problems authors face in sharing datasets
5
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
• Recommended repositories list
• Research Data Helpdesk
• Research Data Policies
Joe Salter
Journal
Development Editor
Graham Smith
Senior Research
Data Editor
Varsha Khodiyar
Data Curation
Manager
Iain Hrynaszkiewicz
Head of Data
Publishing
Rebecca Grant
Research Data
Manager
Data curation at Springer Nature
6
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
No one other than the
creator can access the
data, or even knows that
it exists
Supporting data curation: a researcher’s dataset in a
desktop folder
7
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Pre-curation data checks:
 The data aren’t sensitive
 The data don’t include direct
or indirect human identifiers
 The data shouldn’t be in a
community repository
 The data are associated with
a trusted publication
After making these checks, we begin the
curation process. If necessary we may
recommend that the dataset is split into
smaller groups or collections.
Once received, we check to make sure that the
dataset is suitable for our curation services.
Multiple files in any format are accepted.
Before curation begins
8
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
The curated dataset will be published with
its own metadata record which includes
rich descriptive information, reuse
conditions, licence, DOI, metrics and
keywords
(this example is
https://doi.org/10.6084/m9.figshare.5259
415)
Working with the researcher’s manuscript or published paper, we draft a comprehensive
metadata record for the dataset which is sent to the researcher for approval before
being published. Embargoes can be applied if necessary.
Metadata curation output
9
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
9
Addressing the challenges of data that can’t be openly
shared
10
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
• Personally identifiable information.
• Special categories of personal information
(e.g. as specified by the GDPR or other
data protection legislation).
• Data revealing the location of rare,
endangered or commercially-valuable
species.
• Commercially sensitive data, for example
relating to industrial partners or collected
on their behalf.
What makes research data sensitive?
11
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Datasets should have 0
direct identifiers included
 Name
 Fingerprint
 Facial
photographs
 Signature
 Biometric
records
 Telephone
number
Direct identifiers relate directly to an individual and are
information that, on its own, allows the clear identification of
individuals.
Assessing sensitivity of personal data: direct identifiers
12
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Datasets should have <3
indirect identifiers included
Gender
Place of birth
Income
Race or ethnicity
Unusual features, e.g. rare
diseases, uncommon job titles,
or a large number of children
Indirect identifiers are information that allows the
identification of individuals through their combination with
other available information.
Assessing sensitivity of personal data: indirect identifiers
13
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
 De-identified and shared publicly, e.g. in a
repository.
 Deposited in a controlled access repository.
 Access managed and controlled by the
researcher (e.g. “available on request”).
Sensitive data can still be shared:
14
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Journal data policies may require data sharing, or a data
availability statement describing how data can be accessed.
• Authors may not have the expertise to de-identify data
appropriately; editors may not be able to advise.
• Alternatively, data are deposited in controlled access
repositories (sometimes with minimal metadata).
• Authors may also choose to share data on request (e.g. no
metadata is available at all).
The challenges of sharing sensitive data
15
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
doi:10.1038/s41523-018-0079-1
Data available on request
16
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Working with the curation team to provide editorial support for data
sharing:
 Reviewing accepted manuscripts.
 Providing advice on data sharing.
 Creating a metadata catalogue of rich metadata records for every article
in the journal’s repository.
 Writing detailed data availability statements.
 Build on existing data sharing practice at the journal and support more
authors to share.
npj Breast Cancer is an open access, online-
only, multidisciplinary research journal
dedicated to publishing the finest research on
breast cancer research and treatment.
Metadata curation for the journal npj Breast Cancer
17
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
• Curators connected with authors when an article is
accepted in principle.
• Advice given on de-identification of data.
• Advice given on suitable disciplinary repositories.
• Curator reads paper for data-related information.
• Additional information requested from author.
• Curator creates rich metadata record and DAS.
• Metadata and DAS reviewed and approved by author.
Curation workflow
18
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Data reporting checklist
19
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Authors were not
responsive even
though it held up
their article during
publication
The form was not
suitable for paper
which cited multiple
datasets which need
to be described.
 Type/format of data
 Filenames
 Software required
 Funder
 Additional documentation
For sensitive clinical data that aren’t shared openly:
 Sample size
 Cohort size
 Registered trial number
 Access requirements
Initially the team used a Google form to capture contextual information about the
author’s study and accompanying datasets.
Gathering contextual metadata
20
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Metadata collection is now based on:
A review of the author’s paper.
+ A short spreadsheet filled out by the author.
+ A review of the author’s datasets available in other repositories.
+ Email directly to the author where necessary.
= Rich contextual information about the datasets
+ A consistent format for the metadata we use to describe studies
Adapting the metadata collection process
21
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Example output: metadata record for data available on request
• Authors
• Title
• A description of the
study design
• Data type
• Data format
• Number of files, file
names
• Software required
• Access requirements
• Funder information
• Keywords
• Link to associated
paper
• Metrics
The dataset is available on request only due
to commercial sensitivity. The metadata
record is stored in the npj Breast Cancer
figshare portal and includes:
22
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
https://doi.org/10.1038/s41523-019-
0106-x
Example output: data availability statement
23
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
doi:10.1038/s41523-018-0079-1
Before: data available on request
24
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
14 submissions (journal articles) to date:
 3 were deposited in specialist repositories on the curator’s
recommendation (including GEO and dbGap) – Potential risk to funding if this
was not done.
 1 was a commercially sensitive dataset which required assessment,
advised not to share openly – Potential risk of legal liability if shared without
permission.
 1 paper originally consisted of references to articles for 39 gene
expression datasets, which the curator used to create a table including DOIs
and accession numbers for each.
 800+ views of the metadata records in the repository.
Curation impacts so far
25
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
Improvements to accessibility of all datasets, particularly those only
available on request.
Opportunity for researcher to identify issues in related publications,
e.g. incorrect accession codes.
Allows curation without access to sensitive datasets, capitalising on
knowledge of the researcher and the journal editor.
Increasing accessibility of curation to a larger proportion of researchers
– does not exclude those who cannot share openly.
Demonstrating an approach that’s compatible with generalist or
institutional repositories.
Other impacts
26
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
26
The story behind the image
John Maynard Keynes (1883–1946)
John Maynard Keynes was a British economist who
revolutionised the theory and practice of macroeconomics,
reformed economics and had a profound influence on
economic policy. This illustration represents the Keynesian
model which shows that in a monetary economy it is
possible to have periods of high unemployment unless
governments use active monetary and fiscal policy to
stimulate aggregate demand.
Rebecca Grant, Research Data Manager
Researchdata@springernature.com /
Rebecca.Grant@springernature.com
https://go.nature.com/ResearchDataServices
https://researchdata.springernature.com/
Thank you
27
Developing metadata curation processes for data that can’t be shared openly
CC-BY-ND 2019
27
The story behind the image
John Maynard Keynes (1883–1946)
John Maynard Keynes was a British economist who
revolutionised the theory and practice of macroeconomics,
reformed economics and had a profound influence on
economic policy. This illustration represents the Keynesian
model which shows that in a monetary economy it is
possible to have periods of high unemployment unless
governments use active monetary and fiscal policy to
stimulate aggregate demand.
Slide 10: photo by Frida Bredesen on Unsplash
Slide 12: photo by Ashley Edwards on Unsplash
Image credits

More Related Content

What's hot (20)

PDF
dkNET Webinar: dkNET Hypothesis Center Live Demo 09/24/2021
dkNET
 
PPTX
DataONE Education Module 08: Data Citation
DataONE
 
PDF
Metadata 2020 Vivo Conference 2018
Clare Dean
 
PPTX
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET
 
PDF
McGeary Data Curation Network: Developing and Scaling
National Information Standards Organization (NISO)
 
PPTX
Developing and assessing FAIR digital resources
Michel Dumontier
 
PPTX
Linked Data for Biopharma
Tom Plasterer
 
PPTX
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
Rebecca Grant
 
PPTX
Building a Network of Interoperable and Independently Produced Linked and Ope...
Michel Dumontier
 
PDF
dkNET Poster ENDO 2016
dkNET
 
PDF
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Tom Plasterer
 
PPTX
Compliance: Data Management Plans and Public Access to Data
Margaret Henderson
 
PPTX
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET
 
PPTX
Inroads into Data: Getting Involved in Data at Your Institution
Margaret Henderson
 
PDF
dkNET ESP Meeting - February 2016
dkNET
 
PPTX
Horizon 2020 and the open research data pilot
Sarah Jones
 
PDF
Preparing your data for sharing and publishing
Varsha Khodiyar
 
PDF
dkNET Introductory Webinar 05/10/2017
dkNET
 
PPTX
Data Management Planning for Engineers
Sherry Lake
 
PPTX
23 things for Research Data - LIBER webinar 23 Feb 2017
ARDC
 
dkNET Webinar: dkNET Hypothesis Center Live Demo 09/24/2021
dkNET
 
DataONE Education Module 08: Data Citation
DataONE
 
Metadata 2020 Vivo Conference 2018
Clare Dean
 
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET
 
McGeary Data Curation Network: Developing and Scaling
National Information Standards Organization (NISO)
 
Developing and assessing FAIR digital resources
Michel Dumontier
 
Linked Data for Biopharma
Tom Plasterer
 
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
Rebecca Grant
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Michel Dumontier
 
dkNET Poster ENDO 2016
dkNET
 
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Tom Plasterer
 
Compliance: Data Management Plans and Public Access to Data
Margaret Henderson
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET
 
Inroads into Data: Getting Involved in Data at Your Institution
Margaret Henderson
 
dkNET ESP Meeting - February 2016
dkNET
 
Horizon 2020 and the open research data pilot
Sarah Jones
 
Preparing your data for sharing and publishing
Varsha Khodiyar
 
dkNET Introductory Webinar 05/10/2017
dkNET
 
Data Management Planning for Engineers
Sherry Lake
 
23 things for Research Data - LIBER webinar 23 Feb 2017
ARDC
 

Similar to Developing metadata curation processes for data that can’t be shared openly (20)

PPTX
DataONE Education Module 02: Data Sharing
DataONE
 
PPTX
Life Science Analytics
Andrew Malinow, PhD
 
PDF
Facilitating good research data management practice as part of scholarly publ...
Varsha Khodiyar
 
PDF
Toward a FAIR Biomedical Data Ecosystem
Globus
 
PDF
Digital transformation to enable a FAIR approach for health data science
Varsha Khodiyar
 
PDF
Five essentials factors for unlocking the potential for Open Research Data
Varsha Khodiyar
 
PDF
2012 Fall Data Management Planning Workshop
Lizzy_Rolando
 
PPTX
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...
David Peyruc
 
PPTX
Intro to Data Management Plans
Sarah Jones
 
PDF
A Data Biosphere for Biomedical Research
Robert Grossman
 
PDF
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
PPTX
Recognising data sharing
Jisc RDM
 
PDF
Data Governance in two different data archives: When is a federal data reposi...
Carolyn Ten Holter
 
PPTX
Shareable by Design: Making Better Use of your Research
London School of Hygiene and Tropical Medicine
 
PPTX
DMP health sciences
Sarah Jones
 
PPT
North American funders' DMP requirements
Sarah Jones
 
PPTX
Hahnel "Open Data Policies: Opportunities, compliance and technology strategies"
National Information Standards Organization (NISO)
 
PPTX
Publishing Data on the Web
Centro Web
 
PPTX
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
The University of Edinburgh
 
PPTX
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
EUDAT
 
DataONE Education Module 02: Data Sharing
DataONE
 
Life Science Analytics
Andrew Malinow, PhD
 
Facilitating good research data management practice as part of scholarly publ...
Varsha Khodiyar
 
Toward a FAIR Biomedical Data Ecosystem
Globus
 
Digital transformation to enable a FAIR approach for health data science
Varsha Khodiyar
 
Five essentials factors for unlocking the potential for Open Research Data
Varsha Khodiyar
 
2012 Fall Data Management Planning Workshop
Lizzy_Rolando
 
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...
David Peyruc
 
Intro to Data Management Plans
Sarah Jones
 
A Data Biosphere for Biomedical Research
Robert Grossman
 
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
Recognising data sharing
Jisc RDM
 
Data Governance in two different data archives: When is a federal data reposi...
Carolyn Ten Holter
 
Shareable by Design: Making Better Use of your Research
London School of Hygiene and Tropical Medicine
 
DMP health sciences
Sarah Jones
 
North American funders' DMP requirements
Sarah Jones
 
Hahnel "Open Data Policies: Opportunities, compliance and technology strategies"
National Information Standards Organization (NISO)
 
Publishing Data on the Web
Centro Web
 
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
The University of Edinburgh
 
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
EUDAT
 
Ad

More from Rebecca Grant (7)

PDF
Increasing transparency in Medical Education through Open Data
Rebecca Grant
 
PPTX
Research in the time of Covid: Surveying impacts on Early Career Researchers
Rebecca Grant
 
PPTX
Managing Ireland's Research Data - 3 Research Methods
Rebecca Grant
 
PPTX
Do Open data badges influence author behaviour? A case study at Springer Nature
Rebecca Grant
 
PPTX
Positioning record keepers as data management professionals
Rebecca Grant
 
PPTX
A National Approach to Open Data in Ireland: Publishers and Research Data Man...
Rebecca Grant
 
PPTX
Records professionals and Research Data - a new role?
Rebecca Grant
 
Increasing transparency in Medical Education through Open Data
Rebecca Grant
 
Research in the time of Covid: Surveying impacts on Early Career Researchers
Rebecca Grant
 
Managing Ireland's Research Data - 3 Research Methods
Rebecca Grant
 
Do Open data badges influence author behaviour? A case study at Springer Nature
Rebecca Grant
 
Positioning record keepers as data management professionals
Rebecca Grant
 
A National Approach to Open Data in Ireland: Publishers and Research Data Man...
Rebecca Grant
 
Records professionals and Research Data - a new role?
Rebecca Grant
 
Ad

Recently uploaded (20)

PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
AI/ML Applications in Financial domain projects
Rituparna De
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 

Developing metadata curation processes for data that can’t be shared openly

  • 1. Developing metadata curation processes for data that can’t be shared openly Rebecca Grant, Graham Smith, Iain Hrynaszkiewicz IllustrationinspiredbytheworkofJohnMaynardKeynes
  • 2. 1 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 1 The context for curation support
  • 3. 2 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Stuart, David; Baynes, Grace; Hrynaszkiewicz, Iain; Allin, Katie; Penny, Dan; Lucraft, Mithu; Astell, Mathias (2018): Whitepaper: Practical challenges for researchers in data sharing https://doi.org/10.6084/m9.figshare.5975011.v1 Practical Challenges for Researchers in Data Sharing white paper A global survey of nearly 8000 researchers
  • 4. 3 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Global levels of data sharing: • Poland – 76% (highest) • Germany – 75% • UK – 58% • USA – 55% Private sharing of data is more common than public sharing of data
  • 5. 4 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 11.52% 16.39% 20.22% 23.03% 28.24% Costs of sharing data Lack of time to deposit data Not knowing which repository to use Unsure about copyright and licensing Organising data in a presentable and useful way Total respondents: 7719 Problems authors face in sharing datasets
  • 6. 5 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 • Recommended repositories list • Research Data Helpdesk • Research Data Policies Joe Salter Journal Development Editor Graham Smith Senior Research Data Editor Varsha Khodiyar Data Curation Manager Iain Hrynaszkiewicz Head of Data Publishing Rebecca Grant Research Data Manager Data curation at Springer Nature
  • 7. 6 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 No one other than the creator can access the data, or even knows that it exists Supporting data curation: a researcher’s dataset in a desktop folder
  • 8. 7 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Pre-curation data checks:  The data aren’t sensitive  The data don’t include direct or indirect human identifiers  The data shouldn’t be in a community repository  The data are associated with a trusted publication After making these checks, we begin the curation process. If necessary we may recommend that the dataset is split into smaller groups or collections. Once received, we check to make sure that the dataset is suitable for our curation services. Multiple files in any format are accepted. Before curation begins
  • 9. 8 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 The curated dataset will be published with its own metadata record which includes rich descriptive information, reuse conditions, licence, DOI, metrics and keywords (this example is https://doi.org/10.6084/m9.figshare.5259 415) Working with the researcher’s manuscript or published paper, we draft a comprehensive metadata record for the dataset which is sent to the researcher for approval before being published. Embargoes can be applied if necessary. Metadata curation output
  • 10. 9 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 9 Addressing the challenges of data that can’t be openly shared
  • 11. 10 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 • Personally identifiable information. • Special categories of personal information (e.g. as specified by the GDPR or other data protection legislation). • Data revealing the location of rare, endangered or commercially-valuable species. • Commercially sensitive data, for example relating to industrial partners or collected on their behalf. What makes research data sensitive?
  • 12. 11 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Datasets should have 0 direct identifiers included  Name  Fingerprint  Facial photographs  Signature  Biometric records  Telephone number Direct identifiers relate directly to an individual and are information that, on its own, allows the clear identification of individuals. Assessing sensitivity of personal data: direct identifiers
  • 13. 12 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Datasets should have <3 indirect identifiers included Gender Place of birth Income Race or ethnicity Unusual features, e.g. rare diseases, uncommon job titles, or a large number of children Indirect identifiers are information that allows the identification of individuals through their combination with other available information. Assessing sensitivity of personal data: indirect identifiers
  • 14. 13 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019  De-identified and shared publicly, e.g. in a repository.  Deposited in a controlled access repository.  Access managed and controlled by the researcher (e.g. “available on request”). Sensitive data can still be shared:
  • 15. 14 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Journal data policies may require data sharing, or a data availability statement describing how data can be accessed. • Authors may not have the expertise to de-identify data appropriately; editors may not be able to advise. • Alternatively, data are deposited in controlled access repositories (sometimes with minimal metadata). • Authors may also choose to share data on request (e.g. no metadata is available at all). The challenges of sharing sensitive data
  • 16. 15 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 doi:10.1038/s41523-018-0079-1 Data available on request
  • 17. 16 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Working with the curation team to provide editorial support for data sharing:  Reviewing accepted manuscripts.  Providing advice on data sharing.  Creating a metadata catalogue of rich metadata records for every article in the journal’s repository.  Writing detailed data availability statements.  Build on existing data sharing practice at the journal and support more authors to share. npj Breast Cancer is an open access, online- only, multidisciplinary research journal dedicated to publishing the finest research on breast cancer research and treatment. Metadata curation for the journal npj Breast Cancer
  • 18. 17 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 • Curators connected with authors when an article is accepted in principle. • Advice given on de-identification of data. • Advice given on suitable disciplinary repositories. • Curator reads paper for data-related information. • Additional information requested from author. • Curator creates rich metadata record and DAS. • Metadata and DAS reviewed and approved by author. Curation workflow
  • 19. 18 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Data reporting checklist
  • 20. 19 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Authors were not responsive even though it held up their article during publication The form was not suitable for paper which cited multiple datasets which need to be described.  Type/format of data  Filenames  Software required  Funder  Additional documentation For sensitive clinical data that aren’t shared openly:  Sample size  Cohort size  Registered trial number  Access requirements Initially the team used a Google form to capture contextual information about the author’s study and accompanying datasets. Gathering contextual metadata
  • 21. 20 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Metadata collection is now based on: A review of the author’s paper. + A short spreadsheet filled out by the author. + A review of the author’s datasets available in other repositories. + Email directly to the author where necessary. = Rich contextual information about the datasets + A consistent format for the metadata we use to describe studies Adapting the metadata collection process
  • 22. 21 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Example output: metadata record for data available on request • Authors • Title • A description of the study design • Data type • Data format • Number of files, file names • Software required • Access requirements • Funder information • Keywords • Link to associated paper • Metrics The dataset is available on request only due to commercial sensitivity. The metadata record is stored in the npj Breast Cancer figshare portal and includes:
  • 23. 22 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 https://doi.org/10.1038/s41523-019- 0106-x Example output: data availability statement
  • 24. 23 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 doi:10.1038/s41523-018-0079-1 Before: data available on request
  • 25. 24 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 14 submissions (journal articles) to date:  3 were deposited in specialist repositories on the curator’s recommendation (including GEO and dbGap) – Potential risk to funding if this was not done.  1 was a commercially sensitive dataset which required assessment, advised not to share openly – Potential risk of legal liability if shared without permission.  1 paper originally consisted of references to articles for 39 gene expression datasets, which the curator used to create a table including DOIs and accession numbers for each.  800+ views of the metadata records in the repository. Curation impacts so far
  • 26. 25 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 Improvements to accessibility of all datasets, particularly those only available on request. Opportunity for researcher to identify issues in related publications, e.g. incorrect accession codes. Allows curation without access to sensitive datasets, capitalising on knowledge of the researcher and the journal editor. Increasing accessibility of curation to a larger proportion of researchers – does not exclude those who cannot share openly. Demonstrating an approach that’s compatible with generalist or institutional repositories. Other impacts
  • 27. 26 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 26 The story behind the image John Maynard Keynes (1883–1946) John Maynard Keynes was a British economist who revolutionised the theory and practice of macroeconomics, reformed economics and had a profound influence on economic policy. This illustration represents the Keynesian model which shows that in a monetary economy it is possible to have periods of high unemployment unless governments use active monetary and fiscal policy to stimulate aggregate demand. Rebecca Grant, Research Data Manager [email protected] / [email protected] https://go.nature.com/ResearchDataServices https://researchdata.springernature.com/ Thank you
  • 28. 27 Developing metadata curation processes for data that can’t be shared openly CC-BY-ND 2019 27 The story behind the image John Maynard Keynes (1883–1946) John Maynard Keynes was a British economist who revolutionised the theory and practice of macroeconomics, reformed economics and had a profound influence on economic policy. This illustration represents the Keynesian model which shows that in a monetary economy it is possible to have periods of high unemployment unless governments use active monetary and fiscal policy to stimulate aggregate demand. Slide 10: photo by Frida Bredesen on Unsplash Slide 12: photo by Ashley Edwards on Unsplash Image credits

Editor's Notes

  • #4: In 2017, Springer Nature surveyed > 7,700 researchers worldwide, asking specifically about data sharing at the point of submitting an article for publication. The level of respondents from some regions of interest – notably Japan and China – meant that we could not do detailed analysis, so this year we have begun to extend our research to these territories:
  • #6: Main findings: Researchers do share and use one another’s data but lack places to put it. They would value a high quality data publication
  • #7: Mainly covering the left hand side of this list due to time
  • #12: A natural person is a person that is an individual human being, as opposed to a legal person, which may be a private (i.e., business entity or non-governmental organisation) or public (i.e., government) organisation
  • #13: If I was trying to identify a person in a dataset, what kind of information would allow me to recognise them uniquely?
  • #14: Features such as gender or place of birth aren’t usually unique in a dataset, but once they are combined they can become much more identifying. There are far more indirect identifiers than direct, and it can be more difficult to figure out whether they are identifying or not.
  • #15: A natural person is a person that is an individual human being, as opposed to a legal person, which may be a private (i.e., business entity or non-governmental organisation) or public (i.e., government) organisation
  • #16: A natural person is a person that is an individual human being, as opposed to a legal person, which may be a private (i.e., business entity or non-governmental organisation) or public (i.e., government) organisation
  • #21: Note that as well as authors not filling out the form, it was difficult to do for multiple datasets from one study
  • #22: Process takes around 4 hours of a curator’s time including gathering the information, drafting the record and allowing review