FAIRy stories: the FAIR Data principles in theory and in practice

FAIRy stories: the FAIR Data
principles in theory and in practice
Carole Goble
The University of Manchester, UK
carole.goble@manchester.ac.uk
The views expressed in this talk are my own
NSF Convergence Accelerator Series Tracks A&B webinar, 19th May 2021

March 18, 2021
http://spatial.ucsb.edu/2021/Natasha-Noy

Why do we need FAIR data in Research?
“there must be loads of legacy data. We’re desperately trying to go
back and look at what we knew from SARS 10 years ago”
https://www.covid19dataportal.org/
https://www.rd-alliance.org/group/rda-covid19-rda-covid19-omics-rda-covid19-epidemiology-rda-covid19-
clinical-rda-covid19-1
https://doi.org/10.15497/rda00052

COVID Data sharing boost – mobilising people, infrastructure & initiatives
Spotlighted technical, territorial & practices
Provider: collection, upload and governance bottlenecks
User: find and access to datasets, licenses, data and metadata quality
Access to data for processing at scale, common standards
Behaviour inertia and relapse
Long term sustainability
“global pandemic is not sufficient to radically modify
scientific practices”*
* Larregue et al https://blogs.lse.ac.uk/impactofsocialsciences/2020/11/30/covid-19-where-is-the-data/

https://www.nature.com/articles/d41586-021-00305-7
https://www.nature.com/articles/s41597-020-0524-5

information flows, secondary use
Figure: KnowledgeTurning, Information Flow Josh Sommer, Chordoma Foundation, 2011
Community domain enclaves
Resource fragmentation
Flow across platforms/ sovereignties
Pan-discipline drivers
Knowledge churn, loss and cost

2016
A set of GUIDING PRINCIPLES to
enhance the value of all digital
resources and their reuse by PEOPLE
and by MACHINES
ALIGNING a COMMUNITY around
common data guidelines
FAIR Research Data

branding a trend
(re)-stimulating a
movement

What ARE the FAIR principles?
Aspirational guardrails
Not a standard, nor metrics
A contract between data
provider and user
In the original paper
https://www.go-fair.org/fair-principles/
Relaunch a dialogue - research and policy communities.
Reboot a journey - wider accessibility and reusability of data.

compare &
combine data
https://doi.org/10.1038/sdata.2016.18

“enhancing the ability of machines to
automatically find and use data or any digital
object, and support its reuse by individuals”
INCF Statement

Persistent identifiers
Globally unique, resolvable for
data and always for metadata
Structured metadata
Community defined descriptive
metadata using common
terminologies and standards
Linked Data
Vocabularies are FAIR, (meta)data
reference (meta)data, provenance
Automation-
readiness
Access protocols
Open, free and universally
implementable comms protocols
Semantic Web ->
Linked Data ->
Knowledge Graphs.
Machine-processable
metadata.
[Icons: FAIRsharing]

Open as possible, Closed as necessary
Clear licences for innovation and reuse
Sensitive data, GDPR, IPR, jumpy Deans.
Crossing sovereignty boundaries
• Data sharing becomes data visiting &
federated analysis
An industry in controlled secure access….
• Data Usage Ontology, Beacon Passports,
Trusted Research Environments etc….
Terms of access and use: FAIR ≠ OPEN
FAIR OPEN
SAFE
Privacy preservation
Regulatory rigour

FAIR Implicit Assumptions & Implications
Data are first class objects
Primarily aimed at data creators
and providers for benefit of
consumers.
Operating in an (Open) Data
Ecosystem.
Adoption at scale in legacy
settings.
Data sharing

The Life Sciences & pan-European scale data infrastructure

The Life Sciences Infrastructure Zoo
Flows around a Federated & Diverse System
1466 data repositories
(100+ in EOSC-Life)
916 data format and metadata
standards*
from compounds to clinical trials
https://fairsharing.org/ accessed May 2021
Common standards & agreements
mappings of PIDs and metadata
moving metadata around
accountability and responsibility

FAIR players simplified
Researchers and
company
scientists who
generate and use
the data
Service providers
who manage data
and infrastructure
Local -> Global level
Public -> Commercial
Authorities who
drive policy, practice
& resources
Funders, Policy makers,
Publishers, Professional
societies, Standards
organisations, Institutions

Global and national initiatives
Dedicated projects
Community Orgs
Funders
Policy
Publishers
FAIR
first
stage
Dedicated Services

Where we are going
Where we are
[Susanna Sansone]
FAIR
first
stage

FAIR first stage :
Policymakers, Data service providers
How to define, measure compliance and certify FAIR data?
What is a dataset?
General repos vs Curated authoritative archives?
Principles for Data Repositories
https://www.rd-alliance.org/trust-principles-rda-community-effort
https://fairassist.org/

https://www.natureindex.com/news-blog/what-scientists-need-to-know-about-fair-data
Open Data Survey, 2019
81% of researcher
respondents
unfamiliar with FAIR

1. A common mechanism for metadata
Respect and work with the huge legacy
resources: repositories, registries, tools …
community standards
Find, register, index, search resources
Move metadata between services
withoutAPIs
Repositories ->Tools, Aggregators (e.g. licenses)
-> Registries (upload, auto-curation)
Registries -> Registries (across disciplines)
Contribute to Knowledge Graphs
a little bit of semantics at scale
semantic underware
invisible to users
visible to developers & services

Picture: Carole Goble, Turing Lecture 2018
Schema.org: Semantic Mark up for the Web
Cartel of commercial search engines
Wide web use, web infrastructure
Web pages and sitemaps
Types (830+) IceCreamShop
Properties (1300+) hasMenu
Not targeted at science - too much / too little
Dataset type – 120 properties
(Google Data Profile requires 2 properties)
No type for Protein, Gene, Taxon

Harnessing Schema.org for Bioscience
Profile
Data model
Marginality information
Controlled vocabularies
Cardinality
Documentation
Examples
New (properties | types)
definition & consensus
deployment and use
tools & support
Opinionated conventions
Profiles & Link to domain ontologies
}Add Bioscience properties & types if necessary
Examples &Usage Guidelines
}
Community

Harnessing Schema.org for Bioscience
ChemicalSubstance
definition & consensus
deployment and use
tools & support
Opinionated conventions
Profiles & Link to domain ontologies
Add Bioscience properties & types if necessary
Examples &Usage Guidelines
Community

Bioschemas metadata stratification
broad & shallow / deepish & narrowish
Generic
Subject
specific
MolecularEntity,
Protein,
Sample,Taxon,
ChemicalSubstance…
DataCatalog
Dataset
dataset 5 minimum, 8
recommended properties
license & provenance
https://bioschemas.org/profiles/
Crosswalks to metadata schemas *
• DCAT, DataCite,CrossRef, OpenAIRE, DDI
• DCT:issued <-> Schema:dataPublished
What is a dataset?
Include community ontologies
• Type: ChemicalSubstance
• Property: biologicalRole
• ExpectedType: ChEBI ontology
* https://zenodo.org/record/4420116#.YKFOpaHTX18

400+
People
22
Types
32
Profiles
65
Sites
60M+
Pages
bioschemas.org/liveDeploys
bioschemas.org/
liveDeploys
20+
Countries
120
Profile deployments
bioschemas.org/
liveDeploys
Bioschemas Village

MolecularEntity ChemicalSubstance
Toxicology
Data Aggregator
[with thanks: EgonWillighagen]
MolecularEntity
Gene
Protein
Taxon
Dataset

Lessons: Putting FAIR into Practice
A little bit of semantics at scale -> build critical mass
Profiles
• Schema.org culture – Catch 22
• Consensus building, retention & Ontology-itis
Provider mark-up
• Developer friendly in house tools & wacky web implementations
• Adoption incentives & costs of adapting database processes
Consumer services
• Adoption incentives – Catch 22 & tipping points
• DataCatalog & Dataset popular -> Google Dataset search
Consumer-provider readiness
• Tools and training community take-up….

2. Packaging Research Objects
Gather together into a “crate” files,
unbounded references, & other
crates.
FAIR content: metadata,
identifiers, provenance, citation
about the content
FAIR crates: metadata, PIDs,
provenance, citation about the
crate.
more FAIR middleware -> towards FAIR Digital Objects*
*FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units:
https://doi.org/10.3390/publications8020021

Why “crate up” objects? FAIR+R
Flows:
Researchers work with multiple and
different objects using multiple
infrastructures over periods of time
exchange between platforms and people
Parts:
Research has associated objects
linked together by context
metadata files with files
datasets, scripts, SOPs, articles …
0
held in different places
made at different times by
different people & processes
publish, report, reuse, cite, reproduce
register, deposit, archive, port
point to big, sensitive & active content

Aggregate files and/or any URI-addressable
content with structured metadata
Web and Linked Data Native
machine and human readable PIDs + JSON-LD +
Schema.org, search engine & developer friendly
Flex for open ended content, respect legacy
typed by a profile + add more schema.org and
domain ontologies
http://www.researchobject.org/ro-crate/
Archive file
format
FAIR Object Middleware

FAIR Middleware
metadata carrying interchange format
Knowledge
Graph of
Research
Objects

It’s FAIR metadata middleware, stupid
• smart use of wheels already invented
• get tools, services on board
• developer friendly, firm best practice
Known and Unknown unknowns
One size does not fit all
• contextual interpretation
• descriptive openedness , multi-interpretation
Analogous to FAIR Software
• RDA/ReSA FAIR4Research SoftwareWG

3. Making (legacy) datasets FAIR: FAIRification
[Picture credit: EgonWillighagen]

Credit to: Ian Harrow, FAIR & OM projects
FAIR as enabler for the digital transformation
● Biopharma R&D productivity can be
improved by implementing the FAIR Data
Principles.
● FAIR enables powerful new AI analytics access
to data for machine learning and prediction
● Fairly AI Ready
● Challenges
○ change the culture, show business value,
achieve the ‘FAIR enough’
○ Sustain FAIR solutions and activities
Slide credit: Susanna Sansone

Making (legacy) datasets FAIR: FAIRification
> 100 Public-Private partnerships of
European Commission, universities SMEs
and Big Pharma translational projects
Pharma’s own datasets

*https://www.go-fair.org/how-to-go-fair/fair-data-point/
Data visiting through a
FAIR Data Point*
Linked Data / RDF tech
Dataset transformation
Methodology
Linkset services
RDFWarehouse (Knowledge Graph)
- API not SPARQL
- Sustainability & maintenance
- Linksets PID mapping services

FAIRification of legacy datasets
Practical
advice
Assessment
processes
FAIR levels of
projects / data
Selection of
datasets
Cost/Benefit
analysis
Methodology
Steps for 1 or
more datasets
Cultural change
Legal templates
Squads & BYODs
Maturity models

Interlinking data from different sources
The lessons of good
global and persistent
identifiers.
Mapping identifiers
and services for
mapping ids to ids and
concepts to concepts.
https://fairplus.github.io/the-fair-cookbook/content/recipes/interoperability/identifier-mapping.html

FAIR by Design
At the start of a collection, built in throughout the life cycle
change management, capacity building
FAIRifying Retrospectively
Legacy datasets, build a cohort,
cost benefit and FAIR readiness over a collection of datasets

FA(I)R
New FAIRVariants
FAIR++
Legal > Organisational >
Semantic >Technical*
Business and change analysis.
Cost Benefit Analysis.
Scientific / BusinessValue
Sustainability
“…make a decision that
these data are valuable
enough to invest in the work
required for FAIRification.”
interoperability
*EOSC Interoperability Framework

What does FAIRifying a dataset mean?
A database?A pdf? Depositing to a public archive?
Identifier and ontology selecting, assigning,
mapping between and to existing vocabs, and knowing
about ontology services.
High-fidelity ETL loss-less moving (meta)data
from one system to another

FAIR enough.
Repository manager
Admin monitoring
Bioscientist
Scientific analysis
“Fairness does mean everyone
gets the same. Fairness means
everyone gets what they need”
(Rick Riordan).
Maturity and importance spectrum
Its not all worth it.
FAIR gardens + FAIR scrub
How to assess FAIR maturity
levels, not to be certified but
to make decisions.

FAIR ≠ FREE - an expensive, expert team sport
Mostly manual,
mostly specific

“It is a truth
universally
acknowledged
that a
Knowledge
Graph must be
in want of FAIR
data.
And FAIR data
is in want of
Knowledge
Graphs.”
harvesting
added value
DataCite PID Graph
Bottlenecks:
identifiers and ontologies
curating and ingest pipelines of data providers

4. FAIR Data by Design at Source
Data management platform for Project Hubs
organising, cataloguing, sharing and publishing
multiple kinds of research objects in multiple
repositories for multi-partner projects.
Community developed Knowledge Hub
for guides, examples, tools, and pointers.
Assembled and written by Life Science
researchers and data stewards for their peers.
https://rdmkit.elixir-europe.org
https://fair-dom.org

Data creators
• Retention not sharing, act local not global
• Advantage*: intimate knowledge, data
flirting, credits & incentives
Process change and values
• Access to infrastructure with seamless
information flows,Values
• Time & resources to embed into practice
FAIR Stewardship skills
• Professionalisation & know-how
*Pasquetto, I. V., Borgman, C. L., & Wofford, M. F. (2019). Uses and Reuses of Scientific Data: The Data Creators’
Advantage. Harvard Data Science Review, 1(2). https://doi.org/10.1162/99608f92.fc14bf2d

Summary: FAIRy stories
Theory -> mobilised some
Practice -> marathon that takes a village
Move the story from data providers to
enabling creators & consumers prepare to
share FAIR -> Research on Research
Authorities Change Mgt
Stewardship
Service Providers
Sustained infrastructure

Acknowledgements
Special thanks to
• Stian Soiland-Reyes (Uni of Manchester/Uni of Amsterdam)
• Nick Juty & Ebtisam Alharbi (University of Manchester)
• Susanna Sansone (University of Oxford)
• Tony Burdett (EMBL-EBI)
• Ibrahim Emam (ImperialCollege)
• EgonWillighagen (Maastricht University)
• Alasdair Gray (Heriot-Watt University)
Manchester, Research Object, RDMkit, FAIRDOM, FAIRplus, Bioschemas colleagues
(about 130 people)
Icons from the noun project
(https://thenounproject.com/)

FAIRy stories: the FAIR Data principles in theory and in practice

In this document

More Related Content

What's hot

Similar to FAIRy stories: the FAIR Data principles in theory and in practice

More from Carole Goble

Recently uploaded

FAIRy stories: the FAIR Data principles in theory and in practice