Supported by the NIH grant 1U24 AI117966-01 to UCSD
PI , Co-Investigators at:
The model
annotated with schema.org
Susanna-Assunta Sansone, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra
Oxford e-Research Centre, University of Oxford, UK
Like JATS (Journal Article Tag Suite) is used by PubMed to index literature,
DATS (DatA Tag Suite) is needed for a scalable way to
index data sources in the DataMed prototype
A community effort
v  Enabling discoverability: find and access datasets
v  Focusing on surfacing key metadata descriptors, such as
²  information and relations between authors, datasets, publication,
funding sources, nature of biological signal and perturbation etc.
v  Not the perfect model to represent the experimental details
²  the level of details and metadata needed to ensure interoperability
and reusability are left to the indexed databases
v  Better than just having keywords
²  we have aimed to have maximum coverage of use cases with
minimal number of data elements and relations
What is support to do and be?
Metadata elements identified by combining the two complementary approaches
USE CASES: top-down approach SCHEMAS: bottom-up approach
The development process in a nutshell
Model serialized as JSON schemas and mapping to schema.org
(v1.0, v1.1, v2.0, v2.1)
bottom-up approach
Standing on the shoulders of giants
v  schema.org
v  DataCite
v  RIF-CS
v  W3C HCLS dataset descriptions (mapping of many models including DCAT, PROV, VOID, Dublin
Core)
v  Project Open Metadata (used by HealthData.gov is being added in this new iteration)
v  ISA
v  BioProject
v  BioSample
v  MiNIML
v  PRIDE-ml
v  MAGE-tab
v  GA4GH metadata schema
v  SRA xml
v  CDISC SDM / element of BRIDGE model
Convergence
of elements
extracted from
competency
questions
and existing
(generic and
biomedical)
data models
(incl. DataCite,
DCAT, schema.org,
HCLS dataset, RIF-
CS, ISA-Tab, SRA-
xml etc.)
model for scalable indexing
Adoption
of elements extracted
from
and from
core entities
extended entities
v  Dataset, a core entity catering for any unit of information
²  archived experimental datasets, which do not change after deposition to the
repository => examples available for dbGAP, GEO, ClinicalTrials.org
²  datasets in reference knowledge bases, describing dynamic concepts, such
as “genes”, whose definition morphs over time => examples available for
UniProt
v  Dataset entity is also linked to other digital research objects
v  Software and Data Standard, which are also part of the NIH Commons, but
the focus on other discovery indexes and therefore are not described in
detail in this model
General design of the
Serializations and use of schema.org
v  DATS model in JSON schema, serialized as:
²  JSON* format, and
²  JSON-LD** with vocabulary from schema.org
²  serializations in other formats can also be done, as / if needed
v  Benefits for DataMed and databases index by DataMed
v  Increased visibility (by both popular search engines), accessibility
(via common query interfaces) and possibly improve ranking
v  Extending schema.org
²  Submitted to their tracker missing DATS core elements
²  Coordinating via the bioschemas.org initiative (ELIXIR is also part of)
the extension of schema.org for life science
* JavaScript Object Notation
** JavaScript Object Notation for Linked Data
core and extended elements
v  What is the dataset about?
²  Material
v  How was the dataset produced ? Which information does it hold?
²  Dataset / Data Type with its Information, Method, Platform,
Instrument
v  Where can a dataset be found?
²  Dataset, Distribution, Access objects (links to License)
v  When was the datasets produced, released etc.?
²  Dates to specify the nature of an event {create, modify, start, end...}
and its timestamp
v  Who did the work, funded the research, hosts the resources etc.?
²  Person, Organization and their roles, Grant
Core elements provide the basic info
Of the 18 core elements none is mandatory
Only few properties of the 18
core elements are mandatory
Other adopters
exporting
DATS in their APIs
To evaluate DATS
model capabilities
Work in progress:
documentation and
curation guidelines for
adopters
Implementations and documentation
relations to other BD2K efforts
v  Mapping DATS to omicsDI model
²  To be able to index datasets in this aggregator
v  For datasets not yet in a formal repositories
²  CEDAR metadata authoring tool can be used to
provide DATS-compliant metadata to be later
indexed by DataMed
v  Ensure that the citation metadata for repositories’
landing page maps to core DATS elements
Interlinking to other indexes
Interlinking to other indexes
documentation

More Related Content

PPTX
BD2K @ NIH - A Vision Through 2020
PPTX
A SWOT Analysis of Data Science @ NIH
PPTX
Making Biomedical Research More Like Airbnb
PPTX
Open Science: Some Possible Actions by University Leaders on Behalf of Resear...
PPT
Big Data in Biomedicine – An NIH Perspective
PPT
Data Analytics
PPT
The NIH as a Digital Enterprise: Implications for PAG
PPT
BD2K Update
BD2K @ NIH - A Vision Through 2020
A SWOT Analysis of Data Science @ NIH
Making Biomedical Research More Like Airbnb
Open Science: Some Possible Actions by University Leaders on Behalf of Resear...
Big Data in Biomedicine – An NIH Perspective
Data Analytics
The NIH as a Digital Enterprise: Implications for PAG
BD2K Update

What's hot (20)

PPT
Open Data in a Global Ecosystem
PPTX
Big Data as a Catalyst for Collaboration & Innovation
PPTX
The Commons: Leveraging the Power of the Cloud for Big Data
PDF
Poster RDAP13: A Workflow for Depositing to a Research Data Repository: A Cas...
PPT
Data Science BD2K Update for NIH
PDF
RDAP 15 EarthCollab: Connecting Scientific Information Sources using the Sema...
PPT
RDAP 033111
PPTX
Towards a Data Commons
PPTX
Highlights from NIH Data Science
PPT
Health Policy and Management as it Relates to Big Data
PPT
Yale Day of Data
PPTX
Addressing the wicked problem of learning data privacy though principle and p...
PPTX
Promoting an ethical and GDPR-compliant approach to learning analytics
PPT
HSL and PubViz: a novel Medline Exploration Engine
PPTX
From Where Have We Come & Where Are We Going
PDF
Introduction to PANGAEA & EURO-BASIN Data Management, by Janine Felden
PPT
There is No Intelligent Life Down Here
PDF
Valen Metadata and the [Data] Repository
PPTX
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...
PDF
Navigating the data management ecosystem - Dan Valen
Open Data in a Global Ecosystem
Big Data as a Catalyst for Collaboration & Innovation
The Commons: Leveraging the Power of the Cloud for Big Data
Poster RDAP13: A Workflow for Depositing to a Research Data Repository: A Cas...
Data Science BD2K Update for NIH
RDAP 15 EarthCollab: Connecting Scientific Information Sources using the Sema...
RDAP 033111
Towards a Data Commons
Highlights from NIH Data Science
Health Policy and Management as it Relates to Big Data
Yale Day of Data
Addressing the wicked problem of learning data privacy though principle and p...
Promoting an ethical and GDPR-compliant approach to learning analytics
HSL and PubViz: a novel Medline Exploration Engine
From Where Have We Come & Where Are We Going
Introduction to PANGAEA & EURO-BASIN Data Management, by Janine Felden
There is No Intelligent Life Down Here
Valen Metadata and the [Data] Repository
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...
Navigating the data management ecosystem - Dan Valen
Ad

Similar to NIH BD2K DataMed model, DATS (20)

PDF
NIH BD2K DataMed data index - DATS model
PDF
Introduction to DATS v2.2 - NIH May 2017
PDF
The DATS model: datasets descriptions for data discovery in DataMed
PDF
Datasets with bioschemas
PDF
Dats nih-dccpc-kc7-april2018-prs-uoxf
PDF
NIH BD2K DataMed metadata model - Force11, 2016
PPTX
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
PDF
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
PPTX
Communicating with Data 2010 Annual Meeting
PPTX
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
PDF
Research data catalogues and data interoperability in life sciences
PPT
Bioschemas presentation at ECCB 2016, The Hague
PDF
Experiences in building an ontology driven image database for ...
PDF
NIH BD2K bioCADDIE DataMed: Data Discovery Index
PPTX
Life Science Database Cross Search and Metadata
PPTX
SEEK for Science: A Data and Model Management Platform to support Open and Re...
PDF
dkNET Webinar - FAIR Data Require Better Metadata: The Case for CEDAR 11/13/2020
PPT
David Shotton - Research Integrity: Integrity of the published record
PPT
Some Early Thoughts
PPTX
Building a Network of Interoperable and Independently Produced Linked and Ope...
NIH BD2K DataMed data index - DATS model
Introduction to DATS v2.2 - NIH May 2017
The DATS model: datasets descriptions for data discovery in DataMed
Datasets with bioschemas
Dats nih-dccpc-kc7-april2018-prs-uoxf
NIH BD2K DataMed metadata model - Force11, 2016
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
Communicating with Data 2010 Annual Meeting
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
Research data catalogues and data interoperability in life sciences
Bioschemas presentation at ECCB 2016, The Hague
Experiences in building an ontology driven image database for ...
NIH BD2K bioCADDIE DataMed: Data Discovery Index
Life Science Database Cross Search and Metadata
SEEK for Science: A Data and Model Management Platform to support Open and Re...
dkNET Webinar - FAIR Data Require Better Metadata: The Case for CEDAR 11/13/2020
David Shotton - Research Integrity: Integrity of the published record
Some Early Thoughts
Building a Network of Interoperable and Independently Produced Linked and Ope...
Ad

More from Susanna-Assunta Sansone (20)

PDF
FAIR and Reproducible - GSC, Tucson, Aug 2024
PDF
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
PDF
FAIRsharing-Standards-4-GSC-Aug23.pdf
PDF
FAIR-4-GSC-Sansone-Aug23.pdf
PDF
FAIRsharing & FAIRcookbook at RDA 2023
PDF
NFDI Physical Sciences Colloquium - FAIR
PDF
Metadata Standards
PDF
FAIRcookbook: GSRS22-Singapore
PDF
FAIR Cookbook
PDF
FAIR, community standards and data FAIRification: components and recipes
PDF
FAIRsharing and the FAIR Cookbook
PDF
FAIRsharing for EOSC
PDF
FAIR: standards and services
PDF
FAIRification is a Team Sport: FAIRsharing and the FAIR Cookbook
PDF
FAIRsharing: what we do for policies
PDF
FAIRsharing: how we assist with FAIRness
PDF
ELIXIR FAIR Activities - Examplars
PDF
FAIRsharing - focus on standards and new features
PDF
FAIR data and standards for a coordinated COVID-19 response
PDF
FAIRsharing poster
FAIR and Reproducible - GSC, Tucson, Aug 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIRsharing-Standards-4-GSC-Aug23.pdf
FAIR-4-GSC-Sansone-Aug23.pdf
FAIRsharing & FAIRcookbook at RDA 2023
NFDI Physical Sciences Colloquium - FAIR
Metadata Standards
FAIRcookbook: GSRS22-Singapore
FAIR Cookbook
FAIR, community standards and data FAIRification: components and recipes
FAIRsharing and the FAIR Cookbook
FAIRsharing for EOSC
FAIR: standards and services
FAIRification is a Team Sport: FAIRsharing and the FAIR Cookbook
FAIRsharing: what we do for policies
FAIRsharing: how we assist with FAIRness
ELIXIR FAIR Activities - Examplars
FAIRsharing - focus on standards and new features
FAIR data and standards for a coordinated COVID-19 response
FAIRsharing poster

Recently uploaded (20)

PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
DATA MODELING, data model concepts, types of data concepts
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
PPTX
transformers as a tool for understanding advance algorithms in deep learning
PPTX
AI_Agriculture_Presentation_Enhanced.pptx
PPTX
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
GPS sensor used agriculture land for automation
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
ifsm.pptx, institutional food service management
PPTX
Hushh.ai: Your Personal Data, Your Business
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PPTX
Machine Learning and working of machine Learning
inbound6529290805104538764.pptxmmmmmmmmm
DATA MODELING, data model concepts, types of data concepts
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
transformers as a tool for understanding advance algorithms in deep learning
AI_Agriculture_Presentation_Enhanced.pptx
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
GPS sensor used agriculture land for automation
inbound2857676998455010149.pptxmmmmmmmmm
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
MBA JAPAN: 2025 the University of Waseda
ifsm.pptx, institutional food service management
Hushh.ai: Your Personal Data, Your Business
Chapter security of computer_8_v8.1.pptx
Hushh Hackathon for IIT Bombay: Create your very own Agents
Machine Learning and working of machine Learning

NIH BD2K DataMed model, DATS

  • 1. Supported by the NIH grant 1U24 AI117966-01 to UCSD PI , Co-Investigators at: The model annotated with schema.org Susanna-Assunta Sansone, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra Oxford e-Research Centre, University of Oxford, UK
  • 2. Like JATS (Journal Article Tag Suite) is used by PubMed to index literature, DATS (DatA Tag Suite) is needed for a scalable way to index data sources in the DataMed prototype A community effort
  • 3. v  Enabling discoverability: find and access datasets v  Focusing on surfacing key metadata descriptors, such as ²  information and relations between authors, datasets, publication, funding sources, nature of biological signal and perturbation etc. v  Not the perfect model to represent the experimental details ²  the level of details and metadata needed to ensure interoperability and reusability are left to the indexed databases v  Better than just having keywords ²  we have aimed to have maximum coverage of use cases with minimal number of data elements and relations What is support to do and be?
  • 4. Metadata elements identified by combining the two complementary approaches USE CASES: top-down approach SCHEMAS: bottom-up approach The development process in a nutshell Model serialized as JSON schemas and mapping to schema.org (v1.0, v1.1, v2.0, v2.1)
  • 5. bottom-up approach Standing on the shoulders of giants v  schema.org v  DataCite v  RIF-CS v  W3C HCLS dataset descriptions (mapping of many models including DCAT, PROV, VOID, Dublin Core) v  Project Open Metadata (used by HealthData.gov is being added in this new iteration) v  ISA v  BioProject v  BioSample v  MiNIML v  PRIDE-ml v  MAGE-tab v  GA4GH metadata schema v  SRA xml v  CDISC SDM / element of BRIDGE model
  • 6. Convergence of elements extracted from competency questions and existing (generic and biomedical) data models (incl. DataCite, DCAT, schema.org, HCLS dataset, RIF- CS, ISA-Tab, SRA- xml etc.) model for scalable indexing Adoption of elements extracted from and from core entities extended entities
  • 7. v  Dataset, a core entity catering for any unit of information ²  archived experimental datasets, which do not change after deposition to the repository => examples available for dbGAP, GEO, ClinicalTrials.org ²  datasets in reference knowledge bases, describing dynamic concepts, such as “genes”, whose definition morphs over time => examples available for UniProt v  Dataset entity is also linked to other digital research objects v  Software and Data Standard, which are also part of the NIH Commons, but the focus on other discovery indexes and therefore are not described in detail in this model General design of the
  • 8. Serializations and use of schema.org v  DATS model in JSON schema, serialized as: ²  JSON* format, and ²  JSON-LD** with vocabulary from schema.org ²  serializations in other formats can also be done, as / if needed v  Benefits for DataMed and databases index by DataMed v  Increased visibility (by both popular search engines), accessibility (via common query interfaces) and possibly improve ranking v  Extending schema.org ²  Submitted to their tracker missing DATS core elements ²  Coordinating via the bioschemas.org initiative (ELIXIR is also part of) the extension of schema.org for life science * JavaScript Object Notation ** JavaScript Object Notation for Linked Data
  • 9. core and extended elements
  • 10. v  What is the dataset about? ²  Material v  How was the dataset produced ? Which information does it hold? ²  Dataset / Data Type with its Information, Method, Platform, Instrument v  Where can a dataset be found? ²  Dataset, Distribution, Access objects (links to License) v  When was the datasets produced, released etc.? ²  Dates to specify the nature of an event {create, modify, start, end...} and its timestamp v  Who did the work, funded the research, hosts the resources etc.? ²  Person, Organization and their roles, Grant Core elements provide the basic info
  • 11. Of the 18 core elements none is mandatory
  • 12. Only few properties of the 18 core elements are mandatory
  • 13. Other adopters exporting DATS in their APIs To evaluate DATS model capabilities Work in progress: documentation and curation guidelines for adopters Implementations and documentation
  • 14. relations to other BD2K efforts v  Mapping DATS to omicsDI model ²  To be able to index datasets in this aggregator v  For datasets not yet in a formal repositories ²  CEDAR metadata authoring tool can be used to provide DATS-compliant metadata to be later indexed by DataMed v  Ensure that the citation metadata for repositories’ landing page maps to core DATS elements