Give an open access to your data
and make them ready to be mined
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
A data explorer as bonus
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
The experimental context: needs / wishesseeding harvesting
samples
preparation
samples analysis
Sample
identifiers
2
Experiment
Data Tables
Experiment Design
Web API
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI
(R shiny)
Make both metadata and data
available for data mining
identifiers centrally
managed
data sharing & data availability
facilitate the subsequent
data mining
1
2
3
EDTMS
ODAM Open Data for Access and Mining : The core idea in one shot
Daniel Jacob – INRA UMR 1332 –May 2016
Data repository
Data capture Minimal effort (PUT)
PUT
myhost.org
http://myhost.org/
mount
GET
Implementation of an
Experiment Data Tables Management System
(EDTMS)
Experiment
Data Tables
Merely dropping data files in a data
repository (e.g. a local NAS or distant
storage space) should allow users to
access them by web API
Data can be downloaded,
explored and mined
No database schema, no programming code and no additional configuration on the server side.
Open Data for Access and Mining : The core idea in one shot
EDTMS
ODAM
3
Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Data subset files
enzymes.tsv
• Whatever the kind of experiment, this assumes a design of experiment
(DoE) involving individuals, samples or whatever things, as the main
objects of study (e.g. plants, tissues, bacteria, …)
• This also assumes the observation of dependent variables resulting of
effects of some controlled experimental factors.
• Moreover, the objects of study have usually an identifier for each of
them, and the variables can be quantitative or qualitative.
• We can have either one object type of study or several kinds, but in
this latter case, it must exist a relationship between object types that
we assume of “obtainedFrom" type.
Preparation and cleaning of the data sub-sets of files
EDTMS
ODAM
4
Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Classification of each column within its right category
enzymes.tsv
Data subset files
factor
quantitative
qualitative
identifier
link
categories
EDTMS
ODAM
5
Data subsets files and their associated metadata files must be compliant
with the TSV standard (Tab-Separator-Values)
• You have to organize your data subsets so that links could be established between them.
• In practical, it means to add a column containing the identifiers corresponding to the entity
to which you want to connect the subset, implying a ‘obtainedFrom’ relation.
• It is to be noted that this duplication of identifiers must be the only redundant
information, through all data subsets.
Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv harvests.tsv
samples.tsv
enzymes.tsv
Data subset files
compounds.tsv
Plants
Harvests
Samples
Compounds
Enzymes
Connections between the dataset files based on identifiers
Entities
(concepts)
Link between 2 subsets being carried out from identifiers
(implies a ‘obtainedFrom’ relation)
Identifier of the central entity of the subset
EDTMS
ODAM
factor
quantitative
qualitative
identifier
link
categories
6
Daniel Jacob – INRA UMR 1332 –May 2016
Supplementary files
In order to allow data to be explored and mined, we have to adjoin some
minimal but relevant metadata:
For that, 2 metadata files are required
• s_subsets.tsv: a file allowing to associate with each subset of data a key
concept corresponding to the main entity of the subset and the relations
of the type "obtainedFrom" between these concepts
• a_attributes.tsv: a metadata file allowing each attribute
(concept/variable) to be annotated with some minimal but relevant
metadata
Creation of the metadata files
EDTMS
ODAM
7
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
Daniel Jacob – INRA UMR 1332 –May 2016
s_subsets.tsv This metadata file allows to associate a key concept to each data subset file
Creation of the metadata files
EDTMS
ODAM
8
Plants
Compounds
Enzymes
Harvests
Samples
plants.tsv
PlanteID
harvests.tsv
Lot samples.tsv
SampleID
compounds.tsv
enzymes.tsv
SampleID
SampleID
1
2
3
4
5
Identifier of the central entity of the subset
Link between 2 subsets (implies a ‘obtainedFrom’ relation)
Unique rank number of the data subset
Key concept (i.e. the main entity) associated to the subset in the form of a short name
Plants1
factor
quantitative
qualitative
identifier
categories
PlanteID plants.tsv
Data file name
Daniel Jacob – INRA UMR 1332 –May 2016
a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with
some minimal but relevant metadata
Creation of the metadata files
EDTMS
ODAM
9
factor
quantitative
qualitative
identifier
categories
Plants
Harvests
Samples
Compounds
…
…
Daniel Jacob – INRA UMR 1332 –May 2016
s_subsets.tsv
a_attributes.tsv
…
…
Additional subsets/ attributes can be
added step by step, as soon as data
are produced.
Updating the metadata files
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
Uploading your datasets in the data repository
EDTMS
ODAM
No database schema, no programming code and no additional configuration on the server side.
Your data subset files
Your dataset entry (named
‘frim1’ as example) within
the data repository
Z: (Storage)
Merely dropping data files on the data repository (e.g. NAS) should allow
users to access them by web API
Data subsets files and their
associated metadata files must be
compliant with the TSV standard
(Tab-Separator-Values)
Data repository
PUT
myhost.orgmount
GET
Data capture
Minimal effort (PUT)
Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/check/frim1
myhost.org
StorageDataRepos
NAS
Checking online if your the data subset files are consistent
EDTMS
ODAM
Many test checks can
be automatically
done for you
Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM
Data storage
seeding
harvesting samples analysis
samples
preparation
13
GET
, maximal efficiency (GET)
After depositing your complete dataset as described previously:
• An open access is given to your data through web API
• They are ready to be mined
• No specific code or additional configuration are needed (*) https://www.erasysbio.net/index.php?index=266
minimal effort (PUT)
PUT
Format
TSV
Data
Data Linking
Preparation and cleaning of the data sub-sets of files
FRIM1(*)
Check
Open Data, Access and Mining : web API
Daniel Jacob – INRA UMR 1332 –May 2016
Data
Format
TSV
EDTMS
ODAM
Data linking
Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
Retrieving data
Retrieving metadata
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
factor
quantitative
qualitative
identifier
link
categories
FRIM1 (*)
xml/tsv/json
frim1
14
(*) https://doi.org/10.5281/zenodo.154041
Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
15
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
Field Description Examples
<data format> format of the retrieved data; possible values are: 'xml' or 'csv' xml
<dataset name> Short name (tag) of your dataset frim1
<subset> Short name of a data subset samples
<entry> Name of an attribute entry (defined by the user in the a_attribute file
(column ‘entry’)
sampleid
<category> Name of the attribute category; (assigned by the user in the a_attribute file
(column ‘category’)
possible values are: ‘identifier’, ‘factor’, ‘qualitative’, ‘quantitative’
quantitative
(<subset>) Set of data subsets by merging all the subsets with lower rank than the
specified subset and following the pathway defined by the "is_part_of"
links.
(samples) 
plants + harvests
+ samples
<value> Exact value of the desired entry or category 1, factor
Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
16
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
http://myhost.org/getdata/<data format>/<dataset name>/<subset>/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<category>
http://myhost.org/getdata/<data format>/<dataset name>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/<subset>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)
• Get the subset list of a dataset
• Get all values within a data subset
• Get values within a data subset for a specific value of an entry
• Get all values within a set of data subsets
• Get values within a set of data subsets for a specific value of an entry
• Get the attribute list within a set of data subsets for a specific category
Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants
http://myhost.org/getdata/xml/frim1/harvests/lot/1
http://myhost.org/getdata/xml/frim1/(compounds)/quantitative
Metadata
Metadata
Data
Data
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
17
Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control
Set of data subsets by merging all the subsets with lower rank than the specified
subset and following the pathway defined by the “obtainedFrom" links.
(samples)  plants + harvests + samples
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
18
Daniel Jacob – INRA UMR 1332 –May 2016
Data
Format
TSV
minimal effort, maximal efficiency
EDTMS
ODAM
Data linking
Open Data Access via web API: Application layer
FRIM1
19
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
Daniel Jacob – INRA UMR 1332 –May 2016
Retrieving Data within R
Open Data Access via web API: Application layer
The R package
Rodam
EDTMS
ODAM
20
Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API Rodam package
21
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
tsv
frim1
samples
sample
365
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365
Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API
Read metadata
i.e. category types within the data
Get the data subset ‘activome’
along with its metadata
22
<data format>
<dataset name>
<subset>
(<subset>)
<entry>
<category>
<value>
<value>
<entry>
tsv
frim1
activome
factor
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor
Rodam package
Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API
23
Rodam package
Daniel Jacob – INRA UMR 1332 –May 2016
Data / Metadata
Data Mining
?
Make both
metadata and data
available for
data mining.
Experimentation
/ Analysis
MFA
rCCA
pLDA
…
Open Data Access via web API
activome qNMR_metabo
Water StressControl
ODAM facilitates the subsequent data mining
All Dev. Stages
All Treatments
ODAM facilitates the subsequent data mining
(log10 transformed)
24
Rodam package
Daniel Jacob – INRA UMR 1332 –May 2016
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI (R shiny)
minimal effort, maximal efficiency
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
EDTMS
ODAM
Data
Format
TSV
Data linking
Open Data Access via web API: Application layer
FRIM1
25
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
26
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
27
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
28
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
29
To remove an item
from the selection: i)
click on it, and then
ii) click on the
‘Suppr’ key
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
30
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
31
Explore several
possibilities by
interacting with
the graph
Daniel Jacob – INRA UMR 1332 –May 2016
To summarize
1. Preparation and cleaning of the data sub-sets of files
2. Classification of each column within its right category
3. Connections between the dataset files based on identifiers
4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv
5. Deposit of the dataset files in the data repository
6. Checking online if your the data subset files are consistent
7. Testing online the web-services on your dataset
8. Use of the web API through an application layer (R scripts, data explorer, ... )
EDTMS
ODAM
Data subsets files and their associated metadata files must be
compliant with the TSV standard (Tab-Separator-Values)
Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
(See https://en.wikipedia.org/wiki/Tab-separated_values)
Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
data sharing & data availability
- The array of the "plants" may be created even before planting the seeds.
- Similarly, the array of the "harvests" can be created as soon as the harvests are done,
and this before any analysis.
- Thus, these arrays are generated only once in the project and we can set up the
sharing soon the seed planting. Then each analysis comes to complement the set of
data as soon as they produce their own sub-dataset.
- data are accessible to everyone as soon as they are produced,
identifiers centrally managed
- data are archived and compiled, so that it becomes useless to proceed a laborious
investigation to find out who possesses the right identifiers, etc.
EDTMS
ODAM
seeding harvesting samples analysis
Sample
identifiers
samples
preparation
Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
facilitate the subsequent publication of data
- data are already readily available online by web API,
- But nothing prevents to take this data to fill in existing databases, by adjoining more
elaborate annotations.
- Neither administrator privileges nor any programmatic skills are required
EDTMS
ODAM
Data
Format
TSV
Data linking
PUT
GET
Data capture
Minimal effortData analysis/mining
Maximum efficiency
Daniel Jacob – INRA UMR 1332 –May 2016
minimal effort, maximum efficiency
Format the data
- Based on TSV: choice to keep the good old way of scientist to use
worksheets, thus i) using the same tool for both data files and metadata
definition files, ii) no programmatic skill are required
Give an access through a web services layer
- based on current standards (REST)
Use existing tools
- Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
Develop if needed, lightweight tools
- R scripts, lightweight GUI (R shiny)
Advantages of this approach
biostatflow.org
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
Have a good fun !!
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
https://hub.docker.com/r/odam/getdata/
http://www.bordeaux.inra.fr/pmb/dataexplorer/
https://github.com/INRA/ODAM
https://cran.r-project.org/package=Rodam
https://zenodo.org/record/154041
An online example

More Related Content

PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
PPT
Data Mining Concepts
PPTX
Data mining presentation.ppt
PPTX
Data mining concepts and work
PPT
03 data mining : data warehouse
PPTX
Classification and prediction in data mining
PPT
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPTX
Data mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining Concepts
Data mining presentation.ppt
Data mining concepts and work
03 data mining : data warehouse
Classification and prediction in data mining
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Data mining

What's hot (20)

PPT
Introduction To Data Mining
PPT
Cluster2
DOCX
data mining and data warehousing
PPTX
Introduction to Data Mining
PPT
Data miningppt378
PPT
Data mining and its concepts
PPTX
Introduction to Datamining Concept and Techniques
PPT
Data Mining Concepts and Techniques
PPT
introduction to data mining tutorial
PPT
Data Warehouse and Data Mining
PPT
Dwdmunit1 a
PPT
Chapter 1: Introduction to Data Mining
PPT
Cssu dw dm
PPTX
Introduction to Data mining
PPT
Database
ODP
Data mining
PPTX
3 Data Mining Tasks
PPTX
Data warehouse and olap technology
PPTX
Data mining
Introduction To Data Mining
Cluster2
data mining and data warehousing
Introduction to Data Mining
Data miningppt378
Data mining and its concepts
Introduction to Datamining Concept and Techniques
Data Mining Concepts and Techniques
introduction to data mining tutorial
Data Warehouse and Data Mining
Dwdmunit1 a
Chapter 1: Introduction to Data Mining
Cssu dw dm
Introduction to Data mining
Database
Data mining
3 Data Mining Tasks
Data warehouse and olap technology
Data mining

Viewers also liked (16)

PDF
How I data mined my text message history
PPT
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
PPTX
Data Mining: Mining ,associations, and correlations
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
PPT
3.2 partitioning methods
PPT
Mining Frequent Patterns, Association and Correlations
PPTX
Data visualization
PPT
1.8 discretization
PPT
Data Warehousing and Data Mining
PPTX
Data Mining: Classification and analysis
PPTX
Data cube computation
PDF
Support Vector Machines for Classification
PDF
Data Mining: Association Rules Basics
PDF
Data mining (lecture 1 & 2) conecpts and techniques
PPT
Data mining slides
 
PPTX
Data mining
How I data mined my text message history
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Mining ,associations, and correlations
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
3.2 partitioning methods
Mining Frequent Patterns, Association and Correlations
Data visualization
1.8 discretization
Data Warehousing and Data Mining
Data Mining: Classification and analysis
Data cube computation
Support Vector Machines for Classification
Data Mining: Association Rules Basics
Data mining (lecture 1 & 2) conecpts and techniques
Data mining slides
 
Data mining

Similar to Odam: Open Data, Access and Mining (20)

PPTX
Make your data great now
PPTX
Make your data great again - Ver 2
PPTX
How to make your published data findable, accessible, interoperable and reusable
PDF
Indexator_oct2022.pdf
PDF
NetBioSIG2013-Talk Gang Su
PDF
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
PDF
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
DOCX
Data Mining Exploring DataLecture Notes for Chapter 3
PPTX
R_Proficiency.pptx
PPTX
How to expose research data in EOSC
PDF
Mendeley Data FAIR hackathon
PPTX
DTL Partners Event - FAIR Data Tech overview - Day 1
PDF
SETAC Rome Non-Target Screening For Chemical Discovery
PPTX
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
PPTX
Environment Canada's Data Management Service
PPT
basic-visualization.pptNNNNNNNNNNNNNNNNNNNN
PPT
basic-visualization.pptasdasdasdasdasdasdas
PPT
slides for basics-visualizations ppt.ppt
PDF
Open Science: Research Data Management
PDF
Research data catalogues and data interoperability in life sciences
Make your data great now
Make your data great again - Ver 2
How to make your published data findable, accessible, interoperable and reusable
Indexator_oct2022.pdf
NetBioSIG2013-Talk Gang Su
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
Data Mining Exploring DataLecture Notes for Chapter 3
R_Proficiency.pptx
How to expose research data in EOSC
Mendeley Data FAIR hackathon
DTL Partners Event - FAIR Data Tech overview - Day 1
SETAC Rome Non-Target Screening For Chemical Discovery
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
Environment Canada's Data Management Service
basic-visualization.pptNNNNNNNNNNNNNNNNNNNN
basic-visualization.pptasdasdasdasdasdasdas
slides for basics-visualizations ppt.ppt
Open Science: Research Data Management
Research data catalogues and data interoperability in life sciences

Recently uploaded (20)

PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPTX
Sistem Informasi Manejemn-Sistem Manajemen Database
PDF
Buddhism presentation about world religion
PPT
Technicalities in writing workshops indigenous language
PPTX
DAA UNIT 1 for unit 1 time compixity PPT.pptx
PDF
Introduction to Database Systems Lec # 1
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PPTX
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PDF
American Journal of Multidisciplinary Research and Review
PDF
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
PPTX
Basic Statistical Analysis for experimental data.pptx
PDF
Mcdonald's : a half century growth . pdf
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PPTX
cardiac failure and associated notes.pptx
PDF
PPT nikita containers of the company use
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
Sistem Informasi Manejemn-Sistem Manajemen Database
Buddhism presentation about world religion
Technicalities in writing workshops indigenous language
DAA UNIT 1 for unit 1 time compixity PPT.pptx
Introduction to Database Systems Lec # 1
Power BI - Microsoft Power BI is an interactive data visualization software p...
Stats annual compiled ipd opd ot br 2024
DATA ANALYTICS COURSE IN PITAMPURA.pptx
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
Teal Blue Futuristic Metaverse Presentation.pdf
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
American Journal of Multidisciplinary Research and Review
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
Basic Statistical Analysis for experimental data.pptx
Mcdonald's : a half century growth . pdf
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
cardiac failure and associated notes.pptx
PPT nikita containers of the company use

Odam: Open Data, Access and Mining

  • 1. Give an open access to your data and make them ready to be mined Daniel Jacob UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility May 2016 Open Data for Access and Mining A data explorer as bonus EDTMS ODAM
  • 2. Daniel Jacob – INRA UMR 1332 –May 2016 The experimental context: needs / wishesseeding harvesting samples preparation samples analysis Sample identifiers 2 Experiment Data Tables Experiment Design Web API Develop if needed, lightweight tools - R scripts (Galaxy), lightweight GUI (R shiny) Make both metadata and data available for data mining identifiers centrally managed data sharing & data availability facilitate the subsequent data mining 1 2 3 EDTMS ODAM Open Data for Access and Mining : The core idea in one shot
  • 3. Daniel Jacob – INRA UMR 1332 –May 2016 Data repository Data capture Minimal effort (PUT) PUT myhost.org http://myhost.org/ mount GET Implementation of an Experiment Data Tables Management System (EDTMS) Experiment Data Tables Merely dropping data files in a data repository (e.g. a local NAS or distant storage space) should allow users to access them by web API Data can be downloaded, explored and mined No database schema, no programming code and no additional configuration on the server side. Open Data for Access and Mining : The core idea in one shot EDTMS ODAM 3
  • 4. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv compounds.tsv Data subset files enzymes.tsv • Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study (e.g. plants, tissues, bacteria, …) • This also assumes the observation of dependent variables resulting of effects of some controlled experimental factors. • Moreover, the objects of study have usually an identifier for each of them, and the variables can be quantitative or qualitative. • We can have either one object type of study or several kinds, but in this latter case, it must exist a relationship between object types that we assume of “obtainedFrom" type. Preparation and cleaning of the data sub-sets of files EDTMS ODAM 4
  • 5. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv compounds.tsv Classification of each column within its right category enzymes.tsv Data subset files factor quantitative qualitative identifier link categories EDTMS ODAM 5 Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) • You have to organize your data subsets so that links could be established between them. • In practical, it means to add a column containing the identifiers corresponding to the entity to which you want to connect the subset, implying a ‘obtainedFrom’ relation. • It is to be noted that this duplication of identifiers must be the only redundant information, through all data subsets.
  • 6. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv enzymes.tsv Data subset files compounds.tsv Plants Harvests Samples Compounds Enzymes Connections between the dataset files based on identifiers Entities (concepts) Link between 2 subsets being carried out from identifiers (implies a ‘obtainedFrom’ relation) Identifier of the central entity of the subset EDTMS ODAM factor quantitative qualitative identifier link categories 6
  • 7. Daniel Jacob – INRA UMR 1332 –May 2016 Supplementary files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata: For that, 2 metadata files are required • s_subsets.tsv: a file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts • a_attributes.tsv: a metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata Creation of the metadata files EDTMS ODAM 7 Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note: TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
  • 8. Daniel Jacob – INRA UMR 1332 –May 2016 s_subsets.tsv This metadata file allows to associate a key concept to each data subset file Creation of the metadata files EDTMS ODAM 8 Plants Compounds Enzymes Harvests Samples plants.tsv PlanteID harvests.tsv Lot samples.tsv SampleID compounds.tsv enzymes.tsv SampleID SampleID 1 2 3 4 5 Identifier of the central entity of the subset Link between 2 subsets (implies a ‘obtainedFrom’ relation) Unique rank number of the data subset Key concept (i.e. the main entity) associated to the subset in the form of a short name Plants1 factor quantitative qualitative identifier categories PlanteID plants.tsv Data file name
  • 9. Daniel Jacob – INRA UMR 1332 –May 2016 a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata Creation of the metadata files EDTMS ODAM 9 factor quantitative qualitative identifier categories Plants Harvests Samples Compounds … …
  • 10. Daniel Jacob – INRA UMR 1332 –May 2016 s_subsets.tsv a_attributes.tsv … … Additional subsets/ attributes can be added step by step, as soon as data are produced. Updating the metadata files EDTMS ODAM
  • 11. Daniel Jacob – INRA UMR 1332 –May 2016 Uploading your datasets in the data repository EDTMS ODAM No database schema, no programming code and no additional configuration on the server side. Your data subset files Your dataset entry (named ‘frim1’ as example) within the data repository Z: (Storage) Merely dropping data files on the data repository (e.g. NAS) should allow users to access them by web API Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) Data repository PUT myhost.orgmount GET Data capture Minimal effort (PUT)
  • 12. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/check/frim1 myhost.org StorageDataRepos NAS Checking online if your the data subset files are consistent EDTMS ODAM Many test checks can be automatically done for you
  • 13. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Data storage seeding harvesting samples analysis samples preparation 13 GET , maximal efficiency (GET) After depositing your complete dataset as described previously: • An open access is given to your data through web API • They are ready to be mined • No specific code or additional configuration are needed (*) https://www.erasysbio.net/index.php?index=266 minimal effort (PUT) PUT Format TSV Data Data Linking Preparation and cleaning of the data sub-sets of files FRIM1(*) Check Open Data, Access and Mining : web API
  • 14. Daniel Jacob – INRA UMR 1332 –May 2016 Data Format TSV EDTMS ODAM Data linking Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) Retrieving data Retrieving metadata <data format> <dataset name> <subset> (<subset>) <entry><category> <value> <value> <value> <entry> GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > factor quantitative qualitative identifier link categories FRIM1 (*) xml/tsv/json frim1 14 (*) https://doi.org/10.5281/zenodo.154041
  • 15. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) 15 GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > Field Description Examples <data format> format of the retrieved data; possible values are: 'xml' or 'csv' xml <dataset name> Short name (tag) of your dataset frim1 <subset> Short name of a data subset samples <entry> Name of an attribute entry (defined by the user in the a_attribute file (column ‘entry’) sampleid <category> Name of the attribute category; (assigned by the user in the a_attribute file (column ‘category’) possible values are: ‘identifier’, ‘factor’, ‘qualitative’, ‘quantitative’ quantitative (<subset>) Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the "is_part_of" links. (samples)  plants + harvests + samples <value> Exact value of the desired entry or category 1, factor
  • 16. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) 16 GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > http://myhost.org/getdata/<data format>/<dataset name>/<subset>/<entry>/<value> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<category> http://myhost.org/getdata/<data format>/<dataset name> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<entry>/<value> http://myhost.org/getdata/<data format>/<dataset name>/<subset> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>) • Get the subset list of a dataset • Get all values within a data subset • Get values within a data subset for a specific value of an entry • Get all values within a set of data subsets • Get values within a set of data subsets for a specific value of an entry • Get the attribute list within a set of data subsets for a specific category
  • 17. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants http://myhost.org/getdata/xml/frim1/harvests/lot/1 http://myhost.org/getdata/xml/frim1/(compounds)/quantitative Metadata Metadata Data Data Open Data Access via web API: Examples based on FRIM1 EDTMS ODAM FRIM1 17
  • 18. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the “obtainedFrom" links. (samples)  plants + harvests + samples Open Data Access via web API: Examples based on FRIM1 EDTMS ODAM FRIM1 18
  • 19. Daniel Jacob – INRA UMR 1332 –May 2016 Data Format TSV minimal effort, maximal efficiency EDTMS ODAM Data linking Open Data Access via web API: Application layer FRIM1 19 … Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
  • 20. Daniel Jacob – INRA UMR 1332 –May 2016 Retrieving Data within R Open Data Access via web API: Application layer The R package Rodam EDTMS ODAM 20
  • 21. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API Rodam package 21 <data format> <dataset name> <subset> (<subset>) <entry><category> <value> <value> <value> <entry> tsv frim1 samples sample 365 GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365
  • 22. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API Read metadata i.e. category types within the data Get the data subset ‘activome’ along with its metadata 22 <data format> <dataset name> <subset> (<subset>) <entry> <category> <value> <value> <entry> tsv frim1 activome factor GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor Rodam package
  • 23. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API 23 Rodam package
  • 24. Daniel Jacob – INRA UMR 1332 –May 2016 Data / Metadata Data Mining ? Make both metadata and data available for data mining. Experimentation / Analysis MFA rCCA pLDA … Open Data Access via web API activome qNMR_metabo Water StressControl ODAM facilitates the subsequent data mining All Dev. Stages All Treatments ODAM facilitates the subsequent data mining (log10 transformed) 24 Rodam package
  • 25. Daniel Jacob – INRA UMR 1332 –May 2016 Develop if needed, lightweight tools - R scripts (Galaxy), lightweight GUI (R shiny) minimal effort, maximal efficiency … Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, … EDTMS ODAM Data Format TSV Data linking Open Data Access via web API: Application layer FRIM1 25
  • 26. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 26 http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
  • 27. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 27 http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
  • 28. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 28
  • 29. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 29 To remove an item from the selection: i) click on it, and then ii) click on the ‘Suppr’ key
  • 30. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 30
  • 31. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 31 Explore several possibilities by interacting with the graph
  • 32. Daniel Jacob – INRA UMR 1332 –May 2016 To summarize 1. Preparation and cleaning of the data sub-sets of files 2. Classification of each column within its right category 3. Connections between the dataset files based on identifiers 4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv 5. Deposit of the dataset files in the data repository 6. Checking online if your the data subset files are consistent 7. Testing online the web-services on your dataset 8. Use of the web API through an application layer (R scripts, data explorer, ... ) EDTMS ODAM Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) Note: TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas (See https://en.wikipedia.org/wiki/Tab-separated_values)
  • 33. Daniel Jacob – INRA UMR 1332 –May 2016 Advantages of this approach data sharing & data availability - The array of the "plants" may be created even before planting the seeds. - Similarly, the array of the "harvests" can be created as soon as the harvests are done, and this before any analysis. - Thus, these arrays are generated only once in the project and we can set up the sharing soon the seed planting. Then each analysis comes to complement the set of data as soon as they produce their own sub-dataset. - data are accessible to everyone as soon as they are produced, identifiers centrally managed - data are archived and compiled, so that it becomes useless to proceed a laborious investigation to find out who possesses the right identifiers, etc. EDTMS ODAM seeding harvesting samples analysis Sample identifiers samples preparation
  • 34. Daniel Jacob – INRA UMR 1332 –May 2016 Advantages of this approach facilitate the subsequent publication of data - data are already readily available online by web API, - But nothing prevents to take this data to fill in existing databases, by adjoining more elaborate annotations. - Neither administrator privileges nor any programmatic skills are required EDTMS ODAM Data Format TSV Data linking PUT GET Data capture Minimal effortData analysis/mining Maximum efficiency
  • 35. Daniel Jacob – INRA UMR 1332 –May 2016 minimal effort, maximum efficiency Format the data - Based on TSV: choice to keep the good old way of scientist to use worksheets, thus i) using the same tool for both data files and metadata definition files, ii) no programmatic skill are required Give an access through a web services layer - based on current standards (REST) Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, … Develop if needed, lightweight tools - R scripts, lightweight GUI (R shiny) Advantages of this approach biostatflow.org EDTMS ODAM
  • 36. Daniel Jacob – INRA UMR 1332 –May 2016 Have a good fun !! Daniel Jacob UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility May 2016 Open Data for Access and Mining https://hub.docker.com/r/odam/getdata/ http://www.bordeaux.inra.fr/pmb/dataexplorer/ https://github.com/INRA/ODAM https://cran.r-project.org/package=Rodam https://zenodo.org/record/154041 An online example