SlideShare a Scribd company logo
Automation of Biological Data Analysis and Report Generation
Dmitry Grapov, PhD
Bots write the darndest things
http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood-
california-rdivor,0,3229825.story#axzz2wQwc82EK
•fill in the template (easy)
•human-guided automation
(e.g. Metaboanalyst,
intermediate)
•intelligent/reactive writing
(e.g. ~AI, advanced)
http://narrativescience.com/
Humans + Bots
Interaction:
•Bots and humans combine
in guided analyses
•Humans: make choices
(based on bot guides)
•Bots: automate!
Facilitate:
• workflow logging and
template creation
•reproducible results
Bot: Initial data and meta data
parsing and quality validation
(need: template input)
Human: data cleaning and
experimental design identification
(use: multiple choice, dynamic GUI)
Bot: instantiation of complex
workflows
Human: overview of bot
assumptions and results
Bot: Numerical and text output
generation
Humans + Bots write
darndender things?
Choose Your Own Life Adventure!
?
https://github.com/
dgrapov/AdventureR
Data Analysis Tasks
Visualization (how does it look?)
• histograms, density plots, box plots, line plots, scatter plots, networks, etc.
Statistical Analysis (what is statistically significant?)
• summary tables, ANOVA, FDR adjustment, power analysis, etc.
Exploration (what are the major patterns/trends?)
• clustering, PCA, ICA, etc.
Predictive Modeling (what explains my hypothesis?)
• mixed effects, partial least squares (O-/PLS/-DA), etc.
Network Analysis and Mapping (how are things related?)
• Functional analysis: pathway enrichment or overrepresentation
• Networks: biochemical, structural, mass spectral and empirical networks
• Mapping: projection of analysis results onto network
WCMC Data Analysis Reports ™
Statistical analysis
Clustering
PCA
O-PLS-DA
Biochemical enrichment
Network mapping
Input template: BinBase
•inference of experimental
goals from sample meta data
•mapping variables to external
databases
Tasks:
Report:
Tools:
Automation Challenges
Data cleaning and quality validation
•use: quality control samples; identify: precision/accuracy,
normalization, batch corrections; mitigate: outliers, missing
values, batch effects, etc.
Identification of experimental goals
•use: meta data, identify: main and accessory effects;
choose: statistics, multivariate tests and visualizations
Integration of multiple tasks to evolve robust analyses
•tasks: statistics, multivariate, functional, networks, database
mapping, etc
Data analysis report generation
•use: R, Latex, markdown
?
Challenges to automated
metabolite ID mapping
Stereochemistry?
Search: catechin
Best Match:
Catechin
Biologically relevant:
D-catechin
Synonyms?
Search: UDP GlcNAc
FAIL: UDP GlcNac
PASS: UDP-GlcNac
Strategies for automated
metabolite ID mapping (from synonym)
#1: CTS+ #2: Web query #3: Curated DB
•Use CTS to translate
from synonyms to KEGG
(KID) and PubChem (CID)
•Use KEGGREST and
PUG to filter and choose
most appropriate IDs
•Use fuzzy matching and
word similarity metrics
(e.g. Damerau–
Levenshtein distance)
•Use KEGGREST +
PubChem PUG to
translate synonyms to
IDs
•For KEGG ID:
synonym  SID  KID
•Generate a curated DB
for KEGG and CID
translations +
•Include InChI Keys
•Map to other DBs
•Allow fuzzy matching
on synonyms
•e.g. IDEOM
http://bioinformatics.oxfordjournals.org/content
/early/2012/02/04/bioinformatics.bts069
Interactive Analysis and
Report Generation
knitr (http://yihui.name/knitr/)
Analysis Report Generation
•Analysis on rails or open sandbox
•Humans facilitate robust results generation + Bots ensure reproduction
•Generation of Methods and Results should be automateable
Devium 2.0
Human-guided automated data
analysis and report generator
Human-guided automation could help
ensure robust results by making choices
which are otherwise difficult to automate.
https://github.com/dgrapov/DeviumWeb
MetaMapR
Linking data analysis and biology
https://github.com/dgrapov/MetaMapR
Integration of complex work flows is key to automation.
+ Workflows for complex experiments (e.g. time-course)
+ Biochemical functional analysis (pathway enrichment)
+ GUI for report generation (Devium 2.0)
+ Integrate multi-’Omic’ data sets (MetaMapR 2.0)
+ Scientific literature mining (RapportR)
+ Interactive plots and networks (JavaScript)
Future Goals
dgrapov@ucdavis.edu
metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154

More Related Content

What's hot (20)

PPTX
High Dimensional Biological Data Analysis and Visualization
Dmitry Grapov
 
PPTX
Mapping to the Metabolomic Manifold
Dmitry Grapov
 
PPTX
Metabolomic data analysis and visualization tools
Dmitry Grapov
 
PPTX
3 principal components analysis
Dmitry Grapov
 
PPTX
Data Normalization Approaches for Large-scale Biological Studies
Dmitry Grapov
 
PPTX
Normalization of Large-Scale Metabolomic Studies 2014
Dmitry Grapov
 
PPTX
Data analysis workflows part 2 2015
Dmitry Grapov
 
PPT
Gene Ontology Enrichment Network Analysis -Tutorial
Dmitry Grapov
 
PPTX
0 introduction
Dmitry Grapov
 
PDF
Case Study: Overview of Metabolomic Data Normalization Strategies
Dmitry Grapov
 
PPTX
4 partial least squares modeling
Dmitry Grapov
 
PPTX
Omic Data Integration Strategies
Dmitry Grapov
 
PPTX
3 data normalization (2014 lab tutorial)
Dmitry Grapov
 
PPT
Prote-OMIC Data Analysis and Visualization
Dmitry Grapov
 
PPTX
1 statistical analysis
Dmitry Grapov
 
PPTX
Some statistical concepts relevant to proteomics data analysis
UC Davis
 
PPTX
Data analysis workflows part 1 2015
Dmitry Grapov
 
PPTX
Quality Metrics for Linked Open Data
ebrahim_bagheri
 
PPT
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
jatwood3
 
PDF
The International Journal of Engineering and Science (The IJES)
theijes
 
High Dimensional Biological Data Analysis and Visualization
Dmitry Grapov
 
Mapping to the Metabolomic Manifold
Dmitry Grapov
 
Metabolomic data analysis and visualization tools
Dmitry Grapov
 
3 principal components analysis
Dmitry Grapov
 
Data Normalization Approaches for Large-scale Biological Studies
Dmitry Grapov
 
Normalization of Large-Scale Metabolomic Studies 2014
Dmitry Grapov
 
Data analysis workflows part 2 2015
Dmitry Grapov
 
Gene Ontology Enrichment Network Analysis -Tutorial
Dmitry Grapov
 
0 introduction
Dmitry Grapov
 
Case Study: Overview of Metabolomic Data Normalization Strategies
Dmitry Grapov
 
4 partial least squares modeling
Dmitry Grapov
 
Omic Data Integration Strategies
Dmitry Grapov
 
3 data normalization (2014 lab tutorial)
Dmitry Grapov
 
Prote-OMIC Data Analysis and Visualization
Dmitry Grapov
 
1 statistical analysis
Dmitry Grapov
 
Some statistical concepts relevant to proteomics data analysis
UC Davis
 
Data analysis workflows part 1 2015
Dmitry Grapov
 
Quality Metrics for Linked Open Data
ebrahim_bagheri
 
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
jatwood3
 
The International Journal of Engineering and Science (The IJES)
theijes
 

Viewers also liked (10)

PPTX
6 metabolite enrichment analysis
Dmitry Grapov
 
PPTX
5 data analysis case study
Dmitry Grapov
 
PPTX
2 cluster analysis
Dmitry Grapov
 
PDF
A Primer for Your Next Data Science Proof of Concept on the Cloud
Alton Alexander
 
PPTX
Pragmatic steps to implement big data analytics
Alton Alexander
 
PPTX
Connecting Metabolomic Data with Context
Dmitry Grapov
 
PPTX
Complex Systems Biology Informed Data Analysis and Machine Learning
Dmitry Grapov
 
PPTX
Big Data Analytics
Global Business Solutions SME
 
PPTX
Big Data and Advanced Analytics
McKinsey on Marketing & Sales
 
PPTX
What is Big Data?
Bernard Marr
 
6 metabolite enrichment analysis
Dmitry Grapov
 
5 data analysis case study
Dmitry Grapov
 
2 cluster analysis
Dmitry Grapov
 
A Primer for Your Next Data Science Proof of Concept on the Cloud
Alton Alexander
 
Pragmatic steps to implement big data analytics
Alton Alexander
 
Connecting Metabolomic Data with Context
Dmitry Grapov
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Dmitry Grapov
 
Big Data Analytics
Global Business Solutions SME
 
Big Data and Advanced Analytics
McKinsey on Marketing & Sales
 
What is Big Data?
Bernard Marr
 
Ad

Similar to Automation of (Biological) Data Analysis and Report Generation (20)

PDF
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
PDF
Using MongoDB + Hadoop Together
MongoDB
 
PDF
Data Discovery and Metadata
markgrover
 
PDF
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
DevOpsDays Baltimore
 
PDF
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
DevOpsDays Baltimore
 
PPTX
Big Data: Guidelines and Examples for the Enterprise Decision Maker
MongoDB
 
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PyData
 
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
Konstantinos Xirogiannopoulos
 
PDF
Automatic Detection of Web Trackers by Vasia Kalavri
Flink Forward
 
PPTX
Bots & spiders
Maté Ongenaert
 
PDF
Anaconda and PyData Solutions
Travis Oliphant
 
PDF
Artificial Intelligence for Data Quality
Vera Ekimenko
 
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
PDF
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
ChemAxon
 
PDF
Knowledge Discovery in Production
André Karpištšenko
 
PPTX
Data council sf amundsen presentation
Tao Feng
 
PDF
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
PDF
OpenML Tutorial ECMLPKDD 2015
Joaquin Vanschoren
 
PDF
Venkata brundavanam 2020
Padma Brundavanam
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Using MongoDB + Hadoop Together
MongoDB
 
Data Discovery and Metadata
markgrover
 
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
DevOpsDays Baltimore
 
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
DevOpsDays Baltimore
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
MongoDB
 
GraphGen: Conducting Graph Analytics over Relational Databases
PyData
 
GraphGen: Conducting Graph Analytics over Relational Databases
Konstantinos Xirogiannopoulos
 
Automatic Detection of Web Trackers by Vasia Kalavri
Flink Forward
 
Bots & spiders
Maté Ongenaert
 
Anaconda and PyData Solutions
Travis Oliphant
 
Artificial Intelligence for Data Quality
Vera Ekimenko
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
ChemAxon
 
Knowledge Discovery in Production
André Karpištšenko
 
Data council sf amundsen presentation
Tao Feng
 
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
OpenML Tutorial ECMLPKDD 2015
Joaquin Vanschoren
 
Venkata brundavanam 2020
Padma Brundavanam
 
Ad

More from Dmitry Grapov (7)

PDF
R programming for Data Science - A Beginner’s Guide
Dmitry Grapov
 
PDF
Network mapping 101 course
Dmitry Grapov
 
PDF
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Dmitry Grapov
 
PDF
Dmitry Grapov Resume and CV
Dmitry Grapov
 
PPTX
Machine Learning Powered Metabolomic Network Analysis
Dmitry Grapov
 
PPTX
Modeling poster
Dmitry Grapov
 
PPTX
American Society of Mass Spectrommetry Conference 2014
Dmitry Grapov
 
R programming for Data Science - A Beginner’s Guide
Dmitry Grapov
 
Network mapping 101 course
Dmitry Grapov
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Dmitry Grapov
 
Dmitry Grapov Resume and CV
Dmitry Grapov
 
Machine Learning Powered Metabolomic Network Analysis
Dmitry Grapov
 
Modeling poster
Dmitry Grapov
 
American Society of Mass Spectrommetry Conference 2014
Dmitry Grapov
 

Recently uploaded (20)

PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PDF
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
PDF
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PPTX
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 

Automation of (Biological) Data Analysis and Report Generation

  • 1. Automation of Biological Data Analysis and Report Generation Dmitry Grapov, PhD
  • 2. Bots write the darndest things http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood- california-rdivor,0,3229825.story#axzz2wQwc82EK •fill in the template (easy) •human-guided automation (e.g. Metaboanalyst, intermediate) •intelligent/reactive writing (e.g. ~AI, advanced) http://narrativescience.com/
  • 3. Humans + Bots Interaction: •Bots and humans combine in guided analyses •Humans: make choices (based on bot guides) •Bots: automate! Facilitate: • workflow logging and template creation •reproducible results Bot: Initial data and meta data parsing and quality validation (need: template input) Human: data cleaning and experimental design identification (use: multiple choice, dynamic GUI) Bot: instantiation of complex workflows Human: overview of bot assumptions and results Bot: Numerical and text output generation
  • 4. Humans + Bots write darndender things? Choose Your Own Life Adventure! ? https://github.com/ dgrapov/AdventureR
  • 5. Data Analysis Tasks Visualization (how does it look?) • histograms, density plots, box plots, line plots, scatter plots, networks, etc. Statistical Analysis (what is statistically significant?) • summary tables, ANOVA, FDR adjustment, power analysis, etc. Exploration (what are the major patterns/trends?) • clustering, PCA, ICA, etc. Predictive Modeling (what explains my hypothesis?) • mixed effects, partial least squares (O-/PLS/-DA), etc. Network Analysis and Mapping (how are things related?) • Functional analysis: pathway enrichment or overrepresentation • Networks: biochemical, structural, mass spectral and empirical networks • Mapping: projection of analysis results onto network
  • 6. WCMC Data Analysis Reports ™ Statistical analysis Clustering PCA O-PLS-DA Biochemical enrichment Network mapping Input template: BinBase •inference of experimental goals from sample meta data •mapping variables to external databases Tasks: Report: Tools:
  • 7. Automation Challenges Data cleaning and quality validation •use: quality control samples; identify: precision/accuracy, normalization, batch corrections; mitigate: outliers, missing values, batch effects, etc. Identification of experimental goals •use: meta data, identify: main and accessory effects; choose: statistics, multivariate tests and visualizations Integration of multiple tasks to evolve robust analyses •tasks: statistics, multivariate, functional, networks, database mapping, etc Data analysis report generation •use: R, Latex, markdown ?
  • 8. Challenges to automated metabolite ID mapping Stereochemistry? Search: catechin Best Match: Catechin Biologically relevant: D-catechin Synonyms? Search: UDP GlcNAc FAIL: UDP GlcNac PASS: UDP-GlcNac
  • 9. Strategies for automated metabolite ID mapping (from synonym) #1: CTS+ #2: Web query #3: Curated DB •Use CTS to translate from synonyms to KEGG (KID) and PubChem (CID) •Use KEGGREST and PUG to filter and choose most appropriate IDs •Use fuzzy matching and word similarity metrics (e.g. Damerau– Levenshtein distance) •Use KEGGREST + PubChem PUG to translate synonyms to IDs •For KEGG ID: synonym  SID  KID •Generate a curated DB for KEGG and CID translations + •Include InChI Keys •Map to other DBs •Allow fuzzy matching on synonyms •e.g. IDEOM http://bioinformatics.oxfordjournals.org/content /early/2012/02/04/bioinformatics.bts069
  • 10. Interactive Analysis and Report Generation knitr (http://yihui.name/knitr/) Analysis Report Generation •Analysis on rails or open sandbox •Humans facilitate robust results generation + Bots ensure reproduction •Generation of Methods and Results should be automateable
  • 11. Devium 2.0 Human-guided automated data analysis and report generator Human-guided automation could help ensure robust results by making choices which are otherwise difficult to automate. https://github.com/dgrapov/DeviumWeb
  • 12. MetaMapR Linking data analysis and biology https://github.com/dgrapov/MetaMapR Integration of complex work flows is key to automation.
  • 13. + Workflows for complex experiments (e.g. time-course) + Biochemical functional analysis (pathway enrichment) + GUI for report generation (Devium 2.0) + Integrate multi-’Omic’ data sets (MetaMapR 2.0) + Scientific literature mining (RapportR) + Interactive plots and networks (JavaScript) Future Goals
  • 14. [email protected] metabolomics.ucdavis.edu This research was supported in part by NIH 1 U24 DK097154