SlideShare a Scribd company logo
Clustering the Royal Society of Chemistry
chemical repository to enable enhanced
navigation across millions of chemicals
Valery Tkachenko, Ken Karapetyan, Antony Williams,
Oliver Kohlbacher, Philipp Thiel, Colin Batchelor
ACS, 248th National Meeting
San Francisco, CA
August 14th
2014
Chemical space - 1060
Navigation in chemical space
Clustering
Science dimensions
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching
ChemSpider
Properties
Classification
ChemSpider Data Slices
Tagging in ChemSpider
RSC Archive – since 1841
DERA -
Digitally Enabling RSC Archive
Twelve broad categories
Twelve broad categories
Largest
category is
30 times
the size of
the smallest
200 subcategories
How does it work?
Latent Semantic Analysis to build feature sets
for (1) articles (2) categories.
Features: words, citations and pairs of words.
Domain experts (Journal Development staff)
build a category vector.
All articles with a cosine similarity greater than
an adjustable threshold go into the category.
RSC Data Repository
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Structures similarity
Molecule Similarity
Similarity ?Similarity ?
Suitable in silico representation:
2D binary fingerprints
Suitable in silico representation:
2D binary fingerprints
0 1 0 1 0 1 1 0Y:
0 1 1 0 1 1 0 1X:
25
0 1 2 3 4 5 6 7
Structures similarity
Molecule Similarity
26
• Important fingerprint properties:
1. Length: length of the binary vector
2. Density: fraction of 1-bits
• Various fingerprint types exist
– Different atom typing and generation procedure
– Different properties (length, density, ...)
• Alternative representation: Feature list
– Store only index numbers of vector positions
– Memory-efficient storage
0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0
Length
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0
Sparse fingerprint (sFP)
1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1
Dense fingerprint (dFP)
0 1 0 1 0 1 1 0
1,3,5,6
Structures similarity
27
2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579
3. Tanimoto T.T., IBM Internal Report (1957)
• Molecules as binary vectors
• Various chemoinformatics dis-/similiarity measures:
– Euclidean distance
– Cosine similarity (inner product)
• Most frequently used: Tanimoto Coefficient 2,3
– Corresponds to Jaccard index
– Metric
– [0.0, 1.0] (dissimilar  similar)
Molecule Similarity
Full Similarity Matrix Clustering
28
Results: Clustering the Available Chemspace
• ZINC all purchasable set: ~17x106
compounds (sFP)
• Tanimoto cutoff analysis: 0.76
• Opteron, 64 threads, 100 GB main memory
Total run-time: 64 hours
CCs decomposition: 12 hours
Total run-time: 64 hours
CCs decomposition: 12 hours
Federated linked system
Thank you
Email: tkachenkov@rsc.org
Slides: http://www.slideshare.net/valerytkachenko16

More Related Content

PPTX
Building a Standard for Standards: The ChAMP Project
Stuart Chalk
 
PPT
Building a semantic chemistry platform with the royal society of chemistry
Valery Tkachenko
 
PPTX
ACS 248th Paper 71 ChAMP Project
Stuart Chalk
 
PPTX
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Stuart Chalk
 
PPT
Hosting a compound centric community resource for chemistry data
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPT
The royal society of chemistry and its adoption of semantic web technologies ...
Valery Tkachenko
 
PDF
Acs collaborative computational technologies for biomedical research an enabl...
Sean Ekins
 
PPT
Royal Society of Chemistry open source cheminformatics platforms and libraries
Valery Tkachenko
 
Building a Standard for Standards: The ChAMP Project
Stuart Chalk
 
Building a semantic chemistry platform with the royal society of chemistry
Valery Tkachenko
 
ACS 248th Paper 71 ChAMP Project
Stuart Chalk
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Stuart Chalk
 
Hosting a compound centric community resource for chemistry data
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The royal society of chemistry and its adoption of semantic web technologies ...
Valery Tkachenko
 
Acs collaborative computational technologies for biomedical research an enabl...
Sean Ekins
 
Royal Society of Chemistry open source cheminformatics platforms and libraries
Valery Tkachenko
 

What's hot (20)

PPT
Supporting the exploding dimensions of the chemical sciences via global netwo...
Valery Tkachenko
 
PPTX
FAIR Data and Model Management for Systems Biology (and SOPs too!)
Carole Goble
 
PPTX
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
PPTX
Opportunities in chemical structure standardization
Valery Tkachenko
 
PPTX
Chemistry Validation and Standardization Platform v2.0
Valery Tkachenko
 
PDF
Improving the Management of Computational Models -- Invited talk at the EBI
Martin Scharm
 
PPT
Royal society of chemistry activities to develop a data repository for chemis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPTX
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
Carole Goble
 
PPTX
Citing data in research articles: principles, implementation, challenges - an...
FAIRDOM
 
PPT
The UK National Chemical Database Service – an integration of commercial and ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPTX
Enhancing the Quality of ImmPort Data
Barry Smith
 
PDF
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
CEDAR: Center for Expanded Data Annotation and Retrieval
 
PPTX
Open Science Data Repository - the platform for materials research
Valery Tkachenko
 
PPTX
Overview of open resources to support automated structure verification and e...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPT
The importance of standards for data exchange and interchange on the Royal So...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPT
How an Online Resource for Chemistry Can Change Our World
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPT
Adding complex expert knowledge into chemical database and transforming surfa...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPTX
Building linked data large-scale chemistry platform - challenges, lessons and...
Valery Tkachenko
 
PPT
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Supporting the exploding dimensions of the chemical sciences via global netwo...
Valery Tkachenko
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
Carole Goble
 
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
Opportunities in chemical structure standardization
Valery Tkachenko
 
Chemistry Validation and Standardization Platform v2.0
Valery Tkachenko
 
Improving the Management of Computational Models -- Invited talk at the EBI
Martin Scharm
 
Royal society of chemistry activities to develop a data repository for chemis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
Carole Goble
 
Citing data in research articles: principles, implementation, challenges - an...
FAIRDOM
 
The UK National Chemical Database Service – an integration of commercial and ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Enhancing the Quality of ImmPort Data
Barry Smith
 
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
CEDAR: Center for Expanded Data Annotation and Retrieval
 
Open Science Data Repository - the platform for materials research
Valery Tkachenko
 
Overview of open resources to support automated structure verification and e...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The importance of standards for data exchange and interchange on the Royal So...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Adding complex expert knowledge into chemical database and transforming surfa...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Valery Tkachenko
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Ad

Similar to Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals (20)

PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
PPT
eScience Resources for the Chemistry Community from the Royal Society of Chem...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
PDF
Metadata-based tools at the ENCODE Portal
ENCODE-DCC
 
PDF
10 Years of Multi-Label Learning
Grigorios Tsoumakas
 
PPTX
Neuroscience as networked science
Neuroscience Information Framework
 
PDF
FAIR data requires FAIR ontologies, how do we do?
INRAE (MISTEA) and University of Montpellier (LIRMM)
 
PPT
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
ICZN
 
PPT
The application of cloud computing to royal society of chemistry data platforms
Valery Tkachenko
 
PPTX
Overview of cheminformatics
Benjamin Bucior
 
PPTX
Databases_CSS2.pptx
Silpa87
 
PDF
Ontologies for life sciences: examples from the gene ontology
Melanie Courtot
 
PPTX
Semantic Technologies for Big Sciences including Astrophysics
Artificial Intelligence Institute at UofSC
 
PPT
The expansive reach of ChemSpider as a resource for the chemistry community
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPTX
Semi-automated Exploration and Extraction of Data in Scientific Tables
Elsevier
 
PDF
GARNet workshop on Integrating Large Data into Plant Science
David Johnson
 
PPTX
Building a Biomedical Knowledge Garden
Benjamin Good
 
PPTX
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Numenta
 
PPTX
Encyclopedia of Life: Use cases for phenotypes
Cyndy Parr
 
PDF
Applying tensor decompositions to author name disambiguation of common Japane...
National Institute of Informatics
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
Metadata-based tools at the ENCODE Portal
ENCODE-DCC
 
10 Years of Multi-Label Learning
Grigorios Tsoumakas
 
Neuroscience as networked science
Neuroscience Information Framework
 
FAIR data requires FAIR ontologies, how do we do?
INRAE (MISTEA) and University of Montpellier (LIRMM)
 
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
ICZN
 
The application of cloud computing to royal society of chemistry data platforms
Valery Tkachenko
 
Overview of cheminformatics
Benjamin Bucior
 
Databases_CSS2.pptx
Silpa87
 
Ontologies for life sciences: examples from the gene ontology
Melanie Courtot
 
Semantic Technologies for Big Sciences including Astrophysics
Artificial Intelligence Institute at UofSC
 
The expansive reach of ChemSpider as a resource for the chemistry community
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Elsevier
 
GARNet workshop on Integrating Large Data into Plant Science
David Johnson
 
Building a Biomedical Knowledge Garden
Benjamin Good
 
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Numenta
 
Encyclopedia of Life: Use cases for phenotypes
Cyndy Parr
 
Applying tensor decompositions to author name disambiguation of common Japane...
National Institute of Informatics
 
Ad

More from Valery Tkachenko (20)

PPTX
Evolution of public chemistry databases: past and the future
Valery Tkachenko
 
PPTX
In silico design of new functional materials
Valery Tkachenko
 
PPTX
Metal-organic frameworks: from database to supramolecular effects in complexa...
Valery Tkachenko
 
PPTX
Abstract recommendation system: beyond word-level representations
Valery Tkachenko
 
PPTX
Machine learning methods for chemical properties and toxicity based endpoints
Valery Tkachenko
 
PPTX
Chemical workflows supporting automated research data collection
Valery Tkachenko
 
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
Valery Tkachenko
 
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
 
PDF
Using publicly available resources to build a comprehensive knowledgebase of ...
Valery Tkachenko
 
PPTX
Need and benefits for structure standardization to facilitate integration and...
Valery Tkachenko
 
PPTX
Development and comparison of deep learning toolkit with other machine learni...
Valery Tkachenko
 
PPTX
Living in a world of federated knowledge challenges, principles, tools and ...
Valery Tkachenko
 
PPTX
Open chemistry registry and mapping platform based on open source cheminforma...
Valery Tkachenko
 
PPTX
Using the structured product labeling format to index versatile chemical data
Valery Tkachenko
 
PPT
OpenPHACTS - Chemistry Platform Update and Learnings
Valery Tkachenko
 
PPTX
Evolution of open chemical information
Valery Tkachenko
 
PPTX
OMPOL – visualisation of large chemical spaces
Valery Tkachenko
 
PPTX
Not just another reaction database
Valery Tkachenko
 
PPTX
Implementing chemistry platform for OpenPHACTS
Valery Tkachenko
 
PPTX
Text mining to produce large chemistry datasets for community access
Valery Tkachenko
 
Evolution of public chemistry databases: past and the future
Valery Tkachenko
 
In silico design of new functional materials
Valery Tkachenko
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Valery Tkachenko
 
Abstract recommendation system: beyond word-level representations
Valery Tkachenko
 
Machine learning methods for chemical properties and toxicity based endpoints
Valery Tkachenko
 
Chemical workflows supporting automated research data collection
Valery Tkachenko
 
Deep learning methods applied to physicochemical and toxicological endpoints
Valery Tkachenko
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Valery Tkachenko
 
Need and benefits for structure standardization to facilitate integration and...
Valery Tkachenko
 
Development and comparison of deep learning toolkit with other machine learni...
Valery Tkachenko
 
Living in a world of federated knowledge challenges, principles, tools and ...
Valery Tkachenko
 
Open chemistry registry and mapping platform based on open source cheminforma...
Valery Tkachenko
 
Using the structured product labeling format to index versatile chemical data
Valery Tkachenko
 
OpenPHACTS - Chemistry Platform Update and Learnings
Valery Tkachenko
 
Evolution of open chemical information
Valery Tkachenko
 
OMPOL – visualisation of large chemical spaces
Valery Tkachenko
 
Not just another reaction database
Valery Tkachenko
 
Implementing chemistry platform for OpenPHACTS
Valery Tkachenko
 
Text mining to produce large chemistry datasets for community access
Valery Tkachenko
 

Recently uploaded (20)

PDF
Package-Aware Approach for Repository-Level Code Completion in Pharo
ESUG
 
PDF
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
PDF
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
SOCIEDAD JULIO GARAVITO
 
PPTX
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
PPTX
Reticular formation_nuclei_afferent_efferent
muralinath2
 
PPTX
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
PPT
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PPTX
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PPTX
Role of GIS in precision farming.pptx
BikramjitDeuri
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PDF
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
ESUG
 
PPTX
Limbic system_components_connections_ functions.pptx
muralinath2
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PDF
Identification of Bacteria notes by EHH.pdf
Eshwarappa H
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PPTX
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
PDF
Paleoseismic activity in the moon’s Taurus-Littrowvalley inferred from boulde...
Sérgio Sacani
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
Package-Aware Approach for Repository-Level Code Completion in Pharo
ESUG
 
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
SOCIEDAD JULIO GARAVITO
 
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
Reticular formation_nuclei_afferent_efferent
muralinath2
 
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
Role of GIS in precision farming.pptx
BikramjitDeuri
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
ESUG
 
Limbic system_components_connections_ functions.pptx
muralinath2
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
Identification of Bacteria notes by EHH.pdf
Eshwarappa H
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
Paleoseismic activity in the moon’s Taurus-Littrowvalley inferred from boulde...
Sérgio Sacani
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 

Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

  • 1. Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14th 2014
  • 6. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowdsourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • A structure centric hub for web-searching
  • 12. RSC Archive – since 1841
  • 15. Twelve broad categories Largest category is 30 times the size of the smallest
  • 17. How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.
  • 25. Structures similarity Molecule Similarity Similarity ?Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 0 1 0 1 0 1 1 0Y: 0 1 1 0 1 1 0 1X: 25 0 1 2 3 4 5 6 7
  • 26. Structures similarity Molecule Similarity 26 • Important fingerprint properties: 1. Length: length of the binary vector 2. Density: fraction of 1-bits • Various fingerprint types exist – Different atom typing and generation procedure – Different properties (length, density, ...) • Alternative representation: Feature list – Store only index numbers of vector positions – Memory-efficient storage 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 Length 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 Sparse fingerprint (sFP) 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 Dense fingerprint (dFP) 0 1 0 1 0 1 1 0 1,3,5,6
  • 27. Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) • Molecules as binary vectors • Various chemoinformatics dis-/similiarity measures: – Euclidean distance – Cosine similarity (inner product) • Most frequently used: Tanimoto Coefficient 2,3 – Corresponds to Jaccard index – Metric – [0.0, 1.0] (dissimilar  similar) Molecule Similarity
  • 28. Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace • ZINC all purchasable set: ~17x106 compounds (sFP) • Tanimoto cutoff analysis: 0.76 • Opteron, 64 threads, 100 GB main memory Total run-time: 64 hours CCs decomposition: 12 hours Total run-time: 64 hours CCs decomposition: 12 hours
  • 30. Thank you Email: [email protected] Slides: http://www.slideshare.net/valerytkachenko16

Editor's Notes

  • #19: Change to add more database, rearrange