Clustering the Royal Society of Chemistry
chemical repository to enable enhanced
navigation across millions of chemicals
Valery Tkachenko, Ken Karapetyan, Antony Williams,
Oliver Kohlbacher, Philipp Thiel, Colin Batchelor
ACS, 248th National Meeting
San Francisco, CA
August 14th
2014
Chemical space - 1060
Navigation in chemical space
Clustering
Science dimensions
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching
ChemSpider
Properties
Classification
ChemSpider Data Slices
Tagging in ChemSpider
RSC Archive – since 1841
DERA -
Digitally Enabling RSC Archive
Twelve broad categories
Twelve broad categories
Largest
category is
30 times
the size of
the smallest
200 subcategories
How does it work?
Latent Semantic Analysis to build feature sets
for (1) articles (2) categories.
Features: words, citations and pairs of words.
Domain experts (Journal Development staff)
build a category vector.
All articles with a cosine similarity greater than
an adjustable threshold go into the category.
RSC Data Repository
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Structures similarity
Molecule Similarity
Similarity ?Similarity ?
Suitable in silico representation:
2D binary fingerprints
Suitable in silico representation:
2D binary fingerprints
0 1 0 1 0 1 1 0Y:
0 1 1 0 1 1 0 1X:
25
0 1 2 3 4 5 6 7
Structures similarity
Molecule Similarity
26
• Important fingerprint properties:
1. Length: length of the binary vector
2. Density: fraction of 1-bits
• Various fingerprint types exist
– Different atom typing and generation procedure
– Different properties (length, density, ...)
• Alternative representation: Feature list
– Store only index numbers of vector positions
– Memory-efficient storage
0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0
Length
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0
Sparse fingerprint (sFP)
1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1
Dense fingerprint (dFP)
0 1 0 1 0 1 1 0
1,3,5,6
Structures similarity
27
2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579
3. Tanimoto T.T., IBM Internal Report (1957)
• Molecules as binary vectors
• Various chemoinformatics dis-/similiarity measures:
– Euclidean distance
– Cosine similarity (inner product)
• Most frequently used: Tanimoto Coefficient 2,3
– Corresponds to Jaccard index
– Metric
– [0.0, 1.0] (dissimilar  similar)
Molecule Similarity
Full Similarity Matrix Clustering
28
Results: Clustering the Available Chemspace
• ZINC all purchasable set: ~17x106
compounds (sFP)
• Tanimoto cutoff analysis: 0.76
• Opteron, 64 threads, 100 GB main memory
Total run-time: 64 hours
CCs decomposition: 12 hours
Total run-time: 64 hours
CCs decomposition: 12 hours
Federated linked system
Thank you
Email: tkachenkov@rsc.org
Slides: http://www.slideshare.net/valerytkachenko16

More Related Content

PPTX
Building a Standard for Standards: The ChAMP Project
PPT
Building a semantic chemistry platform with the royal society of chemistry
PPTX
ACS 248th Paper 71 ChAMP Project
PPTX
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
PPT
Hosting a compound centric community resource for chemistry data
PPT
The royal society of chemistry and its adoption of semantic web technologies ...
PDF
Acs collaborative computational technologies for biomedical research an enabl...
PPT
Royal Society of Chemistry open source cheminformatics platforms and libraries
Building a Standard for Standards: The ChAMP Project
Building a semantic chemistry platform with the royal society of chemistry
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Hosting a compound centric community resource for chemistry data
The royal society of chemistry and its adoption of semantic web technologies ...
Acs collaborative computational technologies for biomedical research an enabl...
Royal Society of Chemistry open source cheminformatics platforms and libraries

What's hot (20)

PPT
Supporting the exploding dimensions of the chemical sciences via global netwo...
PPTX
FAIR Data and Model Management for Systems Biology (and SOPs too!)
PPTX
Tools and approaches for data deposition into nanomaterial databases
PPTX
Opportunities in chemical structure standardization
PPTX
Chemistry Validation and Standardization Platform v2.0
PDF
Improving the Management of Computational Models -- Invited talk at the EBI
PPT
Royal society of chemistry activities to develop a data repository for chemis...
PPTX
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
PPTX
Citing data in research articles: principles, implementation, challenges - an...
PPT
The UK National Chemical Database Service – an integration of commercial and ...
PPTX
Enhancing the Quality of ImmPort Data
PDF
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
PPTX
Open Science Data Repository - the platform for materials research
PPTX
Overview of open resources to support automated structure verification and e...
PPT
The importance of standards for data exchange and interchange on the Royal So...
PPT
PPT
Adding complex expert knowledge into chemical database and transforming surfa...
PPTX
Building linked data large-scale chemistry platform - challenges, lessons and...
PPT
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Supporting the exploding dimensions of the chemical sciences via global netwo...
FAIR Data and Model Management for Systems Biology (and SOPs too!)
Tools and approaches for data deposition into nanomaterial databases
Opportunities in chemical structure standardization
Chemistry Validation and Standardization Platform v2.0
Improving the Management of Computational Models -- Invited talk at the EBI
Royal society of chemistry activities to develop a data repository for chemis...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
Citing data in research articles: principles, implementation, challenges - an...
The UK National Chemical Database Service – an integration of commercial and ...
Enhancing the Quality of ImmPort Data
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
Open Science Data Repository - the platform for materials research
Overview of open resources to support automated structure verification and e...
The importance of standards for data exchange and interchange on the Royal So...
Adding complex expert knowledge into chemical database and transforming surfa...
Building linked data large-scale chemistry platform - challenges, lessons and...
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Ad

Similar to Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals (20)

PDF
Exploring Large Chemical Data Sets
PDF
CINF66 Visualizing Molecules In and Out of Context
PDF
Talk at SMASH 2011
PDF
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
PPTX
Cheminformatics
PPT
eScience at the Royal Society of Chemistry and our current initiatives
PPT
Marrying ACDLabs technologies to eScience Projects at the Royal Society of C...
PDF
Robots, Small Molecules & R
PPT
eScience Resources for the Chemistry Community from the Royal Society of Chem...
PPTX
Application of graph theory in drug design
PPTX
Overview of cheminformatics
PPT
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
PPT
Cheminformatics: An overview
PDF
Fingerprinting Chemical Structures
PPT
Structure verification and elucidation using the ChemSpider database
PDF
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'
PPTX
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
PDF
Smallworld : Efficient maximum common substructure searching of large databases
PPTX
Consensus ranking and fragmentation prediction for identification of unknowns...
Exploring Large Chemical Data Sets
CINF66 Visualizing Molecules In and Out of Context
Talk at SMASH 2011
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
Cheminformatics
eScience at the Royal Society of Chemistry and our current initiatives
Marrying ACDLabs technologies to eScience Projects at the Royal Society of C...
Robots, Small Molecules & R
eScience Resources for the Chemistry Community from the Royal Society of Chem...
Application of graph theory in drug design
Overview of cheminformatics
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Cheminformatics: An overview
Fingerprinting Chemical Structures
Structure verification and elucidation using the ChemSpider database
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Smallworld : Efficient maximum common substructure searching of large databases
Consensus ranking and fragmentation prediction for identification of unknowns...
Ad

More from Valery Tkachenko (20)

PPTX
Evolution of public chemistry databases: past and the future
PPTX
In silico design of new functional materials
PPTX
Metal-organic frameworks: from database to supramolecular effects in complexa...
PPTX
Abstract recommendation system: beyond word-level representations
PPTX
Machine learning methods for chemical properties and toxicity based endpoints
PPTX
Chemical workflows supporting automated research data collection
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
PDF
Using publicly available resources to build a comprehensive knowledgebase of ...
PPTX
Need and benefits for structure standardization to facilitate integration and...
PPTX
Development and comparison of deep learning toolkit with other machine learni...
PPTX
Living in a world of federated knowledge challenges, principles, tools and ...
PPTX
Open chemistry registry and mapping platform based on open source cheminforma...
PPTX
Using the structured product labeling format to index versatile chemical data
PPT
OpenPHACTS - Chemistry Platform Update and Learnings
PPTX
Evolution of open chemical information
PPTX
OMPOL – visualisation of large chemical spaces
PPTX
Not just another reaction database
PPTX
Implementing chemistry platform for OpenPHACTS
PPTX
Text mining to produce large chemistry datasets for community access
Evolution of public chemistry databases: past and the future
In silico design of new functional materials
Metal-organic frameworks: from database to supramolecular effects in complexa...
Abstract recommendation system: beyond word-level representations
Machine learning methods for chemical properties and toxicity based endpoints
Chemical workflows supporting automated research data collection
Deep learning methods applied to physicochemical and toxicological endpoints
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Using publicly available resources to build a comprehensive knowledgebase of ...
Need and benefits for structure standardization to facilitate integration and...
Development and comparison of deep learning toolkit with other machine learni...
Living in a world of federated knowledge challenges, principles, tools and ...
Open chemistry registry and mapping platform based on open source cheminforma...
Using the structured product labeling format to index versatile chemical data
OpenPHACTS - Chemistry Platform Update and Learnings
Evolution of open chemical information
OMPOL – visualisation of large chemical spaces
Not just another reaction database
Implementing chemistry platform for OpenPHACTS
Text mining to produce large chemistry datasets for community access

Recently uploaded (20)

PDF
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
PDF
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
PDF
Unit Four Lesson in Carbohydrates chemistry
PPTX
EPILEPSY UPDATE in kkm malaysia today new
PDF
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
PDF
Thyroid Hormone by Iqra Nasir detail.pdf
PPTX
Posology_43998_PHCEUTICS-T_13-12-2023_43998_PHCEUTICS-T_17-07-2025.pptx
PPTX
The Female Reproductive System - Grade 10 ppt
PDF
Chemistry and Changes 8th Grade Science .pdf
PDF
CHEM - GOC general organic chemistry.ppt
PPTX
23ME402 Materials and Metallurgy- PPT.pptx
PPTX
Spectroscopic Techniques for M Tech Civil Engineerin .pptx
PPT
ZooLec Chapter 13 (Digestive System).ppt
PDF
Pharmacokinetics Lecture_Study Material.pdf
PDF
Physics of Bitcoin #30 Perrenod Santostasi.pdf
PPTX
Models of Eucharyotic Chromosome Dr. Thirunahari Ugandhar.pptx
PDF
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
PDF
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
PDF
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
PDF
Glycolysis by Rishikanta Usham, Dhanamanjuri University
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
Unit Four Lesson in Carbohydrates chemistry
EPILEPSY UPDATE in kkm malaysia today new
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
Thyroid Hormone by Iqra Nasir detail.pdf
Posology_43998_PHCEUTICS-T_13-12-2023_43998_PHCEUTICS-T_17-07-2025.pptx
The Female Reproductive System - Grade 10 ppt
Chemistry and Changes 8th Grade Science .pdf
CHEM - GOC general organic chemistry.ppt
23ME402 Materials and Metallurgy- PPT.pptx
Spectroscopic Techniques for M Tech Civil Engineerin .pptx
ZooLec Chapter 13 (Digestive System).ppt
Pharmacokinetics Lecture_Study Material.pdf
Physics of Bitcoin #30 Perrenod Santostasi.pdf
Models of Eucharyotic Chromosome Dr. Thirunahari Ugandhar.pptx
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
Glycolysis by Rishikanta Usham, Dhanamanjuri University

Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

  • 1. Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14th 2014
  • 6. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowdsourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • A structure centric hub for web-searching
  • 12. RSC Archive – since 1841
  • 15. Twelve broad categories Largest category is 30 times the size of the smallest
  • 17. How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.
  • 25. Structures similarity Molecule Similarity Similarity ?Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 0 1 0 1 0 1 1 0Y: 0 1 1 0 1 1 0 1X: 25 0 1 2 3 4 5 6 7
  • 26. Structures similarity Molecule Similarity 26 • Important fingerprint properties: 1. Length: length of the binary vector 2. Density: fraction of 1-bits • Various fingerprint types exist – Different atom typing and generation procedure – Different properties (length, density, ...) • Alternative representation: Feature list – Store only index numbers of vector positions – Memory-efficient storage 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 Length 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 Sparse fingerprint (sFP) 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 Dense fingerprint (dFP) 0 1 0 1 0 1 1 0 1,3,5,6
  • 27. Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) • Molecules as binary vectors • Various chemoinformatics dis-/similiarity measures: – Euclidean distance – Cosine similarity (inner product) • Most frequently used: Tanimoto Coefficient 2,3 – Corresponds to Jaccard index – Metric – [0.0, 1.0] (dissimilar  similar) Molecule Similarity
  • 28. Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace • ZINC all purchasable set: ~17x106 compounds (sFP) • Tanimoto cutoff analysis: 0.76 • Opteron, 64 threads, 100 GB main memory Total run-time: 64 hours CCs decomposition: 12 hours Total run-time: 64 hours CCs decomposition: 12 hours
  • 30. Thank you Email: [email protected] Slides: http://www.slideshare.net/valerytkachenko16

Editor's Notes

  • #19: Change to add more database, rearrange