ChemSpider – The Vision and Challenges Associated with Building a Free Online Community Resource for Chemists Antony Williams AZ, February 2011
What’s the Status of Chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Virtual Screening databases Property databases Screening assay results Patents with chemical structures ADME/Tox data Scientific publications  Compound aggregators Blogs/Wikis and Open Notebook Science
For Synthesis…TotallySynthetic.com
Org Prep Daily  (Blog)
Molbank (Open Access Journal)
Lots of “Public Compound” Databases PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider
Where Would You look?  What Do You Trust?
Linked Data on the Web
What is a compound? “ARTAs”
Vision: Connect Chemistry on the Web The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) Chemistry articles are indexed and searchable by a free online service The web is linked together through the “language of chemistry” Publicly funded research data is linked
We Have Delivered the Vision “ Build a Structure Centric Community to Serve Chemists” Integrate chemical structure data on the web Create a “structure-based hub” to information, data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data
How Was ChemSpider Built? ChemSpider was a “hobby project”  Housed in a basement and running off three servers – one bought, two built Sensitive to weather and power stability Went live at ACS Spring 2007 in Chicago
How Did We Build It? We deal in Molfiles or SDF files We do rudimentary filtering – valence checking, charge imbalance – prior to deposition We have our own “business logic” to standardize Link out to external sites where possible using IDs
www.chemspider.com
We Want to Answer Questions Questions a chemist might ask… What is the melting point of n-heptanol?  What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
Search for a Chemical…by name
Link off a structure in ChemSpider Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “ Everything”
Available Information… Linked to vendors, safety data, toxicity, metabolism
Available Information….
Clickthrough to Patent (SureChem)
Crowdsourced “Annotations” Registered Users can add  Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs  Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
 
Spectra Linked
Spectra Linked
Search for a chemical…by structure Substructure search coming…
Inherited Errors Inherited errors from  every  database… all public compound databases, including ours, have errors “ Incorrect” structures – assertions, timelines etc “ Incorrect” names associated with structures ENORMOUS CHALLENGE
What is the Structure of Vitamin K?
MeSH A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified:  VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
What is the Structure of Vitamin K1?
What is the Structure of Vitamin K1?
Vitamin K1
 
“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Question Everything online: www.dhmo.org
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What  ELSE  is Methane???
 
 
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
Public Domain Chemistry Databases Our  databases are a mess… Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming, challenging and exacting An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs
Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
Symbicort: Budesonide + Formoterol
Symbicort: Budesonide + Formoterol ChemIDPlus Wikipedia
DrugBank: Search Symbicort…
Symbicort: Budesonide + Formoterol PubChem 8 structures called Budesonide. 1 “correct” 6 structures called Formoterol. 1 “correct” Search on “Symbicort” gives 1 structure.
Taxol: Paclitaxel  44  structures
Taxol: Paclitaxel  Bioassay  Data
Taxol: Paclitaxel  Bioassay  Data Most  Bioassay data associated with structure with one ambiguous stereocenter
Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
 
 
 
 
The Final Search Strategy
All Those Names, One Structure
Searching Chemistry on the  Internet How complete a result set will we get if we search for “chemicals” by name? Is there a better way to link chemistry databases? Linking by “names” is dangerous Chemists want structure and SUBstructure searching
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Oleoylethanolamine
InChIs have traction…
Vancomycin
Vancomycin
Vancomycin Search Molecular SKELETON Search Full Molecule
Full  Molecule  Search: 4 Hits
Full  Skeleton  Search: 104 Hits
Vancomycin Who will curate? How would you clean such a large dataset?
Vancomycin on ChemSpider
 
 
 
 
 
 
Name Searching is “Easier”
Name Searching is “Easier”
 
 
Content is King and  Quality  Costs Curated Chemistry “content” is expensive to create Patent searching Structures and properties Drug databases Literature databases Chemical Abstracts Service  (CAS), the “Gold Standard” in Chemistry related information 104 years of content >50 million substances  Proprietary platform
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Feedback from Steve Ritter “ Although CAS and C&EN are both part of the ACS Publications Division,  we at C&EN still have to pay for our SciFinder access, strangely enough.” “ It would be  nice to have an authoritative web-based source of standard, well-drawn structures  for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need.  Maybe Wikipedia will be that source one day .”
 
Search OEA
Search OEA
Search OEA
Semantic Mark-up for Chemistry Semantic mark-up for chemistry is here RSC project prospect (structure linking, IUPAC Gold Book ontology and other ontologies  Nature publishing group compound linking
Nature Chemistry Compound Pages
Project Prospect
Entity-Extraction, Mark-up, Annotate
Entity-Extraction, Mark-up, Annotate
And linked to STITCH…
Success Depends on Dictionaries
Online Curation Online databases generally do NOT allow curation or annotation If you find errors they stay there! ChemSpider allows immediate curation
Search “Vitamin H”
“ Curate” Identifiers
“ Curate” Identifiers
“ Curate” Identifiers
Crowd-sourcing Chemistry Curation
Crowdsourcing Works >130 people have deposited data and participated in data curation Different level curators check each other Wikipedia is the modern primary example
ChemSpider and Publishing The curation efforts on ChemSpider led to a set of validated dictionaries Integrate best-in-class  entity extraction  with validated name dictionaries  Already text-mined the RSC archive and presently linking!
Crowdsourcing Synthesis ChemSpider SyntheticPages
Crowdsourcing Synthesis ChemSpider SyntheticPages
ChemSpider Everywhere: What do computers want? Web services
Web Services
ChemSpider Everywhere Linked from Wikipedia and many Public Databases Linked from Open Notebook Science sites Linked from Blogs using Structure/Spectra EMBED Integrated into structure drawing packages  Integrated to software offerings from Thermo, Waters, Agilent, Bruker
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider Everywhere Crowdsourced Curation of Spectra
ChemSpider Everywhere : ChemMobi
Structure Database Lookup
Structure Database Lookup
Reaction Database Look-up
Reaction Database Look-up
There will always be gaps... What ChemSpider does not deal with, yet... Materials Minerals Polymers Biological macromolecules
Collaborative Data Curation How can we  COLLECTIVELY  clean online data? Developing ways to share curation actions back to original data sources A mindset of  bigger is better  is problematic. How many “real chemicals” are in the public databases?
Future Work Continue curation work Extend search capabilities Expand existing databases Text-mine RSC archive and link chemistry Project: pre-competitive data sharing and linking for Life Sciences Integrate to metabolic pathways tools
The Future of Chemistry on the Web? Public compound databases federate & build a linked environment of validated data! Data validation needs are  not  ignored Publishers layer on information to make publications discoverable Public-Private  databases can be linked Open Data  proliferate The “ Semantic Web ” in action
It’s a long road ahead…
Thank you Email: williamsa@rsc.org  Twitter: ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

ChemSpider – The Vision and Challenges Associated with Building a Free Online Community Resource for Chemists