Checking, Curating and Qualifying Chemistry to Build a Structure Centric Community for Chemists Rutgers University  12/2/2008   Antony Williams
ChemSpider - A Search Engine for Chemists Questions a chemist might ask… What is the melting point of n-butanol?  What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue? ChemSpider can answer all of these questions
Tell Me About Glutathione
Tell Me About Glutathione
Tell Me About Glutathione
Tell Me About Glutathione
Tell Me About Glutathione
Link outs
Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
How many names does a compound have?
ChemSpider Data Content Over 21.5 million unique chemical structures from ca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Access Chemical Vendors – over 40 different vendors and growing Personal Depositions – individual contributions Content database vendors Analytical data collections Patents Web scraping Content is linked back to the original data sources
Complex Search
The Quality of Data Online… Aggregating data opens up quality issues Structure-identifier associations are “dirty” Structures are COMMONLY incorrect Manual curation of small databases is enough work – what about millions of structures? Structures are far from perfect. What is a “correct structure”? Full stereochemistry?  Historical timeline of structure? Who is the authority?
Quality is a Major Issue- Search Butanol OLD EXAMPLE..now fixed
Wikipedia Chemistry Curation project Only ca. 5000 organic structures, 7000 total structures Almost a year of work so far for a team of 6 people Many errors removed in the process. Curation process is a daily event for users/depositors Slow and torturous process http://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structure
Wikipedia Curation Looking for self-consistency across a Wikipedia Page Primary key is the article TITLE The chemical shown needs to match the title Cyclic self-consistency – and decisions must get made
Other issues…
Charges
Sugars – Machine Readable vs Aesthetics Haworth  Stereo  Fischer
Wikipedia – Crowdsourcing Chemistry
Thymol Blue on ChemSpider Data online includes: UV-vis spectrum Measured experimental properties Link to Wikipedia article Links to chromatography details Multiple identifiers/trade names etc. Links to vendors/suppliers/other databases Safety information http://www.chemspider.com/q/thymol%20blue
Crowd-sourcing Curation How to curate data for millions of structures?  Robot processes can clean up depositions Search for Chloride and check molecular formula for Cl Check for stereochemistry and remove names with stereo  Provide a simple-to-use platform to curate, annotate and tag data  Provide curator administration to prevent vandalism (Veropedia)
Post Comments Anyone can “Post Comments” associated with a structure. To curate data we require login to track
Multi-level Curation and Approval
Crowd-sourcing Chemistry Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation ALSO Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
Vancomycin Originally 12 structures with vancomycin  Incomplete stereochemistry Complete but different stereochemistry Different charge states 1 remains after community collaboration with ChEBI
“ Collaboration” with ChEBI
Ginkgolide B
DailyMed
Quality of Structures
Quality of Structures!!!
 
“Entity Extraction” Rule-based recognition of systematic names: Use a lexeme of name fragments Rules for identifying bounds of a name Look-up dictionary: Drug Names Trivial Names Numbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated identifiers on ChemSpider
Name Recognition Azo aldehyde  2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde  2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone  1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) .  The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base  3   as a red solid which was recrystalized from ethanol 95%     (1.28 g, 91 %)
Name Recognition Azo aldehyde  2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde  2   (1.08 g, 3.76 mmol )  in  dry  CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone   1 (0.40 g, 1.88 mmol) and a excess of anhydrous  MgSO 4  (2.00 g,16.67 mmol) .  The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with  dichloromethane  . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base  3   as a red solid which was recrystalized from  ethanol  95%     (1.28 g, 91 %)
ChemMantis Chem ical  M arkup  A nd  N omenclature  T ransformation  I ntegrated  S ystem
Document markup
Markup – 3 seconds!
On the fly conversion
Shorthand Formulae Supported
One Click to more Info…
Names and Structures Dichloroacetone Trichloromethylsilane
Ambiguity
Ambiguity in Abbreviations - DPA
IUPAC PAC Articles
Patents
Single Configuration File defines entities for markup Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical We can extend our system –  should we integrate to PDB somehow?
Nature Publications
Entity Balloons Structures are the language of chemistry Show structures to chemists and search/link from there Link to PDB ?
Other Dictionaries - Species We are considering  Bacteria Fungi Enzymes Viruses PDB codes?
Integrations Out to Other Sources
Reactions
Conclusions The quality of structure-based data online should always be questioned – that includes ChemSpider Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always ChemSpider has a large validated structure-name dictionary Chemical name extraction and document markup is very enabling

Checking, Curating And Qualifying Chemistry

  • 1.
    Checking, Curating andQualifying Chemistry to Build a Structure Centric Community for Chemists Rutgers University 12/2/2008 Antony Williams
  • 2.
    ChemSpider - ASearch Engine for Chemists Questions a chemist might ask… What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue? ChemSpider can answer all of these questions
  • 3.
    Tell Me AboutGlutathione
  • 4.
    Tell Me AboutGlutathione
  • 5.
    Tell Me AboutGlutathione
  • 6.
    Tell Me AboutGlutathione
  • 7.
    Tell Me AboutGlutathione
  • 8.
  • 9.
    Links out toKEGG Kyoto Encyclopedia of Genes and Genomes
  • 10.
    How many namesdoes a compound have?
  • 11.
    ChemSpider Data ContentOver 21.5 million unique chemical structures from ca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Access Chemical Vendors – over 40 different vendors and growing Personal Depositions – individual contributions Content database vendors Analytical data collections Patents Web scraping Content is linked back to the original data sources
  • 12.
  • 13.
    The Quality ofData Online… Aggregating data opens up quality issues Structure-identifier associations are “dirty” Structures are COMMONLY incorrect Manual curation of small databases is enough work – what about millions of structures? Structures are far from perfect. What is a “correct structure”? Full stereochemistry? Historical timeline of structure? Who is the authority?
  • 14.
    Quality is aMajor Issue- Search Butanol OLD EXAMPLE..now fixed
  • 15.
    Wikipedia Chemistry Curationproject Only ca. 5000 organic structures, 7000 total structures Almost a year of work so far for a team of 6 people Many errors removed in the process. Curation process is a daily event for users/depositors Slow and torturous process http://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structure
  • 16.
    Wikipedia Curation Lookingfor self-consistency across a Wikipedia Page Primary key is the article TITLE The chemical shown needs to match the title Cyclic self-consistency – and decisions must get made
  • 17.
  • 18.
  • 19.
    Sugars – MachineReadable vs Aesthetics Haworth Stereo Fischer
  • 20.
  • 21.
    Thymol Blue onChemSpider Data online includes: UV-vis spectrum Measured experimental properties Link to Wikipedia article Links to chromatography details Multiple identifiers/trade names etc. Links to vendors/suppliers/other databases Safety information http://www.chemspider.com/q/thymol%20blue
  • 22.
    Crowd-sourcing Curation Howto curate data for millions of structures? Robot processes can clean up depositions Search for Chloride and check molecular formula for Cl Check for stereochemistry and remove names with stereo Provide a simple-to-use platform to curate, annotate and tag data Provide curator administration to prevent vandalism (Veropedia)
  • 23.
    Post Comments Anyonecan “Post Comments” associated with a structure. To curate data we require login to track
  • 24.
  • 25.
    Crowd-sourcing Chemistry Crowd-sourcedcuration: identify and tag errors, edit names, synonyms, identify records for deprecation ALSO Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
  • 26.
    Vancomycin Originally 12structures with vancomycin Incomplete stereochemistry Complete but different stereochemistry Different charge states 1 remains after community collaboration with ChEBI
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    “Entity Extraction” Rule-basedrecognition of systematic names: Use a lexeme of name fragments Rules for identifying bounds of a name Look-up dictionary: Drug Names Trivial Names Numbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated identifiers on ChemSpider
  • 34.
    Name Recognition Azoaldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 35.
    Name Recognition Azoaldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) . The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 36.
    ChemMantis Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem
  • 37.
  • 38.
    Markup – 3seconds!
  • 39.
    On the flyconversion
  • 40.
  • 41.
    One Click tomore Info…
  • 42.
    Names and StructuresDichloroacetone Trichloromethylsilane
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
    Single Configuration Filedefines entities for markup Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical We can extend our system – should we integrate to PDB somehow?
  • 48.
  • 49.
    Entity Balloons Structuresare the language of chemistry Show structures to chemists and search/link from there Link to PDB ?
  • 50.
    Other Dictionaries -Species We are considering Bacteria Fungi Enzymes Viruses PDB codes?
  • 51.
    Integrations Out toOther Sources
  • 52.
  • 53.
    Conclusions The qualityof structure-based data online should always be questioned – that includes ChemSpider Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always ChemSpider has a large validated structure-name dictionary Chemical name extraction and document markup is very enabling