Crowdsourcing, Collaborations and Text-Mining in a  World of Open Chemistry Nature Publishing Group 11/2008   Antony Williams
Imagine a time when …. The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) Chemistry articles are indexed and searchable by a free online service The web is linked together through the “language of chemistry” Publicly funded research data can be shared and discussed in the Open, maybe as ONS? Cheminformatics has as much of a public face as bioinformatics
ChemSpider - A Search Engine for Chemists Questions a chemist might ask… What is the melting point of n-butanol?  What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue? ChemSpider can answer all of these questions
What is a Structure? Ask a computer…ask a chemist
Tell Me About Glutathione
Tell Me About Glutathione
Tell Me About Glutathione
Tell Me About Glutathione
Tell Me About Glutathione
Tell Me About Glutathione
Link outs
Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
How many names does a compound have?
ChemSpider Data Content Over 21.5 million unique chemical structures from ca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Access Chemical Vendors – over 40 different vendors and growing Personal Depositions – individual contributions Content database vendors Analytical data collections Patents Web scraping Content is linked back to the original data sources
Other Searches What compounds have a mass of 300+/-0.001? or search a combination of intrinsic/predicted properties
Other Searches
Complex Search
The Quality of Data Online… Aggregating data opens up quality issues Structure-identifier associations are “dirty” Structures are COMMONLY incorrect Manual curation of small databases is enough work – what about millions of structures? Structures are far from perfect. What is a “correct structure”? Full stereochemistry?  Historical timeline of structure? Who is the authority?
Who holds THE Quality Authority? Chemical Abstracts Service is the structural authority today. 1400 employees, world standard in chemistry information 101 years of knowledge, process and expertise.  How can an online, free access system peacefully co-exist with the authority?
Quality is a Major Issue- Search Butanol OLD EXAMPLE..now fixed
Wikipedia Chemistry Curation project Only ca. 5000 organic structures, 7000 total structures Almost a year of work so far for a team of 6 people Many errors removed in the process. Curation process is a daily event for users/depositors Slow and torturous process http://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structure
Wikipedia Curation Looking for self-consistency across a Wikipedia Page Primary key is the article TITLE The chemical shown needs to match the title Cyclic self-consistency – and decisions must get made
Viagra or Sildenafil
Other issues…
Charges
Sugars – Machine Readable vs Aesthetics Haworth  Stereo  Fischer
Wikipedia – Crowdsourcing Chemistry
Thymol Blue on ChemSpider Data online includes: UV-vis spectrum Measured experimental properties Link to Wikipedia article Links to chromatography details Multiple identifiers/trade names etc. Links to vendors/suppliers/other databases Safety information http://www.chemspider.com/q/thymol%20blue
Differences between ChemSpider/Wikipedia No, but links. Analytical Data Active editors > 50 (?) Active depositors/curators – 30  No Prediction of properties ???? 6000 people/day; 1900 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data  Sources, … ~5000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source – good and bad Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of advocates, curators and users Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily  Microsoft .NET technologies with OS components  Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
Crowd-sourcing Curation How to curate data for millions of structures?  Robot processes can clean up depositions Search for Chloride and check molecular formula for Cl Check for stereochemistry and remove names with stereo  Provide a simple-to-use platform to curate, annotate and tag data  Provide curator administration to prevent vandalism (Veropedia)
Post Comments Anyone can “Post Comments” associated with a structure. To curate data we require login to track
Multi-level Curation and Approval
Crowd-sourcing Chemistry Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation ALSO Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
DailyMed
Quality of Structures
Quality of Structures!!!
Structure-Centric  We want to search “information” by structure, substructure, similarity of structure Specific focus on  Open Chemistry  at present Standard approaches would be: Identify chemical names “entity extraction” Convert chemical names to structures and index ChemSpider has a validated dictionary of structure-name pairs  Use name extraction, name-conversion and dictionary look-up. THEN curate.
“Entity Extraction” Rule-based recognition of systematic names: Use a lexeme of name fragments Rules for identifying bounds of a name Look-up dictionary: Drug Names Trivial Names Numbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated identifiers on ChemSpider
 
Name Recognition Azo aldehyde  2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde  2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone  1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) .  The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base  3   as a red solid which was recrystalized from ethanol 95%     (1.28 g, 91 %)
Name Recognition Azo aldehyde  2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde  2   (1.08 g, 3.76 mmol )  in  dry  CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone   1 (0.40 g, 1.88 mmol) and a excess of anhydrous  MgSO 4  (2.00 g,16.67 mmol) .  The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with  dichloromethane  . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base  3   as a red solid which was recrystalized from  ethanol  95%     (1.28 g, 91 %)
How Many Chemical Names? “ She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
How Many Chemical Names? “ She had the  drive  to derive  success  in any venture and was well  versed  in  Karate . When the man  in  the  tartan  shirt approached her with a  dagger  in his hand she  spat  in his face, took the stance  of  a  commando  and took  advantage  of his shock to  release  the dagger from his grip and causing him to  recoil .  He  went home and took an  aspirin  after  the  beating.”
ChemMantis Chem ical  M arkup  A nd  N omenclature  T ransformation  I ntegrated  S ystem
Making Open Access Articles Searchable Proof of Concept Can we HOST Chemistry Open Access articles on ChemSpider and add-value Can we identify chemical names in Open Access articles in a user-friendly manner Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles? Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
Document markup ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International  Hosting the Molbank collection at present
A Standard for Document Markup? NLM-DTD: National Library of Medicine; Document Type Definition Approved markup definitions to apply to journal articles – extended as necessary for our purposes
NLM/DTD markup
Chemistry and Biology Menus can be extended as necessary
Document markup
Markup – 3 seconds!
On the fly conversion
Shorthand Formulae Supported
One Click to more Info…
Structure Image Conversion
Two Seconds Later
Not Always Perfect….
A Platform for Markup Can we provide a platform for document markup for chemists? Workflow: Upload word docs, RTF files or point to HTML and load Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation Publish final version with NLM-DTD markup Deposit all structures on ChemSpider under embargo and wait for article DOI to release
Challenges Computer software can generate chemical names better than the majority of chemists The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous One name, Multiple Structures
Names and Structures Dichloroacetone Trichloromethylsilane
Ambiguity
Ambiguity in Abbreviations - DPA
Ambiguity in Abbreviations - THF
Import is Easy Make articles Public/Private (embargo date soon) Auto-markup and check by user
IUPAC PAC Articles
Supports Word .DOC, HTML, RTF
Drexel University Documents
Drexel University Documents
Drexel University Documents
Patents
Single Configuration File defines entities for markup Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical We can extend our system to support your needs based on dictionaries – what does NPG need/not need?
Nature Publications
Entity Balloons Structures are the language of chemistry Show structures to chemists and search/link from there
Other Dictionaries - Species We are considering  Bacteria Fungi Enzymes Viruses PDB codes….
Integrations Out to Other Sources
Integrations Out to Other Sources
Reactions
Manual Curation is Always Necessary
Text- Indexing  and ChemSpider? ChemSpider text-indexes almost 500,000 Open Access and Free Access articles Collection is growing and more publishers have already agreed. Including theses in the future.
Open Access Literature Search
Conclusions The quality of structure-based data online should always be questioned – that includes ChemSpider Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always ChemSpider has a large validated structure-name dictionary Chemical name extraction and document markup is very enabling
Oops…

A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry

  • 1.
    Crowdsourcing, Collaborations andText-Mining in a World of Open Chemistry Nature Publishing Group 11/2008 Antony Williams
  • 2.
    Imagine a timewhen …. The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) Chemistry articles are indexed and searchable by a free online service The web is linked together through the “language of chemistry” Publicly funded research data can be shared and discussed in the Open, maybe as ONS? Cheminformatics has as much of a public face as bioinformatics
  • 3.
    ChemSpider - ASearch Engine for Chemists Questions a chemist might ask… What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue? ChemSpider can answer all of these questions
  • 4.
    What is aStructure? Ask a computer…ask a chemist
  • 5.
    Tell Me AboutGlutathione
  • 6.
    Tell Me AboutGlutathione
  • 7.
    Tell Me AboutGlutathione
  • 8.
    Tell Me AboutGlutathione
  • 9.
    Tell Me AboutGlutathione
  • 10.
    Tell Me AboutGlutathione
  • 11.
  • 12.
    Links out toKEGG Kyoto Encyclopedia of Genes and Genomes
  • 13.
    How many namesdoes a compound have?
  • 14.
    ChemSpider Data ContentOver 21.5 million unique chemical structures from ca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Access Chemical Vendors – over 40 different vendors and growing Personal Depositions – individual contributions Content database vendors Analytical data collections Patents Web scraping Content is linked back to the original data sources
  • 15.
    Other Searches Whatcompounds have a mass of 300+/-0.001? or search a combination of intrinsic/predicted properties
  • 16.
  • 17.
  • 18.
    The Quality ofData Online… Aggregating data opens up quality issues Structure-identifier associations are “dirty” Structures are COMMONLY incorrect Manual curation of small databases is enough work – what about millions of structures? Structures are far from perfect. What is a “correct structure”? Full stereochemistry? Historical timeline of structure? Who is the authority?
  • 19.
    Who holds THEQuality Authority? Chemical Abstracts Service is the structural authority today. 1400 employees, world standard in chemistry information 101 years of knowledge, process and expertise. How can an online, free access system peacefully co-exist with the authority?
  • 20.
    Quality is aMajor Issue- Search Butanol OLD EXAMPLE..now fixed
  • 21.
    Wikipedia Chemistry Curationproject Only ca. 5000 organic structures, 7000 total structures Almost a year of work so far for a team of 6 people Many errors removed in the process. Curation process is a daily event for users/depositors Slow and torturous process http://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structure
  • 22.
    Wikipedia Curation Lookingfor self-consistency across a Wikipedia Page Primary key is the article TITLE The chemical shown needs to match the title Cyclic self-consistency – and decisions must get made
  • 23.
  • 24.
  • 25.
  • 26.
    Sugars – MachineReadable vs Aesthetics Haworth Stereo Fischer
  • 27.
  • 28.
    Thymol Blue onChemSpider Data online includes: UV-vis spectrum Measured experimental properties Link to Wikipedia article Links to chromatography details Multiple identifiers/trade names etc. Links to vendors/suppliers/other databases Safety information http://www.chemspider.com/q/thymol%20blue
  • 29.
    Differences between ChemSpider/WikipediaNo, but links. Analytical Data Active editors > 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 6000 people/day; 1900 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~5000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
  • 30.
    Differences between Wikipedia/ChemSpiderGrowing reputation as focused on quality Worldwide reputation as quality source – good and bad Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of advocates, curators and users Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
  • 31.
    Crowd-sourcing Curation Howto curate data for millions of structures? Robot processes can clean up depositions Search for Chloride and check molecular formula for Cl Check for stereochemistry and remove names with stereo Provide a simple-to-use platform to curate, annotate and tag data Provide curator administration to prevent vandalism (Veropedia)
  • 32.
    Post Comments Anyonecan “Post Comments” associated with a structure. To curate data we require login to track
  • 33.
  • 34.
    Crowd-sourcing Chemistry Crowd-sourcedcuration: identify and tag errors, edit names, synonyms, identify records for deprecation ALSO Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
  • 35.
  • 36.
  • 37.
  • 38.
    Structure-Centric Wewant to search “information” by structure, substructure, similarity of structure Specific focus on Open Chemistry at present Standard approaches would be: Identify chemical names “entity extraction” Convert chemical names to structures and index ChemSpider has a validated dictionary of structure-name pairs Use name extraction, name-conversion and dictionary look-up. THEN curate.
  • 39.
    “Entity Extraction” Rule-basedrecognition of systematic names: Use a lexeme of name fragments Rules for identifying bounds of a name Look-up dictionary: Drug Names Trivial Names Numbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated identifiers on ChemSpider
  • 40.
  • 41.
    Name Recognition Azoaldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 42.
    Name Recognition Azoaldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) . The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 43.
    How Many ChemicalNames? “ She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
  • 44.
    How Many ChemicalNames? “ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.”
  • 45.
    ChemMantis Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem
  • 46.
    Making Open AccessArticles Searchable Proof of Concept Can we HOST Chemistry Open Access articles on ChemSpider and add-value Can we identify chemical names in Open Access articles in a user-friendly manner Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles? Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
  • 47.
    Document markup ChemSpidernow hosting Open Access articles from MDPI, Molecular Diversity Preservation International Hosting the Molbank collection at present
  • 48.
    A Standard forDocument Markup? NLM-DTD: National Library of Medicine; Document Type Definition Approved markup definitions to apply to journal articles – extended as necessary for our purposes
  • 49.
  • 50.
    Chemistry and BiologyMenus can be extended as necessary
  • 51.
  • 52.
    Markup – 3seconds!
  • 53.
    On the flyconversion
  • 54.
  • 55.
    One Click tomore Info…
  • 56.
  • 57.
  • 58.
  • 59.
    A Platform forMarkup Can we provide a platform for document markup for chemists? Workflow: Upload word docs, RTF files or point to HTML and load Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation Publish final version with NLM-DTD markup Deposit all structures on ChemSpider under embargo and wait for article DOI to release
  • 60.
    Challenges Computer softwarecan generate chemical names better than the majority of chemists The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous One name, Multiple Structures
  • 61.
    Names and StructuresDichloroacetone Trichloromethylsilane
  • 62.
  • 63.
  • 64.
  • 65.
    Import is EasyMake articles Public/Private (embargo date soon) Auto-markup and check by user
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
    Single Configuration Filedefines entities for markup Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical We can extend our system to support your needs based on dictionaries – what does NPG need/not need?
  • 73.
  • 74.
    Entity Balloons Structuresare the language of chemistry Show structures to chemists and search/link from there
  • 75.
    Other Dictionaries -Species We are considering Bacteria Fungi Enzymes Viruses PDB codes….
  • 76.
    Integrations Out toOther Sources
  • 77.
    Integrations Out toOther Sources
  • 78.
  • 79.
    Manual Curation isAlways Necessary
  • 80.
    Text- Indexing and ChemSpider? ChemSpider text-indexes almost 500,000 Open Access and Free Access articles Collection is growing and more publishers have already agreed. Including theses in the future.
  • 81.
  • 82.
    Conclusions The qualityof structure-based data online should always be questioned – that includes ChemSpider Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always ChemSpider has a large validated structure-name dictionary Chemical name extraction and document markup is very enabling
  • 83.