[1]
Chemicalize.org, SureChemOpen, PubChem and
the InChIKey: A heavenly conjunction with
transformative utility
Christopher Southan, TW2Informatics, Göteborg, Sweden,
ChemAxon UGM, Budapest, May 2013
Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/
[2]
Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.
TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan@hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications
Presentations: http://www.slideshare.net/cdsouthan
[3]
The ChemAxon name-to-struc functionality is not only a component of the SureChem
patent extraction pipeline but also powers chemicalize.org. Both operations are now
submitting sources to PubChem. The former has deposited structures that bring the
patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is
~0.3 mill., but has been actively selected by users and is 20% unique. The final
conjunction is that all three sources generate the InChIKey (IK) that turns Google into
a de-facto merge of PubChem and ChemSpider of ~50 mill. structures.
Chemicalize.org users can convert new patents, other external or internal documents
and web based text. Individual results can be Googled, searched against
SurChemOpen and bulk extractions triaged against PubChem. It thus becomes
possible to connect chemistry between patents, papers, abstracts and database
records via exact match or similarity searching. When SureChem and
chemicalize.org update their submissions, relationships with the other ~200 PubChem
sources (including ChEMBL and vendor databases) are re-computed and new CID
links made. The synergy between SureChem and chemicalize.org is powerful because
matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics
and the location of the structure within patents. The applications of chemicalize.org
are extended by web tools such as Venny for determining intersects from multiple
extractions and CheS-Mapper for cluster visualization. These utility expansions will be
illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.
Abstract
[4]
Auspicious Conjunctions 2012-13
• PubChem: global chemistry to slice ‘n dice
• SureChemOpen: majority of patent chemistry opened up
• Chemicalize.org : chemistry extractable from any text toombs
• Chemical images: patents extracted in SureChemOpen, OSRA
handles papers
• InChIKey indexing in Google
• ChemSpider: crowdsourcing chemisty quality
• Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper
• SciBite alerts
• Expanding preview and surfacing options e.g. ChEMBLntd, Github,
OSDD, Open Lab Books, figshare etc
• Rise of mobile chemistry
[5]
Databases <> structures < > documents
Abstracts
Patents
Papers
15 mill
0.2 mill (MeSH)
0.8 mill
(ChEMBL)
12K
Google InChIKey ~ 50 million
(47m PubChem + 33m
UniChem + 28m ChemSpider)
[6]
Triaging chemistry from text
• Identify the structure specification types, e.g.
– Semantic names (all sources)
– Code names (press releases, papers and abstracts)
– IUPAC names (papers, patents and abstracts)
– Images (papers, patents, & Google images)
– SMILES (open lab books)
– InChi strings (open lab books)
– SDF files (open lab books, & github)
Convert these to a structure (e.g. SDF, SMILES, InChI) then:
– Search InChIKey in Google
– Search major databases
– Search SureChemOpen
– Compare extracted sets for intersects and diffs
– Extend exact match connectivity with similarity searching
[7]
PubChem Composition
[8]
SureChemOpen Composition (in PubChem)
[9]
Chemicalize.org Composition (in PubChem)
[10]
BACE2 Conjunctions
[11]
BACE2 Conjunctions
[12]
Chemicalise.org Triage
[13]
BACE2 Conjunctions
1. WO2013054291 > chemicalize.org
2. Download 450 structures
3. Upload to PubChem search
[14]
Clustering document extraction sets: CheS-Mapper
[15]
Venny: intersects, diffs, de-dupes and merges
[16]
Conclusions
• Transformative opening up of chemistry > biology via structure >document
connectivity
• Open mining of patent metadata and data
• Expanding toolbox
• Inexorable expansion of open-access publishing
But;
• Journal chemistry extraction > database records still slow
• Text mining of journals still restricted
• Author annotation and direct db submission rare
• Pharmaceutical research publications are still blinding structures (see
PMID: 23159359)
[17]
References
http://www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubchem
http://www.slideshare.net/cdsouthan/cs-cax-bioitchemicalizeposter03apr
http://www.ncbi.nlm.nih.gov/pubmed/23399051
http://www.ncbi.nlm.nih.gov/pubmed/23618056
http://www.ncbi.nlm.nih.gov/pubmed/23506624

EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility

  • 1.
    [1] Chemicalize.org, SureChemOpen, PubChemand the InChIKey: A heavenly conjunction with transformative utility Christopher Southan, TW2Informatics, Göteborg, Sweden, ChemAxon UGM, Budapest, May 2013 Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/
  • 2.
    [2] Dr Christopher Southan,Ph.D., M.Sc.,B.Sc. TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Mobile: +46(0)702-530710 Skype: cdsouthan Email: [email protected] Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Presentations: http://www.slideshare.net/cdsouthan
  • 3.
    [3] The ChemAxon name-to-strucfunctionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey (IK) that turns Google into a de-facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SurChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease. Abstract
  • 4.
    [4] Auspicious Conjunctions 2012-13 •PubChem: global chemistry to slice ‘n dice • SureChemOpen: majority of patent chemistry opened up • Chemicalize.org : chemistry extractable from any text toombs • Chemical images: patents extracted in SureChemOpen, OSRA handles papers • InChIKey indexing in Google • ChemSpider: crowdsourcing chemisty quality • Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper • SciBite alerts • Expanding preview and surfacing options e.g. ChEMBLntd, Github, OSDD, Open Lab Books, figshare etc • Rise of mobile chemistry
  • 5.
    [5] Databases <> structures< > documents Abstracts Patents Papers 15 mill 0.2 mill (MeSH) 0.8 mill (ChEMBL) 12K Google InChIKey ~ 50 million (47m PubChem + 33m UniChem + 28m ChemSpider)
  • 6.
    [6] Triaging chemistry fromtext • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    [13] BACE2 Conjunctions 1. WO2013054291> chemicalize.org 2. Download 450 structures 3. Upload to PubChem search
  • 14.
  • 15.
    [15] Venny: intersects, diffs,de-dupes and merges
  • 16.
    [16] Conclusions • Transformative openingup of chemistry > biology via structure >document connectivity • Open mining of patent metadata and data • Expanding toolbox • Inexorable expansion of open-access publishing But; • Journal chemistry extraction > database records still slow • Text mining of journals still restricted • Author annotation and direct db submission rare • Pharmaceutical research publications are still blinding structures (see PMID: 23159359)
  • 17.