Automatic classification, logical definitions Janna Hastings, EBI Cheminformatics and Metabolism 2nd ChEBI User Group Workshop, 24 June 2010
Chemistry is a domain with a rich heritage of   classification   based on  structural features ChEBI ontology 20.10.10
ChEBI ontology 20.10.10 The ChEBI ontology contains a large  asserted is-a hierarchy   of chemical classes and compounds Each chemical class is clearly defined in  natural language
Why automatic classification? A reasoner can help manage your complex hierarchy Minimise curation overhead by harnessing the power of the knowledge already captured in the ontology and the structures already drawn for the new chemicals Avoid redundancy and excessive pre-coordination of terms ChEBI ontology 20.10.10
Provides a decidable set of  constructors  for  defining  classes OWL 2 ChEBI ontology 20.10.10 oneOf disjointWith sameClassAs   rdfs:subClassOf unionOf intersectionOf complementOf minCardinality maxCardinality cardinality inverseOf TransitiveProperty SymmetricProperty FunctionalProperty InverseFunctionalProperty allValuesFrom someValuesFrom
Necessary and sufficient conditions Necessary  conditions ‘ hydrocarbon molecular entity’  has_atom  some  ‘carbon atom’ ‘ hydrocarbon molecular entity’  has_atom  some  ‘hydrogen atom’ ChEBI ontology 20.10.10 neopentane
Necessary and sufficient conditions Sufficient conditions: ‘ hydrocarbon molecular entity’  has_atom  only  ( ‘carbon atom’ or ‘hydrogen atom’ ) ChEBI ontology 20.10.10 neopentane
OWL Reasoning ChEBI ontology 20.10.10 neopentane has_atom  some  ‘carbon atom’ has_atom  some  ‘hydrogen atom’ neopentane  subClassOf (is_a) ‘ hydrocarbon molecular entity’
Parts and properties Chemical ontology consists of chemical classes  which can be defined by  parts of structures   and/or  properties of structures ChEBI ontology 20.10.10 carboxylic acid cyclic molecule if molecule has part some carboxy group if molecule has property  cyclic,  i.e. a self-connected cyclic path exists through the molecule’s atoms
Pre-coordination vs. post-coordination Given a set of properties that can be used in class definitions, you get an explosion of possible combinations e.g. ‘cyclic’  ChEBI ontology 20.10.10
Pre-coordination vs. post-coordination Other properties: saturated, radical, ion/anion, ... ChEBI ontology 20.10.10
Pre-coordination vs. post-coordination ChEBI ontology 20.10.10
Logically defining chemical classes 20.10.10 Goal: transform the textual definitions into  logical definitions  which are then accessible for  automated reasoning ‘ carbonyl compound’   ↔  has_part  some ( ‘carbonyl group’ ) ‘ carboxylic acid’  ↔  has_part  some ( ‘carboxy group’ ) ‘ monocarboxylic acid’  ↔  has_part  exactly  1  ( ‘carboxy group’ ) ‘ hydroxy monocarboxylic acid’  ↔  has_part  exactly  1  ( ‘carboxy group’ ) and  has_atom  only some ( ‘hydrogen atom’  or  ‘carbon atom’  or   ‘oxygen atom’ ) is_a is_a is_a is_a
Foundational classification of molecules ChEBI ontology 20.10.10
Foundational classification of molecules XXXX molecular entity  ≝   ∃   has_atom  some XXXX atom carbon molecular entity  ≝   ∃   has_atom  some carbon atom ChEBI ontology 20.10.10 has_part in ChEBI
Classification based on regularities in naming name ends with  - oic acid   is_a   oxoacid (CHEBI:24833) ChEBI ontology 20.10.10
Classification based on chemical structure ChEBI ontology 20.10.10 Best would be to include the structure in the ontology Without structure, all parts must be explicitly asserted (combinatorial explosion for larger molecules) But the structure of complex molecules breaks the OWL  Tree Model requirement does not have a model in the shape of a tree
Recent work: description graphs Description graphs are a recent extension to OWL2 which allows graph structures to be captured at the class level We generated these for chemicals in ChEBI ChEBI ontology 20.10.10
Rules for properties ChEBI ontology 20.10.10 molecule(?x),  atom(?a1), atom (?a2), atom(?a3), atom(?a4),  bond(?b1), bond (?b2), bond(?b3), bond (?b4),  has_atom(?x, ?a1), has_atom(?x, ?a2), has_atom(?x, ?a3), has_atom(?x, ?a4), has_bond(?a1, ?b1), has_bond(?a1, ?b4), has_bond(?a2, ?b1), has_bond(?a2, ?b2), has_bond(?a3, ?b2), has_bond(?a3, ?b3), has_bond(?a4, ?b3), has_bond(?a4, ?b4)   -> cyclic_entity(?x) cyclobutane tetrahedrane
Rules for classes defined by parts ChEBI ontology 20.10.10 molecule(?y),  atom(?a0), oxygen_atom(?a1), carbon_atom(?a2), oxygen_atom (?a3),  has_atom(?y, ?a0), has_atom (?y, ?a1), has_atom (?y, ?a2), has_atom (?y, ?a3),  double_bond(?b0), single_bond (?b1), single_bond (?b2), has_bond(?a0, ?b2),  has_bond(?a1, ?b1), has_bond(?a2, ?b0), has_bond(?a2, ?b1), has_bond(?a2, ?b2),  has_bond(?a3, ?b0)  ->  carboxylic_acid(?y) benzoic acid has this part so: is a carboxylic acid carboxylic  acid benzoic acid
Testing the reasoning Can we use a reasoner to deduce the classification hierarchy based on the graphs and rules?  No asserted hierarchy between test classes and molecules with generated graphs
Results Inferred hierarchy shows classified molecules
That’s great, but... ChEBI ontology 20.10.10
Simple substructure search Can be done with cheminformatics software outside  the ontology for a defined list of  groups ChEBI ontology 20.10.10 Get a list of groups in ChEBI
Substructure search ChEBI ontology 20.10.10 benzoic acid has this part so: is a carboxylic acid carboxylic  acid benzoic acid
Goal 20.10.10 We extract  features  from the structural specifications of chemical compounds using standard  cheminformatics techniques  and use these to  automatically classify  compounds into defined classes CDK has_part  exactly 1 ( ‘carboxy group’ ) has_part  some ( ‘cholesterol’ ) 3β-hydroxy-4β-methyl-5α-cholest-7-ene-4α-carboxylic acid   has_part  only some ( ‘carbon atom’ or  ‘oxygen atom’  or  ‘hydrogen atom’ ) hydroxy  monocarboxylic acid
Elements of chemical class definitions Composition and cardinality "tricarboxylic acid" can be defined as a compound containing exactly three carboxy groups Skeleton "metalloporphyrins" can be defined as any compound containing a porphyrin skeleton and a metal atom  B ut beware! Skeleton is not always substructure ChEBI ontology 20.10.10
Elements of chemical class definitions Number and arrangement of rings in a ring system bicyclic compound polycyclic cage properties such as charge and unpaired electrons  ion, radical Structural formula alkane:  acyclic branched or unbranched hydrocarbon having the general formula C n H2 n +2  ChEBI ontology 20.10.10
‘ Features’ must be explicitly asserted All properties and parts have to be explicitly associated with molecules in the ontology e.g. has_part has_charge has_attribute (XXX which has_value YYYY) has_ring_count => adding new relationships ChEBI ontology 20.10.10
Conclusions Chemical classes are defined based on features and parts of molecules These class definitions can be captured explicitly in OWL as ‘necessary and sufficient conditions’ This allows automatic classification if the features are also asserted about the molecules ChEBI ontology 20.10.10
Thank you for your attention
2 nd  ChEBI UGM: Closing remarks Relationships: more, more specific  Natural products: flag them Change of focus from OBO to OWL Expose fingerprints? Commitment to BFO? (general classes) What about DOLCE? (also GFO) Scope: become clearer Semantify the web offering more (SADI) ChEBI as ‘glue’: keep the links coming Mine ChEMBL bioactivity data for ChEBI role assertions (inhibitor etc) Harness literature, map to MeSH ChEBI ontology 20.10.10

Automatic classification in ChEBI

  • 1.
    Automatic classification, logicaldefinitions Janna Hastings, EBI Cheminformatics and Metabolism 2nd ChEBI User Group Workshop, 24 June 2010
  • 2.
    Chemistry is adomain with a rich heritage of classification based on structural features ChEBI ontology 20.10.10
  • 3.
    ChEBI ontology 20.10.10The ChEBI ontology contains a large asserted is-a hierarchy of chemical classes and compounds Each chemical class is clearly defined in natural language
  • 4.
    Why automatic classification?A reasoner can help manage your complex hierarchy Minimise curation overhead by harnessing the power of the knowledge already captured in the ontology and the structures already drawn for the new chemicals Avoid redundancy and excessive pre-coordination of terms ChEBI ontology 20.10.10
  • 5.
    Provides a decidableset of constructors for defining classes OWL 2 ChEBI ontology 20.10.10 oneOf disjointWith sameClassAs rdfs:subClassOf unionOf intersectionOf complementOf minCardinality maxCardinality cardinality inverseOf TransitiveProperty SymmetricProperty FunctionalProperty InverseFunctionalProperty allValuesFrom someValuesFrom
  • 6.
    Necessary and sufficientconditions Necessary conditions ‘ hydrocarbon molecular entity’ has_atom some ‘carbon atom’ ‘ hydrocarbon molecular entity’ has_atom some ‘hydrogen atom’ ChEBI ontology 20.10.10 neopentane
  • 7.
    Necessary and sufficientconditions Sufficient conditions: ‘ hydrocarbon molecular entity’ has_atom only ( ‘carbon atom’ or ‘hydrogen atom’ ) ChEBI ontology 20.10.10 neopentane
  • 8.
    OWL Reasoning ChEBIontology 20.10.10 neopentane has_atom some ‘carbon atom’ has_atom some ‘hydrogen atom’ neopentane subClassOf (is_a) ‘ hydrocarbon molecular entity’
  • 9.
    Parts and propertiesChemical ontology consists of chemical classes which can be defined by parts of structures and/or properties of structures ChEBI ontology 20.10.10 carboxylic acid cyclic molecule if molecule has part some carboxy group if molecule has property cyclic, i.e. a self-connected cyclic path exists through the molecule’s atoms
  • 10.
    Pre-coordination vs. post-coordinationGiven a set of properties that can be used in class definitions, you get an explosion of possible combinations e.g. ‘cyclic’ ChEBI ontology 20.10.10
  • 11.
    Pre-coordination vs. post-coordinationOther properties: saturated, radical, ion/anion, ... ChEBI ontology 20.10.10
  • 12.
  • 13.
    Logically defining chemicalclasses 20.10.10 Goal: transform the textual definitions into logical definitions which are then accessible for automated reasoning ‘ carbonyl compound’ ↔ has_part some ( ‘carbonyl group’ ) ‘ carboxylic acid’ ↔ has_part some ( ‘carboxy group’ ) ‘ monocarboxylic acid’ ↔ has_part exactly 1 ( ‘carboxy group’ ) ‘ hydroxy monocarboxylic acid’ ↔ has_part exactly 1 ( ‘carboxy group’ ) and has_atom only some ( ‘hydrogen atom’ or ‘carbon atom’ or ‘oxygen atom’ ) is_a is_a is_a is_a
  • 14.
    Foundational classification ofmolecules ChEBI ontology 20.10.10
  • 15.
    Foundational classification ofmolecules XXXX molecular entity ≝ ∃ has_atom some XXXX atom carbon molecular entity ≝ ∃ has_atom some carbon atom ChEBI ontology 20.10.10 has_part in ChEBI
  • 16.
    Classification based onregularities in naming name ends with - oic acid is_a oxoacid (CHEBI:24833) ChEBI ontology 20.10.10
  • 17.
    Classification based onchemical structure ChEBI ontology 20.10.10 Best would be to include the structure in the ontology Without structure, all parts must be explicitly asserted (combinatorial explosion for larger molecules) But the structure of complex molecules breaks the OWL Tree Model requirement does not have a model in the shape of a tree
  • 18.
    Recent work: descriptiongraphs Description graphs are a recent extension to OWL2 which allows graph structures to be captured at the class level We generated these for chemicals in ChEBI ChEBI ontology 20.10.10
  • 19.
    Rules for propertiesChEBI ontology 20.10.10 molecule(?x), atom(?a1), atom (?a2), atom(?a3), atom(?a4), bond(?b1), bond (?b2), bond(?b3), bond (?b4), has_atom(?x, ?a1), has_atom(?x, ?a2), has_atom(?x, ?a3), has_atom(?x, ?a4), has_bond(?a1, ?b1), has_bond(?a1, ?b4), has_bond(?a2, ?b1), has_bond(?a2, ?b2), has_bond(?a3, ?b2), has_bond(?a3, ?b3), has_bond(?a4, ?b3), has_bond(?a4, ?b4) -> cyclic_entity(?x) cyclobutane tetrahedrane
  • 20.
    Rules for classesdefined by parts ChEBI ontology 20.10.10 molecule(?y), atom(?a0), oxygen_atom(?a1), carbon_atom(?a2), oxygen_atom (?a3), has_atom(?y, ?a0), has_atom (?y, ?a1), has_atom (?y, ?a2), has_atom (?y, ?a3), double_bond(?b0), single_bond (?b1), single_bond (?b2), has_bond(?a0, ?b2), has_bond(?a1, ?b1), has_bond(?a2, ?b0), has_bond(?a2, ?b1), has_bond(?a2, ?b2), has_bond(?a3, ?b0) -> carboxylic_acid(?y) benzoic acid has this part so: is a carboxylic acid carboxylic acid benzoic acid
  • 21.
    Testing the reasoningCan we use a reasoner to deduce the classification hierarchy based on the graphs and rules? No asserted hierarchy between test classes and molecules with generated graphs
  • 22.
    Results Inferred hierarchyshows classified molecules
  • 23.
    That’s great, but...ChEBI ontology 20.10.10
  • 24.
    Simple substructure searchCan be done with cheminformatics software outside the ontology for a defined list of groups ChEBI ontology 20.10.10 Get a list of groups in ChEBI
  • 25.
    Substructure search ChEBIontology 20.10.10 benzoic acid has this part so: is a carboxylic acid carboxylic acid benzoic acid
  • 26.
    Goal 20.10.10 Weextract features from the structural specifications of chemical compounds using standard cheminformatics techniques and use these to automatically classify compounds into defined classes CDK has_part exactly 1 ( ‘carboxy group’ ) has_part some ( ‘cholesterol’ ) 3β-hydroxy-4β-methyl-5α-cholest-7-ene-4α-carboxylic acid has_part only some ( ‘carbon atom’ or ‘oxygen atom’ or ‘hydrogen atom’ ) hydroxy monocarboxylic acid
  • 27.
    Elements of chemicalclass definitions Composition and cardinality "tricarboxylic acid" can be defined as a compound containing exactly three carboxy groups Skeleton "metalloporphyrins" can be defined as any compound containing a porphyrin skeleton and a metal atom B ut beware! Skeleton is not always substructure ChEBI ontology 20.10.10
  • 28.
    Elements of chemicalclass definitions Number and arrangement of rings in a ring system bicyclic compound polycyclic cage properties such as charge and unpaired electrons ion, radical Structural formula alkane: acyclic branched or unbranched hydrocarbon having the general formula C n H2 n +2 ChEBI ontology 20.10.10
  • 29.
    ‘ Features’ mustbe explicitly asserted All properties and parts have to be explicitly associated with molecules in the ontology e.g. has_part has_charge has_attribute (XXX which has_value YYYY) has_ring_count => adding new relationships ChEBI ontology 20.10.10
  • 30.
    Conclusions Chemical classesare defined based on features and parts of molecules These class definitions can be captured explicitly in OWL as ‘necessary and sufficient conditions’ This allows automatic classification if the features are also asserted about the molecules ChEBI ontology 20.10.10
  • 31.
    Thank you foryour attention
  • 32.
    2 nd ChEBI UGM: Closing remarks Relationships: more, more specific Natural products: flag them Change of focus from OBO to OWL Expose fingerprints? Commitment to BFO? (general classes) What about DOLCE? (also GFO) Scope: become clearer Semantify the web offering more (SADI) ChEBI as ‘glue’: keep the links coming Mine ChEMBL bioactivity data for ChEBI role assertions (inhibitor etc) Harness literature, map to MeSH ChEBI ontology 20.10.10

Editor's Notes

  • #33 More notes in team discussion: prioritise standard inchis add batch submissions (bulk submissions) which additional properties do we pre-calculate and make visible in ChEBI? (team discussion needed) (lipinski?) OWL improvement to make SPARQL querying easier and improve the relationship patterns (not ALWAYS subclassof exists some). This ties into the SADI-fying of ChEBI and should also involve thinking of and testing out specific use cases for *doing stuff with* the exported OWL file. Concern about downgrade in quality caused by increase in scale (quantity of compounds)