Current Projects

Reconstructing the evolution of genes

The goal of this project is to reconstruct the evolutionary relationships of (almost) all genes found in a set of 143 organisms chosen to represent most of the major branches of the tree of life (biased, not surprisingly, toward humans and other organisms of interest to humans). The relationships are represented as a set of phylogenetic trees, where each tree covers one gene family. The internal nodes (branch points) of the trees are labeled with the evolutionary events that are inferred to have occurred: gene duplication, speciation, and horizontal gene transfer. We collaborate with the UniProt resource on obtaining (and constantly improving) a list of known protein-coding genes for all 143 species, and with the Quest for Orthologs Consortium in assessing and improving our phylogenetic trees, from which orthologs are inferred. We coordinate closely with the Ensembl Compara resource, sharing our definitions of protein families. For more information, see these publications:

Reconstructing the evolution of gene functions

This goal of this project is to use gene trees, together with experimental knowledge of the functions of genes in a variety highly studied "model" organisms, to reconstruct how gene functions evolved over time. The project combines a computational representation of gene function (GO annotations that describe findings from published experimental results) with a computational representation of gene evolutionary relationships (phylogenetic trees from PANTHER), into a computational model of the evolution of gene function. We take both a semi-automated and fully automated approach to creating these models. For the semi-automated approach (implemented as the Gene Ontology Phylogenetic Annotation Project), human biocurators construct the model manually using a software tool called PAINT (Phylogenetic Annotation INference Tool), co-developed by our group and the Berkeley BOP group. The evolutionary models for thousands of protein families can be explored using the PanTree resource, maintained by our group. We are also currently developing computational methods to construct these evolutionary models automatically. For more information, see these publications:

Large scale classification of proteins (and protein-coding genes)

The goal of this project is to utilize the "natural classification" of genes embodied in the gene trees, to classify genes that are not in the "reference" gene trees but are clearly related. One of the major interests is predicting the functions of experimentally uncharacterized genes (for which we use the evolutionary reconstructions above), but classifications include other characteristics of genes as well. We do this by creating and expanding a "library" of hidden Markov models (HMMs) that can be used to classify newly sequenced protein sequences. Currently (version 17) the library contains over 15,500 families, and over 124,000 subfamilies. Protein sequences form families (by well understood evolutionary processes including gene duplication, speciation and horizontal gene transfer), which can be further subdivided into subfamilies. The definition of subfamily is variable in the scientific literature, but in PANTHER it is defined very precisely using the evolutionary trees: genes are in different subfamilies if they were separated by a gene duplication and divergence in their evolutionary histories. HMMs are constructed for each family and subfamily. A newly sequenced protein sequence can be compared statistically to each HMM to find the closest match, and therefore a classification for the sequence. We work closely with the InterPro central resource at the European Bioinformatics Institute, to ensure the quality and utility of the HMMs, and consistency with other related resources. The PANTHER HMM scoring tool is distributed both by our group and in the highly used InterProScan software package. For more information, see these publications:

Computational representation of biological systems

The Gene Ontology project, in which our group plays a central role, is an international effort to create a computer-accessible representation of biological systems. We are currently developing an extension to the Gene Ontology annotation paradigm, which we call Gene Ontology Causal Activity Models (GO-CAMs). The GO-CAM paradigm allows the Gene Ontology to be used to represent complex biological systems. It meshes a biologist-friendly visualization of the causal "pathways" by which genes function in biological systems, with the well-defined semantics of an ontology-based representation. Our group collaborates with the Berkeley BOP group to develop a specification for this representation, and to develop a software tool for creating and editing these biological models. The current tool (called Noctua) and biological models can be found here. For more information, see these publications:

Identifying genetic variants with the potential to cause (or increase the risk for) disease

The goal of this project is to create computational methods for analyzing genetic variants carried by individual people, to identify the variants with the greatest potential to cause disease. Our group focuses on variants that cause a protein in one person to differ from a protein in another person. While it is easy to identify variants that will cause a difference in protein sequence, it is difficult to identify those that will cause a difference in protein function. Our group uses reconstruction of the evolutionary history of a protein, in order to identify the variants most likely to impact function. Sequence variants rarely seen over many millions of years of evolution are highly likely to impact function. For more information, see these publications:

Software tools for analysis of high-throughput "omics" experiments

The goal of this project is to develop and enhance a set of online software tools for analyzing the output from high-throughput "omics" data experiments. These experiments typically measure thousands of genes at the same time, making it difficult to interpret the results. The PANTHER enrichment tools allow users to input a list of genes found in the experiment (e.g. a list of genes expressed at a higher level in a cancerous tissue than a healthy one), and find the biological functions that are statistically associated with those genes. The analysis utilizes the Gene Ontology database of known functions, together with predicted functions inferred from the Gene Ontology Phylogenetic Annotation project and other curated sources. For more information, see these publications: