Sequence analysis | Browse books | Database queries | Publications | Library construction

Construction of the PhyloFacts Resource

Universal Proteome Explorer

Version 2.0.   24 July 2008. Families: 55,976. Hidden Markov Models (family and subfamily): 1,440,976.


See our paper "PhyloFacts: An online structural phylogenomic encyclopedia for protein functional and structural classification", at Genome Biology.

Coverage statistics: the fraction of sequences of selected model organisms that are found in or represented by PhyloFacts books

The PhyloFacts resource includes "books" for protein families across the Tree of Life. Each book includes a multiple sequence alignment, one or more phylogenetic trees, predicted subfamilies, predicted 3D protein structures, active sites and other key residues, cellular localization, and Gene Ontology (GO) annotations and evidence codes. PhyloFacts includes hidden Markov models for classification of user-submitted (DNA or protein) sequences to protein families and subfamilies across the tree of life. Our primary current focus is on covering all the gene families represented in the human genome and all structural domains, but plan to expand the resource to include all proteins in all species.

The protein families in this resource typically contain homologs from many species. The phylogenetic distribution of a protein family can vary from highly restricted (e.g., to hominidae or mammals) to throughout the tree of life. Gathering homologs from many divergent species enables us to take advantage of experimental investigations in different systems, and allows powerful inferences of function and structure that might not otherwise be possible.


Sequence selection for phylogenomic inference of protein structure and function: requiring global homology

Because the primary aim of this resource is to enable functional classification of proteins having little (or no) experimental data, we require proteins in a "book" to have the same domain structure. This enables us to use consensus approaches to structure prediction across the family, and to use phylogenomic analysis for inferring molecular function (Sjölander 2004).

Since PSI-BLAST and most other homolog detection methods retrieve sequences with local similarity (i.e., potentially having different overall domain structures), we used the FlowerPower algorithm to gather homologs for a given seed (query) sequence. FlowerPower is similar to PSI-BLAST and other iterated search methods, but includes subfamily identification and subfamily HMM construction to expand the cluster in each iteration. This helps prevent profile drift. FlowerPower includes alignment analysis in each iteration of the algorithm to reduce the intrusion of non-homologs and to restrict the set to sequences having the same overall fold. More information on the FlowerPower algorithm is available here.

Multiple sequence alignment

The final cluster produced by FlowerPower is then re-aligned using the MUSCLE algorithm. We use alignment masking prior to HMM construction and phylogenetic tree construction to remove columns with many gap characters (or which appear to have divergent structure, based on Blosum62 analysis). Each book webpage includes details about how the book was constructed. Where possible, we use expertly curated multiple sequence alignments from online resources (e.g., many of the books in our GPCR series are based on alignments downloaded from the GPCRDB resource).

Phylogenetic tree construction and subfamily identification

Because phylogenetic trees for protein superfamilies often have significant differences in the tree topologies, we provide several alternative trees for users to examine. The SCI-PHY algorithm (Sjölander 1998) was used to construct a hierarchical tree and identify subfamilies. In addition to the SCI-PHY subfamilies and hierarchical tree, most protein family "books" also include several trees, constructed using several methods. Neighbor-Joining trees are constructed using the PHYLIP software . Maximum Likelihood trees are estimated using the PhyML software, and Maximum Parsimony trees are estimated using the PAUP software.

HMM construction

Hidden Markov models are constructed for each family, and also for subfamilies identified using the SCI-PHY algorithm. We use the w0.5 software from the SAM HMM tools from UCSC to optimize the family HMM for remote homolog detection. Subfamily HMMs are constructed as described in Brown et al, 2005, and are optimized for selectivity. The combination enables us to detect novel (previously unknown) family members, and classify them to subtypes with high accuracy. Discrimination of entirely novel subtypes is available through the use of linear regression analysis of subfamily scores.

Functional annotation retrieval

We retrieve annotations from the UniProt resource, and also from GenBank. The UniProt resource includes Gene Ontology classifications and evidence codes for many sequences, making it possible to predict the molecular function, biological process and cellular localization of some subfamilies.

Hyperlinks to other online resources

Hyperlinks are provided to genome databases, key literature, and other resources.

Structure Prediction and Cellular Localization Prediction

We predict the presence of structural and functional domains by several approaches. First, we derive a consensus sequence for the multiple sequence alignment, and submit that sequence to hidden Markov models downloaded from the PFAM resource, and to the PhyloFacts Structure Prediction resource. We also submit the consensus sequence as a BLAST query against the Protein Data Bank (PDB), and to the Phobius TransMembrane and Signal Peptide Prediction resource.


References :

Phylogenomic inference and key methods

  • Sjölander, K., "Phylogenomic inference of protein molecular function: advances and challenges," Bioinformatics 2004 (20)2:170-179. Oxford University Press access.
  • Sjölander, K , "Phylogenetic inference in protein superfamilies: Analysis of SH2 domains," Proceedings of the Conference Intelligent Systems for Molecular Biology 1998 6:165-74. PubMed abstract. (Presents the SCI-PHY algorithm for protein subfamily identification)
  • Brown D, Krishnamurthy N, Dale J, Christopher W, and Sjölander K, "Subfamily HMMs in Functional Genomics", Proceedings of the Pacific Symposium on Biocomputing, 2005. PSB proceedings.

If you have any questions or comments, please email phylo.