![]() |
|
| Sequence analysis | Browse books | Database queries | Publications | Library construction | |
|
Construction of the PhyloFacts Resource
Version 2.0.
7 November 2009. Families: 58,955.
Hidden Markov Models (family and subfamily): 1,564,616.
The PhyloFacts resource includes "books" for protein families across the Tree of Life. Each book includes
a multiple sequence alignment, one or more phylogenetic trees, predicted subfamilies, predicted 3D protein structures, active sites and other key residues, cellular localization, and Gene Ontology (GO) annotations and evidence codes. PhyloFacts includes hidden Markov models for classification of user-submitted (DNA or protein) sequences to protein families and subfamilies across the tree of life. Our primary current focus is on covering all the gene families represented in the human genome and all structural domains, but plan to expand the resource to include all proteins in all species.
The protein families in this resource typically contain homologs from many species. The phylogenetic distribution of a protein family can vary from highly restricted (e.g., to hominidae or mammals) to throughout the tree of life. Gathering homologs from many divergent species enables us to take advantage of experimental investigations in different systems, and allows powerful inferences of function and structure that might not otherwise be possible.
Because the primary aim of this resource is to enable functional classification of proteins having little (or no) experimental data, we require proteins in a "book" to have the same domain structure.
This enables us to use consensus approaches to structure prediction across the family, and to use phylogenomic analysis for inferring molecular function (Sjölander 2004).
Since PSI-BLAST and most other homolog detection methods retrieve sequences with local similarity (i.e., potentially having different overall domain structures), we used the FlowerPower algorithm to gather homologs for a given seed (query) sequence. FlowerPower is similar to PSI-BLAST and other iterated search methods, but includes subfamily identification and subfamily HMM construction to expand the cluster in each iteration. This helps prevent profile drift. FlowerPower includes alignment analysis in each iteration of the algorithm to reduce the intrusion of non-homologs and to restrict the set to sequences having the same overall fold. More information on the FlowerPower algorithm is available here.
The final cluster produced by FlowerPower is then re-aligned using the MUSCLE algorithm. We use alignment masking prior to HMM construction and phylogenetic tree construction to remove columns with many gap characters (or which appear to have divergent structure, based on Blosum62 analysis).
Each book webpage includes details about how the book was constructed. Where possible, we use expertly curated multiple sequence alignments from online resources (e.g., many of the books in our GPCR series are based on alignments downloaded from the GPCRDB resource).
Because phylogenetic trees for protein
superfamilies often have significant differences in the tree topologies, we
provide several alternative trees for users to examine.
The SCI-PHY algorithm
(Sjölander 1998) was used to construct a hierarchical tree and identify
subfamilies.
In addition to the SCI-PHY subfamilies and hierarchical tree, most protein
family "books" also include several trees, constructed using several methods.
Neighbor-Joining trees are constructed using the
PHYLIP software .
Maximum Likelihood trees are estimated using the
PhyML software,
and Maximum Parsimony trees are estimated using the
PAUP software.
We retrieve annotations from the UniProt resource, and also from GenBank. The UniProt resource includes Gene Ontology classifications and evidence codes for many sequences, making it possible to predict the molecular function, biological process and cellular localization of some subfamilies.
Hyperlinks are provided to genome databases, key literature, and other resources.
We predict the presence of structural and functional domains by several approaches. First, we derive a consensus sequence for the multiple sequence alignment, and submit that sequence to hidden Markov models downloaded from the PFAM resource, and to the PhyloFacts Structure Prediction resource. We also submit the consensus sequence as a BLAST query against the Protein Data Bank (PDB), and to the Phobius TransMembrane and Signal Peptide Prediction resource. Phylogenomic inference and key methods |