|
|
|
|
|
PhyloBuilder has two modes for gathering homologs to a submitted query:
global-global (the program default; retrieved sequences align along their
entire lengths to the query) and global-local (retrieved sequences match the
query locally, but can have additional domains before or after the query).
PhyloBuilder gathers homologs for a user-supplied query sequence using
FlowerPower, constructs a multiple sequence alignment using MUSCLE,
applies alignment masking, constructs a phylogenetic tree using
Neighbor-Joining, identifies functional subfamilies using SCI-PHY,
identifies PFAM domains and homologous 3D structures, predicts the presence of
transmembrane domains and signal peptides using the Phobius server,
retrieves Gene Ontology and Enzyme Classification data, and constructs
a family hidden Markov model (HMM) and subfamily HMMs.
The pipeline is shown in Figure 1, below.
There is a single required input: a protein sequence in FASTA format. Optional parameters are detailed below.
Users submitting numerous PhyloBuilder jobs may find it helpful to provide a meaningful name to a PhyloBuilder job. If this field is left blank, an automatic name will be provided (we currently use the time and date of job submission).
The server will send a single email informing the user that the job has completed. The email will not be used for any other purpose.
Notes (optional)
The notes field allows a user to include some information (e.g., details about the submitted sequence) with the PhyloBuilder job. This text will appear in the PhyloBuilder Results page.
The input to PhyloBuilder is a single protein sequence in FASTA format.
FASTA format consists of a line beginning with a greater-than ('>')
character followed by the sequence identifier (no white space is allowed
between the '>' and the identifier); additional text may follow the
sequence identifier, including white space characters.
The amino acid sequence follows on subsequent lines in the single-letter
representation.
Homolog selection (optional)
Two options are provided for global homolog selection: global-global (to select global homologs) and global-local (to select sequences matching the input sequence but which may have additional N- or C-terminal structure). The default is global-global.
Phylogenetic trees may be constructed for the sequences in the alignment.
The trees can be displayed on-line with a graphical viewer, and the tree data
files may be downloaded.
The default setting is to construct only a neighbor-joining tree.
Construction of maximum-likelihood and maximum-parsimony trees may be
selected.
Neighbor-joining trees are constructed with the PHYLIP neighbor-joining
software (protdist and neighbor), using default parameters.
Maximum likelihood trees are constructed with the PhyML software,
using the neighbor-joining tree as the starting tree, and
using default parameters.
The neighbor-joining and maximum likelihood trees are midpoint re-rooted
with the PHYLIP software (retree).
Maximum Parsimony trees are constructed with the PAUP* software,
using heuristic search with random sequence addition, f-value minimization,
midpoint rooting, and consensus by majority rule,
with a maximum of 100,000 trees.
Advanced FlowerPower settings allow the user to modify the homolog collection and inclusion strategy.
Users can over-ride program defaults using the Advanced Settings page to
select a different sequence database, to select sequences with global-local
similarity to the submitted seed sequence, and to modify the minimum fractional
coverage or sequence similarity to a previously included sequence.
Fractional coverage thresholds can be set independently for the query and hit.
See more information at the
Flowerpower webserver
The sw parameter determines the way that sequences are aligned to an HMM using the UCSC SAM software; setting sw to 2 (the default) provides local-local alignment.
This sets the number of times the homologs resulting from a PSI-BLAST search
are searched by FlowerPower with subfamily Hidden Markov Models.
The default number of iterated subfamily HMM (SHMM) scores of the PSI-BLAST sequences is 3.
Increasing the number of iterations may increase the sensitivity (coverage), but will increase program run-time.
Maximum e-value for inclusion of sequences default: depends on
sequence length
The E-value (or Expectation value) is an estimate of the number of hits found by chance alone in a database of a particular size. Since the E-value of an HMM score is affected by the HMM length (high-identity matches to short HMMs can have much less significance than moderate-identity matches to long HMMs), we use a length-dependent E-value cutoff (determined empirically), as follows:
Sequence length less than 65: 1.0e-2;
Number of PSI-BLAST iterations
In FlowerPower, the default setting is three iterations of PSI-BLAST for retrieval of the set of sequences for HMM search.
Although FlowerPower is able to effectively prevent the negative consequences of profile drift (particularly, failing to recognize the query sequence), running more than three PSI-BLAST iterations is only recommended when using the global-local option of sequence search, or when a user wishes to potentially identify very distant homologs. Increasing the number of PSI-BLAST iterations also comes at a significant increase in run time, both due to increasing the length of the PSI-BLAST run and because of the increased time needed in HMM scoring of a larger set of sequences.
Required identity to existing sequence for inclusion
The default setting for minimum percent identity between a new sequence and a previously included sequence is 20%. This, in combination with conservatively set maximum E-values against a subfamily (or family) HMM, provides good remote homolog detection with minimizes the inclusion of non-homologs. Increasing the required identity criterion may result in a more closely related set of sequences, but is likely to result in missing some homologs. Lowering the required identity threshold will allow inclusion of more remotely related sequences, at the risk of including non-homologs and subsequent problems in phylogenomic inference of function.
The UniProt database is used by default for PSI-BLAST homolog search. The NR database is also available.
Fractional coverage of a sequence relative to an HMM (sequence coverage), or of an HMM relative to a sequence (HMM coverage), provides a measure of their overall structural similarity. We measure these by aligning a sequence to an HMM and determining the fraction of the sequence aligned to the HMM, and the fraction of the HMM match states through which a sequence passes in an optimal path through the HMM.
FlowerPower uses HMM-length-specific coverage requirements designed to optimize the retrieval of global and glocal homologs. Users can manipulate the query and hit coverage requirements to identify additional sequences with only partial similarity. This can be useful when restrictive coverage parameters produce very few homologs. However, we recommend caution in transferring annotations from any partial matches.
Q: What is a book?
A: We use the term "book" to reflect the complex data associated with each PhyloFacts family - multiple sequence alignment, phylogenetic trees, predicted 3D structures, identified subfamilies, GO annotations, family and subfamily hidden Markov models, etc. The term also reflects the division of PhyloFacts into libraries of different types (e.g., Plant Disease Resistance, GPCRs, microbial gene families, etc.).
Currently there are more than 55,000 completed books in
the PhyloFacts database, divided into various libraries.
Q: What are global homologs?
A: Global homologs are sequences that share homology along their entire lengths. This distinguishes them from sequences sharing only a local region of similarity and enables us to infer that they have evolved from a common ancestor without any intervening domain shuffling. Since the presence or absence of a domain can cause a dramatic change in function, transferring annotations between sequences must be restricted to those sharing the same overall domain architecture. FlowerPower default parameters are designed to retrieve global homologs to a user-submitted query.
Q: What does "domain architecture" mean?
A: The precise sequence and order of structural domains composing a sequence.
Q: What is domain shuffling?
A: The biological process by which protein families acquire or lose domains, through domain fusion (merging two structures) or fission (dividing one structure into two separate structures).
Q: Why can't I just use a restrictive E-value cutoff in BLAST to infer function?
A: A significant BLAST E-value tells you that two sequences are homologous, but does not distinguish between global and local matches. It also does not take into account the possibility of functional divergence due to gene duplication or other types of evolutionary processes. There is also the problem of existing annotation errors in the databases, which have also been propagated using annotation transfer.
Q: What does GO stand for?
A: GO stands for the Gene Ontology. The GO consortium has produced an ontology of gene/protein molecular functions, biological processes and cellular locations. It represents a large effort to make a structured language under which many biological properties of gene products can be classified. The ontology takes the form of a directed acyclic graph of increasingly specific related terms. Because many groups use the Gene ontology to annotate their sequences, the database represents a source of annotation of much higher quality than found in the headers of most sequences. GO terms are normally accompanied by Evidence Codes, indicating the source of the annotation. The vast majority of GO annotations have been assigned entirely based on electronic evidence; a scant 3% of UniProt annotations have any experimental support. The Gene Ontology
Q: What does EC stand for?
A: EC stands for the Enzyme Commission, but you can really think of it as an Enzyme Classification system. It represents a structured classification of enzymes by their general and specific biocatalytic properties. The relationship between enzyme specificity and evolutionary distance makes the analysis of enzymes a particularly effective use of phylogenomic inference. The Enzyme Commission
Q: How long does PhyloBuilder take?
PhyloBuilder jobs are highly variable in the time required for completion. Longer input sequences with large numbers of homologs take longer than short sequences with few homologs. In general, requiring more SCI-PHY or PSI-BLAST iterations and lowering thresholds for sequence inclusion will cause more sequences to be accepted and will take more time. Using program defaults, most inputs will complete in less than an hour.
Q: What is FlowerPower?
A: FlowerPower is an algorithm optimized for retrieval of global homologs. It is similar to PSI-BLAST in its use of iterated profile search, inclusion of new sequences, and profile re-estimation, but has some distinct differences. The two most significant differences involve the use of subfamily HMMs instead of a single profile, and the use of alignment analysis to restrict included sequences to those meeting user-specified criteria. This combination improves alignment accuracy, minimizes the intrusion of non-homologs, and prevents profile drift. Details on FlowerPower and experimental results are available [Krishnamurthy et al,2006]. Tell me more
Use the FlowerPower webserver
Q: What is SCI-PHY really doing?
A: SCI-PHY finds a
way to divide a set of sequences into subtypes that effectively
approximates an expert division using only the multiple sequence
alignment as input. It does so by clustering sequences into the
largest possible groups such that each column in the cluster contains
amino acids that are biochemically similar, and avoids combining two
groups if that would produce a significant number of columns with
biochemically different amino acids. This criterion enables SCI-PHY to
produce clusters where all the members of a subfamily have a
consistent structure and function. Our tests show SCI-PHY subfamilies
almost always agree at the 4th digit of an EC classification, correlate
very closely to conserved subtrees found by standard phylogenetic analysis, and are consistent with expert divisions of sequences into subtypes.
Tell me more
Visit the SCI-PHY webserver
Q: How often can I use PhyloBuilder?
A: PhyloBuilder is computationally intensive and our compute cluster is relatively small. Our standard user quota is five submissions per day. Please
contact
Advanced - Phylogenetic tree construction
Advanced - FlowerPower settings
Sequence length less than or equal to 100: 1.0e-3;
Sequence length greater than 100: 1.0e-4.
These E-values are computed using an assumed database size of 100,000.
Frequently Asked Questions
Q: I'd like to customize the PhyloBuilder results page to include my own perspective. Is this possible?
A: Each PhyloBuilder book includes a link at the bottom of the page to start the editorial process. You will be asked to fill out a short form, after which you will be able to customize your PhyloBuilder book. We also encourage biologists to contribute to the community annotation of protein families in PhyloFacts. Contact us if you'd like to become a book editor.