The PhyloBuilder Pipeline

PhyloBuilder has two modes for gathering homologs to a submitted query: global-global (the program default; retrieved sequences align along their entire lengths to the query) and global-local (retrieved sequences match the query locally, but can have additional domains before or after the query). PhyloBuilder gathers homologs for a user-supplied query sequence using FlowerPower, constructs a multiple sequence alignment using MUSCLE, applies alignment masking, constructs a phylogenetic tree using Neighbor-Joining, identifies functional subfamilies using SCI-PHY, identifies PFAM domains and homologous 3D structures, predicts the presence of transmembrane domains and signal peptides using the Phobius server, retrieves Gene Ontology and Enzyme Classification data, and constructs a family hidden Markov model (HMM) and subfamily HMMs. The pipeline is shown in Figure 1, below.


Figure 1. The PhyloBuilder Pipeline

PhyloBuilder Inputs

There is a single required input: a protein sequence in FASTA format. Optional parameters are detailed below.

Results label (optional)

Users submitting numerous PhyloBuilder jobs may find it helpful to provide a meaningful name to a PhyloBuilder job. If this field is left blank, an automatic name will be provided (we currently use the time and date of job submission).

Email (NOT optional)

The server will send a single email informing the user that the job has completed. The email will not be used for any other purpose.

Notes (optional)

The notes field allows a user to include some information (e.g., details about the submitted sequence) with the PhyloBuilder job. This text will appear in the PhyloBuilder Results page.

Input: FASTA sequence

The input to PhyloBuilder is a single protein sequence in FASTA format. FASTA format consists of a line beginning with a greater-than ('>') character followed by the sequence identifier (no white space is allowed between the '>' and the identifier); additional text may follow the sequence identifier, including white space characters. The amino acid sequence follows on subsequent lines in the single-letter representation.

>gi|15277272|dbj|BAB63400.1| HLA-A [Homo sapiens] MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVQFVRFDSDAASQRME PRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRDVGSDGRFLRGYRQD AYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHELENGKETLQRTDPPK THMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEQKWAAVVVPSGEEQR YTCHVQHEGLPKPLTLRWELSSQPTIPIVGIIAGLVSSDRKGGSYTQAASS DSAQGSDVSLTACKV Figure 2. Example of FASTA format

Default Settings

Default parameter settings for FlowerPower are designed to optimize the retrieval of global homologs (sequences that share the same series of domains, are roughly the same length and can be aligned over their entire lengths).

Homolog selection (optional)

Two options are provided for global homolog selection: global-global (to select global homologs) and global-local (to select sequences matching the input sequence but which may have additional N- or C-terminal structure). The default is global-global.

Advanced - Phylogenetic tree construction

Phylogenetic trees may be constructed for the sequences in the alignment. The trees can be displayed on-line with a graphical viewer, and the tree data files may be downloaded.

The default setting is to construct only a neighbor-joining tree. Construction of maximum-likelihood and maximum-parsimony trees may be selected.

Neighbor-joining trees are constructed with the PHYLIP neighbor-joining software (protdist and neighbor), using default parameters. Maximum likelihood trees are constructed with the PhyML software, using the neighbor-joining tree as the starting tree, and using default parameters. The neighbor-joining and maximum likelihood trees are midpoint re-rooted with the PHYLIP software (retree). Maximum Parsimony trees are constructed with the PAUP* software, using heuristic search with random sequence addition, f-value minimization, midpoint rooting, and consensus by majority rule, with a maximum of 100,000 trees.

Advanced - FlowerPower settings

Advanced FlowerPower settings allow the user to modify the homolog collection and inclusion strategy. Users can over-ride program defaults using the Advanced Settings page to select a different sequence database, to select sequences with global-local similarity to the submitted seed sequence, and to modify the minimum fractional coverage or sequence similarity to a previously included sequence. Fractional coverage thresholds can be set independently for the query and hit.

See more information at the Flowerpower webserver

SW

The sw parameter determines the way that sequences are aligned to an HMM using the UCSC SAM software; setting sw to 2 (the default) provides local-local alignment.

Number of SHMM iterations

This sets the number of times the homologs resulting from a PSI-BLAST search are searched by FlowerPower with subfamily Hidden Markov Models. The default number of iterated subfamily HMM (SHMM) scores of the PSI-BLAST sequences is 3. Increasing the number of iterations may increase the sensitivity (coverage), but will increase program run-time.

Maximum e-value for inclusion of sequences default: depends on sequence length

The E-value (or Expectation value) is an estimate of the number of hits found by chance alone in a database of a particular size. Since the E-value of an HMM score is affected by the HMM length (high-identity matches to short HMMs can have much less significance than moderate-identity matches to long HMMs), we use a length-dependent E-value cutoff (determined empirically), as follows:

    Sequence length less than 65: 1.0e-2;
    Sequence length less than or equal to 100: 1.0e-3;
    Sequence length greater than 100: 1.0e-4.
These E-values are computed using an assumed database size of 100,000.

Number of PSI-BLAST iterations

In FlowerPower, the default setting is three iterations of PSI-BLAST for retrieval of the set of sequences for HMM search.

Although FlowerPower is able to effectively prevent the negative consequences of profile drift (particularly, failing to recognize the query sequence), running more than three PSI-BLAST iterations is only recommended when using the global-local option of sequence search, or when a user wishes to potentially identify very distant homologs. Increasing the number of PSI-BLAST iterations also comes at a significant increase in run time, both due to increasing the length of the PSI-BLAST run and because of the increased time needed in HMM scoring of a larger set of sequences.

Required identity to existing sequence for inclusion

The default setting for minimum percent identity between a new sequence and a previously included sequence is 20%. This, in combination with conservatively set maximum E-values against a subfamily (or family) HMM, provides good remote homolog detection with minimizes the inclusion of non-homologs. Increasing the required identity criterion may result in a more closely related set of sequences, but is likely to result in missing some homologs. Lowering the required identity threshold will allow inclusion of more remotely related sequences, at the risk of including non-homologs and subsequent problems in phylogenomic inference of function.

Search Database

The UniProt database is used by default for PSI-BLAST homolog search. The NR database is also available.

Coverage

Fractional coverage of a sequence relative to an HMM (sequence coverage), or of an HMM relative to a sequence (HMM coverage), provides a measure of their overall structural similarity. We measure these by aligning a sequence to an HMM and determining the fraction of the sequence aligned to the HMM, and the fraction of the HMM match states through which a sequence passes in an optimal path through the HMM. FlowerPower uses HMM-length-specific coverage requirements designed to optimize the retrieval of global and glocal homologs. Users can manipulate the query and hit coverage requirements to identify additional sequences with only partial similarity. This can be useful when restrictive coverage parameters produce very few homologs. However, we recommend caution in transferring annotations from any partial matches.

Frequently Asked Questions

Q: What is a book?

A: We use the term "book" to reflect the complex data associated with each PhyloFacts family - multiple sequence alignment, phylogenetic trees, predicted 3D structures, identified subfamilies, GO annotations, family and subfamily hidden Markov models, etc. The term also reflects the division of PhyloFacts into libraries of different types (e.g., Plant Disease Resistance, GPCRs, microbial gene families, etc.). Currently there are more than 55,000 completed books in the PhyloFacts database, divided into various libraries.

Q: What are global homologs?

A: Global homologs are sequences that share homology along their entire lengths. This distinguishes them from sequences sharing only a local region of similarity and enables us to infer that they have evolved from a common ancestor without any intervening domain shuffling. Since the presence or absence of a domain can cause a dramatic change in function, transferring annotations between sequences must be restricted to those sharing the same overall domain architecture. FlowerPower default parameters are designed to retrieve global homologs to a user-submitted query.

Q: What does "domain architecture" mean?

A: The precise sequence and order of structural domains composing a sequence.

Q: What is domain shuffling?

A: The biological process by which protein families acquire or lose domains, through domain fusion (merging two structures) or fission (dividing one structure into two separate structures).

Q: Why can't I just use a restrictive E-value cutoff in BLAST to infer function?

A: A significant BLAST E-value tells you that two sequences are homologous, but does not distinguish between global and local matches. It also does not take into account the possibility of functional divergence due to gene duplication or other types of evolutionary processes. There is also the problem of existing annotation errors in the databases, which have also been propagated using annotation transfer.

Q: What does GO stand for?

A: GO stands for the Gene Ontology. The GO consortium has produced an ontology of gene/protein molecular functions, biological processes and cellular locations. It represents a large effort to make a structured language under which many biological properties of gene products can be classified. The ontology takes the form of a directed acyclic graph of increasingly specific related terms. Because many groups use the Gene ontology to annotate their sequences, the database represents a source of annotation of much higher quality than found in the headers of most sequences. GO terms are normally accompanied by Evidence Codes, indicating the source of the annotation. The vast majority of GO annotations have been assigned entirely based on electronic evidence; a scant 3% of UniProt annotations have any experimental support. The Gene Ontology

Q: What does EC stand for?

A: EC stands for the Enzyme Commission, but you can really think of it as an Enzyme Classification system. It represents a structured classification of enzymes by their general and specific biocatalytic properties. The relationship between enzyme specificity and evolutionary distance makes the analysis of enzymes a particularly effective use of phylogenomic inference. The Enzyme Commission

Q: How long does PhyloBuilder take?

PhyloBuilder jobs are highly variable in the time required for completion. Longer input sequences with large numbers of homologs take longer than short sequences with few homologs. In general, requiring more SCI-PHY or PSI-BLAST iterations and lowering thresholds for sequence inclusion will cause more sequences to be accepted and will take more time. Using program defaults, most inputs will complete in less than an hour.

Q: What is FlowerPower?

A: FlowerPower is an algorithm optimized for retrieval of global homologs. It is similar to PSI-BLAST in its use of iterated profile search, inclusion of new sequences, and profile re-estimation, but has some distinct differences. The two most significant differences involve the use of subfamily HMMs instead of a single profile, and the use of alignment analysis to restrict included sequences to those meeting user-specified criteria. This combination improves alignment accuracy, minimizes the intrusion of non-homologs, and prevents profile drift. Details on FlowerPower and experimental results are available [Krishnamurthy et al,2006]. Tell me more

Use the FlowerPower webserver

Q: What is SCI-PHY really doing?

A: SCI-PHY finds a way to divide a set of sequences into subtypes that effectively approximates an expert division using only the multiple sequence alignment as input. It does so by clustering sequences into the largest possible groups such that each column in the cluster contains amino acids that are biochemically similar, and avoids combining two groups if that would produce a significant number of columns with biochemically different amino acids. This criterion enables SCI-PHY to produce clusters where all the members of a subfamily have a consistent structure and function. Our tests show SCI-PHY subfamilies almost always agree at the 4th digit of an EC classification, correlate very closely to conserved subtrees found by standard phylogenetic analysis, and are consistent with expert divisions of sequences into subtypes. Tell me more

Visit the SCI-PHY webserver

Q: How often can I use PhyloBuilder?

A: PhyloBuilder is computationally intensive and our compute cluster is relatively small. Our standard user quota is five submissions per day. Please contact us if you would like a temporary increase to your quota.

Q: I'd like to customize the PhyloBuilder results page to include my own perspective. Is this possible?

A: Each PhyloBuilder book includes a link at the bottom of the page to start the editorial process. You will be asked to fill out a short form, after which you will be able to customize your PhyloBuilder book. We also encourage biologists to contribute to the community annotation of protein families in PhyloFacts. Contact us if you'd like to become a book editor.