SCI-PHY

The SCI-PHY Web Server is available here.

Subfamily Classification In PHYlogenomics. SCI-PHY creates a full rooted tree, with the leaves being the input sequences. Some of the internal nodes (forming a cross-section of the tree) are marked as subfamily nodes. All the descendant sequences of these nodes are considered to be grouped together in a biologically significant way.

The algorithm starts with all tree nodes at the root level, and then iteratively joins subtrees, until a single rooted tree is created.

See: Brown DP, Krishnamurthy N, Sjölander K, "Automated Protein Subfamily Identification and Classification," PLoS Computational Biology 2007, 3(8): e160 doi:10.1371/journal.pcbi.0030160 PDF

Source code. Source code is available as a g-zipped tar file. SCI-PHY has been installed successfully on Linux (RedHat, Debian), Mac OS X, and Windows (2000 and XP); nevertheless, the program is supplied without warranty. In addition, we're sorry, but we are unable to provide support.


Aligned FASTA format

FASTA format is described here.

Typically, a FASTA format sequence consists of a line beginning with a ">" that provides the sequence name or description, followed by one or more lines with one-letter amino acid ("residue") codes.

In aligned Fasta format gaps in one sequence (relative to the other sequences in the alignment) are indicated by a "-" character. Each sequence in the alignment must have the same nuber of characters (counting gaps as well as residues).

Spaces will be ignored.

Here is a simple example of a FASTA format alignment:

>Q9ZVC2/318-350 AGKTPLHIAAEMV----------------------SPDMVAVLLD----HHADPNVRTV >Q9UK73/218-249 HGMTPLKVAAESC----------------------KADVVELLLS-----HADCDRRSR >LITA_LATMA/533-565 NGYTPLHIAADSN----------------------KNDFVMFLIG----NNADVNVRTK

A2M format

UCSC a2m format is described here.

A2M stands for "Align to model", and is generated by aligning sequences to a hidden Markov model (HMM). The format describes the path a sequence takes through the HMM. Upper-case characters are generated in an HMM match state, while lower-case characters are generated in an HMM insert state. Dashes ("-") indicate a sequence used an HMM skip (or delete) state. Uppercase characters and "-" represent alignment columns, and there must be exactly the same number of alignment columns in each sequence. Lowercase characters (and dots or ".") represent insertion positions between alignment columns or at the ends of the sequence. The dots in the multiple alignments are inserted so that all rows have exactly the same number of characters, allowing the alignment to be viewed or displayed with an alignment viewer or editor (such as Jalview or Belvu). Note that the SAM software may output alignments without the dots to save space, but such alignments are rejected by most alignment viewers.

Here is an example of a FASTA and A2M format alignment:

>test1 ----------CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC----------------- >test2 EEEEEEEEEE---------------ADEFG----------------EEEEEEEEEEEEEEEEE

Here is the same alignment in A2M format:

>test1 cccccccccccccccADEFGcccccccccccccccc. >test2 .....eeeeeeeeeeADEFGeeeeeeeeeeeeeeeee


How to define your own subfamilies

You may divide your alignment into subfamilies of sequences rather than having SCI-PHY do so. In this case SCI-PHY will calculate Hidden Markov Models for each user-defined subfamily.

Subfamilies are defined by including a line,

         %subfamily <label>
         

in the alignment before every group of sequences in the subfamily. "<label>" is a short name for each subfamily.


Upload FASTA file

Alternatively, sequences in a file on your local computer may be used as the input seed. The sequence file must be in FASTA or A2M format.


Send email to

Email will be sent to this address announcing completion of the SCI-PHY run. The email will provide a URL link to the results.


HMMER format

The HMMER-format HMMs are derived from the SAM HMMs by using the sam2hmmer conversion.