SCI-PHY

Subfamily Classification In PHYlogenomics. SCI-PHY creates a full rooted tree, with the leaves being the input sequences. Some of the internal nodes (forming a cross-section of the tree) are marked as subfamily nodes. All the descendant sequences of these nodes are considered to be grouped together in a biologically significant way.

The algorithm starts with all tree nodes at the root level, and then iteratively joins subtrees, until a single rooted tree is created.

See: Brown DP, Krishnamurthy N, Sjölander K, "Automated Protein Subfamily Identification and Classification," PLoS Computational Biology 2007, 3(8): e160 doi:10.1371/journal.pcbi.0030160 PDF

Source code. Source code is available as a g-zipped tar file. SCI-PHY has been installed successfully on Linux (RedHat, Debian), Mac OS X, and Windows (2000 and XP); nevertheless, the program is supplied without warranty. In addition, we're sorry, but we are unable to provide support.


Aligned FASTA format

FASTA format is described here.

Typically, a FASTA format sequence consists of a line beginning with a ">" that provides the sequence name or description, followed by one or more lines with one-letter amino acid ("residue") codes.

In aligned Fasta format gaps in one sequence (relative to the other sequences in the alignment) are indicated by a "-" character. Each sequence in the alignment must have the same nuber of characters (counting gaps as well as residues).

Spaces will be ignored.

Here is a simple example of a FASTA format alignment:

>Q9ZVC2/318-350 AGKTPLHIAAEMV----------------------SPDMVAVLLD----HHADPNVRTV >Q9UK73/218-249 HGMTPLKVAAESC----------------------KADVVELLLS-----HADCDRRSR >LITA_LATMA/533-565 NGYTPLHIAADSN----------------------KNDFVMFLIG----NNADVNVRTK

A2M format

UCSC a2m format is described here.

In brief, A2M format is compatible with FASTA. Uppercase characters and "-" represent alignment columns, and there must be exactly the same number of alignment columns in each sequence. Lowercase characters (and spaces or ".") represent insertion positions between alignment columns or at the ends of the sequence. The spaces or periods in the multiple alignments are only for human readability and may be omitted.

Here is an example of a FASTA and A2M format alignment:

>test1 ----------CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC----------------- >test2 EEEEEEEEEE---------------ADEFG----------------EEEEEEEEEEEEEEEEE

Here is the same alignment in A2M format:

>test1 cccccccccccccccADEFGcccccccccccccccc. >test2 .....eeeeeeeeeeADEFGeeeeeeeeeeeeeeeee


How to define your own subfamilies

You may divide your alignment into subfamilies of sequences rather than having SCI-PHY do so. In this case SCI-PHY will calculate Hidden Markov Models for each user-defined subfamily.

Subfamilies are defined by including a line,

%subfamily <label>

in the alignment before every group of sequences in the subfamily. "<label>" is a short name for each subfamily.


Upload FASTA file

Sequences in a file on your local computer may be used as the input seed. The sequence file must be in FASTA or A2M format.




Send email to

Email will be sent to this address announcing completion of the SCI-PHY run. The email will provide a URL link to the results.


HMMER format

The HMMER-format HMMs are derived from the SAM HMMs by using the sam2hmmer conversion.