Submit either an amino acid or a nucleotide sequence in FASTA format. The submitted sequence is classified to a protein family by Hidden Markov Model (HMM) scoring. Nucleotide sequences are first translated into all six frames and each frame is analyzed separately. Batch mode submission of up to five sequences is enabled. Results are returned by e-mail, and allow users to select families for more detailed classification of sequences to functional subfamilies based on scoring against subfamily HMMs.
PhyloFacts Sequence Search Results
Paste sequences in FASTA format. Sequences entered here will be used for searching the set of books and/or subfamilies in the model library. Sequences must be entered in FASTA format. Sequences may be either amino acids or nucleotides (DNA) -- to search using an Expressed Sequence Tag (EST), for example.
Upload FASTA file. Sequences in a file on your local computer may be used for searching the model library. The sequence file must be in FASTA format.
Nucleotide sequence. Check this box if your input sequence represents nucleotides (DNA). The search of the PhyloFacts library "books" will be conducted by first translating the nucleotide sequence to proteins. Six translations will be used: three forward and three reverse (that is, the reversed input sequence). Each forward and reverse translation is derived by offsetting the translation "frame" by zero, one, or two nucleotide positions. Thus, the search of the PhyloFacts library will be done using six different "protein sequences."
Send email to. Email will be sent to this address announcing completion of the search, and providing a URL link to the results.
Remember me. Your email address will be saved as a "cookie" within your current browser/computer if this box is checked. It will be used to automatically fill in your email address the next time you open this page from this browser and computer. Uncheck this box if you are not using your regular browser/computer.
Require protein families to match 60% or more of your sequence. Search results will be limited to those protein families which have a "global match" to your sequence. Global match is defined by a "bi-directional" coverage criterion that depends on the length of your sequence and the length of the protein family's HMM.
Bi-directional means that (1) the HMM coverage (the number of aligned characters between your sequence and the protein family hidden Markov model divided by the length of the protein family HMM) and (2) the sequence coverage (the number of aligned characters divided by the length of your sequence) both are at least as great as the criterion. In other words, bi-directional coverage means that the matching (aligned) regions between your sequence and the protein family are (1) a significant portion of the protein family consensus sequence or profile and (2) a significant portion of your sequence. length
The coverage criterion used varies depending on the length of the protein family HMM as follows:
HMM Coverage
length criterion
<100 0.60
100-199 0.65
200+ 0.70
Advanced options include:
Run BLAST pre-screen. BLAST will be used to identify books that match the query sequence. The BLAST search will be carried out against either each book's consensus sequence (if the "fast" checkbox is checked) or against all sequences in each book (if the "fast" checkbox is not checked). Once the BLAST search has identified books that match the query sequence, then the query sequence will be scored with each of those books' HMMs (and only those books' HMMs). BLAST searches run very quickly compared to HMM scoring; the BLAST pre-screen speeds up the search considerably.
If the "Run BLAST pre-screen" checkbox is not checked, then the query sequence will be scored against each book's HMM for every book in the specified libraries.
Fast BLAST pre-screen. If the "Run BLAST pre-screen" checkbox is checked, then the "fast" checkbox specifies whether the BLAST search will be carried out against each book's consensus sequence ("fast" checked) or against all sequences in each book ("fast" not checked).
Pre-screen e-value criterion. The e-value cut-off used to select books for further HMM scoring.
Batch mode. Each input sequence will be treated as if it had been entered as a separate search submission. Ordinarily, PhyloFacts will treat multiple input sequences as representing a single query. Sequences that all have the same length will be treated as an alignment. Sequences that have different lengths will first be aligned (with ClustalW) and then used to search the PhyloFacts library.
The PhyloFacts search always scores each input sequence with one HMM
from each book in the PhyloFacts library. The difference between
a standard PhyloFacts search and a batch search is in the presentation
of results: standard search results provide one top-scoring book from
each SCOP family for all input sequences. Batch mode results provide
first a summary table of the top-scoring book (for any SCOP family)
for each input sequence. The summary table allows users to select one
input sequence at a time to see the top-scoring books from each SCOP
family for that input sequence.
Use HMMs based on.
The input sequence(s) will be scored by either the "best" HMM for each
book among the different HMMs available, or by the selected type of HMM
for each book. "Best" means the HMM which produces the greatest number
of "correct" search results ("true positives") before an "incorrect"
result ("false positive") when the HMM is used to score all sequences in
PDB40. "True positives" are search matches that are classified by SCOP
to be in the same "fold;" "false positives" are matches to PDB40 entries
that are classified by SCOP to be in a different "fold."
The best method can either search across SW methods ("all SW") or search
within the specified SW method ("specified SW").
In the PhyloFacts library, to create HMMs for each "book" (based on a
PDB40 entry), homologous protein sequences were collected from the NCBI
NR database. Several different methods were used to identify homologous
protein sequences, and an HMM was created from the aligned homologous
protein sequences in each case.
The different methods include:
SW. Smith-Waterman scoring method. The input (query)
sequence is scored against each book in the PhyloFacts library
using the
SAM
hmmscore program. The hmmscore scoring option can be chosen here. Calibrated. Specifies whether the calibrated or uncalibrated
HMM for each PhyloFacts book will be used. GHMM cutoff value.
The General Hidden Markov Model e-value limit. Only matches of the
input (query) sequence with books in the PhyloFacts library that
have a better (lower) e-value than this input will be returned. Only HMMs matching. The search of the PhyloFacts library can
limited to a subset of the "books" in the library. The books to be
searched are identified either by their
SCOP
domain ID, for example, "d1g61a_", or by their Berkeley Phylogenomics
Group accession ID, for example "bpg011128".
Multiple domain IDs may be given, separated by spaces.
In addition, "regular expression" pattern
matching notation may used; for example, "d1g6*" to match all books
with domain IDs beginning with "d1g6".
Functional classification. This portion of the results display is limited to protein families that meet "global homology" criteria: each sequence in the family aligns to the family HMM over almost its full length. The family represents sequences having a common domain architecture. In addition, the matches in this portion of the results display are also required to have a "global match" to your sequence. Global match is defined by a "bi-directional" coverage criterion that depends on the length of your sequence and the length of the protein family's HMM.
Bi-directional means that (1) the HMM coverage (the number of aligned characters between your sequence and the protein family hidden Markov model divided by the length of the protein family HMM) and (2) the sequence coverage (the number of aligned characters divided by the length of your sequence) both are at least as great as the criterion. In other words, bi-directional coverage means that the matching (aligned) regions between your sequence and the protein family are (1) a significant portion of the protein family consensus sequence or profile and (2) a significant portion of your sequence. length
The coverage criterion used varies depending on the length of the protein family HMM as follows:
HMM Coverage
length criterion
<100 0.60
100-199 0.65
200+ 0.70
Note: The coverage requirements shown above are less restrictive that the coverage requirements we use in FlowerPower and are set to enable you to see additional possible matches to your submitted protein.
In the case of longer sequences or HMMs (e.g., greater than 1000 positions), this coverage may not be sufficient to prevent matches between sequences and HMMs representing different domain architectures.
Best other matching protein families.
"Other" matching protein families include "global homology" protein
families (see
Coverage topic) that do not have a global match
(full-length alignment) to your sequence, as well as non-global homology
protein families and protein families that do not represent structural
domains.
These include protein families representing conserved
regions and motifs, as well as some protein families that did not meet
all quality control criteria to be classified as global homology protein
families.
Max. e-value shown. The largest (that is, least significant) e-value
for a domain which will be shown on the map. Default: the "GHMM cutoff
value" specified on the PhyloFacts input page will be used, except in those
cases where a PhyloFacts book has a specific e-value criterion.
(The GHMM cutoff value has a default value of 1.0.)
If one or more sequences were input into the first text area, each of
these is scored against the general HMMs in the model
library.
Once the books whose GHMMs the input sequences scored best against
have been identified, the user can select one or more to further
examine, using subfamily HMMs. A subfamily HMM is computed by
partitioning the alignment into a group of subfamilies using the SCI-PHY
program.
For each subfamily a HMM is generated by replacing the match state
probabilities in the general HMM. This process is further described
here.
The alignment of the input sequence to the HMM (either general or
subfamily) can be displayed together with the alignment of the book
seed or consensus sequence. This is produced using the
hmmscore program with -sw values of 0, 1, 2, and 3
to give all combinations of sequence and HMM local and global alignments.
While the process is running, a page is displayed that gives the input
sequences, parameters, and status information. Once it is finished a
table is presented that gives the best matches of books to input sequences,
grouped together by SCOP classification.
One or more matches may be selected by clicking the checkbox in the
second column, and then used for further searching by clicking
"Go" after "Search selected books for top-scoring subfamily HMMs
against query for selected families".
If subfamilies are searched, the process will be run a second time, matching
the sequences against subfamily HMMs for the selected books.
This column shows the result of subfamily classification by logistic
regression.
The logistic regression classification works by first scoring every sequence
in each subfamily by both its own subfamily's HMM and by every other
subfamilies' HMMs.
These scores are then used to fit a logistic curve for each
subfamily that best predicts each sequence's membership in that
subfamily (indicator value = 1) or non-membership in that subfamily
(indicator value = 0). SAM reverse scores are used in this calculation.
The resulting logistic curve is then used to predict the query sequence's
probability of being a member of a subfamily given its score by that
subfamily's HMM.
Note that if your query sequence is a "novel subtype" — that is,
a sequence which is not classifiable to any of the existing SCI-PHY
subfamilies — then this will be indicated by a low probability
being reported for each SCI-PHY subfamily.
The subfamily logistic regression calculations were created by
Duncan Brown.
This column shows the "local-global" percent identity between the
query sequence and the family consensus sequence — that is, the
ratio of the number of identical positions ("local" to the query) to
the length of (number of nodes in) the HMM ("globally" within the HMM).
The numerator is the number of identical positions.
The denominator is the length of (number of nodes in) the family HMM.
If there is more than one alignment of the PhyloFacts book
HMM to the query (for example, when the PhyloFacts book represents a domain,
and there is more than one such domain in the query sequence), then
the reported percent identity shows the higest percent identity among the
multiple alignments.
Click "Go" under "View alignment" to see details.
This column shows the "local-local" percent identity between
the query and the family consensus sequence — that is, the
ratio of the number of identical positions ("local" to the query) to
the number of aligned positions within the HMM ("locally" within the HMM).
The numerator is the number of identical positions.
The denominator is the length of the aligned region between.
As is the case with the "local-global" percent identity,
if there is more than one alignment of the PhyloFacts book
HMM to the query (for example, when the PhyloFacts book represents a domain,
and there is more than one such domain in the query sequence), then
the reported "local-local" percent identity shows the highest percent
identity among the multiple alignments.
Click "Go" under "View alignment" to see details.
This column in the subfamily search results table shows the "global-local"
percent identity between the query and the subfamily/family HMM consensus
sequence, where the subfamily/family consensus sequence includes both
the subfamily-specific consensus amino acids, and — in remaining
positions — the family consensus amino acids.
The fraction is calculated as
This column in the subfamily search results table shows the "global-local"
percent identity between the query and the subfamily consensus sequence,
where the subfamily consensus sequence includes only the
subfamily-specific portion of the subfamily HMM.
The fraction is calculated as
FASTA format. The FASTA format is described
here.
Basically it consists of a sequence name or description, beginning
with a > at the start of a line followed by one or more lines with
proteins using the 1-letter codes. For our purposes the codes X,
*, B, U, Z, and - should be avoided. The text description after the > symbol
is not interpreted by PhyloFacts.
An example of FASTA format:
Methods
Outputs
Probability query classified to subfamily
% id - HMM
% id - aligned pos.
% id - family HMM
number of identities between query and subfamily/family consensus
id = -----------------------------------------------------------------
number of family HMM positions
% id - subfamily HMM
number of identities between query and subfamily-specific consensus
id = -------------------------------------------------------------------
number of subfamily-specific HMM positions
File formats
Fasta format
>d1dlwa_ a.1.1.1 (A:) Truncated hemoglobin {Ciliate (Paramecium caudatum)}
slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalggpnawt
grnlkevhanmgvsnaqfttvighlrsaltgagvaaalveqtvavaetvrgdvvtv