Supplementary Material for
SATCHMO-JS: A webserver for simultaneous protein multiple sequence alignment and tree construction
Raffi Hagopian, John R. Davidson, Ruchira S. Datta, Bushra Samad, Glen R. Jarvis and Kimmen Sjölander
Nucleic Acids Research Web Server Issue, 2010
Please visit the SATCHMO-JS webserver.
Table of Contents
Introduction
Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset [1] show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, MAFFT, ClustalW and the original SATCHMO algorithm (see table 1).
Additional figures and data are provided below.
| SATCHMO | MAFFT | MUSCLE | CLUSTALW | |
|---|---|---|---|---|
| Q_Developer | 3.52E-13 | 1.14E-05 | 0.010 | 7.16E-57 |
| Q_Modeler | 8.66E-05 | 0.204 | 8.83E-06 | 7.42E-62 |
| Cline | 1.00E-10 | 0.157 | 0.007 | 1.05E-51 |
| Q_Combined | 4.30E-12 | 0.093 | 0.004 | 1.76E-56 |
The table above shows the p-values for the significance of the improvement of SATCHMO-JS relative to the original SATCHMO, MUSCLE, MAFFT and ClustalW. P-values were computed on the dataset as a whole (all 983 reference pairs) using Wilcoxon Paired Signed Rank Tests.
Evaluation Score Results
Q_Developer
The Q_Developer score measures recall. It is described in Wang and Dunbrack [2].
Q_Developer = TP / (TP+FN)
TP = # of correctly aligned residue pairs in the sequence
alignment (i.e., agree with the reference alignment).
FN = # of aligned residue pairs in the reference alignment which are
not in the sequence alignment (i.e., they are missed by the sequence
alignment)
In other words, the Q_Developer score measures the fraction of the
reference alignment that is correctly predicted by the sequence
alignment, and is thus a measure of the recall.
| % ID | CLUSTAL-W | MUSCLE | MAFFT | SATCHMO-JS | SATCHMO |
|---|---|---|---|---|---|
| 0-5 | 0.000 | 0.050 | 0.012 | 0.023 | 0.052 |
| 5-10 | 0.085 | 0.100 | 0.099 | 0.170 | 0.140 |
| 10-15 | 0.257 | 0.325 | 0.330 | 0.371 | 0.378 |
| 15-20 | 0.361 | 0.483 | 0.499 | 0.565 | 0.566 |
| 20-25 | 0.504 | 0.672 | 0.640 | 0.692 | 0.638 |
| 25-30 | 0.647 | 0.771 | 0.761 | 0.787 | 0.734 |
| 30-35 | 0.712 | 0.834 | 0.817 | 0.857 | 0.784 |
| 35-40 | 0.739 | 0.853 | 0.846 | 0.900 | 0.843 |
| 40-100 | 0.918 | 0.927 | 0.944 | 0.966 | 0.923 |
| % ID | CLUSTALW | MUSCLE | MAFFT | SATCHMO |
|---|---|---|---|---|
| 0-5 | 1 | 1 | 1 | 1 |
| 5-10 | 0.007334 | 0.02253 | 0.0267 | 0.06337 |
| 10-15 | 1.337e-07 | 0.0636 | 0.06515 | 0.872 |
| 15-20 | 3.342e-17 | 0.03935 | 0.03105 | 0.609 |
| 20-25 | 1.811e-16 | 0.9575 | 0.02304 | 7.863e-06 |
| 25-30 | 3.65e-11 | 0.4613 | 0.2609 | 5.686e-06 |
| 30-35 | 3.356e-05 | 0.9755 | 0.4267 | 8.368e-05 |
| 35-40 | 0.0001994 | 0.7154 | 0.6006 | 0.002895 |
| 40-100 | 0.0003571 | 0.001997 | 0.03125 | 1.936e-08 |
Q_Modeler
The Q_Modeler score measures precision. It is described in Wang and Dunbrack [2].
Q_Modeler = TP / (TP + FP)
TP = # of correctly aligned residue pairs in the sequence alignment (i.e,
that agree with the reference)
FP = # of incorrectly aligned residue pairs in the sequence alignment
(i.e., pairs in the sequence alignment that are not aligned in the
reference)
| % ID Bin | CLUSTAL-W | MUSCLE | MAFFT | SATCHMO-JS | SATCHMO |
|---|---|---|---|---|---|
| 0-5 | 0.000 | 0.013 | 0.006 | 0.006 | 0.020 |
| 5-10 | 0.055 | 0.066 | 0.082 | 0.116 | 0.097 |
| 10-15 | 0.226 | 0.276 | 0.312 | 0.319 | 0.338 |
| 15-20 | 0.335 | 0.444 | 0.495 | 0.527 | 0.543 |
| 20-25 | 0.497 | 0.639 | 0.640 | 0.658 | 0.642 |
| 25-30 | 0.651 | 0.754 | 0.772 | 0.774 | 0.750 |
| 30-35 | 0.686 | 0.796 | 0.801 | 0.806 | 0.783 |
| 35-40 | 0.691 | 0.805 | 0.803 | 0.834 | 0.813 |
| 40-100 | 0.858 | 0.867 | 0.885 | 0.901 | 0.876 |
| % ID Bin | CLUSTALW | MUSCLE | MAFFT | SATCHMO |
|---|---|---|---|---|
| 0-5 | 1 | 1 | 1 | 1 |
| 5-10 | 0.003185 | 0.01801 | 0.2069 | 0.04501 |
| 10-15 | 6.606e-08 | 0.02383 | 0.5326 | 0.2551 |
| 15-20 | 3.02e-20 | 0.0009527 | 0.6601 | 0.7283 |
| 20-25 | 3.609e-17 | 0.4847 | 0.2169 | 0.009542 |
| 25-30 | 1.529e-10 | 0.6399 | 0.3065 | 0.003368 |
| 30-35 | 4.937e-05 | 0.7914 | 0.332 | 0.005784 |
| 35-40 | 0.0001482 | 0.8566 | 0.5382 | 0.1042 |
| 40-100 | 0.0002237 | 9.56e-06 | 0.03163 | 8.777e-07 |
Cline Shift
The Cline Shift score includes a small positive score for being close to the reference alignment, and a small penalty for overalignment. It is described in (Cline et al., 2002) [3].
| % ID Bin | CLUSTAL-W | MUSCLE | MAFFT | SATCHMO-JS | SATCHMO |
|---|---|---|---|---|---|
| 0-5 | -0.016 | 0.033 | -0.009 | 0.001 | 0.001 |
| 5-10 | 0.030 | 0.035 | 0.060 | 0.113 | 0.074 |
| 10-15 | 0.235 | 0.287 | 0.314 | 0.332 | 0.353 |
| 15-20 | 0.355 | 0.465 | 0.502 | 0.549 | 0.556 |
| 20-25 | 0.511 | 0.659 | 0.644 | 0.678 | 0.644 |
| 25-30 | 0.673 | 0.775 | 0.780 | 0.795 | 0.752 |
| 30-35 | 0.712 | 0.823 | 0.816 | 0.840 | 0.791 |
| 35-40 | 0.727 | 0.830 | 0.830 | 0.878 | 0.831 |
| 40-100 | 0.887 | 0.897 | 0.912 | 0.930 | 0.898 |
| % ID Bin | CLUSTALW | MUSCLE | MAFFT | SATCHMO |
|---|---|---|---|---|
| 0-5 | 0.6875 | 0.8438 | 0.5625 | 0.8125 |
| 5-10 | 0.005889 | 0.02854 | 0.3122 | 0.06204 |
| 10-15 | 1.60E-05 | 0.174 | 0.9787 | 0.5443 |
| 15-20 | 1.96E-16 | 0.04465 | 0.5732 | 0.7483 |
| 20-25 | 1.28E-15 | 0.8633 | 0.4243 | 9.68E-05 |
| 25-30 | 1.44E-10 | 0.7866 | 0.5926 | 6.02E-06 |
| 30-35 | 2.86E-05 | 0.8849 | 0.672 | 0.0003387 |
| 35-40 | 0.0004037 | 0.7343 | 0.7518 | 0.00375 |
| 40-100 | 0.0005661 | 0.0006641 | 0.03723 | 4.45E-09 |
Q_Combined
The Q_Combined score penalizes over- and under-alignment, and measures both the precision and recall. It is described in Wang and Dunbrack [2].
| % ID Bin | CLUSTALW | MUSCLE | MAFFT | SATCHMO-JS | SATCHMO |
|---|---|---|---|---|---|
| 0-5 | 0.000 | 0.011 | 0.004 | 0.005 | 0.016 |
| 5-10 | 0.016 | 0.024 | 0.023 | 0.035 | 0.024 |
| 10-15 | 0.111 | 0.154 | 0.171 | 0.173 | 0.164 |
| 15-20 | 0.209 | 0.288 | 0.317 | 0.326 | 0.328 |
| 20-25 | 0.312 | 0.423 | 0.417 | 0.442 | 0.392 |
| 25-30 | 0.459 | 0.569 | 0.559 | 0.575 | 0.529 |
| 30-35 | 0.573 | 0.686 | 0.681 | 0.706 | 0.634 |
| 35-40 | 0.617 | 0.734 | 0.739 | 0.778 | 0.718 |
| 40-100 | 0.831 | 0.836 | 0.859 | 0.877 | 0.828 |
| % ID Bin | CLUSTALW | MUSCLE | MAFFT | SATCHMO |
|---|---|---|---|---|
| 0-5 | 1 | 1 | 1 | 1 |
| 5-10 | 0.185 | 0.3604 | 0.408 | 0.3575 |
| 10-15 | 5.243e-07 | 0.5067 | 0.8975 | 0.7249 |
| 15-20 | 6.631e-17 | 0.06909 | 0.9918 | 0.8895 |
| 20-25 | 4.484e-17 | 0.3209 | 0.2309 | 3.426e-05 |
| 25-30 | 4.683e-10 | 0.9803 | 0.3625 | 0.0004511 |
| 30-35 | 5.037e-05 | 0.626 | 0.5964 | 3.792e-05 |
| 35-40 | 0.0001537 | 0.8443 | 0.8508 | 0.004093 |
| 40-100 | 0.0002401 | 0.0003652 | 0.02154 | 2.856e-09 |
References cited:
- Edgar, R., "MUSCLE: multiple sequence alignment with high accuracy and high throughput," Nucleic Acids Res. 2004 Mar 19;32(5):1792-7.
- Wang, G. and Dunbrack, R.L. Jr., "Scoring profile-to-profile sequence alignments," Protein Sci. 2004 Jun;13(6):1612-26.
- Cline, M., Hughey, R., Karplus, K. "Predicting reliable regions in protein sequence alignments," Bioinformatics. 2002 Feb;18(2):306–314.
Datasets Used in Validation Experiments And Other Downloads
- Download the datasets and the results of the experiments described in the paper (caution: files are large)
- Spreadsheet of SATCHMO-JS scores
- Figure 1.1
- Figure 1.2
- Figure 2
