stat tracker for
tumblr
SATCHMO-JS Alignment and Phylogenetic Tree Construction

Supplementary Material for

SATCHMO-JS: A webserver for simultaneous protein multiple sequence alignment and tree construction

Raffi Hagopian, John R. Davidson, Ruchira S. Datta, Bushra Samad, Glen R. Jarvis and Kimmen Sjölander
Nucleic Acids Research Web Server Issue, 2010

Please visit the SATCHMO-JS webserver.

Table of Contents

  1. Introduction
  2. Evaluation Scores
    1. Q_Developer
    2. Q_Modeler
    3. Cline Shift
    4. Q_Combined
  3. Datasets Used in Validation Experiments And Other Downloads

Introduction

Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset [1] show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, MAFFT, ClustalW and the original SATCHMO algorithm (see table 1).

Additional figures and data are provided below.


Table 1: SATCHMO-JS Performance Relative to Other Methods. ProbCons and MAFFT were run with 5 iterations of refinement; SATCHMO, SATCHMO-JS and T-Coffee used default parameters.
SATCHMO MAFFT MUSCLE CLUSTALW
Q_Developer 3.52E-13 1.14E-05 0.010 7.16E-57
Q_Modeler 8.66E-05 0.204 8.83E-06 7.42E-62
Cline 1.00E-10 0.157 0.007 1.05E-51
Q_Combined 4.30E-12 0.093 0.004 1.76E-56

The table above shows the p-values for the significance of the improvement of SATCHMO-JS relative to the original SATCHMO, MUSCLE, MAFFT and ClustalW. P-values were computed on the dataset as a whole (all 983 reference pairs) using Wilcoxon Paired Signed Rank Tests.

Evaluation Score Results

Q_Developer

The Q_Developer score measures recall. It is described in Wang and Dunbrack [2].

Q_Developer = TP / (TP+FN)
TP = # of correctly aligned residue pairs in the sequence alignment (i.e., agree with the reference alignment).
FN = # of aligned residue pairs in the reference alignment which are not in the sequence alignment (i.e., they are missed by the sequence alignment)

In other words, the Q_Developer score measures the fraction of the reference alignment that is correctly predicted by the sequence alignment, and is thus a measure of the recall.

Q-Developer Plot
Table 2: Q_Developer Score. ProbCons and MAFFT were run with 5 iterations of refinement; SATCHMO, SATCHMO-JS and T-Coffee used default parameters.
% ID CLUSTAL-W MUSCLE MAFFT SATCHMO-JS SATCHMO
0-5 0.000 0.050 0.012 0.023 0.052
5-10 0.085 0.100 0.099 0.170 0.140
10-15 0.257 0.325 0.330 0.371 0.378
15-20 0.361 0.483 0.499 0.565 0.566
20-25 0.504 0.672 0.640 0.692 0.638
25-30 0.647 0.771 0.761 0.787 0.734
30-35 0.712 0.834 0.817 0.857 0.784
35-40 0.739 0.853 0.846 0.900 0.843
40-100 0.918 0.927 0.944 0.966 0.923

Table 3: P-values For Q_Developer Score. p-values were computed using Wilcoxon paired signed rank test.
% ID CLUSTALW MUSCLE MAFFT SATCHMO
0-5 1 1 1 1
5-10 0.007334 0.02253 0.0267 0.06337
10-15 1.337e-07 0.0636 0.06515 0.872
15-20 3.342e-17 0.03935 0.03105 0.609
20-25 1.811e-16 0.9575 0.02304 7.863e-06
25-30 3.65e-11 0.4613 0.2609 5.686e-06
30-35 3.356e-05 0.9755 0.4267 8.368e-05
35-40 0.0001994 0.7154 0.6006 0.002895
40-100 0.0003571 0.001997 0.03125 1.936e-08

Q_Modeler

The Q_Modeler score measures precision. It is described in Wang and Dunbrack [2].

Q_Modeler = TP / (TP + FP)
TP = # of correctly aligned residue pairs in the sequence alignment (i.e, that agree with the reference)
FP = # of incorrectly aligned residue pairs in the sequence alignment (i.e., pairs in the sequence alignment that are not aligned in the reference)

Q-Modeler Plot
Table 4: Q_Modeler Score. ProbCons and MAFFT were run with 5 iterations of refinement; SATCHMO, SATCHMO-JS and T-Coffee used default parameters.
% ID Bin CLUSTAL-W MUSCLE MAFFT SATCHMO-JS SATCHMO
0-5 0.000 0.013 0.006 0.006 0.020
5-10 0.055 0.066 0.082 0.116 0.097
10-15 0.226 0.276 0.312 0.319 0.338
15-20 0.335 0.444 0.495 0.527 0.543
20-25 0.497 0.639 0.640 0.658 0.642
25-30 0.651 0.754 0.772 0.774 0.750
30-35 0.686 0.796 0.801 0.806 0.783
35-40 0.691 0.805 0.803 0.834 0.813
40-100 0.858 0.867 0.885 0.901 0.876

Table 5: P-values for Q_Modeler Score. p-values were computed using Wilcoxon paired signed rank test.
% ID Bin CLUSTALW MUSCLE MAFFT SATCHMO
0-5 1 1 1 1
5-10 0.003185 0.01801 0.2069 0.04501
10-15 6.606e-08 0.02383 0.5326 0.2551
15-20 3.02e-20 0.0009527 0.6601 0.7283
20-25 3.609e-17 0.4847 0.2169 0.009542
25-30 1.529e-10 0.6399 0.3065 0.003368
30-35 4.937e-05 0.7914 0.332 0.005784
35-40 0.0001482 0.8566 0.5382 0.1042
40-100 0.0002237 9.56e-06 0.03163 8.777e-07

Cline Shift

The Cline Shift score includes a small positive score for being close to the reference alignment, and a small penalty for overalignment. It is described in (Cline et al., 2002) [3].

Cline Shift Plot
Table 6: Cline Shift Score. ProbCons and MAFFT were run with 5 iterations of refinement; SATCHMO, SATCHMO-JS and T-Coffee used default parameters.
% ID Bin CLUSTAL-W MUSCLE MAFFT SATCHMO-JS SATCHMO
0-5 -0.016 0.033 -0.009 0.001 0.001
5-10 0.030 0.035 0.060 0.113 0.074
10-15 0.235 0.287 0.314 0.332 0.353
15-20 0.355 0.465 0.502 0.549 0.556
20-25 0.511 0.659 0.644 0.678 0.644
25-30 0.673 0.775 0.780 0.795 0.752
30-35 0.712 0.823 0.816 0.840 0.791
35-40 0.727 0.830 0.830 0.878 0.831
40-100 0.887 0.897 0.912 0.930 0.898

Table 7: P-values for Cline Shift Score. P-values were computed using Wilcoxon paired signed rank test.
% ID Bin CLUSTALW MUSCLE MAFFT SATCHMO
0-5 0.6875 0.8438 0.5625 0.8125
5-10 0.005889 0.02854 0.3122 0.06204
10-15 1.60E-05 0.174 0.9787 0.5443
15-20 1.96E-16 0.04465 0.5732 0.7483
20-25 1.28E-15 0.8633 0.4243 9.68E-05
25-30 1.44E-10 0.7866 0.5926 6.02E-06
30-35 2.86E-05 0.8849 0.672 0.0003387
35-40 0.0004037 0.7343 0.7518 0.00375
40-100 0.0005661 0.0006641 0.03723 4.45E-09

Q_Combined

The Q_Combined score penalizes over- and under-alignment, and measures both the precision and recall. It is described in Wang and Dunbrack [2].

Q-Combined Score Plot
Table 8: Q_Combined Score. ProbCons and MAFFT were run with 5 iterations of refinement; SATCHMO, SATCHMO-JS and T-Coffee used default parameters.
% ID Bin CLUSTALW MUSCLE MAFFT SATCHMO-JS SATCHMO
0-5 0.000 0.011 0.004 0.005 0.016
5-10 0.016 0.024 0.023 0.035 0.024
10-15 0.111 0.154 0.171 0.173 0.164
15-20 0.209 0.288 0.317 0.326 0.328
20-25 0.312 0.423 0.417 0.442 0.392
25-30 0.459 0.569 0.559 0.575 0.529
30-35 0.573 0.686 0.681 0.706 0.634
35-40 0.617 0.734 0.739 0.778 0.718
40-100 0.831 0.836 0.859 0.877 0.828

Table 9: P-values for Q_Combined Score. P-values were computed using Wilcoxon paired signed rank test.
% ID Bin CLUSTALW MUSCLE MAFFT SATCHMO
0-5 1 1 1 1
5-10 0.185 0.3604 0.408 0.3575
10-15 5.243e-07 0.5067 0.8975 0.7249
15-20 6.631e-17 0.06909 0.9918 0.8895
20-25 4.484e-17 0.3209 0.2309 3.426e-05
25-30 4.683e-10 0.9803 0.3625 0.0004511
30-35 5.037e-05 0.626 0.5964 3.792e-05
35-40 0.0001537 0.8443 0.8508 0.004093
40-100 0.0002401 0.0003652 0.02154 2.856e-09

References cited:

  1. Edgar, R., "MUSCLE: multiple sequence alignment with high accuracy and high throughput," Nucleic Acids Res. 2004 Mar 19;32(5):1792-7.
  2. Wang, G. and Dunbrack, R.L. Jr., "Scoring profile-to-profile sequence alignments," Protein Sci. 2004 Jun;13(6):1612-26.
  3. Cline, M., Hughey, R., Karplus, K. "Predicting reliable regions in protein sequence alignments," Bioinformatics. 2002 Feb;18(2):306–314.

Datasets Used in Validation Experiments And Other Downloads

Back to SATCHMO-JS Supplementary Materials Index Page