stat tracker for
tumblr
SATCHMO-JS Alignment and Phylogenetic Tree Construction

Details on SATCHMO and SATCHMO-JS

SATCHMO (Simultaneous Alignment and Tree Construction using Hidden Markov Models) uses HMM-HMM scoring and alignment to simultaneously estimate a phylogenetic tree and a multiple sequence alignment (MSA). Full details of the SATCHMO algorithm are available in [1].

SATCHMO uses a novel subtree-specific alignment masking proceure at each internal node of the tree to predict the conserved core structure for sequences descending from that node. The masked MSA is used to derive an HMM at that node, which is then used in all-vs-all HMM-HMM scoring to determine the branching order (and multiple alignment) from that point upwards to the root. Because of this subtree-masking protocol, SATCHMO HMMs and MSAs are typically shorter at the root of the SATCHMO tree than they towards the leaves, due to the structural variability across the family as a whole. This is particularly true when sequences in a dataset are highly variable. The PhyloScope tree viewer (provided on the SATCHMO-JS webserver) allows users to interact with the SATCHMO tree/MSA, to view alignments at internal nodes of the tree.

SATCHMO all-vs-all HMM-HMM scoring and alignment is computationally expensive, limiting its applicability to relative small datasets. The SATCHMO-JS algorithm addresses this using a jump-start protocol. We first align all the sequences in a dataset using the MAFFT algorithm [2]. The MAFFT MSA is submitted to QuickTree [3] to construct a Neighbor-Joining (NJ) tree. We then cut the NJ tree into subtrees such that no pair in each subtree has less than a pre-specified percent identity (the default is 35% pairwise identity). The MSAs for each subtree are then masked to remove columns composed entirely of gap characters, and used to jump-start SATCHMO. At this point, we use the standard SATCHMO protocol, constructing an HMM for each subtree, and using HMM-HMM scoring and alignment to determine the tree topology and MSA. Once a rooted tree has been produced, we submit the tree to the RAxML program to optimize the tree edge lengths, keeping the SATCHMO tree topology fixed [4].

Details of the pipeline: The input to SATCHMO-JS is a set of unaligned protein sequences in FASTA format. Up to 300 sequences are allowed. The pipeline has six stages:

  1. Submitting the input dataset to MAFFT [2] to construct an initial MSA.
  2. Estimation of a Neighbor-Joining tree from the MAFFT MSA using QuickTree [3].
  3. Analysis of the MAFFT MSA and NJ tree to identify subtrees whose sequence divergence is no greater than a pre-set threshold (the program default is 35% identity).
  4. Masking of subtree MSAs to remove columns containing 100% gap characters.
  5. Submitting these subtree MSAs to the SATCHMO algorithm (i.e., jump-starting SATCHMO with a smaller number of inputs, so that the HMM-HMM scoring and alignment only needs to be performed from that point upwards to form a rooted tree and MSA).
  6. Optimizing the SATCHMO tree edge lengths using the RAxML software [5], keeping the SATCHMO tree topology fixed.

The SATCHMO source code is available for download .
Supplementary Material

References cited:

  1. Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using Hidden Markov models," Bioinformatics. 2003 Jul 22; 19(11):1404-11.
  2. Katoh K., Misawa K., Kuma K., and Miyata T., "MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform," Nucleic Acids Research, 2002, Vol. 30, No. 14 3059-3066.
  3. Howe K., Bateman A., and Durbin R., "QuickTree: building huge Neighbour-Joining trees of protein sequences," Bioinformatics, 2002, Vol. 18, No. 11 1546-1547.
  4. Hagopian, R., Davidson, J., Datta, R., Samad, B., Jarvis, G., and Sjölander, K., "SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," To appear in NAR Web Server Issue 2010.
  5. Stamakis A., "RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models," Bioinformatics, 2006, Vol. 22, No. 21 2688-2690.