Wayne Christopher,
wayne@phylogenomics.berkeley.edu
Kimmen Sjolander,
kimmen@uclink.berkeley.edu
Berkeley Phylogenomics Group
University of California at Berkeley
Version 1.0, March 14, 2002
Warning: work in progress!
This document describes the operation of bete, an implementation of the Bayesan Evolutionary Tree Estimation algorithm presented in XX and XX. It also creates subfamily HMM's as described in XX.
bete takes as input a multiple sequence alignment in FASTA format and a Hidden Markov Model in SAM format, and produces a number of output files:
A full rooted tree is created, with the leaves being the input sequences. However, some of the internal nodes (forming a cross-section of the tree) are marked as subfamily nodes. All the descendant sequences of these nodes are considered to be grouped together in a biologically significant way.
The algorithm starts with all tree nodes at the root level, and then iteratively joins subtrees, until a single rooted tree is created. The order in which the nodes are joined, together with the criteria for determining what nodes are subtree nodes, is controlled using the options described below.
The output of bete can be viewed using the gtree viewer.
Bete is run as follows.
bete prefix options ...
All the output files are prefixed with prefix, and the default values for the input files are also taken from this string.
Read in the FASTA-format multiple sequence alignment and build a tree using the sequences found. The default value is prefix.fa.
Read in the SAM HMM model file and use it for constructing the subfamily HMM's and also for amino acid probabilities in some cases. The default value is prefix.mod. If this file doesn't exist, the SAM program modelfromalign will be automatically run using the specified MSA. If the -no_hmm option is given, no HMM will be used.
Use the specified Dirichlet density mixture file for computing probabilities. If this is not specified then the built-in recode4.20comp mixture from SAM version 3.3.1 is used. A mixture name of blocks9 can also be given which uses a built-in version of the Blocks9 mixture
Rather than computing the tree, read in the tree strucure from the specified Newick format file. Not yet implemented.
Normally if periods and lower case letters are seen in an alignment, bete skips that column and only uses it when writing out the final subfamily aligmnents. If this option is given it is treated as if it were uppercase, and periods are treated as if they were dashes. Note that either all characters in a column must be periods or lower case letters, corresponding to insert states in an HMM, or none can be.By default, periods are treated as dashes and lower case letters are treated as upper case.
Note that if the MSA has lower case letters or periods, modelfromalign may not produce a model of the correct length and bete will not run.
Only consider joining genes that have the given percentage of pairwise identity with each other. The agglomeration proceeds in two stages. First all clusters are turned into subtrees, and then the subtrees are joined together into the full tree.
Rather than building a tree for each cluster, keep all the clustered nodes as children of one node in a "star" topology. This saves some time when the precise structure of the nodes beneath a subfamily is not needed.
Rather than write out a full tree with subfamilies specially marked, write out a set of trees, one for each family. Most programs that read the Newick format can't handle more than one tree in a file, however.
If the input sequence has subfamilies already specifed, with % lines separating out the subfamilies, then bete will not recompute the tree topology. This option causes it to ignore these % lines in the input file.
At every stage of the tree construction, the pair of nodes with the smallest distance are selected for merging. There are a few ways that bete can compute the distance between two nodes. These include:
- tre: Total Relative Entropy - the average of the relative entropy of all the columns in the two subtrees. This is the default.
- affinity: the average affinity (see description below) of the two columns to each other. The distance is the negative of the affinity.
- pwid: the average pairwise identity of the columns in the two subtrees. The distance is the negative of the pairwise identity.
- logodds: for every sequence and position, the log-odds value of that residue (its emission probability from the HMM divided by its background probability) is computed. For an internal node the values are the weighted average of those of its children. The distance between two sequences is the Euclidean distance of the two log-odds vectors, divided by the sequence length including gaps. Note that alternate HMMs can be specified in the input .fa file for different sequences by adding a line of the form
%hmm file.mod before each group of sequences that share that model. The mod file must be an ASCII format SAM file.The default distance function is affinity.
- ecost: This option describes how subfamilies are identified in the tree.
- dist: When the distance between merged nodes goes below the specifed -cutoff value (or above in the case of TRE), the node that is created is marked as a subfamily root.
- ecost: The point in the agglomeration process where the encoding cost is the lowest is used to determine what the subfamilies are.
- entropy: The point in the agglomeration process where the entropy is the lowest is used to determine what the subfamilies are. This computes the sum, for every position and every subtree in the current partition, of the sum of logs of the unweighted counts of residues (for those that are not 0). This is less correct than encoding cost but it is faster and doesn't suffer from underflow, which the encoding cost does when there is a large collection of diverse sequences as the input.
If this option is given, and the -subfam value was set to dist, it specifies the point at which the subfamilies should be identified. When the distance goes above the specifed value (for TRE) or below (for affinity or pairwise identity) this is a signal for the grouping into subtrees to finish. The tree is still constructed all the way to a single root, but last point during the agglomeration phase where the distance between the child nodes satisfies the cutoff defines the final subfamily decomposition.Multiple -cutoff options may be given, in which case bete writes one subfamily decomposition for each value. These are stored in files prefix.subfam.value and prefix.sfreps.value for each value given by the user.
The default value is 2.5.
Rather than dividing the distance value between two sequences by the overlapping length of the sequences, the total value will be used instead. This will give more weight to sequences that overlap more.
If two subtrees have an overlap of less than ncols (i.e. the number of columns where both are non-gapped is less than that) they will not be considered for joining. The default is 10.
At each tree node, the children are given a weight which reflects the number of independent observations represented by the descendant sequences. This option specifies a maximum value for these weights. The default is 2.5.
When the encoding cost is being used for determining when to define subfamilies (i.e. no -cutoff value is given), normally the null hypothesis is first computed, which is that there is just one subfamily including all the sequences, and the savings over this null hypothesis is printed. Normally the savings starts below zero, grows to some positive value, and then drops back to zero. If this option is given then the null hypothesis cost is not computed and the absolute cost is reported.
When computing subfamily profiles for the creation of subfamily HMM's, do not consider the contributions of other subfamilies that have an affinity for the current subfamily at each position.
Normally when Dirichlet mixture posteriors are generated for subfamily HMM's, only the weights are recomputed from the mixture prior. If this option is given then the amino acid weights for each component are recomputed also.
The subfamily probabilities are a combination of the values derived from the counts in the current subfamily, and the values derived from related subfamilies. This option specifies an upper limit on the contribution of the other subfamilies. The default is 3.0.
Print a brief usage message.
This controls how verbose the program is. The values are:
- 0 Silent: no debugging messages are printed
- 1 Normal: a few helpful messages are printed about the progress
- 2 Verbose: lots of stuff is printed
- 3 GUI: this is a mode used by the GUI when running the program as a subprocess. Currently unimplemented.
When writing a gtree file, add information at each node giving the weight, the weighted counts for each position, the profiles (probabilities of amino acids) for each position, and the per-position affinity, TRE, and pairwise identity vectors for the children. For a large problem this data can take a lot of space (500 sequences with 500 resides can take up to 700 MB) so this is not written out by default.
Don't print profile files for every subfamily.
Don't print subfamily HMM files for every subfamily.
Rather than writing subfamily HMM's at every subfamily node, write a subtree HMM at every node with 2 or more children.
If this option is given then the input subfamilies are read in, and then the pairwise distance metric is printed to standard output and no other computation is done.
If this option is given then one input subfamilies is read in, and the encoding cost of this subtree is printed to standard output and no other computation is done.
Print the set of background probabilities associated with the Dirichlet mixture.
When reading in the sequences, the names that appear after the ≶ on the first line can be rather long. Often these names are lists of names separated by | vertical bars. If this option is given only the pos'th component is used for the name.
After the program is finished, the alignment is printed to the standard output in tree order.
This option is used with the -print_align option to determine the width of the page. The default is 80 columns.
Rather than building a tree, score each sequence against the input HMM. Not yet fully implemented
Don't try to read in an HMM and don't use it for probability estimation.
Use this many threads when performing the computation. Multithreading is currently used while determining the distance between pairs of nodes. If this option is not given, the program will use one thread per CPU.
The following output files are created.
For each subfamily, a subfamily HMM model file.
For each subfamily, list of the amino acid probabilies at each position.
A multiple sequence alignment file, in tree order, with lines beginning with % before each subfamily giving the name of that subfamily.
A multiple sequence alignment file, in tree order, with a sequence for each subfamily that is the consensus for that subfamily. The sequence names are the subfamily names.
The tree structure in Newick format.
The tree structure in Gtree format, with lots of extra data.
NOTE: this format is not currently being written. See the gtree format above.The tree structure in extended Newick format. This format is just like the regular Newick file format, except that at each node, where there would normally be an optional :DISTANCE value giving a floating point distance for that node, there can be a list of attributes given as
:[name1=value1, name2=value2, ...]
The attributes written by bete are:
- name=Nn : this is the internally-generated name for the internal node.
- subfamily=val : 1 if this node is a subfamily root node, 0 otherwise.
- dist=val : the distance from this node to its parent, using the selected metric.
- affinity_vector=val val ... : for every position in the sequence, the affinity value between the two subnodes.
- tre_vector=val val ... : for every position in the sequence, the total relative entropy value between the two subnodes.
- pwid_vector=val val ... : for every position in the sequence, the pairwise identity between the two subnodes.
- weight=val : ..
- weighted_counts=val : ..
- profiles=val : ..
Currently the extended format can be used only by the Gtree viewer.
A graph of cost vs iteration. This can be plotted using gnuplot using the command plot "prefix.cost"
A graph of node distance vs iteration.
Some discussion here ...