UC Bete User's Manual

Wayne Christopher, wayne@phylogenomics.berkeley.edu
Kimmen Sjolander, kimmen@uclink.berkeley.edu
Berkeley Phylogenomics Group
University of California at Berkeley
Version 1.0, March 14, 2002

Warning: work in progress!

This document describes the operation of bete, an implementation of the Bayesan Evolutionary Tree Estimation algorithm presented in XX and XX. It also creates subfamily HMM's as described in XX.

Overview

bete takes as input a multiple sequence alignment in FASTA format and a Hidden Markov Model in SAM format, and produces a number of output files:

A full rooted tree is created, with the leaves being the input sequences. However, some of the internal nodes (forming a cross-section of the tree) are marked as subfamily nodes. All the descendant sequences of these nodes are considered to be grouped together in a biologically significant way.

The algorithm starts with all tree nodes at the root level, and then iteratively joins subtrees, until a single rooted tree is created. The order in which the nodes are joined, together with the criteria for determining what nodes are subtree nodes, is controlled using the options described below.

The output of bete can be viewed using the gtree viewer.

Command line options

Bete is run as follows.

bete prefix options ...

All the output files are prefixed with prefix, and the default values for the input files are also taken from this string.

Input file options

-i msa_file
Read in the FASTA-format multiple sequence alignment and build a tree using the sequences found. The default value is prefix.fa.
-hmm sam_file
Read in the SAM HMM model file and use it for constructing the subfamily HMM's and also for amino acid probabilities in some cases. The default value is prefix.mod. If this file doesn't exist, the SAM program modelfromalign will be automatically run using the specified MSA. If the -no_hmm option is given, no HMM will be used.
-m mixture_file
Use the specified Dirichlet density mixture file for computing probabilities. If this is not specified then the built-in recode4.20comp mixture from SAM version 3.3.1 is used. A mixture name of blocks9 can also be given which uses a built-in version of the Blocks9 mixture
-in_tree tree_file
Rather than computing the tree, read in the tree strucure from the specified Newick format file. Not yet implemented.
-no_skip_inserts
Normally if periods and lower case letters are seen in an alignment, bete skips that column and only uses it when writing out the final subfamily aligmnents. If this option is given it is treated as if it were uppercase, and periods are treated as if they were dashes. Note that either all characters in a column must be periods or lower case letters, corresponding to insert states in an HMM, or none can be.

By default, periods are treated as dashes and lower case letters are treated as upper case.

Note that if the MSA has lower case letters or periods, modelfromalign may not produce a model of the correct length and bete will not run.

Tree construction options

-cluster pwid
Only consider joining genes that have the given percentage of pairwise identity with each other. The agglomeration proceeds in two stages. First all clusters are turned into subtrees, and then the subtrees are joined together into the full tree.
-cluster_star
Rather than building a tree for each cluster, keep all the clustered nodes as children of one node in a "star" topology. This saves some time when the precise structure of the nodes beneath a subfamily is not needed.
-toplevel_forest
Rather than write out a full tree with subfamilies specially marked, write out a set of trees, one for each family. Most programs that read the Newick format can't handle more than one tree in a file, however.
-no_tree
If the input sequence has subfamilies already specifed, with % lines separating out the subfamilies, then bete will not recompute the tree topology. This option causes it to ignore these % lines in the input file.

Distance metric options

-dist func
At every stage of the tree construction, the pair of nodes with the smallest distance are selected for merging. There are a few ways that bete can compute the distance between two nodes. These include:

The default distance function is affinity.

-subfam func
-cutoff value
If this option is given, and the -subfam value was set to dist, it specifies the point at which the subfamilies should be identified. When the distance goes above the specifed value (for TRE) or below (for affinity or pairwise identity) this is a signal for the grouping into subtrees to finish. The tree is still constructed all the way to a single root, but last point during the agglomeration phase where the distance between the child nodes satisfies the cutoff defines the final subfamily decomposition.

Multiple -cutoff options may be given, in which case bete writes one subfamily decomposition for each value. These are stored in files prefix.subfam.value and prefix.sfreps.value for each value given by the user.

The default value is 2.5.

-total
Rather than dividing the distance value between two sequences by the overlapping length of the sequences, the total value will be used instead. This will give more weight to sequences that overlap more.
-min_overlap ncols
If two subtrees have an overlap of less than ncols (i.e. the number of columns where both are non-gapped is less than that) they will not be considered for joining. The default is 10.
-max_indep_counts val
At each tree node, the children are given a weight which reflects the number of independent observations represented by the descendant sequences. This option specifies a maximum value for these weights. The default is 2.5.
-abs_ecost
When the encoding cost is being used for determining when to define subfamilies (i.e. no -cutoff value is given), normally the null hypothesis is first computed, which is that there is just one subfamily including all the sequences, and the savings over this null hypothesis is printed. Normally the savings starts below zero, grows to some positive value, and then drops back to zero. If this option is given then the null hypothesis cost is not computed and the absolute cost is reported.

Subfamily HMM options

-only_subfam_profile
When computing subfamily profiles for the creation of subfamily HMM's, do not consider the contributions of other subfamilies that have an affinity for the current subfamily at each position.
-change_subfam_alphas
Normally when Dirichlet mixture posteriors are generated for subfamily HMM's, only the weights are recomputed from the mixture prior. If this option is given then the amino acid weights for each component are recomputed also.
-max_subfam_contrib val
The subfamily probabilities are a combination of the values derived from the counts in the current subfamily, and the values derived from related subfamilies. This option specifies an upper limit on the contribution of the other subfamilies. The default is 3.0.

Miscellaneous options

-help or --help
Print a brief usage message.
-d debuglevel
This controls how verbose the program is. The values are:
-write_node_data
When writing a gtree file, add information at each node giving the weight, the weighted counts for each position, the profiles (probabilities of amino acids) for each position, and the per-position affinity, TRE, and pairwise identity vectors for the children. For a large problem this data can take a lot of space (500 sequences with 500 resides can take up to 700 MB) so this is not written out by default.
-no_profiles
Don't print profile files for every subfamily.
-no_shmms
Don't print subfamily HMM files for every subfamily.
-tree_hmms
Rather than writing subfamily HMM's at every subfamily node, write a subtree HMM at every node with 2 or more children.
-measure
If this option is given then the input subfamilies are read in, and then the pairwise distance metric is printed to standard output and no other computation is done.
-measure_ec
If this option is given then one input subfamilies is read in, and the encoding cost of this subtree is printed to standard output and no other computation is done.
-print_bg_probs
Print the set of background probabilities associated with the Dirichlet mixture.
-name_segment pos
When reading in the sequences, the names that appear after the ≶ on the first line can be rather long. Often these names are lists of names separated by | vertical bars. If this option is given only the pos'th component is used for the name.
-print_align
After the program is finished, the alignment is printed to the standard output in tree order.
-width ncols
This option is used with the -print_align option to determine the width of the page. The default is 80 columns.
-score
Rather than building a tree, score each sequence against the input HMM. Not yet fully implemented
-no_hmm
Don't try to read in an HMM and don't use it for probability estimation.
-nthreads num
Use this many threads when performing the computation. Multithreading is currently used while determining the distance between pairs of nodes. If this option is not given, the program will use one thread per CPU.

Output files

The following output files are created.

prefix.subfamily.mod
For each subfamily, a subfamily HMM model file.
prefix.subfamily.profile
For each subfamily, list of the amino acid probabilies at each position.
prefix..subfam
A multiple sequence alignment file, in tree order, with lines beginning with % before each subfamily giving the name of that subfamily.
prefix..sfreps
A multiple sequence alignment file, in tree order, with a sequence for each subfamily that is the consensus for that subfamily. The sequence names are the subfamily names.
prefix..tree
The tree structure in Newick format.
prefix..gtree
The tree structure in Gtree format, with lots of extra data.
prefix..xtree
NOTE: this format is not currently being written. See the gtree format above.

The tree structure in extended Newick format. This format is just like the regular Newick file format, except that at each node, where there would normally be an optional :DISTANCE value giving a floating point distance for that node, there can be a list of attributes given as

:[name1=value1, name2=value2, ...]

The attributes written by bete are:

Currently the extended format can be used only by the Gtree viewer.

prefix..cost
A graph of cost vs iteration. This can be plotted using gnuplot using the command plot "prefix.cost"
prefix..dist
A graph of node distance vs iteration.

Algorithms

Some discussion here ...