Wayne Christopher,
wayne@phylogenomics.berkeley.edu
Berkeley Phylogenomics Group
University of California at Berkeley
Version 1.0, April 14, 2002
Warning: work in progress!
This document describes the operation of gtree, a viewer for tree structure, multiple sequence alignments, and properties of genes. It is designed to work together with bete, which creates a tree structure for genes based on a multiple sequence alignment.
The preferred input for gtree is a multiple sequence alignment in FASTA format, and a tree file with the suffix .gtree. By default bete writes out a .gtree file. In addition, gtree looks for properties of the gene such as the source organism, protein function, etc. These properties will be automatically downloaded from Swissprot or Genbank as needed and will be cached in files in the directory .gtree/cache in the user's home directory.
If only an alignment file is available (suffix .fa or .a2m) then gtree will not construct a tree but simply display the alignment.
gtree takes the prefix for the input files as its first argument. That is, if it is run as
gtree prefixIt will read the files prefix.fa and prefix.gtree.
Optional arguments may follow:
Unless this option is given, gtree will try to look up properties of genes in the tree if they are not found in the ~/.gtree/cache directory. Currently this requires Python version 2 and the BioPython utilities to be available.
The gtree viewing window is divided into three areas.
The three areas are independently scrollable in the horizontal direction but are all controlled in the vertical direction by the scrollbar to the right of the tree view.
Clicking on a tree node (either the dot at the root of the subtree, or the name) with the left mouse button will select that tree node. Selected tree nodes are drawn in a separate color, as are the sequences that correspond to them. If you select more than one tree node, each selected node and its descendants will be changed to a different color. If you double-click on any node this will first unselect all other nodes, and then select the one you picked. The menu selection Tree / Clear selection will unselect all nodes.
The currently selected nodes are used to determine how Sequence plots are drawn.
Many of the operations on specific tree nodes are accessible via clicking with the right mouse button on a node in the tree (the dot at the root of the subtree, or the name). These operations are as follows. Some operations are available only for internal nodes and some only for leaf nodes.
Collapse the node down into a single consensus sequence. The node in the tree is drawn as a larger dot. If the subtree is already collapsed this function will expand it again. This function is also available via clicking with the middle mouse button on the node.
Below is an example of a tree with some collapsed subtrees.
Open up all the internal nodes below the selected one.
This function collapses the clicked-on subtree, and at the same time opens up a new instance of the tree viewer that holds just this subtree. The different instances can be selected using the tabs at the top of the window.
This function closes all nodes that are not ancestors or descendants of the clicked-on one.
Change the name of a node.
If bete was run with the -write_node_data option, each node will have information in the .gtree file giving the weighted counts of all amino acids at all positions. This data is displayed in a matrix.
Similar to the above weighted count information, if bete was run with the -write_node_data option, each node will also contain profile data, which gives the probability of finding each amino acid at each position. This is computed using Dirichlet mixtures and hidden Markov models (further described in the bete documentation. This data is displayed in a matrix, as shown below. The weighted count matrix is similar.
Edit the properties of the clicked-on node. If the node is a subtree, and some properties are not filled in (as will be the case at the beginning), gtree tries to come up with a consensus substring for all the property values of the descendant leaf nodes. This is defined (approximately) as the most common substring of length 16 or more.
If the Show alternates button is pressed then the entries that don't contain the consensus substring are searched for consensus substrings of their own, and the process is repeated until there are no more entries left. All the alternate substrings are shown with the number of sequences that contain them in brackets to the left. Each sequence falls into only one group, however.
If the Full text button is pressed then all the values for all sequences in the subtree are listed, with the number of sequences that contain each string in brackets to the left. Unlike the Show alternates button, no consensus calculation is done: the full text of the field is used.
The .gtree file can have alternate alignments at different places in the tree. These places are denoted by a red circle instead of a black one for the node (if the node is also a subtree a purple circle is used). Selecting Use alignment at one of these locations will cause the alignment section of the viewer to be updated with the new data. Rows that are not descendants of the selected node will be left blank.
If the name contains the strings gi| or sp|, this sub-menu will contain entries to look up the sequence in GenBank or SwissProt respectively. The user's web browser will be sent to the appropriate page.
By default the distances in the horizontal direction between two nodes equal the distance computed by bete for those two nodes, using affinity, total relative entropy, or pairwise identity, whichever was selected using the -dist option. Since these distances may make it hard to see the tree, two options are available to change the display.
The middle pane shows the multiple sequence alignment corresponding to the tree. It is divided horizontally by blue lines into the separate subfamilies that were identified by bete.
If a specific node has been collapsed the consensus sequence for that subtree is displayed. If the most common residue shows up in more than 90% of all the sequences that do not have gaps at that position, it is displayed in upper case. Otherwise it is displayed in lower case. If there are no residues at that position in the subtree, a dash is displayed.
Columns are colored according to their degree of conservation. Dark blue indicates perfect conservation, light blue 80% conservation, and gray 40% conservation.
As the mouse is moved across the columns, the column number is displayed in red below the scale bar in the upper left corner.
Directly above the sequence alignment area is the sequence plot. Depending on what is selected in the sequence plot menu this area shows the following types of data.
The consensus sequence for the selected nodes, or the whole tree if none are selected, is shown. When there is more than once residue at a given position the less-common ones are shown above the consensus. The residues are colored as follows.
The affinity between the first two selected subtrees, which are colored red and green in the tree and sequence plot, is displayed as a histogram.
The affinity is defined in the bete documentation. If fewer than two subtrees are selected then no affinity is computed, and if more are selected then the third and successive ones are ignored.
As the mouse is moved across successive columns the value of the affinity is displayed in red after the column number beneath the scale bar.
The affinity plot is shown below.
The total relative entropy between the two selected subtrees is plotted. Otherwise this option is similar to affinity.
The pairwise identity between the two selected subtrees is plotted. Otherwise this option is similar to affinity.
For every selected subtree, the percentage conservation is computed. This is the count of the most common residue divided by the number of non-gap sequences. The average of all these values across all the selected subtrees is plotted for each position. A high value means that within each subfamily, this position is highly conserved, even though it can vary among subfamilies.
The encoding cost for the selected sequences at each position is plotted. This is computed by bete.
The following commands are under the File menu.
Write out the current data with modifications as a .gtree file, and optionally write the alignment to a .fa file. Note that if you have turned off Show inserts, these columns will not be written to the alignment file.
Close the current sub-tree panel. If the current panel is Main then this exits the program.
Exit from the program.
The following commands are under the Tree menu.
Every subfamily in the tree (the blue or purple nodes) are collapsed to their consensus sequences.
All collapsed or closed nodes are opened.
Every node name that contains a given "glob" pattern (where * matches any set of characters, ? matches one character, and [] encloses sets or ranges of characters any one of which will match) will be displayed with a yellow background.
The set of selected nodes is defined to be all the subfamily nodes.
All selected nodes are unselected.
The currently selected nodes and all their descendants are deleted from the display. If you Save the tree and the associated alignment, these nodes will not be written.
The user is prompted to select a cutoff value, or set of cutoffs, and the subfamily nodes are reassigned using the algorithm used by bete. That is, nodes where the cost value is less than the cutoff and the parent's cost is greater are considered subfamilies.
The nodes in the tree can be displayed in various ways. For most of the options it is assumed that the full name is a sequences of name / value pairs separated by vertical bars, ending with a common name, such as
>gi|113481|sp|P15228|AEP_MESMA
The different options for node display are
The following commands are under the Align menu.
If this option is selected, insert positions (those with lower case letters or dots) will be displayed, but not used for affinity or other calculations. If it's off then they won't be shown.
If this option is selected, rather than coloring a whole column according to the conservation of all the residues displayed, different subfamily sections of the column will be colored differently according to the conservation within that subfamily.
The following commands are under the Properties menu.
The following form is presented which allows the title, visibility, width in characters, and position of the various columns available in the property window to be modified.
The following commands are under the Graph menu.
This is the encoding cost as a function of tree-building steps done by bete.
This is the distance between merged nodes as a function of tree-building steps done by bete.
A log-odds plot is shown giving the average for all the nodes underneath each subfamily, as a function of position. If you click on a graph or legend entry with the left button it will toggle it on or off. If you click with the right button it will highlight the subfamily in the tree.
The current sequence plot is written out to a file, as X-Y data, one point per line. If the current plot type is Consensus the Y value is the residue name, and otherwise it is the value in the plot.
Write out the current sequence plot to a file in an X-Y format.
Load an X-Y file and plot it on the screen. If multiple files of this type are loaded they are all written to one window.
A matrix is created betweeen all pairs of the currently selected set of subtrees. The value displayed is the average for all positions of the current sequence plot type. For affinity, TRE, and pairwise identity, these are the values that would be used to determine the order in which nodes are joined by bete.
The following commands are under the Help menu.
Invokes the user's web browser with the on-line manual for gtree.
Invokes the user's web browser with the on-line manual for bete.
Wayne Christopher, Ph.D.
Kimmen Sjolander, Ph.D.