Berkeley PHOG: PhyloFacts Orthology Group Prediction Web Server Supplement 2: Detailed Description of Experiments

Ruchira Datta, Bushra Samad, Christoph Neyer and Kimmen Sjölander

Ortholog detection is essential in functional annotation of genomes, with applications to phylogenetic tree construction, prediction of protein-protein interaction and other bioinformatics tasks. The PHOG web server employs a novel algorithm to identify orthologs based on phylogenetic analysis. To assess the accuracy of PHOG, we used a set of human sequences and their predicted orthologs in three model organisms -- Mus musculus (mouse), Danio rerio (zebrafish) and Drosophila melanogaster (fruit fly) -- from the TreeFam-A resource as a gold standard benchmark. TreeFam-A uses a sophisticated ortholog-identification protocol (including tree reconciliation and manual curation) providing for a high-accuracy dataset. Mouse, zebrafish and fruit fly were selected since they had been targeted for analysis by both OrthoMCL-DB and InParanoid and represented a range of evolutionary distances.

Dataset selection: We chose a set of 100 human sequences from TreeFam-A meeting the following requirements:

Each sequence had to have orthologs in the TreeFam-A curated seed families from at least two of the three target species: mouse, zebrafish and Drosophila melanogaster.

No pair of sequences in the set could have a BLAST E-value < 1 (using a database size of 100,000) or share a common PFAM domain. This ensured that the dataset was filtered to remove potential homologs, so that method performance on this dataset should generalize.

We chose 100 human sequences as follows. We considered all human sequences for which TreeFam-A contained orthologs in at least two of the three target species: mouse, zebrafish, and fruitfly. This yielded a set of 353 human sequences. We then ran BLAST for each sequence in this set against the entire set, setting the database length to 100,000, and removed all sequences for which any other sequence in the set had a BLAST e-value <=1. This yielded a reduced set of 106 sequences. To further ensure that no pair of sequences were homologous, we searched for common PFAM domains by scoring each sequence against the PFAM-A library using hmmpfam (HMMs were downloaded from PFAM on January 13th, 2009). We found that four of the sequences had no significant PFAM hits (based on an e-value <= 0.001 and length >= 30); these four sequences were eliminated. Two pairs of sequences shared PFAM domains; we eliminated one member of each pair. This yielded a set of 100 sequences, containing a total of 138 PFAM domains. See Table S2-1.

Ensuring coverage: To ensure that each sequence in the test set was included in PhyloFacts, we identified the region of each sequence corresponding to each PFAM domain and clustered homologs and constructed a phylogenetic tree using the PhyloBuilder software (using global-local mode for domain-based homology and 5 subfamily HMM iterations). Users can submit their own sequences to this pipeline at the PhyloBuilder webserver at http://phylogenomics.berkeley.edu/phylobuilder.

Assessing method performance: For each human sequence and for each method, we identified predicted orthologs in mouse, zebrafish and Drosophila melanogaster, and compared these predicted orthologs against the orthologs identified by TreeFam-A.

We used TreeFam Release 6.0, InParanoid Version 6.1, OrthoMCL Version 2, and PHOG as of March 1st, 2009. For each of these methods, we used the precomputed predictions that were available on the respective websites. Ensembl Release 52 was used to cross-reference the identifiers used by some of the methods, along with the cross-reference provided by TreeFam. PHOG predictions were made on the basis of a total of 455 trees containing the human sequences based on trees in the PhyloFacts resource as of March 1st, 2009.

Cross-referencing database identifiers: In order to compare the ortholog predictions made by the various methods, we had to bring all the identifiers to the same form. The form we used was the Ensembl gene identifier. We used the gene_stable_id, transcript, transcript_stable_id, translation, and translation_stable_id tables provided by Ensembl to convert Ensembl transcript or protein identifiers to gene identifiers. We also used the stable_id_event table provided by Ensembl to convert outdated Ensembl identifiers to their present forms. We used the ens_xref table provided by TreeFam to convert UniProt accessions to Ensembl identifiers.

Statistical measures of performance: For each method, we computed the recall (or sensitivity) as the fraction of TreeFam-A orthology pairs found.

For reference, recall and precision are defined as follows:

Recall = TP/(TP+FN)

Precision = TP /(TP+FP)

where a True Positive (TP) is an orthology pair included in TreeFam-A that is also predicted by a method, a False Negative (FN) is an orthology pair included in TreeFam-A that is not predicted by a method (i.e., it is missed by the method), and a False Positive (FP) is an orthology pair predicted by a method that is not included in TreeFam-A.

The precision measure reported in the main body of the paper, reported as P1 in Supplement 3, is most stringent. This measure labels all predicted orthologs (by InParanoid, OrthoMCL or PHOG) not found by TreeFam-A as False Positives. In other words, if TreeFam did not identify an ortholog, and OrthoMCL did, OrthoMCL was wrong, and the computed precision will decrease.

We also computed a less stringent measure of precision, reported as P2 in Supplement 3, based on a different definition of False Positive. This measure only calls a predicted orthology pair an error in the case where TreeFam-A selected a different sequence from the same species. In other words, if for a particular human sequence TreeFam-A selects no ortholog from mouse, but OrthoMCL does, OrthoMCL might possibly be correct. That orthology pair does not contribute to the OrthoMCL recall, but does not hurt its precision. On the other hand, if TreeFam-A selects an ortholog from mouse (M1) but OrthoMCL selects a different ortholog from mouse (M2), the (H,M2) pair is called a false positive.

Results: We present the results in the Figures S2-1 through S2-4 below. The point labeled PHOG-T(M) on each curve describes the performance at threshold value 0.09375, where the precision and recall for the human-mouse ortholog predictions are approximately equal. The point labeled PHOG-T(Z) on each curve describes the performance at threshold value 0.296875, where the precision and recall for the human-zebrafish ortholog predictions are approximately equal. The point labeled PHOG-T(D) on each curve describes the performance at threshold value 0.9375, where the precision and recall for the human-fruitfly ortholog predictions are approximately equal. Detailed results are provided in Supplement 3.

0x01 graphic

Figure S2-1. Precision-Recall results for all methods against the dataset as a whole.

Performance was evaluated on 100 human proteins selected from the TreeFam-A manually curated orthology database, with orthologs to each human protein from mouse, zebrafish and fruit fly. Methods evaluated include several PHOG variants, OrthoMCL-DB, InParanoid and SCI-PHY. PHOG-S represents super-orthology predictions, PHOG-O represents standard orthology predictions and PHOG-T represents the tree-distance thresholded variants. PHOG-T variants PHOG-T(M), -T(Z) and -T(F) correspond to tree-distance thresholds selected for optimal performance on this dataset for mouse, zebrafish and fruit fly respectively. Tree distance thresholds were 0.09375 (mouse), 0.296875 (zebrafish) and 0.9375 (fruit fly). SCI-PHY uses hierarchical clustering and encoding cost measures to define functional subtypes and is included for comparison. Recall measures the fraction of TreeFam-A orthologs detected by a method. Precision measures the fraction of a method's predicted orthologs that are included in TreeFam-A. A True Positive (TP) is an orthology pair included in TreeFam-A that is also predicted by a method, a False Positive (FP) is an orthology pair predicted by a method that is not included in TreeFam-A, and a False Negative (FN) is a TreeFam-A ortholog that is missed by a method.

0x01 graphic

Figure S2-2. Results of ortholog prediction, restricted to human-mouse orthologs. See Figure S2-1 for details on precision and recall measures.

0x01 graphic

Figure S2-3. Results of ortholog prediction, restricted to human-zebrafish orthologs. See Figure S2-1 for details on precision and recall measures. OrthoMCL-DB's performance on zebrafish is uncharacteristically low, relative to its performance on other species.

0x01 graphic

Figure S2-4. Results of ortholog prediction, restricted to human-fruit fly orthologs. See Figure S2-1 for details on precision and recall measures.

Table S2-1. Benchmark dataset of 100 human proteins taken from the TreeFam-A resource. The Ensembl identifier and equivalent UniProt accessions are shown, along with the description of each protein (obtained from the UniProt resource).

Ensembl ID

UniProt Accession

Description

Glutathione S-transferase omega-2

DNA-directed RNA polymerase II subunit RPB7

Bardet-Biedl syndrome 5 protein

39S ribosomal protein L15

Mitochondrial import inner membrane translocase subunit TIM44

tRNA-dihydrouridine synthase 4-like

RNA-binding protein NOB1

Protein MAK16 homolog

Ligatin

Protein phosphatase 1 regulatory subunit 3A

Peptidyl-prolyl cis-trans isomerase NIMA-interacting 4

Cell cycle checkpoint control protein RAD9A

Peroxisomal membrane protein 2

Protein ACN9 homolog

Mitochondrial import receptor subunit TOM22 homolog

Mitochondrial import receptor subunit TOM7 homolog

GrpE protein homolog 1

rRNA-processing protein UTP23 homolog

39S ribosomal protein L51

Superoxide dismutase

Mitochondrial import inner membrane translocase subunit TIM50

Protein phosphatase 1 regulatory subunit 14A

Dual specificity protein phosphatase 6

DNA-directed RNA polymerases I and III subunit RPAC2

Heat shock protein beta-8

H/ACA ribonucleoprotein complex subunit 2

Aldo-keto reductase family 1 member B10

Nibrin

Nuclear pore complex protein Nup155

Nucleoporin 50 kDa

ATP-dependent DNA helicase 2 subunit 2

Elongator complex protein 3

Nucleoside diphosphate-linked moiety X motif 8

Erythrocyte band 7 integral membrane protein

Putative ATP-dependent Clp protease proteolytic subunit

Peptidyl-prolyl cis-trans isomerase FKBP1A

Origin recognition complex subunit 6

Transmembrane protein 111

Signal recognition particle 19 kDa protein

ATP synthase subunit O

ESF1 homolog

Annexin A1

Choline-phosphate cytidylyltransferase A

Transcription initiation factor IIE subunit beta

Lamin-B2

60 kDa SS-A/Ro ribonucleoprotein

Tumor suppressor candidate 4

Solute carrier family 35 member B1

SAC domain-containing protein 3

DNA-directed RNA polymerase II subunit RPB4

NADH dehydrogenase

Metastasis-associated protein MTA2

Lysophosphatidylcholine acyltransferase

Phosphoribosylformylglycinamidine synthase

Exocyst complex component 8

Splicing factor 3 subunit 1

NADH dehydrogenase

Splicing factor 3B subunit 3

Transmembrane protein C9orf7

Lamin-B receptor

NAD-dependent deacetylase sirtuin-4

Phenylalanyl-tRNA synthetase beta chain

Smoothened homolog

39S ribosomal protein L17

UPF0415 protein C7orf25

Ribonuclease UK114

Dihydrolipoyllysine-residue acetyltransferase component of pyruvate dehydrogenase complex

Integral membrane protein GPR177

UPF0587 protein C1orf123

Flap endonuclease 1

Geranylgeranyl transferase type-1 subunit beta

D-glucuronyl C5-epimerase

Deoxycytidylate deaminase

Brix domain-containing protein 2

RRP15-like protein

Gremlin-2

Piwi-like protein 1

Sororin

Mitotic spindle assembly checkpoint protein MAD2A

Translation initiation factor eIF-2B subunit gamma

Cytochrome b-c1 complex subunit Rieske

Caspase-14

Anaphase-promoting complex subunit 7

U3 small nucleolar ribonucleoprotein protein MPP10

Chromobox protein homolog 7

Glypican-1

Cyclin-dependent kinase 2-associated protein 2

26S proteasome non-ATPase regulatory subunit 8

Serine/threonine-protein phosphatase 2A catalytic subunit alpha isoform

UPF0480 protein C15orf24

26S proteasome non-ATPase regulatory subunit 7

Eukaryotic translation initiation factor 4E

Nuclear prelamin A recognition factor-like protein

Cytochrome c oxidase copper chaperone

Nucleolar protein 6

39S ribosomal protein L12

DNA-directed RNA polymerase II subunit RPB2

Chromatin accessibility complex protein 1

CCAAT/enhancer-binding protein gamma

DNA excision repair protein ERCC-1