Ruchira Datta, Bushra Samad, Christoph Neyer and Kimmen Sjölander
Ortholog detection is essential in functional annotation of genomes, with applications to phylogenetic tree construction, prediction of protein-protein interaction and other bioinformatics tasks. The PHOG web server employs a novel algorithm to identify orthologs based on phylogenetic analysis. To assess the accuracy of PHOG, we used a set of human sequences and their predicted orthologs in three model organisms -- Mus musculus (mouse), Danio rerio (zebrafish) and Drosophila melanogaster (fruit fly) -- from the TreeFam-A resource as a gold standard benchmark. TreeFam-A uses a sophisticated ortholog-identification protocol (including tree reconciliation and manual curation) providing for a high-accuracy dataset. Mouse, zebrafish and fruit fly were selected since they had been targeted for analysis by both OrthoMCL-DB and InParanoid and represented a range of evolutionary distances.
Dataset selection: We chose a set of 100 human sequences from TreeFam-A meeting the following requirements:
Each sequence had to have orthologs in the TreeFam-A curated seed families from at least two of the three target species: mouse, zebrafish and Drosophila melanogaster.
No pair of sequences in the set could have a BLAST E-value < 1 (using a database size of 100,000) or share a common PFAM domain. This ensured that the dataset was filtered to remove potential homologs, so that method performance on this dataset should generalize.
We chose 100 human sequences as follows. We considered all human sequences for which TreeFam-A contained orthologs in at least two of the three target species: mouse, zebrafish, and fruitfly. This yielded a set of 353 human sequences. We then ran BLAST for each sequence in this set against the entire set, setting the database length to 100,000, and removed all sequences for which any other sequence in the set had a BLAST e-value <=1. This yielded a reduced set of 106 sequences. To further ensure that no pair of sequences were homologous, we searched for common PFAM domains by scoring each sequence against the PFAM-A library using hmmpfam (HMMs were downloaded from PFAM on January 13th, 2009). We found that four of the sequences had no significant PFAM hits (based on an e-value <= 0.001 and length >= 30); these four sequences were eliminated. Two pairs of sequences shared PFAM domains; we eliminated one member of each pair. This yielded a set of 100 sequences, containing a total of 138 PFAM domains. See Table S2-1.
Ensuring coverage: To ensure that each sequence in the test set was included in PhyloFacts, we identified the region of each sequence corresponding to each PFAM domain and clustered homologs and constructed a phylogenetic tree using the PhyloBuilder software (using global-local mode for domain-based homology and 5 subfamily HMM iterations). Users can submit their own sequences to this pipeline at the PhyloBuilder webserver at http://phylogenomics.berkeley.edu/phylobuilder.
Assessing method performance: For each human sequence and for each method, we identified predicted orthologs in mouse, zebrafish and Drosophila melanogaster, and compared these predicted orthologs against the orthologs identified by TreeFam-A.
We used TreeFam Release 6.0, InParanoid Version 6.1, OrthoMCL Version 2, and PHOG as of March 1st, 2009. For each of these methods, we used the precomputed predictions that were available on the respective websites. Ensembl Release 52 was used to cross-reference the identifiers used by some of the methods, along with the cross-reference provided by TreeFam. PHOG predictions were made on the basis of a total of 455 trees containing the human sequences based on trees in the PhyloFacts resource as of March 1st, 2009.
Cross-referencing database identifiers: In order to compare the ortholog predictions made by the various methods, we had to bring all the identifiers to the same form. The form we used was the Ensembl gene identifier. We used the gene_stable_id, transcript, transcript_stable_id, translation, and translation_stable_id tables provided by Ensembl to convert Ensembl transcript or protein identifiers to gene identifiers. We also used the stable_id_event table provided by Ensembl to convert outdated Ensembl identifiers to their present forms. We used the ens_xref table provided by TreeFam to convert UniProt accessions to Ensembl identifiers.
Statistical measures of performance: For each method, we computed the recall (or sensitivity) as the fraction of TreeFam-A orthology pairs found.
For reference, recall and precision are defined as follows:
Recall = TP/(TP+FN)
Precision = TP /(TP+FP)
where a True Positive (TP) is an orthology pair included in TreeFam-A that is also predicted by a method, a False Negative (FN) is an orthology pair included in TreeFam-A that is not predicted by a method (i.e., it is missed by the method), and a False Positive (FP) is an orthology pair predicted by a method that is not included in TreeFam-A.
The precision measure reported in the main body of the paper, reported as P1 in Supplement 3, is most stringent. This measure labels all predicted orthologs (by InParanoid, OrthoMCL or PHOG) not found by TreeFam-A as False Positives. In other words, if TreeFam did not identify an ortholog, and OrthoMCL did, OrthoMCL was wrong, and the computed precision will decrease.
We also computed a less stringent measure of precision, reported as P2 in Supplement 3, based on a different definition of False Positive. This measure only calls a predicted orthology pair an error in the case where TreeFam-A selected a different sequence from the same species. In other words, if for a particular human sequence TreeFam-A selects no ortholog from mouse, but OrthoMCL does, OrthoMCL might possibly be correct. That orthology pair does not contribute to the OrthoMCL recall, but does not hurt its precision. On the other hand, if TreeFam-A selects an ortholog from mouse (M1) but OrthoMCL selects a different ortholog from mouse (M2), the (H,M2) pair is called a false positive.
Results: We present the results in the Figures S2-1 through S2-4 below. The point labeled PHOG-T(M) on each curve describes the performance at threshold value 0.09375, where the precision and recall for the human-mouse ortholog predictions are approximately equal. The point labeled PHOG-T(Z) on each curve describes the performance at threshold value 0.296875, where the precision and recall for the human-zebrafish ortholog predictions are approximately equal. The point labeled PHOG-T(D) on each curve describes the performance at threshold value 0.9375, where the precision and recall for the human-fruitfly ortholog predictions are approximately equal. Detailed results are provided in Supplement 3.

Figure S2-1. Precision-Recall results for all methods against the dataset as a whole.
Performance was evaluated on 100 human proteins selected from the TreeFam-A manually curated orthology database, with orthologs to each human protein from mouse, zebrafish and fruit fly. Methods evaluated include several PHOG variants, OrthoMCL-DB, InParanoid and SCI-PHY. PHOG-S represents super-orthology predictions, PHOG-O represents standard orthology predictions and PHOG-T represents the tree-distance thresholded variants. PHOG-T variants PHOG-T(M), -T(Z) and -T(F) correspond to tree-distance thresholds selected for optimal performance on this dataset for mouse, zebrafish and fruit fly respectively. Tree distance thresholds were 0.09375 (mouse), 0.296875 (zebrafish) and 0.9375 (fruit fly). SCI-PHY uses hierarchical clustering and encoding cost measures to define functional subtypes and is included for comparison. Recall measures the fraction of TreeFam-A orthologs detected by a method. Precision measures the fraction of a method's predicted orthologs that are included in TreeFam-A. A True Positive (TP) is an orthology pair included in TreeFam-A that is also predicted by a method, a False Positive (FP) is an orthology pair predicted by a method that is not included in TreeFam-A, and a False Negative (FN) is a TreeFam-A ortholog that is missed by a method.

Figure S2-2. Results of ortholog prediction, restricted to human-mouse orthologs. See Figure S2-1 for details on precision and recall measures.

Figure S2-3. Results of ortholog prediction, restricted to human-zebrafish orthologs. See Figure S2-1 for details on precision and recall measures. OrthoMCL-DB's performance on zebrafish is uncharacteristically low, relative to its performance on other species.

Figure S2-4. Results of ortholog prediction, restricted to human-fruit fly orthologs. See Figure S2-1 for details on precision and recall measures.
Table S2-1. Benchmark dataset of 100 human proteins taken from the TreeFam-A resource. The Ensembl identifier and equivalent UniProt accessions are shown, along with the description of each protein (obtained from the UniProt resource).
Ensembl ID |
UniProt Accession |
Description |
Glutathione S-transferase omega-2 |
||
DNA-directed RNA polymerase II subunit RPB7 |
||
Bardet-Biedl syndrome 5 protein |
||
39S ribosomal protein L15 |
||
Mitochondrial import inner membrane translocase subunit TIM44 |
||
tRNA-dihydrouridine synthase 4-like |
||
RNA-binding protein NOB1 |
||
Protein MAK16 homolog |
||
Ligatin |
||
Protein phosphatase 1 regulatory subunit 3A |
||
Peptidyl-prolyl cis-trans isomerase NIMA-interacting 4 |
||
Cell cycle checkpoint control protein RAD9A |
||
Peroxisomal membrane protein 2 |
||
Protein ACN9 homolog |
||
Mitochondrial import receptor subunit TOM22 homolog |
||
Mitochondrial import receptor subunit TOM7 homolog |
||
GrpE protein homolog 1 |
||
rRNA-processing protein UTP23 homolog |
||
39S ribosomal protein L51 |
||
Superoxide dismutase |
||
Mitochondrial import inner membrane translocase subunit TIM50 |
||
Protein phosphatase 1 regulatory subunit 14A |
||
Dual specificity protein phosphatase 6 |
||
DNA-directed RNA polymerases I and III subunit RPAC2 |
||
Heat shock protein beta-8 |
||
H/ACA ribonucleoprotein complex subunit 2 |
||
Aldo-keto reductase family 1 member B10 |
||
Nibrin |
||
Nuclear pore complex protein Nup155 |
||
Nucleoporin 50 kDa |
||
ATP-dependent DNA helicase 2 subunit 2 |
||
Elongator complex protein 3 |
||
Nucleoside diphosphate-linked moiety X motif 8 |
||
Erythrocyte band 7 integral membrane protein |
||
Putative ATP-dependent Clp protease proteolytic subunit |
||
Peptidyl-prolyl cis-trans isomerase FKBP1A |
||
Origin recognition complex subunit 6 |
||
Transmembrane protein 111 |
||
Signal recognition particle 19 kDa protein |
||
ATP synthase subunit O |
||
ESF1 homolog |
||
Annexin A1 |
||
Choline-phosphate cytidylyltransferase A |
||
Transcription initiation factor IIE subunit beta |
||
Lamin-B2 |
||
60 kDa SS-A/Ro ribonucleoprotein |
||
Tumor suppressor candidate 4 |
||
Solute carrier family 35 member B1 |
||
SAC domain-containing protein 3 |
||
DNA-directed RNA polymerase II subunit RPB4 |
||
NADH dehydrogenase |
||
Metastasis-associated protein MTA2 |
||
Lysophosphatidylcholine acyltransferase |
||
Phosphoribosylformylglycinamidine synthase |
||
Exocyst complex component 8 |
||
Splicing factor 3 subunit 1 |
||
NADH dehydrogenase |
||
Splicing factor 3B subunit 3 |
||
Transmembrane protein C9orf7 |
||
Lamin-B receptor |
||
NAD-dependent deacetylase sirtuin-4 |
||
Phenylalanyl-tRNA synthetase beta chain |
||
Smoothened homolog |
||
39S ribosomal protein L17 |
||
UPF0415 protein C7orf25 |
||
Ribonuclease UK114 |
||
Dihydrolipoyllysine-residue acetyltransferase component of pyruvate dehydrogenase complex |
||
Integral membrane protein GPR177 |
||
UPF0587 protein C1orf123 |
||
Flap endonuclease 1 |
||
Geranylgeranyl transferase type-1 subunit beta |
||
D-glucuronyl C5-epimerase |
||
Deoxycytidylate deaminase |
||
Brix domain-containing protein 2 |
||
RRP15-like protein |
||
Gremlin-2 |
||
Piwi-like protein 1 |
||
Sororin |
||
Mitotic spindle assembly checkpoint protein MAD2A |
||
Translation initiation factor eIF-2B subunit gamma |
||
Cytochrome b-c1 complex subunit Rieske |
||
Caspase-14 |
||
Anaphase-promoting complex subunit 7 |
||
U3 small nucleolar ribonucleoprotein protein MPP10 |
||
Chromobox protein homolog 7 |
||
Glypican-1 |
||
Cyclin-dependent kinase 2-associated protein 2 |
||
26S proteasome non-ATPase regulatory subunit 8 |
||
Serine/threonine-protein phosphatase 2A catalytic subunit alpha isoform |
||
UPF0480 protein C15orf24 |
||
26S proteasome non-ATPase regulatory subunit 7 |
||
Eukaryotic translation initiation factor 4E |
||
Nuclear prelamin A recognition factor-like protein |
||
Cytochrome c oxidase copper chaperone |
||
Nucleolar protein 6 |
||
39S ribosomal protein L12 |
||
DNA-directed RNA polymerase II subunit RPB2 |
||
Chromatin accessibility complex protein 1 |
||
CCAAT/enhancer-binding protein gamma |
||
DNA excision repair protein ERCC-1 |