PhyloFacts::Genome Coverage

Model Organism and External Database Coverage Statistics

Fraction of select genomes and external databases matching PhyloFacts protein families

Notes


Genomes/Datasets Classification No. sequences Global homology Local similarity Last update

Archaea
Archaeoglobus fulgidus Euryarchaeota 2,432 52.6% 67.7% 2007-04-10
Halobacterium sp nrc-1 Euryarchaeota 2,455 44.8% 57.6% 2007-04-10
Methanococcus jannaschii Euryarchaeota 1,715 38.8% 39.0% 2007-04-10
Methanosarcina acetovirans c2a Euryarchaeota 4,540 40.6% 58.4% 2007-04-10
Methanosarcina mazei goe1 Euryarchaeota 3,371 43.9% 61.9% 2007-04-10
Methanopyrus kandleri av19 Euryarchaeota 1,687 44.5% 58.0% 2007-04-10
Pyrococcus abyssi ge5 Euryarchaeota 1,765 83.0% 92.0% 2007-04-10
Pyrococcus furiosus dsm 3638 Euryarchaeota 2,065 92.5% 98.2% 2007-04-10
Sulfolobus solfataricus Crenarchaeota 2,960 45.7% 58.1% 2007-04-10

Bacteria
Aquifex aeolicus vf5 Aquificae 1,515 64.5% 78.1% 2007-04-10
Bacillus anthracis Firmicutes 5,508 47.6% 56.3% 2007-04-10
Borrelia burgdorferi Spirochaetes 1,740 29.6% 37.1% 2007-04-10
Brucella suis Proteobacteria 3,385 60.6% 66.4% 2007-04-10
Dechloromonas aromatica Proteobacteria 4,155 85.4% 99.3% 2008-07-16
Escherichia coli Proteobacteria 4,289 84.1% 96.8% 2007-07-13
Mycobacterium leprae Actinobacteria 1,605 76.3% 85.1% 2007-04-10
Mycobacterium tuberculosis Actinobacteria 3,924 64.9% 76.8% 2007-04-10
Nocardia farcinica Actinobacteria 5,944 94.7% 98.2% 2007-04-10
Pseudomonas aeruginosa pao1 Proteobacteria 5,570 69.5% 74.9% 2007-04-10
Pseudomonas syringae b728a Proteobacteria 5,090 66.5% 73.3% 2007-04-10
Pseudomonas syringae dc3000 Proteobacteria 5,471 62.6% 69.1% 2007-04-10
Yersinia pestis Proteobacteria 4,086 68.3% 75.1% 2007-04-10

Eukaryota
Arabidopsis thaliana Viridiplantae 22,032 27.1% 71.7% 2007-07-13
Drosophila melanogaster Metazoa 19,781 35.8% 72.9% 2007-07-13
Homo sapiens Metazoa 21,314 79.1% 95.1% 2007-07-16
Mus musculus Animalia 25,371 58.4% 87.0% 2007-04-10
Plasmodium falciparum Protista 5,334 15.1% 44.5% 2007-04-10
Saccharomyces cerevisiae Fungi 5,883 67.8% 90.8% 2007-07-16

Databases
SCOP 13,006 69.9% 96.1% 2007-04-10
PDB 29,318 90.3% 92.7% 2007-04-10



Notes - Coverage statistics

We regularly update PhyloFacts coverage statistics for several sequence databases: selected sequenced genomes across the Tree of Life, the Protein Data Bank (PDB) of solved structures, and the SCOP dataset of classified structural domains. Sequence data for completed genome projects are obtained from UniProt or directly from sequencing projects.

The "Global homology" statistic reports the percentage of sequences that match a global homology protein family with a significant E-value and bi-directional coverage. The "Local similarity" statistic simply requires an HMM score with significant E-value (see below) to any protein family in PhyloFacts.

Coverage for PDB and SCOP is computed by scoring sequences against family HMMs in the PhyloFacts protein structure library. Coverage statistics for genome sequences are computed by scoring against all protein families in the resource.

HMM E-values are computed using the family HMM for each PhyloFacts protein family, using the SAM hmmscore program with local-local scoring (sw=2 and adpstyle=5), and an assumed database size of 100,000.

E-values for both BLAST and HMM scoring are considered to be significant if they meet the following length-dependent cutoffs: HMM length <65: 1.0e-02; HMM length 65-100: 1e-03; HMM length >100: 1.0e-04.

Bi-directional coverage cutoffs were defined to reduce the likelihood of assigning sequences to HMMs with different domain architectures:

Sequence/
HMM length
Percent
   coverage   
10-20065%
200-25070%
250-30073%
300-35075%
350-40078%
400-45080%
450-50083%
500+85%