Testing your comprehension

Note: This is designed to help you assess your comprehension of topics discussed in class.


Weeks 1-2. Fundamentals of molecular biology and evolution

  • What are the basic differences between prokaryotes and eukaryotes at the cellular level?
  • Define endosymbiosis, and give the evidence supporting it.
  • What do biologists mean by gene or protein "function"?
  • What is a pseudogene? What are the distinguishing features of a pseudogene?
  • How are genes identified (following sequencing and assembly)?
  • Where do the annotations (of protein function) in the sequences in GenBank and other databases come from?
  • What is meant by "gene structure"?
  • What is the difference between an exon and an intron?
  • What is a paralog? What is an ortholog? Which relationship between genes allows a biologist to assume functional similarity? Why?
  • Following gene duplication, duplicated genes can evolve novel functions or specificities. Name and describe the two main types of adaptation seen (hint: both end with "functionalization").
  • What do we mean by "domain fission" and "domain fusion"? What biological process produces this? Give an example of two protein folds related by one of these processes.
  • Describe the process of function prediction by homology, and the theory used to support the validity of this approach.
  • What are the four main sources of error in function prediction by homology?
  • What is meant by horizontal gene transfer?
  • What is meant by gene loss?
  • Define and contrast convergent and divergent evolution. Give an example of each process.
  • Protein structure prediction primarily helps provide a hint about what type of function? Select all that apply and give justification (hint: some terms are synonyms): molecular, biochemical, cellular, biological process.


    Protein structure: Primary, secondary, tertiary and quaternary structure. Fold types. Localization. Post-translational modifications.

  • Name at least one webserver that provides secondary structure prediction.
  • What amino acid pattern in an MSA is indicative of an amphipathic helix?
  • What amino acid pattern in an MSA is indicative of a surface beta sheet?
  • What amino acid pattern in an MSA is indicative of a loop?
  • What amino acid pattern in an MSA is indicative of a region separating two structural domains?
  • What is meant by a protein domain?
  • Define and contrast conservative and silent substitutions, and missense and nonsense mutations.
  • Under what circumstances would profile construction reasonably use the observed frequencies in a multiple sequence alignment?
  • What is meant by protein primary, secondary, tertiary and quaternary structure?
  • What amino acids have hydroxyl groups?
  • What amino acid induces a kink?
  • What amino acids are typically found in collagen?
  • What is the smallest amino acid?
  • Name the aromatic amino acids.
  • Name the acidic amino acids.
  • Name the basic amino acids.
  • What is meant by an indel character?
  • What amino acid forms disulfide bridges?
  • Are disulfide bridges most commonly found in cytoplasmic proteins or in secreted proteins? Explain why this is the case.
  • What amino acid(s) have one codon? Which have two? What amino acids have the most codons? Is there a correspondence between the number of codons and the relative frequency of an amino acid?
  • What amino acids are most often involved in metal binding?
  • What are the characteristics of an active site?
  • What is an allosteric site on a protein?
  • Contrast competitive and non-competitive inhibition.

    Homology detection using pairwise sequence comparison, alignment and searches

    Topics: BLAST, Smith-Waterman, Needleman-Wunsch algorithms. Intermediate Sequence Search. Substitution matrices. Indel parameters. Masking. E-value computation.

  • How are E-values computed?
  • Which E-value indicates greater significance: an E-value of .000001 or an E-value of 1.0?
  • What level of percent identity in a pairwise alignment should you have in order to be able to assume that two proteins are homologous? (What other aspects of the alignment should be examined?)
  • What does it mean to be homologous?
  • Why does BLAST sometimes replace amino acids with X's?
  • If you use intermediate sequence search to predict homology, what do you need to confirm about the intermediate sequence?

    Iterated homolog detection

    Topics: PSI-BLAST, Target98, and Intermediate Sequence Search.

  • What is meant by profile drift?
  • How does PSI-BLAST compare to BLAST at recognizing remote homologs?
  • Explain how intermediate sequence search works and how you would avoid getting false positive matches using this approach.
  • What substitution matrix is used by NCBI by PSI-BLAST and BLAST?

    Multiple sequence alignment

    Topics: Iterative vs progressive alignment algorithms; BAliBASE and validation datasets. Comparison with structural alignment.

  • Describe the difference between iterative and progressive alignment algorithms.
  • Is ClustalW progressive or iterative?
  • Is MUSCLE progressive or iterative?
  • Is the SAM buildmodel software progressive or iterative (when used to construct an MSA of unaligned sequences)?

    Hidden Markov Models and Profiles

    Topics: Estimating amino acid probabilities from small training sets. Homolog detection and alignment using these tools. Dirichlet mixture densities.

  • How is a profile related to an MSA?
  • What is the main difference between profile construction using Dirichlet mixture densitiees and profile construction using the Blosum62 substitution matrix.
  • What is the name of the HMM software system we've been using in the class, and where does it come from?
  • Describe what the w0.5 tool does, and the usage (inputs and outputs).
  • Given a UCSC A2M format alignment, you should be able to say which characters come from insert states, from match states and from delete states.
  • Describe the different results you'd get if you used Dirichlet mixture densities instead of substitution matrices to estimate amino acid distributions for a multiple sequence alignment.
  • When can you use the observed frequencies to estimate a profile? When do you need substitution matrices or Dirichlet mixture densities to estimate a profile?
  • What is the main difference between a profile and an HMM in database search for homologs?
  • What can HMM methods do that a profile can not?
  • Define profile. What is the relationship between a profile and an MSA?
  • In a profile, represented by a matrix M, what does M[i][j] refer to. Be precise.
  • Name two different ways to estimate an amino acid distribution from a column in a multiple sequence alignment. What are the advantages and disadvantages of these methods?
  • Why do you think the self-substitution score (along the diagonal) of the Blosum62 matrix has different values for the amino acids?
  • Define profile. What is the relationship between a profile and an MSA?
  • In a profile, represented by a matrix M, what does M[i][j] refer to. Be precise.
  • Name two different ways to estimate an amino acid distribution from a column in a multiple sequence alignment. What are the advantages and disadvantages of these methods?

    Protein structure prediction & homology model construction

    Topics: Structural Genomics Initiative; construction and use of comparative models. SCOP and Astral benchmark datasets. Secondary structure and solvent accessibility prediction.

  • Rank the following fold recognition methods by their relative sensitivity (for the same error rate): Target98, BLAST, PSI-BLAST, Intermediate Sequence Search. Refer to the work of Park and colleagues (one of the required readings for this week).
  • What aspects of the methods contribute to the relative performance?
  • What are the differences between PSI-BLAST and Target98 that might account for their differences in performance?
  • What are the differences between BLAST and PSI-BLAST that account for their differences in performance?
  • What is profile drift?
  • When can Intermediate Sequence Search be used effectively, and what kinds of errors could you make using this approach? How would you avoid making an error using this approach?
  • What is the aim of the Structural Genomics initiative?
  • Explain the difference between "fold recognition" and "comparative model construction".
  • Comparative model construction is also known as ______ model construction.
  • What can a biologist do with a comparative model? How does the evolutionary distance between a target and template determine the actual uses of the comparative model?
  • What homology model construction servers are available? How are they different from each other?
  • Name the steps in constructing a comparative model. Which steps are most important to the comparative model accuracy. Where can the greatest errors be produced?
  • Name one server that includes pre-computed comparative models.
  • Name a second webserver that computes comparative models on the fly, and the criteria it uses to select templates.
  • At what percent identity between a target and template can the comparative model be used for docking studies?
  • Describe how Park and colleagues used the Astral PDB40 dataset and the SCOP database to compare methods.
  • What is an "interleaved" or "genetic" domain?
  • When you submit a sequence to the NCBI PSI-BLAST server:
  • What is meant by "phylogenetic distribution" when I ask you to tell me the phylogenetic distribution of a gene?
  • Why would different domain structure prediction servers give you different ranges for the presence of a domain?
  • Name the domain prediction servers that are specifically for identifying structural domains. Of these, which give you the SCOP classification of the predicted domain?
  • Which domain prediction server(s) integrate secondary structure prediction and scoring into their workflow?
  • Which domain prediction server(s) will produce comparative models which you can download?
  • Which domain prediction server(s) include predictions for domains for which no solved structures are known?
  • If a structural domain prediction server predicts the presence of a structural domain (with a significant e-value) does that mean you can predict function? Why or why not? Give all the reasons.
  • Are all proteins of solved structure represented in SCOP?
  • What database stores the 3D coordinates of solved structures?
  • What is meant by protein structure prediction meta-servers?
  • What is meant by 2D-threading?
  • Describe the Rosetta method of protein structure prediction.
  • Explain the insight published by the Baker lab regarding the non-optimal local energetics of active site conformations in proteins.
  • What is meant by solvent accessibility?
  • What is the expected accuracy in predicting solvent accessibility?
  • What innovations in secondary structure prediction over the last couple of decades have led to improvements in accuracy, from the initial Chou-Fassman approach to the high-accuracy methods used today?

    Structure-function analysis

    Topics: Active site vs binding pocket characteristics; Identifying positions under diversifying selection. Evolutionary Trace and 3D structure analysis methods. Enzyme Classification.

  • Describe the inputs, outputs, and algorithm of Evolutionary Trace and the Eisenberg 3D cluster analysis methods.
  • What are the differences between binding pocket residues and catalytic residues? Do their conservation patterns differ?
  • Will two enzymes that have the same EC (enzyme classification) number up to the first 3 EC digits always have the same structure? Give an example to illustrate your answer.
  • If two sequences have about 40% ID, up to what level of Enzyme Classification number can you predict with 90% accuracy level?

    Phylogenetic tree construction and analysis; Phylogenomic analysis

    Topics: Evolution of gene families. Gene duplication, fission, fusion, repeats, horizontal transfer. Ortholog and paralog identification. Phylogenetic tree construction and analysis. Methods: NJ, Parsimony, ML; BETE, SATCHMO. Bootstrap.

  • What are the two main classes of phylogenetic reconstruction?
  • For each of the main methods in phylogenetic reconstruction (i.e., MP, ML, Neighbor-Joining) and the other heuristic approaches discussed in class (SATCHMO, BETE), which class does each belong?
  • Describe the process of bootstrap analysis (i.e., what happens at each step of the analysis?)
  • What does a bootstrap value represent (concretely)?
  • Name and describe the fundamental key assumptions of a phylogenetic analysis.
  • What does a node in a phylogenetic tree represent?
  • What does the root represent?
  • What do the branch lengths represent?
  • How do you root a phylogenetic tree? Name the two main methods and the issues associated with each.
  • Why is masking performed? Which of the fundamental assumptions of phylogenetic reconstruction is masking designed to address?
  • What level of bootstrap support should you have for a subtree to be considered supported by the analysis? (give a rough ballpark)
  • What does it mean to be monophyletic?
  • Is it scientifically correct to say all species are related by a Tree of Life? Why or why not?
  • What is the mathematical definition of a tree (in graph theoretical terms)?
  • Define ortholog.
  • Define paralog.
  • Define xenolog.
  • Does orthology imply that two genes have the same function?

    Protein localization

    Required reading: TBA
    Topics: Review of biological apparatus for cellular sorting; Methods: TMHMM, TargetP, Phobius.

  • Name one transmembrane prediction server and describe the method it uses to predict TM domains.
  • What is the most common source of false positive errors in TM prediction?
  • What does the Chen et al paper suggest is one of the main sources of over-prediction of TM prediction accuracy by method developers?
  • Describe the earliest method of TM prediction developed.
  • Some types of signal peptides are located at the amino terminus, but at least one type is located in the middle of the protein. What cellular compartment localization signal is located at the middle of proteins?
  • Why is it important to know the cellular localization of proteins?
  • What is the difference between accuracy measures at a per-position basis, and measures based on segmental accuracy? How is "segmental accuracy" measured?

    Miscellaneous topics

    Required reading: TBA
    Topics: Predicting post-translational modifications using PROSITE; predicting protein-protein interactions using Rosetta Stone and phylogenetic profiles.

  • Explain what is meant by a Rosetta Stone protein and what it indicates.
  • Describe the phylogenetic profile method, and what it aims to predict, and the sources for FP and FN errors.

    Miscellaneous questions about methods

  • A method that has very few false positives has which of the following? (choose all that apply): (a) high sensitivity, (b) high selectivity, (c) high coverage, (d) high specificity, (e) high recall, (f) high false negatives.
  • A method with few false negatives has which of the following? (choose all that apply): (a) high sensitivity, (b) high specificity, (c) low sensitivity, (d) low coverage, (e) high coverage, (f) high selectivity.
  • What is an ROC plot? What is it designed to show?
  • Name the benchmark databases used primarily for: fold prediction, multiple sequence alignment accuracy, cellular localization, active site residue prediction, subfamily identification, phylogenetic tree accuracy. If none are available, please note (N/A).
  • Define a false positive in the following contexts: fold prediction, TM prediction.
  • If A is homologous to B, and B is homologous to C, does it follow that A and C are homologous? Why or why not? Qualify your answer.
  • Is it possible for two proteins to be 50% homologous? Explain.
  • If one group publishes a method and says their specificity is 95% and a second group publishes a method and says their specificity is 98%, does it follow that the second group's method is superior?