While similarity searching is an effective and reliable strategy for identifying homologs – sequences that share a common evolutionary ancestor – most similarity searches seek to answer a much more challenging question: "Is there a related sequence with a similar function?". The units in this chapter present practical strategies for identifying homologous sequences in DNA and protein databases (units 3.3, 3.4, 3.5, 3.9, 3.10) once homologs have been found, more accurate alignments can be built from multiple sequence alignments (unit 3.7), which can also form the basis for more sensitive searches, phenotype prediction, and evolutionary analysis. Similarity searching is effective and reliable because sequences that share significant similarity can be inferred to be homologous they share a common ancestor.
(1997) units 3.3 and 3.4), PSI-BLAST ( Altschul et al., 1997), SSEARCH ( Smith and Waterman (1981) Pearson (1991), unit 3.10), FASTA ( Pearson and Lipman (1988) unit 3.9) and the HMMER3 ( Johnson et al., 2010) programs produce accurate statistical estimates, ensuring protein sequences that share significant similarity also have similar structures. Widely used similarity searching programs, like BLAST ( Altschul et al. Modern protein sequence databases are very comprehensive, so that more than 80% of metagenomic sequence samples typically share significant similarity with proteins in sequence databases. Sequence similarity searching to identify homologous sequences is one of the first, and most informative, steps in any analysis of newly determined sequences. AN INTRODUCTION TO IDENTIFYING HOMOLOGOUS SEQUENCES