Epilogue: Multiple Sequence Alignment
Learn multiple sequence alignment with the help of a three-dimensional Manhattan and a greedy multiple alignment algorithm.
We'll cover the following...
Amino acid sequences of proteins performing the same function are likely to be somewhat similar, but these similarities may be elusive in the case of distant species. You now possess an arsenal of algorithms for aligning pairs of sequences, but if sequence similarity is weak, pairwise alignment may not identify biologically related sequences. However, simultaneous comparison of many sequences often allows us to find similarities that pairwise sequence comparison fails to reveal. Bioinformaticians sometimes say that while pairwise alignment whispers, multiple alignment shouts.
Building a three-dimensional Manhattan
We’re now ready to use pairwise sequence analysis to build up our intuition for comparison of multiple sequences. In our three-way alignment of A-domains from the introduction, we found 19 conserved columns:
However, similarities between A-domains are not limited to these 19 columns, as we can find semi-conservative columns, each of which has two matching amino acids:
T-way alignment
A multiple alignment of strings , …, , also called a t-way alignment, is specified by a matrix having t rows, where the i-th row contains the symbols of in order, interspersed with space symbols. We also assume that no column in multiple alignments contains only space symbols. In the 3-way alignment below, we’ve highlighted the most popular symbol in each column using upper case letters:
The multiple alignment matrix is a generalization of the pairwise alignment matrix to more than two sequences. The three arrays shown below this alignment record the respective number of symbols in ATGTTATA, AGCGATCA, and ATCGTCTC encountered up to a given position. Together, these three arrays correspond to a path in a three-dimensional grid:
As the alignment graph for two sequences is a grid of squares, the alignment graph for three sequences is a grid of cubes. Every node in the 3-way alignment graph has up to seven incoming edges, as shown in the figure below.
The score of multiple alignments is defined as the sum of scores of the alignment columns (or, equivalently, weights of edges in the alignment path), with an optimal alignment being one that maximizes this score. In the case of an amino acid alphabet, we can use a very general scoring method that is defined by a t-dimensional matrix containing entries that describe the scores of all possible combinations of t symbols (representing 20 amino acids and the space symbol). Intuitively, we should reward more conserved columns with higher scores. For more details, see DETOUR: Scoring Multiple Alignments. Intuitively, we should reward more conserved columns with higher scores. For example, in the Multiple Longest Common Subsequence Problem, the score of a column is equal to 1 if all of the column’s symbols are identical, and 0 if even one symbol disagrees.
Multiple Alignment Problem
Problem overview:
Find the highest-scoring alignment between multiple strings under a given scoring matrix.
Input: A collection of t strings and a t-dimensional matrix Score.
Output: A multiple alignment of these strings whose score (as defined by the matrix Score) is maximized among all possible alignments of these strings.
Sample input:
PLEASANTLY
MEANLY
HAPPILY
Output:
3
PLE—ASANT----LY
—ME-A----N—LY
-----HA-----PPILY
A straightforward dynamic programming algorithm applied to the t-dimensional alignment graph solves the Multiple Alignment Problem for t strings. For three sequences and ...