# Bayesian Algorithm for Local Sequence Alignment

## BALSA

These page will provide some guidance to users of both the BALSA web server as well as users of the stand-alone version of BALSA.

The Bayesian algorithm for local sequence alignment (BALSA) is a software package for pairwise sequence alignment that takes into account the uncertainty associated with alignment variables by incorporating in its forward sums a series of scoring matrices, gap parameters, and all possible alignments. The algorithm returns samples of alignments drawn from the posterior distribution and the posterior probabilities of gap penalties and scoring matrices. In so doing, it incorporates information from the full ensemble of solutions, rather than only the single most probable alignment, which is the target of most algorithms. BALSA reports an ensemble centroid alignment which is the alignment with the minimal the Hamming distance from the ensemble of all sampled alignments. In addition, the algorithm returns credibility limits that provide a global assessment of the degree to which the members of the ensemble depart from the centroid. The ensemble centroid alignment yields tighter credibility limits on average than the optimal alignment. BALSA is described in Webb, B. J., J. S. Liu, et al. (2002). BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res 30(5): 1268-77; and Webb-Robertson, B. J., L. A. McCue, et al. (2008). Measuring Global Credibility with Application to Local Sequence Alignment. PLoS CompBiol. 4(5): e1000077.

Sequence alignment is one of the most important high dimensional discrete prediction problem is biology. Sequence alignment methods commonly focus on identifying the highest scoring alignment between two sequences, and assessing the statistical significance of this alignment. For typical pair of alignments, there are a very large number of possible alignments. The number grows rapidly with the length of the sequences being aligned. Regardless of the alignment procedure employed, when a single alignment is chosen for the comparison of two sequences, it is a point estimate selected from the large ensemble of all possible alignments. It is not surprising, given the immense size of the alignment space, that the most probable alignments, and thus each individual alignment, often has very a very small probability.

## Sequence Alignment

Recognizing this, BALSA is based on the following (Webb-Robertson 2008):

• BALSA samples a scoring matrix and gap penalties from their posterior distribution based on the set of parameters supplied by the user. Conditioning on these, a backtrace is sampled. This process is repeated 1000 times to sample a series of alignments.
• The strength of the recommendation of the data for any specific alignment is equal to its posterior probability under the assumed probabilistic model.
• A credibility limit is the radius of the smallest hyper-sphere around a proposed estimate that contains a specified proportion of the probability mass of the posterior distribution, where the radius is measured by the number of elements by which two solutions differ. The size of this limit characterizes an estimate's credibility.
• The estimate with the minimum credibility limit best represents the ensemble.

Details of the probability calculations and sampling process can be found in the links above.

## An Example

To make this more concrete, we'll examine the results from aligning a pair of DNA sequences. The sequences are 1 KB upstream sequences of a pair of orthologous genes taken from a set of genes up-regulated in human skeletal muscle tissue and their mouse orthologs. The sequences are available for download here and here.

The sequences can be pasted into the text boxes or uploaded by clicking the browse button. We will use the mouse sequence as the query and the human sequence as the comparison sequence. We will choose a range of DNA PAM matrices, each with default gap opening parameters and gap extenion parameters of -12 and -1.

You can run this example by selecting this link and clicking the Submit button.

Because BALSA samples alignments from the posteior distribution of all alignments, is your results may vary slightly from the results shown below, but they should be substantially similar.

The First part of the output shows the posterior probabilities of the entered scoring matrix, gap openening and extension penalties, given the sequence data. The fact that the PAM10 parameters have a probablity of approximately 1, inicates the closesness of the sequences.

```P(PAM10_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 1
P(PAM50_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 3.68284e-27
P(PAM100_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 9.86766e-81
P(PAM200_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 8.23545e-142
```

Next, an animated histogram of the ensemble centroid alignment is shown. The centroid alignment is the alignment with the minimal Hamming distance from the ensemble of all sampled alignments. In this case, the Hamming distance is the number of base pairs by which the two alignments differ. The centroid alignment meets the exclusive pairing and the colinearity constraints of the alignment problem, but it does not necessarily meet the common requirement that a gap in one sequence cannot be followed by a gap in the other sequence. The histogram shows the probability of base pairing for each possible pair of aligned bases.

## Centroid

The credibility limit is the minimum Hamming distance radius of a hyper-sphere containing a given percent of the posterior distribution. Centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments. The 85, 90 and 95% credibility limits are shown as the miminum Hamming distance encompasing that percent of the sampled alignments and as a normalized distance. A perfect match would yield a normalized distance of zero, and in the case where the longest sampled alignment has no base pairings in common, the normalized distance is one. The normalized distance is useful for comparing alignments among pairs of sequences with differing lengths.

```               ************ Credibility Limit(85,90,95) ************

The credibility limit of  85 %  :  219 0.15477
The credibility limit of  90 %  :  225 0.159011
The credibility limit of  95 %  :  237 0.167491
```

Finally a histogram of the sampled alignments is shown. The histogram shows the posterior probability of any two bases being paired.

## Histogram

Back to the BALSA homepage.