These page will provide some guidance to users of both the BALSA web server as well as users of the stand-alone version of BALSA.
The Bayesian algorithm for local sequence alignment (BALSA) is a software package for pairwise sequence alignment that takes into account the uncertainty associated with alignment variables by incorporating in its forward sums a series of scoring matrices, gap parameters, and all possible alignments. The algorithm returns samples of alignments drawn from the posterior distribution and the posterior probabilities of gap penalties and scoring matrices. In so doing, it incorporates information from the full ensemble of solutions, rather than only the single most probable alignment, which is the target of most algorithms. BALSA reports an ensemble centroid alignment which is the alignment with the minimal the Hamming distance from the ensemble of all sampled alignments. In addition, the algorithm returns credibility limits that provide a global assessment of the degree to which the members of the ensemble depart from the centroid. The ensemble centroid alignment yields tighter credibility limits on average than the optimal alignment. BALSA is described in Webb, B. J., J. S. Liu, et al. (2002). BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res 30(5): 1268-77; and Webb-Robertson, B. J., L. A. McCue, et al. (2008). Measuring Global Credibility with Application to Local Sequence Alignment. PLoS CompBiol. 4(5): e1000077.
Sequence alignment is one of the most important high dimensional discrete prediction problem is biology. Sequence alignment methods commonly focus on identifying the highest scoring alignment between two sequences, and assessing the statistical significance of this alignment. For typical pair of alignments, there are a very large number of possible alignments. The number grows rapidly with the length of the sequences being aligned. Regardless of the alignment procedure employed, when a single alignment is chosen for the comparison of two sequences, it is a point estimate selected from the large ensemble of all possible alignments. It is not surprising, given the immense size of the alignment space, that the most probable alignments, and thus each individual alignment, often has very a very small probability.
Recognizing this, BALSA is based on the following (Webb-Robertson 2008):
Details of the probability calculations and sampling process can be found in the links above.
To make this more concrete, we'll examine the results from aligning a pair of DNA sequences. The sequences are 1 KB upstream sequences of a pair of orthologous genes taken from a set of genes up-regulated in human skeletal muscle tissue and their mouse orthologs. The sequences are available for download here and here.
The sequences can be pasted into the text boxes or uploaded by clicking the browse button. We will use the mouse sequence as the query and the human sequence as the comparison sequence. We will choose a range of DNA PAM matrices, each with default gap opening parameters and gap extenion parameters of -12 and -1.
You can run this example by selecting this link and clicking the Submit button.
Because BALSA samples alignments from the posteior distribution of all alignments, is your results may vary slightly from the results shown below, but they should be substantially similar.
The First part of the output shows the posterior probabilities of the entered scoring matrix, gap openening and extension penalties, given the sequence data. The fact that the PAM10 parameters have a probablity of approximately 1, inicates the closesness of the sequences.
P(PAM10_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 1 P(PAM50_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 3.68284e-27 P(PAM100_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 9.86766e-81 P(PAM200_DNA, Gap Opening Penalty=-12, Gap Extension Penalty=-1 | R1,R2) = 8.23545e-142
Next, an animated histogram of the ensemble centroid alignment is shown. The centroid alignment is the alignment with the minimal Hamming distance from the ensemble of all sampled alignments. In this case, the Hamming distance is the number of base pairs by which the two alignments differ. The centroid alignment meets the exclusive pairing and the colinearity constraints of the alignment problem, but it does not necessarily meet the common requirement that a gap in one sequence cannot be followed by a gap in the other sequence. The histogram shows the probability of base pairing for each possible pair of aligned bases.
The credibility limit is the minimum Hamming distance radius of a hyper-sphere containing a given percent of the posterior distribution. Centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments. The 85, 90 and 95% credibility limits are shown as the miminum Hamming distance encompasing that percent of the sampled alignments and as a normalized distance. A perfect match would yield a normalized distance of zero, and in the case where the longest sampled alignment has no base pairings in common, the normalized distance is one. The normalized distance is useful for comparing alignments among pairs of sequences with differing lengths.
************ Credibility Limit(85,90,95) ************ The credibility limit of 85 % : 219 0.15477 The credibility limit of 90 % : 225 0.159011 The credibility limit of 95 % : 237 0.167491
Finally a histogram of the sampled alignments is shown. The histogram shows the posterior probability of any two bases being paired.