Significance of Pairwise Sequence Alignment Scores

Objective: Measuring the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution.

Inputs and Run Time: This server allows application of our technique to pairwise sequence alignment of nucleic acid and amino acid sequences. To keep the run-time short, the precision for larger sequence lengths will not be as good as that for shorter sequence lengths. Specifically, runtime is approximately O(mns), where m and n are the sequence lengths and s is the number of samples. The run-time is capped at mns = 1 × 109, approximately 1 hour, by reducing the number of samples as necessary. Please examine the output to discover how many samples were permitted.

Default values are 1000 samples for a 40 × 40 local sequence alignment of amino acid sequences using the BLOSUM62 scoring matrix, SWISSPROT residue frequencies, an insertion start score of -12, and an insertion extension cost of -1; this set of values has a run-time of about 10 seconds.

Outputs and Temperature: The temperature that parameterizes the importance sampling distribution will be chosen so that approximately half the generated importance samples contribute a non-zero value to the importance sampling sum that determines the p-value of the specified target score. The server provides p-values for scores near the target score.

To find the p-value that interests you, find the row beginning with the text "RESULT" followed by your score of interest, and read across the line to discover the estimate of the score's p-value, as well as various statistics to help you evaluate your confidence in the estimate. If the number of samples is not large enough, there may be some scores for which no p-value is computed. In this case, the mathematics indicates use of the p-value for the first higher score with an available p-value, but instead we recommend rerunning the simulation with more samples or a more appropriate target score.

Be aware that the p-value estimates for scores outside the central range of the displayed scores can be imprecise. Also note that some of the listed p-values in the 3rd output column will underflow double-precision floating point numbers; to avoid this problem it may behoove you to manipulate the logarithm (base 10) in the 6th output column.

Citing: If you use this technique or this server in your work, in your publications please cite:

Lee A. Newberg (2008) Significance of gapped sequence alignments. J Comput Biol, 15(9), 1187-1194. doi: 10.1089/cmb.2008.0125.

Target score: (less than maximum possible score)

Alignment type: local (Smith-Waterman)     global (Needleman-Wunsch)

Length of first sequence:

Length of second sequence:

Number of samples:

Insertion start score (enter as negative):

Insertion extension score (enter as negative):

Number of letters (e.g., dna=4, proteins=20):

Equilibrium distribution of nucleotides/residues:

Scoring Matrix: