Background Composition

Sites are sampled based on the ratio of the motif probability to the background probability as described by equations 1 - 4. By default the background probability is calculated by equation 2, which assumes a uniform background model throughout the sequences. Variation in local base composition can adversely affect sequence alignment. Because such variation can be complex in untranscribed sequence, and because binding motifs are often AT- or GC-rich, these adverse effects can be difficult to control using existing masking algorithms. The alignment algorithm can use an alternative approach which employs a heterogeneous background model of sequence composition to account for these variations. A two step process is employed. First, the individual input sequence is analyzed for heterogeneity in base composition using the Bayesian segmentation algorithm (Liu 1999). This algorithm returns the probabilities of observing each of the four bases for each position in the sequence p0i,b i=1..I b = {A,T,C,G}, where I is the length of the sequence. These probabilities are based on a sequence's composition heterogeneity and the uncertainty in this heterogeneity. The extended Gibbs sampling algorithm incorporates this information as a local background model. Specifically the probability that the sequence segment, Ra, Ra+1, ...Ra+w where Rv is the base at position v in the sequence, is sampled as a motif site is proportional to the ratio of the probabilities of the segment under the site model, pmv, verses the background, i.e.

Equation 5

where w is the width of the site model model.