Gibbs sampler versions

If you are running the Gibbs sampler locally, we suggest that periodically check for updates and new version releases. Below are descriptions of a few specific changes to the Gibbs sampler that affect the web tutorial examples, and an indication of the versions when these changes occurred.

Palindromes

Previous versions (up to version 2.05) treated even width and odd width palindromes separately. An odd width palindrome was specified by an odd motif width (e.g. 17 positions), with the center position in the model "on" but unpaired in the palindromic model. In more recent versions of Gibbs (versions 2.06 and higher) it is NOT necessary to specify even and odd palindromes separately.

Centroid solution

The Centroid solution option is available with versions 3.0 and higher.


Gibbs sampling modes

The Gibbs sampler currently (as of version 3.0) allows four sampling modes: site sampling, motif sampling, recursive sampling, and centriod sampling. These sampling modes were developed and implemented over time, and as listed, represent increasing levels of sophistication. Here we provide brief descriptions of appropriate uses for each of these sampling modes. For a more thorough description of the site, motif, and recursive sampling modes, and their use on the Gibbs web server, see our chapter in Current Protocols in Bioinformatics. For a more thorough description of the recursive and centroid sampling modes, and their use on the web server and at the command line, see our on-line tutorial on analysis of co-expression data.

Site Sampling

The Site Sampling mode was originally described in our first paper on Gibbs sampling for biological sequences (Lawrence et al., 1993). In this mode, the sampler will identify exactly one site per input sequence for a predicted motif. Given this restriction, the site sampling mode is not suitable for analysis of the types of high-throughput transcriptomics data that are being generated these days. However, we continue to find site sampling useful for very specific cases.

Site sampling is appropriate when the input data consist of sequences for which you have a reasonable expectation that each sequence has one binding site for the transcription factor. Specifically, we use site sampling to analyze sequence data from DNaseI footprinting or EMSA (electrophoretic mobility shift assay) experiments. For example, a site sampling run for the seven DNaseI footprints of the E. coli PhoP transcription factor produces the following results.

Output from this example

Site sampling is invoked by:

Motif Sampling

The Motif Sampling mode was one of the first extensions to the Gibbs sampler, and was described in (Neuwald et al., 1995). In this mode, the sampler will identify anywhere between zero sites and the maximum possible number of sites per sequence (e.g.: a 50 base sequence could maximally have 5 non-overlapping 10-mer sites). This allows the sampler quite a lot of freedom. This sampling mode is generally not as sensitive as recursive sampling or centroid sampling, but because it is less compute-intensive, it can be useful during initial, exploratory motif discovery tasks.

Motif sampling is appropriate when you have a reasonable expectation that the input sequences contain a common motif, although you are not certain that each sequence contains a site for the motif, and some sequences may contain multiple sites for the motif, but it is difficult to estimate an upper limit on that number of sites per sequence. One example in which we have found motif sampling useful is for exploring bacterial genomes (specifically, the extracted intergenic sequences of a bacterial genome) for possible repetitive sequences. For example, a motif sampling run on all of the intergenic regions of the E. coli genome identifies the REP element (Rudd, K.E. (1998)).

Output from this example

Note in the example output file, that at the command line we estimated a total of 100 motif sites. This is an estimate, and simply directs the sampler to initiate the motif search by selecting 100 sites at random to build the initial model. The sampler, in fact, found many more than 100 sites for the strong REP motif. It is also important to note that we use this type of analysis only in an exploratory manner. We do not use the output as definitive descriptions of repetitive elements.

Motif sampling is invoked by:

Recursive Sampling

The Recursive Sampling mode implements a more advanced sampling algorithm than that previously used, and was described in (Thompson et al., 2003). In this mode, the sampler will identify between zero sites and a maximum number of sites per sequence that is set by the user. This sampling mode is more compute-intensive than site sampling or motif sampling, but is typically more sensitive, and thus is currently the default mode for running the Gibbs Sampler on the web server.

Recursive sampling is appropriate when you have a reasonable expectation that the input sequences contain a common motif, and you can reasonably estimate an upper limit on the number of sites per sequence. We have used this sampling mode extensively for phylogenetic footprinting and analysis of co-expressed genes (e.g.: Conlan et al., 2005 and Wan et al., 2004), and examples that use recursive sampling can be found on our tutorial pages.

Recursive sampling is invoked by:

Centroid Sampling

The Centroid Sampling mode is a modification of the Recursive Sampling mode. Similar to recursive sampling, in this mode, the sampler will identify between zero sites and the maximum number of sites per sequence set by the user. Centroid sampling is the most recently developed sampling mode, and represents a significant departure from previous approaches, in that the algorithm does not search for an optimal solution, i.e., one that maximizes a motif score. Instead, in the centroid sampling mode, the algorithm provides a centroid motif solution, which is that alignment of sites that has the minimum total distance to the set of alignments sampled from the a posteriori probability distribution of alignments.

Centroid sampling is appropriate when you have a reasonable expectation that the input sequences contain a common motif, and you can reasonably estimate an upper limit on the number of sites per sequence. Examples using the centroid sampler can be found on our tutorial pages.

The Centroid Sampler is described in Thompson et al. (2007). The centroid sampler also allows the incorporation of a full phylogenetic model as described in Newberg et al. (2007).

Centroid sampling is invoked by:



go back Gibbs home page

mail

Valid HTML 4.01!