Prokaryotic Co-expression Data Analysis Tutorial

These pages will provide some guidance to users of both the Gibbs sampler web server as well as command line users of the stand-alone version of Gibbs. The Gibbs sampler offers a dizzying array of options. Some of these offer the opportunity to model specific information you may have about the data that you wish to analyze, reflecting knowledge of the biology in your experiments. Others are largely technical and meant for advanced users who wish to control details of how the sampling is done. We will focus these help pages on the options we use that help us model the biology of transcription regulation.

Described below are approaches we have used to examine co-expression data from microarray experiments and promoter fusion experiments. In both cases, we assume that genes are co-expressed because they are under the control of the same transcription factor(s), and we use Gibbs sampling to try to indentify the putative binding motif for this factor(s). The basic difference is that we expect microarray data to contain more noise (false positives) than promoter fusion experiments, and make some adjustments to our prior expectations that each promoter will contain a site for a motif. It is also important to remember that Gibbs sampling is a stochastic procedure, thus you will observe chance differences from run to run, particularly in noisy data.

First, careful attention to the input sequence data is important. Generally in bacteria, genes encoded in an operon are co-expressed, though only the DNA sequence upstream of the 5´ most proximal gene has a promoter, and hence transcription factor binding sites. We generally do not include sequences upstream of genes for which the next gene immediately upstream is encoded on the same strand, and there is < 50 bp of upstream intergenic sequence. In addition, when two genes are encoded on opposite strands and share an upstream integenic region (divergently transcribed genes), this region should only be included in an input sequence file once. There is no need to include such an upstream sequence twice, using both strands of the DNA. In fact, this is detrimental, because it provides the sampler with two copies of the same DNA sequence.

When searching for transcription factor (TF) binding sites in co-expression data, we use the centroid sampling or recursive sampling mode of the Gibbs sampler. Parameters that we use for this type of data (bacterial co-expresssion data) include:

The use of these parameters, and the reasons for their choice are described below.

Motif model width and fragmentation

The above listed range of motif widths (16-24) seems to cover an average width, given current knowledge of known bacterial TF binding sites, and furthermore, spans a couple turns of the DNA double helix. The parameters we use for width and palindromic models are designed to capture the features of binding sites for a classic bacterial helix-turn-helix (HTH) type transcription factor: HTH-type TFs are typically symmetric homodimers, thus they bind to symmetric (palindromic) DNA binding sites. Furthermore, the two HTH regions of the dimeric TF typically contact bases in two adjacent major grooves of the DNA, and thus the two halves of the palindromic binding site span well over 10 bases (the approximate number of bases per helical turn of B-form DNA). The bases contacted by a TF are not necessarily contiguous, thus we use fragmentation to allow the Gibbs sampler to ignore positions which do not participate in the protein-DNA interaction, and are therefore not conserved as part of the binding site.

The images below illustrate these ideas. E.coli Crp binding sites are palindromic, and the Crp motif is fragmented such that the central six positions are ignored, and the first eight and last eight positions (with stars below) make up the motif model. The X-ray crystal structure of Crp bound to DNA illustrates why the central bases do not appear conserved and may be ignored - those bases between the two halves of the palindromic binding site do not contact the protein.

aligned Crp binding sites:
TAATGTGAGTTAGCTCACTCAT
TTCTGTAACAGAGATCACACAA
TTTCGTGATGTTGCTTGCAAAA
AATTGTGACACAGTGCAAATTC
ATGCCTGACGGAGTTCACACTT
GATTGTGATTCGATTCACATTT
TGTTGTGATGTGGTTAACCCAA
CGGTGTGAAATACCGCACAGAT
ATTTGTGAGTGGTCGCACATAT
********      ********

CRP Dimer Homo dimeric structure indicates symmetric model (GAD)

The E.coli Crp homodimer binding to DNA.

Palindromes

As mentioned above, we often specify a palindromic model, when we have a reasonable expectation that binding sites in the sequence data may bind a HTH-type transcription factor. When using a palindromic model on the web server, reverse complementation of the input sequence data is automatically disabled. Using stand-alone Gibbs, reverse complementation needs to be disabled at the command line. Not doing this generally causes the motif model to become overly strong in one half and noticably weaker in the other half. This occurs because, if we specified 16 positions for the motif model and -R 1,1,8 for the palindrome in a Gibbs run, then the palindromic model is formed by combining the observed bases in the first 8 positions with the reverse complements of the last 8 positions. With reverse complemention of the sequence data also enabled, Gibbs will orient the sites to achieve the best possible alignment in one half, and when the two halves are combined, the overall motif model scores better; however you end up with a lopsided motif model.

It is also important to note that with the recent versions of the Gibbs sampler you need NOT do anything special to identify even width and odd width palindromes (as was necessary with older versions), the sampler uses fragmentation to determine the total width of the motif. For example, using a motif model width of 16 positions with the palindrome and fragmentation options -R 1,1,8 -M 1,24 (as in the examples below), an even width palindrome (PurR example) would be identified with 16 "on" positions and possibly 0, 2, 4, 6, or 8 positions fragmented ("off"), whereas an odd width palindrome (NtrC example) would be identified with 16 "on" positions and possibly 1, 3, 5, or 7 positions fragmented.

glnA logo - disable rev comp glnA logo - rev comp

The sequence logos above illustrate the effect of using reverse complementation with a palindromic model. The first image was created from Gibbs output with reverse complementation disabled. For the second image, Gibbs was allowed to reverse complement the motif sites. Notice how positions 11, 14, 16, and 17 have been strengthened while positions 2, 4, and 7 are weaker than in the first image. Logos can be generated online

These width and palindromic parameters have proven valuable in our research on bacterial transcription regulation. However, each user should evaluate whether palindromic models are appropriate or likely to reflect the biology in their system. For example, whether to use palindromic models should be carefully considered for species that do not encode a large proportion of HTH-type transcription factors. We typically run Gibbs multiple times on a data set while varying the width and palindrome parameters and compare the results. The examples provided below show the best results for each particular sequence data set.

Background composition

Knowing that DNA (particularly non-coding DNA) may vary locally in composition, we use a Bayesian segmentation algorithm to determine the position-specific composition of the input sequence data. This provides the probabilities of observing each of the four DNA bases at each position of the input sequences; these probabilities are then used in the background model during Gibbs sampling. If you choose not to use a position-specific background model, Gibbs will calculate and use a homogeneous background composition.

For example, the 500 bp region upstream of the translation start of YDR226W from Saccharomyces cerevisiae is composed (overall) of 26% A, 38% T, 18% C, and 16% G. However, the Bayesian segmentation algorithm (Liu,J. and Lawrence,C.E. (1999)) shows that the composition of this 500 bp region is heterogenous; that the probability of observing each of the four bases varies considerably over this DNA region (see below). The Gibbs web server will (by default) calculate and use this position-specific background composition. The stand-alone version of Gibbs, however, requires that you calculate this position-specific background composition and specify the composition file at the command line. If you do not do this, Gibbs will calculate and use a homogeneous background composition: p(A)=0.26, p(T)=0.38, p(C)=0.18, p(G)=0.16 for every position of the 500 bp region upstream of YDR226W.

Yeast background composition

The compositional variation of a 500 bp region upstream of the translation start site of the YDR226W/ADK1 gene from Saccharomyces cerevisiae. From Thompson et al. The probability of each base at each position is calculated by the Bayesian segmentation algorithm, Liu,J. and Lawrence,C.E. (1999). This algorithm returns the probabilities of observing each of the four bases at each position in the sequence.

The number of sites per sequence

As an optional parameter, Gibbs allows you to provide an estimate of the number of binding sites in the input data - this number affects the initial motif. Specifically, Gibbs initiates a motif search by selecting sites at random to build the initial model. If you provide an estimate of the number of binding sites, you are affecting how many sites are selected for this initial model. If not supplied, Gibbs will select one site per sequence by default. We have found this default to be adequate for most data sets. Furthermore, the Gibbs sampler is insensitive to small changes in this value.

A required parameter for the recursive sampling and the centroid sampling modes (which are used in the examples below), is the parameter for specifying the maximum number of sites per sequence. We recommend these sampling modes, and thus recommend using this parameter for the majority of motif search experiments. The particular biological reasoning behind this parameter is that we know that many TFs bind to more than one site in a promoter and often bind cooperatively, therefore we want to look for more than one site per sequence for a motif model ( -E 3 in the examples). For most applications, it is sufficient to simply provide this maximum number of sites, which will initialize Gibbs sampling with equal probabilities for each of the possible number of sites per sequence (e.g.: p = 0.25 for each 0, 1, 2, or 3 sites per sequence). However for some cases, we provide prior probabilities for the number of sites per sequence based on our confidence that each promoter sequence in our data will have a site in the motif. The promoter fusion example given below uses this option, and decisions about choosing these prior probabilites are described.

The Wilcoxon signed-rank test

In order to determine the statistical significance of a motif, the Wilcoxon signed-rank test can also be included as a parameter during a Gibbs run. This test formalizes the use of negative controls and is thus very useful. Briefly, given a set of negative controls in addition to the sequence data under study, the Wilcoxon signed-rank test will test the null hypothesis that sites are no more likely to occur in the study sequences than in the negative controls; the ranks of the scores of each site in the motif are also incorporated into the calculation of the p-value during this test. With stand-alone Gibbs run at the command line, you can provide Gibbs with a separate fasta file of negative control sequences; the requirement is that for each of the study sequences there be one control sequence of equal length. Alternatively, Gibbs will randomly shuffle each of the study sequences and use these "random" sequences as negative controls. This is the only option available on the Gibbs web server, and is the option invoked if a separate file of negative controls is not provided to stand-alone Gibbs at the command line. We use the Wilcoxon test frequently to evaluate the statistical significance of a motif of interest; however, it is important for users to note that motifs from a Gibbs run with the Wilcoxon test enabled should not be used as the final motif model for further analysis or publication, as these motifs typically include some (poorly matching) negative control sites.


Common paramters to use with the Gibbs Recursive Sampler and the Gibbs Centroid Sampler

Standard input and options for bacterial data:

fasta_file
width
-n specifies the DNA alphabet
-r disables reverse-complementation of the input sequence data
-R t,i,j specifies a palindromic model for motif model t for positions i to j
-M t,w allows the motif model t to fragment up to a width w (the positions of the motif model need not be contiguous)
-E m specifies the maximum number of sites per sequence that may be sampled into the motif model
-S num specifies the number of seeds (i.e.: the number of times that the Gibbs sampler is restarted with separate initial models)
-o name a name for the Gibbs output file
-B file the position-specific composition file, which contains the background composition for the input fasta sequences1

1 This file is generated automatically when "Background Model" is checked on the Gibbs web form. For stand-alone Gibbs, this file must be generated by the program "unified.cpp", which is provided with the Gibbs sampler distribution.


Examples: Analysis of co-expression data from a promoter fusion study and from a microarray study

Using co-expression data from promoter fusion studies, we expect little if any inclusion of false positives, and depending on the design of the experiment, we may reasonably expect that the expression observed is the result of a single common regulatory mechanism. Thus we can design a Gibbs sampling run in such a way as to capitalize on these expectations. However, using co-expression data from microarray studies, our expectations are somewhat different. Due to their genomic scale, we expect that microarray studies will include more noise (false positives), and possibly reflect the regulatory activity of more than one transcription factor. We can design our Gibbs sampling runs according to these expectations as well.

M.tuberculosis co-expression data from a promoter fusion study

The promoter fusion co-expression data in our example is from the promoter regions of M.tuberculosis genes that are up-regulated under each of the following three comparisons: 1) standing vs shaking culture conditions, 2) shaking cultures provided with 1.3% vs 20% oxygen, and 3) intracellular (within macrophages) vs extracellular growth (Florczyk et al., Infection & Immunity 71:5332-5343). This study showed that the promoters of 10 genes were similarly regulated; of these, a subset are divergently transcribed, resulting in a total of 7 unique intergenic promoter regions for Gibbs sampling. We used the basic Gibbs options described above, to follow our expectations that the observed co-expression is due to the action of a single TF, that the motif is likely palindromic (M.tuberculosis, like most bacteria, encodes many HTH-type TFs), and that multiple sites may be present in a sequence (allowing cooperative binding).


Gibbs Recursive Sampler command line

Gibbs.linux Mtb_7seq.fa 16 -n -r -R 1,1,8 -M 1,24 -E 3 -B Mtb_7seq.fa_info-det -o Mtb_7seq.4.out -S 20 -p 100 -i 1000 -P reg.0805.pr

Gibbs input and options:

Mtb_7seq.fa the fasta format data file, it contains the 7 intergenic sequences that contain the 10 M.tuberculosis promoters
16 specifies the width of the motif = 16
-n specifies the DNA alphabet
-r disables reverse-complementation of the input sequence data
-R 1,1,8 specifies a palindromic model for motif model number 1, between the first 8 and the last 8 model positions
-M 1,24 allows the motif model number 1 to fragment up to a width = 24 (the 16 positions of the motif model need not be contiguous)
-E 3 specifies 0, 1, 2, or 3 sites per sequence may be sampled into the motif model
-P reg.0805.pr specifies a prior file, which contains prior probabilites on the number of sites/sequence3
-o Mtb_7seq.4.out provides a name for the Gibbs output file
-B Mtb_7seq.fa_info-det specifies the position-specific composition file, which contains the background composition for the input fasta sequences1
-S 20 specifies 20 seeds (i.e.: the number of times that the Gibbs sampler is restarted with separate initial models)
-p 100 specifies a plateau period of 100 (i.e.: 100 iterations are performed without an increase in the MAP value)2
-i 1000 specifies 1000 maximum iterations (i.e.: at a maximum, 1000 iterations are allowed during each seed; also 1000 total iterations are performed during near optimal sampling)2

1 This file is generated automatically when "Background Model" is checked on the Gibbs web form. For stand-alone Gibbs, this file must be generated by the program "unified.cpp", which is provided with the Gibbs sampler distribution.
2 These options (-p and -i) control the number of iterations performed by the sampler; these options are used during the recursive sampling mode, but not the centroid sampling mode.
3 reg.0805.pr file
>BLOCKS
0.05 0.30 0.30 0.30 0.05
>
3 This prior actually also includes a probability for 4 sites (p = 0.05), but since we specify that only up to 3 sites are allowed, Gibbs will normalize these probabilities.

Output from this example     Run this example on the Gibbs server (Gibbs sampling is a stochastic process. Your results may differ from the example shown here.)

Mtb_7seq.4.out motif a

The sequence logo of the model produced by the example above.


Gibbs Centroid Sampler command line

Gibbs.linux Mtb_7seq.fa 16 -n -r -R 1,1,8 -M 1,24 -E 3 -S 20 -m -y -nopt -bayes 1000,5000 -align_centroid -B Mtb_7seq.fa_info-det -o Mtb_7seq.12.out -P reg.0805.pr

Gibbs input and options:

Mtb_7seq.fa the fasta format data file, it contains the 7 intergenic sequences that contain the 10 M.tuberculosis promoters
16 specifies the width of the motif = 16
-n specifies the DNA alphabet
-r disables reverse-complementation of the input sequence data
-R 1,1,8 specifies a palindromic model for motif model number 1, between the first 8 and the last 8 model positions
-M 1,24 allows the motif model number 1 to fragment up to a width = 24 (the 16 positions of the motif model need not be contiguous)
-E 3 specifies 0, 1, 2, or 3 sites per sequence may be sampled into the motif model
-P reg.0805.pr specifies a prior file, which contains prior probabilites on the number of sites/sequence3
-S 20 specifies 20 seeds (i.e.: the number of times that the Gibbs sampler is restarted with separate initial models)
-o Mtb_7seq.12.out provides a name for the Gibbs output file
-B Mtb_7seq.fa_info-det specifies the position-specific composition file, which contains the background composition for the input fasta sequences1
-m disables maximal MAP sampling2
-y disables frequency solution2
-nopt disables near-optimal sampling2
-bayes 1000,5000 specifies 1000 interations for the burn-in period and 5000 iterations for the sampling period2
-align_centroid will call the recursive sampler to provide an alignment and motif matrix for the centroid2

1 This file is generated automatically when "Background Model" is checked on the Gibbs web form. For stand-alone Gibbs, this file must be generated by the program "unified.cpp", which is provided with the Gibbs sampler distribution.
2 These options are needed for centroid sampling; they control the number of iterations performed by the sampler, and disable the MAP sampling
3 reg.0805.pr file
>BLOCKS
0.05 0.30 0.30 0.30 0.05
>
3 This prior actually also includes a probability for 4 sites (P = 0.05), but since we specify that only up to 3 sites are allowed, Gibbs will normalize these probabilities.

Output from this example     Run this example on the Gibbs server (Gibbs sampling is a stochastic process. Your results may differ from the example shown here.)

Mtb_7seq.12.out motif a

The sequence logo of the model produced by the example above.


Gibbs Centroid Sampler command line, with Wilcoxon test using randomly shuffled negative controls

Gibbs.linux Mtb_7seq.fa 16 -n -r -R 1,1,8 -M 1,24 -E 3 -S 20 -m -y -nopt -bayes 1000,5000 -align_centroid -B Mtb_7seq.fa_info-det -o Mtb_7seq.14.out -P reg.0805.pr -l

Gibbs input and options are the same as those above, adding only:

-l turns on the Wilcoxon test, randomly shuffled sequences will be generated as negative controls

Output from this example     Run this example on the Gibbs server (Gibbs sampling is a stochastic process. Your results may differ from the example shown here.)


M.tuberculosis co-expression data from a microarray study

The microarray co-expression data in our example is based on the results of Sherman et al.(Proc Natl Acad Sci 98:7534-7539) on the response of M.tuberculosis to hypoxia. Table 1 of that paper reports the expression of 47 genes was induced by hypoxia. A number of these genes are likely coded within operons and a few are divergently transcribed, thus we analyzed a set of 32 intergenic sequences containing promoters for these genes. We used Gibbs parameters designed to follow our expections that the data may include false positives, that the observed co-expression may be due to the action of more than one TF, that the motifs are likely palindromic (M.tuberculosis, like most bacteria, encodes many HTH-type TFs), and that multiple sites may be present in a sequence (allowing cooperative binding). Therefore, we searched for a two palindromic motifs simultaneously, and allowed 0, 1, 2, or 3 sites per input sequence (-E 3) at equal prior probabilities (the default, uninformed priors on the number of sites per sequence).


Gibbs Recursive Sampler command line

Gibbs.linux hypoxia.fa 16,16 -n -r -R 1,1,8,2,1,8 -M 1,24,2,24 -E 3 -S 20 -p 100 -i 1000 -B hypoxia.fa_info-det -o hypoxia.1.out

Gibbs input and options:

hypoxia.fa the fasta format data file, it contains the 32 M.tuberculosis intergenic sequences
16,16 specifies two motifs, each has width = 16
-n specifies the DNA alphabet
-r disables reverse-complementation of the input sequence data
-R 1,1,8,2,1,8 for each of the two motif models, specifies a palindromic model between the first 8 and the last 8 model positions
-M 1,24,2,24 allows each motif model to fragment up to a width = 24 (the 16 positions of the motif models need not be contiguous)
-E 3 specifies 0, 1, 2, or 3 sites per sequence may be sampled into the motif model
-S 20 specifies 20 seeds (i.e.: the number of times that the Gibbs sampler is restarted with separate initial models)
-o hypoxia.1.out provides a name for the Gibbs output file
-B hypoxia.fa_info-det specifies the position-specific composition file, which contains the background composition for the input fasta sequences1
-p 100 specifies a plateau period of 100 (i.e.: 100 iterations are performed without an increase in the MAP value)2
-i 1000 specifies 1000 maximum iterations (i.e.: at a maximum, 1000 iterations are allowed during each seed; also 1000 total iterations are performed during near optimal sampling)2

1 This file is generated automatically when "Background Model" is checked on the Gibbs web form. For stand-alone Gibbs, this file must be generated by the program "unified.cpp", which is provided with the Gibbs sampler distribution.
2 These options (-p and -i) control the number of iterations performed by the sampler; these options are used during the recursive sampling mode, but not the centroid sampling mode.

Output from this example     Run this example on the Gibbs server (Gibbs sampling is a stochastic process. Your results may differ from the example shown here.)

hypoxia.1.out motif a hypoxia.1.out motif b

The two motif models produced by the recursive sampler (Near Optimal solution) for the MTB hypoxia example. Motif a (on the left) resembles the model from the promoter fusion example. Motif b (on the right) is probably not significant.


Gibbs Centroid Sampler command line

Gibbs.linux hypoxia.fa 16,16 -n -r -R 1,1,8,2,1,8 -M 1,24,2,24 -E 3 -S 20 -m -y -nopt -bayes 2000,8000 -align_centroid -B hypoxia.fa_info-det -o hypoxia.3.out

Gibbs input and options:

hypoxia.fa the fasta format data file, it contains the 32 M.tuberculosis intergenic sequences
16,16 specifies two motifs, each has width = 16
-n specifies the DNA alphabet
-r disables reverse-complementation of the input sequence data
-R 1,1,8,2,1,8 for each of the two motif models, specifies a palindromic model between the first 8 and the last 8 model positions
-M 1,24,2,24 allows each motif model to fragment up to a width = 24 (the 16 positions of the motif models need not be contiguous)
-E 3 specifies 0, 1, 2, or 3 sites per sequence may be sampled into the motif model
-S 20 specifies 20 seeds (i.e.: the number of times that the Gibbs sampler is restarted with separate initial models)
-o hypoxia.3.out provides a name for the Gibbs output file
-B hypoxia.fa_info-det specifies the position-specific composition file, which contains the background composition for the input fasta sequences1
-m disables maximal MAP sampling2
-y disables frequency solution2
-nopt disables near-optimal sampling2
-bayes 2000,8000 specifies 2000 interations for the burn-in period and 8000 iterations for the sampling period2
-align_centroid will call the recursive sampler to provide an alignment and motif matrix for the centroid2

1 This file is generated automatically when "Background Model" is checked on the Gibbs web form. For stand-alone Gibbs, this file must be generated by the program "unified.cpp", which is provided with the Gibbs sampler distribution.
2 These options are needed for centroid sampling; they control the number of iterations performed by the sampler, and disable the MAP sampling

Output from this example     Run this example on the Gibbs server (Gibbs sampling is a stochastic process. Your results may differ from the example shown here.)

hypoxia.3.out motif

The one motif model produced by the centroid sampler for the MTB hypoxia example; this resembles the model from the promoter fusion example.


For these runs, we allowed the Gibbs sampler to search for two motifs simultaneously, in case the primary response to hypoxia in M.tuberculosis involves more than a single transcription factor. In the Gibbs Recursive Sampler output, the Near Optimal solution for the first motif (MOTIF a) included 32 total sites, and most of these sites (31 sites) were sampled during at least 50% of the iterations in the Frequency solution. This motif is very similar to the motif found in the promoter fusion example described above. The Near Optimal solution for the second motif (MOTIF b) included 14 sites, however only two sites were sampled during at least 50% of the iterations in the Frequency solution. Because sites were not frequently sampled, combined with the observation that the motif model has few well-conserved positions (low information bits for the positions of the motif), indicates that this motif is likely not significant (the Wilcoxon signed-rank test could test this assumption). The Gibbs Centroid Sampler is able to avoid falsely predicting a second motif; in the output, a single motif (labeled MOTIF b) is identified that includes 31 sites, and this motif is also very similar to the motif found in the promoter fusion example described above.



go back Gibbs home page

mail

Valid HTML 4.01!