Prokaryotic Phylogenetic Footprinting

These pages will provide some guidance to users of both the Gibbs sampler web server as well as command line users of the stand-alone version of Gibbs. The Gibbs sampler offers a dizzying array of options. Some of these offer the opportunity to model specific information you may have about the data that you wish to analyze, reflecting knowledge of the biology in your experiments. Others are largely technical and meant for advanced users who wish to control details of how the sampling is done. We will focus these help pages on the options we use that help us model the biology of transcription regulation.

Phylogenetic footprinting (McCue et al. 2002, 2001) is a method for identifying transcription factor (TF) binding sites using a set of orthologous promoter regions. The basic assumption is that a given gene will likely be controlled by the same transcription factor in multiple closely related species.

When searching for bacterial TF binding sites by phylogenetic footprinting, we most often:

This range of motif widths (16-24) seems to cover an average width given current knowledge of known bacterial TF binding sites, and furthermore, spans a couple turns of the DNA double helix. The parameters we use for width and palindromic models are designed to capture the features of binding sites for a classic bacterial helix-turn-helix (HTH) type transcription factor: HTH-type TFs are typically symmetric homodimers, thus they bind to symmetric (palindromic) DNA binding sites. Furthermore, the two HTH regions of the dimeric TF typically contact bases in two adjacent major grooves of the DNA, and thus the two halves of the palindromic binding site span well over 10 bases (the approximate number of bases per helical turn of B-form DNA). The bases contacted by a TF are not necessarily contiguous, thus we use fragmentation to allow the Gibbs sampler to ignore positions which do not participate in the protein-DNA interaction and are therefore not conserved as part of the binding site.

CRP Dimer Homo dimeric structure indicates symmetric model (GAD)

The CRP homo-dimeric structure binding to DNA.

When using a palindromic model, reverse complementation of the input sequence data is automatically disabled. Not doing this generally causes the motif model to become very strong in one half and noticably weaker in the other half. This occurs because in a Gibbs run, if we specified 16 positions for the motif model and -R 1,1,8 for the palindrome, then the first 8 positions will be combined with the reverse complements of the last 8 positions to form the palindromic model. With reverse complemention of the sequence data also enabled, Gibbs will orient the sites to achieve the best possible alignment in one half, and when the two halves are combined the overall motif model scores better, however you end up with a lop-sided motif model. It is also important to note that with the current version of the Gibbs sampler* you need NOT do anything differently to identify even width and odd width palindromes, the sampler uses fragmentation to determine the total width of the motif. For example, using a motif model width of 16 positions with the palindrome and fragmentation options -R 1,1,8 -M 1,24 (as in the examples below), an even width palindrome (PurR example) would be identified with 16 "on" positions and 0, 2, 4, 6, or 8 positions fragmented ("off"), and an odd width palindrome (NtrC example) would be identified with 16 "on" positions and 1, 3, 5, or 7 positions fragmented.

If you are running the Gibbs sampler locally, previous versions (up to version 2.05) treated even width and odd width palindromes separately. An odd width palindrome was specified by an odd motif width (e.g. 17 positions), with the center position in the model "on" but unpaired in the palindromic model. In the current version of Gibbs (versions 2.06 and higher) it is NOT necessary to specify even and odd palindromes separately.

glnA logo - no rev comp glnA logo - rev comp

The sequence logos above illustrate the effect of using reverse complementation with a palindromic model. The first image was created from Gibbs output with reverse complementation turned off. In the second, Gibbs was allowed to reverse complement the motif sites. Notice how positions 3, 5 and 7 have been strengthened while positions on the right are weaker. Logos can be generated online

These width and palindromic parameters have proven valuable in our research on bacterial transcription regulation. However, each user should evaluate whether palindromic models are appropriate or likely to reflect the biology in their system. For example, whether to use palindromic models should be carefully considered for species that do not encode a large proportion of HTH-type transcription factors. We typically run Gibbs multiple times on a data set while varying these parameters and compare the MAP values of the results. The examples provided below show the best results for each particular sequence data set.

While it is not necessary to orient the input sequences in any particular way for phylogenetic footprinting, we have found that it may be useful to include prior information on the position of sites in the input sequences. In order to use such information, it is necessary orient all the input sequences in a similar manner. Our sequence data contain a maximum of 500 bases upstream of an orthologous gene (less than this if we detected an upstream coding region closer than 500 bases), oriented 5' to 3' relative to the gene of interest. We have determined the distribution of the position of 182 experimentally validated TF binding sites upstream of E.coli genes, and we include this distribution as a prior information spacing model. Using this spacing model is only valid if your sequence data is oriented 5' to 3', such that the gene of interest would begin at the end of each input sequence. This spacing distribution was used in the purL example, but not used in the glnA example. See Thompson et al. for a histogram of this spacing distribution.

Knowing that many bacterial TFs bind to more than one site in a promoter and often bind cooperatively, we want to look for more than one site per sequence for a motif model. Additionally we have reasonably high confidence that among closely related species an orthologous gene will be regulated in a similar way. Therefore, we allow Gibbs to search for 0, 1, or 2 sites per input sequence (-E 2) and use a prior information file that provides a relatively low prior probability of finding 0 sites (P = 0.05), and equal probabilities of finding 1 or 2 sites (P = 0.35) persequence. This prior (below) actually also includes a probability for 3 sites (P = 0.25), but since we specify that only up to 2 sites are allowed, Gibbs will normalize these probabilities. Thus we can be lazy if we decide to do another Gibbs run to look for up to 3 sites per sequence and just use this same prior file. In the purL example, Gibbs detects 1 site per sequence from 5 of the 7 input sequences; no sites were detected in either the Shewanella oneidensis or Pseudomonas aeruginosa purL data. In fact, neither of these species encodes a PurR ortholog and thus no PurR binding sites are present in their purL promoters. In the glnA example Gibbs detects 2 sites per sequence from 6 of the 7 input sequences in the Maximum MAP solution.

0.05 0.35 0.35 0.25

Prior file for the purL and glnA examples.

Knowing that non-coding DNA (in particular) may vary locally in composition, we use a Bayesian segmentation algorithm to determine the position-specific composition of the input sequence data. This provides the probabilities of observing each of the four DNA bases at each position of the input sequences; these probabilities are then used in the background model during Gibbs sampling. If you choose not to use a position-specific background model, Gibbs will calculate and use a homogeneous background composition.

Yeast background composition

The compositional variation of a 500 bp region upstream of the translation start site of the YDR226W/ADK1 gene from Saccharomyces cerevisiae. From Thompson et al. The probability of each base at each position is calculated by the Bayesian segmentation algorithm, Liu,J. and Lawrence,C.E. (1999). This algorithm returns the probabilities of observing each of the four bases at each position in the sequence.

The following two phylogenetic footprinting examples are from data described in McCue et al. 2002, who report results based on the Near Optimal Gibbs solutions. The sequence data are from the following seven gamma proteobacteria:

purL example

This example finds an experimentally validated PurR binding site in the E.coli sequence (PurR binds to even palindromic sites)

Gibbs command line:

/local/compbio/programs/bin/Gibbs -PBernoulli purL.fa 16 -n -r -R 1,1,8 -M 1,24 -E 2 -S 40 -p 200 -i 1000 -o purL.16.p200i1000.out -B purL.fa_info-det -P

Gibbs input & options:

* This file is generated automatically when "Background Model" is checked on the Gibbs web form.

Output from this example        Run this example on the Gibbs server (Gibbs sampling is a stochastic process. Your results may differ from the example shown here.)

glnA example

This example finds experimentally validated NtrC binding sites in the E.coli sequence (NtrC binds to odd palindromic sites)

Gibbs command line:

/local/compbio/programs/bin/Gibbs -PBernoulli glnA.fa 16 -n -r -R 1,1,8 -M 1,24 -E 2 -S 40 -p 200 -i 1000 -o glnA.16.p200i1000.out -B glnA.fa_info-det -P

Gibbs input & options:

* This file is generated automatically when "Background Model" is checked on the Gibbs web form.

Output from this example        Run this example on the Gibbs server (Gibbs sampling is a stochastic process. Your results may differ from the example shown here.)

go back Gibbs home page


Valid HTML 4.01!