Informed Priors

The Bayesian formula requires pseudocount parameters. Background pseudocounts are calculated as a percentage of the background observed counts. The user can determine how much weight the pseudocounts hold by specifying the pseudocount weight. The weight will be in the range between 0 and 1 with a default of .1 (10% of the observed counts.)

Motif pseudocounts are set in one of two ways. In the first method, there is no bias in the composition of the motif. Motif residue pseudocounts are set to the background pseudocounts. Such an approach implements "uninformed priors." When the composition of a motif is known to some degree, "informed priors" can be implemented. In this method, motif pseudocounts are calculated as before. Depending upon the confidence level of the occurrence of particular residues in specific locations, the pseudocounts are multiplied by a user supplied factor.

To supply prior information for a motif model, create a table with a column for each amin acid or nucleotide and a row for each position in the motif. For example, for a DNA motif model10 elements wide, you would create a 10 row by 4 column table. The columns are in the order ATCG. For proteins the table is 20 columns wide. The order is ACDEFGHIKLMNPQRSTVWY.

There should be one table for each motif for which there is prior information. The table begins with the header >PRIOR n, where n is the motif number. The table ends with >. The table entries are the per cent probabilities of the occurrence each value at each position.

The following example shows a prior table for a DNA motif model 22 elements in length.

>Prior 1
42 33 9 16
33 61 0 6
33 30 23 14
4 61 26 9
0 33 10 57
0 72 28 0
20 9 0 71
71 4 25 0
28 33 14 25
10 10 66 14
28 16 23 33
71 9 0 20
9 38 33 20
23 28 14 35
0 80 14 6
5 0 95 0
85 9 0 6
0 5 95 0
95 0 0 5
33 33 34 0
28 57 15 0
42 52 0 6
>

The pseudocount for a particular nucleotide for the ith position of a motif is then the value of the particular column in the ith row divided by 100 and multiplied by the estimated number of motif sites and the pseudocount weight. For example, if the estimated number of motifs was 20 and the default pseudocount weight of 0.1 was used, there would be (42/100) * 0.1 * 20 = 0.84 pseudocounts added to the count for an A in position 1.

An optional 5th column may be included. This column is a weight. It replaces the default pseudocount weight for that position.

Pseudocounts may also be added directly in a manner similar to priors. Use >PSEUDO n,where n is the motif number. In this case, the pseudocount for a particular nucleotide for the ith position of a motif is then the value of the column in the ith row multiplied by the pseudocount weight. An optional 5th column may be included. This column is a weight. It replaces the default pseudocount weight for that position. If Calc. Default Pseudo Counts option is sel;eected, the software specifies a weak uniform strength hint by inserting pseudocounts equal to 0.2 times the number of input FASTA sequences in each position. 

When using the option to limit the mumber of sites found in a sequence, it is possible to specify prior probabilities on each of the possible number of sites per sequence. Enter, >BLOCKS followed by a list of propr probabilites for each number of sites. For example, if a maximum of 3 sites per sequence were specified,

>BLOCKS
0.05 0.35 0.35 0.25
>

assigns 5% probability to 0 sites per sequence, 35% to 1 site etc. The values will be normalized, so it is not necessary that they add to 1. >BLOCKS is only valid when the recursive sampler is used. It is ignored otherwise.

You can also specify a fixed probability for the possible number of sites per sequence using the >SBLOCKS option. Unlike the >BLOCKS option, the probabilities entered with this option are not adjusted during processing. For example,

>SBLOCKS
0.05 0.35 0.35 0.25
>

specifies a fixed set of probabilities for 0, 1, 2 or 3 sites per sequence. At each sampling step, the number of sites to be sampled into the sequence is selected from this distribution. >SBLOCKS is only valid when the recursive sampler option is used. It is ignored otherwise.

You can also specify the estimated number of motifs in the prior file. Enter >SEQ followed by a list containing the total number of relevant sequences (usually the total number of sequences in your data file) and the estimate for the number of motif sites for each model. For example, if 2 models were specified,

>SEQ
16 8 12
>

indicates that there are 16 sequences and there are 8 expected sites for model type 1 and 12 expected sites for the second model.

Comments may be included in the priors file. Any line beginning with a ! is considered a comment. Comments may also be included in the output file generated by including the >COMMENT option in the priors field. For example,

>COMMENT
E. coli data
plateau period = 50
>

Lines between >COMMNENT and > will be printed at the top of the output.