The Phylogenetic Sampler

Using the Phylogenetic Sampler

To use the phylogenetic sampler, create a prior file. See http://bayesweb.wadsworth.org/gibbs/prior.html. Phylogenetic sequences should first be aligned. We have used CLUSTALW and MUSCLE successfully. If your multiple sequence alignment program places -'s in the sequences, replace them with N's.

In the prior file, describe the phylogeny using a tree in Newick format similar to the following:


! tree from /orthogibbs6/select_align/study72
! based on all 72 intergenics
>TREE 1 6
((KPNE:0.32222,(ECOL:0.00601,SFLE:0.02473):0.14695):0.03788,((SBON:0.06927,SENT:0.08519):0.12679,CROD:0.14350):0.00000);
>

The !'s are comments. The tree description starts with '>TREE n m' where n is the starting sequence that the tree describes and m is the ending sequence. '>TREE 1 6' says this tree describes sequences 1 through 6 in the Fasta file. The names of the species in the tree don't matter, but the sequences in the Fasta file must be in the same order as in the tree, i.e. it doesn't matter if the header in the first Fasta sequence contains KPNE, just that the sequence from the species described by the first branch of the tree is first in the file; the sequence for the ECOL branch in the example above is second in the Fasta file etc. If you have multiple aligned groups of sequences, called MASSes in (Newberg et al. 2007), then use a second >TREE statement. Unaligned sequences are not described by trees.

All of the unaligned sequences should be at the end of the Fasta file.

We also used sequence weights. They are described by a >WEIGHT command in the prior file. For example,


>WEIGHT
1 0.3358
2 0
3 0.325724
4 0.148832
5 0.217804
6 0.161971
7 0.578202
8 0.616242
>

The format of each line is 'sequence_number sequence_weight' Notice that only 6 sequences are described by the tree, but 8 sequences have weights. Sequences which do not have assigned weights default to a weight of 1.0. We calculated weights using the method of (Newberg et al. 2005).

You can use the program seq.weights.pl to calculate sequence weights from a tree. This is a perl program. It uses BioPerl, available from http://www.bioperl.org/wiki/Main_Page and Math::Matrix, available from CPAN, http://www.cpan.org/.

Here's a typical command line:

/users/thompson/gibbs/bin/Gibbs.opteron /users/thompson/phylo/study_set/data/real/purT.6.fa 16 12 -n -E 4 -r -R 1,1,8 -P /users/thompson/phylo/study_set/data/ortho.14.pr -o /users/thompson/phylo/study_set/output/purT.30.out -S 10 -bayes 2000,8000 -B /users/thompson/phylo/study_set/data/real/purT.6.fa_info-det

-r -R 1,1,8 tells the program to look for palindromes and only search the forward strand. Depending on your data, you may want to increase or decrease the number of burn in and sampling iterations.

Newberg, L. A., McCue, L. A., and Lawrence, C. E. (2005). The Relative Inefficiency of Sequence Weights Approaches in Determining a Nucleotide Position Weight Matrix. Statistical Applications in Genetics and Molecular Biology 4, 1-18.

Newberg, L. A., Thompson, W. A., Conlan, S., Smith, T. M., McCue, L. A., and Lawrence, C. E. (2007). A phylogenetic Gibbs sampler that yields centroid solutions for cis regulatory site prediction. Bioinformatics, btm241.

Here's a sample prior file:


! this command describes the probability of 0 through 4 sites in each sequence
>BLOCKS 1 1
0.2 0.2 0.2 0.2 0.2
>

! Uniform pseudocounts
! One model, 16 bases wide - each position is multiplied by the value of
! the -w parameter which defaults to 0.1
! Since we're goung to use a plaindromic model, the pseudocound will be
! doubled, i.e. there are 0.28 counts per nucleotide per position.

>PSEUDO 1
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
1.40 1.40 1.40 1.40
>

! tree from /orthogibbs6/select_align/study72
! based on all 72 intergenics
>TREE 1 6
((KPNE:0.32222,(ECOL:0.00601,SFLE:0.02473):0.14695):0.03788,((SBON:0.06927,SENT:0.08519):0.12679,CROD:0.14350):0.00000);
>

! weights for all sequences
>WEIGHT
1 0.3358
2 0
3 0.325724
4 0.148832
5 0.217804
6 0.161971
7 0.578202
8 0.616242
>



gibbs
Last modified: Wed Jul 18 10:32:54 EDT 2007