The paper is available from Genome Research (abstract).
All sequence files are in FASTA format. .gz files have been compressed with gzip.
2710.aligned.fa.gz - the set of human-mouse 2710 10kb sequence pairs used for sequence mining aligned with BLASTZ and masked with RepeatMasker
24.aligned.pos.train.fa - original 24 pairs of human-rodent sequences used as positive training. The file, reported.sites, contains the positions of the reported sites in this file.
100.aligned.neg.train.fa - 100 pairs of human and mouse sequences used as negative training
13.aligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation
100.aligned.neg.valid.fa - 100 pairs of human-mouse sequences used as negative validation
13.nonaligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation.
2910.nonaligned.human.fa.gz - the full set of unaligned human sequences. These sequenceshave been processed with RepeatMasker. The unaligned sequences from 2710.aligned.fa.gz, 100.aligned.neg.train.fa, and 100.aligned.neg.valid.fa are contained in this file.
2910.nonaligned.mouse.fa.gz - the full set of unaligned mouse sequences.
24.nonaligned.pos.train.fa.tar.gz - the 24 pairs of positive training files before alignment with BLASTZ
20.aligned.liver.fa - 20 aligned pairs of liver specific sequences
crp.fa - a sample fasta file for testing Gibbs. It contains 18 e. coli sequences containing know CRP TFBS.
crp.sites.dat - a sample collection of sites for testing dscan
blastz.to.fasta.pl - a perl program for converting the output from BLASTZ to aligned fasta sequences.
Files labeled as .comp.gz are gzipped background composition files for use with Gibbs.
reported.sites - contains a list of reported sites and positions for 24.aligned.pos.train.fa
Bayes factor ratios
Each file contains columns for the human RefSeq Id, mouse RefSeq Id, Bayes ratio and number of modules found in 10 sampling steps.
ratio.2710.061804 - Bayes ratio for 2710 human-mouse sequence pairs
predicted.sites.2710 - contains all sites predicted during data mining in the 2710 human-mouse sequence pairs. It lists the site type, the predicted site and its position in the aligned sequences.
pos.validation.13 - ratios for the 13 positive validation pairs
neg.validation.100 - ratios for 100 negative validation pairs
pos.xvalid.21 - ratio from cross-validation for 21 positive training sequence pairs with predicted modules
neg.xvalid.24 - ratio from cross-validation for 24 negative training sequence pairs with predicted modules
hum.mouse.pr - an annotated prior file for use with the modular sampler. This file contains the parameters we used with Gibbs to analyze 24.aligned.pos.train.fa.
ortho.110206.pr -prior file for crp data
stb5.tar.gz - simulated yeast sequences
phylo.101706.1.pr - prior file for yeast sequences
studyset.tar.gz - prokaryotic data set
regulon.tar.gz - regulon data set
Information on obtaining the Gibbs sampler may be found at /gibbs/gibbs.html.
If you have comments or questions about these files or Gibbs, please contact Bill Thompson