Supplementary files for "Decoding Human Regulatory Circuits"

The paper is available from Genome Research (abstract).

Data Files:

All sequence files are in FASTA format. .gz files have been compressed with gzip.

2710.aligned.fa.gz - the set of human-mouse 2710 10kb sequence pairs used for sequence mining aligned with BLASTZ and masked with RepeatMasker

24.aligned.pos.train.fa - original 24 pairs of human-rodent sequences used as positive training. The file, reported.sites, contains the positions of the reported sites in this file.

100.aligned.neg.train.fa - 100 pairs of human and mouse sequences used as negative training

13.aligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation

100.aligned.neg.valid.fa - 100 pairs of human-mouse sequences used as negative validation

13.nonaligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation.

2910.nonaligned.human.fa.gz - the full set of unaligned human sequences. These sequenceshave been processed with RepeatMasker. The unaligned sequences from 2710.aligned.fa.gz, 100.aligned.neg.train.fa, and 100.aligned.neg.valid.fa are contained in this file.

2910.nonaligned.mouse.fa.gz - the full set of unaligned mouse sequences.

24.nonaligned.pos.train.fa.tar.gz - the 24 pairs of positive training files before alignment with BLASTZ

20.aligned.liver.fa - 20 aligned pairs of liver specific sequences

crp.fa - a sample fasta file for testing Gibbs. It contains 18 e. coli sequences containing know CRP TFBS.

crp.sites.dat - a sample collection of sites for testing dscan - a perl program for converting the output from BLASTZ to aligned fasta sequences.

Files labeled as .comp.gz are gzipped background composition files for use with Gibbs.

Reported Sites


reported.sites - contains a list of reported sites and positions for 24.aligned.pos.train.fa

Bayes factor ratios


Each file contains columns for the human RefSeq Id, mouse RefSeq Id, Bayes ratio and number of modules found in 10 sampling steps.

ratio.2710.061804 - Bayes ratio for 2710 human-mouse sequence pairs

predicted.sites.2710 - contains all sites predicted during data mining in the 2710 human-mouse sequence pairs. It lists the site type, the predicted site and its position in the aligned sequences.

pos.validation.13 - ratios for the 13 positive validation pairs

neg.validation.100 - ratios for 100 negative validation pairs

pos.xvalid.21 - ratio from cross-validation for 21 positive training sequence pairs with predicted modules

neg.xvalid.24 - ratio from cross-validation for 24 negative training sequence pairs with predicted modules - an annotated prior file for use with the modular sampler. This file contains the parameters we used with Gibbs to analyze 24.aligned.pos.train.fa.

Supplementary files for "A phylogenetic Gibbs sampler for high-resolution comparative genomics studies of transcription regulation"

crp.tar.gz - simulated crp sequences -prior file for crp data

stb5.tar.gz - simulated yeast sequences - prior file for yeast sequences

studyset.tar.gz - prokaryotic data set

regulon.tar.gz - regulon data set

Information on obtaining the Gibbs sampler may be found at /gibbs/gibbs.html.

If you have comments or questions about these files or Gibbs, please contact Bill Thompson