Supplementary files for "Decoding Human Regulatory Circuits"

The paper is available from Genome Research (abstract).

Data Files:

All sequence files are in FASTA format. .gz files have been compressed with gzip.

2710.aligned.fa.gz - the set of human-mouse 2710 10kb sequence pairs used for sequence mining aligned with BLASTZ and masked with RepeatMasker

24.aligned.pos.train.fa - original 24 pairs of human-rodent sequences used as positive training. The file, reported.sites, contains the positions of the reported sites in this file.

100.aligned.neg.train.fa - 100 pairs of human and mouse sequences used as negative training

13.aligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation

100.aligned.neg.valid.fa - 100 pairs of human-mouse sequences used as negative validation

13.nonaligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation.

2910.nonaligned.human.fa.gz - the full set of unaligned human sequences. These sequenceshave been processed with RepeatMasker. The unaligned sequences from 2710.aligned.fa.gz, 100.aligned.neg.train.fa, and 100.aligned.neg.valid.fa are contained in this file.

2910.nonaligned.mouse.fa.gz - the full set of unaligned mouse sequences.

24.nonaligned.pos.train.fa.tar.gz - the 24 pairs of positive training files before alignment with BLASTZ

20.aligned.liver.fa - 20 aligned pairs of liver specific sequences

crp.fa - a sample fasta file for testing Gibbs. It contains 18 e. coli sequences containing know CRP TFBS.

crp.sites.dat - a sample collection of sites for testing dscan

blastz.to.fasta.pl - a perl program for converting the output from BLASTZ to aligned fasta sequences.

Files labeled as .comp.gz are gzipped background composition files for use with Gibbs.

Reported Sites

=============

reported.sites - contains a list of reported sites and positions for 24.aligned.pos.train.fa

Bayes factor ratios

===================

Each file contains columns for the human RefSeq Id, mouse RefSeq Id, Bayes ratio and number of modules found in 10 sampling steps.

ratio.2710.061804 - Bayes ratio for 2710 human-mouse sequence pairs

predicted.sites.2710 - contains all sites predicted during data mining in the 2710 human-mouse sequence pairs. It lists the site type, the predicted site and its position in the aligned sequences.

pos.validation.13 - ratios for the 13 positive validation pairs

neg.validation.100 - ratios for 100 negative validation pairs

pos.xvalid.21 - ratio from cross-validation for 21 positive training sequence pairs with predicted modules

neg.xvalid.24 - ratio from cross-validation for 24 negative training sequence pairs with predicted modules

hum.mouse.pr - an annotated prior file for use with the modular sampler. This file contains the parameters we used with Gibbs to analyze 24.aligned.pos.train.fa.

Supplementary files for "A phylogenetic Gibbs sampler for high-resolution comparative genomics studies of transcription regulation"

crp.tar.gz - simulated crp sequences

ortho.110206.pr -prior file for crp data

stb5.tar.gz - simulated yeast sequences

phylo.101706.1.pr - prior file for yeast sequences

studyset.tar.gz - prokaryotic data set

regulon.tar.gz - regulon data set

Information on obtaining the Gibbs sampler may be found at /gibbs/gibbs.html.

If you have comments or questions about these files or Gibbs, please contact Bill Thompson