dscan


Program to be run

Normally, dscan takes the aligned segments submitted by the user and creates a frequency model which is used to scan the database. When a palindromic model is suspected, use palindrome_dscan. palindrome_dscan creates a palindromic model from the submitted segments and uses it to scan the data. palindrome_dscan should only be used with nucleotide data and reverse complement searching should be turned off.

Scan E. coli intergenics

A full set of E. coli and R. palustris intergenic regions are available as databases for searching. Clicking the E. coli or R. palustris button will load the database file into the Sequences to be searched textbox. Depending on the load on the server, it may take several minutes to load the file. The E. coli intergenic file consists of 2417 sequences. The FASTA headers contain the gene name, the genomic coordinates of the gene and its upstream neighbor, the length of the intergenic region and its genomic coordinates. The E.coli intergenic regions were derived from the E.coli K12 genome entry in RefSeq, downloaded on Feb 28, 2003.

The R. palustris intergenic file consists of 2633 sequences. The FASTA headers ontain the gene name, the genomic coordinates of the gene and its upstream neighbor, the length of the intergenic region and its genomic coordinates. The R. palustris intergenic regions were derived from the Rhodopseudomonas palustris CGA009 genome entry in RefSeq (NC_005296), downloaded on Mar 8, 2004.

Choosing a scoring matrix

For scanning a DNA database, dscan allows a choice of either an Identity matrix or a PAM1 DNA scoring matrix. For proteins, a Blossum62 matrix is used for scoring. In either case a product multinomial model may be used instead.

Aligned Segments

Aligned segments are basically the alignment produced by Gibbs. A seperate list of the aligned segments can be produced using the Create Scan Ouput option on the Advanced Options page for Gibbs

The segments consist of a row of describing the fragmentation of the sites. A * for each conserved position and a . for each fragmented column. The first character in the mask must be an '*' and not a '.', ie, the 1st position specified must be an ON position and not an OFF position.

For example


      **.*****.*..*.*****.**
      TTTTTTGATCGTTTTCACAAAA
      TTATTTGCACGGCGTCACACTT
      AACTGTGAGCATGGTCATATTT
      GTATGCAAAGGACGTCACATTA
      AGGTGTTAAATTGATCACGTTT
      TTATTTGAACCAGATCGCATTA
      AATTGTGATGTGTATCGAAGTG
      TTGTGTAAACGATTCCACTAAT
      TTATCTGCAATTCAGTACAAAA
      TAATGTGAGTTAGCTCACTCAT
      TTCTGTAACAGAGATCACACAA
      TTTCGTGATGTTGCTTGCAAAA
      AATTGTGACACAGTGCAAATTC
      ATGCCTGACGGAGTTCACACTT
      GATTGTGATTCGATTCACATTT
      TGTTGTGATGTGGTTAACCCAA
      CGGTGTGAAATACCGCACAGAT
      ATTTGTGAGTGGTCGCACATAT

dscan will create a frequency model from the conserved columns and use it to search the database sequences for similar sites.

Count Matrix

A count matrix is similar to the frequency matrix output by Gibbs. It contains a list of the counts of each nucleotide or amino acid for each position in the matrix.

The matrix consists of a row of describing the fragmentation of the sites. A * for each conserved position and a . for each fragmented column. The first character in the mask must be an '*' and not a '.', ie, the 1st position specified must be an ON position and not an OFF position.

Rows following the model mask specify model positions. There must be one row for each position in the model. This includes OFF positions. For example, if the model mask is 18 characters long, with 14 ON positions and 4 OFF positions, 18 rows of data must be present in the frequency matrix that follow. Data in the OFF position of the matrix is ignored, but it must numeric data, ie, not alphabetic characters.

Columns specify counts of each possible alphabet letter.


	For nucleotide data: column order is: A T C G. this order is compatible
	                   with Gibbs.
        - protein data: alphabetic order of single letter AA codes; ie,
	     A C D E F G H I K L M N P Q R S T V W Y 

Values in a freq matrix may be integers or floats. This implies that both probability matrices (with positions summing to 1.0) and counts matrices are allowed.

Note: count matrices only work with dscan, not palindrom_dscan.

For example


 **.*****.*..*.*****.**
    7   7   1   2
    6   9   0   2
    0   0   0   0
    0  14   3   0
    0   3   1  13
    0  16   1   0
    3   1   0  13
   16   0   1   0
    0   0   0   0
    4   3   6   4
    0   0   0   0
    0   0   0   0
    1   4   4   8
    0   0   0   0
    0  12   2   3
    0   1  16   0
   13   1   0   3
    2   1  14   0
   14   2   0   1
    0   0   0   0
    7  10   0   0
    5  10   1   1

dscan will create a frequency model from the conserved columns and use it to search the database sequences for similar sites.

Reverse Complement

The program normally samples data in both forward direction only. For nucleotide data, it is common to search both the forward and reverse complement strand. Checking this option will cause the program to scan both strands. Sites found in the reverse complement direction will be marked with an R. Note: when scanning for repeats, the sequences are scanned in the forward direction and then in the reverse direction for all repeats. Thus, dscan may miss cases where a site model appears in the same sequence in the forward and reverse direction.

Expectation Value

The E-value is the number of sites with the same score or better that we would expect to find in a random database of the same size.

p-value

The p-value is the probability of finding a profile score of at least the value of the highest scoring segmentt in the sequence in a random sequence of the same length. A Bonferroni adjustment is made to adjust for the number of possible segments. If searching for multiple sites or multiple motifs a different Bonferroni adjsutment is made. In all cases a second Bonferroni adjustment is made for the size of the database, either by multiplying by the number of sequences or the effective size of the database. See Neuwald, Liu and Lawrence, Gibbs motif sampling, Protein Science (1995) 4:1618-1632 for details

Print top N values

By default, dscan prints all sites found with an E-value less than the cutoff and the top p-value with a -log10(p-value) greater than the cutoff or in the case of repeats, all p-values above the cutoff. It is possible, instead, to print the top N values regardless of the E-value cutoff.

Alternate Bonferoni Adjustment

Normally, the size of the database is adjusted for by multiplying the adjusted p-value for a sequence by the number of sequences in the database. When there is a large variation in the length of the sequences, this can underestimate some p-values and overestimate others. When searching for one model without repeats, an alternate adjustment is available which multiplies the p-value calculated for a single sequence by the effective length of the database

Program Output

The output below is the result of searching a small database of E. coli intergenic sequences with the segments listed above.


/tmp/dscan21491/dscan21491 /tmp/dscan21491/dbfile.txt /tmp/dscan21491/snfile.txt -P -n -R 
C: 740 (0.195767)
G: 740 (0.195767)
A: 1150 (0.304233)
T: 1150 (0.304233)

Total database length: 3780
Effective size for model 1: 3024

average length = 105.0
Distribution of -log10(E-values):
 '=' is 1 count.

   -4.00 : 5        |=====
   -3.00 : 0        |
   -2.00 : 3        |===
   -1.00 : 2        |==
    0.00 : 9        |=========
    1.00 : 9        |=========
    2.00 : 6        |======
    3.00 : 2        |==
    4.00 : 0        |
   total : 36      

    mean = 0.55919
   stdev = 1.99404
   range = -3.48058 .. 3.45876

[3.46] ecomale
  5.0 (1.150e-07):   14  TTACCGCCAA TTCTGTAACAGAGATCACACAA AGCGACGGTG  35  


[3.12] cole1 R
  4.7 (2.498e-07):   82  GGACTTCCAT TTTTGTGAAAACGATCAAAAAA ACAGTCTTTC   61


[2.93] cole1
  4.5 (3.865e-07):   61  GAAAGACTGT TTTTTTGATCGTTTTCACAAAA ATGGAAGTCC  82  


[2.66] (tdr)
  4.2 (7.304e-07):   78  TTGAAAGTTA ATTTGTGAGTGGTCGCACATAT CCTGTT      99  


[2.59] ecomale R
  4.1 (8.482e-07):   35  CACCGTCGCT TTGTGTGATCTCTGTTACAGAA TTGGCGGTAA   14


[2.54] ecolac
  4.1 (9.509e-07):    9    AACGCAAT TAATGTGAGTTAGCTCACTCAT TAGGCACCCC  30  


[2.44] ecodaop
  4.0 (1.206e-06):    7      AGTGAA TTATTTGAACCAGATCGCATTA CAGTGATGCA  28  


[2.16] ecobgirl
  3.7 (2.280e-06):   76  CAAAGTTAAT AACTGTGAGCATGGTCATATTT TTATCAAT    97  


	time: 0 seconds (0.00 minutes)

The first part of the output lists the nucleotide or amino acid content of the database along with the total database size. Next is a histogram of the distribution of -log10(E-values). The list of sites matching the model follows. The number in square brackets is the -log10(expectation value). The higher this number is, the less likely it would be to find a site with a score equal or greater than the one found in a random database of the same size. The E-value is followed by the FASTA header of the sequence. Sites found in reverse complement have an R after the header.

The next line lists the -log10(adjusted p-value) followed by the raw p-value in parenthesis. Immediately following is the starting position of the site found, some flanking sequence and the listing of the site. Following this is the flanking sequence and the site ending position.