MAMOT -
Example Application
Example Application of Mamot
Creating a model for p53 Transcription factor binding sites
binding sites
Frederic Schuetz and Mauro Delorenzi
Initial model
An initial model was generated using information in
Table 1 of the publication by (Hoh et al. 2002).
The "Weight Matrix" there gives essentially observed frequencies in 37 positive examples.
To create a first set of Emission Probability vectors,
these absolute frequencies were regularized by adding pseudocounts, 0.1 pseudocounts
for each nucleotide
and position, except for those considered incompatible for binding in the
"Filter" matrix part of their Table 1.
For the latter a pseudocount of 0.00004 was used, which gives an emission probability
of about 10E-6, for these cases.
The resulting regularized relative frequencies
of the four nucleotides at each position were taken as initial values
of the Emission Probability Matrix.
A mamot HMM model file was created and called InitialModelp53.txt.
Refined model
A refined model was based on training the HMM model with the sequences published
by (Wei et al. 2006).
These were obtained by a promising new technology
to characterize DNA fragments recovered from chromatin immunoprecipitation
after binding of the p53 protein that uses paired-end ditag (PET) sequencing.
States
This model has emitting states for each of the 20 informative binding positions,
one for a possible spacer between the first and second groups of 10
informative positions, and one state for "random" DNA sequence.
The 10th binding position is conncected with the 11th binding position
and with the spacer state. The spacer state is connected with itself.
therefore spacers of any length are possible, although the parameters
chosen do favor short spacers of generally 0-2 nucleotides.
The random state is connected to itself to allow for arbitrary length
of sequences between binding sites.
Training model parameters
mamot -Bu -i 12 -w 0.01 -b -m InitialModelp53.txt -s seqsscoreM8
The Baum-Welch algorithm is applied (-B) to both strand (-u)
of each fragment independently. The transition parameters are
not changed during this learning only the emission probabilities
(-b, this can also be specified by
a comment line in the model definition file instead as an
option in the command line).
For the a small number of pseudocounts (0.01, option -w) is added.
For the training only the sequences with a PET score of 8 or more
are used. The score is the number of overlapping sequence
fragments isolated in the experiment.
The number of iterations was limited to a maximum of 12 (option -i),
with good convergence occurring in 8-10 iterations.
Evaluating and applying the model
As in the original publication, binding site search was applied
separately to the subsets stratified by PET score.
False positive prediction rates were estimated by using a set of
background genomic sequences. For each isolated
fragment in the pool of the PET fragement sequences
a pseudo-random sequence of the same
length was randomly selected from the human genome sequence.
Binding site probabilities
Binding site probabilities were calculated by applying the Forward-Backward
algorithm to compute posteriori probabilities for all the states in the model.
Program call example:
mamot -Du -m p53LearnedModel -s seqs/sequencefile
The generated matrix of probabilities was imported into R for statistical analysis.
Comments
The isolated fragments should be enriched for p53 binding sites
over their frequency in random sequences, and the enrichment
should be higher for higher scores.
To identify fragments containing a predicted binding sites, the
maximum of the probabilities for the binding states was extracted
for each fragment. The distribution of these values for the PET (red) and
the random sequences (blue) are shown in the linked plots and summarized in
Table 1.
As a comparison the Table also contains as reference
the values of the sequence prediction model (p53PET)
used in the publication of Wei et al. are given.
For the set of score 2, only those fragments that the author considered
to have a predicted binding site with their model of binding sites
were included in the supplementary materials of the publication.
As random sequences a collection of fragments were chosen from the human
genomes that matches in number and length of the fragments the collection
of fragments with PET scores of 2 or more. This is a different collection
than that in the paper, whichis not provided in their online materials.
Conclusions
The prediction results obtained are in good agreement with
this expectation and with the results reported in the original study.
The distribution of the maximum binding probability for the
positive testing sequence sets,
is in almost all cases
clearly bimodal, with enrichment at distinctly low and high probability
values.
The histogram plot (PlotHist.pdf) and the plot of smoothened
estiamted densitites (PlotNormDens.pdf) show this for the pooled
analysis of all the positive and negative test sequences.
This suggests that the upper cluster
might correspond to real binding sites, while the lower
cluster might be due to false positive isolated fragments.
Alternatively, there could be strong and weak p53
binding sites or some fragment might be associated with p53,
but only indirectly so, through the intermediate of a
protein complex that binds to DNA and to the p53 protein.
Materials
Histogram plot
PlotHist.pdf
Plot of P-values
PlotNormDens.pdf
Table of results
Table1.pdf
Mamot Model Definition Files:
A. Initial
p53InitialModel.txt
B. Learned
p53LearnedModel.txt
Sequence sets
separated by PET score, as used
for Learning or Testing:
A. The Training Set
seqsPETabove8.txt
B. The "positive" Testing Sets:
seqsPET2available.txt
seqsPET3.txt
seqsPET4.txt
seqsPET5.txt
seqsPET6.txt
seqsPET7.txt
seqsPETabove7.txt
C. The "negative" Testing Set: random genome sequences
seqsRANDOM.txt
For questions and comments, Email us
Frederic.Schutz@isb-sib.ch
Mauro.Delorenzi@isrec.ch