Example Application of Mamot
Creating a model for p53 Transcription factor binding sites
Frederic Schuetz and Mauro Delorenzi
An initial model was generated using information in
Table 1 of the publication by (Hoh et al. 2002).
The "Weight Matrix" there gives essentially observed frequencies in 37 positive examples.
To create a first set of Emission Probability vectors,
these absolute frequencies were regularized by adding pseudocounts, 0.1 pseudocounts
for each nucleotide
and position, except for those considered incompatible for binding in the
"Filter" matrix part of their Table 1.
For the latter a pseudocount of 0.00004 was used, which gives an emission probability
of about 10E-6, for these cases.
The resulting regularized relative frequencies
of the four nucleotides at each position were taken as initial values
of the Emission Probability Matrix.
A mamot HMM model file was created and called InitialModelp53.txt.
A refined model was based on training the HMM model with the sequences published
by (Wei et al. 2006).
These were obtained by a promising new technology
to characterize DNA fragments recovered from chromatin immunoprecipitation
after binding of the p53 protein that uses paired-end ditag (PET) sequencing.
This model has emitting states for each of the 20 informative binding positions,
one for a possible spacer between the first and second groups of 10
informative positions, and one state for "random" DNA sequence.
The 10th binding position is conncected with the 11th binding position
and with the spacer state. The spacer state is connected with itself.
therefore spacers of any length are possible, although the parameters
chosen do favor short spacers of generally 0-2 nucleotides.
The random state is connected to itself to allow for arbitrary length
of sequences between binding sites.
Training model parameters
mamot -Bu -i 12 -w 0.01 -b -m InitialModelp53.txt -s seqsscoreM8
The Baum-Welch algorithm is applied (-B) to both strand (-u)
of each fragment independently. The transition parameters are
not changed during this learning only the emission probabilities
(-b, this can also be specified by
a comment line in the model definition file instead as an
option in the command line).
For the a small number of pseudocounts (0.01, option -w) is added.
For the training only the sequences with a PET score of 8 or more
are used. The score is the number of overlapping sequence
fragments isolated in the experiment.
The number of iterations was limited to a maximum of 12 (option -i),
with good convergence occurring in 8-10 iterations.
Evaluating and applying the model
As in the original publication, binding site search was applied
separately to the subsets stratified by PET score.
False positive prediction rates were estimated by using a set of
background genomic sequences. For each isolated
fragment in the pool of the PET fragement sequences
a pseudo-random sequence of the same
length was randomly selected from the human genome sequence.
Binding site probabilities
Binding site probabilities were calculated by applying the Forward-Backward
algorithm to compute posteriori probabilities for all the states in the model.
Program call example:
mamot -Du -m p53LearnedModel -s seqs/sequencefile
The generated matrix of probabilities was imported into R for statistical analysis.
The isolated fragments should be enriched for p53 binding sites
over their frequency in random sequences, and the enrichment
should be higher for higher scores.
To identify fragments containing a predicted binding sites, the
maximum of the probabilities for the binding states was extracted
for each fragment. The distribution of these values for the PET (red) and
the random sequences (blue) are shown in the linked plots and summarized in
As a comparison the Table also contains as reference
the values of the sequence prediction model (p53PET)
used in the publication of Wei et al. are given.
For the set of score 2, only those fragments that the author considered
to have a predicted binding site with their model of binding sites
were included in the supplementary materials of the publication.
As random sequences a collection of fragments were chosen from the human
genomes that matches in number and length of the fragments the collection
of fragments with PET scores of 2 or more. This is a different collection
than that in the paper, whichis not provided in their online materials.
The prediction results obtained are in good agreement with
this expectation and with the results reported in the original study.
The distribution of the maximum binding probability for the
positive testing sequence sets,
is in almost all cases
clearly bimodal, with enrichment at distinctly low and high probability
The histogram plot (PlotHist.pdf) and the plot of smoothened
estiamted densitites (PlotNormDens.pdf) show this for the pooled
analysis of all the positive and negative test sequences.
This suggests that the upper cluster
might correspond to real binding sites, while the lower
cluster might be due to false positive isolated fragments.
Alternatively, there could be strong and weak p53
binding sites or some fragment might be associated with p53,
but only indirectly so, through the intermediate of a
protein complex that binds to DNA and to the p53 protein.
Plot of P-values
Table of results
Mamot Model Definition Files:
separated by PET score, as used
for Learning or Testing:
A. The Training Set
B. The "positive" Testing Sets:
C. The "negative" Testing Set: random genome sequences
For questions and comments, Email us