BCF Home

Example Application

Example Application of Mamot

Creating a model for p53 Transcription factor binding sites binding sites

Frederic Schuetz and Mauro Delorenzi

Initial model

An initial model was generated using information in Table 1 of the publication by (Hoh et al. 2002). The "Weight Matrix" there gives essentially observed frequencies in 37 positive examples.
To create a first set of Emission Probability vectors, these absolute frequencies were regularized by adding pseudocounts, 0.1 pseudocounts for each nucleotide and position, except for those considered incompatible for binding in the "Filter" matrix part of their Table 1. For the latter a pseudocount of 0.00004 was used, which gives an emission probability of about 10E-6, for these cases.
The resulting regularized relative frequencies of the four nucleotides at each position were taken as initial values of the Emission Probability Matrix. A mamot HMM model file was created and called InitialModelp53.txt.

Refined model

A refined model was based on training the HMM model with the sequences published by (Wei et al. 2006). These were obtained by a promising new technology to characterize DNA fragments recovered from chromatin immunoprecipitation after binding of the p53 protein that uses paired-end ditag (PET) sequencing.


This model has emitting states for each of the 20 informative binding positions, one for a possible spacer between the first and second groups of 10 informative positions, and one state for "random" DNA sequence.
The 10th binding position is conncected with the 11th binding position and with the spacer state. The spacer state is connected with itself. therefore spacers of any length are possible, although the parameters chosen do favor short spacers of generally 0-2 nucleotides.
The random state is connected to itself to allow for arbitrary length of sequences between binding sites.

Training model parameters

mamot -Bu -i 12 -w 0.01 -b -m InitialModelp53.txt -s seqsscoreM8
The Baum-Welch algorithm is applied (-B) to both strand (-u) of each fragment independently. The transition parameters are not changed during this learning only the emission probabilities (-b, this can also be specified by a comment line in the model definition file instead as an option in the command line). For the a small number of pseudocounts (0.01, option -w) is added. For the training only the sequences with a PET score of 8 or more are used. The score is the number of overlapping sequence fragments isolated in the experiment. The number of iterations was limited to a maximum of 12 (option -i), with good convergence occurring in 8-10 iterations.

Evaluating and applying the model

As in the original publication, binding site search was applied separately to the subsets stratified by PET score. False positive prediction rates were estimated by using a set of background genomic sequences. For each isolated fragment in the pool of the PET fragement sequences a pseudo-random sequence of the same length was randomly selected from the human genome sequence.

Binding site probabilities

Binding site probabilities were calculated by applying the Forward-Backward algorithm to compute posteriori probabilities for all the states in the model. Program call example:
mamot -Du -m p53LearnedModel -s seqs/sequencefile
The generated matrix of probabilities was imported into R for statistical analysis.


The isolated fragments should be enriched for p53 binding sites over their frequency in random sequences, and the enrichment should be higher for higher scores.
To identify fragments containing a predicted binding sites, the maximum of the probabilities for the binding states was extracted for each fragment. The distribution of these values for the PET (red) and the random sequences (blue) are shown in the linked plots and summarized in Table 1. As a comparison the Table also contains as reference the values of the sequence prediction model (p53PET) used in the publication of Wei et al. are given.
For the set of score 2, only those fragments that the author considered to have a predicted binding site with their model of binding sites were included in the supplementary materials of the publication. As random sequences a collection of fragments were chosen from the human genomes that matches in number and length of the fragments the collection of fragments with PET scores of 2 or more. This is a different collection than that in the paper, whichis not provided in their online materials.


The prediction results obtained are in good agreement with this expectation and with the results reported in the original study.
The distribution of the maximum binding probability for the positive testing sequence sets, is in almost all cases clearly bimodal, with enrichment at distinctly low and high probability values. The histogram plot (PlotHist.pdf) and the plot of smoothened estiamted densitites (PlotNormDens.pdf) show this for the pooled analysis of all the positive and negative test sequences.
This suggests that the upper cluster might correspond to real binding sites, while the lower cluster might be due to false positive isolated fragments. Alternatively, there could be strong and weak p53 binding sites or some fragment might be associated with p53, but only indirectly so, through the intermediate of a protein complex that binds to DNA and to the p53 protein.


Histogram plot

Plot of P-values

Table of results

Mamot Model Definition Files:
A. Initial
B. Learned

Sequence sets

separated by PET score, as used for Learning or Testing:
A. The Training Set
B. The "positive" Testing Sets:
C. The "negative" Testing Set: random genome sequences

For questions and comments, Email us

Please send comments on web pages to bcf@isb-sib.ch