************************************************ ************************************************ ** ** ** MAMOT ** ** a program for HMM modelling ** ** 2008, Mauro C. Delorenzi ** ** ** ************************************************ ************************************************/ Contact: Mauro.Delorenzi@isb-sib.ch or Frederic.Schutz@isb-sib.ch If you report errors please describe the problem, indicate the program version and the command line and provide access to all input files. ************************************************************************** FUNCTIONS IMPLEMENTED Generation of random sequences mamot -G Baum-Welch (BW) LEARNING mamot -B Viterbi (Vit) LEARNING mamot -V Forward-backward "posterior" DECODING mamot -D FB (forward(-backward)) PROBABILITY mamot -P Viterbi probability and DECODING mamot -Q ************************************************************************* QUICK START The file atcg.model in directory INPUT provides a simple model to describe simulated AT- and CG-rich regions in the genome. You can try the following commands: 1) Generate 100 (pseudo-)random sequences of maximal length 1000 using the model and store them in file "sequences" mamot -Gf -r4 -m INPUT/atcg.model -n 100 -l 1000 >dnasequences The -f option creates a tab-delimited file called GeneratedSequences.txt which contains the true state and parameters for each generated symbol. The -r option sets a seed, by default the same is used and two identical default runs give the same random sequences. 2) Using the Forward-Backward posterior decoding algorithm, we create a file "PMatrix" containing the calculated probability for each state and each symbol: mamot -D -m INPUT/atcg.model dnasequences This also produces a file "FBprob" with the log Probability of each sequence. 3) Retrain the model using the generated sequences and the Baum-Welch algorithm: mamot -B -i 20 -w 0.1 -m INPUT/atcg.model dnasequences The new model will be stored in file "FinalModel". The initial model will be stored in file "InitialModel" (this is useful to check if the provided model file has been correctly interpreted by mamot). in this case the exercice shows the degree of fluctuations to be expected, which can be high, especially if the number of generated sequences is small. 4) Retrain with the Viterbi method: mamot -V -m INPUT/atcg.model sequences This produces files as with Baum-Welch training. 5) Decode with the Viterbi method: mamot -Q -m INPUT/atcg.model sequences This produces a file "FBprob" with the log Probability of each sequence. ************************************************************************* USAGE - MORE EXAMPLES General options --------------- -h : display help page -v : verbose output (mostly sent to cerr) -m : Specify the file containing the model (default: INPUT/R1.model) -s : Specify the file containing the sequences (default: SEQUENCES/seqFile) -f : Write additional information in a file (depending on the command) -G: Generation of random sequences ---------------------------------- -r n : seed for the random number generator (default 1) -n n : number of sequences to generate -l n : maximum length of each sequence (default 1000, 0 for no limit) Examples: mamot -G -r5 -m INPUT/atcg.model -n 200 -l 5000 > sequences mamot -Gf -l500 -m INPUT/casino.model -n 2 mamot -G -l 10 -n 3 > SEQUENCES/seqFile; mamot -P; cat FBprob -B and -V: Learning a model --------------------------- -j n : minimal number of EM iterations (but not below 2) -i n : maximal number of EM iterations -w n : a multiplicative weight factor for the pseudocounts added to the emission counts -d n : a multiplicative weight factor for the pseudocounts added to the emission counts (default: 100) Examples: mamot -Batp -j 3 -i 8 -w 0.1 -d 500 -m INPUT/atcg.model -s SEQUENCES/dnasequences mamot -Vtp -j 3 -i 10 -d 50 -w0.01 -m INPUT/TF.model -s SEQUENCES/dnasequences2 -P: Computing the Probability given the model ---------------------------------------------- mamot -P -m modelfile seqFile Writes log Prob. to the file FBprob -D and -Q: Decoding ------------------- mamot -D -m modelfile seqFile Writes posterior state probabilitie to the file PMatrix also the same output as -P to the file FBprob mamot -Q -m modelfile seqFile Writes posterior state probabilitie to the file yyyyyyyyyyyyyyyyyyyyy - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - INPUT seqFile: file with sequencs, name at max 100 chars, multifasta format modelfile: file with model, name at max 100 chars, please see example file lines cannot be longer than 1000 chars *********************************************************************** OPTIONS -m : specify the file containing the model (default: INPUT/R1.model) -s : specify the file containing the sequences (default: SEQUENCES/seqFile) -h : display help page -v : verbose output -f : write additional information in a file (depending on the command) -n : specify nb of sequences to be generated (default: 1) -l : when generating sequences, max length of each generated sequence -r : specify the seed for the random generator when generating sequences -i : maximal number of iterations in Baum Welch (BW) and Viterbi -j : minimal number of iterations in BW and Viterbi -d : threshold of absolute value of change of total log likelihood to stop BW or Vit learning (default is kMINdifftotLogLik) -b : in Baum Welch and Viterbi Learning do not update transition probabilities -c : in Baum Welch and Viterbi Learning do not update emission probabilities -t : in Baum Welch and Viterbi Learning tie emission distributions of states in the same tie group (pool contributions in Baum Welch) -p : in Baum Welch and Viterbi Learning tie emission distributions of pairs of complementary DNA states (pool contributions in Baum Welch) -w : multiplicative factor for pseudocounts for emissions, 1 for standard pseudocount scheme, (default is 0: no pseudocounts are added) -y : number of pseudocounts for transitions (default is 0: no pseudocounts added) -a : write also intermediate update models to file "UpdateModels" (after each iteration in BW and Vit learning) -g : limit the additional output of -Q to values above a cutoff -x : use state weights given in the model file in decoding functions -z : use the list of state-position combinations given in the file "ext"P as additional probabilities in decoding. The program will handle only one sequence. It can be used theoretically also in learning but the use of one sequence is then a very limiting restriction. This option is incompatible with special treatment of the two strands of DNA (options -u and -e) **************************************************************************** OBSERVATIONS Pseudocounts ------------ Note that the addition of pseudocounts while regularizing parameters impairs the monotonous likelihood increase of the BW algorithm and can lead to model instability and converge problems. By experience we recommend addition of a small number of pseudocounts when the sample size of the training set is limited and no pseudocounts when rich data is available for training. Emission counts: By default no transition pseudocounts are added. The number of pseudocounts added is given by the product of the number indicated under "NULLMODEL" in the model file, the number of letters of the alphabet used and the weight specified with -w on the command line. The default weight is 0, so that -w 1 is needed to activate the use of the pseudocounts. Transition counts: By default no transition pseudocounts are added. A number can be specified with the option -y. This number is then added to all transitions except those starting or ending in the BEGIN or END state. In many cases the use of pseudocounts for transitions can lead to unexpected consequences and is not generally recommended. It can also increase model complexity enormously and slow down learning. State Weights ------------ Need to activate the use of state weights given in the model file with -x Tying ------------ In the algorithm there is first a sharing of the counts among the pairs that are complementary in the sense of DNA. In a second step the sharing of the counts among all members of a set of tied states. Therefore if pairs (Aand cA) and (B, cB) are complementary and there is tying between A and B and between cA and cB, at the end also A and cB respectively B and cA should results complementary, counts will be effectively shared among all 4 states, in learning the 4 states will be completely coordinated. **************************************************************************** LIMITATIONS & TROUBLESHOOTING One of the most frequent problem arises when a user inadvertently used a model file that defines a model in which any sequence has probability zero. The mistake might not be by the program and lead to unpredictable behavior, possibly including segmentation faults. Sometimes, the program might not read the model file as it was intended, this can be checked by running a learning algorithmus and comparing the model defined in the file "InitialModel" with the expected model (unless the program aborts earlier). Beware of the use of files from MacIntosh or Windows system, the end-of-line symbol must be converted to the Unix \n symbol. MAMOT writes certain output to files with fixed names, so every time the program is invoked these files are overwritten. HIDDEN MARKOV MODEL ------------------- states: any number, any name (within reason) up to 25 chars letters: max. 30 for now, can handle only capital "latin" letters properly (in readSeq) emission probabilities must be completely listed in the order given by the alphabet transition probabilities CONSTANTS --------- Most constants used are accessible in file globals.h (changes must be followed by compilation to get a new executable) In particular, the maximal length of a sequence might have to be increased for wide scanning of DNA sequences or reduced on machines with little RAM. GENERAL LIMITATIONS ------------------- The symbols that can be used for the Alphabet is currently limited to the 52 single letters ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz (except for generating sequences) There is no global checking if the model "makes sense". For example if a model does not allow reaching of the END state, when generating rsndom sequences it will never stop, unless a maximal length is provided. If all sequences have a probability of zero in the model, the bahavior of mamot is unpredictable, a warning might be issued when using -P to compute the probability of any sequence. Need to give a full set of emission probabilities also for the BEGIN and END states, although these numbers are not used and therefore arbitrary. To simplify the code in some locations mamot uses arbitrary constants and represents for example log of zero with a small negative number, the resulting approximation should be sufficient in almost all situations. ************************************************************************** OBSOLETE OPTIONS -k : during first iteration in learning store sequences in memory and do not read it from file in successive iterations, -k must be followed by -l -l : when using -k, maximal length of sequences used in training -u : use both strands of a DNA sequence, apply decoding to both strands one after the other; respectively use both independently in learning a model (not recommended) -e : use both strands of a DNA sequence "conjointly" (not recommended) ************************************************************************** **************************************************************************