############################################################## ## Readme.txt file ## ## This file contains the commands for compiling the java ## programs, executing the Java programs and reading the ## output ############################################################## Download the zip file "GeneralMarkovModels.zip" for the source code and the example data sets. #################################### Step 1: Program compilation #################################### This code was written in J2SE 1.4, therefore, its compilation on the newer versions of Java requires the use of the switch command Xlint. We will assume that the code for RBH or SBH is stored in the directory C:/RBH or C:/SBH C:/RBH > set classpath=.; C:/RBH > javac -Xlint MaximumAverageLikelihood.java ## At this point a few warnings will be generated that can ## be safely ignored. These warnings are generated because ## in the newer versions of Java (J2SE >= 1.5), the return ## type for the objects belonging to the collections framework ## has to be explicitly defined. #################################### Step 2: Program execution #################################### C:/RBH > java MaximumAverageLikelihood ## At this stage the user is prompted to enter several input parameters ## An example is shown below ------- INPUT PARAMETERS ------- No of sites: 1206 No of data sets: 2 Input file: C:/Data/Example_Data.txt Output file: C:/Data/RBH_Out.txt Tree Representation: (SpeciesA,(SpeciesB,SpeciesC),(SpeciesD,SpeciesE)) Q-value iteration cut-off value: 0.0000001 Estimate invariant sites (Y/N): Y #################################### Understanding the input parameters #################################### 1. No of sites: The total number of matched nucleotide sites. 2. No of data sets: Number of data sets. 3. Input file: Data file containing the sequences in sequential format. The first line should contain the number of species and the number of matched sites. The subsequent lines contain the actual sequences such that the first 10 characters correspond to the species name. The entire sequence for a particular species should be specified in the same line. The input data file can contain more than one dataset. Refer to the example file 4. Output file: Name of the output file 5. Tree Representation: Newick representation of an unrooted tree 6. Q-value iteration cut-off value: If the difference between the new and previous estimates of the Q-matrices along an edge is less than the cut-off value, the Q-matrix updation along that edge terminates. The lower the cut-off value, the longer is the execution time. However, lower cut-off values are likely to result in more accurate parameter estimates. #################################### Understanding the output parameters #################################### Though the program generates several files for internal use, the file of interest for the end-user is the one labelled "stats.txt". For each dataset provided in the "Input file", this file contains the following information - i. log-likelihood ii. pi_var i.e., the vector of stationary probabilities for variable sites in the order A, C, G, T) iii. the proportion of total sites that are invariable iv. pi_inv i.e. the probability that a site is of type A, C, G, or T given that it is invariable.