##############################################################
## Readme.txt file
##
## This file contains the commands for compiling the java 
## programs, executing the Java programs and reading the 
## output
##############################################################

Download the zip file "GeneralMarkovModels.zip" for the source 
code and the example data sets.

####################################
Step 1: Program compilation
####################################

	This code was written in J2SE 1.4, therefore, its 
	compilation on the newer versions of Java requires 
	the use of the switch command Xlint. We will assume 
	that the code for RBH or SBH is stored in the directory 
	C:/RBH or C:/SBH

	C:/RBH > set classpath=.;
	C:/RBH > javac -Xlint MaximumAverageLikelihood.java

	## At this point a few warnings will be generated that can 
	## be safely ignored. These warnings are generated because 
	## in the newer versions of Java (J2SE >= 1.5), the return 
	## type for the objects belonging to the collections framework 
	## has to be explicitly defined. 

####################################
Step 2: Program execution 
####################################

	C:/RBH > java MaximumAverageLikelihood

	## At this stage the user is prompted to enter several input parameters
	## An example is shown below 

	 ------- INPUT PARAMETERS -------
	No of sites: 1206
	No of data sets: 2
	Input file: C:/Data/Example_Data.txt
	Output file: C:/Data/RBH_Out.txt
	Tree Representation: (SpeciesA,(SpeciesB,SpeciesC),(SpeciesD,SpeciesE))
	Q-value iteration cut-off value: 0.0000001
	Estimate invariant sites (Y/N): Y

####################################
Understanding the input parameters
####################################

1. No of sites: The total number of matched nucleotide sites.
2. No of data sets: Number of data sets.
3. Input file: Data file containing the sequences in sequential 
	   format. The first line should contain the number of 
	   species and the number of matched sites. The subsequent 
	   lines contain the actual sequences such that the first 
	   10 characters correspond to the species name. The entire 
	   sequence for a particular species should be specified in 
	   the same line. The input data file can contain more than 
	   one dataset. Refer to the example file
4. Output file: Name of the output file
5. Tree Representation: Newick representation of an unrooted tree
6. Q-value iteration cut-off value: If the difference between the new 
	and previous estimates of the Q-matrices along an edge is less 
	than the cut-off value, the Q-matrix updation along that edge 
	terminates. The lower the cut-off value, the longer is the 
	execution time. However, lower cut-off values are likely to result 
	in more accurate parameter estimates.

####################################
Understanding the output parameters
####################################

Though the program generates several files for internal use, the file 
of interest for the end-user is the one labelled "stats.txt". For each 
dataset provided in the "Input file", this file contains the following 
information - 
i. log-likelihood
ii. pi_var i.e., the vector of stationary probabilities for 
	variable sites in the order A, C, G, T)
iii. the proportion of total sites that are invariable
iv. pi_inv i.e. the probability that a site is of type A, C, G, 
	or T given that it is invariable.