+--------------------------------------------------------------------+
|   Phylogenetic tests of the molecular clock and linearized tree    |
+--------------------------------------------------------------------+

These programs run on MSDOS.

This self-extracting file has five programs:
    njboot  -- construct a neighbor-joining (NJ) tree
    postree -- create a postscript file of trees
    tpcv    -- conduct the two-cluster test
    branch  -- conduct the branch length test 
    branbst  -- conduct the branch length test by bootstrap
  and example files:
    crab.dat -- 13 mtDNA sequences of crab species
    crab.nj  -- an NJ treefile of the crab sequences
    crab.tc  -- output of tpcv (result of the two-cluster test)
    crab.cn -- output of tpcv  (linearized tree)
    crab.br  -- output of branch (result of the branch length test)
    crab.ps -- postscript file for crab.br (output of postree)

    njboot.exe, postree.exe, tpcv.exe, and branbst.exe are 16 bit executables.
    They work only for a small data set.

    njboot2.exe, postree2.exe, tpcv2.exe, branch2.exe, and branbst2.exe are
    32bit executables. They will work for large data sets.


Make sure that you use the binary mode when you ftp the files.
The programs may not run for your data because these programs were originally
developed on the unix system and use a large memory. 
It may be better to compile the programs on your machine as a large memory
model with a large stack size or in order to optimize the compiled codes for
a specific processor. The source codes are provided in njboot.src.exe,
postree.src.exe, tpcv.src.exe, and branbst.src.exe.


There are two tests of rate constancy: (1) two-cluster test and (2) branch 
length test. The two-cluster test is essentially the relative rate test for
many sequences. The branch length test is the test of rate difference for each
sequence under the tree root from the average rate of all sequences.
The two tests assume that the tree topology is given and 
that the outgroup is known so that the root of the tree is also known.

Thus, to carry out these tests, first construct a neighbor-joinig (NJ) tree.

(1) sequence inputfile for njboot, tpcv, branch, and branbst
The file format is just like the PHYLIP package format without the number of 
sequences and the number of sites, but sequence names and actual data should be
separated by at least one space.
All sequences should have the same number of sites.
Each sequence data should be given in one line and its name and actual
sequence should be separated by spaces. 

 name1  AACT.........
 name2  TACT.........
 name3  TAAT.........
  .
  .


symbols   - : indels
          ? : missing 

In the analysis, all the sites which includes indels or missing are eliminated.

(2) construct the NJ tree
To construct an NJ tree, type

   njboot inputfile -d[distance option] > treefile

  If you type just njboot, it will give you the description such as

njboot inputfile -b[bootstrap number] -d[distace option] -o[output file] -s[seed]
distance option
0    JC distance
1    K2P distance
2    Tajima Nei distance
3    Tamura distance
4    Gamma distance
5    amino Poisson
6    amino p-distance
7    amino Gamma
8    Tamura Nei
9    Tamura Nei Gamma
10    p-distance
11    p-distance with gap
12    V(proportion)
13    V K2P
14    V K2P gamma
15    V Hasegawa
16    V Hasegawa gamma
17    number of difference

 For example, type

    njboot crab.dat -d1 > crab.nj1

   crab.nj1 is a treefile that is for an NJ tree constructed by Kimura's two
   parameter distance.

(2) Printing the tree
To print the NJ tree, type
 postree crab.nj1 -o 1 -i

 in which sequence 1 (A-salina) is specified as an outgroup. You can use more
 than one sequence as outgroup. This will create a postscript file crab.ps
 I assume the Sun workstation has a postscript printer.
 So, type 
          print crab.ps 
  The NJ tree will be printed. The numbers on the interior nodes are the node
  numbers. These numbers will be necessary to look at the result of the two-
  cluster test. If the Sun is not connected to a postscript printer, take
  this file to any computer connected to a postscript printer and print it.
  For example, type PRINT file name on PC.


(3) Two-cluster test
 Type 
     tpcv crab.dat -tcrab.nj -d2 -o 1 > crab.tc

File crab.tc will be

file crab.dat: nseq=13 nsite=421 npsite=155 nasite=373 K2P-distance
outgroup 1 A-salina
 
node  L    R    delta     s.e.        Z     CP   height     s.e.        bA       bB       bC
  14  2 <  3 0.029684 0.023945 1.239691 78.14% 0.114878 0.014329  0.100036 0.129720 0.156687
  17 12 > 13 0.001399 0.001401 0.998365 67.78% 0.001344 0.001346  0.002043 0.000645 0.133376
  16 10 > 11 0.005836 0.005458 1.069374 71.08% 0.008156 0.003349  0.011074 0.005237 0.135988
  18 16 > 17 0.011517 0.007633 1.508842 86.64% 0.018945 0.004916  0.024703 0.013187 0.143052
  19  4 >  5 0.001700 0.004502 0.377639 28.86% 0.006763 0.003032  0.007613 0.005913 0.129617
  21  8 >  9 0.000141 0.001838 0.076647  5.58% 0.002695 0.001911  0.002766 0.002625 0.131972
  22  6 < 21 0.002440 0.004971 0.490855 37.58% 0.012308 0.004015  0.011088 0.013529 0.132144
  24 22 <  7 0.002165 0.010235 0.211477 16.64% 0.026045 0.005761  0.024963 0.027127 0.130204
  23 24 > 19 0.004014 0.008728 0.459944 34.72% 0.031728 0.005907  0.033735 0.029721 0.148330
  20 23 > 18 0.003459 0.012176 0.284107 22.06% 0.050815 0.007466  0.052544 0.049085 0.233646
  15 20 < 14 0.017302 0.028276 0.611912 45.82% 0.132218 0.012820  0.123567 0.140870 0.201980
Q=6.535938

This gives the result of the two-cluster test at each interior node.
For example, the first line shows that node 14 which are connected to two
descendant nodes (sequences) 2 and 3. The schematic view of this is:
                 
     | bC           The delta = | bA - bB |
     | 
   node 14        The difference of the branch length bA and bB is given in
 bA / \           delta which is this case 0.029684 and its standard error
   /   \ bB       is 0.023945. 
  2     \         The Z value  ( Z = delta/s.e ) = 1.239691 and its CP value
         3        (= 1 - pvalue ) =  78.14%. This is not significant.

Here, the two clusters under node 14 contain one sequence each, but
this procedure is the same for the clusters that contains many sequences.

In our paper, we describe a method to force rate constancy on a given topology,
called linearized tree. The linearized tree is created in file crab.cn.
In this case, all the branch lengths are positive. So, only this crab.cn
is created. In some cases, you may also find files whose suffix is cng.
In such cases, some branch lengths of linearized trees become negative. Those
negative branch lengths are forced to zero and stored in sequencefile.cng.
In the example of the crab data, the height of node 14 from a tip of the tree is 
   H = (bA + bB)/2 =0.114878  and its standard error is 0.014329.

At the end of the result Q=6.535938 is given. This is for a test of rate
constancy that combines the rate difference for all the interior nodes under
the root. This is the chi-square test with n-1 degrees of freedom where
n is the number of sequences under the root. (n=12 in this case.)

If you type tpcv, it will give the description.
tpcv  inputfile -t[treefile] -o [outgroup1 outgroup2 ... ] -d[distance] -c[codon position] -g[gamma a  value]
distance option
0 p-distance
1 JC-distance
2 K2P-distance
3 amino acid p-distance
4 amino acid Poisson correction
5 JC distance gamma
6 K2P distance gamma
7 amino acid gamma
8 Tajima and Nei
9 V K2P
10 V K2P gamma
11 V Hasegawa
12 V Hasegawa gamma
13 Tamura Nei
14 Tamura Nei gamma

Note that the distance option is slightly different from the njboot.

(4) Branch length test
  Type
                          
  branbst crab.dat -tcrab.nj -d2 -o 1 > crab.br

 crab.br is a treefile in which the branch lengths are reestimated by the
 ordinary least squares.
 Type
  postree crab.br -o 1
  print crab.ps

 In the printed tree, the numbers in parentheses that follow sequence names
 are CP values (= 1 - pvalue) of the branch length test.
At the end of the treefile crab.br, more detailed results are given.
Type cat crab.br, then you can see the followings.

branch length reestimation: K2P-distance was used
#sites 421   #actual sites 373   #polymorphic sites 155
/* sequence file: crab.dat: treefile: crab.nj
  nseq=13 nsite=421 npsite=155 nasite=373 K2P-distance
outgroup: 1 A-salina*/
average root-to-tip length 0.126451
3 C-sp. delta 0.029261 (0.026560) Z=1.101704  72.86%
2 C-vittat delta 0.000424 (0.026301) Z=0.016104   0.80%
13 P-l(GU) delta 0.011417 (0.009086) Z=1.256597  78.88%
12 P-l(NE) delta 0.010018 (0.009020) Z=1.110615  73.30%
11 P-p(GU) delta 0.002119 (0.010271) Z=0.206326  15.86%
10 P-p(NE) delta 0.003717 (0.010391) Z=0.357729  27.36%
5 P-camtsc delta 0.005026 (0.008548) Z=0.588035  43.80%
4 L-aequit delta 0.003326 (0.009184) Z=0.362163  28.12%
7 L-splend delta 0.001462 (0.009797) Z=0.149180  11.14%
9 P-acadia delta 0.000040 (0.008770) Z=0.004560   0.00%
8 P-bernha delta 0.000181 (0.008655) Z=0.020900   1.60%
6 E-tenuim delta 0.002330 (0.008825) Z=0.264015  20.52%
Q=6.535938

  +--------------------  A-salina (outgroup)
  |  b1 +--------------------- C-sp  
  |  +--|         b2
  +--|  +------------------ C-vittat
root |          b3
     +-- 

  In this test the difference of the root-to-tip distance of each sequence from
  the average of all sequences under the root is tested. The root-to-tip 
  distance is the sum of branch lengths from the root to a tip. For example,
  The root-to-tip distances for sequence C-sp and C-vittat are b1+b2 and b1+b3,
  respectively.
  In the result, the delta is the difference of the root-to-tip distance from
  the average. After the delta, its standard error, the Z value (delta/s.e.),
  and the CP value are shown. The Q value is also for the test of rate
  constancy that combined all sequences. This is the chi-square test with
  n-1 degrees of freedom where n is the number of sequences under the root.


Type branbst, and it will give a brief description.

branbst  inputfile -t[treefile] -d[distance] -o[outgroup] -c[codon position] -g[gamma a  value] -b[bootstrap replications] -s[seed]
0 JC distance
1 K2P distance
2 Tajima Nei distance
3 Tamura distance
4 Gamma distance
5 amino Poisson
6 amino p-distance
7 amino Gamma
8 Tamura Nei
9 Tamura Nei Gamma
10 p-distance
11 p-distance with gap
12 V(proportion)
13 V K2P
14 V K2P gamma
15 V Hasegawa
16 V Hasegawa gamma


(5) Treefile
The treefile used for tpcv, branch, and branbst is compatible with the output 
of METREE and NJBOOT2. And the treefiles that  tpcv, branch, and branbst
output can be seen on screen by TREEVIEW.


The treefile can be converted to Newick format by cnvtre in this distribution.

cnvtre treefile > treefile.nwk

The treefile in Newick format can be displayed by MEGA or other software.

(6) Use a tree topology different from the output of njboot

   Sometimes you would like to use a tree topology that parsimony or the 
   maximum likelihood method produced. But it may be different from the one
   njboot generated. My programs cannot input treefiles from PHYLIP or PAUP. 
   But, the format of the treefile for my programs is quite simple. 
   So, if the number of your sequences is not too large, it is not difficult 
   to make a treefile for an arbitrary topology yourself.

This is a treefile of 6 crab sequences.
------------------------------------------------------------
   6 sequences
1 A-salina
2 C-vittat
3 C-sp.
4 L-aequit
5 P-camtsc
6 E-tenuim
  7 and   2        0.105263
  7 and   3        0.121527
  8 and   1        0.193376
  8 and   7        0.033368
  9 and   8        0.089167
  9 and   6        0.035602
 10 and   9        0.020522
 10 and   4        0.006968
 10 and   5        0.006558
-----------------------------------------------------------
The first line indicates the number of sequences for this data.
Then the names of sequences are given. Each line indicates the sequential 
number and the name of each sequence. Below the sequence names, the tree topology is given. Each line indicates a branch of the tree. The numbers of the nodes
at the end of the branch and the branch length is given. For example, the line
  7 and   2        0.105263
indicates that sequence 2 (C-vittat) is connected to internal node 7 and
the branch length is 0.105263. You can see this in the actual tree shown below.

    +----------------------------- 1 A-salina
    |               
    |               +----------- 2 C-vittat
    |       +-------7
    |       |       +----------- 3 C-sp
    +-------8       
            |   +--------------- 6 E-tenuim
            |   |
            +---9     +--------- 4 L-aequit
                +----10
                      +--------- 5 P-camtsc

To make a treefile for n sequences:
 1) Give internal nodes sequential numbers from n+1 to 2n. The order for 
   assigning the number to the internal nodes does not matter.
 2) Describe each branch, following the above format. You can give any value
    to the branch length.  You have 2n-3 branches.

Please remember that the names of the sequences in the sequence file and
those in the treefile must be exactly the same.



(7)If you have problems or comments, please send an email to

                          takezaki@med.kagawa-u.ac.jp

				Naoko Takezaki
                                     Kagawa University
				     1750-1 Ikenobe, Mikicho, Kitagun
				     Kagawa 761-0793
                                     Japan
                                     



