Skip to content

Commit 80cea17

Browse files
authored
Merge pull request #2 from sabifo4/main
Update scripts and README.md file to deal with Issue #1
2 parents 57fc543 + e2bd262 commit 80cea17

File tree

5 files changed

+125
-84
lines changed

5 files changed

+125
-84
lines changed

README.md

Lines changed: 95 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,107 @@
11
# bs_inBV
2-
A tool to help generate the file in.BV when using "complex" substitution models for MCMCTree. Particularly, the hessian is approximated by a bootstrap approach.
32

4-
# Idea #
5-
People have increasingly recognized the importance of using more complex substitution models in modeling sequence evolution, particularly for deep-time evolution and in case the sequence divergence is large. MCMCTree supports only a number of substitution models. These models, however, may be available in other software specifically used for tree construction, e.g., IQ-Tree and RAxML.
3+
A tool with which the branch lengths, gradient, and Hessian used by `MCMCtree` to approximate the likelihood calculation are estimated with complex amino acid substitution models. Particularly, `bs_inBV` will return the `in.BV` file with the inferred branch lengths, gradient, and Hessian; the latter is approximated using a bootstrap approach.
4+
5+
## Idea
6+
7+
People have increasingly acknowledged the importance of using more complex amino acid substitution models in modeling sequence evolution, particularly when analysing deep phylogenies. `MCMCtree` supports various amino acid substitution models, being LG+F the most complex one. Nevertheless, more complex models are available in other phylogenetic software such as `IQ-TREE` or `RAxML`.
8+
9+
When enabling the approximate likelihood calculation with `MCMCtree` (i.e., `usedata = 2` followed by `usedata = 3`; please consult the [PAML documentation for details](https://github.com/abacus-gene/paml/blob/master/doc/pamlDOC.pdf)), `CODEML` (`BASEML` if nucleotide data) will be first "called" by `MCMCtree` when using `usedata = 2` to estimate the branch lengths, the gradient, and the Hessian under a maximum-likelihood approach; vectors and matrix that will be output in the so-called `in.BV` file. Subsequently, `MCMCtree` will use the branch lengths, gradient, and Hessian in this `in.BV` file to approximate the likelihood calculation during the MCMC when enabling `usedata = 3` in the control file. Particularly, `bs_inBV` has been written to use complex amino acid substitution models available in `IQ-TREE` to estimate the branch lengths and **approximate the Hessian matrix by calculating the negative inverse of the bootstrap covariance matrix of branch length estimates**. In addition, `bs_inBV` will then set all the elements of the gradient vector to zero.
610

7-
To work with the approximate likelihood method of MCMCTree (usedata=2), one needs to get the MLE (maximum likelihood estimate) of the branch length, as well as the gradient and hessian evaluated at the MLE. This tool takes the advantages of the abundant substitution models of IQ-Tree. Briefly, the tool uses IQ-Tree's estimate of the branch length, set the gradient to all zeros, and importantly, **approximate the hessian by calculating the negative inverse of the bootstrap covariance matrix of branch length estimates**.
811
![image](https://github.com/evolbeginner/bs_inBV/assets/8715751/6b7ae95a-f018-4331-8812-720601f637ed)
912

13+
## Installation
14+
15+
### Ruby gems
1016

11-
# Installation
12-
Make sure [RUBY](https://www.ruby-lang.org/en/) is installed.
17+
First and foremost, please make sure that [Ruby](https://www.ruby-lang.org/en/) is installed. In addition, you will need to install the following gems before running `bs_inBV` using the command `gem install <gem_name>`. The gems to be installed are the following:
1318

14-
Many scripts included in this product require the following RUBY packages. If any is not installed, please install it by `gem install package_name`.
1519
* parallel
1620
* colorize
21+
* find
22+
* getoptlong
23+
* time
24+
* fileutils
25+
* tmpdir
26+
* bio
27+
* bio-nwk
1728

18-
This product also employs several other computational tools. Please ensure that you have them installed.
19-
* [PAML](https://github.com/abacus-gene/paml)
20-
* [IQ-Tree](http://www.iqtree.org/)
21-
* [newick_utilities](https://github.com/tjunier/newick_utils)
22-
23-
# Usage #
24-
1. Perform a regular MCMCTree analysis. To save time, you can run with only the prior (e.g., set usedata = 0 in mcmctree.ctl) and very few iterations (set burnin=1, sampfreq=1, nsample=1 in mcmctree.ctl).
25-
The reason for this step is to make sure the order of the tips in the file in.BV generated by MCMCTree is the same as that of the species.tree specified by the user.
26-
27-
2. Extract a "reference" tree from the file in.BV generated by MCMCTree in the previous step.
28-
29-
`sed '4!d' in.BV > ref.tre`
30-
31-
3. Run the following
32-
33-
`ruby create_hessian_by_bootstrapping.rb --ali alignment --calibrated_tree species_tree --outdir C20 --ref ref.tre --force -b 1000 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf`
34-
35-
Arguments:
36-
* `--ali`: alignment file
37-
* `--calibrated_tree`: the species tree used in regular MCMCTree
38-
* `--outdir`: output directory
39-
* `--force`: tells the program to overwrite the output if it exists
40-
* `-b`: the no. of bootstraps (note that only traditional bootstrapping is allowed as IQ-TRee's UFB cannot be used for a species tree with a fixed topology)
41-
* `--cpu`: no. of cores used in IQ-Tree
42-
* `-m`: the model passed to IQ-Tree
43-
* `--mcmctree_ctl`: the control file for MCMCTree
44-
* `--pmsf`: use the PMSF approximation
45-
46-
4. Output
47-
Go to the folder "Sample". Run `ruby create_hessian_by_bootstrapping.rb --ali combined.phy --calibrated_tree species.trees --outdir C20 --ref ref.tre --force -b 100 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf`. Note that as detailed above, the files "ref.tre" and "mcmctree.ctl" are obtained by running a regular MCMCTree analysis prior to using create_hessian_by_bootstrapping. The example data are simulated using LG+C20+G assuming a root age of 4.0 Ga. For the details of simulation, see [link](https://sishuowang2022.wordpress.com/2023/06/20/substitution-models-lgcxx-and-cxxlg-differ-in-iq-tree/).
48-
* C20/split: IQ-Tree output
49-
* C20/mcmctree: MCMCTree results. The branch length estimated by the specified model (in the example "-m LG+G+C20"), gradients, and the approximated hessian by bootstrap in the in.BV
50-
51-
# Notes #
52-
1. With `--run_mcmctree`, MCMCTree will be run directly after generating in.BV. In case you want only the file in.BV based on your specified model say LG+G+C60, please do not use `--run_mcmctree`.
53-
2. Without `--pmsf` IQ-Tree is much slower because it is the traditional Cxx model that will be applied in IQ-Tree.
54-
3. In case of problems due to invertible variance-covariance matrix of the branch lengths, try either increasing bootstrap no. or avoiding too closely related species.
55-
4. For more details, please see **Note S4 and Figs. S9-S10** in the original refenrence (see below).
56-
5. As suggested by Sandra Alvarez [link](https://github.com/evolbeginner/bs_inBV/issues/1), **installation of some necessary packages are not clearly stated in the current version of README.md**. So users should be very careful when using v1.0 of the present tool.
57-
58-
# How to cite
59-
Dating the bacterial tree of life based on ancient symbiosis Sishuo Wang, Haiwei Luo bioRxiv 2023.06.18.545440; doi: https://doi.org/10.1101/2023.06.18.545440.
29+
### Third-party tools
30+
31+
In order to run `bs_inBV`, you will need to install the following programs:
32+
33+
* [`PAML`](https://github.com/abacus-gene/paml). Please refer to the [PAML Wiki](https://github.com/abacus-gene/paml/wiki#installation) for more details regarding the installation procedure.
34+
* [`IQ-TREE`](http://www.iqtree.org/). Please refer to the [IQ-TREE website](http://www.iqtree.org/#download) for more details regarding the installation procedure.
35+
* [`newick-utils`](https://github.com/tjunier/newick_utils). If you had any issues when installing `newick-utils` from the source code, please download [the pre-compiled binaries of Newick Utilities v1.6](https://web.archive.org/web/20210409163921/http://cegg.unige.ch/pub/newick-utils-1.6-Linux-x86_64-disabled-extra.tar.gz). These binaries are available on [this website](https://web.archive.org/web/20210409163921/http://cegg.unige.ch/newick_utils).
36+
* [`R`](https://cran.r-project.org/). Please make sure that you follow the installation instructions on the main website. WSL users will need to install R via the terminal too. In addition, please make sure that the `MASS` R package is installed as it is used by script [`reorder_node.rb`](reorder_node.rb).
37+
38+
### Changes to ruby scripts prior to running `bs_inBV`
39+
40+
At the time of writing, `bs_inBV` has pre-defined some paths an aliases to execute third-party tools in some of its ruby scripts. If you are to use this tool under a different environment, please make sure you change the following lines to match your PC settings:
41+
42+
* Users will need to update the path to PAML programs in line 40 in script [`do_mcmctree.rb`](lib/do_mcmctree.rb). If users have an alias to execute `MCMCtree`, they may also have to change line 43 accordingly.
43+
* If users have not exported `IQ-TREE` to their PATH or have an alias different from the one specified in line 26 in script [`create_hessian_by_bootstrapping.rb`](create_hessian_by_bootstrapping.rb), they will need to change that too. The same applies to lines 27-28 for `newick-utils` programs and line 29 for `MCMCtree` in that same script.
44+
45+
## Usage
46+
47+
* If you have already run `MCMCtree` and have the `in.BV` file, please run the following command to obtain a "reference" tree that will be subsequently used by `bs_inBV`:
48+
49+
```sh
50+
sed '4!d' in.BV > ref.tre
51+
```
52+
53+
* If you do not have an `in.BV` file, then you can run a quick dummy analysis with `MCMCtree` to obtain one. Note that the main reason for running this program now is just to make sure that the order of the tips that will be output in the `in.BV` is the same as that of the input tree file specified by the user. To save time, you can ask `MCMCtree` to sample from the prior (i.e., no data will be used, set `usedata = 0` in [the control file](sample/mcmctree.ctl)) and specify few iterations (e.g., set `burnin = 1`, `sampfreq = 1`, `nsample = 1` in [the control file](sample/mcmctree.ctl)). If you are using the control file given in the [`sample`](sample/mcmctree.ctl) directory, please also make sure that you set `usedata = 2`. If you have your own control file, please make the changes accordingly. Once you have a "dummy" `in.BV` file, you can extract the "reference" tree:
6054

61-
You may also need to cite corresponding papers for the use of PAML, IQ-Tree, and Newick_Utilities.
55+
```sh
56+
sed '4!d' in.BV > ref.tre
57+
```
58+
59+
* Now, you are ready to run `bs_inBV`! Please execute this tool by running the following command (make sure that the paths to the third-party tools have been modified to match your settings as [aforementioned](README.md#third-party-tools)):
60+
61+
```sh
62+
# Run from the same directory where you have this script
63+
# You can use relative/absolute paths to your input files
64+
ruby create_hessian_by_bootstrapping.rb --ali <path_to_alignment_file> --calibrated_tree <path_to_calibrated_tree_file> --outdir <path_to_output_directory> --ref <path_to_reference_tree>/ref.tre --force -b 1000 --cpu <number_CPUs_for_IQTREE_to_use> -m LG+G+C20 --mcmctree_ctl <path_to_MCMCtree_control_file>/mcmctree.ctl --run_mcmctree --pmsf
65+
```
66+
67+
> Options:
68+
>
69+
> * `--ali`: path to alignment file.
70+
> * `--calibrated_tree`: path to input calibrated tree file that will be used by `MCMCtree`.
71+
> * `--outdir`: path to the output directory.
72+
> * `--force`: tells the program to overwrite the output files, if any.
73+
> * `-b`: number of bootstraps that `IQ-TREE` will run. Please note that only traditional bootstrapping is allowed as the UFB approach implemented in `IQ-TREE` cannot be used when using a fixed tree topology (which is required by `MCMCtree`).
74+
> * `--cpu`: number of cores that will be used by `IQ-TREE`.
75+
> * `-m`: amino acid substitution model that `IQ-TREE` will use for branch lengths inference.
76+
> * `--mcmctree_ctl`: path to the control file that will execute `MCMCtree`. Please make sure that this control file has `usedata = 2 in.BV` if you want to run `MCMCtree` after `bs_inBV` outputs the `in.BV` file (enable this feature by adding option `--run_mcmctree`).
77+
> * `--pmsf`: use the PMSF approximation
78+
79+
### Example
80+
81+
There is a dataset that you can use in the [`sample`](sample) directory to test the usage of `bs_inBV`. First, please clone this repository and follow all the steps detailed above with regards to [gems, third-party tools, and code changes during the installation procedure](README.md#installation). Once everything is ready, then go to directory [`sample`](sample) and execute the following command:
82+
83+
```sh
84+
# Run this command from directory `sample`
85+
ruby ../create_hessian_by_bootstrapping.rb --ali combined.phy --calibrated_tree species.trees --outdir C20 --ref ref.tre --force -b 100 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf
86+
```
87+
88+
Once the command above finishes, you will see the following output files:
89+
90+
* `C20/split`: this directory will contain all the output files generated by `IQ-TREE`. The name you choose for option `--outdir` will be given to the first directory (e.g., `C20` in this example).
91+
* `C20/mcmctree`: if option `--run_mcmctree` is not enabled, this directory will contain your input files and the `in.BV` file. If you decide to run `MCMCtree` by enabling option `--run_mcmctree`, then you will also find the output files generated by `MCMCtree` in this directory. Please note that the `in.BV` file will contain the vector of branch length estimated using the amino acid substitution model defined by the user (e.g.,`-m LG+G+C20` in this example), the gradient vector, and the Hessian matrix approximated using a bootstrap approach.
92+
93+
> **NOTE 1**: Please check above to learn how to generate file `ref.tre`. For more details about how to configure the control file for `MCMCtree`, please check the [PAML documentation](https://github.com/abacus-gene/paml/blob/master/doc/pamlDOC.pdf).
94+
> **NOTE 2**: The example data were simulated under model LG+C20+G and a root age of 4.0 Ga. For more details about the simulation, please [read this blog post](https://sishuowang2022.wordpress.com/2023/06/20/substitution-models-lgcxx-and-cxxlg-differ-in-iq-tree/).
95+
96+
## Notes
97+
98+
1. When using `--run_mcmctree`, `MCMCtree` will be execute right after generating the `in.BV` file. If you only want to obtain the `in.BV` file based on your specified amino model (e.g., LG+G+C60), please do not use `--run_mcmctree`.
99+
2. Without `--pmsf`, `IQ-TREE` is much slower because the traditional Cxx model is used by `IQ-TREE`.
100+
3. In case you experience any issues related to inverting the variance-covariance matrix of the branch lengths, please try to (i) increase the number of bootstrap or (ii) avoid including too closely-related species in your phylogeny.
101+
4. For more details, please see **Note S4 and Figs. S9-S10** in the original reference (see below).
102+
103+
## How to cite
104+
105+
Dating the bacterial tree of life based on ancient symbiosis Sishuo Wang, Haiwei Luo bioRxiv 2023.06.18.545440; doi: https://doi.org/10.1101/2023.06.18.545440.
62106

107+
You may also need to cite corresponding papers for the use of `PAML`, `IQ-TREE`, and `Newick Utilities`.

create_hessian_by_bootstrapping.rb

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,11 @@
99
require 'find'
1010
require 'colorize'
1111

12-
require_relative 'lib/Dir'
13-
require_relative 'lib/processbar'
14-
require_relative 'lib/do_mcmctree.rb'
12+
$LOAD_PATH << './lib'
13+
14+
require 'Dir.rb'
15+
require 'processbar.rb'
16+
require 'do_mcmctree.rb'
1517

1618

1719
############################################################
@@ -21,7 +23,7 @@
2123
DIR = File.dirname(__FILE__)
2224
LIB_DIR = File.join(DIR, 'lib')
2325

24-
IQTREE = 'iqtree'
26+
IQTREE = 'iqtree2'
2527
NW_STATS = 'nw_stats'
2628
NW_TOPOLOGY = 'nw_topology'
2729
MCMCTREE = 'mcmctree'
@@ -277,6 +279,4 @@ def split_ali_file(ali_file)
277279
`bash #{FIGTREE2NWK} -i FigTree.tre > figtree.nwk`
278280
puts "Done!" if $? == 0
279281
end
280-
end
281-
282-
282+
end

lib/Dir.rb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
#! /usr/bin/env ruby
2+
13
require 'find'
24

35

@@ -76,6 +78,4 @@ def getFilesBySuffices(indir, suffices)
7678
def get_file_path(file)
7779
path = File.symlink?(file) ? File.readlink(file) : file
7880
return(path)
79-
end
80-
81-
81+
end

lib/do_mcmctree.rb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
require 'time'
2929
require 'colorize'
3030

31-
require 'Dir'
31+
require 'Dir.rb'
3232

3333

3434
#################################################################
@@ -37,7 +37,7 @@
3737
DIR = File.dirname(__FILE__)
3838
FIGTREE2NWK = File.join(DIR, "figtree2tree.sh")
3939

40-
PAML_DIR = File.expand_path("~/software/phylo/paml/")
40+
PAML_DIR = File.expand_path("~/PAML/bin/")
4141
MCMCTREE_CTL = File.join(PAML_DIR, 'mcmctree.ctl')
4242
CODEML_CTL = File.join(PAML_DIR, 'codeml.ctl')
4343
MCMCTREE = "mcmctree"
@@ -433,4 +433,4 @@ def run_codeml(outdir, seqtype, ndata, alpha, ncatG, cpu)
433433
puts
434434
puts
435435

436-
end
436+
end

sample/mcmctree.ctl

Lines changed: 17 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,29 @@
11
seed = -1
2-
* seqfile = tmp/seqs.phy
3-
* treefile = tmp/calibration_tree.txt
42
seqfile = combined.phy
53
treefile = species.trees
64
mcmcfile = mcmc.txt
75
outfile = out.txt
86

9-
ndata = 1
10-
seqtype = 2
11-
usedata = 2 in.BV 1
12-
clock = 2
13-
* RootAge = 'B(3.20,4.51,0.025,0.01)' * safe constraint on root age, used if no fossil for root.
7+
ndata = 1 * Number of partitions
8+
seqtype = 2 * 0: nucleotides; 1:codons; 2:AAs
9+
usedata = 2 in.BV * 0: no data (prior); 1:exact likelihood;
10+
* 2:approximate likelihood; 3:out.BV (in.BV)
11+
clock = 3 * 1: global clock; 2: independent rates; 3: correlated rates
1412

15-
model = 0 * 0 : JC69, 1: K80, 2: F81, 3: F84, 4: HKY85
16-
alpha = 0.5
17-
ncatG = 4
18-
duplication = 0
19-
cleandata = 0 * remove sites With ambiguity data (1:yes, 0:no)?
13+
model = 3 * models for AAs or codon-translated AAs:
14+
* 0:poisson, 1:proportional,2:Empirical,3:Empirical+F
15+
* 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189) alpha = 0.5
16+
ncatG = 4 * No. categories in discrete gamma
17+
alpha = 0.5 * alpha for gamma rates at sites
2018

21-
BDparas = 1 1 0
19+
BDparas = 1 1 0.1 * birth, death, sampling
2220
kappa_gamma = 6 2 * gamma prior for kappa
2321
alpha_gamma = 1 1 * gamma prior for alpha
2422

25-
rgene_gamma = 1 50 1
26-
sigma2_gamma = 1 10 1
23+
rgene_gamma = 1 50 * gammaDir prior for rate for genes
24+
sigma2_gamma = 1 10 * gammaDir prior for sigma^2 (for clock=2 or 3)
2725

28-
finetune = 1: .1 .1 .1 .1 .1 .1 * auto (0 or 1): times, musigma2, rates, mixing, paras, FossilErr
29-
30-
print = 1
31-
burnin = 1000
32-
sampfreq = 2
33-
nsample = 5000
26+
print = 1 * 0: no mcmc sample; 1: everything except branch rates 2: everything
27+
burnin = 100000
28+
sampfreq = 1000
29+
nsample = 20000

0 commit comments

Comments
 (0)