Table of Contents
BAli-Phy is a Unix command line program that is developed primarily on Linux. BAli-Phy also runs on Windows and Mac OS X, but it is not a GUI program and so you must run it in a terminal. Therefore, you might want to keep a Unix tutorial or Unix cheat sheet handy while you work.
In addition to the main bali-phy executable, BAli-Phy comes with a collection of small command-line utilities such as alignment-cat, trees-consensus, etc. These utilities can be used to process alignments, assemble data sets, and summarize the results of MCMC.
We typically run BAli-Phy on workstations with at least 8Gb of RAM and 2 cores. More cores will allow you to run more MCMC chains at once, and more RAM will allow you to run larger data sets. However, it is often easier and faster to run BAli-Phy on a (Linux) computing cluster, if you have one available.
If you have previously installed bali-phy, you do not have to remove the old version before installing the new version. Simply follow the installation instructions for the new version. If you are manually adding the new version of bali-phy to your PATH, just make sure that the new version comes before the old version in the PATH, or remove the old version from the PATH.
In order to remove an older version, simply delete the directory bali-phy-
. This will completely uninstall the old version from the system. BAli-Phy does not create hidden files that will remain after you remove its directory.oldversion
First check that you have a 64-bit version of the Windows operation system installed. The executables for download will only run on a 64-bit installation of Windows.
Before you can use BAli-Phy on Windows, you need to install a Unix command-line environment. We recommand installing Cygwin. You may then access the Unix command line environment by running the Cygwin shell (not the normal windows command line). The Cygwin shell mounts the C:
drive on /cygdrive/c/
, so you can access the directory C:/Users/
as /cygdrive/c/Users/
from within the Cygwin shell, for example.
While running the Cygwin installer setup-x86_64.exe
, you will be given an opportunity to select additional packages.
From Science, select R.
From Math, select gnuplot.
From Interpreters, select perl.
From Web, select wget.
From Editors, select nano.
You can re-run the installer to add packages that you did not add during the initial install.
C:/
method because it is compiled as a
native windows executable.
The combination of native windows executables (which want C:/
)
and the Cygwin shell (which wants /cygdrive/c/
) can be
confusing. If you supply Cygwin filenames with
/cygdrive/
to native windows executables like BAli-Phy, then it
may complain that the files cannot be found.
You can optionally use MSYS2 instead of Cygwin. Both MSYS2 and Cygwin can be installed at the same time. After installing MSYS2, You may access the Unix command line environment by running the MSYS2 shell (not the normal windows command line). The MSYS2 shell mounts the C:
drive on /c/
, so you can access the directory C:/Users/
as /c/Users/
from within the MSYS2 shell, for example.
After installing MSYS2 you will need to install a few packages before you proceed. Run the MSYS2 shell, and enter the command:
%
pacman -S perl tar
After installing a Unix command line, use it to download and extract the bali-phy executables from the website:
%
mkdir -p ~/Applications
%
cd ~/Applications
%
wget http://www.bali-phy.org/files/bali-phy-3.4-win64.tar.gz
%
tar -zxf bali-phy-3.4-win64.tar.gz
Second, check that the bali-phy executable runs:
%
~/Applications/bali-phy-3.4/bin/bali-phy --version
You still need to add it to your PATH as described in Section 2.6, “Add BAli-Phy to your PATH
”.
First install the XCode (version 6 or higher) command line tools:
%
xcode-select --install
Then install homebrew and use homebrew to compile and install bali-phy:
%
brew tap brewsci/bio
%
brew install bali-phy
Check that the executable runs:
%
bali-phy --version
If you install with homebrew, you don't need to do anything extra to put bali-phy in your PATH.
Open a windows in the Terminal app to access the UNIX command line. Then download and extract the executables:
%
mkdir -p ~/Applications
%
cd ~/Applications
%
curl -O http://www.bali-phy.org/files/bali-phy-3.4-mac64.tar.gz
%
tar -zxf bali-phy-3.4-mac64.tar.gz
Check that the executable runs:
%
~/Applications/bali-phy-3.4/bin/bali-phy --version
You still need to add it to your PATH as described in Section 2.6, “Add BAli-Phy to your PATH
”.
You can install gnuplot via homebrew:
%
brew install gnuplot
You can install R via homebrew:
%
brew tap caskroom/cask
%
brew cask install xquartz
%
brew install r
However, note that this might conflict with R installed from other places, such as MRAN.
Check that the executable runs:%
sudo apt-get install bali-phy
If you install with apt-get, you don't need to do anything extra to put bali-phy in your PATH.%
bali-phy --version
First install wget. If you have Debian or Ubuntu Linux, type:
%
sudo apt-get install wget
Then download and extract the executables:
%
mkdir -p ~/Applications
%
cd ~/Applications
%
wget http://www.bali-phy.org/files/bali-phy-3.4-linux64.tar.gz
%
tar -zxf bali-phy-3.4-linux64.tar.gz
Second, check that the executable runs:
%
~/Applications/bali-phy-3.4/bin/bali-phy --version
You still need to add it to your PATH as described in Section 2.6, “Add BAli-Phy to your PATH
”.
If you have Debian or Ubuntu Linux, you can install other recommended programs by typing:
%
sudo apt-get install gnuplot
%
sudo apt-get install r-base
First check if the executable is in your PATH.
%
bali-phy --version
If this shows version info, then bali-phy is already in your PATH and you can skip this section. This should be true if you installed bali-phy using a package manager such as homebrew or apt, or if you've already added it to your PATH.
If bali-phy is not in your path, then you should see:
%
bali-phy --version
bali-phy: command not found.
If bali-phy is not in your PATH, then continue with this section.
Add bali-phy to your PATH, so that the shell knows where to find it. This command only affects the terminal in which it is typed, and will not affect new terminals:
%
export PATH=~/Applications/bali-phy-3.4/bin:$PATH
To set the PATH automatically for new terminals, type:
%
test -r ~/.bash_profile && echo 'export PATH=~/Applications/bali-phy-3.4/bin:$PATH' >> ~/.bash_profile
%
echo 'export PATH=~/Applications/bali-phy-3.4/bin:$PATH' >> ~/.profile
This will affect new terminals only after you log out and log back in though.
Now check that the executable runs:
%
bali-phy --version
If it does, then your PATH is set up correctly, and you can probably skip the rest of this section.
If you installed BAli-Phy to the directory
~/Applications
, then you can run
bali-phy by typing ~/Applications/bali-phy-3.4/bin/bali-phy.
However, it would be much nicer to simply type
bali-phy and let the computer find the
executable for you. This can be achieved by putting the directory
that contains the BAli-Phy executables into
your "path". The "path" is a colon-separated list of directories that is
searched to find program names that you type. It is stored in an
environment variable called PATH
.
Setting your PATH
is also a pre-requisite for running
the bp-analyze script to summarize your
MCMC runs.
You can examine the current value of this environment variable by typing:
%
echo $PATH
We will assume that you extracted the bali-phy archive in
~/Applications
and so you want to add
$HOME/Applications/bali-phy-3.4/bin
to your PATH
. (If you installed to another directory,
replace $HOME/Applications/bali-phy-3.4/
with that directory.)
The commands for doing this depend on what "shell" you are using. Type echo $SHELL to find out. If your shell is sh or bash then the command looks like this:
%
PATH=$HOME/Applications/bali-phy-3.4/bin:$PATH
If your shell is csh or tcsh, then the command looks like this:
%
setenv PATH $HOME/Applications/bali-phy-3.4/bin:$PATH
Note that these commands will only affect the window you are typing in, and will vanish when you reboot.
To make this change survives when you logout or reboot, open your shell configuration file in a text editor, and add the command on a line by itself. This will ensure that it is run every time you log in.
To find the right configuration file, look in your $HOME directory
for .profile
(for the Bourne shell sh),
.bash_profile
(for BASH), or
.login
(for tcsh). You may have to
create the file if it is not present. On Cygwin, you should
put the change in the file .bashrc
.
If you do not know which directory is your home directory, you can find its full name by typing:
%
echo $HOME
In order to determine that the software has been correctly installed, and the PATH
has been correctly set, run the following commands:
%
bali-phy ~/Applications/bali-phy-3.4/share/doc/bali-phy/examples/sequences/5S-rRNA/25.fasta --iter=150
%
bali-phy ~/Applications/bali-phy-3.4/share/doc/bali-phy/examples/sequences/5S-rRNA/25.fasta --iter=150
%
bp-analyze 25-1 25-2
Furthermore, the directories 25-1
and 25-2
should contain a file called C1.log
. You should be able to load these files in Tracer, although the chain will not really have converged yet.
Here are some examples and explanations of how to run bali-phy. You can get an overview of command line options by running bali-phy --help.
We recommend running multiple chains in parallel for each command, because
This can be done simply by starting several instances of the program, and does not require using MPI or special command-line options.
The simplest way to run BAli-Phy is to type all the arguments on the command line:
%
bali-phy sequences.fasta
You can run a traditional fixed-alignment Bayesian tree inference by adding -I none
:
%
bali-phy sequences.fasta -I none
You can also specify a range of characters to use in the analysis:
%
bali-phy sequences.fasta:1-30,90-100
Sequence files can be in FastA or PHYLIP format. FASTA format prefixes sequence names with ">":
>human this is a comment and is not part of the sequence name CTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAA GTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTT >tarsier this is also a comment CTGACTGCTGAAGAGAAGGCCGCCGTCACTGCCCTGTGGGGCAAGGTAGACGTGGAAGAT GTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTCGTCTACCCATGGACCCAGAGGTTCTTT >bushbaby CTGACTCCTGATGAGAAGAATGCCGTTTGTGCCCTGTGGGGCAAGGTGAATGTGGAAGAA GTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCATGGACCCAGAGGTTCTTT >hare CTGTCCGGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGTGAATGTGGAAGAA GTTGGTGGTGAGACCCTGGGCAGGCTGCTGGTTGTCTACCCATGGACCCAGAGGTTCTTC
If the sequence-name line contains a space, BAli-Phy treats everything after the space as a comment.
The sequences in the file do not need to be aligned unless the alignment is being estimated.
Sensible defaults are supplied for command line options that are not specified. For example, if sequences.fasta
contains DNA sequences, then
%
bali-phy sequences.fasta
is equivalent to
%
bali-phy sequences.fasta -A DNA -S tn93 -I rs07
Default values that are used will always be displayed on the screen and in the output files so that you do not have to guess. You can specify a more complex substitution model using the -S
option.
%
bali-phy sequences.fasta -S lg08+Rates.gamma+inv
Every short option like -S
has an equivalent long option like --smodel
.
To see the most frequently-used command-line options, you can run
%
bali-phy help
In addition to using the command line, you may also specify
options in a file. Option files also use the long form of command line options.
Each option is given on its own line using the syntax "option =
value
" instead of the syntax "--option
value
". The value can be blank if the option does not take
an argument. The align
option indicates sequence files.
Lines that begin with # are comments, and blank lines are ignored.
For example, consider the following option file:
# sequence data for 3 genes/partitions align = ITS1.fasta align = 5.8S.fasta align = ITS2.fasta # linked substitution model for 1st and 3rd partition smodel = 1,3:tn93+Rates.free[n=3] # substitution model for 2nd partition smodel = 2:tn93 # indel model for second partition imodel = 2:none # linked scale for 1st and 3rd partition scale = 1,3: # choose a name for output directories name = ITS-analysis1
Options files are specified with the -c
option:
option_file
%
bali-phy -c analysis1.txt
# run the analysis%
bali-phy -c analysis1.txt --name ITS-analysis1b
# override the name
Options given on the command line will override values given in the option file.
Running bali-phy on a computing cluster is not necessary, but can speed up the analysis dramatically. This is because a cluster allows you to run several independent MCMC chains simultaneously and pool the resulting samples. You can run multiple chains simultaneously simply by starting several different instances of bali-phy. Each instance of bali-phy runs only one chain and does not require using MPI or special command-line options.
This approach to parallel computation is sometimes more efficient than MCMCMC-based parallelism involving heated chains. It is equivalent to running MCMCMC with no temperature difference between chains, with the exception that it allows results from all chains to be used, instead of just results from the single "cold" chain. Thus, if you run 10 independent chains in parallel, then you may gather samples 10 times faster that a single chain.
BAli-Phy can read in sequences
and alignments in both FastA and PHYLIP formats. Filenames for
FastA files should end in .fasta
,
.mpfa
, .fna
,
.fas
, .fsa
, or
.fa
. Filenames for PHYLIP files should
end in .phy
. If one of these extensions
is not used, then BAli-Phy will
attempt to guess which format is being used.
Large data sets run more slowly than small data sets. We recommend a conservative starting point with few taxa and short sequence lengths. You can then increase the size of your data set until a balance between speed and size is reached. The tool alignment-thin described in Section 11, “Alignment utilities” can be used to construct a smaller data set.
The number of MCMC samples that you need depends on whether you are primarily interested in obtaining a point estimate or in obtaining detailed measures of confidence and uncertainty. For detailed measures of confidence and uncertainty you should obtain a minimum of 10,000 samples after the Markov chain converges. For an estimate, you don't need very many samples after convergence. (But you may need many samples to be sure that you've converged!)
See also Section 3.5, “Running on computing clusters”.
BAli-Phy is quite CPU intensive, and so we recommend using 150 or fewer taxa in order to limit the time required to accumulate enough MCMC samples.
When designing an MCMC analysis, I recommend performing an initial analysis with a much smaller number of sequences. This smaller analysis will run much faster, and allow discovering mistakes much more quickly. Then, after you are sure that you are running the program correctly and have chosen the best model, you can ramp up the number of sequences towards your desired number.
Aligning just a pair of sequences takes time and memory, where represents the sequence length. Therefore sequences longer than (say) 1000 letters become increasingly impractical. However, you might try to see how long you can make your sequences before you run out of memory, or the program becomes too slow.
For multi-gene analyses, two separate data partitions (i.e. genes) of 500 letters will be twice as fast to align as one data partition of 1000 letters. So, it may be possible to analyze several genes as long as each gene individually is not too long.
Also, note that you can sometimes speed up the analysis of protein sequences by coding them as amino acids or codons, rather than nucleotides. This is because it decreases the sequence length.
BAli-Phy creates a new
directory to store its output files each time it is run. By default, the
directory name is the name of the sequence file, with a number
added on the end to make it unique. BAli-Phy
first checks if there is already a directory called
, and then moves on to
file
-1/
, etc. until it finds an
unused directory name.file
-2/
You can specify a different name to use instead of the
sequence-file name by using the --name
option.
BAli-Phy writes the following output files inside the directory that it creates:
C1.out |
Iteration numbers, probabilities, success probabilities for transition kernels, etc.. |
C1.P.fastas |
Sampled alignments for partition including ancestral sequences. |
C1.err |
Log file for hopefully irrelevant error messages. |
C1.MAP |
Successive estimates of the MAP alignment, tree and parameters. |
C1.log |
Numeric parameters: indel and substitution rates, etc. (One sample per line.) |
C1.trees |
Tree samples in Newick format. (One sample per line.) |
C1.run.json |
JSON file containing information about the command line, models, hostname, start time, etc. |
This section explains the meaning of the various field names in the file C1.log
.
prior |
The log prior probability. |
likelihood |
The log likelihood. |
posterior |
The log of the posterior probability. (The posterior probability is the product of the prior and the likelihood). |
prior_A |
The log-probability of the alignments in all partitions. |
|A| |
The total number of alignment columns across all partitions. |
#indels |
The total number of indel events across all partitions. (Adjacent indels that occur on the same branch are merged). |
|indels| |
The total length of indel events across all partitions. (Adjacent indels that occur on the same branch are merged). |
#substs |
The total unweighted parsimony score for substitutions across all partitions. |
P/likelihood |
The substitution log-likelihood for partition . |
P/prior_A |
The log-probability of the alignment for partition . |
P/|A| |
The length of the alignment in the th partition. |
P/#indels |
The number of indel events in partition , if we group adjacent indels that occur on the same branch. |
P/|indels| |
The length of indel events in partition , if we group adjacent indels that occur on the same branch. |
P/#substs |
The unweighted parsimony score for substitutions in partition . |
Scale[] * |T| |
The scaled branch lengths for partition group . |
|T| |
The unscaled tree length. (This will probably be around 1.0). |
Scale[] |
The average number of substitutions per site on the entire tree for partitions in the th scale group. |
S/ |
Parameter |
I/ |
Parameter |
The "prior" field includes the probability of the alignment, since the alignment is not observed.
The likelihood is the probabilistic analogue to summed mismatch penalties.
The prior_A is the probabilistic analogue to summed gap penalties.
The prefixes "S/" and "I/" will be dropped if not necessary to disambiguate parameters with the same name in different sub-models.
This section is primarily about extracting estimates from output files. See Section 10, “Convergence and Mixing: Is it done yet?” for methods of determining effective sample sizes, and for checking mixing and convergence.
To compute the majority consensus tree, do the following. (The program FigTree allows you to view the resulting tree file graphically.)
%
trees-consensusdir-1
/C1.treesdir-2
/C1.trees >c50.PP.tree
By default, the first 10% of tree samples are skipped as burn-in (--skip=10%
or -s 10%
) and every generation is analyzed (--subsample=1
or -x 1
). To discard the first 1000 tree samples and analyze every 10th sample:
%
trees-consensus -s 1000 -x 10dir-1
/C1.treesdir-2
/C1.trees >c50.PP.tree
By default, splits are included in the consensus tree if they have a
PP greater than 0.5. You can specify a more stringent level
(e.g. 0.66) by adding the option
--consensus-PP=0.66
as follows:
%
trees-consensus -s20% -x10 --consensus-PP=0.66dir-1
/C1.treesdir-2
/C1.trees >c66.PP.tree
You may also make the program write directly to the output file
(e.g. c66.PP.tree
) by using the more general form
--consensus-PP=0.66:c66.PP.tree
. Leaving off
the ":c66.PP.tree
" part (as we did above) or specifying
":-
" sends the output to the standard output
(e.g. the terminal, if not redirected).
%
trees-consensus -s20% -x10dir-1
/C1.treesdir-2
/C1.trees --consensus-PP=0.66:c66.PP.tree
You can supply multiple levels and filenames separated by commas. This is faster than running the program multiple times with different consensus levels.
%
trees-consensus -s20% -x10dir-1
/C1.treesdir-2
/C1.trees --consensus-PP=0.5:c50.PP.tree
,0.66:c66.PP.tree
Finally, you may use the option --consensus=
instead of the option --consensus-PP=
if you do
not wish the resulting tree to contain embedded posterior
probabilities on branches, as well as branch lengths.
%
trees-consensus -s20% -x10dir-1
/C1.treesdir-2
/C1.trees --consensus=0.5:c50.PP.tree
,0.66:c66.PP.tree
Both the --consensus=
and
--consensus-PP=
options may be given simultaneously.
See trees-consensus --help
for a complete list of options.
The greedy consensus tree may be used instead of a majority-consensus tree when a fully resolved (e.g. bifurcating) tree is required. When the topology has many tips and each topology may be sampled only once, the greedy consensus should be higher quality than the estimate of the MAP topology. To obtained a fully resolved tree, the greedy consensus strategy starts with the majority consensus and then adds the highest-supported split that does not conflict.
To compute the greedy consensus tree do:
%
trees-consensus --skip=burnin
dir-1
/C1.treesdir-2
/C1.trees --greedy-consensus=greedy.tree
To compute the maximum a posteriori tree do:
%
trees-consensus --skip=burnin
dir-1
/C1.treesdir-2
/C1.trees --map-tree=MAP.tree
When the tree has many tips, each topology may be sampled only once, leading to low quality estimates of the MAP topology. As a result, when you need a bifurcating tree you should probably use the greedy consensus instead.
%
trees-bootstrapdir-1
/C1.treesdir-2
/C1.trees
This command computes the effective sample size for the posterior probability of each split. It also computes the Average Standard Deviation of Split Frequencies (ASDSF) between two or more independent runs.
See Section 10, “Convergence and Mixing: Is it done yet?” for more information.
This command gives a median and confidence interval, ESS, and a stabilization time:
%
statreportdir-1
/C1.logdir-2
/C1.log > Report
When multiple runs are analyzed, this command gives PSRF and joint ESS values. The program Tracer allows you to view the same summaries graphically.
See Section 10, “Convergence and Mixing: Is it done yet?” for more information.
To construct an alignment estimate via posterior decoding, select any tree file tree
that corresponds to your alignment. It does not need to be fully resolved.
%
cut-rangedir
-1/C1.Pp
.fastasdir
-2/C1.Pp
.fastas --skip=burn-in
| alignment-chop-internal --treetree
| alignment-max > Pp
-max.fasta
You can optionally replace --tree
with tree
-N
, where n_sequences
n_sequences
is the number of non-ancestral sequences in your alignment.
You can use the program SeaView to view the alignment graphically.
To annotate a specific alignment alignment
.fasta, choose a fully resolved tree estimate tree
:
%
cut-rangedir
-1/C1.Pp
.fastasdir
-2/C1.Pp
.fastas --skip=burn-in
| alignment-chop-internal --treetree
| alignment-gildalignment
.fastatree
>alignment
-AU.prob%
alignment-drawalignment
.fasta --AUalignment
-AU.prob >alignment
-AU.html
The majority consensus tree is usually not fully resolved, so we recommend using the greedy consensus instead.
Instead of manually running each of the steps to analyze the
output files, you may instead run the PERL script
bp-analyze to execute these commands. The
script will create an HTML page
Results/index.html
that summarizes the
posterior distribution.
You may run bp-analyze inside the output directory, like this:
%
bp-analyze --skip=iterations
You may also run it with one or more output directories as arguments, like this:
%
bp-analyze --skip=iterations
directory
-1/directory
-2/
In this case, output from multiple runs will be used to assess convergence and mixing, as well as to increase the precision of the estimates.
All the commands that are executed by bp-analyze will be logged to
Results/bp-analyze.log
. You can also see these
commands as they are executed by supplying the --verbose option:
%
bp-analyze --skip=iterations
--verbose
The Results/
directory will contain
the following useful files:
Report |
A summary of numerical parameters: credible intervals and mixing. |
consensus |
A summary of supported splits (clades). |
c-levels.plot |
The number of splits (clades) supported at each LOD level. |
c50.tree | The majority consensus topology + branch lengths (Newick format) |
c50.PP.tree |
The majority consensus topology + branch lengths + Posterior Probabilities (Newick format) |
MAP.tree |
An estimate of the MAP topology + branch lengths (Newick format) |
The following files will be generated to summarize alignment uncertainty, unless the analysis uses a fixed alignment.
P |
An estimate of the alignment for partition
|
P |
An AU plot of the maximum posterior decoding alignment for partition
|
The following files describe convergence and mixing:
partitions.bs |
Confidence intervals on the support for partitions, generated using a block bootstrap. |
partitions.SRQ | A collection of SRQ plots for the supported partitions. |
c50.SRQ | An SRQ plot for the majority consensus tree. |
The SRQ plots can be viewed by typing "plot
'
" in
gnuplot.file
' with lines
This file reports the quality of estimates of support for each partition in terms of the posterior probability (PP) and log-10 odds (LOD). It also reports the auto-correlation time (ACT), the effective sample size (Ne), the number of samples that support (1) or do not support (0) the partition, and the number of regenerations. Only partitions with PP > 0.1 are shown by default.
The default substitution model for DNA and RNA is tn93.
All the DNA models are special cases of the GTR model.
Model | d.f. | Summary |
---|---|---|
jc69 | 0 | Equal rates and equal base frequencies. (Jukes and Cantor, 1969) |
k80 | 1 | Unequal transition & transversion rates, equal base frequencies. (Kimura, 1980) |
f81 | 3 | Equal exchangabilities, unequal frequencies. (Felsenstein, 1981) |
hky85 | 4 | Unequal Transition & transversion rates, unequal base frequencies. (Hasegawa, Kishino, and Yano, 1985) |
tn93 | 5 |
Unequal rates for transitions (purines), transitions (pyrimadies) and transversions, unequal base frequencies. (Tamura and Nei, 1993) |
gtr | 8 | Unequal exchangeabilities, unequal frequencies. (Tavare, 1986) |
Frequencies are estimated by default. Frequencies can be fixed by setting the pi
parameter to a constant value, if the model allows unequal frequencies.
Constant frequencies are specified as a list of pairs that associates each letter with its frequency:
gtr[pi=List[Pair["A",0.1],Pair["C",0.2],Pair["T",0.3],Pair["G",0.4]]
Frequencies can also be specified using functions:
gtr[pi=Frequencies.uniform]
Model | d.f. | Summary |
---|---|---|
Frequencies.uniform | 0 | Equal frequencies |
The default substitution model for proteins is lg08.
Model | d.f. | Summary |
---|---|---|
jc69 | 0 | Equal rates and equal frequencies. (Jukes and Cantor, 1969) |
f81 | 19 | Equal exchangabilities, unequal frequencies. (Felsenstein, 1981) |
| 19 |
Empirical exchange rates, all proteins. (Jones, Taylor, Thornton 1992) |
| 19 |
Empirical exchange rates, all proteins. Whelan and Goldman (2001) |
| 19 |
Empirical exchange rates, all proteins. Le and Gascuel (2008) |
| 19 | |
gtr | 208 | Unequal exchangeabilities, unequal frequencies. (Tavare, 1986) |
Frequencies are estimated by default. Frequencies can be fixed by setting the pi
parameter to a constant value, if the model allows unequal frequencies.
Constant frequencies are specified as a list of pairs that associates each letter with its frequency:
wag+f[pi=List[Pair["A",0.047],Pair["R",0.19],...]]
Frequencies can also be specified using functions:
wag+f[pi=Frequencies.uniform]]
Model | d.f. | Summary |
---|---|---|
Frequencies.uniform | 0 | Equal frequencies |
wag_freq | 0 | The constant amino-acid frequencies from the WAG paper. |
lg08_freq | 0 | The constant amino-acid frequencies from the LG08 paper. |
The +fe
model is shorthand for +f[pi=Frequencies.uniform]
:
wag+fe
The doublets alphabet consists of 16 RNA dinucleotides. It is used to model RNA stems, where two nucleotides matched in the RNA secondary structure are highly correlated.
The default substitution model for doublets is tn93_sym+x2_sym+f
.
As of version 3.4, BAli-Phy does not yet allow specifying which nucleotides are paired either with a string like ((.))
or with a "pairs" file. Instead you must manually extract the paired nucleotides and put them in their own partition (for stems), and then manually extract each loop and put it in its own partition.
The stems should be arranged so that paired nucleotides are adjacent. For example, suppose the sequence AGGCT
was paired according to ((.))
. Then the input file for the stems should contain a sequence of doublets that looks like ATGC
, where AT
is the first pair, and GC
is the second pair. Later versions of the software should allow extracting stems and loops from nucleotide sequences using parenthesis notation or a "pairs" file.
Model | d.f. | Summary |
---|---|---|
| df(nuc_model) |
The the same as Simultaneous changes of both letters are not allowed. Dinucleotide frequencies are the product of independent nucleotide frequencies. |
| df(nuc_model)+15 |
Mutation-selection model: neutral mutation follows Simultaneous changes of both letters are not allowed. |
| df(nuc_model)+15 | This model has separate frequencies for each dinucleotide. Simultaneous changes of both letters are not allowed. |
| 19 |
This model has separate frequencies for each dinucleotide, and distinguishes between transitions and transversion between match states (including GU/UG). Simultaneous changes of both letters are allowed, but only between match states. (Savill, N. et al 2001) |
gtr | 134 | Unequal exchangeabilities, unequal frequencies. It is unlikely that you would want to use this model, since it has so many parameters. (Tavare, 1986) |
Frequencies are estimated by default. Frequencies can be fixed by setting the pi
parameter to a constant value, if the model allows unequal frequencies.
Constant frequencies are specified as a list of pairs that associates each letter with its frequency.
hky85[pi=List[Pair["A",0.1],Pair["C",0.2],Pair["T",0.3],Pair["G",0.4]]]+x2 hky85_sym+x2_sym+f[pi=List[Pair["AA",0.01],Pair["AC",0.01],Pair["AG",0.01],Pair["AU",0.22],Pair["CA",0.01],Pair["CC",0.01],Pair["CG",0.22],Pair["CU",0.01],Pair["GA",0.01],Pair["GC",0.22],Pair["GG",0.01],Pair["GU",0.01],Pair["UA",0.22],Pair["UC",0.01],Pair["UG",0.01],Pair["UU",0.01]]]
Frequencies can also be specified using functions:
Model | d.f. | Summary |
---|---|---|
Frequencies.uniform | 0 | Equal frequencies on dinucleotides |
The triplets alphabet is similar to the codons alphabet, except that stop codons are included. Unlike the codons alphabet, the triplets alphabet has no knowledge of the genetic code.
The default substitution model for triplets is tn93+x3.
Model | d.f. | Summary |
---|---|---|
| df(nuc_model )+63 | GY94-style rate matrix constructed from nucleotide exchangability matrix. |
| df(nuc_model ) | MG94-style rate matrix constructed from nucleotide rate matrix. This model should give the same likelihood as |
| df(nuc_model )+63 | Mutation-selelection model with neutral mutation following |
Frequencies are estimated by default. Frequencies can be fixed by setting the pi
parameter to a constant value, if the model allows unequal frequencies.
Constant frequencies are specified as a list of pairs that associates each letter with its frequency.
hky85[pi=List[Pair["A",0.1],Pair["C",0.2],Pair["T",0.3],Pair["G",0.4]]]+x3
Frequencies can also be specified using functions:
hky85_sym+x3_sym+f[pi=f1x4] // nucleotide frequencies are estimated
Model | d.f. | Summary |
---|---|---|
Frequencies.uniform | 0 | Equal frequencies |
f1x4 | 3 | Constructs triplet frequencies from independent nucleotide frequencies. |
f3x4 | 9 | Constructs triplet frequencies from independent nucleotide frequencies for each codon position. |
The +fe
model is shorthand for +f[pi=Frequencies.uniform]
:
hky85_sym+x3_sym+fe
BAli-Phy interprets branch lengths for codon models as 1/3 the number of substitutions per triplet. Thus, they should be comparable to branch lengths under DNA/RNA nucleotide models.
The default substitution model for codons is gy94.
Model | d.f. | Summary |
---|---|---|
gy94 | 62 | Model of dN/dS with a separate frequency for each codon. Rate for changing a nucleotide depends on neighboring nucleotides. (Goldman and Yang, 1994) |
gy94[pi=f1x4] | 5 | The GY94 model with codon frequencies constructed from nucleotide frequencies. (Goldman and Yang, 1994) |
gy94[pi=f3x4] | 11 | The GY94 model with codon frequencies constructed from nucleotide frequencies for each codon position. (Goldman and Yang, 1994) |
gy94_ext[ | df(nuc_model )+61 | GY94 model extended with a generic nucleotide exchangeability matrix. (Goldman and Yang, 1994) |
mg94 | 4 |
Model of dN/dS with f81 as the neutral model. Rate for changing a nucleotide depends only on that nucleotide. (Muse and Gaut, 1994) |
mg94k | 5 | Model of dN/dS with hky85 as the neutral model. (Muse and Gaut, 1994) |
mg94_ext[ | df(nuc_model )+1 | Model of dN/dS with |
fMutSel | 65 | MG94-like model with fitnesses for each codon. (Yang and Nielsen, 2008) |
fMutSel0 | 24 | MG94-like model with fitnesses for each amino-acid. (Yang and Nielsen, 2008) |
| df(nuc_model)+60 | GY94-style rate matrix constructed from nucleotide exchangability matrix (dN/dS = 1). This model should give the same likelihood as |
| df(nuc_model) | MG94-style rate matrix constructed from nucleotide rate matrix (dN/dS = 1). |
| df(codon_model )+1 | Scales non-synonymous rates by |
| df(codon_model )+60 | Mutation-selelection model with neutral mutation following |
BAli-Phy interprets branch lengths for codon models as 1/3 of the number of substitutions per codon. Thus, they should be comparable to branch lengths under DNA/RNA models.
The x3
, x3_sym
, x3x3
, dNdS
, and mut_sel
models
can be used to build up codon models piecewise:
mg94
is equivalent to f81+x3+dNdS
.mg94k
is equivalent to hky85+x3+dNdS
.gy94
is equivalent to hky85_sym+x3_sym+f+dNdS
.fMutSel
is equivalent to gtr+x3+mut_sel+dNdS
.
Frequencies are estimated by default. Frequencies can be fixed by setting the pi
parameter to a constant value, if the model allows unequal frequencies.
Constant frequencies are specified as a list of pairs that associates each letter with its frequency.
gy94[pi=List[Pair["AAA",0.01],Pair["C",0.02],...]] mg94[pi=List[Pair["A",0.1],Pair["C",0.2],Pair["T",0.3],Pair["G",0.4]]]
Frequencies can also be specified using functions:
gy94[pi=f1x4] // nucleotide frequencies are estimated
Model | d.f. | Summary |
---|---|---|
Frequencies.uniform | 0 | Equal frequencies |
f1x4 | 3 | Constructs codon frequencies from independent nucleotide frequencies. |
f3x4 | 9 | Constructs codon frequencies from independent nucleotide frequencies for each codon position. |
When using a codon-based substitution model like gy94
, you may select the genetic code by specifying -A Codons[,
. Available genetic codes are genetic-code
]standard
, mt-vert
, mt-invert
, mt-yeast
, mt-protozoan
.
If the genetic code is not specified, then the standard code is used:
%
bali-physequence-file
-S gy94 -A Codons%
bali-physequence-file
-S gy94 -A Codons[RNA]
These examples specify the vertebrate mitochondrial code:
%
bali-physequence-file
-S gy94 -A Codons[DNA,mt-vert]%
bali-physequence-file
-S gy94 -A Codons[,mt-vert]
Model | d.f. | Summary |
---|---|---|
| df(submodel )+2 |
A mixture of conserved and neutral sites. (Wong, et. al. 2004) |
| df(submodel )+4 |
A mixture of conserved, neutral, and positively-selected sites. (Wong, et. al. 2004) |
m2a_test | df(submodel )+4 |
A Bayesian test for positive selection that compares M2a with M1a. (Wong, et. al. 2004) |
m3 | df(submodel )+2*-1 | An free mixture of categories of conserved dN/dS values. (Yang et. al. 2000) |
| df(submodel )+2*+1 |
A Bayesian test for positive selection based on the M3 model extended with an extra category of either neutral of positively-selected sites. |
| df(submodel )+2 | The M7 model places a beta distribution on dN/dS. (Yang et. al. 2000) |
m8a | df(submodel )+3 | The M8a model adds a category of neutral sites to the M7 model. (Swanson et. al. 2003) |
m8 | df(submodel )+4 | The M8 model adds a category of positively-selected sites to the M7 model. (Yang et. al. 2000) |
m8a_test | df(submodel )+4 | A Bayesian test for positive selection that compares the M8 to the M8a model. (Swanson et. al. 2003) |
branch_site | df(submodel )+4 | A Bayesian test for positive selection that on some (unknown) sites and some (known) branches. (Zhang et. al 2005) |
In order to use the branch-site substitution model, the user needs to specify an unrooted tree topology and disable topology changes in order to keep the topology fixed:
%
bali-phyalignment
.fasta -S branch_site -Ttree
.tree --disable=topology
The tree file should be in Newick format, with foreground branches labelled using NHX attributes. The NHX attribute must be applied to the branch, not the node, so it must occur after a colon.
Example 1. An tree with a foreground branch
The posterior probability of positive selection is the posterior mean of the posSelection parameter. This may be computed using the statreport program with the --mean
option. In case this probability is extremely close to 1 or 0, you may wish to add the option --Rao-Blackwellize branch_site:posSelection
. This will report the log-probability of positive selection each iteration. The user may exponentiate the reported values and then average them (using R, for example) in order to compute a more accurate estimate of the posterior probability of positive selection.
Complex substitution models in BAli-Phy are constructed as mixtures of reversible CTMC models that run at different rates (e.g. ) or have different parameters (e.g. an M2a codon model).
Model | d.f. | Summary |
---|---|---|
| df(submodel )+1 | Site rates follow a discrete approximation to the Gamma distribution (Yang 1994) |
| df(submodel )+1 | Site rates follow a discrete approximation to the logNormal distribution |
| df(submodel )+2(-1) | Sites fall in one of categories. Each category has its own rate. (Yang 1995) |
| df(submodel )+df(dist ) | Sites rates follow a discrete approximation to the distribution |
| df(submodel )+1 | Some fraction inv:p_inv of sites are invariable. |
Each of these models is a probability distribution on pairwise alignments. The probability distribution on multiple sequence alignments is constructed by factoring the multiple sequence alignment into pairwise alignments along each branch of the tree, as described in Redelings and Suchard (2005).
The default insertion/deletion model is rs07
.
Model | d.f. | Summary |
---|---|---|
rs05 | 3 |
A symmetric insertion-deletion model with geometrically-distributed indel lengths. Indels occur on all branches with the same probability, regardless of branch length. (Redelings and Suchard, 2005) |
rs07 | 2 |
A symmetric insertion-deletion model with geometrically-distributed indel lengths. Longer branches have more indels. (Redelings and Suchard, 2007) |
|
No indel model for the partition, indels uninformative. Fixed alignment for the partition. |
The user can specify priors and parameters for indel models (See section Section 8, “Models and Priors”):
rs07[log_rate~log_laplace[-4,0.707],mean_length=2]
Models, probability distributions, and functions are treated the same in BAli-Phy because all of them have parameters or arguments. Parameters have names in BAli-Phy. Parameter values are specified using square brackets as follows:
hky85[kappa=2] // model log[x=2] // function normal[mean=0,sigma=1] // probability distribution
It is possible to specify parameter values by position instead of by name:
hky85[2] log[2] normal[0,1]
It is even possible to mix positional and named arguments, as long as all the positional arguments come before all the named arguments:
normal[0,sigma=1] // OK normal[mean=0,1] // not OK
The order and type of parameters for a function can be found with the help
command. For example,
%
bali-phy help hky85
A value must be given for each parameter, unless the parameter has a default value (See Section 8.4, “Default values and default priors”).
If you are using the csh, tcsh, or zsh instead of the bash shell, then you need to put single quotes around terms with square brackets on the command-line:
%
bali-phy file.fasta -S 'hky85[kappa=2]'
If you do not add quotes, you will get an error message like "bali-phy: No match.
" (for csh) or "zsh: no matches found: hky85[kappa=2]
" (for zsh).
Models in phylogenetics literature are often combined using +
. For example, the model WAG+F+G4+I
starts with the WAG amino-acid model, and places several modifiers, like "+G4" on the right.
BAli-Phy follows this convention by treating A+B
as an abbreviation for B[submodel=A]
. When there are multiple '+
' symbols they associate to the left, so that A+B+C
is understood to mean (A+B)+C
. For example:
hky85+Rates.gamma // rewritten to Rates.gamma[submodel=hky85] hky85+inv // rewritten to inv[submodel=hky85] wag+f // rewritten to f[submodel=wag] wag+f+Rates.gamma+inv // rewritten to inv[submodel=Rates.gamma[submodel=f[submodel=wag]]]
This allows a simple method for combining models, when one model is an argument to another model.
Priors on model parameters are specified by giving a random value. Random values can be obtained from distributions using the function sample
. For example, this places a log-normal prior on the parameter kappa
of the hky85
model:
hky85[kappa=sample[log_normal[1,1]]]
You can write ~Dist
as a shorthand for sample[Dist]
:
hky85[kappa=~log_normal[1,1]]
The =~
can be further shortened to just ~
:
hky85[kappa~log_normal[1,1]]
It also is possible to use random values as inputs to other functions. For example:
add[1,~exponential[10]]
In such cases the parameter value should be specified with =
, as in the following example:
rs07[mean_length=add[1,~exponential[10]]]]
Random values and distributions have different types. For example, the
following is of type Distribution[Double]
:
uniform[0,1]
In contrast, the following are both of type Double
:
sample[uniform[0,1]] ~uniform[0,1]
This is important when passing distributions as arguments to other
distributions and functions. For example, the distribution iid
is used to generate a specific number of samples from another distribution. Thus, it needs to receive a distribution as an argument:
~iid[4,normal[0,1]] // OK : 4 samples from the normal[0,1] distribution ~iid[4,~normal[0,1]] // not OK: 4 samples from ... a random number?
Some function arguments have default values. For example, the Rates.gamma
parameter n
has a default value of 4. Thus the following are equivalent:
hky85+Rates.gamma[n=4]+inv hky85+Rates.gamma+inv
When the default value is random, then the argument has a default prior. For example, the kappa
parameter of hky85
has a default value of ~log_normal[log[2],0.25]
, so the following are equivalent:
hky85[kappa~log_normal[log[2],0.25]] hky85
The help
command can be used to determine the default value for a parameter, if there is one.
Every function has a result type, as well as an argument type for each argument. The argument type specifies what kind of arguments are acceptable, and the result type specifies what kind of result the function produces. Types include Int
for integers, Double
for double-precision floating point numbers, and String
for text strings. Integer arguments are implicitly converted to Double
when the argument type is Double
.
Some types contain parameters. For example List[Int]
indicates a list of integers and List[Double]
indicates a list of real numbers. In order to indicate a list of unknown type, we use a type variable a
and write List[a]
. Type variables always begin with a lower-case letter. They are able to match any specific type, and their value is found by pattern-matching. For example, the function add[x,y]
takes two arguments of type a
and has a result of type a
. Thus:
add[1,2] // arguments are a=Int, so result is of type Int add[1.0,2.0] // arguments are a=Double, so result is of type Double
Pair[a,b]
is a parameterized type that can be specialized to (for example) Pair[String,Double]
and Pair[Int,Int]
.
Types for components of substitution models are often parameterized by type of the alphabet. For example, hky85 has a result type of RevCTMC[a]
, where a
could be DNA
or RNA
. The use of alphabet types in substitution models prevents combining substitution models with mismatched alphabets.
You should analyze multiple genes under different evolutionary models by putting each one it its own data partition. Placing different genes in different partitions means that their alignments vary independently. It also prevents sequences in one gene from being aligned against sequences in another gene.
Different partitions share the same tree topology and a common set of unscaled branch lengths. However, branch lengths are scaled by a different factor in each partition, since some genes may evolve faster than others.
To put different genes in different partitions, you can place the sequences from each partition in a different FASTA or Phylip file. The sequence names in files for all partitions should be the same.
%
bali-phy gene1.fasta gene2.fasta
You can also select different sites from a single larger file:
%
bali-phy sequences.fasta:3-350 sequences.fasta:351-570
By default, each partition will have its own substitution model, insertion/deletion model, and scaled tree length. For example, even if all partitions are assigned a tn93
substitution model, their base frequencies will all be estimated independently. When parameters are estimated separately for two partitions, we say that the parameters for those partitions are "unlinked".
A substitution model or insertion-deletion model that is specified without qualification will apply to every partition. However, each partition will recieve its own copy of each model with unlinked parameter values:
%
bali-phy
sequence-file1
sequence-file2
-S tn93 -I rs07
You can select partition-specific values for 4 options: -S
, -I
, -A
, and --scale
. For example, to specify different substitution models but the same alphabet:
%
bali-phy
sequence-file1
sequence-file2
-S 1:tn93 -S 2:gtr -A DNA
You can fix the alignment and ignore insertion/deletion information in one partition, while allowing the alignment to vary and using insertion/deletion information in another partition:
%
bali-phy
sequence-file1
sequence-file2
-I 2:none
Since alignments are estimated by default, the alignment will be estimated in the first partition, but fixed in the second partition.
Specifying specify -I none
fixes the alignment in all partitions:
%
bali-phy
sequence-file1
sequence-file2
-I none
You can also specify that two partitions share a single copy of a single substitution model or indel model. For example, if two partitions both have a tn93
model, linking these models would force the partitions to have the same nucleotide frequencies and substitution rates. Linking partitions reduces the number of parameters that need to be estimated, and also pools information between the partitions:
%
bali-phy
sequence-file1
sequence-file2
-S 1,2:tn93 -I 1,2:rs07
By default each partition has a separate scale, but you can force groups of partitions to share a scale as follows:
%
bali-phy
sequence-file1
sequence-file2
--scale 1,2:
The --link
command is provided to allow specifying a model for each partition separately, and then afterwards choose which partitions to link.
%
bali-phy
sequence-file1
sequence-file2
-S 1:tn93 -S 2:tn93 --link=1,2 -t%
bali-phy
sequence-file1
sequence-file2
-S tn93 --link=1,2 -t
If the linked partitions are given different models, BAli-Phy will give an error and refuse to run:
%
bali-phy
bali-phy: Error! Partitions 1 and 2 cannot be linked because they have differing values 'tn93' and ''sequence-file1
sequence-file2
-S 1:tn93 --link=1,2 -t
You can also specify which of the 3 attributes "smodel", "imodel", and "scale" are being linked:
%
bali-phy
// Don't link the indel modelsequence-file1
sequence-file2
--link=1,2:smodel,scale -t
When using Markov chain Monte Carlo (MCMC) programs like MrBayes, BEAST or BAli-Phy, it is hard to determine in advance how many iterations are required to give a good estimate. The number depends on the specific data set that is being examined. As a result, BAli-Phy relies on the user to analyze the output of a running chain periodically in order to determine when enough samples have been obtained. This section describes a number of techniques to diagnose when more samples must be taken.
Some of the better diagnostics for lack of convergence rely on running at least 2 independent copies of the Markov chain (preferably 4-10) from different random starting points to see if the sampled posterior distributions for each chain are the same. Unfortunately, when the distributions all seem to be this same, this doesn't prove that they have all converged to the equilibrium distribution. However, if the distributions are different then you can reject either convergence or good mixing.
Convergence refers to the the tendency of a Markov chain to to "forget" its starting value and become typical of its equilibrium distribution. Note that convergence is a property of the Markov chain itself, not of individual runs of the Markov chain. Ideally a number of individual runs should be examined in order to determine how many initial iterations to discard as "burnin".
In MCMC, each sample is not fully independent of previous samples. In fact, even after a Markov chain has converged, it can get "stuck" in one part of the parameter space for a long time, before jumping to an equally important part. When this happens, each new sample contributes very little new information, and we need to obtain many more samples to get good precision on our parameter estimates. In such a case, we say that the chain isn't "mixing" well.
To calculate the ASDSF and MSDSF run:
%
trees-bootstrapdir-1
/C1.treesdir-2
/C1.trees ...dir-n
/C1.trees > partitions.bs
For each split, the SDSF value is just the standard deviation across runs of the Posterior Probabilities for that split. By averaging the resulting SDSF values across splits, we may obtain the ASDSF value (Huelsenbeck and Ronquist 2001). This is commonly considered acceptable if it is < 0.01.
However, it is also useful to consider the maximum of the SDSF values (MSDSF). This represents the range of variation in PP across the runs for the split with the most variation.
To generate the split-frequency comparison plot, you must have R installed. Locate the script compare-runs.R
. Then run:
%
trees-bootstrapdir-1
/C1.treesdir-2
/C1.trees ...dir-n
/C1.trees --LOD-table=LOD-table > partitions.bs%
R --slave --vanilla --args LOD-table compare-SF.pdf < compare-runs.R
Following Beiko et al (2006), this displays the variation in estimates of split frequencies across runs. Splits are arranged on the x-axis in increasing order of Posterior Probability (PP), which is obtained by averaging over runs. We then plot a vertical bar from the minimum PP to the maximum PP.
Potential Scale Reduction Factors check that different runs have similar posterior distributions. Only numerical variables may have a PSRF. To calculate the PSRF for each numerical parameter, you may run:
%
statreportdir-1
/C1.logdir-2
/C2.p ...dir-n
/C1.log > Report
The PSRF is a ratio of the width of the pooled distribution to the average width of each distribution, and should ideally be 1. The PSRF is customarily considered to be small enough if it is less than 1.01.
We compare the PSRF based on the length of 80% credible intervals (Brooks and Gelman 1998) and report the result as PSRF-80%CI. For integer-valued parameters, we avoid excessively large PSRF values by subtracting 1 from the width of the pooled CI.
We also report a new PSRF that is more sensitive for integer distributions. For each individual distribution, we find the 80% credible interval. We divide the probability of that interval (which may be more than 80%) by the probability of the same interval under the pooled distribution. The average of this measure over all distributions gives us a PSRF that we report as PSRF-RCF.
This convergence diagnostic gives a criterion for detecting when a parameter value has stabilized at different values in several independent runs, indicating a lack of convergence. This situation might occur if different runs of the Markov chain were trapped in different modes and failed to adequately mix between modes.
To calculate the split ESS values, run:
%
statreportdir-1
/C1.logdir-2
/C1.log ...dir-n
/C1.log > Report
We calculate effective sample sizes based on integrated autocorrelation times. This method has the nice property that simply duplicating every sample does not increase the ESS.
The program Tracer also computes ESS values.
To calculate the split ESS values, run:
%
trees-bootstrapdir-1
/C1.treesdir-2
/C1.trees ...dir-n
/C1.trees > partitions.bs
To compute the ESS for a split, we consider the presence or absence of a split in each iteration as a series of binary values. We compute the integrated autocorrelation time for this binary sequence, which leads to an ESS. This approach is similar to dividing the iterations into blocks and computing the ESS on the PP estimates in the blocks. It is also similar to estimating the variance reduction under a block bootstrap.
To obtain estimates of the stabilization time for each numerical parameter, you may run:
%
statreport C1.log > Report
Each series of values is counted as having stabilized after the series crosses its upper and then lower 95% confidence bounds twice (if the initial value is below the median) or crosses its lower and then upper confidence bounds twice (if the initial value is above the median). The confidence bounds are those based on its equilibrium distribution as calculated from the last third of the values in the sequence.
In addition to examining convergence diagnostics for continuous parameters, it is important to examine convergence diagnostics for the topology as well (Beiko et al 2006). In theory, we recommend the web tool Are We There Yet (AWTY) (Wilgenbush et al, 2004). However, AWTY gives incorrect results if you upload plain NEWICK tree samples -- which is what BAli-Phy outputs. Therefore, if you wish to use AWTY, you must convert the tree samples files to NEXUS before you upload them to AWTY in order to get correct results.
It is also be possible to assess stabilization of tree topologies using tools distributed with bali-phy by using commands like the following. Here, sub-sampling and burnin does not apply to the equilibrium tree files. Also, note that you need to manually construct the equilibrium samples, which we recommend to contain at least 500 trees; you might do this by sub-sampling using the BAli-Phy tool sub-sample.
To report the average distances within and between two tree samples:
%
trees-distances --skip=burnin
--subsample=factor
comparedir-1
/C1.treesdir-2
/C1.trees
To compute the distance from each tree in C1.trees to all trees equilibrium.trees, as a time series:
%
trees-distances --skip=burnin
--subsample=factor
convergenceC1.trees
equilibrium.trees
To assess when the above time series stabilizes:
%
trees-distances --skip=burnin
--subsample=factor
convergedC1.trees
equilibrium.trees
The stabilization criterion is the same one described above for numerical values.
Note that the running time is the product of the number of trees in the two files. Therefore, comparing two complete tree samples without sub-sampling will take too long.
All tools display their options if given the "--help
" argument on the command line.
Show basic information about the alignment:
%
alignment-info file.fasta%
alignment-info file.fasta file.tree
To select columns from an alignment:
%
alignment-cat -c1-10,50-100,600- file.fasta > result.fasta%
alignment-cat -c5-250/3 file.fasta > first_codon_position.fasta%
alignment-cat -c6-250/3 file.fasta > second_codon_position.fasta
To concatenate two or more alignments:
%
alignment-cat file1.fasta file2.fasta > all.fasta
Remove columns without a minimum number of letters:
%
alignment-thin --min-letters=5file
.fasta >file
-thinned.fasta
Remove sequences:
%
alignment-thin --remove=seq1,seq2file
.fasta >file
2.fasta
Remove short sequences:
%
alignment-thin --longer-than=250file
.fasta >file
-long.fasta
Remove sequences while preserving sequence diversity:
%
alignment-thin --down-to=30file
.fasta >file
-30taxa.fasta%
alignment-thin --down-to=30file
.fasta --protect=seq1,seq2 >file
-30taxa.fasta
Remove sequences that are missing conserved columns:
%
alignment-thin --remove-crazy=10file
.fasta >file
2.fasta
Draw an alignment to HTML, optionally coloring residues by AU.
%
alignment-drawfile
.fasta --show-ruler --color-scheme=DNA+contrast >file
.html%
alignment-drawfile
.fasta --show-ruler --AU=file
-AU.prob --color-scheme=DNA+contrast+fade+fade+fade+fade >file
-AU.html
Find the last (or first) FastA alignment in a file.
%
alignment-find --first <file
.fastas > first.fasta%
alignment-find <file
.fastas > last.fasta
Turn columns from a template alignment into alignment constraints:
%
alignment-indices template.fasta > constraints.txt%
alignment-indices -c100-110,200,300- template.fasta > constraints.txt
Each line in this file corresponds to one alignment column.
This program analyzes the tree sample contained in
file
. It reports the MAP topology, the
supported taxa partitions (including partial partitions), and the
majority consensus topology.
Usage: trees-bootstrap file1
[file2
... ] --predicates
predicate-file
[OPTIONS]
This program analyzes the tree samples contained in
file1
, file2
,
etc. It gives the support of each tree sample for each predicate in
predicate-file
, and reports a confidence
interval based on the block bootstrap.
Each predicate is the intersection of a set of partitions, and is specified as a list of partitions or (multifurcating) trees, one per line. Predicates are separated by blank lines.
Usage: trees-to-SRQ predicate-file
[OPTIONS] trees-file
This program analyzes the tree samples contained in
trees-file
. It uses them to produce an
SRQ plot for each predicate in
predicate-file
. Plots are produced in
gnuplot format, with one point per line
and with plots separated by a blank line.
If --mode sum
is specified, then a "sum"
plot is produced instead of an SRQ plot. In this plot, the slope of
the curve corresponds to the posterior probability of the event. If the
--invert
option is used then the slope of the
curve correspond to the probability of the inverse event. This is
recommended if the probability of the event is near 1.0, because the
sum plot does not distinguish variation in probabilities near 1.0 well.
Compiling BAli-Phy is intended to be a relatively painless process. However, most people will want to use the pre-compiled binaries as described in the standard installation instructions at Section 2, “Installation” instead of compiling BAli-Phy themselves. You might want to compile BAli-Phy yourself if you want to
Otherwise, the pre-compiled binaries will be fine.
In order to compile BAli-Phy, you need
We recommend the GNU C++ Compiler (GCC) version 5.0 (or higher) or the Clang compiler version 3.5.0 or higher. The Cairo graphics library is optional, but if it is missing, the drawtree tool that is used to draw consensus trees won't be built. See also Section 2.8, “Install programs used for viewing the results”.
On Debian and Ubuntu, you can type:
%
sudo apt-get install g++ git libcairo2-dev pandoc
If your version of Debian or Ubuntu is recent enough to contain meson version 0.45 or higher, you can install meson with apt-get:
Otherwise you can install meson through pip3:%
sudo apt-get install meson
%
dpkg -s meson | grep Version
Version: 0.45.1-2
%
sudo apt-get install python3 python3-pip ninja
%
python3 -m venv meson
%
source meson/bin/activate
%
pip3 install meson
On Mac OS X, the simplest way to get a compiler is to install XCode (version 6 or newer) command line tools, which come with clang.
%
xcode-select --install
To get the other tools, first install homebrew, and then type:
%
brew install git meson cairo pandoc
The MSYS2 project provides an MINGW64 compiler that can create native windows executables. MSYS2 itself is actually non-native (it is derived from cygwin), and therefore the MSYS2 shell refers to drives as /c/
instead of C:/
.
%
pacman --needed --noconfirm -Sy pacman-mirrors
%
pacman -Sy
%
pacman -S mingw-w64-x86_64-ninja
%
pacman -S mingw-w64-x86_64-toolchain
%
pacman -S mingw-w64-python3-pip
%
PATH=/c/msys64/mingw64/bin:$PATH
# Put the mingw64 executables into your path%
pip3 install meson
Keep in mind that MSYS2 keeps its (non-native) executables in C:/msys64/usr/bin
, while it keeps the (native) MINGW executables in C:/msys64/mingw64/bin
. If you want to use the native MINGW executables, you need to make sure that /c/msys64/mingw64/bin/
is in your PATH. If you forget to put the MINGW executables in the path, some of the installed MINGW programs (such as pip3 above) will show up as missing when you try to run them.
First check out the code using git:
%
git clone https://github.com/bredelings/BAli-Phy.git
%
cd BAli-Phy
%
git submodule update --init
Then run meson to configure the build process:
%
meson build --prefix=$HOME/Applications/bali-phy-3.4/
In the MSYS2 environment, the command is called meson.py instead of meson:
%
meson.py build --prefix=$HOME/Applications/bali-phy-3.4/
Finally, build and install the software:
%
ninja -C build
%
ninja -C build test
%
ninja -C build install
The command bali-phy and its associated tools should then be located in ~/Applications/bali-phy-3.4/bin/
. To install to another directory dir
, specify --prefix=dir
to meson.
You can select the C++ compiler by setting the CXX variable. A useful example of this is to use g++-8 on systems where g++ invokes a compiler that is too old:
%
CXX=g++-8 meson build --prefix=$HOME/Applications/bali-phy-3.4
You may also set compiler and linker options using the CPPFLAGS, CXXFLAGS, and LDFLAGS variables. For example, you can instruct the compiler to use all the features of your chip, instead of producing generic code that will run anywhere:
%
CXXFLAGS="-mtune=native -march=native" meson --prefix=$HOME/Applications/bali-phy-3.4
For example, you can set the CPPFLAGS and LDFLAGS variables to instruct the compiler where to look for libraries, such as cairo:
%
CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" meson build --prefix=$HOME/Applications/bali-phy-3.4
Another useful example of this is to produce an OS X executable on that can run on older versions of OS X:
%
CXXFLAGS="-mmacosx-version-min=10.9" LDFLAGS="-mmacosx-version-min=10.9" meson build --prefix=$HOME/Applications/bali-phy-3.4
14.4.1. | Why is bali-phy still running? How long will it take? |
It runs until you stop it. Stop it when its done. The longer answer is that is is hard to predict how long MCMC will take to converge, since it depends on each data set in complex ways. Automatic rules for determining when to stop an MCMC chain can be difficult to get right. BAli-Phy does not contain an automatic stopping rule yet, so it relies on the user to run convergence diagnostics and determine when to stop the run. | |
14.4.2. | How do I stop a bali-phy run on my personal computer? |
Simply kill the process -- there is no special
command to stop bali-phy. If you are
running it on your personal workstation, then you can use
the command kill. To do that, you need
to find the PID (process ID) of the running program. You
can find this by examining the beginning of the file
Here the PID is 18838. Therefore you can type:
On some operating systems you can also type:
However, be aware that this will terminate all of your bali-phy runs on that computer. | |
14.4.3. | How do I stop a bali-phy run on a computing cluster? |
Simply terminate the submitted job. The specific command to terminate a job will depend on the queue manager that is installed on your cluster. Examine the documentation for your cluster, or ask your cluster support staff how to delete running jobs on your cluster. As an example, if the SGE software is used to submit jobs, then the command qstat should list your jobs and their job ID numbers (which is different than the process ID number). You can then use the command qdel to delete jobs by ID number. The SGE documentation describes how to use these commands. | |
14.4.4. | So, how can I know when to stop it? |
You can stop when it has both converged and also run for long enough to give you >1000 effectively independent samples. | |
14.4.5. | How can I tell when the chain has converged? |
See section Section 10, “Convergence and Mixing: Is it done yet?”. | |
14.4.6. | How can I check how many iterations the chain has finished? |
Run wc -l C1.log inside the output directory, and subtract 2. |
14.6.1. | How do I compute the clade support? |
Actually, BAli-Phy uses unrooted trees, so it only estimates bi-partition support. A bi-partition is a division of taxa into two groups, but it does not specify which group contains the root. | |
14.6.2. | How do I compute the split/bi-partition support? |
After you analyze the output (Section 5.4, “Summarizing the output - scripted”), the partition support is indicated in
|