ProGraphMSA project page

Introduction:

ProGraphMSA is a state-of-the-art multiple sequence alignment tool which produces phylogenetically sensible gap patterns while maintaining robustness by allowing alternative splicings and errors in the branching pattern of the guide tree. This is achieved by incorporating a graph-based sequence representation as in POA and combines it with the advantages of the phylogeny-aware algorithm in Prank. Further, we account for variations in the substitution pattern by using estimated amino acid frequencies and by implementing context-specific profiles as in CS-Blast.
The latest versions of ProGraphMSA include an alignment model supporting tandem repeat unit insertions and deletions.

Web-based Multiple Sequence Alignment:

Downloads:

If you experience any problems downloading/executing ProGraphMSA contact me at bladam@.szalkowsutki@##inf.ethz*.ch

Example:

Downloading and building ProGraphMSA from source on Linux:

# download and extract source
wget http://people.inf.ethz.ch/sadam/ProGraphMSA/files/ProGraphMSA-current.tar.gz
tar xzvf ProGraphMSA-current.tar.gz

# change into the source directory
cd ProGraphMSA

Running ProGraphMSA:

# perform an alignment and output stockholm format:
./ProGraphMSA.sh input_sequences.fasta -o output.stk

# perform an alignment and output fasta format:
./ProGraphMSA.sh --fasta input_sequences.fasta -o output.fasta

Downloading and installing tandem repeat detectors:

# download T-REKS
wget http://bioinfo.montp.cnrs.fr/t-reks/T-Reks.jar
# download and extract TRUST
wget http://www.ibi.vu.nl/programs/trustwww/trust.tgz
tar xzf trust.tgz
# adjust installation path in wrapper script
echo "Trust path: $(pwd)/Align"
${EDITOR} trust2treks.py

Running ProGraphMSA+TR:

# perform an alignment using T-REKS to detect tandem repeats and output stockholm format:
./ProGraphMSA+TR.sh input_sequences.fasta -o output.stk

# perform an alignment using T-REKS to detect tandem repeats and output fasta format:
./ProGraphMSA+TR.sh --fasta input_sequences.fasta -o output.fasta

# perform an alignment using TRUST to detect tandem repeats and output stockholm format:
./ProGraphMSA+TR.sh --custom_tr_cmd trust2treks.py input_sequences.fasta -o output.stk

Documentation:

Command line parameters:

Usage: ProGraphMSA [--ancestral_seqs] [--all_trees] [-i <iterations>] [-T] [-M] [-m] [-a] [-C <count>] [-F] [--custom_model <file>] [-w] [-c <file>] [-r] [-R] [--custom_tr_cmd $lt;command>] [--trd_output <filename>] [--read_repeats <T-Reks format output>] [--repalign] [--repeat_indel_ext <probability>] [--repeat_indel_rate <rate>] [-A] [-P <distance>] [-p <distance>] [-D <distance>] [-d <distance>] [-x <distance>] [-l <distance>] [-E <probability>] [-e <probability>] [-g <rate>] [-f] [--dna] [--codon] [-t <newick file>] [-o <filename>] [--] [--version] [-h] <fasta file>
Tandem-repeat related parameters:
-i <iterations>, --iterations <iterations> number of iterations re-estimating guide tree [default: 2]
-R, --repeats use T-REKS to identify tandem repeats
--custom_tr_cmd <command> custom command for detecting tandem-repeats
--trd_output <filename> write TR detector output to file
--read_repeats <T-REKS format output> read TR detector output from file
--repalign re-align detected tandem repeat units
--repeat_indel_ext <probability> repeat indel extension probability
--repeat_indel_rate <rate> insertion/deletion rate for repeat units (per site)
Guide tree, distances, and substitution model:
-i <iterations>, --iterations <iterations> number of iterations re-estimating guide tree [default: 2]
-m, --mldist use distances estimated by a Maximum-Likelihood method
-a, --nwdist estimate initial distance tree from Needleman-Wunsch alignments
-D <distance>, --max_dist <distance> maximum distance for alignment
-F, --estimate_aafreqs estimate equilibrium amino acid frequencies from input data
-w, --darwin use model of evolution from Darwin (GONNET matrix and different indel model parameters, otherwise WAG will be used)
--custom_model <file> custom substitution model in qmat format
-c <file>, --cs_profile <file> path to library of context-sensitive profiles (we distribute a copy in the 3rd_party folder)
-A, --no_force_align_m do not force alignment of initial Methionine
Parameters for adjusting the indel model:
-l <distance>, --edge_halflife <distance> edge half-life (evolutionary distance at which the probability of re-using an unsused graph is halved)
-E <probability>, --end_indel_prob <probability> probability of mismatching sequence ends (set to -1 to disable this feature)
-e <probability>, --gap_ext <probability> gap extension probability
-g <rate>, --indel_rate <rate> insertion/deletion rate
Input/Output:
-f, --fasta output fasta format (instead of stockholm)
-t <newick file>, --tree <newick file> initial guide tree
-o <filename>, --output <filename> Output file name
-I, --input_order output sequences in input order (default: tree order)
--dna align DNA sequence
--codon align DNA sequence based on a codon model
--ancestral_seqs output all ancestral sequences
<fasta file> (required) input sequences