Phylogenomics Workflow For Using Protein Data

This is an older version of the workflow which uses protein data as input. The Phyparts concordance analysis is also not included. I recommend using nucleotide data and the respective workflow instead.

Change headers in files

For JGI files:

cat Marbr1_GeneCatalog_proteins_20141203.aa.fasta | sed 's/jgi|//g' | sed 's/|/_/g' | sed 's/Marbr1_/Marbr1|/g' > Marbr1.names.fas

For Funannotate proteins.fas:

cat Nagfri.proteins.fas | sed 's/>NAGFRI_/>NAGFRI|g/g' > Nagfri.proteins_rename.fas

For NCBI-Genbank files. POR is the 3 letter accession.

cat Tolpar.fas | sed 's/ \[Tolypocladium paradoxum\]//g' | sed 's/>POR/>Tolpar|POR/g' | sed 's/ /_/g' | sed 's/,//g' | sed 's/\///g' | sed 's/(//g' | sed 's/)//g' | sed 's/[][]//g' > Tolpar.names.fas

Make a new directory and move all renamed files

mkdir good_names
mv *.names.fas good_names
cd good_names

Get the names of all files in the directory for the next step

echo $(dir)

Run ProteinOrtho

proteinortho -clean -project=project_name -cpus=$SLURM_NTASKS input_protein_file_1.fas input_protein_file_2.fas input_protein_file_3.fas

Get single copy, shared orthologous clusters. This example has 3 input proteomes:

grep $'^3\t3' protortho_output.tsv > new_all_3.tsv

Grab proteins from files with src_proteinortho_grab_proteins.pl. It's a good idea to take the header from the proteinortho output and include it at the top of your new tsv file. This can speed up the process.

perl ./src_proteinortho_grab_proteins.pl -exact -tofiles new_all_18.tsv input_protein_file_1.fas input_protein_file_2.fas input_protein_file_3.fas

Make a new directory and move single copy files

mkdir single_copy
mv *.fasta single_copy
cd single_copy

Create an lb file from files in this directory for MUSCLE input using new_create_muscle.sh

./new_create_muscle.sh > lb_cmd_file

Run MUSCLE with Open MPI.

mpirun lb lb_cmb_file

Move output files to new directory

mkdir muscle_out
mv *.fasta.out muscle_out
cd muscle_out

Create an lb file from files in this directory for trimAl input using create_trimal.sh

./create_trimal.sh > lb_cmd_trimal

Run trimAl

mpirun lb lb_cmd_trimal

Make a new folder and move trimmed files to this directory

mkdir trimmed
mv *.out.trim trimmed

Running RAxML-NG

Create an expected file from your trimmed alignments. Check this output file for the presence of all of your taxa

cat trimmed/*.fasta.out.trim | grep "^>" | tr "|" "\t" | cut -f 1 | sort | uniq > expected.txt

Concatenate your alignment with combine_fasaln.pl and your expected file

perl ./combine_fasaln_v2019.pl -o allseqs.fas -of fasta -d ./trimmed/ --expected expected.txt

Make a new directory and move your alignment

mkdir raxmlng
mv allseqs.fas raxmlng
cd raxmlng

Run RAxML-NG with WAG model (formerly PROTGAMMAWAG), 200 bs replicates and a seed value of 2 (this is arbitrary and can be changed)

raxml-ng-mpi --all -msa allseqs.fas --model WAG --opt-model on --seed 2 --bs-trees 200

Running ASTRAL

This process uses the files in trimmed/ to prepare individual gene trees to then estimate a species tree with ASTRAL

Rename fasta headers across alignments in trimmed/

for filename in *.trim
do
    cat "${filename}" | awk -F "|" '{print $1}' > "${filename}.renamed"
done

Create an lb file using create_astral.sh

./create_astral.sh > lb_cmd_astral

Run concatenated alignments of genes through RAxML to prepare individual gene trees

mpirun lb lb_cmd_astral

Concatenate all gene trees

cat raxml.bipartitions > all_gene_trees.tre

Optional: Filter gene trees with low support branches using Newick utilities. Example filters anything below 50%.

all_gene_trees.tre 'i & b<=50' o > all_gene_tree_BS50.tre

Run ASTRAL

java -jar astral.5.7.3.jar -i all_gene_trees.tre -o output.tre

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow_for_protein_data.md

Workflow_for_protein_data.md

Phylogenomics Workflow For Using Protein Data

This is an older version of the workflow which uses protein data as input. The Phyparts concordance analysis is also not included. I recommend using nucleotide data and the respective workflow instead.

Running RAxML-NG

Running ASTRAL

Files

Workflow_for_protein_data.md

Latest commit

History

Workflow_for_protein_data.md

File metadata and controls

Phylogenomics Workflow For Using Protein Data

This is an older version of the workflow which uses protein data as input. The Phyparts concordance analysis is also not included. I recommend using nucleotide data and the respective workflow instead.

Running RAxML-NG

Running ASTRAL