Skip to content

Commit

Permalink
fixed header issue in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
DomBennett committed May 30, 2020
1 parent faa8b8f commit 47053fb
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions vignettes/phylotaR.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ vignette: >
%\VignetteEncoding{UTF-8}
---

#Introduction
# Introduction

The first step to running a phylogenetic analysis is the identification of overlapping sequences. Often orthology is determined by pairing sequences whose gene names match (e.g. COI sequences with COI sequences, rbcl sequences with rbcl sequences). Problems can arise however if gene names differ between authors, if different gene sections are represented or if sequences are mislabelled. These issues can be especially problematic for large-scale analyses where individual errors cannot be detected.

Expand All @@ -16,13 +16,13 @@ The first step to running a phylogenetic analysis is the identification of overl
This R pacakge, `phylotaR`, is an R implementation of this pipeline. In this vignette we will demonstrate how to run PhyLoTa using a small taxonomic group. The pipeline is composed of four automated stages (taxise, download, cluster, cluster2) and a final user-performed stage of cluster selection.


#Installing NCBI BLAST+ Tools
# Installing NCBI BLAST+ Tools

The PhyLoTa pipeline uses BLAST to identify orthologous sequence clusters. In order to run phylotaR, a local copy of the BLAST software must be installed on your computer. **Installing the phylotaR package does not install BLAST, it must be installed separately**. To install BLAST+, please see the NCBI website's [installation instructions](https://www.ncbi.nlm.nih.gov/books/NBK279671/).

#Pipeline
# Pipeline

##Setup
## Setup

For demonstration purposes we will run the pipeline on a small taxonomic group. Because they are charismatic and relatively well-studied, we will select the Night Monkey genus, [Aotus](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9504). Now that we have decided on a taxonomic group we need to find out its unique taxonomic ID. This can be looked up by navigating to the [NCBI taxonomy webpage](https://www.ncbi.nlm.nih.gov/taxonomy) and searching 'Aotus'. Doing this, we can see that Aotus ID is **9504**. We will need this number for specifying the parameters in our pipeline. (Notice, that there is also a plant genus called Aotus.)

Expand All @@ -38,7 +38,7 @@ setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = TRUE)

The above imports the `phylotaR` package and initiates a cache that will contain the pipeline parameters. For this tutorial we will keep the parameters as their default. See the function `parameters()` for a complete list and description of all the parameters and their default values. For more detailed information on the parameters please see the publication, [phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R](https://doi.org/10.3390/life8020020). `wd` must be a file path to the folder we called `aotus/`. `ncbi_dr` must be a file path to the folder containing all the NCBI BLAST+ tools -- see above 'Installing NCBI BLAST+ Tools'. Depending on your system and how you installed the tools, they may be in your system path in which case you can simply supply '.' to the `ncbi_dr` argument. On my computer I provide the path to the where the `blastn` executable is located, e.g. `/usr/local/ncbi/blast/bin/`. Running `setup()` will verify whether the BLAST tools are installed correctly.

##Running
## Running

After `setup()` has been run we can run the pipeline with the following command.

Expand All @@ -48,7 +48,7 @@ run(wd = wd)

This will run all the automated stages of the pipeline: taxise, download, cluster and cluster2. The first of these stages looks up all the taxonomic information available on the descendants of the parent ID provided, `txid`. The second downloads representative sequences for all identified descendants. No additional arguments are required other than `wd` which specifies the working directory that contains the cache and all parameters as set up by `setup()`. In this folder you will also find a `log.txt` that reports detailed information on the progression of the pipeline as well as all the output files generated by each stage. Additionally, you will see session info and a blast version text files. These files, along with the log, can help debugging if any errors occur. The whole pipeline can complete in around 2 minutes for Aotus using default parameters. Aotus, however, is a genus of only 13 taxa, larger clades will take much longer particularly during the download stage.

##Restarting
## Restarting

The pipeline can be halted and restarted. The cache records all downloaded and generated data by the pipeline. If there is a system crash or the user wishes to halt the program, the pipeline can be restarted from the same point it stopped with the function `restart()`. Additionally, due to the potential random nature of the pipeline, a user may wish to re-run the pipeline from certain stages. This can be achived by first using `reset()` followed by `restart()`. For example, in the code below a completed pipeline is reset to 'cluster' and then restarted. After running these commands, the pipeline will run as if it has only just completed the download stage. Note, all resets and restarts are recorded in the log.

Expand All @@ -57,7 +57,7 @@ reset(wd = wd, stage = 'cluster')
restart(wd = wd)
```

###Changing parameters
### Changing parameters

Paramaters can always be set by a user at the initiation of a folder with the `setup()` function. To change the parameter values after a folder has already been set up, a user can use `parameters_reset()`. For example, if the download stage is taking particularly long, the `btchsz` could be increased. This would raise the number of sequences downloaded per request. (Note, too high a `btchsz` may cause your NCBI Entrez access being limited.)

Expand All @@ -69,7 +69,7 @@ restart(wd = wd)
# ^ restart from whatever point it was halted
```

##Cluster selection
## Cluster selection

After a pipeline has completed, the identified clusters can be interrogated. We can generate a phylota object using `read_phylota()` but in the code below we will load a pre-existing phylota object from the package data. The phylota object contains cluster, sequence and taxonomic information on all the clusters. It has 6 data slots: cids, sids, txids, txdct, sqs, clstrs, prnt_id and prnt_nm. Each of these slots can be accessed with `@`, see ?\`Phylota-class\` for more information. The `phylotaR` package has a range of functions for probing clusters in a phylota object. For example, if we want to know how many different taxonomic groups are represented by each cluster we can use `get_ntaxa()`.

Expand Down Expand Up @@ -131,7 +131,7 @@ write_sqs(phylota = reduced, sid = sids, sq_nm = scientific_names,
# ^ to avoid clutter, we're writing to a temporary folder
```

##Testing output
## Testing output

We can sanity check our cluster sequences by running a very quick phylogenetic analysis using mafft and raxml. The below code will use the cluster to generate an alignment and a tree through R. In order for the code to run, it requires the installation of mafft and raxml and, additionally, may require tweaking to work on your system.

Expand Down

0 comments on commit 47053fb

Please sign in to comment.