diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 0c8bc3c..7f93a9b 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -15,11 +15,11 @@ repos: - id: markdownlint-cli2 files: \.(md|qmd)$ types: [file] - exclude: LICENSE.md + exclude: (README.md|LICENSE.md)$ - id: markdownlint-cli2-fix files: \.(md|qmd)$ types: [file] - exclude: LICENSE.md + exclude: (README.md|LICENSE.md)$ - repo: https://github.com/lorenzwalthert/precommit rev: v0.3.2.9027 hooks: diff --git a/README.Rmd b/README.Rmd index c030cb7..dad725e 100644 --- a/README.Rmd +++ b/README.Rmd @@ -118,8 +118,6 @@ optional arguments: --width WIDTH figure width in inches [default 6.0] ``` - - ### R package vignette Alternatively, import the library in an R script and use the package diff --git a/README.md b/README.md index e95ed88..b79eb5f 100644 --- a/README.md +++ b/README.md @@ -9,8 +9,15 @@ status](https://github.com/NBISweden/genecovr/workflows/R-CMD-check/badge.svg)](https://github.com/NBISweden/genecovr/actions) -Perform gene body coverage analyses in R to evaluate genome assembly -quality. +`genecovr` is an `R` package that provides plotting functions that +summarize gene transcript to genome alignments. The main purpose is to +assess the effect of polishing and scaffolding operations has on the +quality of a genome assembly. The gene transcript set is a large +sequence set consisting of assembled transcripts from RNA-seq data +generated in relation to a genome assembly project. Therefore, +`genecovr` serves as a complement to software such as +[BUSCO](https://busco.ezlab.org/), which evaluates genome assembly +quality using a smaller set of well-defined single-copy orthologs. ## Installation @@ -18,14 +25,17 @@ You can install the released version of genecovr from [NBIS GitHub](https://github.com/nbis) with: ``` r +# If necessary, uncomment to install devtools # install.packages("devtools") devtools::install_github("NBISweden/genecovr") ``` -## Quick usage +## Usage + +### genecovr script quick start There is a helper script for generating basic plots located in -PACKAGE\_DIR/bin/genecovr. Create a data input csv-delimited file with +PACKAGE_DIR/bin/genecovr. Create a data input csv-delimited file with columns 1. data label @@ -41,7 +51,59 @@ generate plots: PACKAGE_DIR/bin/genecovr indata.csv ``` -## Vignette +#### Example + +There are example files located in PACKAGE_DIR/inst/extdata consisting +of two psl alignment files containing gmap alignments and fasta indices +for the transcript sequences and two for different assembly versions: + +- nonpolished.fai - fasta index for raw assembly +- polished.fai - fasta index for polished assembly +- transcripts.fai - fasta index for transcript sequences +- transcripts2nonpolished.psl - gmap alignments, transcripts to raw + assembly +- transcripts2polished.psl - gmap alignments, transcripts to polished + assembly + +Using these files and the labels `non` and `pol` for the different +assemblies, a `genecovr` input file (called e.g., `assemblies.csv`) +would look as follows: + + nonpol,transcripts2nonpolished.psl,nonpolished.fai,transcripts.fai + pol,transcripts2polished.psl,polished.fai,transcripts.fai + +and the command to run would be: + + genecovr assemblies.csv + +#### genecovr options + +To list genecovr script options, type ’genecovr -h\`: + + usage: genecovr [-h] [-v] [-p number] + [-d OUTPUT_DIRECTORY] [--height HEIGHT] + [--width WIDTH] + csvfile + + positional arguments: + csvfile csv-delimited file with columns + 1. data label + 2. mapping file (supported formats: psl) + 3. assembly file (fasta or fasta index) + 4. transcript file (fasta or fasta index) + + optional arguments: + -h, --help show this help message and exit + -v, --verbose print extra output + -p number, --cpus number + number of cpus [default 1] + -d OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY + output directory + --height HEIGHT figure height in inches [default 6.0] + --width WIDTH figure width in inches [default 6.0] + +### R package vignette -Alternatively, import the library as usual in an R script and use the -package functions. See the vignette for a minimum working example. +Alternatively, import the library in an R script and use the package +functions. See [Get started](articles/genecovr.html) or run +`vignette("genecovr")` for a minimum working example. diff --git a/vignettes/genecovr.Rmd b/vignettes/genecovr.Rmd index 5966487..b60a1b7 100644 --- a/vignettes/genecovr.Rmd +++ b/vignettes/genecovr.Rmd @@ -58,13 +58,14 @@ psize <- 3 ## On object representation of pairwise sequence alignments `genecovr` has functionality to read pairwise sequence alignment files -and converts the pairwise alignments to `AlignmentPairs` objects. An -`AlignmentPairs` object is a subclass of the Bioconductor class -`S4Vectors::Pairs`. A `Pairs` object in turn aligns two vectors along -slot names `first` and `second`, and the `AlignmentPairs` object adds -slots for the query and subject, and possibly extra slots related to -additional information in the alignment file. The query and subject -are `GRanges` objects or objects derived from the `GRanges`class. +and converts the pairwise alignments to `genecovr::AlignmentPairs` +objects. An `AlignmentPairs` object is a subclass of the Bioconductor +class `S4Vectors::Pairs`. A `Pairs` object in turn aligns two vectors +along slot names `first` and `second`, and the `AlignmentPairs` object +adds slots for the query and subject, and possibly extra slots related +to additional information in the alignment file. The query and subject +are `GenomicRanges::GRanges` objects or objects derived from the +`GRanges`class. # Analysing gene body coverage @@ -75,10 +76,10 @@ gmap files in psl format, `transcripts2nonpolished.psl` and `transcripts2polished.psl`. In addition there are fasta index files for both assemblies (`nonpolished.fai` and `polished.fai`) and for the transcriptome (`transcripts.fai`). The fasta indices are used to -generate `Seqinfo` objects that can be used to set sequence -information on the parsed output. We load the fasta indices and parse -the psl files with `readPsl`, storing the results in an -`AlignmentPairsList` for convenience. +generate `GenomeInfoDb::Seqinfo` objects that can be used to set +sequence information on the parsed output. We load the fasta indices +and parse the psl files with `genecovr::readPsl`, storing the results +in an `genecovr::AlignmentPairsList` for convenience. ``` {r gbc-load-data} assembly_fai_fn <- list( @@ -137,7 +138,7 @@ assembly, as expected. ## Plot summary indel and match distributions We can also select multiple columns to plot in the -`AlignmentPairsList`. +`genecovr::AlignmentPairsList`. ```{r gbc-plot-match-indel} cnames <- c("misMatches", "query.NumInsert", "query.BaseInsert") @@ -157,7 +158,7 @@ plot(apl, aes(x = id, y = get_expr(enquo(cnames))), which = "boxplot") + ## Plot number of indels -The function `insertionSummary` summarizes the number of insertions, +The function `genecovr::insertionSummary` summarizes the number of insertions, either at the transcript level (default) or per alignment. The intuition is that as assembly quality improves, the number of indels go down. @@ -188,17 +189,18 @@ ggplot(x, aes(id)) + ## Gene body coverage -The function `geneBodyCoverage` takes an `AlignmentPairs` object and -summarizes breadth of coverage and number of subject hits per -transcript. +The function `genecovr::geneBodyCoverage` takes an `AlignmentPairs` +object and summarizes breadth of coverage and number of subject hits +per transcript. ``` {r gbc-make-gene-body-coverage} gbc <- lapply(apl, geneBodyCoverage, min.match = 0.1) ``` -A summary can be obtained with the `summarizeGeneBodyCoverage` -function. We define a range of minimum match hit cutoffs to filter out -hits with too few matches in the aligned region. +A summary can be obtained with the +`genecovr::summarizeGeneBodyCoverage` function. We define a range of +minimum match hit cutoffs to filter out hits with too few matches in +the aligned region. ``` {r gbc-summarize-gene-body-coverage} min.match <- c(0.25, 0.5, 0.75, 0.9) @@ -237,7 +239,7 @@ ggplot( A fragmented assembly will lead to more transcripts mapping to several contigs. We calculate the number of subjects by coverage cutoff with -the function `countSubjectsByCoverage` +the function `genecovr::countSubjectsByCoverage` ```{r gbc-ncontigs-per-transcript} data <- dplyr::bind_rows(