Skip to content

Commit

Permalink
Revert "Update documentation links"
Browse files Browse the repository at this point in the history
This reverts commit 4dd96fd.
  • Loading branch information
percyfal committed Dec 19, 2023
1 parent 4dd96fd commit 9fb7e4d
Show file tree
Hide file tree
Showing 2 changed files with 149 additions and 24 deletions.
127 changes: 127 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
output: github_document
---

<!-- README.md is generated from README.Rmd with devtools::build_readme(). Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# genecovr

<!-- badges: start -->
[![R build status](https://github.com/NBISweden/genecovr/workflows/R-CMD-check/badge.svg)](https://github.com/NBISweden/genecovr/actions)
<!-- badges: end -->

`genecovr` is an `R` package that provides plotting functions that
summarize gene transcript to genome alignments. The main purpose is to
assess the effect of polishing and scaffolding operations has on the
quality of a genome assembly. The gene transcript set is a large
sequence set consisting of assembled transcripts from RNA-seq data
generated in relation to a genome assembly project. Therefore,
`genecovr` serves as a complement to software such as
[BUSCO](https://busco.ezlab.org/), which evaluates genome assembly
quality using a smaller set of well-defined single-copy orthologs.

## Installation

You can install the released version of genecovr from [NBIS
GitHub](https://github.com/nbis) with:

``` r
# If necessary, uncomment to install devtools
# install.packages("devtools")
devtools::install_github("NBISweden/genecovr")
```

## Usage

### genecovr script quick start

There is a helper script for generating basic plots located in
PACKAGE_DIR/bin/genecovr. Create a data input csv-delimited file with
columns

1. data label
2. mapping file (supported formats: psl)
3. assembly file (fasta or fasta index)
4. transcript file (fasta or fasta index)

Columns 3 and 4 can be set to missing value (NA) in which case
sequence sizes will be inferred from the alignment files. Then run the
script to generate plots:

``` shell
PACKAGE_DIR/bin/genecovr indata.csv
```

#### Example

There are example files located in PACKAGE_DIR/inst/extdata consisting
of two psl alignment files containing gmap alignments and fasta
indices for the transcript sequences and two for different assembly
versions:

- nonpolished.fai - fasta index for raw assembly
- polished.fai - fasta index for polished assembly
- transcripts.fai - fasta index for transcript sequences
- transcripts2nonpolished.psl - gmap alignments, transcripts to raw assembly
- transcripts2polished.psl - gmap alignments, transcripts to polished
assembly

Using these files and the labels `non` and `pol` for the different
assemblies, a `genecovr` input file (called e.g., `assemblies.csv`)
would look as follows:

```
nonpol,transcripts2nonpolished.psl,nonpolished.fai,transcripts.fai
pol,transcripts2polished.psl,polished.fai,transcripts.fai
```

and the command to run would be:

```
genecovr assemblies.csv
```

#### genecovr options

To list genecovr script options, type 'genecovr -h`:

```
usage: genecovr [-h] [-v] [-p number]
[-d OUTPUT_DIRECTORY] [--height HEIGHT]
[--width WIDTH]
csvfile
positional arguments:
csvfile csv-delimited file with columns
1. data label
2. mapping file (supported formats: psl)
3. assembly file (fasta or fasta index)
4. transcript file (fasta or fasta index)
optional arguments:
-h, --help show this help message and exit
-v, --verbose print extra output
-p number, --cpus number
number of cpus [default 1]
-d OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
output directory
--height HEIGHT figure height in inches [default 6.0]
--width WIDTH figure width in inches [default 6.0]
```



### R package vignette

Alternatively, import the library in an R script and use the package
functions. See [Get started](articles/genecovr.html) or run
`vignette("genecovr")` for a minimum working example.
46 changes: 22 additions & 24 deletions vignettes/genecovr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,13 @@ psize <- 3
## On object representation of pairwise sequence alignments

`genecovr` has functionality to read pairwise sequence alignment files
and converts the pairwise alignments to `genecovr::AlignmentPairs`
objects. An `AlignmentPairs` object is a subclass of the Bioconductor
class `S4Vectors::Pairs`. A `Pairs` object in turn aligns two vectors
along slot names `first` and `second`, and the `AlignmentPairs` object
adds slots for the query and subject, and possibly extra slots related
to additional information in the alignment file. The query and subject
are `GenomicRanges::GRanges` objects or objects derived from the
`GRanges`class.
and converts the pairwise alignments to `AlignmentPairs` objects. An
`AlignmentPairs` object is a subclass of the Bioconductor class
`S4Vectors::Pairs`. A `Pairs` object in turn aligns two vectors along
slot names `first` and `second`, and the `AlignmentPairs` object adds
slots for the query and subject, and possibly extra slots related to
additional information in the alignment file. The query and subject
are `GRanges` objects or objects derived from the `GRanges`class.


# Analysing gene body coverage
Expand All @@ -76,10 +75,10 @@ gmap files in psl format, `transcripts2nonpolished.psl` and
`transcripts2polished.psl`. In addition there are fasta index files
for both assemblies (`nonpolished.fai` and `polished.fai`) and for the
transcriptome (`transcripts.fai`). The fasta indices are used to
generate `GenomeInfoDb::Seqinfo` objects that can be used to set
sequence information on the parsed output. We load the fasta indices
and parse the psl files with `genecovr::readPsl`, storing the results
in an `genecovr::AlignmentPairsList` for convenience.
generate `Seqinfo` objects that can be used to set sequence
information on the parsed output. We load the fasta indices and parse
the psl files with `readPsl`, storing the results in an
`AlignmentPairsList` for convenience.

``` {r gbc-load-data}
assembly_fai_fn <- list(
Expand Down Expand Up @@ -158,10 +157,10 @@ plot(apl, aes(x = id, y = get_expr(enquo(cnames))), which = "boxplot") +

## Plot number of indels

The function `genecovr::insertionSummary` summarizes the number of
insertions, either at the transcript level (default) or per alignment.
The intuition is that as assembly quality improves, the number of
indels go down.
The function `insertionSummary` summarizes the number of insertions,
either at the transcript level (default) or per alignment. The
intuition is that as assembly quality improves, the number of indels
go down.

First we show a plot with the number of insertions per alignment. A
consequence of this is that as a transcript may be split in multiple
Expand Down Expand Up @@ -189,18 +188,17 @@ ggplot(x, aes(id)) +

## Gene body coverage

The function `genecovr::geneBodyCoverage` takes an `AlignmentPairs`
object and summarizes breadth of coverage and number of subject hits
per transcript.
The function `geneBodyCoverage` takes an `AlignmentPairs` object and
summarizes breadth of coverage and number of subject hits per
transcript.

``` {r gbc-make-gene-body-coverage}
gbc <- lapply(apl, geneBodyCoverage, min.match = 0.1)
```

A summary can be obtained with the
`genecovr::summarizeGeneBodyCoverage` function. We define a range of
minimum match hit cutoffs to filter out hits with too few matches in
the aligned region.
A summary can be obtained with the `summarizeGeneBodyCoverage`
function. We define a range of minimum match hit cutoffs to filter out
hits with too few matches in the aligned region.

``` {r gbc-summarize-gene-body-coverage}
min.match <- c(0.25, 0.5, 0.75, 0.9)
Expand Down Expand Up @@ -239,7 +237,7 @@ ggplot(

A fragmented assembly will lead to more transcripts mapping to several
contigs. We calculate the number of subjects by coverage cutoff with
the function `genecovr::countSubjectsByCoverage`
the function `countSubjectsByCoverage`

```{r gbc-ncontigs-per-transcript}
data <- dplyr::bind_rows(
Expand Down

0 comments on commit 9fb7e4d

Please sign in to comment.