ARNAQ stands for Analysts' RNA QC. ARNAQ is a tool for exploring RNA sequencing datasets, with an eye to differential expression or other analysis. It is designed to make reviewing the data and isolating potential QC issues as simple and swift as possible.
It is not a replacement for the first line QC performed by sequencing platforms and their associated tools; but instead to supplement them by considering experimental groups and allowing a more nuanced approach to potentially problematic samples than a simple pass/fail flag.
Analysts who handle a lot of RNA projects will benefit from being able to perform these QC steps more quickly, and will help avoid copy/paste scripts that are prone to errors where one line is not updated for the current project. The report contains descriptions of what the plots show, helping to present this work to collaborators who are less familiar with bioinformatics.
Investigators who have some familiarity with bioinformatics will benefit from shorter, clearer R scripts where they do not have to delve too deeply into the specifics of the packages used for this QC analysis, helping them be confident that the analysis they perform is reliable.
In both cases, the .svg
export options the produces high quality print-ready plots that can be
easily incorporated into papers.
ARNAQ is named in reference to the board game 'Lost Ruins of Arnak'.
The goal of ARNAQ is to make the process of producing a context-aware QC report as quick and error-free as possible, so analyst time can instead be spent elsewhere.
Metadata about the experiment (such as reference genome) and metadata about samples (such as group membership) is typically available outside R, and as such ARNAQ uses simple files for this information, which can either be generated by a pipeline or adapted from the files from a pipeline.
With these files created, the process for creating a report can be as simple as a single line:
arnaq()
The rationale is that the critical parameters for an analysis should not be buried inside a long list of R calls. By placing them in tables, and permitted automated generation as part of a pipeline, it is easy to visually check that the correct metadata is being used.
Full installation and post-installation instructions are here.
ARNAQ requires metadata files for each project you run through it.
The first, resources.yml
, contains metadata that applies to the project as a whole, including
the names of data files that will be used to create ht report.
The second, the samples metadata file (samples.txt
by default), contains metadata specific to
each sample. This includes the name to use for the samples in the report (which may differ from
that in the data files), and information on which experimental group(s) the sample belongs to.
The entries for this file should be:
- project_id: this is a name to use for the report, which will also be prepended to the files the script outputs.
- species: a species identifier, which will be listed in the report but not used for any
- processing. This can be a more presentation-ready equivalent of
genome_reference
, next. - genome_reference: the directory name containing the
.gtf
file with gene definitions. See below for how to set this. - count_table: this is the location of the count table file.
- summary_table: this is the location of the read assignment summary, generated by featureCounts or similar tool. If listed as 'None', it is assumed that this data is unavailable and the report will be produced without that section.
- duplication_table: this is the location of a table of duplication rates by sample. If listed as 'None', it is assumed to be unavailable.
- metrics_table: this is the location of the Picard metrics file. If listed as 'None', it is assumed to be unavailable.
- resource_dir: is the parent directory for all genome references, as described below.
- report_template: this is a path to an rmarkdown template used to build the report. The default
of
INTERNAL
will use the standard report in the ARNAQ package. You only need to change this if you intend to alter the template. - biotype_conversion: the location of a table that maps the full list
of Ensembl biotypes to a smaller number of categories, in order to keep the
associated plots in the report readable. The default of
INTERNAL
will use the standard table included in the ARNAQ package. - ercc_concentrations: the location of a table of data regarding concentrations of ERCC spike-ins in the two different mixes, used for QC plots to test accurate detection of those spike-ins. This is only needed if you are using ERCC spike-in features of ARNAQ.
This file contains metadata about specific samples in a project.
- Fastq: Unused by this tool; but can be used to store the original filenames for each sample, or some other early identifier. If in doubt, this can be left the same as the next column.
- Name: identifiers for samples, that correspond to the column names in the count table and other data files.
- Display: names for the samples to display in the QC document. This can be identical to the Name column or you can give samples more readable names.
- Any number of additional columns, where each column is one set of groups samples can belong to.
This group information will be used in the report. Group names should not start with a number.
Numerical data will not be handled correctly by ARNAQ; but you can include these columns anyway as
long as you use the
treat.groups
parameter to only include factor-based metadata.
The order of lines in this file does not matter; the Name column is used to assign the names and meta-data to the correct sample. The version of this table stored in the R session will be sorted to match the columns of the count data.
- Description of the typical workflow for ARNAQ.
- Examples of report generation using included data.
- A list of the files created during an ARNAQ run.
- A list of the objects created in the R session during an ARNAQ run.
- The structure of the all.plots object, to enable fine-tuning of ARNAQ's plots.
Copies of the example reports generated by the Creating Example Reports
vignette are viewable
using a proxy service: