- Cloning
- Environment
- Snakemake usage of this repo
- Perform the processing of a single sample
- Guppy Basecall
- TailfindR
- Flair
- Albacore
- EpiNano
Make sure to clone the submodules that comprises dependancies not manageable by conda (or pip).
git clone --recursive https://github.com/sunsetyerin/nanopore_dRNA_m6A_pA_tail_isoform.git
or
git clone https://github.com/sunsetyerin/nanopore_dRNA_m6A_pA_tail_isoform.git
cd direct_rna
git submodule update --init --recursive
This repository has been developped using python3.7.3 through python3.7.6 in a miniconda environment.
Though conda is not
required, it is highly recommended in order to manage the dependencies.
Users can choose to manage their dependencies manually and should consult
the envs/environment.yaml
file as reference.
Using snakemake to manage the environment
From a bash terminal:
snakemake --use-conda --conda-create-envs-only --cores [cores available]
Using conda to install the environment manually.
conda env create --file envs/environment.yaml
conda activate direct_rna
snakemake [args]
All scripts, steps and jobs are managed using snakemake
Input files, workflow parameters and output path are provided through config
files (configs/*.yaml
).
The workflow is designed to compare direct-RNA samples.
# To display the jobs and commands including subworkflows
snakemake -np
# to visualize the acyclic rulegraph
snakemake --rulegraph | dot | display -
# To launch the jobs
snakemake --use-conda
An example of snakemake profile for "numbers" cluster is available at
configs/smk_profiles/numbers/config.yaml
and a second one for "slurmgpu" will
be available eventually.
So far the processing of a sample includes basecalling, polyA tail estimation, gene/isoform expression and match to Illumina gene expression (if available in Flair configs, see config file).
Config files contains the values of variables needed to perform analyses.
configs/sample_config.yaml
is under development and is provided as an
example. It's meant to be used to apply the analyses on those first data
available from the PromethION platform.
Most of the data directories hosting results will be created ad-hoc by the
snakemake manager, although it is possible to create the folders in advance.
Any of these folders can be replaced by a symlink in order to redirect
voluminous files to appropriate spaces. The entire data/
folder can be
written to a project space with a simple:
ln -s /PATH/TO/SCRATCH/data data
It is recommended to use symlinks to redirect the data/ folder to a scrath space since the outputs can get particularly large.
Oxford Nanopore Technologies apparels uses MinKNOW
software to basecall
reads in realtime, at the expense of the production of the event table
(formerly produced by Albacore
). Albacore
is discontinued and Guppy
is currently favored. It can produce a move table during basecall that is
similar to the deprecated event table (make sure to use --fast5_out
).
TailfindR
requires one of those tables.
Download available here.
Current configs are using an installation available to dlhost01 & dlhost03.
/opt/ont-guppy_3.2.2/bin/guppy_basecaller
TailfindR
is designed to be supplied with a single directory containing all
the .fast5
files. It supports parallelization by specifying the number of
cores to attribute to the TailfindR
job. However, it was designed and tested
with MinIONs output files and I noticed that it doesn't scale well to a full
PromethION flowcell. The first large usage of TailfindR
on gphost04 using the
sequencing run of the COLO829 cell line with 1563 .fast5
files of ~4000 reads
(~6.25M reads) took 1.5 weeks to process all files and failed to produce the
compilation of the results after another 2 weeks.
The chosen workaround to address the issue is to generate temporary directories
each containing a softlink towards a single (or few) .fast5
file(s) (see
rule tailfindr_butler
). A TailfindR
job is then generated for each
temporary directories. Numerous TailfindR
jobs should be submitted to a
cluster such as numbers
through snakemake
.
Flair
is a workflow in itself and requires its own environment to run. Both
the scripts used for direct_rna analysis and environment .yaml
files are
provided in the flair
submodule.
Unlike TailfindR
the input is a single .fast5
which prompts the need for
catenation of the multiple 4000 reads .fast5
produced by Guppy
.
Albacore
is a deprecated basecaller that was discontinued by Oxford Nanopore
Technologies to the benefit of the currently maintained Guppy
. It is included
to this workflow as a requirement for EpiNano
m6A caller which relies on the
basecall errors of Albacore
to assign methylated sites.
EpiNano
states that it requires
Albacore v2.1.7,
to provide accurate prediction. This version is unavailable. The community only
have access to Albacore v2.3.4
.
link to Nanoporetech community
NOTE Albacore
is used along with the multiread .fast5
wrapper available
via the seamlessf5
package.
The current version of EpiNano
allows m6A sites calling at the genomic
position level. The calling at read level is
under development.