A Nextflow pipeline to tabulate synthetic barcode counts from NGS data
- Parameters
- Pipeline summary
- Dependiencies
- Installing the pipeline
- Running the pipeline
- Performance
- Module Descriptions
BARtab integrates with the R package bartools for downstream QC, analysis and visualization of population-level and single-cell level cellular barcoding datasets.
Please see our preprint for application examples.
Please check NEWS.md for changes in BARtab v1.4.
Usage: nextflow run danevass/bartab --indir <input dir>
--outdir <output dir>
--ref <path/to/reference/fasta>
--mode <single-bulk | paired-bulk | single-cell>
Input/output arguments:
--indir Directory containing input *.fastq.gz files.
Must contain R1 and R2 if running in mode paired-bulk or single-cell.
For single-cell mode, directory can contain BAM files.
--input_type Input file type, either fastq or bam, only relevant for single-cell mode [default = fastq]
--ref Path to a reference fasta file for the barcode / sgRNA library.
If null, reference-free workflow will be used for single-bulk and paired-bulk modes.
--mode Workflow to run. <single-bulk, paired-bulk, single-cell>
--outdir Output directory to place output [default = './']
Read merging arguments:
--mergeoverlap Length of overlap required to merge paired-end reads [default = 10]
Filtering arguments:
--minqual Minimum PHRED quality per base [default = 20]
--pctqual Percentage of bases within a read that must meet --minqual [default = 80]
--complexity_threshold Complexity filter [default = 0]
Minimum percentage of bases that are different from their next base (base[i] != base[i+1])
Trimming arguments:
--constants Which constant regions flanking barcode to search for in reads: up, down or both.
"all" runs all 3 modes and combines the results. <up, down, both, all> [default = 'up']
--upconstant Sequence of upstream constant region [default = 'CGATTGACTA'] // SPLINTR 1st gen upstream constant region
--downconstant Sequence of downstream constant region [default = 'TGCTAATGCG'] // SPLINTR 1st gen downstream constant region
--up_coverage Number of bases of the upstream constant that must be covered by the sequence [default = 3]
--down_coverage Number of bases of the downstream constant that must be covered by the sequence [default = 3]
--constantmismatches Proportion of mismatched bases allowed in constant regions [default = 0.1]
--min_readlength Minimum length of barcode sequence [default = 20]
--barcode_length Optional. Length of barcode if it is the same for all barcodes.
If constant regions are trimmed on both ends, reads are filtered for this length.
If either constant region is trimmed, this is the maximum sequence length.
If barcode_length is set, alignments to the middle of a barcode sequence are filtered out.
Mapping arguments:
--alnmismatches Number of allowed mismatches during reference mapping [default = 2]
--barcode_length (see trimming arguments)
--cluster_unmapped Cluster unmapped reads with starcode [default = false]
Reference-free arguments:
--cluster_distance Defines the Levenshtein distance for clustering lineage barcodes [default = 3].
--cluster_ratio Cluster ratio for message passing clustering.
A cluster of barcode sequences can absorb a smaller one only if it is at least x times bigger [default = 3].
Sincle-cell arguments:
--cb_umi_pattern Cell barcode and UMI pattern on read 1, required for fastq input.
N = UMI position, C = cell barcode position [default = CCCCCCCCCCCCCCCCNNNNNNNNNNNN]
--cellnumber Number of cells expected in sample, only required when fastq provided. whitelist_indir and cellnumber are mutually exclusive
--whitelist_indir Directory that contains a cell ID whitelist for each sample <sample_id>_whitelist.tsv
--umi_dist Hamming distance between UMIs to be collapsed during counting [default = 1]
--umi_count_filter Minimum number of UMIs per barcode per cell [default = 1]
--umi_fraction_filter Minimum fraction of UMIs per barcode per cell compared to dominant barcode in cell
(barcode supported by most UMIs) [default = 0.3]
--pipeline To specify if input fastq files were created by SAW pipeline
Resources:
--max_cpus Maximum number of CPUs [default = 6]
--max_memory Maximum memory [default = "14.GB"]
--max_time Maximum time [default = "40.h"]
Optional arguments:
-profile Configuration profile to use. Can use multiple (comma separated)
Available: conda, singularity, docker, slurm, lsf
--email Direct output messages to this address [default = '']
--help Print this help statement.
Author:
Dane Vassiliadis ([email protected])
Henrietta Holze ([email protected])
The pipeline can extract barcode counts from population-level DNA barcode amplicon sequencing, single-cell RNA-sequencing or single-cell barcode amplicon sequencing data.
BARtab can perform reference-free barcode extraction or perform alignment to a reference.
Detailed QC reports of the analysis steps are provided alongside barcode count tables.
The bulk workflow is executed with mode single-bulk
and paired-bulk
for single-end or paired-end reads, respectively.
- Report raw data quality using
fastqc
FASTQC - [Paired-end] Merge paired end reads using
FLASh
MERGE_READS - Quality filter reads using
fastp
FILTER_READS - Trim 5' and/or 3' constant regions using
cutadapt
CUTADAPT_READS - [Reference-based] Align to reference barcode library using
bowtie
BUILD_BOWTIE_INDEX, BOWTIE_ALIGN - [Reference-based optional] Filter alignments to either end of a reference sequence, not the middle FILTER_ALIGNMENTS
- [Reference-based] Count number of reads aligning per barcode using
samtools
SAMTOOLS, GET_BARCODE_COUNTS - [Reference-free optional] Trim barcode sequences to the same length if only one adapter was trimmed TRIM_BARCODE_LENGTH
- [Reference-free] Cluster barcode reads to account for PCR and sequencing errors using
starcode
STARCODE - Merge barcode count files for multiple samples COMBINE_BARCODE_COUNTS
- Report metrics for individual samples MULTIQC
The single-cell workflow can extract barcodes from both single-cell amplicon-sequencing libraries and directly from processed scRNA-seq data with mode single-cell
.
- [fastq] Report raw data quality using
fastqc
FASTQC - [fastq] Extraction of cell barcodes (optional) and UMIs using
umi-tools
UMITOOLS_WHITELIST, UMITOOLS_EXTRACT - [BAM] Filter unmapped reads containing cell barcode and UMI and convert to fastq using
samtools
BAM_TO_FASTQ - Trim 5' and/or 3' constant regions using
cutadapt
CUTADAPT_READS - [Reference-based] Align to reference barcode library using
bowtie
BUILD_BOWTIE_INDEX, BOWTIE_ALIGN - [Reference-based optional] Filter alignments to either end of a reference sequence, not the middle FILTER_ALIGNMENTS
- [Reference-free] Cluster barcodes and error-correct UMIs with
starcode-umi
STARCODE_SC - Remove reads from PCR chimerism REMOVE_PCR_CHIMERISM
- Error correct UMIs and count UMIs per barcode using
umi-tools
UMITOOLS_COUNT - Filter and tabulate barcodes per cell and produce QC plots PARSE_BARCODES_SC
- Report metrics for individual samples MULTIQC
BARtab allows for extraction of barcodes from stereo-seq data that was processed with the SAW pipeline (>=v6.1.0).
Input: one fastq file per sample with unmapped reads, generated by the mapping command of the SAW pipeline with the outUnMappedFq=1
flag.
Barcodes counts are tabulated by spot (coordinate) and can subsequently be binned the same way as the stereo-seq data.
- Filter barcode reads and trim 5' and/or 3' constant regions using
cutadapt
CUTADAPT_READS - Align to reference barcode library using
bowtie
BUILD_BOWTIE_INDEX, BOWTIE_ALIGN - [Optional] Filter alignments for sequences mapping to either end of a barcode FILTER_ALIGNMENTS
- Count barcodes from sam file COUNT_BARCODES_SAM
- Filter and tabulate barcodes per spot and produce QC plots PARSE_BARCODES_SC
- Report metrics for individual samples MULTIQC
See citations
-
Install Nextflow using the instructions found here (and here)
# download the executable curl get.nextflow.io | bash # move the nextflow file to a directory accessible by your $PATH variable sudo mv nextflow /usr/local/bin
-
Try out the pipeline
(this will automatically pull the pipeline, usually into~/.nextflow/assets/
)nextflow run danevass/bartab --help
Alternatively, clone the pipeline into a directory of your choice first with
nextflow clone danevass/bartab target_dir/
-
Install dependencies
Download the Docker image from docker hub.
docker pull henriettaholze/bartab:v1.4 nextflow run danevass/bartab -profile docker [options]
export NXF_SINGULARITY_LIBRARYDIR=MY_SINGULARITY_IMAGES # your singularity storage dir export NXF_SINGULARITY_CACHEDIR=MY_SINGULARITY_CACHE # your singularity cache dir singularity pull --dir $NXF_SINGULARITY_LIBRARYDIR henriettaholze-bartab-v1.4.img docker://henriettaholze/bartab:v1.4 nextflow run danevass/bartab -profile singularity [options]
- Install miniconda using the instructions found here: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
- It is recommended to use mamba to create the conda environment
conda install -c conda-forge mamba
- Install BARtab dependencies by running
mamba env create -f environment.yaml
(orconda env create -f environment.yaml
) - Run the pipeline with
nextflow run danevass/bartab -profile conda [options]
The location of the conda environment is specified in
conf/conda.config
.
Print the help message with nextflow run danevass/bartab --help
.
To run a specific branch or the pipeline use -r <branch>
.
Run any of the test datasets using nextflow run danevass/bartab -profile <test_SE,test_PE,test_SE_ref_free,test_sc,test_sc_bam,test_sc_saw_fastq>,<conda,docker,singularity>,<slurm,lsf>
To run the pipeline with your own data, create a parameter yaml file and specify the location with -params-file
.
An example to run the single-end bulk workflow:
indir: "test/dat/test_SE"
ref: "test/ref/SPLINTR_mCHERRY_V2_barcode_reference_library.fasta"
mode: "single-bulk"
outdir: "test/test_out/single_end/"
upconstant: "TGACCATGTACGATTGACTA"
downconstant: "TGCTAATGCGTACTGACTAG"
constants: "up"
barcode_length: 60
min_readlength: 20
Use -w
to specify the location of the work directory and -resume
when only parts of the input have changed or only a subset of process has to be re-run.
nextflow run danevass/bartab \
-profile conda \
-params-file path/to/params/file.yaml \
-w "/scratch/work/" \
-resume
The reference must be a fasta file.
For SPLINTR libraries, barcodes are labelled based on their frequency in the library.
Giving barcodes labels is advantagous for plotting and comparing barcodes across samples.
All fastq files must be moved or linked to an input directory. This can be done as follows if files are in different subdirectories:
cd indir/
fastq_files="path/to/data/*/*fastq.gz"
for f in $fastq_files; do ln -s $f .; done
All fastq files can then be symlinked to an input directory. File names must match the pattern <sample_name>_R{1,2}*.{fastq,fq}.gz
.
Parameter input_type
must be set to fastq
.
If scRNA-seq has already been processed with Cell Ranger or STARSolo, cell barcode whitelists can be provieded to skip cell calling.
Set the parameter whitelist_indir
to the location of the folder with all whitelist files.
A list of cell IDs detected in a sample can be found in Cell Ranger results in outs/filtered_feature_bc_matrix/barcodes.tsv.gz
.
For each sample, a matching whitelist must be provided following the syntax <sample_id>_whitelist.tsv
with the same sample_id
as in the filename of the fastq files.
The whitelist files must be in tsv format and extensions like '-1' must be removed.
Example command:
zcat outs/filtered_feature_bc_matrix/barcodes.tsv.gz | sed 's/..$//g' > whitelist_indir/4-5nMT100nMP_S4_whitelist.tsv
If no cell ID whitelist is provided (with --whitelist_indir
),
cell barcodes are identified in R1 using umi-tools whitelist.
Barcodes can be extracted from scRNA-seq data, from Cell Ranger or STARSolo generated BAM files.
Reads containing barcode sequences will be in the unmapped fraction of reads after alignment.
To retain unmapped reads annotated with cell ID and UMI in the output, run STAR with the option --outSAMunmapped Within KeepPairs
.
All BAM files can then be symlinked to an input directory and the parameter input_type
set to bam
.
To symlink files and give them individual names,
cd indir/
ln -s cellranger/results/outs/possorted_genome_bam.bam Sample_Pool-1-231114.bam
When using the reference-based workflow, it is recommended to set the parameter constants
to all
to get maximum recovery of barcodes from randomly fragmented reads.
This parameter setting is not compatible with the reference-free workflow since barcode fragments covering different parts of a barcode cannot be clustered.
If a process fails and the error message is not clear, have a look in the run folder of the process.
Each process has runs in a separate folder in the work directory, e.g. /scratch/work/cf/d61146a435010dfeb92d2aee18947a
.
Check .command.err
and any other log files in the folder for error messages.
We recently did a comparison of BARtab with two other pipelines, pycashier and timemachine.
Code and results are available here: https://github.com/DaneVass/bartools_manuscript_code/tree/main/tools-comparison
Module descriptions contain additional information on third-party tools, parameters and output files.
Software check is always performed as first module.
Output files:
reports/software_check.txt
: Report of all software versions.
QC of fastq files is performed using FastQC.
Output files:
reports/fastqc/<sample_id>.html
: html report (copy)reports/fastqc/zips/<sample_id>.zip
(copy)
If running in mode paired-bulk, forward and reverse reads are merged using FLASh.
The minimum overlap of reads can be specified with the parameter mergeoverlap
(default 10 bases).
Output files:
merged_reads/<sample_id>.extendedFrags.fastq.gz
: merged reads (symlink)merged_reads/unmerged/<sample_id>.notCombined_<1,2>.fastq.gz
: reads that could not be merged (symlink)merged_reads/logs/<sample_id>.flash.log
: log (copy)merged_reads/logs/<sample_id>.hist
: Numeric histogram of merged read lengths. (copy)merged_reads/logs/<sample_id>/<sample_id>.histogram
: Visual histogram of merged read lengths. (symlink)
Reads are quality filtered using fastp.
The minimum quality score to keep can be specified with the parameter minqual
.
The minimum percent of bases that must have minqual
quality can be specified with the parameter pctqual
.
complexity_threshold
(default 0) can be set to remove reads with e.g. poly-A tails due to an empty barcode contstruct.
Complexity is defined as minimum percentage of bases that are different from their next base (base[i] != base[i+1]).
For a WSN barcode pattern, barcodes have a minimum complexity of 66%.
fastp by default removes reads with >5 ambiguous bases (N).
To skip read filtering, set minqual
to 0.
Output files:
filtered_reads/<sample_id>.filtered.fastq.gz
: filtered reads (symlink)filtered_reads/logs/<sample_id>.filter.log
: log (copy)filtered_reads/logs/<sample_id>.filter.fastp.json
: log (copy)filtered_reads/logs/<sample_id>.filter.fastp.html
: log (copy)
Constant regions are trimmed and reads are filtered for length and N bases using cutadapt.
Constants can be specified with the parameters upconstant
and downconstant
.
For each, minimum coverage can be specified with up_coverage
and down_coverage
(default 3).
If this is smaller than the length of the constant region, partial matches at the beginning or end of the sequence are accepted.
This is particularly useful in case of random fragmentation and variable stagger length.
In bulk mode, reads can be filtered for containing either upconstant (up
), downconstant (down
) or both (both
) with the parameter constants
.
If all reads cover the upstream constant but only some cover the downstream constant (e.g. due to variable length barcodes or stagger),
set constants
to both
and down_coverage
to 0
.
When contstants
is set to all
, reads are filtered in all three ways and fastq files of trimmed sequences are concatenated.
This is usefult for random fragmented barcode reads from scRNA-seq data.
Example for trimming options:
upconstant="ATGGAATTG"
downconstant="CGGAACCGA"
up_coverage=6
down_coverage=6
>seq1
ATGGAATTGACATCACGCTCAAGGATCCGGAACCGA
>seq2
GAATTGACATCACGCTCAAGGATCCGGAAC
>seq3
ATGGAATTGACATCACGCTCAAGGATC
>seq4
GAATTGACATCACGCTCAAGGATC
>seq5
ACATCACGCTCAAGGATCCGGAACCGA
>seq6
ACATCACGCTCAAGGATCCGGAAC
>seq7
ACATCACGCTCAAGGATCCGGA
>seq8
ACATCACGCTCCGGAACAAGGATC
>seq9
ACATCACGCTCAAGGATC
Option both
will trim sequence 1 and 2, up
will trim sequence 3 and 4, down
will trim sequence 5 and 6, all
will trim sequence 1-6.
The minimum read length can be specified with min_readlength
(default 20).
If a constant barcode length is set with barcode_length
, this is set as maximum sequence length.
For both
, only sequences matching exactly barcode_length
will be retained.
The fraction of mismatches in the constant region can be specified with constantmismatches
(default 0.1).
Output files:
trimmed_reads/<sample_id>.trimmed.fastq
: filtered and trimmed reads (symlink)trimmed_reads/logs/<sample_id>.cutadapt.log
: log (copy)
If a reference is provided, it is indexed using bowtie1.
Output files:
reference_index/
NAME.1.bt2
,NAME.2.bt2
,NAME.3.bt2
,NAME.4.bt2
,NAME.rev.1.bt2
, andNAME.rev.2.bt2
(symlink)
If a reference is provided, trimmed and filtered reads are aligned to the indexed reference using bowtie1.
--norc
is specified, bowtie will not attempt to align against the reverse-complement reference strand.
Only non-ambiguous alignments are reported with the flags -a --best --strata -m1
.
This is necessary when aligning short barcode fragments from scRNA-seq.
Note: This can results in not detection of specific barcodes if only a part of the barcode is sequenced.
If for example only the first 40 bases of a barcode are sequenced and there are non-unique barcodes in the reference based on the first 40 bases, no read will unambiguously match to these barcodes.
Sequences that map with the same number of mismatches to multiple barcodes will be discarded.
The number of allowed mismatches can be specified with the parameter alnmismatches
(default 2).
Output files:
mapped_reads/<sample_id>.mapped.sam
: Aligned reads (symlink)mapped_reads/unmapped/<sample_id>.unmapped.fastq.gz
: Unaligned reads (optional) (symlink)mapped_reads/logs/<sample_id>.bowtie.log
: log (copy)
If the barcodes have a consistent length specified with barcode_length
, alignments to the middle of a barcode sequence are filtered out.
Alignments that start at the first position or end at the last are retained.
This ensures confidence in barcodes detected with short mapping sequences (min_readlength
), or barcode fragments from scRNA-seq.
Output files:
mapped_reads/<sample_id>.mapped_filtered.sam
: Aligned and filtered reads (symlink)
Before clustering barcode reads with starcode, they are trimmed to min_readlength
if constants
is set to up
or down
.
Reasoning:
If the reads do not cover the whole barcode and a stagger is used, barcode reads will have different length.
Since starcode only collapses sequence clusters of unequal sizes (--cluster_ratio
), this would result in one cluster per stagger length for each barcode.
Trimming is either done on 3' or 5' end, depending on which constant was trimmed.
No trimming is performed if constants
is both
. all
is not compatible with clustering barcode reads.
MultiQC creates a report of metrics for fastqc, flash, fastp, cutadapt and bowtie for all samples.
Output files:
reports/multiqc/multiqc_report.html
: report for all samples (copy)
The SAM file of aligned barcode reads is sorted, indexed and compressed using samtools.
Barcode counts for each sample are compiled with samtools indexstats
.
Output files:
counts/<sample_id>_rawcounts.txt
: tsv containing barcode and count (copy)
If no reference is provided, the consensus barcode repertoire is derived using starcode.
Starcode clusters the filtered and trimmed barcode sequences based on their Levenshtein distance (substitutions, insertions, deletions).
The default Levenshtein distance for clustering is 3 which is conservative to avoid collapsing truly distingt barcodes.
The parameter cluster_distance
can be adjusted for the expected number of sequencing errors given the length of the barcode construct.
Examplary recommended cluster distances: ClonMapper (Gutierrez et al. 2021) (20bp barcode): 1 and FateMap (Goyal et al. 2023) (100bp barcode): 8.
The default cluster_ratio
to use for message passing clustering is 3.
This means that a cluster can absorb a smaller one only if it is at least 3 times bigger.
When barcodes are aligned to a reference, cluster_unmapped
can be set to true
to cluster unmapped barcode reads.
When using this option for barcodes from direct scRNA-seq data, barcode reads have different lengths due to random fragmentation.
Clustering unmapped reads may therefore not be as informative.
From the starcode manual:
The clustering method is Message Passing. This means that clusters are built bottom-up by merging small clusters into bigger ones. The process is recursive, so sequences in a cluster may not be neighbors, i.e., they may not be within the specified Levenshtein distance.
Output files:
starcode/<sample_id>[_unmapped]_starcode.tsv
: barcode counts with sequence of centroid of each barcode cluster and read count (copy)starcode/logs/<sample_id>[_unmapped]_starcode.log
: log (copy)
When starcode is run on reads that did not align to the barcode reference library, results are written to counts/unmapped/
and counts/logs/
instead of starcode/
.
Barcode counts of all samples are combined into one table with an outer join.
Output files:
counts/all_counts_combined.tsv
: table of barcodes and counts for each sample (copy)
If no cell ID whitelist is provided (with --whitelist_indir
), Cell barcodes are identified in R1 using umi-tools whitelist.
A whitelist of cell IDs can be found in Cell Ranger results in outs/filtered_feature_bc_matrix/<sample_id>-barcodes.tsv
but extensions like '-1' must be removed.
The expected number of cells should be specified with the parameter cellnumber
.
This should be approximately the number of cells loaded. The command is only utilized to extract cell barcodes and not to perform cell calling.
Cell calling should be done with tools such as Cell Ranger.
Barcodes identified in droplets that do not contain cells or doublets will be removed when merging the barcode counts table with e.g. QC'd Seurat object.
If the number of cells loaded differs a lot between samples, they must be processed separately with adjusted cellnumber
values.
Output files:
umitools/whitelist/<sample_id>_whitelist.tsv
: whitelisted cell barcodes and counts (symlink)umitools/whitelist/logs/<sample_id>_whitelist.log
: log (copy)
Reads that contain cell barcode and UMI are extracted using umi-tools extract.
Output files:
umitools/extract/<sample_id>_R2_extracted.fastq
: reads that contain cell barcode and UMI, both added to the read name (symlink)umitools/extract/logs/<sample_id>_exctract.log
: log (copy)
Reads are filtered with samtools for flag 4 and flags CB and UB to obtain unmapped reads that contain a cell barcode and UMI. At a later step (for efficiency), cell ID and UMI are added to the read headers with the module RENAME_READS_BAM (or RENAME_READS_SAW).
Output files:
fastq/<sample_id>_R2.fastq.gz
: reads containing cell barcode and UMI (symlink)
If no reference is available, starcode-umi
is used to cluster barcodes from scRNA-seq data.
We utilized the python script starcode-umi
, from starcode/starcode-umi, written to cluster UMI-tagged sequences.
Starcode-umi performs clustering and merging of barcode sequences to find the best possible clusters of UMI and sequence pairs.
Parameters cluster_distance
and cluster_ratio
can be adjusted for error correction of barcode sequences (see STARCODE).
See also section Trimming barcode length before clustering
This module also clusters unmapped reads if cluster_unmapped
is set to true
.
Output files:
starcode/[unmapped/]<sample_id>[_unmapped]_starcode.tsv
: the output of starcode with clustered and corrected cellbarcode-umi-barcode sequence and sequence count (symlink)starcode/logs/<sample_id>[_unmapped]_starcode.log
: starcode log (copy)
We assume that PCR chimerism happen mainly between the CB-UMI and the lineage barcode sequence.
Reads from PCR chimerism are removed by only retaining the cell barcode-umi-lineage barcode combination supported by the most reads.
This substantially decreases the amount of barcodes detected per cell.
Output files:
counts/[unmapped/]pcr_chimerism/<sample_id>[_unmapped][_starcode]_barcodes.tsv
: cell barcode-umi-lineage barcode combination, formatted as input to umi-tools count_tab (symlink)counts/logs/<sample_id>[_unmapped][_starcode]_pcr_chimerism.log
: log (copy)
Trimmed, filtered and aligned barcodes are counted using umi-tools count_tab.
The Hamming distance between UMIs to be collapsed within cells during counting can be specified with parameter umi_dist
(default and recommended 1).
Output files:
counts/<sample_id>.counts.tsv
: barcode UMI counts with columns barcode, cell barcode and deduplicated UMI count (copy)counts/logs/<sample_id>_counts.log
: log (copy)
Alternative to umitools_count for stereo-seq data processed with the SAW pipeline. Trimmed, filtered and aligned barcodes are counted from the SAM file (no pcr chimera removal or UMI error correction).
Output files:
counts/<sample_id>.counts.tsv
: barcode UMI counts with columns barcode, cell barcode and deduplicated UMI count (copy)
Since multiple barcodes can be detected in a cell, the counts table needs to be aggregated. This allows the results to be merged into the metadata of a single-cell object such as a Seurat or AnnData object.
Barcodes can be filtered based on the number of supporting UMIs (--umi_count_filter
) and by the number of UMIs in comparison to the dominant barcode per cell (--umi_fraction_filter
).
E.g. if barcode a has 5 supporting UMIs in a cell and a second barcode with 2 supporting UMIs and umi_fraction_filter
set to 0.3, 5 / 2 = 0.25 < 0.3
, so the second barcode will be discarded.
Barcodes and UMIs are semicolon-separated if multiple barcodes were detected per cell.
Output files:
counts/<sample_id>_cell_barcode_annotation.tsv
: aggregated barcode counts per cell with cell barcode as row index and barcode and UMI count as columns (copy)counts/qc/<sample_id>_barcodes_per_cell.pdf
: QC plot, number of detected barcode per cell (copy)counts/qc/<sample_id>_UMIs_per_bc.pdf
: QC plot, UMIs supporting the most frequent barcode per cell (copy)counts/qc/<sample_id>_avg_sequence_length.pdf
: QC plot, average mapped sequence length per barcode (only with reference)counts/qc/<sample_id>_barcodes_per_cell_filtered.pdf
: QC plot, number of detected barcode per cell (copy)counts/qc/<sample_id>_UMIs_per_bc_filtered.pdf
: QC plot, UMIs supporting the most frequent barcode per cell (copy)counts/qc/<sample_id>_avg_sequence_length_filtered.pdf
: QC plot, average mapped sequence length per barcode (only with reference) (copy)counts/qc/<sample_id>_avg_sequence_length.tsv
: average mapped sequence length per barcode as table (only with reference) (copy)
Essential output files are marked with **.
single_end/
├── counts
│ ├── **all_counts_combined.tsv**
│ ├── **test_SE_01_rawcounts.txt**
│ ├── **test_SE_02_rawcounts.txt**
│ └── **test_SE_03_rawcounts.txt**
├── filtered_reads
│ ├── logs
│ │ ├── test_SE_01.filter.fastp.html
│ │ ├── test_SE_01.filter.fastp.json
│ │ ├── test_SE_01.filter.log
│ │ ├── test_SE_02.filter.fastp.html
│ │ ├── test_SE_02.filter.fastp.json
│ │ ├── test_SE_02.filter.log
│ │ ├── test_SE_03.filter.fastp.html
│ │ ├── test_SE_03.filter.fastp.json
│ │ └── test_SE_03.filter.log
│ ├── test_SE_01.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/4b/5aa24ebed0ac626e6f9a3e01a8cda2/test_SE_01.filtered.fastq.gz
│ ├── test_SE_02.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/6b/2c6d8dbfd414d1f2cb07d636dd54ec/test_SE_02.filtered.fastq.gz
│ └── test_SE_03.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/ee/8aa047ee412dfc7d3efb5e40524770/test_SE_03.filtered.fastq.gz
├── mapped_reads
│ ├── logs
│ │ ├── test_SE_01.bowtie.log
│ │ ├── test_SE_02.bowtie.log
│ │ └── test_SE_03.bowtie.log
│ ├── test_SE_01.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/87/f338156278257ac88ff276253dcd6e/test_SE_01.mapped_filtered.sam
│ ├── test_SE_01.mapped.sam -> /scratch/users/hholze/BARtab/work/89/d74f3656808410912312463972204e/test_SE_01.mapped.sam
│ ├── test_SE_02.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/e2/b1d81c307389fd6a2ae690792520b5/test_SE_02.mapped_filtered.sam
│ ├── test_SE_02.mapped.sam -> /scratch/users/hholze/BARtab/work/69/ded496d4631e6d2952e585b81f84bc/test_SE_02.mapped.sam
│ ├── test_SE_03.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/45/c0bfc53016017909fd7bc8c6dee2dd/test_SE_03.mapped_filtered.sam
│ ├── test_SE_03.mapped.sam -> /scratch/users/hholze/BARtab/work/19/84b5109b37a0e5453fc046f38b968b/test_SE_03.mapped.sam
│ └── unmapped
│ ├── test_SE_01.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/89/d74f3656808410912312463972204e/test_SE_01.unmapped.fastq.gz
│ ├── test_SE_02.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/69/ded496d4631e6d2952e585b81f84bc/test_SE_02.unmapped.fastq.gz
│ └── test_SE_03.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/19/84b5109b37a0e5453fc046f38b968b/test_SE_03.unmapped.fastq.gz
├── pipeline_info
│ ├── execution_report_2024-02-13_10-54-54.html
│ ├── execution_timeline_2024-02-13_10-54-54.html
│ ├── execution_trace_2024-02-13_10-54-54.txt
│ ├── pipeline_dag_2024-02-13_10-54-54.html
│ └── software_check.txt
├── reference_index
│ ├── test.ref.fasta.1.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.1.ebwt
│ ├── test.ref.fasta.2.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.2.ebwt
│ ├── test.ref.fasta.3.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.3.ebwt
│ ├── test.ref.fasta.4.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.4.ebwt
│ ├── test.ref.fasta.rev.1.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.rev.1.ebwt
│ └── test.ref.fasta.rev.2.ebwt -> /scratch/users/hholze/BARtab/work/2b/cf68381a69d3f08ffef825f5c81e76/test.ref.fasta.rev.2.ebwt
├── reports
│ ├── fastqc
│ │ ├── test_SE_01_fastqc.html
│ │ ├── test_SE_02_fastqc.html
│ │ ├── test_SE_03_fastqc.html
│ │ └── zips
│ │ ├── test_SE_01_fastqc.zip
│ │ ├── test_SE_02_fastqc.zip
│ │ └── test_SE_03_fastqc.zip
│ └── multiqc
│ ├── multiqc_data
│ │ ├── multiqc_bowtie1.txt
│ │ ├── multiqc_citations.txt
│ │ ├── multiqc_cutadapt_cutadapt_up.txt
│ │ ├── multiqc_data.json
│ │ ├── multiqc_fastp_fastp_filter.txt
│ │ ├── multiqc_fastqc.txt
│ │ ├── multiqc_general_stats.txt
│ │ ├── multiqc.log
│ │ └── multiqc_sources.txt
│ └── **multiqc_report.html**
└── trimmed_reads
├── logs
│ ├── test_SE_01.cutadapt_up.log
│ ├── test_SE_02.cutadapt_up.log
│ └── test_SE_03.cutadapt_up.log
├── test_SE_01.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/b8/d4049037e44cb9ef2e8123c4cf8bca/test_SE_01.trimmed.fastq.gz
├── test_SE_02.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/af/fe172f0a69673bc61a0791fbffc6c3/test_SE_02.trimmed.fastq.gz
└── test_SE_03.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/5b/5180a8298fd66d703cc75675fa14e2/test_SE_03.trimmed.fastq.gz
sc
├── counts
│ ├── logs
│ │ ├── test_sc_count.log
│ │ └── test_sc_pcr_chimerism.log
│ ├── pcr_chimerism
│ │ └── test_sc_mapped_barcodes.tsv -> /scratch/users/hholze/BARtab/work/91/80eb18d752e203a36e56c802287f1e/test_sc_mapped_barcodes.tsv
│ ├── qc
│ │ ├── **test_sc_avg_sequence_length_filtered.pdf**
│ │ ├── **test_sc_avg_sequence_length.pdf**
│ │ ├── test_sc_avg_sequence_length.tsv
│ │ ├── **test_sc_barcodes_per_cell_filtered.pdf**
│ │ ├── **test_sc_barcodes_per_cell.pdf**
│ │ ├── **test_sc_UMIs_per_bc_filtered.pdf**
│ │ └── **test_sc_UMIs_per_bc.pdf**
│ ├── **test_sc_cell_barcode_annotation.tsv**
│ └── **test_sc.counts.tsv**
├── filtered_reads
│ ├── logs
│ │ ├── test_sc.filter.fastp.html
│ │ ├── test_sc.filter.fastp.json
│ │ └── test_sc.filter.log
│ └── test_sc.filtered.fastq.gz -> /scratch/users/hholze/BARtab/work/3d/5eb0602672d14f41394bb45beb8453/test_sc.filtered.fastq.gz
├── mapped_reads
│ ├── logs
│ │ └── test_sc.bowtie.log
│ ├── test_sc.mapped_filtered.sam -> /scratch/users/hholze/BARtab/work/74/ba40770c2814b5b8ca350f68b238a3/test_sc.mapped_filtered.sam
│ ├── test_sc.mapped.sam -> /scratch/users/hholze/BARtab/work/85/342afa175dac82cc4c8c034659ec2a/test_sc.mapped.sam
│ └── unmapped
│ └── test_sc.unmapped.fastq.gz -> /scratch/users/hholze/BARtab/work/85/342afa175dac82cc4c8c034659ec2a/test_sc.unmapped.fastq.gz
├── pipeline_info
│ ├── execution_report_2024-02-13_10-41-07.html
│ ├── execution_timeline_2024-02-13_10-41-07.html
│ ├── execution_trace_2024-02-13_10-41-07.txt
│ ├── pipeline_dag_2024-02-13_10-41-07.html
│ └── software_check.txt
├── reference_index
│ ├── test.ref.fasta.1.ebwt
│ ├── test.ref.fasta.2.ebwt
│ ├── test.ref.fasta.3.ebwt
│ ├── test.ref.fasta.4.ebwt
│ ├── test.ref.fasta.rev.1.ebwt
│ └── test.ref.fasta.rev.2.ebwt
├── reports
│ ├── fastqc
│ │ ├── test_sc_R1_fastqc.html
│ │ ├── test_sc_R2_fastqc.html
│ │ └── zips
│ │ ├── test_sc_R1_fastqc.zip
│ │ └── test_sc_R2_fastqc.zip
│ └── multiqc
│ ├── multiqc_data
│ │ ├── multiqc_bowtie1.txt
│ │ ├── multiqc_citations.txt
│ │ ├── multiqc_cutadapt_cutadapt_up.txt
│ │ ├── multiqc_data.json
│ │ ├── multiqc_fastp_fastp_filter.txt
│ │ ├── multiqc_fastqc.txt
│ │ ├── multiqc_general_stats.txt
│ │ ├── multiqc.log
│ │ └── multiqc_sources.txt
│ └── **multiqc_report.html**
├── trimmed_reads
│ ├── logs
│ │ └── test_sc.cutadapt_up.log
│ └── test_sc.trimmed.fastq.gz -> /scratch/users/hholze/BARtab/work/bc/0dc0335a276e124249000dfeeaaeb6/test_sc.trimmed.fastq.gz
└── umitools
├── extract
│ ├── logs
│ │ └── test_sc_extract.log
│ └── test_sc_R2_extracted.fastq -> /scratch/users/hholze/BARtab/work/8a/1f38d4dcb11843353bcf43b4479a74/test_sc_R2_extracted.fastq
└── whitelist
├── logs
│ └── test_sc_whitelist.log
└── test_sc_whitelist.tsv -> /scratch/users/hholze/BARtab/work/40/89a05ef10cda4349ab4c71dcc74e50/test_sc_whitelist.tsv