Parse Biosciences Split-seq pre-processing workflow for HPC cluster

Background: Combinatorial barcoding for snRNA-seq

Combinatorial or split-pool barcoding appends a unique set of barcodes to single cells or nuclei during multiple "rounds". These reactions are confined to each individual fixed and permeabilized cell, bypassing the need for physical barriers between cells such as microwells or nanodroplets.

This workflow is designed to pre-process mouse data from the Parse Biosciences Evercode whole-transcriptome (“WT”) mini, 100k, and 1M barcoding kits. Versions 1, 2, and 3 of the barcoding chemistry are all supported. As of October 2024, genome and annotation is fixed at mm39 and vM32. This split-pool method may differ from others such as SHARE-seq due to using of both oligo-dT and random hexamer primers during the first round of reverse transcription-based barcoding. The goal in using both primers is to span the entire transcript for more full-length coverage and to avoid polyA-biased reads. After all 3 round of barcoding, aliquots of cells are lysed and libraries are prepared from barcoded cDNA. At the end of short-read library prep, a final Illumina barcode is added to identify these lysed aliquots, or "subpools", and libraries are sequenced and demultiplexed as subpools containing a mix of all the samples on the plate.

Workflow overview

The main deviation from the official company pipeline is the mapping/quantification software and custom handling of sample merging to produce Python anndata objects grouped by tissue for downstream processing with scanpy. This workflow is based on the kb-python (kallisto/bustools) suite, the official snRNA-seq tools for IGVF, and also includes optional code in place for genetic demultiplexing of individuals from sample barcoding wells loaded with 2 individuals with distinct genotypes.

This workflow does the following:

Pseudoalignment with kb count: Associate reads with their cells/nuclei of origin, collapse duplicated reads according to unique molecular identifiers (UMIs), collapse reads from oligo-dT and random hexamer primers in the same well (counts_unfiltered_modified), and generate subpool-level cell x gene matrices.
Genetic demultiplexing with klue: Optionally, quantify reads associated with distinct genotypes for each cell & merge with cellbender adata. Assign genotype with higher count number between 2 expected genotypes, tie otherwise.
Background removal with cellbender: Remove ambient RNA / PCR chimeras from data and filter cells/nuclei.
Add metadata and doublet scores with scanpy: Merge sample metadata with cellbender output by sample barcode-to-well mapping to create a cellbender-filtered adata complete with sample and subpool information. Perform doublet scoring within scanpy using scrublet.
Merge samples across subpools by tissue for downstream analysis.

Input data

Subpool FASTQ Files:

Subpool_1_R1.fastq.gz (contains cDNA)
Subpool_1_R2.fastq.gz (contains the cell barcode and UMI)

Fastq files split by multiple lanes are ok, e.g.

Subpool_2_S1_L001_R1_001.fastq.gz
Subpool_2_S1_L001_R2_001.fastq.gz
Subpool_2_S1_L003_R1_001.fastq.gz
Subpool_2_S1_L003_R2_001.fastq.gz etc.

Read name is not used directly in the pipeline, can be formatted however. Just need to specify the relevant information in the fastq config file.

Setup instructions

Note: Written for and tested on UCI's HPC. Follow these steps if you are using this workflow for the first time.

Clone this repository

Clone this branch.

git clone https://github.com/mortazavilab/parse_pipeline.git

Set up an interactive tmux session on HPC

Remember your login node or choose your favorite out of i15, i16, i17. You can switch login nodes via ssh, e.g. ssh login-i15
Start tmux session via tmux new -s mysession, or whatever you want to name it. When you need to reconnect, type tmux a -t mysession (tmux cheatsheet) from the same login node as when you started the session. This is so if your internet goes out or you have to close your laptop, ongoing processes won't be terminated.
Start interactive session: srun -A SEYEDAM_LAB --cpus-per-task=1 --mem 32G --pty bash -i. This is so you don't clog up the login node. We shouldn't need a lot since we are just installing packages and running snakemake, which will launch more computationally-intensive jobs for you.

Create a conda environment called snakemake using python 3.9

conda install -n base -c conda-forge mamba
mamba create -n snakemake -c conda-forge -c bioconda snakemake==7.32 python==3.9 scanpy==1.10.2 pandas==2.2.2 anndata==0.10.9 scikit-image==0.24.0
pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@04c2f5b460721fd55cf62a4cd23617b2555d69b8 for the latest CellBender version (3.2.0). (Yes unfortunately you need to install cellbender on this environment as well in order to use anndata_from_h5 from cellbender.remove_background.downstream when making the adata from CellBender output.)

Create a conda environment called cellbender using python 3.7

Tried multiple times to get CellBender working in the main snakemake environment with python 3.9 but to no avail; it's a known version issue regarding saving the checkpoint file. The checkpoint is very convenient in case the run fails due to lack of memory or job timeout, so I think the up-front effort in creating a second conda environment is worth it. Following the installation instructions for CellBender:

mamba create -n cellbender python=3.7
mamba activate cellbender
mamba install -c anaconda pytables
pip install torch==1.13.1
pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@04c2f5b460721fd55cf62a4cd23617b2555d69b8 for the latest CellBender version (3.2.0).

Install klue

Klue installation instructions:

git clone https://github.com/Yenaled/klue
cd klue
mkdir build
cd build
cmake ..
make
make install

If step 7 does not work, follow these steps:

cd src (within the build folder)
pwd and copy the path, for example mine is /dfs5/bio/erebboah/klue/build/src
Then edit your ~/.bashrc by adding this line to the bottom of the file: export PATH="<your path>:$PATH". For example I added export PATH="/dfs5/bio/erebboah/klue/build/src:$PATH
source ~/.bashrc

Test installation by typing klue in the terminal, should see version and usage information.

Create / edit input files

Make a fastq config file with columns that match the many examples in the configs folder, e.g. igvf_004_config.tsv and save in configs folder. Required columns are fastq, fastq_r2, subpool, plate, lane, run, droplets_included, learning_rate, and expected_cells. The last 3 parameters are specifically for CellBender parameters of the same name. We recommend entering the same number of cells/nuclei as expected in the subpool (for example 67000 for 1M kits, 13000 for 100k kits) for expected cells, half the default learning rate (0.00005), and approximately 3 times the expected cells for the total droplets included (for example 200000 for 1M kits and 40000 for 100k kits). In the case of subpar CellBender results with these parameters, modify droplets included by including an additional 50,000 or subtracting 50,000 (1M kits).
Update sample_metadata.csv with relevant metadata for your experiment. The pipeline works on the assumption you have something in the following columns, in order to adhere to a common set of attributes especially when submitting data to the IGVF portal:

Mouse_Tissue_ID
Experiment
bc1_well
well_type
SampleType
Tissue
Genotype
Sex
Age
Protocol
Chemistry
Multiplexed_sample1
Multiplexed_sample2
Row
Column
DOB
Age_days
Body_weight_g
Estrus_cycle
Dissection_date
Dissection_time
ZT
Dissector
Tissue_weight_mg
Notes

Make a copy of snakemake/Snakefile.smk with your experiment name to edit. Keep the copy in the snakemake folder. You can also just edit Snakefile.smk directly, but I like to keep a copy of the Snakefile used for each run for my own records. You only need to edit 4 lines maximum in the header region of the file:

config_tsv: Path to the fastq config file which has the paths to your read1 and read2 input fastqs and the plate, lane, run, and sequencing platform.
sample_csv: Path to the sample metadata. I typically update the Google spreadsheet, download the tab, and upload it to my configs folder. Each row represents a well in the sample barcoding plate with metadata information for that sample, some of which is repeated across all the samples on the plate, such as experiment name, kit, and tissue.
kit: either WT (48 wells), WT_mini (12 wells), or WT_mega (96 wells)
chemistry: either v1, v2, or v3

Run pipeline

Skip steps 1-4 below if you were following the setup instructions and are already in an interactive tmux session.

Pay attention to your login node, or ssh to your favorite, e.g. ssh login-i15
Change directories to your pipeline directory, e.g. cd /share/crsp/lab/seyedam/erebboah/parse_pipeline. You MUST be in the parse_pipeline directory, not in a sub-directory like parse_pipeline/configs, parse_pipeline/snakemake, or it will not run.
Start tmux session, e.g. tmux new -s mysession
Start interactive session: srun -A SEYEDAM_LAB --cpus-per-task=1 --mem 32G --pty bash -i
Activate your snakemake environment: conda activate snakemake (you need to activate your snakemake conda environment again even if you were following setup instructions since sourcing the bashrc probably reset it to your base environment)
Check that snakemake is going to run the appropriate jobs (use the -n flag first). Make sure to change Snakefile.smk to the one you are actually using! For example, Snakefile_igvf015.smk

 snakemake \
  -s snakemake/Snakefile.smk \
  -j 50 \
  --latency-wait 120 \
  --use-conda \
  --cluster "sbatch -A {resources.account} --partition={resources.partition} --mem={resources.mem_gb}GB --gres={resources.gres} -c {resources.threads} --time=72:00:00" -n

Actually run pipeline (no -n flag):

 snakemake \
  -s snakemake/Snakefile.smk \
  -j 50 \
  --latency-wait 120 \
  --use-conda \
  --cluster "sbatch -A {resources.account} --partition={resources.partition} --mem={resources.mem_gb}GB --gres={resources.gres} -c {resources.threads} --time=72:00:00"

Basic troubleshooting

FileNotFoundError/No such file or directory: Check your current directory (pwd). Make sure the 3 required input files exist and in the correct locations: fastq config e.g. igvf_###_config.tsv is in parse_pipeline/configs, sample_metadata.csv is in parse_pipeline/configs, and Snakemake_###.smk is in parse_pipeline/snakemake. Make sure the fastq config file is spelled correctly in your Snakemake smk file.
AttributeError: Make sure the columns in igvf_###_config.tsv exactly match fastq, fastq_r2, subpool, plate, lane, run, droplets_included, learning_rate, and expected_cells.
If your fastq or fastq_r2 paths are incorrect, the pipeline will run fine at first. Kallisto runs, symlinked files have the correct name while being wrong, but you get no cells and code downstream of kallisto will probably break.
If you put the wrong kit/chemistry in the snakemake file, you will get NAN barcodes and the pipeline fails due to an assert() call in the make cellbender adata rule.

Known issues / Wishlist

Clean up extra files
klue reference generation runs twice for the F1 plates, in other words it makes both NODJ.idx file and a B6NODF1J.idx file, which for now are identical. And the same for all the other 6 non-B6J genotypes. The good news is that it only runs twice one time...reference generation isn't repeated after the first pipeline run.
Integrate report code
Integrate experimentSpecs and analysisSpecs to replace manual metadata curation
Support human genome
Support custom humanized loci for Model AD

Name		Name	Last commit message	Last commit date
Latest commit History 485 Commits
IGVF_analysis		IGVF_analysis
annotation		annotation
barcodes		barcodes
configs		configs
snakemake		snakemake
README.md		README.md
uci_parse_workflow.png		uci_parse_workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parse Biosciences Split-seq pre-processing workflow for HPC cluster

Background: Combinatorial barcoding for snRNA-seq

Workflow overview

Input data

Setup instructions

Clone this repository

Set up an interactive tmux session on HPC

Create a conda environment called snakemake using python 3.9

Create a conda environment called cellbender using python 3.7

Install klue

Create / edit input files

Run pipeline

Basic troubleshooting

Known issues / Wishlist

About

Releases

Packages

Contributors 2

Languages

mortazavilab/parse_pipeline

Folders and files

Latest commit

History

Repository files navigation

Parse Biosciences Split-seq pre-processing workflow for HPC cluster

Background: Combinatorial barcoding for snRNA-seq

Workflow overview

Input data

Setup instructions

Clone this repository

Set up an interactive tmux session on HPC

Create a conda environment called snakemake using python 3.9

Create a conda environment called cellbender using python 3.7

Install klue

Create / edit input files

Run pipeline

Basic troubleshooting

Known issues / Wishlist

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages