Skip to content

NabaviLab/SEACON

Repository files navigation

SEACON

SEACON (Single-cell Estimation of Allele-specific COpy Numbers) is a tool for allele-specific copy number profiling from single-cell DNA-sequencing data. The source code is also available from Zenodo.

SEACON can be cited as follows: "Improved allele-specific single-cell copy number estimation in low-coverage DNA-sequencing". Samson Weiner, Bingjun Li, and Sheida Nabavi. Under Review.

Contents

Installation

To install SEACON, begin by cloning the repository, then cd into it:

git clone --recursive https://github.com/NabaviLab/SEACON.git
cd SEACON

Next, follow either the instructions for either the standard installation or custom installation.

Standard Installation

To reproduce the environment needed to use SEACON, with all its dependencies and versions, create a new conda environment from the provided environment.yml file:

conda env create -f environment.yml

This will create a new conda environment named SEACON. Next, activate the environment:

conda activate SEACON

Lastly, execute the installation of the tool:

pip install .

Custom Installation

The environment can also be assembled manually. SEACON reliably works for Python version 3.10 and R version 4.1. Additionally, there are a number of python dependencies, one R package, and a few standard bioinformatics tools. All are available on conda, and are listed below.

Dependencies

SEACON requires the following python packages are installed:

Additionally, SEACON requires the following bioinformatics tools be available:

Additionally, SEACON integrates CHISEL. CHISEL is written in python2.7 and thus causes dependency conflicts, so it should be installed in a separate conda environment. SEACON assumes it is installed in an environment named "SEACON_chisel". Note that this step is automated with the provided initialize.sh script.

Inputs and Preparation

Sequencing reads

SEACON takes as input a path to a directory containing bam files, one for each cell. It assigns the name of the file to be the cell id. Information on obtaining bam files from raw fastq files is provided in the Raw data preprocessing.

Reference genome

A reference genome is needed for some of the steps in SEACON. For human reference genomes, this will typically be hg38 or hg19, and can be obtained from the UCSC genome browser here.

Mappability file

A mappability file in bigWig format must be provided for the given reference genome. These can be obtained from the same organization (for example, here). The 100-mers mappability track for both hg38 and hg19 have been provided for download at this Google Drive.

Phasing information

To compute BAFs, SEACON uses the estimation procedure of CHISEL, and is included as a submodule in this repository. This requires an input vcf file containing phased SNPs from a matched-normal sample or a pseudo-normal sample built from identified normal cells (see this section). Scripts are provided for obtaining the phased SNPs using additional bioinformatics tools. More information appears in the section Obtaining phased SNPs below. Additionally, precomputed BAFs can be provided if estimated using a different algorithm. For convenience, if using the BAF estimation procedure of CHISEL, the outputs of CHISEL will also be generated for comparison.

Usage

SEACON is run by calling the command seacon with a selected mode as follows (or by running the core/seacon.py script):

seacon [mode] [-o outdir] [options]

The modes are as follows: prep_readcount, prep_baf, call, and pseudonormal. SEACON is intended to be run using the first three modes in that order, which can also be achieved in a single call by passing mode full. The pseudonormal mode is designed to build a pseudobulk bam file from the normal cells to facilitate the BAF computation, but is only necessary if a matched-normal sample is unavailable. Each of the four modes are explained in further detail below.

For a full list of parameters and their default, use the following command:

seacon --help

Additionally, please be advised that multiple cores can be utilized by specifying the --num-processors parameter.

prep_readcount mode

The prep_readcount mode partitions the genome into bins, extracts the raw readcounts, filters using mappability and GC content, and normalizes the read counts using normal cells identified in the sample. It is executed as follows:

seacon prep_readcount -o out_dir -i bam_path -r ref.fa -b 5000000 --map-file hg38_mappability.bigWig [options]

If a matched-normal sample is available, use the --normal-bam parameter. The readcounts can also be normalized without using the normal cells with the --no-normal flag.

prep_baf mode

The prep_baf mode initializes input precomputed BAFs, or executes the BAF estimation procedure of CHISEL using a vcf file with phased SNPs. It is executed as follows:

seacon prep_baf -o out_dir -i bam_path --vcf phased_snps.vcf.gz --block-size 100 [options] 

Note that the block size is in kb. If providing preocmputed BAFs, use the --precomputed-baf parameter followed by the path to the input file. The file should be in tsv format with headers chrom, start, end, cell, BAF.

call mode

Once the previous two commands are run, the call mode infers the copy numbers.

seacon call -o out_dir [options]

pseudonormal mode

If no matched-normal sample is available, SEACON can construct a pseudonormal sample which can be used to construct an input vcf file. This mode should be called after running the prep_readcount mode, and is executed as follows:

seacon pseudonormal -i bam_dir -o out_dir

After this file is generated, use the provided gen_vcf.sh script in the scripts directory to identify the phased SNPs (see the section Obtaining phased SNPs). When this is complete, continue by running SEACON with the prep_baf mode.

Output files

After running the full SEACON pipeline, the final allele-specific copy number profiles are saved in the calls.tsv file, which has the following format:

cell   chrom   start          end              CN
cell1   chr1     1000001   2000000     1,1
cell1   chr1     2000001   3000000     1,2
cell1   chr1     3000001   4000000     2,3

The copy number profiles are also saved more compactly in the calls_flat.tsv file. Here, each row contains the cell name followed by its copy numbers across the genome. The ith column corresponds to the genomic bin specified by the ith line in the filtered_regions.bed file.

The readcounts, RDRs, and BAFs are saved in files readcounts.tsv, RDR.tsv, and BAF.tsv, respectively, also in compact format. Additionally, the set of normal cells identified in the sample are saved in the diploid.txt file.

If the BAF estimation procedure of CHISEL is used, the output of chisel can be found in file chisel_out/calls/calls.tsv. Please see the CHISEL documentation for more details.

Resource requirements and runtime

The runtime and memory requirements of SEACON greatly depends on the number of cells, coverage, and other factors. Resource usage and runtime will mostly consist of the raw data preprocessing and BAF computation, with the seacon call command executing relatively quickly. For this reason, we advise using SEACON on a high performance computing cluster with multiple cores. For convenience, we show example execution times of the three main modes of SEACON two whole-genome scDNA-seq datasets, one with 100 cells and the other 1000. Both datasets have a coverage of 0.1X coverage and 1Mbp bin sizes. The commands were allocated 10 cores of an Intel Xeon E5 v4 2.4GHz processors. The 100 cell dataset was allocated 20Gb memory, while the 1000 cell data was as allocated 40Gb. Note that the memory requirements are largely to run the CHISEL BAF estimation procedure - if using precomputed BAFs, the call command can be run with significantly less memory.

Num cell Coverage Bin Size prep_readcount prep_baf call
100 0.1X 1Mbp 128s 6404s 496s
1000 0.1X 1Mbp 926s 25.6h 12004s

Utilities

Raw data preprocessing

If raw reads are acquired in fastq format, we recommend the following steps (note that the bioinformatics tools bwa and samtools are required):

bwa mem -t 8 ref.fa cell.fastq.gz > cell.sam
samtools view -q 40 -hbS -o cell.bam cell.sam
samtools sort -o cell.sorted.bam cell.bam
samtools index cell.sorted.bam

Obtaining phased SNPs

If a matched-normal sample is available, or if a pseudonormal sample is constructed using the pseudonormal mode of SEACON, it is necessary to obtain a set of phased SNPs to compute the BAFs. A pipeline for obtaining a vcf file with phased SNPs is provided in the script gen_vcf.sh located in the scripts directory. This script requires bcftools and additionally a phasing tool such as Eagle2 or shapeit.

In the provided script gen_vcf.sh, the tool Eagle2 is used, and a prebuilt binary for linux can be installed here. Using reference-based phasing locally requires a reference panel to be downloaded, for example this reference from 1000 Genomes. Reference-free phasing is also possible. Additionally, Eagle2 can be used online through the the Michigan Imputation Server.

Breakpoint filtering scripts

We have provided scripts to smooth the output copy number profiles of any method using the breakpoint filtering heuristic implemented in SEACON. These scripts are located in the scripts directory and include one for the output of CHISEl and another for any arbitrary output. Instructions on using these scripts appear in the README file located in the scripts directory.

Example commands

The following is an example workflow of using SEACON

# Normalize readcounts and identify normal cells
seacon prep_readcount -o seacon_output -i dataset1_bams -r ../reference/hg38.fa -b 1000000 -map-file ../reference/mappability/hg38_mappability.bigWig -P 12

# Build pseudobulk sample from the normal cells identified with the previous step. Output is located in seacon_output/pseudonormal.bam
seacon pseudonormal -o seacon_output -i dataset1_bams

# Customize the provided script to obtain phased snps from the pseudonormal.bam file
bash scripts/gen_vcf.sh

# Compute the BAFs using the phased SNPs
seacon prep_baf -o seacon_output -i dataset1_bams --vcf phased_snps.vcf.gz --block-size 100

# With the input data ready, infer the copy numbers. Here, we are using a component merge tolerance threshold of 0.2 and a breakpoint filter of 5 cells to overcome noisy measurements. 
seacon call -o seacon_output --tolerance 0.2 --upper-filter 5 --max-wgd 2

Simulations

Datasets were generated using the CNAsim simulator. Information for recreating datasets used in the paper can be found in the README.md file in the simulation directory.

About

A tool for allele-specific copy number profiling.

Resources

License

Stars

Watchers

Forks

Packages

No packages published