SEACON (Single-cell Estimation of Allele-specific COpy Numbers) is a tool for allele-specific copy number profiling from single-cell DNA-sequencing data. The source code is also available from Zenodo.
SEACON can be cited as follows: "Improved allele-specific single-cell copy number estimation in low-coverage DNA-sequencing". Samson Weiner, Bingjun Li, and Sheida Nabavi. Under Review.
To install SEACON, begin by cloning the repository, then cd into it:
git clone --recursive https://github.com/NabaviLab/SEACON.git
cd SEACON
Next, follow either the instructions for either the standard installation or custom installation.
To reproduce the environment needed to use SEACON, with all its dependencies and versions, create a new conda environment from the provided environment.yml file:
conda env create -f environment.yml
This will create a new conda environment named SEACON. Next, activate the environment:
conda activate SEACON
Lastly, execute the installation of the tool:
pip install .
The environment can also be assembled manually. SEACON reliably works for Python version 3.10 and R version 4.1. Additionally, there are a number of python dependencies, one R package, and a few standard bioinformatics tools. All are available on conda, and are listed below.
SEACON requires the following python packages are installed:
Additionally, SEACON requires the following bioinformatics tools be available:
- bedtools
- bcftools
- bigWigAverageOverBed
These tools must be added to your
PATH
variable.
Additionally, SEACON integrates CHISEL. CHISEL is written in python2.7 and thus causes dependency conflicts, so it should be installed in a separate conda environment. SEACON assumes it is installed in an environment named "SEACON_chisel". Note that this step is automated with the provided initialize.sh script.
SEACON takes as input a path to a directory containing bam files, one for each cell. It assigns the name of the file to be the cell id. Information on obtaining bam files from raw fastq files is provided in the Raw data preprocessing.
A reference genome is needed for some of the steps in SEACON. For human reference genomes, this will typically be hg38 or hg19, and can be obtained from the UCSC genome browser here.
A mappability file in bigWig format must be provided for the given reference genome. These can be obtained from the same organization (for example, here). The 100-mers mappability track for both hg38 and hg19 have been provided for download at this Google Drive.
To compute BAFs, SEACON uses the estimation procedure of CHISEL, and is included as a submodule in this repository. This requires an input vcf file containing phased SNPs from a matched-normal sample or a pseudo-normal sample built from identified normal cells (see this section). Scripts are provided for obtaining the phased SNPs using additional bioinformatics tools. More information appears in the section Obtaining phased SNPs below. Additionally, precomputed BAFs can be provided if estimated using a different algorithm. For convenience, if using the BAF estimation procedure of CHISEL, the outputs of CHISEL will also be generated for comparison.
SEACON is run by calling the command seacon
with a selected mode as follows (or by running the core/seacon.py script):
seacon [mode] [-o outdir] [options]
The modes are as follows: prep_readcount
, prep_baf
, call
, and pseudonormal
. SEACON is intended to be run using the first three modes in that order, which can also be achieved in a single call by passing mode full
. The pseudonormal
mode is designed to build a pseudobulk bam file from the normal cells to facilitate the BAF computation, but is only necessary if a matched-normal sample is unavailable. Each of the four modes are explained in further detail below.
For a full list of parameters and their default, use the following command:
seacon --help
Additionally, please be advised that multiple cores can be utilized by specifying the --num-processors
parameter.
The prep_readcount mode partitions the genome into bins, extracts the raw readcounts, filters using mappability and GC content, and normalizes the read counts using normal cells identified in the sample. It is executed as follows:
seacon prep_readcount -o out_dir -i bam_path -r ref.fa -b 5000000 --map-file hg38_mappability.bigWig [options]
If a matched-normal sample is available, use the --normal-bam
parameter. The readcounts can also be normalized without using the normal cells with the --no-normal
flag.
The prep_baf mode initializes input precomputed BAFs, or executes the BAF estimation procedure of CHISEL using a vcf file with phased SNPs. It is executed as follows:
seacon prep_baf -o out_dir -i bam_path --vcf phased_snps.vcf.gz --block-size 100 [options]
Note that the block size is in kb. If providing preocmputed BAFs, use the --precomputed-baf
parameter followed by the path to the input file. The file should be in tsv format with headers chrom, start, end, cell, BAF.
Once the previous two commands are run, the call mode infers the copy numbers.
seacon call -o out_dir [options]
If no matched-normal sample is available, SEACON can construct a pseudonormal sample which can be used to construct an input vcf file. This mode should be called after running the prep_readcount mode, and is executed as follows:
seacon pseudonormal -i bam_dir -o out_dir
After this file is generated, use the provided gen_vcf.sh
script in the scripts directory to identify the phased SNPs (see the section Obtaining phased SNPs). When this is complete, continue by running SEACON with the prep_baf mode.
After running the full SEACON pipeline, the final allele-specific copy number profiles are saved in the calls.tsv file, which has the following format:
cell chrom start end CN
cell1 chr1 1000001 2000000 1,1
cell1 chr1 2000001 3000000 1,2
cell1 chr1 3000001 4000000 2,3
The copy number profiles are also saved more compactly in the calls_flat.tsv file. Here, each row contains the cell name followed by its copy numbers across the genome. The ith column corresponds to the genomic bin specified by the ith line in the filtered_regions.bed file.
The readcounts, RDRs, and BAFs are saved in files readcounts.tsv, RDR.tsv, and BAF.tsv, respectively, also in compact format. Additionally, the set of normal cells identified in the sample are saved in the diploid.txt file.
If the BAF estimation procedure of CHISEL is used, the output of chisel can be found in file chisel_out/calls/calls.tsv. Please see the CHISEL documentation for more details.
The runtime and memory requirements of SEACON greatly depends on the number of cells, coverage, and other factors. Resource usage and runtime will mostly consist of the raw data preprocessing and BAF computation, with the seacon call
command executing relatively quickly. For this reason, we advise using SEACON on a high performance computing cluster with multiple cores. For convenience, we show example execution times of the three main modes of SEACON two whole-genome scDNA-seq datasets, one with 100 cells and the other 1000. Both datasets have a coverage of 0.1X coverage and 1Mbp bin sizes. The commands were allocated 10 cores of an Intel Xeon E5 v4 2.4GHz processors. The 100 cell dataset was allocated 20Gb memory, while the 1000 cell data was as allocated 40Gb. Note that the memory requirements are largely to run the CHISEL BAF estimation procedure - if using precomputed BAFs, the call command can be run with significantly less memory.
Num cell | Coverage | Bin Size | prep_readcount | prep_baf | call |
---|---|---|---|---|---|
100 | 0.1X | 1Mbp | 128s | 6404s | 496s |
1000 | 0.1X | 1Mbp | 926s | 25.6h | 12004s |
If raw reads are acquired in fastq format, we recommend the following steps (note that the bioinformatics tools bwa and samtools are required):
bwa mem -t 8 ref.fa cell.fastq.gz > cell.sam
samtools view -q 40 -hbS -o cell.bam cell.sam
samtools sort -o cell.sorted.bam cell.bam
samtools index cell.sorted.bam
If a matched-normal sample is available, or if a pseudonormal sample is constructed using the pseudonormal mode of SEACON, it is necessary to obtain a set of phased SNPs to compute the BAFs. A pipeline for obtaining a vcf file with phased SNPs is provided in the script gen_vcf.sh
located in the scripts directory. This script requires bcftools and additionally a phasing tool such as Eagle2 or shapeit.
In the provided script gen_vcf.sh
, the tool Eagle2 is used, and a prebuilt binary for linux can be installed here. Using reference-based phasing locally requires a reference panel to be downloaded, for example this reference from 1000 Genomes. Reference-free phasing is also possible. Additionally, Eagle2 can be used online through the the Michigan Imputation Server.
We have provided scripts to smooth the output copy number profiles of any method using the breakpoint filtering heuristic implemented in SEACON. These scripts are located in the scripts directory and include one for the output of CHISEl and another for any arbitrary output. Instructions on using these scripts appear in the README file located in the scripts directory.
The following is an example workflow of using SEACON
# Normalize readcounts and identify normal cells
seacon prep_readcount -o seacon_output -i dataset1_bams -r ../reference/hg38.fa -b 1000000 -map-file ../reference/mappability/hg38_mappability.bigWig -P 12
# Build pseudobulk sample from the normal cells identified with the previous step. Output is located in seacon_output/pseudonormal.bam
seacon pseudonormal -o seacon_output -i dataset1_bams
# Customize the provided script to obtain phased snps from the pseudonormal.bam file
bash scripts/gen_vcf.sh
# Compute the BAFs using the phased SNPs
seacon prep_baf -o seacon_output -i dataset1_bams --vcf phased_snps.vcf.gz --block-size 100
# With the input data ready, infer the copy numbers. Here, we are using a component merge tolerance threshold of 0.2 and a breakpoint filter of 5 cells to overcome noisy measurements.
seacon call -o seacon_output --tolerance 0.2 --upper-filter 5 --max-wgd 2
Datasets were generated using the CNAsim simulator. Information for recreating datasets used in the paper can be found in the README.md file in the simulation directory.