Skip to content

3.1 scDNA: CopyKit

Junke Kris Wang edited this page Jul 4, 2024 · 4 revisions

Welcome to the single cell DNA Workshop!

Overview

The goal of CopyKit is to help you analyze single cell DNA sequencing datasets for copy number.

This tutorial presents an example workflow of CopyKit. Detailed information for methods used of each function can be found in CopyKit user guide.

The example dataset and Rscript can be downloaded here.

Installation

  1. Install the stable or development version from github using devtools. Note that knn smoothing is only available in development version.

  2. Install from package archive files, download v0.1.2 here.

devtools::install_github("navinlabcode/copykit")
devtools::install_github("navinlabcode/copykit@devel")

Check version and load libraries.

packageVersion("copykit")

## [1] '0.1.2'

library(copykit)

Parallelization

Whenever possible, CopyKit uses the BiocParallel framework.

library(BiocParallel)
n_threads = 8
register(MulticoreParam(workers = n_threads, progressbar = F), default = T)

Confirm parameters:

bpparam()

## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 50; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: NA; bpprogressbar: FALSE
##   bpexportglobals: TRUE; bpexportvariables: FALSE; bpforceGC: FALSE
##   bpfallback: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

Segmentation

Run CBS segmentation from bam files.

This part takes more than 10 mins depending on the sample size, thus will be skipped in the workshop.

tumor <- runVarbin(
          dir = "~/path/to/bam/files/",
          genome = "hg38",
          resolution = "220kb", # c("220kb", "55kb", "110kb", "195kb", "280kb", "500kb", "1Mb", "2.8Mb")
          remove_Y = FALSE,
          is_paired_end = FALSE,
          seed = 17,
          BPPARAM = bpparam()
        )

An example of basic segmentation and metric calculation of sequencing data from mock data generated from the built-in simulator.

copykit_obj <- mock_bincounts(ncells = 5,
                              ncells_diploid = 1,
                              position_gain = c(1,5,1000))

## Running variance stabilization transformation: ft

## Smoothing outlier bins.

## Running segmentation algorithm: CBS for genome hg38

## Merging levels.

## Done.

copykit_obj <- runMetrics(copykit_obj)

## Calculating overdispersion.

## Counting breakpoints.

## Done.

Calculated metric was stored in colData and can be visualized using plotMetrics().

kable(colData(copykit_obj))
sample ground_truth overdispersion breakpoint_count
V1 V1 FALSE 0.0002988 0
V2 V2 TRUE 0.0002742 0
V3 V3 TRUE 0.0002272 0
V4 V4 TRUE 0.0001224 0
V5 V5 TRUE 0.0001481 0
plotMetrics(copykit_obj, metric = "overdispersion", label = "ground_truth")

qc2-1

Example data set

First, we load in the example CopyKit object. This data set was generated from scDNA-seq of a human liver sample.

tumor <- readRDS("/DATAPATH/sample_obj.rds")

Then take a look at the data structure of CopyKit object.

head(tumor)

## class: CopyKit 
## dim: 6 600 
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(6): 1 2 ... 5 6
## rowData names(3): gc_content abspos arm
## colnames(600): PMTC6LiverC103DL6L7S3_S1255_L004_R1_001
##   PMTC6LiverC204DL1S7_S588_L002_R1_001 ...
##   PMTC6LiverC271AL6L7S5_S1423_L004_R1_001
##   PMTC6LiverC136AL6L7S3_S1288_L004_R1_001
## colData names(9): sample reads_assigned_bins ... reads_total
##   percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 6 ranges
## Phylo: Phylogenetic tree with 0 tips and  nodes
## consensus dim: 0 0

kable(rowData(tumor)[1:5,])
gc_content abspos arm
0.613786 1039596 p
0.591817 1264158 p
0.543128 1830107 p
0.572288 2065015 p
0.604703 2290536 p
kable(colData(tumor)[1:5,])
sample reads_assigned_bins reads_unmapped reads_duplicates reads_multimapped reads_unassigned reads_ambiguous reads_total percentage_duplicates
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 506994 27749 67923 0 105388 115 708169 0.096
PMTC6LiverC204DL1S7_S588_L002_R1_001 PMTC6LiverC204DL1S7_S588_L002_R1_001 476881 36334 45716 0 98424 96 657451 0.070
PMTC6LiverC258AL1S3_S258_L001_R1_001 PMTC6LiverC258AL1S3_S258_L001_R1_001 549322 27843 50278 0 111227 107 738777 0.068
PMTC6LiverC154AL1S2_S154_L001_R1_001 PMTC6LiverC154AL1S2_S154_L001_R1_001 812987 36743 78314 0 155468 179 1083691 0.072
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 624401 35599 74604 0 131717 111 866432 0.086
kable(segment_ratios(tumor)[1:5,1:5])
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 PMTC6LiverC204DL1S7_S588_L002_R1_001 PMTC6LiverC258AL1S3_S258_L001_R1_001 PMTC6LiverC154AL1S2_S154_L001_R1_001 PMTC6LiverC247AL4L5S3_S1015_L003_R1_001
1 1 0.98 0.71 1
1 1 0.98 0.71 1
1 1 0.98 0.71 1
1 1 0.98 0.71 1
1 1 0.98 0.71 1

Quality Control

First, we want to detect euploid cells by calculating the sample-wise coefficient of variation from the segment ratio means. The threshold can be changed from the automatic detection to a custom threshold with the argument resolution. By setting a threshold = 0.1, findAneuploidCells() will mark as euploid all cells with a coefficient of variation less or equal than 0.1.

tumor <- findAneuploidCells(tumor, seed = 17)

## number of iterations= 15

## Copykit detected 244 that are possibly diploid cells using a resolution of: 0.067

## Added information to colData(CopyKit).

Then we detect low-quality samples by a k-nearest-neighbors filtering (default k=5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default resolution=0.9).

tumor <- findOutliers(tumor, k = 5, resolution = 0.9)

## Calculating correlation matrix.

## Marked 70 cells as outliers.

## Adding information to metadata. Access with colData(scCNA).

## Done.

kable(colData(tumor)[1:5,])
sample reads_assigned_bins reads_unmapped reads_duplicates reads_multimapped reads_unassigned reads_ambiguous reads_total percentage_duplicates is_aneuploid find_normal_cv cell_corr_value outlier
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 506994 27749 67923 0 105388 115 708169 0.096 FALSE 0.02 1.000 FALSE
PMTC6LiverC204DL1S7_S588_L002_R1_001 PMTC6LiverC204DL1S7_S588_L002_R1_001 476881 36334 45716 0 98424 96 657451 0.070 FALSE 0.00 1.000 FALSE
PMTC6LiverC258AL1S3_S258_L001_R1_001 PMTC6LiverC258AL1S3_S258_L001_R1_001 549322 27843 50278 0 111227 107 738777 0.068 FALSE 0.06 0.678 TRUE
PMTC6LiverC154AL1S2_S154_L001_R1_001 PMTC6LiverC154AL1S2_S154_L001_R1_001 812987 36743 78314 0 155468 179 1083691 0.072 TRUE 0.54 0.972 FALSE
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 624401 35599 74604 0 131717 111 866432 0.086 FALSE 0.00 1.000 FALSE

Visualize the QC results by a heatmap.

plotHeatmap(tumor, 
            order_cells = "hclust",
            label = c('is_aneuploid','outlier'),
            row_split = 'outlier', 
            n_threads = n_threads)

qcht-1

In the last step of the QCm we remove unwanted cells by sub-setting the CopyKit object.

tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]
tumor <- tumor[,colData(tumor)$outlier == FALSE]

The dataset should now be ready to proceed with the analysis.

Clustering

We apply dimension reduction and embed the data into 2 dimensional space.

tumor <- runUmap(tumor, seed = 17)
kable(reducedDim(tumor)[1:5,])
PMTC6LiverC154AL1S2_S154_L001_R1_001 -2.8515888 -0.8837085
PMTC6LiverC72AL3S1_S840_L003_R1_001 -2.0377223 0.5955269
PMTC6LiverC223AL1S3_S223_L001_R1_001 0.5117559 -0.2887311
PMTC6LiverC153AL1S6_S537_L002_R1_001 -2.5070448 -0.5821292
PMTC6LiverC198DL1S7_S582_L002_R1_001 -1.1519634 -0.8655404
plotUmap(tumor)

umap-1

By default, CopyKit uses hdbscan as its clustering method and it is recommended to do so. To perform clustering, k should be pre-determined. To help with choosing k, CopyKit provides the helper function findSuggestedK() to bootstrap clustering over a range of k values, and returns the value that maximizes the Jaccard similarity.

tumor <- findSuggestedK(tumor, k_range = 3:7)

## Calculating Jaccard similarity for k range: 3 4 5 6 7

## 

## Suggested k = 7 with median jaccard similarity of: 0.943

plotSuggestedK(tumor)

findk-1

plotSuggestedK(tumor, geom = 'tile')

findk-2

Finally, we cluster the cells using hdbscan with our preferred k.

tumor <- findClusters(tumor)

## Using suggested k_subclones = 7

## Finding clusters, using method: hdbscan

## Found 2 outliers cells (group 'c0')

## Found 4 subclones.

## Done.

Check the clustering result.

tumor <- calcConsensus(tumor);tumor <- runConsensusPhylo(tumor)

plotUmap(tumor,  label = 'subclones')

visual-1

plotHeatmap(tumor, 
            order_cells = "hclust",
            row_split = "subclones",
            label = 'subclones', 
            n_threads = n_threads)

visual-2

It is possible that a subgroup of clustering outliers is identified. Outliers are added to subgroup c0 and should be removed.

tumor_c0 <- tumor[,colData(tumor)$subclones == 'c0']
tumor <- tumor[,colData(tumor)$subclones != 'c0']
colData(tumor)$subclones <- droplevels(colData(tumor)$subclones)

Excluded data can be easily merged back. This can also be used to merge 2 CopyKit objects.

merged <- cbind(tumor, tumor_c0)

Ploidy and Integer estimation

CopyKit supports different methods of calculating integer copy number profiles.

  1. Computational ploidies: scquantum is a method that estimates ploidy, and uses the ploidy estimate to scale the data to absolute copy numbers given bincount data from single-cell copy number profiling.
tumor <- calcInteger(tumor, 
                     method = 'scquantum', 
                     assay = 'bincounts')
  1. Fixed: a fixed value of ploidy (generally determined using Flow Cytometry) will be used to scale all cells.
tumor <- calcInteger(tumor, 
                     method = 'fixed', 
                     ploidy_value = 4.3)

Plot the integer copy number and check the rounding error

plotHeatmap(tumor, 
            assay = 'integer',
            order_cells = "hclust",
            row_split = "subclones",
            label = 'subclones', 
            n_threads = n_threads)

visualint-1

plotHeatmap(tumor, 
            assay = 'integer',
            order_cells = "hclust",
            row_split = "subclones",
            rounding_error = TRUE,
            label = 'subclones', 
            n_threads = n_threads)

visualint-2

We can also check profile of any cell using the plotRatio(). Here shown profile of a cell from c1.

plotRatio(tumor, sample_name = 'PMTC6LiverC281AL6L7S5_S1433_L004_R1_001')

ratio-1

Phylogeny

To run a phylogenetic analysis of cells’ copy number profiles, use the function runPhylo(). Available methods are Neighbor Joining and Balanced Minimum evolution. Trees can also be visualized by plotPhylo()

tumor <- runPhylo(tumor,
                  assay = 'segment_ratios',
                  metric = 'manhattan',
                  n_threads = n_threads)
phylo(tumor)

## 
## Phylogenetic tree with 311 tips and 310 internal nodes.
## 
## Tip labels:
##   PMTC6LiverC154AL1S2_S154_L001_R1_001, PMTC6LiverC72AL3S1_S840_L003_R1_001, PMTC6LiverC223AL1S3_S223_L001_R1_001, PMTC6LiverC153AL1S6_S537_L002_R1_001, PMTC6LiverC198DL1S7_S582_L002_R1_001, PMTC6LiverC161AL4L5S1_S929_L003_R1_001, ...
## 
## Rooted; includes branch lengths.

plotPhylo(tumor, label = 'subclones')

phylo-1

Single cell phylogeny is very prone to data noise. By taking the median of all cells from the same subclone, we can build subclonal consensus profiles and construct phylogeny from there. We can also root the consensus tree by a diploid profile.

tumor <- calcConsensus(tumor, assay = 'segment_ratios')
kable(consensus(tumor)[1:5,])
c1 c2 c3 c4
V1 0.69 0.73 0.68 0.69
V2 0.69 0.73 0.68 0.69
V3 0.69 0.73 0.68 0.69
V4 0.69 0.73 0.68 0.69
V5 0.69 0.73 0.68 0.69
tumor <- runConsensusPhylo(tumor, root = 'neutral')
consensusPhylo(tumor)

## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
##   c1, c2, c3, c4
## 
## Rooted; includes branch lengths.

plotPhylo(tumor, label = 'subclones', consensus = TRUE)

consensus-1

To further understand which copy number events contribute to the lineage, we can also plot consensus heatmaps annotated by oncogenes.

genes = c("CDKN2A",
          "FGFR1",
          "TP53",
          "PTEN",
          "MYC",
          "CDKN1A",
          "MDM2",
          "AURKA",
          "PIK3CA",
          "CCND1",
          "KRAS")

tumor <- calcConsensus(tumor, assay = 'integer')
plotHeatmap(tumor, 
            consensus = TRUE,
            label = 'subclones',
            genes = genes,
            n_threads = n_threads)

consensuspl-1

Or more intuitively, check copy number states across genes in different subclones.

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8)

consensusgene-1

Also note that all the plot functions except heatmap are based on ggplot2, thus we can further customize them by adding themes, etc.

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8) +
  ggplot2::ylim(c(0,4)) +
  ggplot2::ggtitle("Copy number status across genes") +
  ggplot2::theme_linedraw()

custplot-1

Session info

sessionInfo()

## R version 4.3.3 (2024-02-29)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 24.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /opt/miniconda3/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: /UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocParallel_1.34.2         copykit_0.1.2              
##  [3] DNAcopy_1.74.1              Rsubread_2.14.2            
##  [5] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
##  [7] Biobase_2.60.0              GenomicRanges_1.52.1       
##  [9] GenomeInfoDb_1.36.4         IRanges_2.34.1             
## [11] S4Vectors_0.38.2            BiocGenerics_0.46.0        
## [13] MatrixGenerics_1.12.3       matrixStats_1.3.0          
## [15] knitr_1.47                 
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      jsonlite_1.8.8          shape_1.4.6.1          
##   [4] magrittr_2.0.3          ggbeeswarm_0.7.2        modeltools_0.2-23      
##   [7] farver_2.1.2            rmarkdown_2.27          GlobalOptions_0.1.2    
##  [10] fs_1.6.4                zlibbioc_1.46.0         vctrs_0.6.5            
##  [13] memoise_2.0.1           RCurl_1.98-1.14         ggtree_3.6.0           
##  [16] htmltools_0.5.8.1       S4Arrays_1.0.6          forcats_1.0.0          
##  [19] BiocNeighbors_1.18.0    gridGraphics_0.5-1      sass_0.4.9             
##  [22] bslib_0.7.0             htmlwidgets_1.6.4       plotly_4.10.4          
##  [25] cachem_1.1.0            igraph_2.0.3            mime_0.12              
##  [28] lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3        
##  [31] Matrix_1.6-1.1          R6_2.5.1                fastmap_1.2.0          
##  [34] GenomeInfoDbData_1.2.10 shiny_1.8.1.1           clue_0.3-65            
##  [37] digest_0.6.35           aplot_0.2.3             colorspace_2.1-0       
##  [40] ggnewscale_0.4.10       patchwork_1.2.0         irlba_2.3.5.1          
##  [43] labeling_0.4.3          fansi_1.0.6             httr_1.4.7             
##  [46] abind_1.4-5             compiler_4.3.3          withr_3.0.0            
##  [49] doParallel_1.0.17       viridis_0.6.5           highr_0.11             
##  [52] MASS_7.3-60             DelayedArray_0.26.7     rjson_0.2.21           
##  [55] bluster_1.12.0          gtools_3.9.5            tools_4.3.3            
##  [58] vipor_0.4.7             beeswarm_0.4.0          ape_5.8                
##  [61] prabclus_2.3-3          httpuv_1.6.15           nnet_7.3-19            
##  [64] glue_1.7.0              dbscan_1.1-12           nlme_3.1-165           
##  [67] promises_1.3.0          grid_4.3.3              cluster_2.1.6          
##  [70] generics_0.1.3          gtable_0.3.5            class_7.3-22           
##  [73] tidyr_1.3.1             data.table_1.15.4       utf8_1.2.4             
##  [76] XVector_0.40.0          flexmix_2.3-19          foreach_1.5.2          
##  [79] pillar_1.9.0            yulab.utils_0.1.4       later_1.3.2            
##  [82] robustbase_0.99-2       circlize_0.4.16         splines_4.3.3          
##  [85] dplyr_1.1.4             treeio_1.24.3           lattice_0.22-6         
##  [88] FNN_1.1.4               survival_3.7-0          tidyselect_1.2.1       
##  [91] ComplexHeatmap_2.16.0   miniUI_0.1.1.1          amap_0.8-19            
##  [94] gridExtra_2.3           xfun_0.45               mixtools_2.0.0         
##  [97] diptest_0.77-1          scquantum_1.0.0         DEoptimR_1.1-3         
## [100] lazyeval_0.2.2          ggfun_0.1.5             yaml_2.3.8             
## [103] evaluate_0.24.0         codetools_0.2-20        kernlab_0.9-32         
## [106] tibble_3.2.1            ggplotify_0.1.2         cli_3.6.2              
## [109] uwot_0.2.2              xtable_1.8-4            segmented_2.1-0        
## [112] munsell_0.5.1           jquerylib_0.1.4         Rcpp_1.0.12            
## [115] png_0.1-8               fastcluster_1.2.6       parallel_4.3.3         
## [118] ggplot2_3.5.1           mclust_6.1.1            ggalluvial_0.12.5      
## [121] bitops_1.0-7            viridisLite_0.4.2       tidytree_0.4.6         
## [124] scales_1.3.0            purrr_1.0.2             crayon_1.5.3           
## [127] fpc_2.2-12              GetoptLong_1.0.5        rlang_1.1.4