This repository houses the codebase for the paper Learning the cellular origins of cancer using single-cell chromatin landscapes. In this paper, we proposed SCOTI (Single-cell Cell Of Transformation Identification), which offers a straightforward and cost-effective framework enabling cancer biologists to ascertain the COT at cell-type resolution for a given cancer mutation profile via scATAC sequencing of suspected normal tissues of origin.
In the first section below, we provide an example of how to pre-process data and prepare it as input for SCOTI, in addition to then running SCOTI. To reproduce paper figures, see scripts in the directory paper
, named by the corresponding figure.
First, we need to up our conda environment. To do this, run
sh setup_env.sh
Then, activate the environment:
conda activate coo
and run:
Rscript post_installation.R
You should now be ready to proceed.
To run SCOTI, we first need to create aggregated, binned scATAC and mutation profiles. All scripts for data processing are in data/scripts
, and all relative paths below are relative to this directory.
We'll start with scATAC, and we'll use the data from Single-Cell Multiomic Analysis Identifies Regulatory Programs in Mixed-Phenotype Acute Leukemia as an example, which comes from PBMC and bonemarrow. After cloning this repository, download the fragment files:
mkdir -p ../bed_files/Greenleaf_test/migrated_to_hg19/
sh get_scATAC_data_from_links.sh ../greenleaf_blood_bm_ftp_links.txt ../bed_files/Greenleaf_test/migrated_to_hg19/ '*gz'
This will download the files to the directory data/bed_files/Greenleaf_test/migrated_to_hg19
.
Small aside: Note that the fragments from this study are already aligned to hg19. However, if your data is not and you would like to run SCOTI using the mutation data we used, you will have to align the fragments to hg19, since the mutation data we had available was aligned to hg19. To do this, make sure your fragment files are in the directory
data/bed_files/[DATASET_NAME]/
. Here,DATASET_NAME
is whatever name you want to give to your dataset. Then runRscript migrate_and_save_fragments.R --dataset [DATASET_NAME] --cores [NUM_CORES]
. Note that this script is parallelized, so you can specifyNUM_CORES
, where each core will be migrating one of the fragment files. So if you have 16 fragment files and 4 cores, then specifyingNUM_CORES
to 4 will process 4 files at a time.
We will then rename the files to match the pattern [TISSUE_TYPE]-[SAMPLE_NAME].bed.gz
:
cd data/bed_files/Greenleaf_test/migrated_to_hg19
for f in $(ls *); do mv $f $(echo $f | sed 's/.*_scATAC_//' | sed 's/.fragments.tsv.gz/.bed.gz/' | sed 's/_/-/'); done
Now, we need a metadata file that contains cell-type annotations. This file must contain the following columns:
- barcode: cell barcode corresponding to the barcodes in the fragment files e.g AACTGGTTCCACTAGA
- sample: sample name from which the cell comes from (this must correspond to
SAMPLE_NAME
in the name of the fragment file, see above). - cell_type: cell type annotation
- tissue: tissue name from which the cell comes from (again, must correspond to
TISSUE_TYPE
in the name of the fragment file, see above)
The metadata file needs to be in data/metadata
and must be called [ANNOTATION].csv
where ANNOTATION
is your desired name for the cell annotations that you used. For this example, we have a file called test_annotation.csv
.
We then create the binned scATAC profiles:
Rscript create_count_overlaps.R --dataset [DATASET_NAME] --cores [NUM_CORES] --annotation [ANNOTATION] --which_interval_ranges [INTERVAL_RANGES_NAME]
This is script is parallelized, and each core will be processing a given fragment file. Note that here, DATASET_NAME
and ANNOTATION
must be the ones you chose previously. INTERVAL_RANGES_NAMES
specifies the GenomicRanges object (bins) that we will align our fragments to. For this example, we have a file called data/test_ranges.RData
. This corresponds to the bins we used for our paper. So for INTERVAL_RANGES_NAME
, we would specify test_ranges
. Note that this file must be an .RData
file.
Thus, assuming we have 4 cores, in this example, we would run:
Rscript create_count_overlaps.R --dataset Greenleaf_test --cores 4 --annotation test_annotation --which_interval_ranges test_ranges
This will create our binned fragment files in data/processed_data/count_overlap_data/[ANNOTATION]/
with the name interval_ranges_[INTERVAL_RANGES_NAME]_[DATASET_NAME]_count_overlaps_[TISSUE_TYPE]-[SAMPLE_NAME].rds
. In addition, we will get metadata files in data/processed_data/cell_counts_per_sample/[ANNOTATION]/
with the names cell_counts_interval_ranges_[INTERVAL_RANGES_NAME]_[DATASET_NAME]_count_overlaps_[TISSUE_TYPE]-[SAMPLE_NAME].rds
. These files contain data about the number of cells per cell type.
Finally, we must combine the counts from identical cell types from different samples. We do this by running:
Rscript combine_overlaps.R --datasets [DATASET_NAME] --annotation [ANNOTATION] --which_interval_ranges [INTERVAL_RANGES_NAME]
In our example:
Rscript combine_overlaps.R --datasets Greenleaf_test --annotation test_annotation --which_interval_ranges test_ranges
This creates an aggregated scATAC profile, which can be found at data/processed_data/count_overlap_data/combined_count_overlaps/[ANNOTATION]/interval_ranges_[INTERVAL_RANGES_NAME]_[DATASET_NAME]_combined_count_overlaps.rds
. In the same directory, another file called interval_ranges_[INTERVAL_RANGES_NAME]_[DATASET_NAME]_combined_count_overlaps_metadata.rds
will also be created, which has information about the number of cells per cell type and tissue type.
In this example:
interval_ranges_test_ranges_Greenleaf_test_combined_count_overlaps.rds
and
interval_ranges_test_ranges_Greenleaf_test_combined_count_overlaps_metadata.rds
To create aggregated, binned mutation profiles, we first need a MAF file with the mutation (SNV) data. For this example, we will use Non-Hodgkin lymphoma (Lymph-BNHL). We have the corresponding MAF file, called Lymph-BNHL_SNV_with_SEX.txt
, in Google Drive. This should be downloaded, unzipped, and placed in data/mutation_data/
. Your file, if using your own mutation data, should be called [CANCER_TYPE]_SNV_with_SEX.txt
. Next, we begin by parsing the file:
python3 2_Sorting_MutationFileSex_CancerType.py --cancer_types [CANCER_TYPE]
We then intersect the mutations with the bins:
python3 3_Intersect_paz_cancertypes.py --cancer_types [CANCER_TYPE]
Finally, we aggregate the mutations across samples:
python3 4_AssembleCout_paz_Cancergroup.py --cancer_types [CANCER_TYPE]
This creates a file data/processed_data/[CANCER_TYPE].txt
that has the aggregated, binned mutation profile.
Finally, we add custom names to the bins, in addition to adding CANCER_TYPE
as a column name for our data (this step could have been incorporated in the previous scripts):
Rscript align_mutations_to_ranges --cancer_types [CANCER_TYPE]
This creates a csv file data/processed_data/[CANCER_TYPE].csv
that is ready to be input into SCOTI.
Putting these together for our example:
python3 2_Sorting_MutationFileSex_CancerType.py --cancer_types Lymph-BNHL
python3 3_Intersect_paz_cancertypes.py --cancer_types Lymph-BNHL
python3 4_AssembleCout_paz_Cancergroup.py --cancer_types Lymph-BNHL
Rscript align_mutations_to_ranges.R --cancer_types Lymph-BNHL
In our paper, for each cancer type, we ran the model 100 times to obtain robust predictions. In practice, this means we needed access to a compute cluster to parallelize the model training process. Below, we present two pipelines for obtaining predictions: an unparallelized approach and a parallelized approach. Of course, to obtain 100 predictions in a reasonable amount of time, particularly if using large feature spaces, you would want to use the parallelized option. In the case of the parallelized option, we assume you have access to a system that uses UGE as the job manager. However, even if this is not the case, it is very straighforward to modify prep_ML_model_scripts.py
, the script which does the parallelization, to use a different job manager.
The script analysis/ML/build_ML_model.py
runs SCOTI. Before looking at an example, we outline the important command line options below:
Cancer Types [--cancer_types]
The cancer types to run the model on. Note that this is not parallelized i.e specifying mutliple cancer types will run the model sequentially on each cancer type. Should correspond toCANCER_TYPE
above.
scATAC Datasets [--datasets]
Names of the scATAC datasets to consider. Should correspond toDATASET_NAME
above.
Cell Type Number Filter [--scATAC_cell_number_filter]
Filter for the minimum number of cells per cell type. All cells with lower than this minimum will be excluded.Annotation [--annotation_dir]
Name of annotation used. Should correspond toANNOTATION
above.
Fold For Test Set [--fold_for_test_set]
Which fold among the 10 contiguous genome regions to use as the test set. This number must be in the range 1-10.Seed Range [--seed_range]
Range of seeds to use when training the model and computing permutation importance for a particular train/test split (i.e for a particular chosen fold for the test set). Format: [LOWEST_SEED]-[HIGHEST_SEED] e.g 3-7, where 3 and 7 are inclusive.Number Of Optuna Trials Pre-backward Feature Selection [--n_optuna_trials_prebackward_selection]
Number of Optuna trials to run before running backward feature selection (i.e when using the full feature space)Number of Optuna Trials During Backward Feature Selection [--n_optuna_trials_backward_selection]
Number of Optuna trials to run when performing backward feature selection, for each iteration of BFS.Enable SQLite [--sqlite]
If specified, this will enable the use of SQLite to store Optuna results. This is the easiest way to store results, as it requires no database configuration on your part. It is only appropriate to set this option when you are not using parallelization. However, SQLite is not well suited for concurrent database access, so it should not be used when training models in parallel.Test Set Performance Number Of Features [--test_set_perf_num_features]
Number of features during BFS at which to save test set performance. This can be a list of numbers e.g 1 2 5 10 or simply "all." It is advisable to just set this option to "all", as computing test set performance requires negligeable compute. However, assuming the models trained during BFS are not deleted, these can always be calculated later (and the script does provide an option to do this by just re-running it, which won't re-train models, but load already trained models and compute the test set performance at the specified number of features).Cell Types To Keep [--cell_types_keep]
Cell types to keep despite not meeting the cell type filter. Format: [NAME_OF_CELL_TYPE]-[DATASET]. It is possible to provide more than 1 e.g C1-D1 C2-D2Enable Use Of Custom Mutations [--custom_mutations]
This is one of multiple mutually exclusive options, which specifies where to load mutations from. It should always be set when using your own mutations, and for the example below. Which options to set use when reproducing the results from the paper will be specified as needed.Tissues To Consider [--tissues_to_consider]
If you only want to use a particular subset of tissues from your scATAC data, then set this option to the name of these tissues as a list (and exactly as they are called in the metadata file created usingcombine_overlaps.R
under the column "tissue")
Which Interval Ranges [--which_interval_ranges]
Which interval ranges to use. Should correspond toWHICH_INTERVAL_RANGES
used previously.
Dataset Abbreviation [--dataset_abbrev]
Abbreviation for the name of the dataset being used. This abbreviation will be appended to feature names. Should be specified if not using one of the datasets we used in the paper.We will use the PBMC/bonemarrow + Lymph-BNHL data we generated previously as an example. Run:
python3 build_ML_model.py \
--cancer_types Lymph-BNHL \
--datasets Greenleaf_test \
--scATAC_cell_number_filter 100 \
--annotation_dir test_annotation \
--seed_range 1-1 \
--fold_for_test_set 1 \
--n_optuna_trials_prebackward_selection 50 \
--n_optuna_trials_backward_selection 50 \
--sqlite \
--test_set_perf_num_features all \
--custom_mutations \
--which_interval_ranges test_ranges \
--dataset_abbrev GL_t
Once the model is done training, we can then obtain a plot of the feature importance of different features at a given iteration of BFS using the script plot_top_features.R
. Before we do this, we outline the important command line options to do this:
Cancer Types [--cancer_types]
The cancer types to plot. Should correspond to cancer type used when training the model. Should be a string, separating tissues by a comma and no space in between e.g "cancer_type1,cancer_type2,..."Cell Type Number Filter [--cell_number_filter]
Should correspond to the filter used when training the model.Datasets [--datasets]
Should correspond to the datasets used when training the model. Should be a string, separating datasets by a comma and no space in between.Top features to plot [--top_features_to_plot]
Number of features at which to plot feature importances (assuming `robustness_analysis` is not enabled, see parallelized option below). Should be a string, separating numbers by a comma and no space in between.Tissues To Consider [--tissues_to_consider]
Should correspond to tissues considered when training the model. Should be a string, separating tissues by a comma and no space in between.Annotation [--annotation]
Should correspond to annotation used when training the model.Seed Range [--seed_range]
Should correspond to seeds used when training the model.Folds For Test Set [--folds_for_test_set]
Unlike when building the model, this parameter is "folds" plural i.e it should be a range in the format "A-B" where A, B are integers between 1 and 10 with A <= BCell Types To Keep [--cell_types_keep]
Should correspond to cell types kept when training the model. Should be a string, separating cell types by a comma and no space in betweenFor our example, we run:
Rscript plot_top_features.R \
--cancer_types=Lymph-BNHL \
--datasets=Greenleaf_test \
--cell_number_filter=100 \
--annotation=test_annotation \
--seed_range=1-1 \
--top_features_to_plot=10,5,2,1 \
--folds_for_test_set=1-1 \
--tissues_to_consider=all
This will generate the plot in
figures/models/XGB/Lymph-BNHL/scATAC_source_Greenleaf_test_cell_number_filter_100_annotation_test_annotation_seed_1_fold_for_test_set_1/backwards_elimination_results/
in the file called permutation_importance_bar_plot.png
. The plot is shown below. Note that as we would expect for Lymph-BNHL, the most important feature turns out to be bone marrow B cells (BMMC B), which corresponds to the putative cell of transformation. Note that each feature in the plot is appended with the number of features being considered. Since we specified top_features_to_plot=10,5,2,1
, we see the permutation importance of different features for each of these different number of features left.
To obtain more confidence in our prediction, we can run the model multiple times, using different random seeds, and different test sets. This is where parallelization comes into play. SQLite, which is what we used in the unparallelized option for storing results from Optuna, is not well suited for concurrent database access. Thus, we used a PostgreSQL database. The first step in thus to set this up. Begin by initializing a new PostgreSQL directory that we'll call sqldb
:
initdb -D sqldb
We will change the configurations of the database to make it less restrictive so that our database can accept connections from other hosts (the jobs that will be running the model). Keep in mind that our models were running on a secure server, you will need to make sure you adjust the security settings depending on how you plan to run the model. To do this, edit the file analysis/ML/sqldb/postgresql.conf
so that:
- listen_addresses = '*'
- max_connections = 10000
Make sure listen_addresses
is uncommented.
Also, in analysis/ML/sqldb/pg_hba.conf
, change the lines under the comments # IPv4 local connections:
and # IPv6 local connections:
to:
# IPv4 local connections:
host all all 0.0.0.0/0 trust
# IPv6 local connections:
host all all ::/0 trust
Next, we need to start the server. We will also get the hostname of the machine that the server is running on; this is helpful for later. Depending on how you do this, the hostname will change every time you start a new server, which you need to do every time you want to run models. Thus, it's best to just create a mini-script that echo
s the hostname to a file, which will then access from the script that runs the model, and then starts the server:
Create a script that you can run. I'll call it startpg
(in my case, it's in ~/.local/bin/
, which is one of the directories included in $PATH
, so running startpg
will run that script):
echo $(hostname) > analysis/ML/postgresql_hostname.txt
~/conda/coo/bin/pg_ctl -D analysis/ML/sqldb start
You will need to modify the path ~/conda/coo/bin/
depending on where your conda environment is installed.
Now, run:
startpg
We've now started the server (and we've recorded the hostname of the machine running it).
We now create a new user for the database called my_user
(can change to whatever you want). We will be asked for a passcode, and I'll use password
here:
createuser -P my_user
We then create a database called optuna_db
associated with this user (again, we can pick any name):
createdb -O my_user optuna_db
Now, change line 839 in ML_utils.py
to reflect the name of the database, as well as the username and password that you chose for your user (note that we're writing the password in the script, which is not good practice, but for our purposes it's fine). In our case:
postgresql://my_user:password@{hostname}:5432/optuna_db
We're now ready to submit multiple jobs in parallel to train our models. To do this, we will use the script prep_ML_model_scripts.py
. We first go over the important command line options:
Cancer Types [--cancer_types]
Analogous to option frombuild_ML_model.py
scATAC Datasets [--datasets]
Analogous to option frombuild_ML_model.py
Cell Type Number Filter [--scATAC_cell_number_filter]
Analogous to option frombuild_ML_model.py
Annotation [--annotation_dir]
Analogous to option frombuild_ML_model.py
Test Set Performance Number Of Features [--test_set_perf_num_features]
Analogous to option frombuild_ML_model.py
Number Of Optuna Trials Pre-backward Feature Selection [--n_optuna_trials_prebackward_selection]
Analogous to option frombuild_ML_model.py
Number of Optuna Trials During Backward Feature Selection [--n_optuna_trials_backward_selection]
Analogous to option frombuild_ML_model.py
Cell Types to Keep [--cell_types_keep]
Analogous to option frombuild_ML_model.py
Tissues To Consider [--tissues_to_consider]
Analogous to option frombuild_ML_model.py
Enable Use Of Custom Mutations [--custom_mutations]
Analogous to option frombuild_ML_model.py
Folds For Test Set [--fold_for_test_set_range]
Analogous to option fromplot_top_features.R
, NOT build_ML_model.py
Seeds Interval [--seed_interval]
Which seeds to use for training. Format: "A-B" where A, B are integers.Seed Interval Step [--seed_interval_step]
Each fold specified byfold_for_test_set_range
will be submitted as an individual job. This parameter determines how many seeds per fold to train for each job. So if e.g seed_interval
is set to 1-10 and seed_interval_step
is set to 2, then the jobs will train seeds 1-2, 3-4, 5-6, 7-8, and 9-10 respectively, where the seeds within a given range are trained sequentially, not in parallel (so here, if e.g we're considering fold 1, then the first job will first train a model using seed 1, then once that is done, will train the model with seed 2)
Cores [--cores]
Number of cores to use per job submissionMemory per Core [--mem_per_core]
Amount of memory to use per core (in GB)Time [--time]
Max time per job. Format: "hours:minutes:seconds"Submit Jobs [--submit_jobs]
Submit the jobs (useful to set off for debugging).Note that if you are not using UGE, you will need to modify lines 204-227 and 247 to reflect your job manager specifics.
We start by testing things out before submitting jobs, so we will first omit --submit_jobs
:
python3 prep_ML_model_scripts.py \
--cancer_types Lymph-BNHL \
--scATAC_cell_number_filter 100 \
--annotation_dir test_annotation \
--datasets Greenleaf_test \
--seed_interval=1-10 \
--n_optuna_trials_prebackward_selection 50 \
--n_optuna_trials_backward_selection 50 \
--fold_for_test_set_range 1-10 \
--test_set_perf_num_features all \
--cores=8 \
--seed_interval_step=5 \
--custom_mutations \
--mem_per_core 1 \
--which_interval_ranges=test_ranges \
--dataset_abbrev GL_t
This creates 20 bash scripts in analysis/ML
which correspond to the jobs that would be submitted if we were to keep --submit_jobs
. To make sure the jobs will run, we test out one of the scripts:
sh cancer_types_Lymph-BNHL_datasets_Greenleaf_test_scATAC_cell_number_filter_100_annotation_dir_test_annotation_tissues_to_consider_all_seed_range_1-5_fold_for_test_set_1.sh
If this runs fine, we can stop it and re-run the above command, this time with the --submit_jobs
option enabled:
python3 prep_ML_model_scripts.py \
--cancer_types Lymph-BNHL \
--scATAC_cell_number_filter 100 \
--annotation_dir test_annotation \
--datasets Greenleaf_test \
--seed_interval=1-10 \
--n_optuna_trials_prebackward_selection 50 \
--n_optuna_trials_backward_selection 50 \
--fold_for_test_set_range 1-10 \
--test_set_perf_num_features all \
--cores=8 \
--seed_interval_step=5 \
--custom_mutations \
--mem_per_core 1 \
--which_interval_ranges=test_ranges \
--dataset_abbrev GL_t \
--submit_jobs
Once the jobs end, we can then plot results using plot_top_features.R
. Here, we introduce new command line options that are helpful for the parallelized setting.
Robustness Analysis [--robustness_analysis]
This specifies that we're operating in the parallelized setting, so that we obtain the correct plots.Top Features To Plot [--top_features_to_plot]
This is not a new option, but it means something different here. When--robustness_analysis
is enabled, --top_features_to_plot
specifies the number of features remaining at which to plot the R^2 of the model when the top appearing feature across runs is actually the top feature for that particular model e.g if Lung AT2 is the top appearing feature across 100 runs of the model, then specifying --top_features_to_plot=1,2,5,10
will plot boxplots of the R^2 when 1, 2, 5, and 10 features remain after backward feature selection, and will only plot those models where the top feature is actually Lung AT2. For an example, see Supplementary Figure 1 of the paper, last column.
Top Features To Plot For Feature Importance [--top_features_to_plot_feat_imp]
This specifies the number of features remaining at which to plot feature importances (boxplots) for various features across the runs. Only the top 5 most important features will be plotted, where importance is first measured based on the number of times the feature appears, and ties are broken by median feature importance. For an example, see Figure 1, 1D.We now plot:
Rscript plot_top_features.R \
--cancer_types Lymph-BNHL \
--datasets Greenleaf_test \
--cell_number_filter 100 \
--annotation test_annotation \
--seed_range 1-10 \
--top_features_to_plot_feat_imp 2,5,10 \
--top_features_to_plot 1,2,5,10 \
--folds_for_test_set 1-10 \
--robustness_analysis \
--tissues_to_consider all
This creates multiple figures in figures/models/XGB/Lymph-BNHL/scATAC_source_Greenleaf_pbmc_bm_cell_number_filter_100_annotation_test_annotation_seed_all_seeds/
(displayed below, respectively):
Lymph-BNHL_top_feature_appearances.pdf
: this has the number of times different features appeared as the top feature across 100 runs of the modelLymph-BNHL_feature_importance_with_2_5_10_features_top_5_features.pdf
: this has the feature importances of the top 5 features across 100 runs of the model, where top 5 features is defined as the features that appeared the most after 15 iterations of BFS, with ties broken by median feature importanceLymph-BNHL_top_feature_test_set_perf_with_10_5_2_1_features.pdf
: this has the test set performance of the models with 10, 5, 2, and 1 features where the top feature corresponded to the eventual predicted COT.
To download all raw scATAC fragment files associated with this paper (except lung and kidney):
cd data/scripts
sh download_all_scatac_data.sh
This will download the different scATAC datasets used in their respective directories in data/bed_files
. Lung and Kidney data can be downloaded at https://drive.google.com/file/d/1Fmt4NlJ-J-EXqM7WCKckT1l9ldXRoNLK/view?usp=drive_link.
Annotation data for all datasets can be found at https://drive.google.com/file/d/1ZXkQI90rX8UVYKEa16jOuL-u5npUyHSi/view?usp=drive_link. Download these and place them in data/metadata
.
We also provide the processed scATAC (ready for input into ML models) in the following directories:
data/processed_data/count_overlap_data/combined_count_overlaps/finalized_annotation/
: Contains the most common annotations used across the paperdata/processed_data/count_overlap_data/combined_count_overlaps/Greenleaf_colon_remove_cancer_merge_normal_unaffected/
: In addition to normal colon scATAC data, this also contains data from polyp tissue for colon.data/processed_data/count_overlap_data/combined_count_overlaps/new_intermediate_blood_bm_annotation/
: Contains the data from Granja et. al separated by CD34+ and CD34- tissue.
If you want to run the model with the data from one of the latter two directories, but would also like to include other data, make sure to copy over the data from the first directory to one of the latter two. e.g if you want to run the model with fetal brain data from Trevino et. al (2021) and PBMC/bone marrow separated by CD34+ and CD34- (the third directory), you need to copy Greenleaf_brain_combined_count_overlaps.rds
and Greenleaf_brain_combined_count_overlaps_metadata.rds
from finalized_annotation
to new_intermediate_blood_bm_annotation.