Skip to content

Commit

Permalink
Merge branch 'master' into scROSHI
Browse files Browse the repository at this point in the history
  • Loading branch information
Anne Bertolini committed Oct 10, 2024
2 parents 54e2d27 + 815c0cd commit 31dda98
Show file tree
Hide file tree
Showing 26 changed files with 2,173 additions and 1,505 deletions.
12 changes: 5 additions & 7 deletions .github/workflows/linter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,8 @@ jobs:
uses: actions/checkout@v2

- name: Lint Code Base
uses: github/super-linter@v4
env:
VALIDATE_ALL_CODEBASE: false
DEFAULT_BRANCH: master
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

VALIDATE_SNAKEMAKE_SNAKEFMT: true
uses: snakemake/[email protected]
with:
directory: .
snakefile: workflow/snakefile_basic.smk
args: "--lint"
63 changes: 63 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,80 @@
# Changelog

## [2.1.1] - 2024-08-15

### Changed

- adapt example cell type classification genes for ovarian cancer and melanoma samples.


## [2.1.0] - 2024-07-30

### Changed

- update cellranger rules
have new `cellranger_count_8` rule that includes syntax changes of cellranger v8. The new rule is the default; if an older version of cellranger should be used with the rule `cellranger_count` the
`ruleorder: cellranger_count > cellranger_count_8`
in the Snakefile must be adapted.
Also, add new rule `gunzip_and_link_cellranger` and separate these steps from the cellranger rule.

- update and describe in main README how cellranger expects to find the raw FASTQ files.

- update conda environments

- `celltyping.yaml`
- `identify_doublets.yaml`
- `sctransform_preprocessing.yaml`

- update `identify_doublets` output

- Have results of `identify_doublets` rule in own subdirectory instead of the `counts_filtered` directory.

- update `generate_qc_plots_*`

- have resulting QC plots written in own subdirectory instead of same directory as count files.
- Improve memory usage.
- Clean up script.

- update `filter_genes_and_cells.R`

- implement iterative filtering to make sure the selected thresholds for genes and cells apply to all genes/cells of the downstream analyses
- Clean up script.

- update `plotting.R`

- add more colours for clusters. Make sure even with a high number of clusters, enough colours are provided.
- make sure all cell types that are not found in a sample are still shown in the legend (with `show.legend = T`, adapt to new ggplot2 default settings)
- Clean up script.

- update `create_hdf5.py`
- make sure the script can work with Human and also Mouse data. Mouse Ensembl gene IDs are longer than 16 characters, and cannot be of type `dtype='S16'`.

### Fixed

- fix `sctransform_preprocessing.R`

- Filtering of raw input files is not applied to row and column names. This issue should have had no effect as long as filtered input data was provided (with minimum of QC on genes and cells).
- Changed to a check that stops the script if unfiltered input is detected.
- Script linting.


## [2.0.7] - 2023-03-20

### Changed

- specify which library should be used for the function `ggsave` to avoid conflict between the R packages `ggplot2` and `cowplot`

## [2.0.6] - 2023-02-14

### Fixed

- adapt script `query_civic_expr.py` to changed syntax in python package `civicpy` version 3.0. The script no longer works as is with previous versions of the package.
- adapt installation instructions for `civicpy` to require version 3.0

## [2.0.5] - 2023-01-11

### Fixed

- adapt script `query_civic_expr.py` to changed syntax in python package `civicpy` version 2.0. The script no longer works as is with previous versions of the package.
- adapt installation instructions for `civicpy` to require version 2.0

Expand All @@ -33,6 +95,7 @@
## [2.0.3] - 2023-03-20

### Changed

- specify which library should be used for the function ggsave to avoid conflict between ggplot2 and cowplot

## [2.0.2] - 2022-08-31
Expand Down
103 changes: 70 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# scAmpi - Single Cell Analysis mRNA pipeline

### General overview
## General overview

This scAmpi workflow is organized into two main parts: the `scAmpi_basic` part and the `scAmpi_clinical` part, which can be run independently. scAmpi_basic includes general scRNA processing steps, such as mapping, QC, normalisation, unsupervised clustering, cell type classification, and DE analysis.
This scAmpi workflow is organized into two main parts: the `scAmpi_basic` part and the `scAmpi_clinical` part, which can be run independently. scAmpi_basic includes general scRNA processing steps, such as mapping, QC, normalisation, unsupervised clustering, cell type classification, and DE analysis.
For more details see the [scAmpi publication](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010097).

scAmpi_clinial includes the search for disease relevant drug targets for differentially expressed genes. Note that the clinical part is only applied if at least one cluster identified in your sample is indicated as a diseased ("malignant") cell type.

![README_rulegraph](https://user-images.githubusercontent.com/38692323/175028270-2ac20406-720d-4941-bfb9-e924e5f65759.png)

### Installation instructions
## Installation instructions

scAmpi follows the best practices of the Snakemake workflow manager in providing the software needed to run the pipeline in per-rule conda environments. Those environmnents are specified in the `envs/` directory in yaml files that are named `{rule_name}.yaml`. The easiest way to install and use the software is by running Snakemake with the `--use-conda` parameter. Snakemake will try to find the environments of the yaml files the rules point to, and install them if they are not already available. The directory for installing the conda environments can be specified with the `--conda-prefix` parameter.

Expand All @@ -23,47 +23,83 @@ snakemake --use-conda --conda-create-envs-only --conda-prefix /my/directory/for/

- `--use-conda` instructs snakemake to utilize the `conda:` directive in the rules
- `--conda-create-envs-only` specifies that only the installation of conda environments is triggered, not the analysis of the samples.
- *(optional):* with `--conda-prefix /my/directory/for/conda/envs/` a directory for the installation of the conda environments can be specified.
- _(optional):_ with `--conda-prefix /my/directory/for/conda/envs/` a directory for the installation of the conda environments can be specified.

### Installation of tools for initial read mapping and counting
## Installation of tools for initial read mapping and counting

For the read mapping and UMI counting step scAmpi offers pre-defined rules for using either Cellranger or STARsolo. Both tools are not available for installation via conda and need to be installed separately. Only one of the tools needs to be installed, depending on the method of choice.

- [Cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger): Follow the instructions on the 10xGenomics installation support page to install cellranger and to include it into the PATH.
Webpage: [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation)
Webpage: [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation)
- [STAR](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md) as open source alternative to Cellranger. For installation, follow the instructions in the excellent STAR documentation and include it in your PATH.

### Example data
## Example data

For a test run the freely available 10X Genomics data from PBMC cells can be used. A step by step guideline and example config file are provided in the directory `testdata/`. Note that this test run assumes that cellranger` has been chosen for read mapping.

### Before running the pipeline
## Before running the pipeline

- **internet connection**
Some steps of the scAmpi workflow perform online queries. Please make sure that this is possible on your computing system, e.g. by loading the respective modules to enable the proxy connection. (Most systems will have this enabled per default).
- **internet connection**
Some steps of the scAmpi workflow perform online queries. Please make sure that this is possible on your computing system, e.g. by loading the respective modules to enable the proxy connection. (Most systems will have this enabled per default).

- **config file**
- input directory
Before running the pipeline the `config.yaml` file needs to be adapted to contain the **full path to input fastq files** for the intended analysis. It is provided in the
first section (`inputOutput`) of the config file.
Before running the pipeline the `config.yaml` file needs to be adapted to contain the **full path to input FASTQ files** for the intended analysis. It is provided in the
first section (`inputOutput`) of the config file. Cellranger expects one sub-directory per sample.
- resource information
In addition to the input path, further resource information must be provided in the section `resources`. This information is primarily specifying
input required for the cell type classification and the genomic reference used for the cellranger mapping. An example `config.yaml` file ready for adaptation, as
well as a brief description of the relevant config blocks, is provided in the directory `config/`.
- **sample map**
Provide a "sample_map", i.e. a tab delimited text file listing all samples that should be analysed (one row per sample).
The sample map must contain a column with the header `sample` (see example below). This ID will be used to name files and identify the sample throughout the pipeline.
An example file ready for adaptation is provided in the directory `config/`.
- **sample map**
Provide a "sample_map", i.e. a tab delimited text file listing all samples that should be analysed (one row per sample).
The sample map must contain a column with the header `sample` (see example below). This ID will be used to name files and identify the sample throughout the pipeline.
An example file ready for adaptation is provided in the directory `config/`.

Sample map example:
Sample map example:

```
sample
SAMPLE-1_scR
SAMPLE-2_scR
```
```
sample
SAMPLENAME1
SAMPLENAME2
```

- **raw FASTQ input**
Cellranger expects the input FASTQ files to follow a specific structure:

`/path/to/input_fastqs/SAMPLENAME1/SAMPLENAME1_S[Number]_L00[Lane Number]_[Read Type]_001.fastq.gz`

Where Read Type is one of:

- I1: Sample index read (optional)
- I2: Sample index read (optional)
- R1: Read 1
- R2: Read 2

### Running scAmpi
**NOTE:** SAMPLENAME can only contain the following characters [a-zA-Z0-9_-]+
E.g., if the SAMPLENAME contains a dot, cellranger will stop right away.

For very detailed information and example scenarios see the [10X Cellranger documentation](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/inputs/cr-specifying-fastqs).

Cellranger expects a sub-directory per sample.

```
input_fastqs
└── SAMPLENAME1
├── SAMPLENAME1_S4_L001_I1_001.fastq.gz
├── SAMPLENAME1_S4_L001_R1_001.fastq.gz
└── SAMPLENAME1_S4_L001_R2_001.fastq.gz
└── SAMPLENAME2
├── SAMPLENAME2_S4_L001_I1_001.fastq.gz
├── SAMPLENAME2_S4_L001_R1_001.fastq.gz
└── SAMPLENAME2_S4_L001_R2_001.fastq.gz
```

- **Running cellranger**
The default is now to run to run the new rule `cellranger_count_8` that is adapted to the syntax of cellranger v8.
If an older version of cellranger should be used with the rule `cellranger_count`
the `ruleorder: cellranger_count > cellranger_count_8` in the snakefile must be adapted.

## Running scAmpi

Different use cases of scAmpi are covered by several snakefiles to choose from.

Expand All @@ -74,7 +110,7 @@ Different use cases of scAmpi are covered by several snakefiles to choose from.

Please find details below.

### scAmpi_basic part
## scAmpi_basic part

Example call:

Expand All @@ -84,15 +120,15 @@ snakemake -s workflow/snakefile_basic.smk --configfile config/config.yaml -j 1 -

Note that if the pipeline is run on a compute cluster with a job scheduling system (e.g. LSF) the commands need to be adjusted accordingly.

### scAmpi_clinical part
## scAmpi_clinical part

Example call (that includes the basic part as well):

```
snakemake -s workflow/snakefile_clinical.smk --configfile config/config.yaml -j 1 -p
```

### A note on using CIViC
## A note on using CIViC

The CIViC query implemented in scAmpi makes use of an offline cache file of the CIViC database. The cache is retrieved with the initial installation of the scAmpi software. Afterwards, users have to manually update the cache file if they want to use a new version.
To update the cache file, load the respective conda environemnt and open a Python session.
Expand All @@ -102,17 +138,18 @@ Then type:
>> civic.update_cache()
```

### A note on the clinical trials query
## A note on the clinical trials query

From `clinicaltrials.gov` information about clinical trials is downloaded into the a zipped file `cancer_clinicalTrials.zip` that is unzipped for the subsequent queries. The resulting directory contains a large number of files that you can delete after the successful run, keeping only the zipped version.

### Running scAmpi_clinical independently
## Running scAmpi_clinical independently

It is possible to run the scAmpi_clinical part independently of scAmpi_basic, following some restrictions to the file names and formatting.

- Use the master snake file `workflow/snakefile_clinical-only.smk`.
- scAmpi_clinical expects as input the results of a DE analysis on cell cluster level
- The input files must follow the file name convention `SAMPLEID.CLUSTER.txt`

- SAMPLEID is the sample name specified in the sample map
- CLUSTER is the cell cluster ID
- `txt` is the expected suffix
Expand All @@ -125,17 +162,17 @@ gene_names diff padj test_statistic pct_nonzero
ATP1A1 1.679 3.05e-15 14.506 81.42
```

Here, "gene_names" contains the HGNC gene symbols, "diff" contains the fold change or a similar value, "padj" contains the adjusted p-value, "test_statistic" contains the value of the test statistics, and "pct_nonzero" contains the percentage of cells in this cluster with non-zero expression in the respective gene.
Results of this clinical pipeline run are the *in-silico* drug prediction and clinical annotations.
Here, "gene*names" contains the HGNC gene symbols, "diff" contains the fold change or a similar value, "padj" contains the adjusted p-value, "test_statistic" contains the value of the test statistics, and "pct_nonzero" contains the percentage of cells in this cluster with non-zero expression in the respective gene.
Results of this clinical pipeline run are the \_in-silico* drug prediction and clinical annotations.
Other side results, e.g. the minimum set cover computation, the plotting of drug predictions on the UMAP, and the gene set enrichment analysis, cannot be created in an independent clinical run as they rely on additional input files generated by the scAmpi_basic part.

### Adapting/Integrating rules in Snakemake
## Adapting/Integrating rules in Snakemake

Snakemake is a Python-based workflow management system for building and executing pipelines. A pipeline is made up of ["rules"](snake/scAmpi_basic_rules.py) that represent single steps of the analysis. In a [yaml config file](config/config_scAmpi.yaml) parameters and rule-specific input can be adjusted to a new analysis without changing the rules. In a ["master" snake file](snake/snake_scAmpi_basic_master.snake) the desired end points of the analysis are specified. With the input and the desired output defined, Snakemake is able infer all steps that have to be performed in-between.

To change one of the steps, e.g. to a different software tool, one can create a new rule, insert a new code block into the config file, and include the input/output directory of this step in the master snake file. It is important to make sure that the format of the input and output of each rule is compatible with the previous and the subsequent rule. For more detailed information please have a look at the excellent [online documentation](https://snakemake.readthedocs.io/en/stable/index.html) of Snakemake.

### Quick start using test data
## Quick start using test data

To quickly start a scAmpi_basic run with PBMC test data you can follow the following steps:

Expand All @@ -144,7 +181,7 @@ To quickly start a scAmpi_basic run with PBMC test data you can follow the follo
- prepare Cellranger software and reference directory
- update the path to the cellranger reference directory in `testdata/config.yaml`
- download example data from the 10xGenomics website (for more detailed instructions see `testdata/README_testdata.md`)
- *optional*: to circumvent the time-consuming mapping step create the directory `results/counts_raw/` in your scAmpi repository, copy the raw matrix `testdata/5k_pbmc_v3.h5.tar` into the directory, gunzip the file (e.g. `tar -xvf 5k_pbmc_v3.h5.tar`) and start the test run from this step.
- _optional_: to circumvent the time-consuming mapping step create the directory `results/counts_raw/` in your scAmpi repository, copy the raw matrix `testdata/5k_pbmc_v3.h5.tar` into the directory, gunzip the file (e.g. `tar -xvf 5k_pbmc_v3.h5.tar`) and start the test run from this step.
- perform Snakemake dryrun to see a list of steps that will be performed
`snakemake -s workflow/snakefile_basic.smk --configfile testdata/config.yaml -n -p`
- start analysis run
Expand Down
3 changes: 2 additions & 1 deletion config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,9 @@ tools:
cellranger_count:
call: cellranger
local_cores: 12
mem_mb: 6000
mem_mb: 40000
runtime: 1440
create_bam: "true"
variousParams: ""

starsolo:
Expand Down
3 changes: 2 additions & 1 deletion config/sample_map.tsv
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
sample
SAMPLE-1_scR
SAMPLENAME1
SAMPLENAME2
9 changes: 5 additions & 4 deletions required_files/aml/celltype_config_aml.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
Major Subtype
B.cells_find.markers.1.union.group.14_classifier none
Plasma.cells_find.markers.1.group.19_classifier none
AML_find.markers.1.union.group.0.1.3.4.7.8.9.11.17.18_classifier HSC.Prog.putativeAML_bone.marrow_VanGalen19,AML.HSC.Prog.like_aml_VanGalen19,GMP.putativeAML_bone.marrow_VanGalen19,AML.GMP.like_aml_VanGalen19,Myeloid.putativeAML_bone.marrow_VanGalen19,AML.Myeloid.like_aml_VanGalen19
T.cells_find.markers.1.union.group.2.5.6.10.13_classifier T.cells.CD8_normal_Newman15,T.cells.CD4.naive_normal_Newman15,NK.cells.resting_normal_Newman15,NK.cells.activated_normal_Newman15
B.cells_find.markers.TP.AML.5.group.7.11_classifier none
Plasma.cells_find.markers.TP.AML.5.group.12_classifier none
AML_find.markers.TP.AML.5.group.0.1.4.8.10.15_classifier HSC.Prog.putativeAML_bone.marrow_VanGalen19,AML.HSC.Prog.like_aml_VanGalen19,GMP.putativeAML_bone.marrow_VanGalen19,AML.GMP.like_aml_VanGalen19,Myeloid.putativeAML_bone.marrow_VanGalen19,AML.Myeloid.like_aml_VanGalen19
T.cells_find.markers.TP.AML.5.group.2.3.9.16.17_classifier T.cells.CD8_normal_Newman15,T.cells.CD4.naive_normal_Newman15,NK.cells.resting_normal_Newman15,NK.cells.activated_normal_Newman15
Erythroid.cells_find.markers.1.group.12.15_classifier none
Plasmacytoid.dendritic.cell_melanoma_classifier none
Monocyte.like.cells_find.markers.TP.AML.5.group.6_classifier none
Loading

0 comments on commit 31dda98

Please sign in to comment.