Skip to content

Commit

Permalink
Merge branch 'development' into sjspielman/499-external-celltype-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
sjspielman committed Nov 15, 2023
2 parents ffa8edf + 35d7cd4 commit 447f0c9
Show file tree
Hide file tree
Showing 2 changed files with 73 additions and 50 deletions.
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ repos:
rev: v2.2.0
hooks:
- id: doctoc
args: [--update-only]
args: [--update-only, --title=**Table of Contents**]
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff for linting and formatting python
rev: v0.1.5
Expand All @@ -44,6 +44,7 @@ repos:
rev: v3.0.3
hooks:
- id: prettier
exclude: '\.md$'
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
Expand Down
120 changes: 71 additions & 49 deletions internal-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@

<!-- END doctoc generated TOC please keep comment here to allow auto update -->


## Running `scpca-nf` as a Data Lab staff member

This section provides instructions for running the main workflow, found in [`main.nf`](main.nf).
Expand All @@ -39,28 +38,28 @@ nextflow run AlexsLemonade/scpca-nf -profile ccdl,batch

There are several flags and/or parameters which you may additionally wish to specify, as follows.

+ Nextflow flags:
+ `-resume`: Resume workflow from most recent checkpoint
+ `-with-tower`: Use `Nextflow Tower` to monitor workflow (requires separate [Nextflow Tower registration](https://tower.nf/))
+ Workflow parameters:
+ `--run_ids list,of,ids`: A custom comma-separated list of ids (run, library, or sample) for this run.
+ `--project list,of,project_ids`: A custom comma-separated list of project ids for this run
[The default](config/profile_ccdl.config) run ids are `"SCPCR000001,SCPCS000101"`.
+ `--repeat_mapping`: Use this flag to repeat mapping, even if results already exist.
+ By default, the workflow checks whether each library has existing `alevin-fry` or `salmon` mapping results, and skips mapping for libraries with existing results.
Using this flag will override that default behavior and repeat mapping even if the given library's results exist.
+ For more implementation details, please refer to the [external instructions](external-instructions.md#repeating-mapping-steps).
+ `--skip_genetic_demux`: Use this flag to skip genetic demultiplexing, which is turned on by default.
+ Genetic demultiplexing requires mapping of both bulk and single-cell data, followed by SNP calling and genetic demultiplexing, which can be quite time consuming.
+ When genetic demultiplexing is skipped, the workflow will still perform cellhash-based demultiplexing, if available for a given library.
+ `--repeat_genetic_demux`: Use this flag to repeat genetic demultiplexing, even if results already exist.
+ By default, the workflow checks whether each library has existing genetic demultiplexing results, and skips genetic demultiplexing for libraries with existing results.
Using this flag will override that default behavior and repeat genetic demultiplexing even if the given library's results exist.
+ `--perform_celltyping`: Use this flag to perform cell type annotation, which is turned off by default.
+ `--repeat_celltyping`: Use this flag to repeat cell type annotation, even if results already exist.
+ By default, the workflow checks whether each library has existing cell type annotation results for `SingleR` and/or `CellAssign` (depending on references for that library).
Using this flag will override that default behavior and repeat cell type annotation even if the given library's results exist.
+ This flag is _only considered_ if `--perform_celltyping` is also used.
- Nextflow flags:
- `-resume`: Resume workflow from most recent checkpoint
- `-with-tower`: Use `Nextflow Tower` to monitor workflow (requires separate [Nextflow Tower registration](https://tower.nf/))
- Workflow parameters:
- `--run_ids list,of,ids`: A custom comma-separated list of ids (run, library, or sample) for this run.
- `--project list,of,project_ids`: A custom comma-separated list of project ids for this run
[The default](config/profile_ccdl.config) run ids are `"SCPCR000001,SCPCS000101"`.
- `--repeat_mapping`: Use this flag to repeat mapping, even if results already exist.
- By default, the workflow checks whether each library has existing `alevin-fry` or `salmon` mapping results, and skips mapping for libraries with existing results.
Using this flag will override that default behavior and repeat mapping even if the given library's results exist.
- For more implementation details, please refer to the [external instructions](external-instructions.md#repeating-mapping-steps).
- `--skip_genetic_demux`: Use this flag to skip genetic demultiplexing, which is turned on by default.
- Genetic demultiplexing requires mapping of both bulk and single-cell data, followed by SNP calling and genetic demultiplexing, which can be quite time consuming.
- When genetic demultiplexing is skipped, the workflow will still perform cellhash-based demultiplexing, if available for a given library.
- `--repeat_genetic_demux`: Use this flag to repeat genetic demultiplexing, even if results already exist.
- By default, the workflow checks whether each library has existing genetic demultiplexing results, and skips genetic demultiplexing for libraries with existing results.
Using this flag will override that default behavior and repeat genetic demultiplexing even if the given library's results exist.
- `--perform_celltyping`: Use this flag to perform cell type annotation, which is turned off by default.
- `--repeat_celltyping`: Use this flag to repeat cell type annotation, even if results already exist.
- By default, the workflow checks whether each library has existing cell type annotation results for `SingleR` and/or `CellAssign` (depending on references for that library).
Using this flag will override that default behavior and repeat cell type annotation even if the given library's results exist.
- This flag is _only considered_ if `--perform_celltyping` is also used.

Please refer to [`nextflow.config`](nextflow.config) and [other configuration files](config/) for other parameters which can be modified.

Expand All @@ -80,7 +79,6 @@ Please refer to our [`CONTRIBUTING.md`](CONTRIBUTING.md#stub-workflows) for more

### Running `scpca-nf` for ScPCA Portal release


When running the workflow for a project or group of samples that is ready to be released on ScPCA portal, please use the tag for the latest release:

```
Expand Down Expand Up @@ -116,29 +114,33 @@ Make sure to adjust the settings to make the zip file publicly accessible.

## Maintaining references for `scpca-nf`


Inside the `references` folder are files and scripts related to maintaining the reference files available for use with `scpca-nf`.

1. `ref-metadata.tsv`: Each row of this TSV file corresponds to a reference that is available for mapping with `scpca-nf`.
The columns included specify the `organism` (e.g., `Homo_sapiens`), `assembly`(e.g.,`GRCh38`), and `version`(e.g., `104`) of the `fasta` obtained from [Ensembl](https://www.ensembl.org/index.html) that was used to build the reference files.
This file is used as input to the `build-index.nf` workflow, which will create all required index files for `scpca-nf` for the listed organisms in the metadata file, provided the `fasta` and `gtf` files are stored in the proper location on S3.
See [instructions for adding additional organisms](#adding-additional-organisms) for more details.
The columns included specify the `organism` (e.g., `Homo_sapiens`), `assembly`(e.g.,`GRCh38`), and `version`(e.g., `104`) of the `fasta` obtained from [Ensembl](https://www.ensembl.org/index.html) that was used to build the reference files.
This file is used as input to the `build-index.nf` workflow, which will create all required index files for `scpca-nf` for the listed organisms in the metadata file, provided the `fasta` and `gtf` files are stored in the proper location on S3.
See [instructions for adding additional organisms](#adding-additional-organisms) for more details.

2. `scpca-refs.json`: Each entry of this file contains a supported reference for mapping with `scpca-nf` and the name used to refer to that supported reference, e.g., `Homo_sapiens.GRCh38.104`.
For each supported reference, a list of all the reference files that are needed to run `scpca-nf` will be included.
This file is required as input to `scpca-nf`.
For each supported reference, a list of all the reference files that are needed to run `scpca-nf` will be included.
This file is required as input to `scpca-nf`.

3. `celltype-reference-metadata.tsv`: Each row of this TSV file corresponds to a supported cell type reference available for cell type assignment using `add-celltypes.nf`.
For all references, the following columns will be populated: `celltype_ref_name`, `celltype_ref_source` (e.g., `celldex`), supported `celltype_method` (e.g., `SingleR`).
All references obtained from the `PanglaoDB` source also require an `organs` column containing the list of supported `PanglaoDB` organs to include when building the reference.
This should be a comma-separated list of all organs to include.
To find all possible organs, see the `organs` column of `PanglaoDB_markers_2020-03-27.tsv`.
This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for performing cell type annotation from the main workflow.
See [instructions for adding additional cell type references](#adding-additional-cell-type-references) for more details.
3. `celltype-reference-metadata.tsv`: Each row of this TSV file corresponds to a supported cell type reference available for cell type annotation.
This file is required as input to the `build-celltype-ref.nf` workflow to create and/or update cell type references.
For all references, the following columns must be populated:

- `celltype_ref_name` (e.g., `BlueprintEncodeData` or `blood-compartment`)
- `celltype_ref_source` (e.g., `celldex` or `PanglaoDB`)
- `celltype_method` (e.g., `SingleR` or `CellAssign`)
- All references obtained from the `PanglaoDB` source also require an `organs` column containing the list of supported `PanglaoDB` organs to include when building the reference.
This should be a comma-separated list of all organs to include.
To find all possible organs, see the `organs` column of `PanglaoDB_markers_2020-03-27.tsv`.

See [instructions for adding additional cell type references](#adding-additional-cell-type-references) for more details.

4. `PanglaoDB_markers_2020-03-27.tsv`: This file is used to build the cell type references from `PanglaoDB`.
This file was obtained from clicking the `get tsv file` button on the [PanglaoDB Dataset page](https://panglaodb.se/markers.html?cell_type=%27choose%27) and replacing the date in the filename with a date in ISO8601 format.
This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for the main workflow to use during cell type annotation.
This file was obtained from clicking the `get tsv file` button on the [PanglaoDB Dataset page](https://panglaodb.se/markers.html?cell_type=%27choose%27) and replacing the date in the filename with a date in ISO8601 format.
This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for the main workflow to use during cell type annotation.

### Adding additional organisms

Expand All @@ -147,7 +149,7 @@ Adding additional organisms is handled, in part, by the `build-index.nf` workflo
Follow the below steps to add support for additional references:

1. Download the desired `fasta` and `gtf` files for the organism of choice from `Ensembl`.
Add these to the `S3://scpca-references` bucket with the following directory structure, where the root directory here corresponds to the `organism` and the subdirectory corresponds to the `Ensembl` version:
Add these to the `S3://scpca-references` bucket with the following directory structure, where the root directory here corresponds to the `organism` and the subdirectory corresponds to the `Ensembl` version:

```
homo_sapiens
Expand All @@ -166,16 +168,36 @@ homo_sapiens

### Adding additional cell type references

Adding additional organisms is handled, in part, by the `build-celltype-ref.nf` workflow.

Adding additional references to use for cell type annotation is handled by the `build-celltype-ref.nf` workflow.

Follow the below steps to add support for additional cell type references.
Reference files are created and automatically named by the `build-celltype-ref.nf`.
We currently only support `celldex` and `PanglaoDB` for reference sources for `SingleR` and `CellAssign` cell type annotation, respectively.

1. Add the `celltype_ref_name`, `celltype_ref_source`, `celltype_method`, and `organs` (if applicable) for the new reference to `celltype-reference-metadata.tsv`.
2. Generate the new cell type references using `nextflow run build-celltype-ref.nf -profile ccdl,batch` from the root directory of this repository.
3. Ensure that the new reference files are public and in the correct location on S3:
- `SingleR` reference files, which are the full reference datasets from the `celldex` package, should be in `s3://scpca-references/celltype/singler_references` named as `celldex-<reference name>.rds`.
- `SingleR` trained model files for the given Nextflow parameter `singler_label_name` should be in `s3://scpca-references/celltype/singler_models` named as `<reference name>_models.rds`.
- `CellAssign` organ-specific reference gene matrices should be in `s3://scpca-references/celltype/cellassign_references` named as `PanglaoDB-<organ>.tsv`.
Follow these steps to add support for additional cell type references.

1. Add the `celltype_ref_name`, `celltype_ref_source`, `celltype_method`, and `organs` (if applicable) for the new reference to [`celltype-reference-metadata.tsv`](references/celltype-reference-metadata.tsv).

- `<celltype_ref_name>` represents the reference dataset name.
For use with `SingleR`, this should be taken directly from a `celldex` dataset.
For `CellAssign`, names are established by the Data Lab as `<tissue/organ>-compartment` to represent a set of markers for a given tissue/organ.
- `<celltype_ref_source>` represents the reference dataset source. Currently only `celldex` and `PanglaoDB` are supported for `SingleR` and `CellAssign`, respectively.
- `<celltype_method>` represents which annotation method to use with the specified reference, either `SingleR` or `CellAssign`.
- `organs` indicates which organs to be included in creation of references with `PanglaoDB` as the `celltype_ref_source`.
This must be a comma separated list of all organs to include.

2. Generate the new cell type reference using `nextflow run build-celltype-ref.nf -profile ccdl,batch` from the root directory of this repository.
3. Ensure that the new reference files are public and in the correct location on S3.

`SingleR` reference files, which are the full reference datasets from the `celldex` package, should be in `s3://scpca-references/celltype/singler_references` and named as `<celltype_ref_name>_<celltype_ref_source>_<version>.rds`.
Corresponding "trained" model files for use in the cell type annotation workflow should be stored in `s3://scpca-references/celltype/singler_models`, named as `<celltype_ref_name>_<celltype_ref_source>_<version>_model.rds`.

- `<celltype_ref_name>` is a given `celldex` dataset.
- Note that the workflow parameter `singler_label_name` will determine which `celldex` dataset label is used for annotation; by default, we use `"label.ont"` (ontology labels).
- `<celltype_ref_source>` is `celldex`.
- `<version>` is the `celldex` version used during reference building, where we use dashes in place of periods (e.g., version `x.y.z` would be represented as `x-y-z`).

`CellAssign` organ-specific reference gene matrices should be stored in `s3://scpca-references/celltype/cellassign_references` and named as `<celltype_ref_name>_<celltype_ref_source>_<date>.tsv`.

- `<celltype_ref_name>` is a given reference name established by the Data Lab.
- `<celltype_ref_source>` is `PanglaoDB`
- `<date>` is the `PanglaoDB` date, which serves as their version, in ISO8601 format.

0 comments on commit 447f0c9

Please sign in to comment.