Merge branch 'development' into sjspielman/499-external-celltype-docs

AlexsLemonade · Nov 15, 2023 · 447f0c9 · 447f0c9
2 parents ffa8edf + 35d7cd4
commit 447f0c9
Show file tree

Hide file tree

Showing 2 changed files with 73 additions and 50 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -22,7 +22,7 @@ repos:
     rev: v2.2.0
     hooks:
       - id: doctoc
-        args: [--update-only]
+        args: [--update-only, --title=**Table of Contents**]
   - repo: https://github.com/astral-sh/ruff-pre-commit
     # Ruff for linting and formatting python
     rev: v0.1.5
@@ -44,6 +44,7 @@ repos:
     rev: v3.0.3
     hooks:
       - id: prettier
+        exclude: '\.md$'
   - repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v4.5.0
     hooks:

diff --git a/internal-instructions.md b/internal-instructions.md
@@ -13,7 +13,6 @@
 
 <!-- END doctoc generated TOC please keep comment here to allow auto update -->
 
-
 ## Running `scpca-nf` as a Data Lab staff member
 
 This section provides instructions for running the main workflow, found in [`main.nf`](main.nf).
@@ -39,28 +38,28 @@ nextflow run AlexsLemonade/scpca-nf -profile ccdl,batch
 
 There are several flags and/or parameters which you may additionally wish to specify, as follows.
 
-+ Nextflow flags:
-  + `-resume`: Resume workflow from most recent checkpoint
-  + `-with-tower`: Use `Nextflow Tower` to monitor workflow (requires separate [Nextflow Tower registration](https://tower.nf/))
-+ Workflow parameters:
-  + `--run_ids list,of,ids`: A custom comma-separated list of ids (run, library, or sample) for this run.
-  + `--project list,of,project_ids`: A custom comma-separated list of project ids for this run
-  [The default](config/profile_ccdl.config) run ids are `"SCPCR000001,SCPCS000101"`.
-  + `--repeat_mapping`: Use this flag to repeat mapping, even if results already exist.
-    + By default, the workflow checks whether each library has existing `alevin-fry` or `salmon` mapping results, and skips mapping for libraries with existing results.
-    Using this flag will override that default behavior and repeat mapping even if the given library's results exist.
-    + For more implementation details, please refer to the [external instructions](external-instructions.md#repeating-mapping-steps).
-  + `--skip_genetic_demux`: Use this flag to skip genetic demultiplexing, which is turned on by default.
-    + Genetic demultiplexing requires mapping of both bulk and single-cell data, followed by SNP calling and genetic demultiplexing, which can be quite time consuming.
-    + When genetic demultiplexing is skipped, the workflow will still perform cellhash-based demultiplexing, if available for a given library.
-  + `--repeat_genetic_demux`: Use this flag to repeat genetic demultiplexing, even if results already exist.
-    + By default, the workflow checks whether each library has existing genetic demultiplexing results, and skips genetic demultiplexing for libraries with existing results.
-    Using this flag will override that default behavior and repeat genetic demultiplexing even if the given library's results exist.
-  + `--perform_celltyping`: Use this flag to perform cell type annotation, which is turned off by default.
-  + `--repeat_celltyping`: Use this flag to repeat cell type annotation, even if results already exist.
-    + By default, the workflow checks whether each library has existing cell type annotation results for `SingleR` and/or `CellAssign` (depending on references for that library).
-    Using this flag will override that default behavior and repeat cell type annotation even if the given library's results exist.
-    + This flag is _only considered_ if `--perform_celltyping` is also used.
+- Nextflow flags:
+  - `-resume`: Resume workflow from most recent checkpoint
+  - `-with-tower`: Use `Nextflow Tower` to monitor workflow (requires separate [Nextflow Tower registration](https://tower.nf/))
+- Workflow parameters:
+  - `--run_ids list,of,ids`: A custom comma-separated list of ids (run, library, or sample) for this run.
+  - `--project list,of,project_ids`: A custom comma-separated list of project ids for this run
+    [The default](config/profile_ccdl.config) run ids are `"SCPCR000001,SCPCS000101"`.
+  - `--repeat_mapping`: Use this flag to repeat mapping, even if results already exist.
+    - By default, the workflow checks whether each library has existing `alevin-fry` or `salmon` mapping results, and skips mapping for libraries with existing results.
+      Using this flag will override that default behavior and repeat mapping even if the given library's results exist.
+    - For more implementation details, please refer to the [external instructions](external-instructions.md#repeating-mapping-steps).
+  - `--skip_genetic_demux`: Use this flag to skip genetic demultiplexing, which is turned on by default.
+    - Genetic demultiplexing requires mapping of both bulk and single-cell data, followed by SNP calling and genetic demultiplexing, which can be quite time consuming.
+    - When genetic demultiplexing is skipped, the workflow will still perform cellhash-based demultiplexing, if available for a given library.
+  - `--repeat_genetic_demux`: Use this flag to repeat genetic demultiplexing, even if results already exist.
+    - By default, the workflow checks whether each library has existing genetic demultiplexing results, and skips genetic demultiplexing for libraries with existing results.
+      Using this flag will override that default behavior and repeat genetic demultiplexing even if the given library's results exist.
+  - `--perform_celltyping`: Use this flag to perform cell type annotation, which is turned off by default.
+  - `--repeat_celltyping`: Use this flag to repeat cell type annotation, even if results already exist.
+    - By default, the workflow checks whether each library has existing cell type annotation results for `SingleR` and/or `CellAssign` (depending on references for that library).
+      Using this flag will override that default behavior and repeat cell type annotation even if the given library's results exist.
+    - This flag is _only considered_ if `--perform_celltyping` is also used.
 
 Please refer to [`nextflow.config`](nextflow.config) and [other configuration files](config/) for other parameters which can be modified.
 
@@ -80,7 +79,6 @@ Please refer to our [`CONTRIBUTING.md`](CONTRIBUTING.md#stub-workflows) for more
 
 ### Running `scpca-nf` for ScPCA Portal release
 
-
 When running the workflow for a project or group of samples that is ready to be released on ScPCA portal, please use the tag for the latest release:
 
 ```
@@ -116,29 +114,33 @@ Make sure to adjust the settings to make the zip file publicly accessible.
 
 ## Maintaining references for `scpca-nf`
 
-
 Inside the `references` folder are files and scripts related to maintaining the reference files available for use with `scpca-nf`.
 
 1. `ref-metadata.tsv`: Each row of this TSV file corresponds to a reference that is available for mapping with `scpca-nf`.
-The columns included specify the `organism` (e.g., `Homo_sapiens`), `assembly`(e.g.,`GRCh38`), and `version`(e.g., `104`) of the `fasta` obtained from [Ensembl](https://www.ensembl.org/index.html) that was used to build the reference files.
-This file is used as input to the `build-index.nf` workflow, which will create all required index files for `scpca-nf` for the listed organisms in the metadata file, provided the `fasta` and `gtf` files are stored in the proper location on S3.
-See [instructions for adding additional organisms](#adding-additional-organisms) for more details.
+   The columns included specify the `organism` (e.g., `Homo_sapiens`), `assembly`(e.g.,`GRCh38`), and `version`(e.g., `104`) of the `fasta` obtained from [Ensembl](https://www.ensembl.org/index.html) that was used to build the reference files.
+   This file is used as input to the `build-index.nf` workflow, which will create all required index files for `scpca-nf` for the listed organisms in the metadata file, provided the `fasta` and `gtf` files are stored in the proper location on S3.
+   See [instructions for adding additional organisms](#adding-additional-organisms) for more details.
 
 2. `scpca-refs.json`: Each entry of this file contains a supported reference for mapping with `scpca-nf` and the name used to refer to that supported reference, e.g., `Homo_sapiens.GRCh38.104`.
-For each supported reference, a list of all the reference files that are needed to run `scpca-nf` will be included.
-This file is required as input to `scpca-nf`.
+   For each supported reference, a list of all the reference files that are needed to run `scpca-nf` will be included.
+   This file is required as input to `scpca-nf`.
 
-3. `celltype-reference-metadata.tsv`: Each row of this TSV file corresponds to a supported cell type reference available for cell type assignment using `add-celltypes.nf`.
-For all references, the following columns will be populated: `celltype_ref_name`, `celltype_ref_source` (e.g., `celldex`), supported `celltype_method` (e.g., `SingleR`).
-All references obtained from the `PanglaoDB` source also require an `organs` column containing the list of supported `PanglaoDB` organs to include when building the reference.
-This should be a comma-separated list of all organs to include.
-To find all possible organs, see the `organs` column of `PanglaoDB_markers_2020-03-27.tsv`.
-This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for performing cell type annotation from the main workflow.
-See [instructions for adding additional cell type references](#adding-additional-cell-type-references) for more details.
+3. `celltype-reference-metadata.tsv`: Each row of this TSV file corresponds to a supported cell type reference available for cell type annotation.
+   This file is required as input to the `build-celltype-ref.nf` workflow to create and/or update cell type references.
+   For all references, the following columns must be populated:
+
+     - `celltype_ref_name` (e.g., `BlueprintEncodeData` or `blood-compartment`)
+     - `celltype_ref_source` (e.g., `celldex` or `PanglaoDB`)
+     - `celltype_method` (e.g., `SingleR` or `CellAssign`)
+     - All references obtained from the `PanglaoDB` source also require an `organs` column containing the list of supported `PanglaoDB` organs to include when building the reference.
+       This should be a comma-separated list of all organs to include.
+       To find all possible organs, see the `organs` column of `PanglaoDB_markers_2020-03-27.tsv`.
+
+   See [instructions for adding additional cell type references](#adding-additional-cell-type-references) for more details.
 
 4. `PanglaoDB_markers_2020-03-27.tsv`: This file is used to build the cell type references from `PanglaoDB`.
-This file was obtained from clicking the `get tsv file` button on the [PanglaoDB Dataset page](https://panglaodb.se/markers.html?cell_type=%27choose%27) and replacing the date in the filename with a date in ISO8601 format.
-This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for the main workflow to use during cell type annotation.
+   This file was obtained from clicking the `get tsv file` button on the [PanglaoDB Dataset page](https://panglaodb.se/markers.html?cell_type=%27choose%27) and replacing the date in the filename with a date in ISO8601 format.
+   This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for the main workflow to use during cell type annotation.
 
 ### Adding additional organisms
 
@@ -147,7 +149,7 @@ Adding additional organisms is handled, in part, by the `build-index.nf` workflo
 Follow the below steps to add support for additional references:
 
 1. Download the desired `fasta` and `gtf` files for the organism of choice from `Ensembl`.
-Add these to the `S3://scpca-references` bucket with the following directory structure, where the root directory here corresponds to the `organism` and the subdirectory corresponds to the `Ensembl` version:
+   Add these to the `S3://scpca-references` bucket with the following directory structure, where the root directory here corresponds to the `organism` and the subdirectory corresponds to the `Ensembl` version:
 
 ```
 homo_sapiens
@@ -166,16 +168,36 @@ homo_sapiens
 
 ### Adding additional cell type references
 
-Adding additional organisms is handled, in part, by the `build-celltype-ref.nf` workflow.
-
+Adding additional references to use for cell type annotation is handled by the `build-celltype-ref.nf` workflow.
 
-Follow the below steps to add support for additional cell type references.
+Reference files are created and automatically named by the `build-celltype-ref.nf`.
 We currently only support `celldex` and `PanglaoDB` for reference sources for `SingleR` and `CellAssign` cell type annotation, respectively.
 
-1. Add the `celltype_ref_name`, `celltype_ref_source`, `celltype_method`, and `organs` (if applicable) for the new reference to `celltype-reference-metadata.tsv`.
-2. Generate the new cell type references using `nextflow run build-celltype-ref.nf -profile ccdl,batch` from the root directory of this repository.
-3. Ensure that the new reference files are public and in the correct location on S3:
-    - `SingleR` reference files, which are the full reference datasets from the `celldex` package, should be in `s3://scpca-references/celltype/singler_references` named as `celldex-<reference name>.rds`.
-    - `SingleR` trained model files for the given Nextflow parameter `singler_label_name` should be in `s3://scpca-references/celltype/singler_models` named as `<reference name>_models.rds`.
-    - `CellAssign` organ-specific reference gene matrices should be in `s3://scpca-references/celltype/cellassign_references` named as `PanglaoDB-<organ>.tsv`.
+Follow these steps to add support for additional cell type references.
+
+1. Add the `celltype_ref_name`, `celltype_ref_source`, `celltype_method`, and `organs` (if applicable) for the new reference to [`celltype-reference-metadata.tsv`](references/celltype-reference-metadata.tsv).
+
+    - `<celltype_ref_name>` represents the reference dataset name.
+      For use with `SingleR`, this should be taken directly from a `celldex` dataset.
+      For `CellAssign`, names are established by the Data Lab as `<tissue/organ>-compartment` to represent a set of markers for a given tissue/organ.
+    - `<celltype_ref_source>` represents the reference dataset source. Currently only `celldex` and `PanglaoDB` are supported for `SingleR` and `CellAssign`, respectively.
+    - `<celltype_method>` represents which annotation method to use with the specified reference, either `SingleR` or `CellAssign`.
+    - `organs` indicates which organs to be included in creation of references with `PanglaoDB` as the `celltype_ref_source`.
+       This must be a comma separated list of all organs to include.
+
+2. Generate the new cell type reference using `nextflow run build-celltype-ref.nf -profile ccdl,batch` from the root directory of this repository.
+3. Ensure that the new reference files are public and in the correct location on S3.
+
+`SingleR` reference files, which are the full reference datasets from the `celldex` package, should be in `s3://scpca-references/celltype/singler_references` and named as `<celltype_ref_name>_<celltype_ref_source>_<version>.rds`.
+Corresponding "trained" model files for use in the cell type annotation workflow should be stored in `s3://scpca-references/celltype/singler_models`, named as `<celltype_ref_name>_<celltype_ref_source>_<version>_model.rds`.
+
+  - `<celltype_ref_name>` is a given `celldex` dataset.
+    - Note that the workflow parameter `singler_label_name` will determine which `celldex` dataset label is used for annotation; by default, we use `"label.ont"` (ontology labels).
+  - `<celltype_ref_source>` is `celldex`.
+  - `<version>` is the `celldex` version used during reference building, where we use dashes in place of periods (e.g., version `x.y.z` would be represented as `x-y-z`).
+
+`CellAssign` organ-specific reference gene matrices should be stored in `s3://scpca-references/celltype/cellassign_references` and named as `<celltype_ref_name>_<celltype_ref_source>_<date>.tsv`.
 
+  - `<celltype_ref_name>` is a given reference name established by the Data Lab.
+  - `<celltype_ref_source>` is `PanglaoDB`
+  - `<date>` is the `PanglaoDB` date, which serves as their version, in ISO8601 format.