AlexsLemonade · jashapiro · Sep 24, 2024 · Aug 13, 2024 · Aug 13, 2024 · Aug 13, 2024
diff --git a/.github/workflows/spell-check.yml b/.github/workflows/spell-check.yml
@@ -1,4 +1,3 @@
-
 name: Spell check Markdown files
 
 # Controls when the action will run.
@@ -14,29 +13,32 @@ jobs:
   # This workflow contains a single job called "spell check"
   spell-check:
     runs-on: ubuntu-latest
-    container:
-      image: rocker/tidyverse:4.3.2
 
     # Steps represent a sequence of tasks that will be executed as part of the job
     steps:
-      - uses: actions/checkout@v2
-
-      - name: Install packages
-        run: Rscript --vanilla -e "install.packages('spelling', repos = c(CRAN = 'https://cloud.r-project.org'))"
+      - name: Checkout
+        uses: actions/checkout@v4
 
-      - name: Run spell check
-        id: spell_check_run
+      - name: Remove files that do not need to be spellchecked
         run: |
-          results=$(Rscript --vanilla "scripts/spell-check.R")
-          echo "::set-output name=sp_chk_results::$results"
-          cat spell_check_errors.tsv
-      - name: Archive spelling errors
-        uses: actions/upload-artifact@v2
+          rm ./LICENSE
+
+      - name: Spell check action
+        uses: alexslemonade/spellcheck@v0
+        id: spell
         with:
-          name: spell-check-results
+          dictionary: components/dictionary.txt
+
+      - name: Upload spell check errors
+        uses: actions/upload-artifact@v4
+        id: artifact-upload-step
+        with:
+          name: spell_check_errors
           path: spell_check_errors.tsv
 
-      # If there are too many spelling errors, this will stop the workflow
-      - name: Check spell check results - fail if too many errors
-        if: ${{ steps.spell_check_run.outputs.sp_chk_results > 0 }}
-        run: exit 1
+      - name: Fail if there are spelling errors
+        if: steps.spell.outputs.error_count > 0
+        run: |
+          echo "There were ${{ steps.spell.outputs.error_count }} errors"
+          column -t spell_check_errors.tsv
+          exit 1
diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -12,6 +12,11 @@ For more information about `AlexsLemonade/scpca-nf` versions, please see [the re
 <!-- PUT THE NEW CHANGELOG ENTRY RIGHT BELOW THIS -->
 <!-------------------------------------------------->
 
+## 2024.09.24
+
+* Metadata for all samples from all projects on the Portal can now be downloaded in a single tab-separated values file.
+* For more information on what to expect in the metadata file, see the {ref}`metadata section of the Downloadable files page <download_files:metadata>`.
+
 ## 2024.08.13
 
 * A new column, `age_timing`, is now present in the sample metadata tables included with each download.

diff --git a/docs/download_files.md b/docs/download_files.md
@@ -164,10 +164,8 @@ Metadata for all samples on the Portal is available to download separately from
 Each project page has an option to download metadata for all of its samples as a single zip file containing the `metadata.tsv` file and a `README.md` file.
 Project-specific metadata will contain all columns listed in [the above table](#metadata) and any additional project-specific columns, such as treatment or outcome.
 
-<!--
 Additionally, a single TSV file containing the metadata for all samples from all projects on the Portal is available for download.
 The Portal-wide metadata will contain all columns listed in [the above table](#metadata).
--->
 
 ## Multiplexed sample libraries
 
@@ -185,6 +183,7 @@ Because we do not perform demultiplexing to separate cells from multiplexed libr
 For more on the specific contents of multiplexed library `SingleCellExperiment` objects, see the {ref}`Additional SingleCellExperiment components for multiplexed libraries <sce_file_contents:additional singlecellexperiment components for multiplexed libraries>` section.
 
 The [metadata file](#metadata) for multiplexed libraries (`single_cell_metadata.tsv`) will have the same format as for individual samples, but each row will represent a particular sample/library pair, meaning that there may be multiple rows for each `scpca_library_id`, one for each `scpca_sample_id` within that library.
+In addition, an estimate of the total number of cells for each sample after demultiplexing will be found in the `sample_cell_estimate` (as opposed to the `sample_cell_count_estimate` column used for non-multiplexed samples).
 
 
 ## Merged object downloads

diff --git a/docs/faq.md b/docs/faq.md
@@ -3,7 +3,7 @@
 ## Why did we use Alevin-fry for processing?
 
 We aimed to process all of the data in the portal such that it is comparable to widely used pipelines, namely Cell Ranger from 10x Genomics.
-In our own benchmarking, we found that [Alevin-fry](https://github.com/COMBINE-lab/alevin-fry) produces very similar results to [Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count), while allowing faster, more memory efficient processing of single-cell and single-nuclei RNA-sequencing data.
+In our own benchmarking, we found that [Alevin-fry](https://github.com/COMBINE-lab/alevin-fry) ([He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3)) produces very similar results to [Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count), while allowing faster, more memory efficient processing of single-cell and single-nuclei RNA-sequencing data.
 In the configuration that we are using ("selective alignment" mapping to a human transcriptome that includes introns), Alevin-fry uses approximately 12-16 GB of memory per sample and completes mapping and quantification in less than an hour.
 By contrast, Cell Ranger uses up to 25-30 GB of memory per sample and takes anywhere from 2-8 hours to align and quantify one sample.
 Quantification of samples processed with both Alevin-fry and Cell Ranger resulted in similar distributions of mapped UMI count per cell and genes detected per cell for both tools.
@@ -17,7 +17,7 @@ We also compared the mean gene expression reported for each gene by both methods
 ![](https://github.com/AlexsLemonade/alsf-scpca/blob/c0c2442d7242f6e06a5ac6d1e45bd1951780da14/analysis/docs-figures/plots/gene_exp_correlation.png?raw=true)
 
 Recent reports from others support our findings.
-[He _et al._ (2021)](https://doi.org/10.1101/2021.06.29.450377) demonstrated that `alevin-fry` can process single-cell and single-nuclei data more quickly and efficiently then other available methods, while also decreasing the false positive rate of gene detection that is commonly seen in methods that utilize transcriptome alignment.
+[He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3) demonstrated that `alevin-fry` can process single-cell and single-nuclei data more quickly and efficiently then other available methods, while also decreasing the false positive rate of gene detection that is commonly seen in methods that utilize transcriptome alignment.
 [You _et al._ (2021)](https://doi.org/10.1101/2021.06.17.448895) and [Tian _et al._ (2019)](https://doi.org/10.1038/s41592-019-0425-8) have also noted that results from different pre-processing workflows for single-cell RNA-sequencing analysis tend to result in compatible results downstream.
 
 ## How do I use the provided RDS files in R?

diff --git a/docs/processing_information.md b/docs/processing_information.md
@@ -4,15 +4,15 @@
 
 ### Mapping and quantification using alevin-fry
 
-We used [`salmon alevin`](https://salmon.readthedocs.io/en/latest/alevin.html) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) to generate gene by cell counts matrices for all single-cell and single-nuclei samples.
+We used [`salmon`](https://salmon.readthedocs.io/en/latest) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) to generate gene by cell counts matrices for all single-cell and single-nuclei samples.
 In brief, we utilized [selective alignment](#selective-alignment) to the [`splici` index](#reference-transcriptome-index) for all single-cell and single-nuclei samples.
 
 #### Reference transcriptome index
 
 For all samples, we aligned FASTQ files to a reference transcriptome index referred to as the `splici` index.
 The [`splici` index](https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/) is built using transcripts from both spliced cDNA and intronic regions.
 Inclusion of intronic regions in the index used for alignment allowed us to capture both reads from mature, spliced cDNA and nascent, unspliced cDNA.
-Alignment of RNA-seq data to an index containing intronic regions has been shown to reduce spuriously detected genes ([He _et al._ 2021](https://doi.org/10.1101/2021.06.29.450377), [Kaminow _et al._ 2021](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full#sec-5)).
+Alignment of RNA-seq data to an index containing intronic regions has been shown to reduce spuriously detected genes ([He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3), [Kaminow _et al._ 2021](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full#sec-5)).
 In our hands, we have found that use of the `splici` index led to a more comparable distribution of unique genes found per cell to Cell Ranger than did use of an index obtained from spliced cDNA transcripts only.
 
 #### Selective alignment
@@ -21,7 +21,7 @@ We mapped reads to the transcriptome index using `salmon` with the default "sele
 Briefly, selective alignment uses a mapping score validated approach to identify maximal exact matches between reads and the provided index.
 For all samples, we used selective alignment to the `splici` index.
 
-A more detailed description of the mapping strategy invoked by `salmon` in conjunction with `alevin-fry` can be found in [Srivastava _et al._ (2020)](https://doi.org/10.1186/s13059-020-02151-8).
+More detailed descriptions of the mapping strategy invoked by `salmon` in conjunction with `alevin-fry` can be found in [Srivastava _et al._ (2020)](https://doi.org/10.1186/s13059-020-02151-8) and [He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3).
 
 #### Alevin-fry parameters
 
@@ -99,7 +99,7 @@ In these cases, the cell type annotations obtained from the submitter will be pr
 
 ## ADT quantification from CITE-seq experiments
 
-CITE-seq libraries with reads from antibody-derived tags (ADTs) were also quantified using  [`salmon alevin`](https://salmon.readthedocs.io/en/latest/alevin.html) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.
+CITE-seq libraries with reads from antibody-derived tags (ADTs) were also quantified using  [`salmon`](https://salmon.readthedocs.io/en/latest) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.
 
 Reference indices were constructed from the submitter-provided list of antibody barcode sequences corresponding to each library using the `--features` flag of `salmon index`.
 Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with [selective alignment](#selective-alignment) and subsequent [quantification via alevin-fry](#alevin-fry-parameters).
@@ -130,7 +130,7 @@ Multiplexed libraries, or libraries with cells or nuclei from more than one biol
 
 ### Hashtag oligonucleotide (HTO) quantification
 
-HTO reads were also quantified using  [`salmon alevin`](https://salmon.readthedocs.io/en/latest/alevin.html) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.
+HTO reads were also quantified using  [`salmon`](https://salmon.readthedocs.io/en/latest) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.
 Reference indices were constructed from the submitter-provided list of HTO sequences corresponding to each library using the `--features` flag of `salmon index`.
 Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with [selective alignment](#selective-alignment) and subsequent [quantification via alevin-fry](#alevin-fry-parameters).
 

diff --git a/scripts/spell-check.R b/scripts/spell-check.R