Figure out what to do when the numbers of cells don't match up between CellAssign results and processed SCE object #591

allyhawkins · 2023-11-22T20:41:26Z

I was testing the workflow and ran into an issue with rendering the supplemental cell type report. The problem is in creating the CellAssign density plots. When running a library we already had processed, there were now NA's for the CellAssign prediction score causing an error in:

scpca-nf/templates/qc_report/celltypes_supplemental_report.rmd

Lines 710 to 715 in c6930fe

    
           y_max <- celltype_df$cellassign_max_prediction |> 
        
             split(celltype_df$cellassign_celltype_annotation) |> 
        
             purrr::map_dbl( 
        
               \(x) max(density(x, bw = density_bw)$y) 
        
             ) |> 
        
             max()

After some investigation, it looks like the NA's are coming from barcodes that are not present in the CellAssign results, but are present in the processed object. The problem that I think we're going to have moving forward is sometimes the exact cells present in the final object may change slightly when filtering. I double checked that we set seeds in all the scripts where we need it (including when running CellAssign), but we have seen scenarios where the exact number of cells is altered slightly from run to run.

So we need to decide what to do about repeating CellAssign in that scenario. I think we have a few options:

Remove the ability to skip CellAssign. This option makes me really sad, but it would ensure that any barcodes in the object are in fact getting annotated by CellAssign. 😢
Continue to skip CellAssign, but remove any cells that are not classified before creating plots and print out a warning. The problem with this approach is that it seems misleading to me. It's not that CellAssign couldn't assign a score, it's that we didn't provide that cell to CellAssign to begin with, so we don't really know what it is.

I don't love option 1, but I think it's the only thing I can think of to be sure that the barcodes are the same. The only other thing would be actually checking that the barcodes in the files are the same before skipping it.

The other thing to note is that this is based on the assumption that CellAssign returns a score for every single cell that is provided as input. This is my understanding and looking back at some of the analysis I did in sc-data-integration this seemed to be the case. So it seems unlikely that CellAssign just didn't assign a score and also didn't return it in the predictions matrix.

Tagging @jashapiro @sjspielman for any thoughts on this.

The text was updated successfully, but these errors were encountered:

jashapiro · 2023-11-22T20:53:53Z

The easy solution in that particular code is to use max(density(x, bw = density_bw)$y, na.rm = TRUE)

But that would still leave NAs that we would have to accommodate for plots and reports. I think that is okay for now, though we might want to add a warning that the CellAssign results were calculated with a differing cell set, and then add a separate category for the NA values, like we do with submitter celltypes. Presumably this would also affect SingleR, so we would want to pay attention to that as well. In that case they all end up as NA, but similar to the submitter cell types, we would then have two different kinds of NA values. ugh.

allyhawkins · 2023-11-22T21:07:23Z

But that would still leave NAs that we would have to accommodate for plots and reports. I think that is okay for now, though we might want to add a warning that the CellAssign results were calculated with a differing cell set, and then add a separate category for the NA values, like we do with submitter celltypes. Presumably this would also affect SingleR, so we would want to pay attention to that as well. In that case they all end up as NA, but similar to the submitter cell types, we would then have two different kinds of NA values. ugh.

My interpretation of this comment is to leave everything as is, but to add a separate label for not classified by SingleR and not classified by CellAssign? And then I think when creating our diagnostic plots we remove those before plotting.
Also if there are a non-zero number of cells that are not classified, then we will need a warning.

allyhawkins · 2023-11-22T21:10:01Z

I think to avoid two kinds of NA's, we want to grab the cell barcodes that are in the SCE object and not in the SingleR/CellAssign results and label them as "Unclassified" before even adding them into the SCE object. This would happen in add_celltypes_to_sce.R.

jashapiro · 2023-11-22T21:17:15Z

I think to avoid two kinds of NA's, we want to grab the cell barcodes that are in the SCE object and not in the SingleR/CellAssign results and label them as "Unclassified" before even adding them into the SCE object. This would happen in add_celltypes_to_sce.R.

That makes sense to me.

But now I'm confused by the error again, and wondering if missing cells are really the issue? Presumably if all values were NA, we would also give the cellassign_celltype_annotation an NA value? In which case split() would drop those rows, and we wouldn't have gotten an error in the code you quoted. (Which doesn't mean that we don't need to deal with it. Just that I am confused by the fact that you saw an error.)

jashapiro · 2023-11-22T21:19:27Z

Also, as to timing of this error: I think we could call this a known bug and come back to it later. Running this release without any cell type skipping doesn't seem like a terrible idea. I don't mind a clean run...

allyhawkins · 2023-11-22T21:21:10Z

But now I'm confused by the error again, and wondering if missing cells are really the issue? Presumably if all values were NA, we would also give the cellassign_celltype_annotation an NA value? In which case split() would drop those rows, and we wouldn't have gotten an error in the code you quoted. (Which doesn't mean that we don't need to deal with it. Just that I am confused by the fact that you saw an error.)

It's because we are labeling cells that are NA as Unknown cell type, which could be an error in itself... but this is what I get when looking at the head of the colData of one of the processed sces:

> head(celltype_df)
                         barcodes cluster     UMAP1     UMAP2 singler_celltype_ontology singler_celltype_annotation cellassign_celltype_annotation
TCCACGTTCTGCCTCA TCCACGTTCTGCCTCA       1  5.817751  3.443341                CL:0000188     cell of skeletal muscle              Unknown cell type
ACTATTCCAGCAGATG ACTATTCCAGCAGATG       2 -4.599211 -5.128418                CL:0000192          smooth muscle cell              Unknown cell type
CTCCACAGTTATCTTC CTCCACAGTTATCTTC       3 -6.321206 -3.468336                CL:0000187                 muscle cell              Unknown cell type
ACTACGAAGAATACAC ACTACGAAGAATACAC       3 -9.004413 -3.580391                CL:0000540                      neuron              Unknown cell type
TACTGCCTCTGGTGCG TACTGCCTCTGGTGCG       3 -7.810103 -3.880036                CL:0000187                 muscle cell              Unknown cell type
CAACGATCAGCTCCTT CAACGATCAGCTCCTT       2 -4.917690 -5.254333                CL:0000187                 muscle cell              Unknown cell type
                 cellassign_max_prediction submitter_celltype_annotation
TCCACGTTCTGCCTCA                 0.9991841              tumor_Myoblast-A
ACTATTCCAGCAGATG                 0.9994990              tumor_Myoblast-B
CTCCACAGTTATCTTC                 0.9995808               tumor_Myocyte-A
ACTACGAAGAATACAC                        NA                    Epithelial
TACTGCCTCTGGTGCG                 0.9984105               tumor_Myocyte-A
CAACGATCAGCTCCTT                 0.9993721              tumor_Myoblast-A

Also, as to timing of this error: I think we could call this a known bug and come back to it later. Running this release without any cell type skipping doesn't seem like a terrible idea. I don't mind a clean run...

I'm not opposed to this! The only reason I picked this up is because I went to run the workflow on the samples we've been using for development in scpca/processed so it automatically used old results.

allyhawkins · 2023-11-22T21:24:44Z

It's because we have a function that sets the unknowns for both singler/ CellAssign. So we are setting both NA and other to Unknown cell type.

scpca-nf/templates/qc_report/utils/celltype_functions.rmd

Lines 66 to 83 in c6930fe

    
             unknown_string <- "Unknown cell type" 
        
             df |> 
        
               dplyr::mutate( 
        
                 {{ annotation_column }} := dplyr::case_when( 
        
                   # singler condition 
        
                   is.na({{ annotation_column }}) ~ unknown_string, 
        
                   # cellassign conditon 
        
                   {{ annotation_column }} == "other" ~ unknown_string, 
        
                   # otherwise, keep it 
        
                   .default = {{ annotation_column }} 
        
                 ) |> 
        
                 # order column by number of cells 
        
                 forcats::fct_infreq() |> 
        
                 # make "Unknown cell type" the last level 
        
                 forcats::fct_relevel(unknown_string, after = Inf) 
        
               ) 
        
           }

jashapiro · 2023-11-22T21:26:15Z

Oh, I wonder if the bug may actually be different?

 y_max <- celltype_df$cellassign_max_prediction |> 
   split(celltype_df$cellassign_celltype_annotation) |> 
   purrr::map_dbl( 
     \(x) max(density(x, bw = density_bw)$y) 
   ) |> 
   max()

The above code may fail if density() fails, which happens when n is small. So we should filter out the little ones:

 y_max <- celltype_df$cellassign_max_prediction |> 
   split(celltype_df$cellassign_celltype_annotation) |> 
   purrr::discard(\(x) sum(is.finite(x)) <= 2)|>
   purrr::map_dbl( 
     \(x) max(density(x, bw = density_bw, na.rm = TRUE)$y) 
   ) |> 
   max(na.rm = TRUE)

allyhawkins · 2023-11-22T21:28:21Z

For some context, this is the actual error I was getting. It's definitely the NA's for scores. We can add the filter too, but we need the na.rm = TRUE for sure.

jashapiro · 2023-11-22T21:34:25Z

It's because we have a function that sets the unknowns for both singler/ CellAssign. So we are setting both NA and other to Unknown cell type.

That makes sense. Though we probably shouldn't do that in the future... Marking the "missing" calls because we were repeating cell typing seems like what we want to do in the future.

And we probably do still want to add the filter/discard() to prevent errors with density(). So those fixes can go in now, and fixing the way we handle preexisting cell type merging can come later...

Or maybe just go for it as part of this: change the missing cells to "Untyped" or "Unassigned" in add_celltypes_to_sce.R and then filter them out for the report, adding a warning if any are filtered that way.

jashapiro · 2023-11-22T21:39:17Z

Note I updated the code above to handle NA in the density() call itself, and changed the filtering.

allyhawkins · 2023-11-22T21:40:50Z

And we probably do still want to add the filter/discard() to prevent errors with density(). So those fixes can go in now, and fixing the way we handle preexisting cell type merging can come later...

Definitely making this change now.

Or maybe just go for it as part of this: change the missing cells to "Untyped" or "Unassigned" in add_celltypes_to_sce.R and then filter them out for the report, adding a warning if any are filtered that way.

I'm trying what I think will be an easy fix for this, if it takes too long then I'll just make the density change.

allyhawkins · 2023-11-28T14:26:38Z

Closed by #596

allyhawkins self-assigned this Nov 27, 2023

allyhawkins mentioned this issue Nov 27, 2023

Account for mismatches between old cell typing results and current processed object and handle missing UMAPs #596

Merged

allyhawkins closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out what to do when the numbers of cells don't match up between CellAssign results and processed SCE object #591

Figure out what to do when the numbers of cells don't match up between CellAssign results and processed SCE object #591

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023 •

edited

Loading

jashapiro commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023 •

edited

Loading

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023 •

edited

Loading

jashapiro commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

allyhawkins commented Nov 28, 2023

Figure out what to do when the numbers of cells don't match up between CellAssign results and processed SCE object #591

Figure out what to do when the numbers of cells don't match up between CellAssign results and processed SCE object #591

Comments

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023 • edited Loading

jashapiro commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023 • edited Loading

allyhawkins commented Nov 22, 2023

jashapiro commented Nov 22, 2023 • edited Loading

jashapiro commented Nov 22, 2023

allyhawkins commented Nov 22, 2023

allyhawkins commented Nov 28, 2023

jashapiro commented Nov 22, 2023 •

edited

Loading

jashapiro commented Nov 22, 2023 •

edited

Loading

jashapiro commented Nov 22, 2023 •

edited

Loading