Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out what to do when the numbers of cells don't match up between CellAssign results and processed SCE object #591

Closed
allyhawkins opened this issue Nov 22, 2023 · 13 comments
Assignees

Comments

@allyhawkins
Copy link
Member

I was testing the workflow and ran into an issue with rendering the supplemental cell type report. The problem is in creating the CellAssign density plots. When running a library we already had processed, there were now NA's for the CellAssign prediction score causing an error in:

y_max <- celltype_df$cellassign_max_prediction |>
split(celltype_df$cellassign_celltype_annotation) |>
purrr::map_dbl(
\(x) max(density(x, bw = density_bw)$y)
) |>
max()

After some investigation, it looks like the NA's are coming from barcodes that are not present in the CellAssign results, but are present in the processed object. The problem that I think we're going to have moving forward is sometimes the exact cells present in the final object may change slightly when filtering. I double checked that we set seeds in all the scripts where we need it (including when running CellAssign), but we have seen scenarios where the exact number of cells is altered slightly from run to run.

So we need to decide what to do about repeating CellAssign in that scenario. I think we have a few options:

  1. Remove the ability to skip CellAssign. This option makes me really sad, but it would ensure that any barcodes in the object are in fact getting annotated by CellAssign. 😢
  2. Continue to skip CellAssign, but remove any cells that are not classified before creating plots and print out a warning. The problem with this approach is that it seems misleading to me. It's not that CellAssign couldn't assign a score, it's that we didn't provide that cell to CellAssign to begin with, so we don't really know what it is.

I don't love option 1, but I think it's the only thing I can think of to be sure that the barcodes are the same. The only other thing would be actually checking that the barcodes in the files are the same before skipping it.

The other thing to note is that this is based on the assumption that CellAssign returns a score for every single cell that is provided as input. This is my understanding and looking back at some of the analysis I did in sc-data-integration this seemed to be the case. So it seems unlikely that CellAssign just didn't assign a score and also didn't return it in the predictions matrix.

Tagging @jashapiro @sjspielman for any thoughts on this.

@jashapiro
Copy link
Member

The easy solution in that particular code is to use max(density(x, bw = density_bw)$y, na.rm = TRUE)

But that would still leave NAs that we would have to accommodate for plots and reports. I think that is okay for now, though we might want to add a warning that the CellAssign results were calculated with a differing cell set, and then add a separate category for the NA values, like we do with submitter celltypes. Presumably this would also affect SingleR, so we would want to pay attention to that as well. In that case they all end up as NA, but similar to the submitter cell types, we would then have two different kinds of NA values. ugh.

@allyhawkins
Copy link
Member Author

But that would still leave NAs that we would have to accommodate for plots and reports. I think that is okay for now, though we might want to add a warning that the CellAssign results were calculated with a differing cell set, and then add a separate category for the NA values, like we do with submitter celltypes. Presumably this would also affect SingleR, so we would want to pay attention to that as well. In that case they all end up as NA, but similar to the submitter cell types, we would then have two different kinds of NA values. ugh.

My interpretation of this comment is to leave everything as is, but to add a separate label for not classified by SingleR and not classified by CellAssign? And then I think when creating our diagnostic plots we remove those before plotting.
Also if there are a non-zero number of cells that are not classified, then we will need a warning.

@allyhawkins
Copy link
Member Author

I think to avoid two kinds of NA's, we want to grab the cell barcodes that are in the SCE object and not in the SingleR/CellAssign results and label them as "Unclassified" before even adding them into the SCE object. This would happen in add_celltypes_to_sce.R.

@jashapiro
Copy link
Member

jashapiro commented Nov 22, 2023

I think to avoid two kinds of NA's, we want to grab the cell barcodes that are in the SCE object and not in the SingleR/CellAssign results and label them as "Unclassified" before even adding them into the SCE object. This would happen in add_celltypes_to_sce.R.

That makes sense to me.

But now I'm confused by the error again, and wondering if missing cells are really the issue? Presumably if all values were NA, we would also give the cellassign_celltype_annotation an NA value? In which case split() would drop those rows, and we wouldn't have gotten an error in the code you quoted. (Which doesn't mean that we don't need to deal with it. Just that I am confused by the fact that you saw an error.)

@jashapiro
Copy link
Member

Also, as to timing of this error: I think we could call this a known bug and come back to it later. Running this release without any cell type skipping doesn't seem like a terrible idea. I don't mind a clean run...

@allyhawkins
Copy link
Member Author

But now I'm confused by the error again, and wondering if missing cells are really the issue? Presumably if all values were NA, we would also give the cellassign_celltype_annotation an NA value? In which case split() would drop those rows, and we wouldn't have gotten an error in the code you quoted. (Which doesn't mean that we don't need to deal with it. Just that I am confused by the fact that you saw an error.)

It's because we are labeling cells that are NA as Unknown cell type, which could be an error in itself... but this is what I get when looking at the head of the colData of one of the processed sces:

> head(celltype_df)
                         barcodes cluster     UMAP1     UMAP2 singler_celltype_ontology singler_celltype_annotation cellassign_celltype_annotation
TCCACGTTCTGCCTCA TCCACGTTCTGCCTCA       1  5.817751  3.443341                CL:0000188     cell of skeletal muscle              Unknown cell type
ACTATTCCAGCAGATG ACTATTCCAGCAGATG       2 -4.599211 -5.128418                CL:0000192          smooth muscle cell              Unknown cell type
CTCCACAGTTATCTTC CTCCACAGTTATCTTC       3 -6.321206 -3.468336                CL:0000187                 muscle cell              Unknown cell type
ACTACGAAGAATACAC ACTACGAAGAATACAC       3 -9.004413 -3.580391                CL:0000540                      neuron              Unknown cell type
TACTGCCTCTGGTGCG TACTGCCTCTGGTGCG       3 -7.810103 -3.880036                CL:0000187                 muscle cell              Unknown cell type
CAACGATCAGCTCCTT CAACGATCAGCTCCTT       2 -4.917690 -5.254333                CL:0000187                 muscle cell              Unknown cell type
                 cellassign_max_prediction submitter_celltype_annotation
TCCACGTTCTGCCTCA                 0.9991841              tumor_Myoblast-A
ACTATTCCAGCAGATG                 0.9994990              tumor_Myoblast-B
CTCCACAGTTATCTTC                 0.9995808               tumor_Myocyte-A
ACTACGAAGAATACAC                        NA                    Epithelial
TACTGCCTCTGGTGCG                 0.9984105               tumor_Myocyte-A
CAACGATCAGCTCCTT                 0.9993721              tumor_Myoblast-A

Also, as to timing of this error: I think we could call this a known bug and come back to it later. Running this release without any cell type skipping doesn't seem like a terrible idea. I don't mind a clean run...

I'm not opposed to this! The only reason I picked this up is because I went to run the workflow on the samples we've been using for development in scpca/processed so it automatically used old results.

@allyhawkins
Copy link
Member Author

It's because we have a function that sets the unknowns for both singler/ CellAssign. So we are setting both NA and other to Unknown cell type.

unknown_string <- "Unknown cell type"
df |>
dplyr::mutate(
{{ annotation_column }} := dplyr::case_when(
# singler condition
is.na({{ annotation_column }}) ~ unknown_string,
# cellassign conditon
{{ annotation_column }} == "other" ~ unknown_string,
# otherwise, keep it
.default = {{ annotation_column }}
) |>
# order column by number of cells
forcats::fct_infreq() |>
# make "Unknown cell type" the last level
forcats::fct_relevel(unknown_string, after = Inf)
)
}

@jashapiro
Copy link
Member

jashapiro commented Nov 22, 2023

Oh, I wonder if the bug may actually be different?

 y_max <- celltype_df$cellassign_max_prediction |> 
   split(celltype_df$cellassign_celltype_annotation) |> 
   purrr::map_dbl( 
     \(x) max(density(x, bw = density_bw)$y) 
   ) |> 
   max() 

The above code may fail if density() fails, which happens when n is small. So we should filter out the little ones:

 y_max <- celltype_df$cellassign_max_prediction |> 
   split(celltype_df$cellassign_celltype_annotation) |> 
   purrr::discard(\(x) sum(is.finite(x)) <= 2)|>
   purrr::map_dbl( 
     \(x) max(density(x, bw = density_bw, na.rm = TRUE)$y) 
   ) |> 
   max(na.rm = TRUE) 

@allyhawkins
Copy link
Member Author

For some context, this is the actual error I was getting. It's definitely the NA's for scores. We can add the filter too, but we need the na.rm = TRUE for sure.

Screenshot 2023-11-22 at 3 27 37 PM

@jashapiro
Copy link
Member

jashapiro commented Nov 22, 2023

It's because we have a function that sets the unknowns for both singler/ CellAssign. So we are setting both NA and other to Unknown cell type.

That makes sense. Though we probably shouldn't do that in the future... Marking the "missing" calls because we were repeating cell typing seems like what we want to do in the future.

And we probably do still want to add the filter/discard() to prevent errors with density(). So those fixes can go in now, and fixing the way we handle preexisting cell type merging can come later...

Or maybe just go for it as part of this: change the missing cells to "Untyped" or "Unassigned" in add_celltypes_to_sce.R and then filter them out for the report, adding a warning if any are filtered that way.

@jashapiro
Copy link
Member

Note I updated the code above to handle NA in the density() call itself, and changed the filtering.

@allyhawkins
Copy link
Member Author

And we probably do still want to add the filter/discard() to prevent errors with density(). So those fixes can go in now, and fixing the way we handle preexisting cell type merging can come later...

Definitely making this change now.

Or maybe just go for it as part of this: change the missing cells to "Untyped" or "Unassigned" in add_celltypes_to_sce.R and then filter them out for the report, adding a warning if any are filtered that way.

I'm trying what I think will be an easy fix for this, if it takes too long then I'll just make the density change.

@allyhawkins
Copy link
Member Author

Closed by #596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants