improve_07_annotation #994

maud-p · 2025-01-17T12:32:24Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

This PR is linked to the issue 856.

What is the goal of this pull request?

I aim here to improve the workflow of the annotations performed in 07_combined_annotation_across_samples_exploration.Rmd.

The suggested changes aim to overcome some misclassification that have been spotted in the first version of the annotation while working on the integration of the data.

In the first version of the annotation, I realized that a lot of immune cells were filtered out because of the cnv-threshold. See the expression of PTPRC = CD45 below:

We requested immune and endothelial cells to have no infered cnv to be labeled as normal. I think that the few CNV that were infered for immune cells are cell type specific expression of some genes that colocalized on one chromosome. Thus I decided to allow a bit of flexibility in regards to CNV for immune and endothelial cells and only based the annotation of immune and endothelial cells on the predicted.cell-type and associated prediction.score.

Similar to the immune cells, there is a high probability that we also infer false positive CNV in normal epithelial/stroma cells. I am not sure that it will be possible to differentiate between normal cells with false positive CNV from cancer cells. For that reason, I decided to label as normal cells with a cnv-score <= cnv-threshold and as cancer cells with a cnv-score > cnv-threshold + 2.
The choice of +2 is a bit arbitrary, but based on the approximate number of CNV that can be detected in endothelial and immune cells.
There is the possibility of cancer cell without any CNV. It would be extremly difficult in that case to differentiate them from normal cells. The strategy I decided to go with to reduce the risk of misclassification is to use the prediction score of label transfer. Indeed, we expect normal cells to resemble the cells from the fetal kidney reference and thus have a high prediction.score.
While working on the integration (see umap reduction here colored by cnv_score), I realized that samples -177, -180, -181, -190, -197 have strikingly very low cnv_score. The cnv_score is simply the number of chromosome presenting a CNV for each patient. Of note, these 5 samples are the ones with no immune and/or endothelial cells for which we couldn’t run infercnv properly. We ran infercnv without any reference. In that case, the mean over all cellular expression profiles is taken as a reference. If the sample is mostly composed of cancer cell, the cancer-associated CNV are taken as the normal reference...

I am afraid not to have a solution for this, but I am quite sure that the cells for these 5 patients are mis-classified. This will affect the downstream differential expression analysis cancer versus normal and the potential finding of Wilms tumor histology specific markers. This is why I would strongly recommand forcing the annotations for these patients to unknown.

I slightly changed the summarized CNV heatmap with do_Feature_mean. Initially, the feature ploted with do_Feature_mean was prop_cnv_chr[i], the proportion of the chromosome i affected by a CNV. The function do_Feature_mean took then the mean over all cells in a group. This can be misleading are it is then impossible to interpret the mean value. Example with a mean of 0.5, we could say:

half of the cells in the group gained/lossed the entire chromosome, half are unaffected
all cells gained/lossed half of the chromosome
... all other combinations possible!

For that reason, I changed the feature to has_cnv_chr[i], which is a binary information of the presence/absence of CNV in the chromosome i for the given cell. While taking the mean over all cells in a group, we then have the information of the percentage of cells within the group having any kind of CNV in this chromosome.

If known, do you anticipate filing additional pull requests to complete this analysis module?

-[ ] one for the integration of the samples
-[ ] one for differential expression analysis, looking for Wilms tumor, histology specific markers

Results

The annotation file is generated automatically while running the script and will be saved in the results folder.
The notebook is saved in the notebook folder.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

sjspielman · 2025-01-17T13:53:22Z

Hi @maud-p, just wanted to give you a heads-up for time frame! I anticipate getting to review this next Tuesday 1/21, since we are out of the office Monday 1/20 for a holiday and I don't think I'll have a chance to get to this today. Have a good weekend in the meantime! :)

maud-p · 2025-01-17T13:56:38Z

Hi @sjspielman ,
Thank you for letting me know. Please no rush for it, as I wanted to send the PR before leaving for vacations ;)
I'll be back Monday 26/01.
I also wish you a great weekend!

sjspielman

Generally speaking, these annotation changes you are proposing seem reasonable to me. I have a few initial thoughts about some of the way this is implemented. Once you make these changes, I will do a second round of more thorough code review. Let me know if I can clarify anything here!

Most importantly, the score threshold (0.85 being changed to 0.75) was not only used in this notebook. It was also used in several exploratory steps (we don't need to worry about those at this point), but more importantly it was used for inferCNV inference. We would like to keep this parameter consistent whenever it is used to create the annotations, which is during inferCNV inference and in this notebook. Therefore, to change the threshold, you'll want to take a step back here to apply the changes more broadly and make use of the arguments and parameters we have already established. I recommend that you:
- Define a variable at the top of this script 00_run_workflow.sh with the chosen threshold you'd like to use, 0.75
- Pass this variable into both the inferCNV step on line 150, and this notebook on the following line
- Then, re-run these steps to regenerate both the inferCNV results and this notebook; you'll want to make sure those new inferCNV results are also updated in your researcher results S3 bucket for me to see them as part of review.
- I recommend leaving all defaults at 0.85 for now.
The choice of +2 indeed seems sort of arbitrary, which isn't necessarily a problem. But, since it is not a rigorously tested value, I recommend making it a notebook parameter in case this changes in the future after additional investigation.

...yses/cell-type-wilms-tumor-06/notebook/07_combined_annotation_across_samples_exploration.Rmd

…ion_across_samples_exploration.Rmd Co-authored-by: Stephanie Spielman <[email protected]>

…old default to 0.85

…ing function

maud-p · 2025-01-27T12:09:50Z

Hi @sjspielman ,

thank you for the review 😄

I followed your advice to keep the predicted_celltype_threshold to 0.85 for now, however, re-running the script I realized we are missing quite a lot of immune and endothelium cells like this.
In a second step of the review, I'll re-run the 06_infercnv.R and 07_combined_annotation_across_samples_exploration.Rmd with 0.75 instead of 0.85, I think we should perform slightly better 🤞
❓ As for predicted_celltype_threshold, I'd like to add a parameter for cnv_threshold to be a vector (0 2), as changed in the default of the notebook 07_combined_annotation_across_samples_exploration.Rmd. But I didn't manage to write it in bash-R...

Thank you!

sjspielman · 2025-01-27T16:22:42Z

I followed your advice to keep the predicted_celltype_threshold to 0.85 for now, however, re-running the script I realized we are missing quite a lot of immune and endothelium cells like this.

To clarify, what I meant by this is to keep the defaults in the R code as 0.85 (e.g., the inferCNV script argument and the notebook parameter), but you can assign 0.75 in the shell script 00_run_workflow.sh and override those defaults. So, in the line where you created this variable in the shell script, you can make it equal to 0.75 so that value will be used instead of 0.85.

As for predicted_celltype_threshold, I'd like to add a parameter for cnv_threshold to be a vector (0 2), as changed in the default of the notebook 07_combined_annotation_across_samples_exploration.Rmd. But I didn't manage to write it in bash

For this, you'd need to define two separate parameters rather than using !r c(0,2). I think having two parameters here would probably be clearer as well. In other words:

Define 2 variables in the bash script for 0 and 2
Create 2 parameters in the R markdown notebook
Pass in your 2 variables and treat the 0 and 2 as separate quantities rather than indexing them out of a vector*

maud-p · 2025-01-28T13:01:59Z

Dear @sjspielman ,
thank you for clarifying! The last steps of the workflow are now running with the new predicted.celltype.threshold=0.75 and cnv_threshold_low=0, cnv_threshold_high=2. I'll update the s3 bucket hopefully tomorrow 😄

analyses/cell-type-wilms-tumor-06/00_run_workflow.sh

Co-authored-by: Stephanie Spielman <[email protected]>

maud-p · 2025-01-29T12:06:40Z

Dear @sjspielman ,

I just wanted to let you know that there will be 2 ~major changes to come tha will impact this PR:

in 07_combined_annotation_across_samples_exploration.Rmd in the second level annotation of normal cells. I understood why we were loosing immune and endothelial cells 🥳 in 06_infercnv.R we have been indeed renaming the immune and endothelium that passed the score_threshold as normal. Thus, the selection of immune and endothelium cells lines 461-499 should be changed as the following

cell_type_df <- cell_type_df |>
 mutate(second.level_annotation = case_when(
    ell_type_df$compartment %in% c("normal") &
     cell_type_df$cell_type %in% c("endothelial cell") ~ "endothelium",
   cell_type_df$compartment %in% c("normal") &
     cell_type_df$cell_type %in% c("B cell",
                                   "CD4-positive, alpha-beta T cell",
                                   "conventional dendritic cell",
                                   "lymphocyte",
                                   "macrophage",
                                   "mast cell",
                                   "monocyte",
                                   "natural killer cell",
                                   "neutrophil",
                                   "plasmacytoid dendritic cell") ~ "immune",

   .default = "unknown"
 ))

another major change concern the gene_order_file defined in 06a_build-geneposition.R and used for infercnv. I changed it to have a segmentation for each chromosome arms (p and q) per chromosome. I would indeed use this information to validate the infercv step.

I made all these changes locally and re-ran the workflow. I am now exploring the results to make sure it is worth going into it.

I would then suggest to postpone this PR and open another one to first change the 06a_build-geneposition.R and re-run the 06_infercnv.R with the new gene_order and predicted_celltype_threshold = 0.75.

Then maybe once this is fixed we can re-open the PR#994 and finalize the remaining changes?

Thank you in advance!

sjspielman · 2025-01-29T21:29:25Z

I understood why we were loosing immune and endothelial cells 🥳 in 06_infercnv.R we have been indeed renaming the immune and endothelium that passed the score_threshold as normal.

🥳🥳🥳 great catch!!

I would then suggest to postpone this PR and open another one to first change the 06a_build-geneposition.R and re-run the 06_infercnv.R with the new gene_order and predicted_celltype_threshold = 0.75.

Then maybe once this is fixed we can re-open the PR #994 and finalize the remaining changes?

That's fine with me! I was thinking that it might be fine to keep this PR open and come back to it later after merging in those other changes you describe, but the results might be different enough that a new PR is easier for you. So in any case - do what is easier for you :)

improve_07_annotation

9c17a36

maud-p requested a review from jaclyn-taroni as a code owner January 17, 2025 12:32

jaclyn-taroni requested review from sjspielman and removed request for jaclyn-taroni January 17, 2025 12:36

sjspielman reviewed Jan 22, 2025

View reviewed changes

maud-p and others added 6 commits January 27, 2025 10:32

Update analyses/cell-type-wilms-tumor-06/notebook/07_combined_annotat…

0f1d211

…ion_across_samples_exploration.Rmd Co-authored-by: Stephanie Spielman <[email protected]>

Update analyses/cell-type-wilms-tumor-06/notebook/07_combined_annotat…

fa20313

…ion_across_samples_exploration.Rmd Co-authored-by: Stephanie Spielman <[email protected]>

Update analyses/cell-type-wilms-tumor-06/notebook/07_combined_annotat…

bc6762e

…ion_across_samples_exploration.Rmd Co-authored-by: Stephanie Spielman <[email protected]>

add parameters for cnv_threshold and change predicted.celltype.thresh…

12d7f92

…old default to 0.85

add predicted.celltype.threshold parameter

8bcfffd

add ${predicted_celltype_threshold} parameter to the 06_infercnv call…

a7331e2

…ing function

maud-p requested a review from sjspielman January 27, 2025 12:10

maud-p added 2 commits January 28, 2025 14:03

add parameters for scores thresholds

71ecc6e

update params

9bce4a4

sjspielman reviewed Jan 28, 2025

View reviewed changes

analyses/cell-type-wilms-tumor-06/00_run_workflow.sh Outdated Show resolved Hide resolved

Update analyses/cell-type-wilms-tumor-06/00_run_workflow.sh

f8e3574

Co-authored-by: Stephanie Spielman <[email protected]>

maud-p mentioned this pull request Jan 30, 2025

06 infercnv update #1013

Open

8 tasks

simplify str_extract pattern

8fdee1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve_07_annotation #994

improve_07_annotation #994

maud-p commented Jan 17, 2025

sjspielman commented Jan 17, 2025

maud-p commented Jan 17, 2025

sjspielman left a comment

maud-p commented Jan 27, 2025

sjspielman commented Jan 27, 2025 •

edited

Loading

maud-p commented Jan 28, 2025

maud-p commented Jan 29, 2025

sjspielman commented Jan 29, 2025

improve_07_annotation #994

Are you sure you want to change the base?

improve_07_annotation #994

Conversation

maud-p commented Jan 17, 2025

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

sjspielman commented Jan 17, 2025

maud-p commented Jan 17, 2025

sjspielman left a comment

Choose a reason for hiding this comment

maud-p commented Jan 27, 2025

sjspielman commented Jan 27, 2025 • edited Loading

maud-p commented Jan 28, 2025

maud-p commented Jan 29, 2025

sjspielman commented Jan 29, 2025

sjspielman commented Jan 27, 2025 •

edited

Loading