Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2 02. Final model for the LINCS dataset (batch 1) #13

Open
EchteRobert opened this issue Oct 5, 2022 · 10 comments
Open

P2 02. Final model for the LINCS dataset (batch 1) #13

EchteRobert opened this issue Oct 5, 2022 · 10 comments
Assignees

Comments

@EchteRobert
Copy link
Collaborator

EchteRobert commented Oct 5, 2022

Here I trained a model on all data available from batch 1 in the LINCS dataset, which can be found like this: aws s3 ls s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/

The model uses 1745 features, because of an issue with 10 plates (broadinstitute/lincs-cell-painting#88 (comment)). In total, I trained the model on 136 plates, 5965 wells, including 1228 unique compounds using the 10 uM dose point. During preprocessing I removed 1587 wells due to missing MoA or compound name (pert_iname) annotation. I used the following hyperparameters:

Hyperparameter value
batch size 36
epochs 100
kFilters 0.5
latent dim 2048
learning rate 0.0005
nr cells (1500, 800)
nr sets 8
optimizer AdamW
output dim 2048
true batch size 288

I assess the model on the 10 uM dose point using replicate and MoA prediction and similarly on the 3.33 uM dose, which is considered the test set.

Results

  • The model significantly improves upon the average baseline for replicate and MoA prediction for both the 10 and 3.33 uM dose points.
  • It improves the mAP by 60% and 30% for the 10 uM dose point (training) and 3.33 uM dose point (test) data, respectively.
  • I could have trained the model a bit longer, e.g. 150 epochs, as the validation mAP and loss did not fully converge yet.
Results 10 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=84.81208433212997, pvalue=0.0)

plate Training mAP model Training mAP BM Training mAP shuffled
all plates 0.7473 0.269 0

MoA prediction

Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.753694914168434, pvalue=1.5518902810751288e-11)

plate mAP model mAP BM mAP shuffled
all plates 0.0541 0.0338 0.0002
Results 3.33 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=49.02599189522616, pvalue=0.0)

plate Training mAP model Training mAP BM Training mAP shuffled
all plates 0.4465 0.1695 0

MoA prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=3.525483296865904, pvalue=0.0004250301209859708)

plate mAP model mAP BM mAP shuffled
all plates 0.042 0.0322 0
Loss curves Screen Shot 2022-10-05 at 11 19 38 AM Screen Shot 2022-10-05 at 11 20 15 AM
All plate names SQ00014812_SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015116_SQ00015117_SQ00015118_SQ00015119_SQ00015120_SQ00015121_SQ00015122_SQ00015123_SQ00015124_SQ00015125_SQ00015126_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015155_SQ00015156_SQ00015157_SQ00015158_SQ00015159_SQ00015160_SQ00015162_SQ00015163_SQ00015164_SQ00015165_SQ00015166_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015197_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015229_SQ00015230_SQ00015231_SQ00015232_SQ00015233
@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Oct 7, 2022

As discussed during yesterday's check-in, I have computed Figure 4D as in the LINCS manuscript. I only have dose points 3.33 and 10 available. In general, we see that:

  • the model amplifies the strength of profiles that already achieve a high mAP with average profiling
  • the model amplifies the strength of some profiles that were not visible before (by increasing the mAP from <0.05 to >0.1)
  • relatively few model profiles lose performance compared to average profiling, which increases this model's potential (because why not use it, except for the additional computation time)

figure4 (1)

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Oct 13, 2022

Interpretability analysis rerun for LINCS data

From plate SQ00015142 I inspected images from well B13, which is 10 uM sulfafurazole, and computed the same saliencies as before in the Stain data. I chose this well randomly and the plate based on the large size of the file. I tried inspecting SQ00015106 before, but the seeding was so sparse that picking the top and bottom saliency cells resulted in only a handful of cells in total. The seeding generally seems to be less dense than in the Stain experiments.

Main takeaways

No conclusion can be drawn from these results because the high and low saliency cells are not consistent in their appearance.

Images here!

Screen Shot 2022-10-13 at 2 16 18 PM

Screen Shot 2022-10-13 at 2 16 36 PM

Screen Shot 2022-10-13 at 2 17 07 PM

Screen Shot 2022-10-13 at 2 16 55 PM

Screen Shot 2022-10-13 at 2 18 05 PM

Screen Shot 2022-10-13 at 2 17 57 PM

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Oct 14, 2022

Interpretability analysis rerun for LINCS data

From plate SQ00015131 I inspected images from well E13, which is 10 uM ganetespib and has HSP inhibitor as its MoA, and computed the same saliencies as before in the Stain data. This MoA is the one that relatively improved them most both for the 3.33 and 10 uM dose points when using model profiling versus average profiling.

Main takeaways

I think now we can see that the green-outlined cells tend to be brighter/have stronger contrast in general than the red-outlined cells. We also see that (again) features that calculate the correlation between different channels are the most important for deciding which cells are the most or least important. IIUC, that means that cells which are very flat are not important and cells that are 'fat' in the depth dimension are more important. Then the question is: does it make sense that flat cells are less representative of the compound than fat cells?
I wonder if you can see a similar more conclusive pattern here as well @AnneCarpenter?

Images here!

Screen Shot 2022-10-14 at 11 54 02 AM

Screen Shot 2022-10-14 at 11 54 43 AM

Screen Shot 2022-10-14 at 11 54 54 AM

Screen Shot 2022-10-14 at 11 55 05 AM

Screen Shot 2022-10-14 at 11 55 43 AM

Screen Shot 2022-10-14 at 11 56 17 AM

Screen Shot 2022-10-14 at 11 56 31 AM

@bethac07
Copy link

@EchteRobert Do you happen to know what version of CellProfiler your features were made in? 3.X or 4.X? I don't know if in a way fatal to your analysis, but Costes features in CP3.X we realized as we were putting 4.0 together are improperly calculated

@EchteRobert
Copy link
Collaborator Author

Ah, that's interesting @bethac07. According to the LINCS manuscript, it was version 2.3.1 so I'm guessing they were improperly calculated there as well. I'm wondering what the model is picking up then... Do you know how they are calculated exactly then?

@bethac07
Copy link

My level of understanding from memory (which I cannot stress enough may be wrong) and a bit of digging is this - Costes measurements are a special case of the Manders coefficient (which involves looking at which part of an images that threshold positive in each of 2 channels), where in Costes that threshold is defined in a particular way. In at least CellProfiler 3, but possibly/probably also 2.3, there was an assumption that there were only 255 gray levels (numerical values), which is true in 8 bit images, but is wrong in 16 bit images (which these are) which have 65535 gray levels (numerical values). So the threshold was being set basically always to 255, which most of the image has a higher brightness than, so the calculated correlation coefficients were nearly always 1.

So basically, I think it was measuring "pixels brighter than 255"?

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Oct 17, 2022

Great catch Beth! I checked out the values of those Costes Correlation features and they are indeed all equal (or almost equal) to 1. To get these features I just looked at which features resulted in the highest saliency values (absolute). I think there are two possible explanations as to why these features popped up as 'most salient':

  1. Because the feature value 1 is relatively large for features (as I normalize all feature values within the plate). This explanation fits the L1 norm activation based saliency method the most.
  2. It also makes sense for the gradient-based saliency as this method is looking at features that when changed could influence the outcome a lot. Possibly, when most cells have a 1 for Costes Correlation and a few don't, these few cells would become very important for model prediction.

Instead, I now calculated the correlation between saliency and feature values (something I also did before) and that points to different features, which I hope do have some actual meaning 😄 . Below are the results for this particular well for the different saliency scores.

Main takeaways

  • The two major feature types are intensity and texture based features.
  • Interestingly, intensity-based features (particularly DNA) were also found to be important for the JUMP data, here DNA is not among the top correlated features.
  • Texture-based features are new, which may have to do with the lower seeding density of the wells in LINCS.
Combined saliency score
Feature name Correlation (Pearson)
Cytoplasm_Texture_SumAverage_RNA_10_0 0.381074
Cytoplasm_Texture_SumAverage_RNA_20_0 0.391287
Cytoplasm_Correlation_Manders_AGP_RNA 0.408331
Nuclei_Texture_InfoMeas1_DNA_5_0 0.435283
Cytoplasm_Texture_SumEntropy_RNA_20_0 0.438746
Cytoplasm_Texture_Entropy_RNA_5_0 0.442981
Cytoplasm_Texture_SumEntropy_RNA_10_0 0.443986
Cytoplasm_Texture_Entropy_RNA_10_0 0.445096
Cytoplasm_Texture_SumEntropy_RNA_5_0 0.445393
Cytoplasm_Texture_Entropy_RNA_20_0 0.462262
Feature name Correlation (Pearson)
Cytoplasm_Texture_AngularSecondMoment_RNA_10_0 -0.454882
Cytoplasm_Texture_AngularSecondMoment_RNA_5_0 -0.451614
Cytoplasm_Texture_AngularSecondMoment_RNA_20_0 -0.448406
Nuclei_Intensity_IntegratedIntensity_RNA -0.445297
Nuclei_Intensity_IntegratedIntensity_Mito -0.436190
Cells_Intensity_MaxIntensity_RNA -0.432253
Nuclei_Intensity_MaxIntensity_RNA -0.431139
Nuclei_Intensity_IntegratedIntensity_ER -0.422684
Nuclei_Intensity_IntegratedIntensity_AGP -0.413577
Nuclei_Texture_Correlation_DNA_5_0 -0.382188
L1 norm activation saliency score
Feature name Correlation (Pearson)
Nuclei_Texture_SumAverage_DNA_5_0 0.461082
Nuclei_Granularity_1_Mito 0.461331
Cytoplasm_Texture_SumEntropy_RNA_5_0 0.465899
Nuclei_RadialDistribution_MeanFrac_AGP_4of4 0.468284
Nuclei_RadialDistribution_MeanFrac_ER_4of4 0.468673
Nuclei_RadialDistribution_MeanFrac_Mito_4of4 0.471352
Cytoplasm_Texture_SumEntropy_RNA_20_0 0.474167
Cytoplasm_Texture_SumEntropy_RNA_10_0 0.477463
Nuclei_Intensity_LowerQuartileIntensity_DNA 0.483370
Nuclei_Texture_InfoMeas1_DNA_5_0 0.500611
Feature name Correlation (Pearson)
Nuclei_Intensity_UpperQuartileIntensity_RNA -0.567669
Nuclei_Intensity_StdIntensity_RNA -0.546395
Nuclei_Intensity_MeanIntensity_RNA -0.543263
Cells_Intensity_StdIntensity_RNA -0.542904
Cells_Intensity_MaxIntensity_RNA -0.542100
Nuclei_Intensity_MADIntensity_RNA -0.540240
Nuclei_Intensity_MaxIntensity_RNA -0.540037
Nuclei_Intensity_StdIntensity_AGP -0.535854
Nuclei_Intensity_MADIntensity_AGP -0.527600
Nuclei_Intensity_UpperQuartileIntensity_ER -0.521866
Gradient analysis score
Feature name Correlation (Pearson)
Cells_Texture_InverseDifferenceMoment_RNA_5_0 -0.513449
Cytoplasm_Intensity_IntegratedIntensityEdge_AGP -0.510165
Cells_Intensity_IntegratedIntensityEdge_AGP -0.507586
Cytoplasm_AreaShape_MaximumRadius -0.501875
Cells_Texture_InverseDifferenceMoment_RNA_10_0 -0.500681
Cytoplasm_Intensity_IntegratedIntensityEdge_RNA -0.497952
Cytoplasm_Texture_InverseDifferenceMoment_RNA_5_0 -0.495278
Cytoplasm_AreaShape_MeanRadius -0.493570
Cells_Texture_InverseDifferenceMoment_RNA_20_0 -0.490450
Cells_AreaShape_MinorAxisLength -0.488041
Feature name Correlation (Pearson)
Cells_Texture_DifferenceEntropy_RNA_10_0 0.492800
Cytoplasm_Texture_InfoMeas1_RNA_10_0 0.493645
Cells_Texture_Contrast_RNA_5_0 0.494087
Cytoplasm_Texture_DifferenceEntropy_RNA_5_0 0.494265
Cells_Texture_DifferenceEntropy_RNA_5_0 0.506087
Cells_Texture_InfoMeas1_DNA_5_0 0.513774
Cells_Texture_InfoMeas1_DNA_10_0 0.514939
Cells_Texture_InfoMeas1_RNA_10_0 0.519832
Cytoplasm_Texture_InfoMeas1_RNA_5_0 0.522585
Cells_Texture_InfoMeas1_RNA_5_0 0.541869

@AnneCarpenter
Copy link

AnneCarpenter commented Oct 17, 2022 via email

@EchteRobert
Copy link
Collaborator Author

Does this change the cells that would be green and red then?

It could change them yes - and I think they did (but I don't have a lot of experience with analyzing cells by eye)

By the way, for the colorblind you will eventually want to change to another color scheme.

Yes, I will change that!

@EchteRobert
Copy link
Collaborator Author

Model trained on 3.33 uM dose point.

3.33 uM

plate mAP model mAP BM mAP filtered BM mAP shuffled
all plates 0.0456 0.0324 0.0323 0

10 uM

plate mAP model mAP BM mAP filtered BM mAP shuffled
all plates 0.0475 0.034 0.034 0.0002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants