P2 02. Final model for the LINCS dataset (batch 1) #13

EchteRobert · 2022-10-05T08:38:23Z

Here I trained a model on all data available from batch 1 in the LINCS dataset, which can be found like this: aws s3 ls s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/

The model uses 1745 features, because of an issue with 10 plates (broadinstitute/lincs-cell-painting#88 (comment)). In total, I trained the model on 136 plates, 5965 wells, including 1228 unique compounds using the 10 uM dose point. During preprocessing I removed 1587 wells due to missing MoA or compound name (pert_iname) annotation. I used the following hyperparameters:

Hyperparameter	value
batch size	36
epochs	100
kFilters	0.5
latent dim	2048
learning rate	0.0005
nr cells	(1500, 800)
nr sets	8
optimizer	AdamW
output dim	2048
true batch size	288

I assess the model on the 10 uM dose point using replicate and MoA prediction and similarly on the 3.33 uM dose, which is considered the test set.

Results

The model significantly improves upon the average baseline for replicate and MoA prediction for both the 10 and 3.33 uM dose points.
It improves the mAP by 60% and 30% for the 10 uM dose point (training) and 3.33 uM dose point (test) data, respectively.
I could have trained the model a bit longer, e.g. 150 epochs, as the validation mAP and loss did not fully converge yet.

Results 10 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=84.81208433212997, pvalue=0.0)

plate	Training mAP model	Training mAP BM	Training mAP shuffled
all plates	0.7473	0.269	0

MoA prediction

Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.753694914168434, pvalue=1.5518902810751288e-11)

plate	mAP model	mAP BM	mAP shuffled
all plates	0.0541	0.0338	0.0002

Results 3.33 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=49.02599189522616, pvalue=0.0)

plate	Training mAP model	Training mAP BM	Training mAP shuffled
all plates	0.4465	0.1695	0

MoA prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=3.525483296865904, pvalue=0.0004250301209859708)

plate	mAP model	mAP BM	mAP shuffled
all plates	0.042	0.0322	0

Loss curves

All plate names

SQ00014812_SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015116_SQ00015117_SQ00015118_SQ00015119_SQ00015120_SQ00015121_SQ00015122_SQ00015123_SQ00015124_SQ00015125_SQ00015126_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015155_SQ00015156_SQ00015157_SQ00015158_SQ00015159_SQ00015160_SQ00015162_SQ00015163_SQ00015164_SQ00015165_SQ00015166_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015197_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015229_SQ00015230_SQ00015231_SQ00015232_SQ00015233

The text was updated successfully, but these errors were encountered:

EchteRobert · 2022-10-07T14:39:47Z

As discussed during yesterday's check-in, I have computed Figure 4D as in the LINCS manuscript. I only have dose points 3.33 and 10 available. In general, we see that:

the model amplifies the strength of profiles that already achieve a high mAP with average profiling
the model amplifies the strength of some profiles that were not visible before (by increasing the mAP from <0.05 to >0.1)
relatively few model profiles lose performance compared to average profiling, which increases this model's potential (because why not use it, except for the additional computation time)

EchteRobert · 2022-10-13T12:22:59Z

Interpretability analysis rerun for LINCS data

From plate SQ00015142 I inspected images from well B13, which is 10 uM sulfafurazole, and computed the same saliencies as before in the Stain data. I chose this well randomly and the plate based on the large size of the file. I tried inspecting SQ00015106 before, but the seeding was so sparse that picking the top and bottom saliency cells resulted in only a handful of cells in total. The seeding generally seems to be less dense than in the Stain experiments.

Main takeaways

No conclusion can be drawn from these results because the high and low saliency cells are not consistent in their appearance.

Images here!

EchteRobert · 2022-10-14T09:52:10Z

Interpretability analysis rerun for LINCS data

From plate SQ00015131 I inspected images from well E13, which is 10 uM ganetespib and has HSP inhibitor as its MoA, and computed the same saliencies as before in the Stain data. This MoA is the one that relatively improved them most both for the 3.33 and 10 uM dose points when using model profiling versus average profiling.

Main takeaways

I think now we can see that the green-outlined cells tend to be brighter/have stronger contrast in general than the red-outlined cells. We also see that (again) features that calculate the correlation between different channels are the most important for deciding which cells are the most or least important. IIUC, that means that cells which are very flat are not important and cells that are 'fat' in the depth dimension are more important. Then the question is: does it make sense that flat cells are less representative of the compound than fat cells?
I wonder if you can see a similar more conclusive pattern here as well @AnneCarpenter?

Images here!

bethac07 · 2022-10-14T15:28:51Z

@EchteRobert Do you happen to know what version of CellProfiler your features were made in? 3.X or 4.X? I don't know if in a way fatal to your analysis, but Costes features in CP3.X we realized as we were putting 4.0 together are improperly calculated

EchteRobert · 2022-10-14T15:45:43Z

Ah, that's interesting @bethac07. According to the LINCS manuscript, it was version 2.3.1 so I'm guessing they were improperly calculated there as well. I'm wondering what the model is picking up then... Do you know how they are calculated exactly then?

bethac07 · 2022-10-14T16:55:24Z

My level of understanding from memory (which I cannot stress enough may be wrong) and a bit of digging is this - Costes measurements are a special case of the Manders coefficient (which involves looking at which part of an images that threshold positive in each of 2 channels), where in Costes that threshold is defined in a particular way. In at least CellProfiler 3, but possibly/probably also 2.3, there was an assumption that there were only 255 gray levels (numerical values), which is true in 8 bit images, but is wrong in 16 bit images (which these are) which have 65535 gray levels (numerical values). So the threshold was being set basically always to 255, which most of the image has a higher brightness than, so the calculated correlation coefficients were nearly always 1.

So basically, I think it was measuring "pixels brighter than 255"?

EchteRobert · 2022-10-17T08:54:31Z

Great catch Beth! I checked out the values of those Costes Correlation features and they are indeed all equal (or almost equal) to 1. To get these features I just looked at which features resulted in the highest saliency values (absolute). I think there are two possible explanations as to why these features popped up as 'most salient':

Because the feature value 1 is relatively large for features (as I normalize all feature values within the plate). This explanation fits the L1 norm activation based saliency method the most.
It also makes sense for the gradient-based saliency as this method is looking at features that when changed could influence the outcome a lot. Possibly, when most cells have a 1 for Costes Correlation and a few don't, these few cells would become very important for model prediction.

Instead, I now calculated the correlation between saliency and feature values (something I also did before) and that points to different features, which I hope do have some actual meaning 😄 . Below are the results for this particular well for the different saliency scores.

Main takeaways

The two major feature types are intensity and texture based features.
Interestingly, intensity-based features (particularly DNA) were also found to be important for the JUMP data, here DNA is not among the top correlated features.
Texture-based features are new, which may have to do with the lower seeding density of the wells in LINCS.

Combined saliency score

Feature name	Correlation (Pearson)
Cytoplasm_Texture_SumAverage_RNA_10_0	0.381074
Cytoplasm_Texture_SumAverage_RNA_20_0	0.391287
Cytoplasm_Correlation_Manders_AGP_RNA	0.408331
Nuclei_Texture_InfoMeas1_DNA_5_0	0.435283
Cytoplasm_Texture_SumEntropy_RNA_20_0	0.438746
Cytoplasm_Texture_Entropy_RNA_5_0	0.442981
Cytoplasm_Texture_SumEntropy_RNA_10_0	0.443986
Cytoplasm_Texture_Entropy_RNA_10_0	0.445096
Cytoplasm_Texture_SumEntropy_RNA_5_0	0.445393
Cytoplasm_Texture_Entropy_RNA_20_0	0.462262

Feature name	Correlation (Pearson)
Cytoplasm_Texture_AngularSecondMoment_RNA_10_0	-0.454882
Cytoplasm_Texture_AngularSecondMoment_RNA_5_0	-0.451614
Cytoplasm_Texture_AngularSecondMoment_RNA_20_0	-0.448406
Nuclei_Intensity_IntegratedIntensity_RNA	-0.445297
Nuclei_Intensity_IntegratedIntensity_Mito	-0.436190
Cells_Intensity_MaxIntensity_RNA	-0.432253
Nuclei_Intensity_MaxIntensity_RNA	-0.431139
Nuclei_Intensity_IntegratedIntensity_ER	-0.422684
Nuclei_Intensity_IntegratedIntensity_AGP	-0.413577
Nuclei_Texture_Correlation_DNA_5_0	-0.382188

L1 norm activation saliency score

Feature name	Correlation (Pearson)
Nuclei_Texture_SumAverage_DNA_5_0	0.461082
Nuclei_Granularity_1_Mito	0.461331
Cytoplasm_Texture_SumEntropy_RNA_5_0	0.465899
Nuclei_RadialDistribution_MeanFrac_AGP_4of4	0.468284
Nuclei_RadialDistribution_MeanFrac_ER_4of4	0.468673
Nuclei_RadialDistribution_MeanFrac_Mito_4of4	0.471352
Cytoplasm_Texture_SumEntropy_RNA_20_0	0.474167
Cytoplasm_Texture_SumEntropy_RNA_10_0	0.477463
Nuclei_Intensity_LowerQuartileIntensity_DNA	0.483370
Nuclei_Texture_InfoMeas1_DNA_5_0	0.500611

Feature name	Correlation (Pearson)
Nuclei_Intensity_UpperQuartileIntensity_RNA	-0.567669
Nuclei_Intensity_StdIntensity_RNA	-0.546395
Nuclei_Intensity_MeanIntensity_RNA	-0.543263
Cells_Intensity_StdIntensity_RNA	-0.542904
Cells_Intensity_MaxIntensity_RNA	-0.542100
Nuclei_Intensity_MADIntensity_RNA	-0.540240
Nuclei_Intensity_MaxIntensity_RNA	-0.540037
Nuclei_Intensity_StdIntensity_AGP	-0.535854
Nuclei_Intensity_MADIntensity_AGP	-0.527600
Nuclei_Intensity_UpperQuartileIntensity_ER	-0.521866

Gradient analysis score

Feature name	Correlation (Pearson)
Cells_Texture_InverseDifferenceMoment_RNA_5_0	-0.513449
Cytoplasm_Intensity_IntegratedIntensityEdge_AGP	-0.510165
Cells_Intensity_IntegratedIntensityEdge_AGP	-0.507586
Cytoplasm_AreaShape_MaximumRadius	-0.501875
Cells_Texture_InverseDifferenceMoment_RNA_10_0	-0.500681
Cytoplasm_Intensity_IntegratedIntensityEdge_RNA	-0.497952
Cytoplasm_Texture_InverseDifferenceMoment_RNA_5_0	-0.495278
Cytoplasm_AreaShape_MeanRadius	-0.493570
Cells_Texture_InverseDifferenceMoment_RNA_20_0	-0.490450
Cells_AreaShape_MinorAxisLength	-0.488041

Feature name	Correlation (Pearson)
Cells_Texture_DifferenceEntropy_RNA_10_0	0.492800
Cytoplasm_Texture_InfoMeas1_RNA_10_0	0.493645
Cells_Texture_Contrast_RNA_5_0	0.494087
Cytoplasm_Texture_DifferenceEntropy_RNA_5_0	0.494265
Cells_Texture_DifferenceEntropy_RNA_5_0	0.506087
Cells_Texture_InfoMeas1_DNA_5_0	0.513774
Cells_Texture_InfoMeas1_DNA_10_0	0.514939
Cells_Texture_InfoMeas1_RNA_10_0	0.519832
Cytoplasm_Texture_InfoMeas1_RNA_5_0	0.522585
Cells_Texture_InfoMeas1_RNA_5_0	0.541869

AnneCarpenter · 2022-10-17T09:21:32Z

Does this change the cells that would be green and red then? Seems like yes, happy to take another look. By the way, for the colorblind you will eventually want to change to another color scheme. The wiki has a section that can help. -- Sent from my mobile phone

EchteRobert · 2022-10-17T09:30:26Z

Does this change the cells that would be green and red then?

It could change them yes - and I think they did (but I don't have a lot of experience with analyzing cells by eye)

By the way, for the colorblind you will eventually want to change to another color scheme.

Yes, I will change that!

EchteRobert · 2022-10-21T09:33:14Z

Model trained on 3.33 uM dose point.

3.33 uM

plate	mAP model	mAP BM	mAP filtered BM	mAP shuffled
all plates	0.0456	0.0324	0.0323	0

10 uM

plate	mAP model	mAP BM	mAP filtered BM	mAP shuffled
all plates	0.0475	0.034	0.034	0.0002

EchteRobert added the LINCS label Oct 5, 2022

EchteRobert self-assigned this Oct 5, 2022

EchteRobert added the evaluation label Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2 02. Final model for the LINCS dataset (batch 1) #13

P2 02. Final model for the LINCS dataset (batch 1) #13

EchteRobert commented Oct 5, 2022 •

edited

Loading

EchteRobert commented Oct 7, 2022 •

edited

Loading

EchteRobert commented Oct 13, 2022 •

edited

Loading

EchteRobert commented Oct 14, 2022 •

edited

Loading

bethac07 commented Oct 14, 2022

EchteRobert commented Oct 14, 2022

bethac07 commented Oct 14, 2022

EchteRobert commented Oct 17, 2022 •

edited

Loading

AnneCarpenter commented Oct 17, 2022 via email

EchteRobert commented Oct 17, 2022

EchteRobert commented Oct 21, 2022

P2 02. Final model for the LINCS dataset (batch 1) #13

P2 02. Final model for the LINCS dataset (batch 1) #13

Comments

EchteRobert commented Oct 5, 2022 • edited Loading

Results

EchteRobert commented Oct 7, 2022 • edited Loading

EchteRobert commented Oct 13, 2022 • edited Loading

Interpretability analysis rerun for LINCS data

Main takeaways

EchteRobert commented Oct 14, 2022 • edited Loading

Interpretability analysis rerun for LINCS data

Main takeaways

bethac07 commented Oct 14, 2022

EchteRobert commented Oct 14, 2022

bethac07 commented Oct 14, 2022

EchteRobert commented Oct 17, 2022 • edited Loading

Main takeaways

AnneCarpenter commented Oct 17, 2022 via email

EchteRobert commented Oct 17, 2022

EchteRobert commented Oct 21, 2022

EchteRobert commented Oct 5, 2022 •

edited

Loading

EchteRobert commented Oct 7, 2022 •

edited

Loading

EchteRobert commented Oct 13, 2022 •

edited

Loading

EchteRobert commented Oct 14, 2022 •

edited

Loading

EchteRobert commented Oct 17, 2022 •

edited

Loading