Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

03. Model for Stain2 #5

Open
EchteRobert opened this issue Feb 28, 2022 · 20 comments
Open

03. Model for Stain2 #5

EchteRobert opened this issue Feb 28, 2022 · 20 comments
Assignees

Comments

@EchteRobert
Copy link
Collaborator

EchteRobert commented Feb 28, 2022

It is now clear that this feature aggregation model will only serve a certain feature set (meaning a certain dataset line), and is not developed to be able to aggregate any feature set (it is only invariant to the number of cells per well). I will start with creating a model that is able to beat the 'mean aggregation' baselines of the Stain2 batches, and then move forward to Stain3, Stain4, and finally use Stain5 as a final testset.

Because of that it would be ideal if all features across Stain datasets were the same. This is (somewhat) the case across Stain2, Stain3, and Stain4. However, Stain5 has a slightly different cellprofiler pipeline resulting in a different and larger feature set. During preprocessing I found that the pipeline from raw single-cell features to data that can directly be fed to the model, is quite a slow process. This is especially the case when all features are used (in this case 4295 for Stain 2-4 and 5794 for Stain 5). The model inference and training also becomes increasingly slower as the number of features increases. From the initial experiments on CPJUMP1 we saw that not all features are needed to create a better profile than the baseline (#1). This is why I have chosen to select only all common features across Stain 2-5. This has the advantage of speed, both in preprocessing and inference, and compatibility, as no separate model will have to be trained to use Stain5 as the test set.

Assuming that the features across Stain2, Stain3, Stain4, and Stain5 are consistent within each experiment, there are 1324 features which are measured in all of them. The features are well distributed in terms of category: Cells: 441 features, Cytoplasm: 433 features, and Nuclei: 450 features. 1124 of them are decently uncorrelated (<abs(0.5) Pearsson correlation) [one plate tested]. From hereon these are the features that will be used to train the model.

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Feb 28, 2022

The Stain 2 experiment (https://github.com/jump-cellpainting/pilot-analysis/issues/15) contains 14 batches, of which only 1 will not be used to train the model. This is BR00112200 (Confocal) which contains less features than the other batches due to it missing the RNA channel. All other batches will be used to train or validate the model. See overview below:

Beautiful colours here!

Note that the Percent Strong shown here is calculated with an additional sphering operation
Screen Shot 2022-02-28 at 2 20 31 PM

The Percent Strong/Replicating with feature selected features - no sphering

Description Percent_Replicating
BR00113818.csv 51.1
BR00113819.csv 51.1
BR00113821.csv 51.1
BR00113820.csv 56.7
BR00112198.csv 55.6
BR00112204.csv 63.3
BR00112199.csv 58.9
BR00112200.csv 63.3
BR00112201.csv 70
BR00112197repeat.csv 63.3
BR00112203.csv 52.2
BR00112202.csv 56.7
BR00112197binned.csv 58.9
BR00112197standard.csv 66.7

The Percent Strong/Replicating with the 1324 features as used by the model - I will use this as the reference BM

Description Percent_Replicating
BR00113818.csv 52.2
BR00113819.csv 48.9
BR00113821.csv 47.8
BR00113820.csv 55.6
BR00112198.csv 56.7
BR00112204.csv 58.9
BR00112199.csv 57.8
BR00112201.csv 66.7
BR00112197repeat.csv 63.3
BR00112203.csv 56.7
BR00112202.csv 54.4
BR00112197binned.csv 58.9
BR00112197standard.csv 56.7

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Feb 28, 2022

Experiment 1

The first model is trained on BR00112197 binned, BR00112199 multiplane, and BR00112203 MitoCompare. These are the most distinct batches that could have been chosen, all other batches' features have values that contain more similar distributions. The training and validation loss curves indicate slow but steady learning and the model has not converged after 50 epochs. The PR will be calculated for each batch as a whole without the negative controls. The training data consists of 80% of each batch, meaning that the model has not seen the remaining 20% during training. The model will also be tested on a completely unseen batch.

Main Takeaways

  • The PR shows that the correlation between non-replicates is quite high, but the correlation between replicates is even higher. The model appears to cluster everything somewhat together, but still separates the replicates adequately. This might indicate that it does not even use the full latent feature space yet?
  • Robust MAD normalization pushes the non-replicates more around a zero distribution, however this is at the cost of the overall PR.
  • The model learns general aggregation methods, which also apply to a completely unseen batch: BR00113818 Redone.
  • Interestingly, the model performs slightly worse on the BR00112199 MultiPlane and BR00112197 binned batches, which it has partly seen during training, while it performs better on the BR00113818 Redone batch, which it has not yet seen before. The negative controls for these training plates have higher correlations than the BR00113818 Redone plate.

Conclusion

The model shows promise in learning general aggregation methods which can be applicable to unseen data, as long as the features remain constant. However, something unexpected is going on for the BR00112199 MultiPlane and BR00112197 binned batches. I will investigate whether these results are due to chance or something else is going on.

Results! Wooh! Screen Shot 2022-02-28 at 2 24 06 PM

BR00112203 MitoCompare - training data
Stain2_BR00112203_MitoCompare_PR

BR00112203 MitoCompare Robust MAD normalized features
Stain2_BR00112203_MitoCompare_normalized_PR

BR00112199 MultiPlane - training data
Stain2_BR00112199_MultiPlane_PR

BR00112197 binned - training data
Stain2_BR00112197binned_PR

BR00113818 Redone - not in training set
Stain2_BR00113818_Redone_PR

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Feb 28, 2022

While trying to find the cause for the possible issue described in #5 (comment), I found that the model creates a feature space that puts features from the same batch closer together than the mean aggregation method does. Whether this is a good thing or not is not obvious to me. Note that BR00113818 is not in the training set of the MLP.

Look at these patterns!

UMAP_MLP

UMAP_BM

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 1, 2022

Experiment 1 (continued)

As the model improved the PS upon the baseline in all of the previous plates, I will now test the model on 5 for more plates from the Stain2 dataset: BR00113818_Redone, BR00113819_Redone, BR00113820_Redone, BR00113821_Redone, and BR00112197_repeat. The PR/PS is reported below. I also plotted the number of cells per well per plate in histograms.

Main takeaways

The model performs similar to or better than the average aggregation method for 3 out of 5 plates. For the remaining two it significantly underperformed however. I expected this to be due to the average number of cells that would be present in the plates. Looking at the histograms of these two plates (BR00113820_Redone and BR00113821_Redone), we can see that this might indeed be the cause as these two plates have a different distribution of cells per well and less cells overall.

Later addition: As discussed with @shntnu I calculated the PC1 loadings per plate and the correlation between these loadings. See below. It shows how especially BR00112203 (training), BR00113819, BR00113820, and BR00113821 do not correlate well with with the other plates in terms of PC1 loadings, i.e. other features are more important to describe the profiles of these plates. Note also that BR0011203, and BR00112199 are used as 2 of the 3 training plates, while these correlate especially less with the two poorly performing plates. Especially because the PR of the BR00112203 (training) is the highest, while its PC1 loadings correlation is relatively low with all other plates it is expected that the model performs worse on all other plates.

Conclusion: the plates used during training probably influence the model to pay more attention to a specific set of features, which are not as relevant for the poorly performing plates.

Are you ready for this?

BR00112197_repeat
Stain2_BR00112197repeat_PR

BR00113818_Redone
Stain2_BR00113818_Redone_PR

BR00113819_Redone
Stain2_BR00113819_Redone_PR

BR00113820_Redone
Stain2_BR00113820_Redone_PR

BR00113821_Redone
Stain2_BR00113821_Redone_PR

Don't forget to look at these!

BR00112197binned_hist

BR00113820_hist

BR00113821_hist

This is additional stuff. Perhaps not as interesting as the first bit? You decide.

BR00112197repeat_hist

BR00112199_hist

BR00112203_hist

BR00113818_hist

BR00113819_hist

PC1 loadings per plate

PC1_loadings_Stain2

Number of cells per well per plate summary

Stain2_cells

@niranjchandrasekaran
Copy link
Collaborator

The model performs similar to or better than the average aggregation method for 3 out of 5 plates. For the remaining two it significantly underperformed however.

@EchteRobert Quick question - did you recompute Percent Replicating for the baseline using the 1324 features or are these values from the original baseline in https://github.com/jump-cellpainting/pilot-analysis/issues/15#issuecomment-670640802? If it is the latter, I would recommend doing the former so that we are comparing apples to apples.

Also, the cell count histograms surprised me. Given that the only difference between the plates is the dye concentration, I did not expect to see such a huge difference in the number of cells between plates.

@EchteRobert
Copy link
Collaborator Author

I did not @niranjchandrasekaran. Good point. I will recalculate the baseline with 1324 features.

Yes it also surprised me a bit, although I cannot explain why it would be the case. Actually, I encountered the first well in these two plates which did not contain any cells at all.

@niranjchandrasekaran
Copy link
Collaborator

On checking the table in #5 (comment), I just realized that the two plates BR00113820_Redone and BR00113821_Redone have different cell seeding density compared to the other plates. So they are expected to have different number of cells.

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 2, 2022

Experiment (intermediate)

The previous results showed a high non-replicate correlation and, although the replicate correlation was even higher, we would rather like to see a lower non-replicate correlation which would represent a cleaner profile or sharper contrast between replicates and non-replicates.
To test this John proposed to change my current feature normalization method (zero-mean 1 standard deviation) to RobustMAD. Secondly, I doubled the batch size during training. This means that there are more negative pairs per batch (as this increases exponentially) which may push the learned profiles further apart.

Main takeaways

The increased batch size in combination with the RobustMAD normalization show that the model has an extremely hard time learning. Upon inspection of the gradients of the model, I saw that these vanished instantly with the first epochs. Returning to the original normalization removed this effect and allowed for better training.

Click here! Screen Shot 2022-03-02 at 3 01 07 PM

BR00112203 plate (previously highest PR)
Stain2_BR00112203_exp2_BS128_PR

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 2, 2022

Experiment 2

As RobustMAD did not do what was expected and the non-replicate correlation did not decrease either, likely due to the model not learning at all, I trained another model with the previous normalization and a higher batch size (80 instead of 128 in the previous post). I also moved to 'cleaner' data (all 'green' plates as indicated in the table here #5 (comment)), which may cause the model to perform worse on the 'non-green' plates.

Main takeaways

The model is able to push non-replicate correlation down somewhat, however this comes at the cost of overfitting. The model achieves this on the training plates, but not on the validation plates. I expect that more data will be needed to achieve the best of both worlds.

Losses and PRs! Screen Shot 2022-03-02 at 4 17 39 PM

BR00112197 standard - training data
Stain2_BR00112197standard_exp2_PR

BR00113818 - non training data
Stain2_BR00113818_PR

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 2, 2022

Experiment 3

In #5 (comment) I showed that the model learns to amplify the plate specific signal for the cell profiles. To counteract that a model is trained which also tries to learn across plate replicates. Additionally, one possible reason why the negative correlation has been so high so far, may be that the model learns to separate all plate information. By doing that the model automatically pushes all same plate profiles together and non-replicate profile correlation will become higher in general. Perhaps including across plate replicates will reduce this effect by fully utilizing the latent loss space.

Main takeaways

Non-replicate correlation appears to indeed decrease somewhat as expected, at least for the training plates. However, the model is overfitting very clearly and the overall performance with respect to the previous model is much lower. Decreasing the batch size and increasing the number of plates used for training does not solve this problem. I expect that the model is memorizing specific compounds, but not an aggregation method.

UMAP patterns here!

UMAP BM same plates as in #5 (comment)
UMAP_BM

UMAP MLP
UMAP_MLP

UMAP BM training plates
['BR00112197standard': 0, 'BR00112199': 1, 'BR00112197repeat': 2]
UMAP_BM_train

UMAP MLP training plates
UMAP_MLP_train

Percent histograms here!

Training plates

Stain2_BR00112197standard_PR

Stain2_BR00112197repeat_PR

Stain2_BR00112199_PR copy

Test plate
Stain2_BR00113818_PR

@shntnu
Copy link
Collaborator

shntnu commented Mar 4, 2022

As discussed with @shntnu I calculated the PC1 loadings per plate and the correlation between these loadings.

@EchteRobert Awesome! What you essentially did here was measure the distribution similarity between all pairs of plates. The first PC is a quick way to do that.

Comparing the PC1 loadings of two multivariate distributions is a shortcut for comparing the covariance matrices of the two multivariate distributions. If the distributions are truly multivariate gaussian (good luck with that, haha!), then it's actually a very good approximation (to the extent that PC1 explains a large fraction of the variance).

If you really want to go down this rabbit hole (⚠️ stop, don't ! ⚠️ ) read up

@EchteRobert EchteRobert changed the title 03. Model for all Stain datasets (2, 3, 4, and 5) 03. Model for Stain2 dataset Mar 8, 2022
@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 8, 2022

Experiment 3V2

Learning from previous experiments, I used the following experiment setup:

  • Use 5 plates as training/validation data, which have the lowest correlation with other plates based on the PC1 loadings shown in 03. Model for Stain2 #5 (comment). These are: ['BR00112197binned_FS', 'BR00112199_FS', 'BR00112203_FS', 'BR00113818_FS', 'BR00113820_FS']
  • Replicates are only considered within wells, as across wells lead to poor performance (and perhaps is also not a sensible training method given the evaluation method/goal for the model).
  • A larger batch size of 72 was used, to increase the number of negative pairs per batch.
  • 500 cells were consistently sampled 3 times per well per batch (this is no different than other experiments, but it may change in future ones so I'm pointing it out here).
  • The cosine similarity distance metric is used instead of SNR distance to ensure that hard positive mining is performed during the SupCon loss calculation. We will see that this also changes the loss features space for the better.
  • The number of parameters in the model is increased fourfold to decrease underfitting of the model.

Below I will show:

  • The PC1 loadings of the model aggregated cell profiles
  • The PR of all 13 plates in Stain2
  • The (mean) mean average precision (mAP)of the training and validation compounds for the benchmark (mean aggregation) and the model. For the worst performing ones I will show the mAP per compound for that validation set.

Main takeaways

  • The PC1 loadings of the model features are similar to those of the BM for most plates, however the BR00113820 and BR00113821 plates are even stronger outliers now and the BR00112203 is a much smaller outlier.
  • The model achieves higher PR scores than the baseline for all plates now.
  • The PR distribution of the non-replicates is centered more around zero, due to the switch to the cosine similarity (which is normalized).
  • The model has overfit the training set by quite a lot, which can be seen by the PR scores as well as the mAP scores.
  • The mAP of the MLP training compounds is higher than that of the BM training compounds, while the mAP of the MLP validation compounds is generally higher than that of the BM validation compounds. This observation shows the potential of the model to generalize to unseen compounds. Perhaps with some form of regularization the generalization of the model to unseen compound types can be increased.
PC1 loadings of the model profiles

PC1_loadings_MLP_Stain2exp3V2

PR but in a new latent loss space!
Plate Percent Replicating
Training
BR00112197binned 88.9
BR00112199 91.1
BR00112203 88.9
BR00113818 84.4
BR00113820 97.8
Validation
BR00112197repeat 72.2
BR00112197standard 72.2
BR00112198 63.3
BR00112201 72.2
BR00112202 56.7
BR00112204 61.1
BR00113819 67.8
BR00113821 50.0

Stain2_BR00113820_PR

Stain2_BR00113821_PR

A new metric approaches!

5 plates are used to train the model (as shown in the 'Plate' column). During training 80% of the compounds are used to train the model and 20% of the compounds (the same ones for each plate) are used as a hold out or validation set.

Plate training compounds MLP training compounds BM validation compounds MLP validation compounds BM
Training
BR00112197binned 0.44 0.41 0.20 0.30
BR00112199 0.38 0.32 0.20 0.28
BR00112203 0.49 0.30 0.16 0.27
BR00113818 0.43 0.28 0.17 0.30
BR00113820 0.59 0.30 0.18 0.30
Validation
BR00112197repeat 0.29 0.41 0.25 0.31
BR00112197standard 0.32 0.40 0.27 0.28
BR00112198 0.27 0.35 0.26 0.30
BR00112201 0.26 0.40 0.22 0.32
BR00112202 0.25 0.34 0.24 0.30
BR00112204 0.24 0.35 0.25 0.29
BR00113819 0.24 0.28 0.17 0.25
BR00113821 0.19 0.24 0.12 0.22
mAP BR00112201

Plate: BR00112201
Total mean:0.25251311463707016

Training samples mean AP: 0.259931

compound AP
PF-477736 1
AMG900 1
APY0201 1
AZD2014 1
GDC-0879 1
acriflavine 1
RG7112 0.930556
GSK-J4 0.897222
Compound2 0.830556
BLU9931 0.677167
BI-78D3 0.668651
SCH-900776 0.640873
CPI-0610 0.572222
SU3327 0.510317
ABT-737 0.480423
Compound7 0.472073
-GNF 5 0.469444
MK-5108 0.447917
THZ1 0.422808
NVS-PAK1-1 0.347374
SU-11274 0.32939
GW-5074 0.246392
GSK2334470 0.246166
BX-912 0.24095
NVP-AEW541 0.23775
CHIR-99021 0.220037
dosulepin 0.202143
GSK-3-inhibitor-IX 0.172313
PD-198306 0.148742
PFI-1 0.14835
Compound3 0.145067
BMS-566419 0.12329
BMS-863233 0.121743
apratastat 0.118872
WZ4003 0.114163
ICG-001 0.11288
PNU-74654 0.0874405
ML324 0.0822136
Compound5 0.0819586
GW-3965 0.0698881
SGX523 0.0628168
AZ191 0.0614712
A-366 0.0492269
halopemide 0.0481211
FR-180204 0.0474747
BIX-02188 0.044098
Compound4 0.0427142
AZD7545 0.0417633
SHP 99.00 0.0412191
RGFP966 0.0397035
IOX2 0.0396046
CP-724714 0.0378228
EPZ015666 0.037468
AMG-925 0.0353015
VX-745 0.0336891
SGC-707 0.0329782
P5091 0.0326774
Compound6 0.0305971
delta-Tocotrienol 0.0295755
Compound1 0.0279454
PS178990 0.0278597
carmustine 0.0272295
T-0901317 0.0272058
andarine 0.0257093
UNC0642 0.0257052
dimethindene-(S)-(+) 0.0252354
ML-323 0.0244636
ML-298 0.0232809
Compound8 0.0218036
SAG 0.0198054
KH-CB19 0.0187536
filgotinib 0.0143387

Validation samples mean AP: 0.222843

compound AP
valrubicin 0.830159
sirolimus 0.647222
romidepsin 0.614379
ponatinib 0.489386
merimepodib 0.373039
ispinesib 0.357657
neratinib 0.250216
veliparib 0.0939503
orphenadrine 0.0710256
ruxolitinib 0.0683867
hydroxyzine 0.0374705
selumetinib 0.0353887
pomalidomide 0.0339397
skepinone-l 0.0242614
homochlorcyclizine 0.0220177
rheochrysidin 0.0216262
quazinone 0.0209096
purmorphamine 0.0201343

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 11, 2022

To get an overview of all the PRs based on training/validation plates and training/validation compounds like for the mAP.
Generally speaking, the PR values correlate highly with the mAP values that were reported in #5 (comment).

Excel table Screen Shot 2022-03-11 at 5 59 03 PM

@EchteRobert
Copy link
Collaborator Author

Experiments

The model showed in previous comments is overfitting the training dataset. This means it does not beat the baseline in mean average precision when comparing its profiles created for validation (hold-out) compounds, validation (hold-out) plates, or both.
There are two main ideas to reduce overfitting on 1. plates and 2. compounds:

  1. Consider replicates across plates
  2. Aggregate all same-compound cells from wells within a plate, into a super well if you will, and then sampling new 'augmented wells' from this super well. This should increase the variability of single-cell well compositions and reduce compound overfitting.
    (3. A possible extension of 1. and 2. is to also merge ALL compound wells across ALL plates (to form super super wells?))

Main takeaways

I will not show the results as there are too many different experiments, but instead outline the most important findings.

  • Using across plate replicates did not result in higher performance (PR/mAP) on validation plates. Results are instead somewhat worse to previous models. I expect this to be due to the training task (finding a latent space representation that attracts/repels across plate replicates) differing too much from the evaluation task (checking if these latent space representations attracts/repels within plate replicates).
  • Aggregating same-compound wells has a strong regularizing effect on the model performance: training plates now achieve similar performance to validation plates, but also no longer beat the baseline performance.
  • Training and Validation loss decrease together nicely (no more overfitting) when across plate replicates are no longer considered, but creating super wells still is.
    However, it turns out that by creating super wells and sampling augmented wells from these the model learns something very different from what the evaluation task is. What it learns is is not exactly clear, but I think that because samples are now all created from a similar (aggregated) distribution, and thus contain cells which originated from the same well, they are much much easier to distinguish. Basically, it is matching cells with the same feature profiles which originated from the same well instead of finding a good aggregation method for the entire well.

Next up

A possible improvement will be to reduce the data augmentation a bit. Instead, only creating super wells 50% of the time. The other 50% sampling will be done from a single well. Additionally, super wells are created by aggregating only 2 of the 4 available wells (chosen at random).
Another improvement is the normalization method. I will now normalize all wells across the entire plate before training the model on the wells. First this normalization was done per well.

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 18, 2022

Experiment

Results of the 'Next up' experiment described here: #5 (comment)

Main takeaways

  • The model is now able to also beat the mAP of the validation compounds in the training plates
  • It also beats the mAP of both the training and the validation compounds in some of the validation plates. It was not able to do either before.
  • The 4 plates where did not outperform the BM in any of the metrics are the furthest away from the training plates (see PC1 loadings plot), so this is also an expected result.

Next up

  • It's possible that a separate model is need for the plates where the model did not perform as well yet. I will try training a separate model for those plates next.
  • I will also try training a model with across replicate correlations again, to see if it does improve generalization using this new training setup.
EXCITING!

Results in bold are the highest score

plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
Training plates
BR00112201 0.66 0.40 0.43 0.32 98.9 66.7
BR00112198 0.56 0.35 0.4 0.30 100 56.7
BR00112204 0.59 0.35 0.35 0.29 100 58.9
Validation plates
BR00112202 0.44 0.34 0.31 0.30 93.3 54.4
BR00112197standard 0.47 0.40 0.34 0.28 94.4 56.7
BR00112203 0.19 0.30 0.21 0.27 52.2 56.7
BR00112199 0.3 0.32 0.23 0.28 76.7 57.8
BR00113818 0.32 0.28 0.24 0.30 77.8 52.2
BR00113819 0.32 0.28 0.21 0.25 70 48.9
BR00112197repeat 0.47 0.41 0.37 0.31 92.2 63.3
BR00113820 0.27 0.30 0.24 0.30 58.9 55.6
BR00113821 0.15 0.24 0.16 0.22 38.9 47.8
BR00112197binned 0.41 0.41 0.34 0.30 91.1 58.9

@shntnu
Copy link
Collaborator

shntnu commented Mar 18, 2022

👀 🎊

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 21, 2022

Experiment

Building upon the setup in the previous experiment I now train and evaluate a model on across plate compound replicates. The training set consists of the same 3 plates: BR00112201, BR00112198, and BR00112204. The validation set contains only the BR00112202, BR00112197standard, BR00113818, BR00113819, BR00112197repeat, and BR00112197binned. Note that I am only selecting the plates here that are close to the training sets, this is because I am considering across plate correlations and the other 4 outlier plates look at different features. I group the outlier plates in a separate validation set and compute the results for this set for completeness sake, but I do not think this last set is useful for analysis due to their different feature importances.

I compute the baseline mAP (and PR) using the mean aggregation method for these two sets with across plate replicates of compounds, and do the same using the model aggregation method.

Main takeaways

  • The model achieves better mAP scores than the baseline method in matching compounds across plates in both the training and validation set.
  • The model achieves worse mAP scores than the baseline method in matching compounds across plates in the outlier set. This is expected.
  • The model achieves generally lower mAP scores on finding within plate replicates than the previous model (03. Model for Stain2 #5 (comment)). It also beats the baseline mean aggregation less often in validation plates. This seems like a logical consequence of requiring the model to adjust for various staining concentrations.

Next up

  • It's possible that a separate model is need for the outlier plates. I will try training a separate model for those plates next. I am curious to see if this model in turn will perform poorly on the training and validation plates used in this experiment.
CrissCross mAP🔀

Across plate compound correlations
-- I do not report the PR, because all of these are (close to) 100 percent. I expect this to be due to the high number of replicates that are now being considered (perhaps I need to increase the number of samples used for the non-replicate correlation calculation?). --

plate set Training mAP model Training mAP BM Validation mAP model Validation mAP BM
Training set 0.48 0.30 0.35 0.30
Validation set 0.31 0.23 0.28 0.21
Outlier set 0.11 0.15 0.09 0.13

Within plate compound correlations

plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
Training plates
BR00112201 0.58 0.4 0.37 0.32 98.9 66.7
BR00112198 0.53 0.35 0.34 0.3 97.8 56.7
BR00112204 0.53 0.35 0.35 0.29 98.9 58.9
Validation plates
BR00112202 0.43 0.34 0.36 0.3 88.9 54.4
BR00112197standard 0.46 0.4 0.39 0.28 92.2 56.7
BR00112203 0.18 0.3 0.16 0.27 48.9 56.7
BR00112199 0.28 0.32 0.18 0.28 68.9 57.8
BR00113818 0.26 0.28 0.26 0.3 70 52.2
BR00113819 0.25 0.28 0.19 0.25 72.2 48.9
BR00112197repeat 0.44 0.41 0.36 0.31 86.7 63.3
BR00113820 0.25 0.3 0.2 0.3 64.4 55.6
BR00113821 0.17 0.24 0.18 0.22 45.6 47.8
BR00112197binned 0.41 0.41 0.4 0.3 88.9 58.9

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Mar 22, 2022

Experiment

To see if my hypothesis* is true, I trained a model on 2 of the outlier plates (BR00113819 and BR00113821). I then calculated the same performance metrics as before. The model was trained without creating pairs across plates, only within each plate.

*Training on plates which are similar according to the PC1 loadings plot, will lead to poor performance of the model on plates which are dissimilar to the training plates.

Main takeaways

  • I expected the model to beat the baseline for plates BR00113818 and BR00113820, and although it did not perform very poorly on these plates, it did not beat the baseline in all metrics.
  • In fact, only for these two validation plates did the model prediction outperform the baseline for the training compounds, while performing worse on the validation compounds. The opposite is true for BR00112202, BR00112197standard, BR00112197repeat, BR00112204, and BR00112201. So it appears the model has overfit the training compounds for the plates that are similar to the training plates, but still learned a decent aggregation of the validation compounds for the validation plates.
  • None of the model predictions for the validation plates achieved better performance than the baseline in all metrics. This may be due to the larger differences between the training plates and validation plates used in this experiment than in the previous experiment.

Next up

Time to evaluate on Stain3.

TableTime!
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
Training plates
BR00113819 0.58 0.28 0.28 0.25 97.8 48.9
BR00113821 0.59 0.24 0.22 0.22 96.7 47.8
Validation plates
BR00112202 0.33 0.34 0.34 0.3 80 54.4
BR00112197standard 0.32 0.4 0.34 0.28 78.9 56.7
BR00112203 0.16 0.3 0.18 0.27 38.9 56.7
BR00112199 0.17 0.32 0.16 0.28 40 57.8
BR00113818 0.35 0.28 0.24 0.3 76.7 52.2
BR00112198 0.27 0.35 0.28 0.3 66.7 56.7
BR00112197repeat 0.33 0.41 0.34 0.31 70 63.3
BR00112204 0.28 0.35 0.35 0.29 66.7 58.9
BR00113820 0.36 0.3 0.25 0.3 84.4 55.6
BR00112197binned 0.28 0.41 0.3 0.3 65.6 58.9
BR00112201 0.38 0.4 0.34 0.32 86.7 66.7

@EchteRobert
Copy link
Collaborator Author

Evaluation

As an additional evaluation at the compound level, I compared the mAP between the model and the benchmark for the 'within cluster plates' (see PC1 loadings plot for the cluster) to see if there are specific compounds which consistently perform worse or better than the benchmark while using the model.

Colorful bubble graph training compounds! Screen Shot 2022-03-31 at 4 47 28 PM
Colorful bubble graph validation compounds! Screen Shot 2022-03-31 at 4 49 44 PM

@EchteRobert
Copy link
Collaborator Author

EchteRobert commented Apr 12, 2022

Evaluation Stain3 optimized model

After tuning a bunch of hyperparameters using Stain3 plates I trained a model on Stain2 plates using the same hyperparameters and training methods to see if this new setup is compatible across plates. I changed the data that is used to calculate the validation loss, so that selecting the best validation loss model will actually yield the best performance on the validation compounds. See #6 (comment) for the finding of this validation loss issue and #6 (comment) for the hyperparameter experiment details.

Main takeaways

  • The model has actually improved all scores both for training and validation data, showing that the optimized parameters work for this task in a more general sense than just for Stain3 plates.
  • The update on the validation loss now better represents the performance on the validation compounds. The model with the best validation loss performs equal or better than the last epoch model for 6 out of 7 plates on the validation compounds.

Results

mAP table with last epoch model here!
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
Training plates
BR00112201 0.81 0.4 0.47 0.32 100 66.7
BR00112198 0.78 0.35 0.49 0.3 100 56.7
BR00112204 0.82 0.35 0.42 0.29 100 58.9
Validation plates
BR00112202 0.52 0.34 0.35 0.3 94.4 54.4
BR00112197standard 0.54 0.4 0.44 0.28 95.6 56.7
BR00112197repeat 0.55 0.41 0.4 0.31 95.6 63.3
BR00112197binned 0.48 0.41 0.41 0.3 91.1 58.9
mAP table with best validation loss model here!

Numbers in bold are better than the last epoch model. Numbers in italic are worse.

plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
Training plates
BR00112201 0.65 0.4 0.45 0.32 98.9 66.7
BR00112198 0.59 0.35 0.49 0.3 98.9 56.7
BR00112204 0.59 0.35 0.46 0.29 100 58.9
Validation plates
BR00112202 0.48 0.34 0.37 0.3 95.6 54.4
BR00112197standard 0.51 0.4 0.44 0.28 93.3 56.7
BR00112197repeat 0.49 0.41 0.47 0.31 93.3 63.3
BR00112197binned 0.46 0.41 0.41 0.3 85.6 58.9

@EchteRobert EchteRobert changed the title 03. Model for Stain2 dataset 03. Model for Stain2 Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants