Train anomalyze models #47

MattsonCam · 2025-04-03T16:32:26Z

This pr deterministically samples single cell data and trains anomalyze models. I haven't trained the anomalyze models yet, so you shouldn't expect to see them before this pr is approved. In a future pr I will use these models to generate anomaly data for their respective datasets. Any constructive feedback is welcome when reviewing this pr.

- Deterministically sample single-cell data - Train on sampled data and save trained models - Monitor training

identifying anomalyze and training models

review-notebook-app · 2025-04-03T16:32:31Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

MikeLippincott

LGTM! I made mostly preference comments that you can feel free to ignore. I really like the documentation explaining the process though.

MikeLippincott · 2025-04-05T20:51:00Z

2.evaluate_data/identify_sc_outlier_per_plate/README

-This analysis identifies single-cell anomalies determined with an isolation forest for each plate independently of other plates.
-The isolation forests are constructed from both treatment and control cellprofiler plate features.
+#Identify single-cell anomalies
+This analysis identifies JUMP anomalies after sampling and training processed JUMP data.


Suggested change

This analysis identifies JUMP anomalies after sampling and training processed JUMP data.

This analysis identifies single-cell anomalies after sampling and training processed JUMP data.

MikeLippincott · 2025-04-05T20:52:02Z

2.evaluate_data/identify_sc_outlier_per_plate/README

+Anomalyze is trained on 6 combinations of processed data:
+- Normalized single-cell QC'd data
+- Normalized single-cell non-QC'd data
+- Normalized and feature-selected single-cell QC'd data
+- Normalized and feature-selected single-cell non-QC'd data
+- QC'd data aggregated to the well level before feature-selection
+- Non-QC'd data aggregated to the well level after feature-selection
+- Non-QC'd data aggregated to the well level before feature-selection
+- QC'd data aggregated to the well level after feature-selection


I like this. This is clearly written. Do you think a table would help to show operation order?

MikeLippincott · 2025-04-05T20:53:24Z

2.evaluate_data/identify_sc_outlier_per_plate/nbconverted/identify_anomalous_single_cells_fs.py

+plate_data_name = pathlib.Path(sys.argv[1]).name
+is_sc = sys.argv[2].lower() == "true"
+sampled_plate_jump_data_path = sys.argv[3]


Consider using argparser with keyword args and parsing. This is good though. Definitely a preference thing.

MikeLippincott · 2025-04-05T20:55:25Z

2.evaluate_data/identify_sc_outlier_per_plate/nbconverted/identify_anomalous_single_cells_fs.py

+num_estimators = 1_600 if is_sc else 1

-# Calculate anomalies
-isofor = IsolationForest(n_estimators=1000, random_state=0, n_jobs=-1)
+isofor = IsolationForest(n_estimators=num_estimators, random_state=0, n_jobs=-1)
+isofor.fit(featdf)


is num_estimators supossed to be an int? If yes, consider changing 1_600. If not, consider adding a comment on the formatting

Ahh, nvm. I see the int seperator. Consider adding a justification for why 1_600?

MikeLippincott · 2025-04-05T20:55:50Z

2.evaluate_data/identify_sc_outlier_per_plate/nbconverted/sample_anomalous_single_cells_fs.py

+
+# ### Import Libraries
+
+# In[ ]:


looks like the imports was not run FYI

MikeLippincott · 2025-04-05T21:00:14Z

2.evaluate_data/identify_sc_outlier_per_plate/nbconverted/sample_anomalous_single_cells_fs.py

+if is_sc:
+    ds = DeterministicSampling(
+        _platedf=platedf,
+        _samples_per_plate=8_100,


How is the _samples_per_plate determined?

MikeLippincott · 2025-04-05T21:03:20Z

2.evaluate_data/utils/DeterministicSampling.py

+    """
+    _plate_column, _well_column and _cell_id_columns must be present in _platedf.
+    _cell_id_columns will be used with _plate_column and _well_column to uniquely identify each cell.
+    For example, _cell_id_columns in some projects could be ["Metadata_Site", "Metadata_ObjectNumber"]
+    """


For the docstrings for the class methods consider adding return type hints and return descriptions

MikeLippincott · 2025-04-05T21:04:43Z

2.evaluate_data/identify_sc_outlier_per_plate/sample_data_and_train_anomalyze.sh

@@ -0,0 +1,51 @@
+#!/bin/bash


Consider activating your conda env if you have one to run python

MikeLippincott · 2025-04-05T21:06:29Z

2.evaluate_data/identify_sc_outlier_per_plate/sample_data_and_train_anomalyze.sh

+py_path="nbconverted"
+
+jupyter nbconvert --to python --output-dir="${py_path}/" *.ipynb


consider merging these lines into one.

MikeLippincott · 2025-04-05T21:08:48Z

2.evaluate_data/identify_sc_outlier_per_plate/nbconverted/sample_anomalous_single_cells_fs.py

+if plate_data_path.exists():
+    sampled_platedf = pd.concat(
+        [sampled_platedf, pd.read_parquet(plate_data_path)], axis=0
+    )
+
+sampled_platedf.to_parquet(plate_data_path)


If you run the script for all plates then run this script again for all plates does this sampled_platedf have double the data then? How would that affect anomalyze?

Are there safe gaurds for this?

Cameron Mattson added 4 commits April 3, 2025 10:12

Updated anomalyze training to:

2061048

- Deterministically sample single-cell data - Train on sampled data and save trained models - Monitor training

Updated README to describe the process for

d387767

identifying anomalyze and training models

Updated conda environment to include the farmhash

1394acd

Added script to perform sampleing and anomalyze training

b7e5879

Changed unused variable when sampling

9dd2139

MattsonCam requested a review from MikeLippincott April 4, 2025 21:22

MikeLippincott approved these changes Apr 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train anomalyze models #47

Train anomalyze models #47

MattsonCam commented Apr 3, 2025 •

edited

Loading

review-notebook-app bot commented Apr 3, 2025

MikeLippincott left a comment

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

MikeLippincott Apr 5, 2025

	This analysis identifies JUMP anomalies after sampling and training processed JUMP data.
	This analysis identifies single-cell anomalies after sampling and training processed JUMP data.

		py_path="nbconverted"

		jupyter nbconvert --to python --output-dir="${py_path}/" *.ipynb

Train anomalyze models #47

Are you sure you want to change the base?

Train anomalyze models #47

Conversation

MattsonCam commented Apr 3, 2025 • edited Loading

review-notebook-app bot commented Apr 3, 2025

MikeLippincott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MattsonCam commented Apr 3, 2025 •

edited

Loading