Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train anomalyze models #47

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Train anomalyze models #47

wants to merge 5 commits into from

Conversation

MattsonCam
Copy link
Member

@MattsonCam MattsonCam commented Apr 3, 2025

This pr deterministically samples single cell data and trains anomalyze models. I haven't trained the anomalyze models yet, so you shouldn't expect to see them before this pr is approved. In a future pr I will use these models to generate anomaly data for their respective datasets. Any constructive feedback is welcome when reviewing this pr.

Cameron Mattson added 4 commits April 3, 2025 10:12
- Deterministically sample single-cell data
- Train on sampled data and save trained models
- Monitor training
identifying anomalyze and training models
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@MikeLippincott MikeLippincott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I made mostly preference comments that you can feel free to ignore. I really like the documentation explaining the process though.

This analysis identifies single-cell anomalies determined with an isolation forest for each plate independently of other plates.
The isolation forests are constructed from both treatment and control cellprofiler plate features.
#Identify single-cell anomalies
This analysis identifies JUMP anomalies after sampling and training processed JUMP data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This analysis identifies JUMP anomalies after sampling and training processed JUMP data.
This analysis identifies single-cell anomalies after sampling and training processed JUMP data.

Comment on lines +3 to +11
Anomalyze is trained on 6 combinations of processed data:
- Normalized single-cell QC'd data
- Normalized single-cell non-QC'd data
- Normalized and feature-selected single-cell QC'd data
- Normalized and feature-selected single-cell non-QC'd data
- QC'd data aggregated to the well level before feature-selection
- Non-QC'd data aggregated to the well level after feature-selection
- Non-QC'd data aggregated to the well level before feature-selection
- QC'd data aggregated to the well level after feature-selection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. This is clearly written. Do you think a table would help to show operation order?

Comment on lines +28 to +30
plate_data_name = pathlib.Path(sys.argv[1]).name
is_sc = sys.argv[2].lower() == "true"
sampled_plate_jump_data_path = sys.argv[3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using argparser with keyword args and parsing. This is good though. Definitely a preference thing.

Comment on lines +59 to +62
num_estimators = 1_600 if is_sc else 1

# Calculate anomalies
isofor = IsolationForest(n_estimators=1000, random_state=0, n_jobs=-1)
isofor = IsolationForest(n_estimators=num_estimators, random_state=0, n_jobs=-1)
isofor.fit(featdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is num_estimators supossed to be an int? If yes, consider changing 1_600. If not, consider adding a comment on the formatting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, nvm. I see the int seperator. Consider adding a justification for why 1_600?


# ### Import Libraries

# In[ ]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the imports was not run FYI

if is_sc:
ds = DeterministicSampling(
_platedf=platedf,
_samples_per_plate=8_100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the _samples_per_plate determined?

Comment on lines +24 to +28
"""
_plate_column, _well_column and _cell_id_columns must be present in _platedf.
_cell_id_columns will be used with _plate_column and _well_column to uniquely identify each cell.
For example, _cell_id_columns in some projects could be ["Metadata_Site", "Metadata_ObjectNumber"]
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the docstrings for the class methods consider adding return type hints and return descriptions

@@ -0,0 +1,51 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider activating your conda env if you have one to run python

Comment on lines +10 to +12
py_path="nbconverted"

jupyter nbconvert --to python --output-dir="${py_path}/" *.ipynb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider merging these lines into one.

Comment on lines +73 to +78
if plate_data_path.exists():
sampled_platedf = pd.concat(
[sampled_platedf, pd.read_parquet(plate_data_path)], axis=0
)

sampled_platedf.to_parquet(plate_data_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run the script for all plates then run this script again for all plates does this sampled_platedf have double the data then? How would that affect anomalyze?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there safe gaurds for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants