-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train anomalyze models #47
base: main
Are you sure you want to change the base?
Conversation
- Deterministically sample single-cell data - Train on sampled data and save trained models - Monitor training
identifying anomalyze and training models
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I made mostly preference comments that you can feel free to ignore. I really like the documentation explaining the process though.
This analysis identifies single-cell anomalies determined with an isolation forest for each plate independently of other plates. | ||
The isolation forests are constructed from both treatment and control cellprofiler plate features. | ||
#Identify single-cell anomalies | ||
This analysis identifies JUMP anomalies after sampling and training processed JUMP data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This analysis identifies JUMP anomalies after sampling and training processed JUMP data. | |
This analysis identifies single-cell anomalies after sampling and training processed JUMP data. |
Anomalyze is trained on 6 combinations of processed data: | ||
- Normalized single-cell QC'd data | ||
- Normalized single-cell non-QC'd data | ||
- Normalized and feature-selected single-cell QC'd data | ||
- Normalized and feature-selected single-cell non-QC'd data | ||
- QC'd data aggregated to the well level before feature-selection | ||
- Non-QC'd data aggregated to the well level after feature-selection | ||
- Non-QC'd data aggregated to the well level before feature-selection | ||
- QC'd data aggregated to the well level after feature-selection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. This is clearly written. Do you think a table would help to show operation order?
plate_data_name = pathlib.Path(sys.argv[1]).name | ||
is_sc = sys.argv[2].lower() == "true" | ||
sampled_plate_jump_data_path = sys.argv[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using argparser with keyword args and parsing. This is good though. Definitely a preference thing.
num_estimators = 1_600 if is_sc else 1 | ||
|
||
# Calculate anomalies | ||
isofor = IsolationForest(n_estimators=1000, random_state=0, n_jobs=-1) | ||
isofor = IsolationForest(n_estimators=num_estimators, random_state=0, n_jobs=-1) | ||
isofor.fit(featdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is num_estimators supossed to be an int? If yes, consider changing 1_600. If not, consider adding a comment on the formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, nvm. I see the int seperator. Consider adding a justification for why 1_600?
|
||
# ### Import Libraries | ||
|
||
# In[ ]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the imports was not run FYI
if is_sc: | ||
ds = DeterministicSampling( | ||
_platedf=platedf, | ||
_samples_per_plate=8_100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the _samples_per_plate
determined?
""" | ||
_plate_column, _well_column and _cell_id_columns must be present in _platedf. | ||
_cell_id_columns will be used with _plate_column and _well_column to uniquely identify each cell. | ||
For example, _cell_id_columns in some projects could be ["Metadata_Site", "Metadata_ObjectNumber"] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the docstrings for the class methods consider adding return type hints and return descriptions
@@ -0,0 +1,51 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider activating your conda env if you have one to run python
py_path="nbconverted" | ||
|
||
jupyter nbconvert --to python --output-dir="${py_path}/" *.ipynb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider merging these lines into one.
if plate_data_path.exists(): | ||
sampled_platedf = pd.concat( | ||
[sampled_platedf, pd.read_parquet(plate_data_path)], axis=0 | ||
) | ||
|
||
sampled_platedf.to_parquet(plate_data_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run the script for all plates then run this script again for all plates does this sampled_platedf have double the data then? How would that affect anomalyze?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there safe gaurds for this?
This pr deterministically samples single cell data and trains anomalyze models. I haven't trained the anomalyze models yet, so you shouldn't expect to see them before this pr is approved. In a future pr I will use these models to generate anomaly data for their respective datasets. Any constructive feedback is welcome when reviewing this pr.