-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve memory usage of MetaLearnerGridSearch
#62
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #62 +/- ##
==========================================
+ Coverage 94.85% 94.88% +0.03%
==========================================
Files 15 15
Lines 1534 1544 +10
==========================================
+ Hits 1455 1465 +10
Misses 79 79 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! :)
and you do not want to store all MetaLearners objects rather evaluate them after | ||
fitting each one and just store one. | ||
|
||
``grid_size_`` will contain the number of hyperparameter combinations after fitting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't quite grasped yet what the motivation for this attribute is. If the user expects a generator, won't they just be used to querying it until it no longer yields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! But I think it may be useful in the case the user wants to show a progress bar or similar, this way it's easier for them to access the number of metalearners to fit instead of having to calculate it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what you mean. At the same time this use case is not super clear to me. Shall we simplify and remove it until we witness that there is a need for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may not explain it correctly, in the case we set store_raw_results = False
and store_results = False
then the fit
method finishes "instantly" (it does not wait for fitting the individual metalearners). Then, afaict these are only fitted when the user requests them by iterating over the generator where it may be of use to use this grid_size_
to display some progress bar or similar. If you still think it's not clear lmk and we can discuss it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I added some explanation about this in the docstring:
1590e94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The idea of this PR is to improve the memory usage of
MetaLearnerGridSearch
. The main issue was that all metalearners instances were saved in memory. Each of this was referenced in two places, thejobs
list and theraw_results_
list.jobs
list now_FitAndScoreJob
stores only the factory and the parameters, then theMetaLearner
object is created inside thejoblib
job which runs_fit_and_score
.raw_results_
now one can choose to return a generator fromjoblib.parallel
instead of materializing all the results in a list.It is important to notice that if the user wants to iterate themselves over the results and not materialize them (for example, to compute policy values and only store the one with the highest policy) then
store_results
andstore_raw_results
must be set toFalse
as ifstore_results
isTrue
then the generator would be consumed when calling_format_results
.It also adds a
grid_size_
attribute which can be useful if a generator is returned.Checklist
CHANGELOG.rst
entry