You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's time to allow partial data to be stored in the results folder. Currently, we hold all data in memory (in the Collector object) until the end of the run. Then we dump that data into the results database all at once. Doing this allows us to avoid incomplete/partial states. There are several drawbacks, though:
This costs a lot of memory, particularly for long running experiments (e.g. continual learning experiments)
This puts a large sudden write-pressure on the database at the end of a run. If many parallel processes are doing this at once, this dramatically increases the chances of long lockups
The full collector object needs to be checkpointed currently, which can be very expensive as it fills up
There are several challenges for the format, some of which already exist and we are simply ignoring them because the probability of hitting the issue is small, but no longer:
Data consistency. If a run is preempted, we need to ensure that restarting the run doesn't invalidate the data. Possibly this means needing to delete some overlapping rows or other safeguards. Preferably this is done only once, for instance when the checkpoint loads.
Buffered writes. We don't want to hit the database frequently, too much potential for lockups and too much disk activity.
Database lock handling.
Some concrete implementation notes:
Experiment description -> experiment metadata -> experiment. The description should be used once to construct the metadata in a db. Then the experiment will be run from that db only. This allows us to synchronize states and retain consistency. This also makes the path for computed .py experiment descriptions much simpler.
Need to handle synchronizing the metadata db across local and server machines. Possibly embedding this into the results database is sufficient
Use some tmp location (e.g. SLURM_TMPDIR) for partial results db, then synchronize that with the final results db. This should alleviate write pressure and make buffering less important
Have db writer occur asynchronously in a background thread.
The text was updated successfully, but these errors were encountered:
It's time to allow partial data to be stored in the results folder. Currently, we hold all data in memory (in the Collector object) until the end of the run. Then we dump that data into the results database all at once. Doing this allows us to avoid incomplete/partial states. There are several drawbacks, though:
There are several challenges for the format, some of which already exist and we are simply ignoring them because the probability of hitting the issue is small, but no longer:
Some concrete implementation notes:
.py
experiment descriptions much simpler.The text was updated successfully, but these errors were encountered: