Make `epoch_sample_ids` cachable #792

janEbert · 2024-09-28T11:23:39Z

🚀 Feature Request

It would be awesome to enable caching of epoch_sample_ids.

Motivation

Caching would remove a lot of redundant work that currently is re-executed at each run. It takes 20 minutes for my dataset's sample IDs to be created. This wastes a lot of budget for large-scale runs.

In my case, I'll specifically focus on the implementation in https://github.com/mosaicml/streaming/blob/2e9db78db6dd4108b697cfde92a95cd0de80539c/streaming/base/batching/random.py. Specifically, what takes long is dataset.resample_streams (with sampling_method="balanced") and get_shuffle (with shuffle_algo="py1e" in my case).

[Optional] Implementation

I've looked into this a bit, but get_shuffle's indirect dependence through get_partitions on sample_in_epoch (drop_first in the called functions) seems to make this very difficult. Maybe someone with more knowledge of the codebase can chime in on this, though. I would personally be happy with a simple hacky solution for now. :)

For now I've implemented a stupid NumPy file hash for dataset.resample_streams, which already saves around 40–50% of the time.

The text was updated successfully, but these errors were encountered:

snarayan21 · 2024-10-10T22:59:53Z

Hey, @janEbert this seems sensible! We have chosen not to cache the epoch sample id tensor mainly because persistent storage may not be available in many training setups. So reading from a cached file is not always possible.

However, this could be an optional feature for users that do have this set up. Dumping the numpy tensor to a file honestly is a good start -- we'd be happy to help review an implementation, and always appreciate community PRs!

janEbert · 2024-10-11T20:45:30Z

I see, that makes sense. It also seemed like the indices are re-calculated upon each validation run, so there is really only a time save when you start a run or load from a checkpoint.

Regarding the implementation, I'll be happy to put what I cooked up into a PR once I find some free time. Considering the re-calculation I mentioned above (if I interpreted my logs correctly), maybe the additional complexity is not really worth to add to the code base, though. :)

janEbert added the enhancement New feature or request label Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `epoch_sample_ids` cachable #792

Make `epoch_sample_ids` cachable #792

janEbert commented Sep 28, 2024

snarayan21 commented Oct 10, 2024

janEbert commented Oct 11, 2024

Make epoch_sample_ids cachable #792

Make epoch_sample_ids cachable #792

Comments

janEbert commented Sep 28, 2024

🚀 Feature Request

Motivation

[Optional] Implementation

snarayan21 commented Oct 10, 2024

janEbert commented Oct 11, 2024

Make `epoch_sample_ids` cachable #792

Make `epoch_sample_ids` cachable #792