Nested Dataset Slowdown #593

vyeevani · 2024-10-03T15:54:53Z

TLDR: I'm working with robotics datasets. They are expressed a nested dataset. Grain calls repr on the inner dataset which slows down the loops significantly.

I think grain has serious benefits over tfds for nested datasets like this. Currently, rlds is the best way of handling these types of datasets. It somewhat abuses tfds dataset manipulation techniques since it doesn't have a better of handling this. I think grain would be able to simplify a lot of these workflows and allow for much more complex things to be created, i.e stitching multiple robot episodes together in unique ways that aren't easily expressible in tfds or rlds.

Fixing the above problem would be hugely beneficial and doesn't feel like it's super necessary.

Simple example to showcase the problem:

builder = tfds.builder_from_directory(builder_dir=dataset_path)
episode_data_source = builder.as_data_source("train", deserialize_method=tfds.core.decode.DeserializeMethod.DESERIALIZE_AND_DECODE)
episode_index_sampler = grain.IndexSampler(
    num_records=2,
    num_epochs=1,
    shard_options=grain.ShardOptions(shard_index=0, shard_count=1, drop_remainder=True),
    shuffle=True,
    seed=0
)

steps_index_sampler = grain.IndexSampler(
    num_records=2,
    num_epochs=1,
    shard_options=grain.ShardOptions(shard_index=0, shard_count=1, drop_remainder=True),
    shuffle=True,
    seed=0
)

import pyinstrument
profiler = pyinstrument.Profiler()

profiler.start()

episode_data_loader = grain.DataLoader(data_source=episode_data_source, sampler=episode_index_sampler)
for episode_data in episode_data_loader:
    steps_data_source = episode_data[rlds.STEPS]
    steps_data_loader = grain.DataLoader(data_source=steps_data_source, sampler=steps_index_sampler)
    for steps_data in steps_data_loader:
        pass

profiler.stop()

# Save the flamegraph to an HTML file
with open('flamegraph.html', 'w') as f:
    f.write(profiler.output_html())

Specific slowdowns happen when creating the state/validating the state.

vyeevani · 2024-10-03T16:17:00Z

vyeevani mentioned this issue Oct 3, 2024

PythonDataSource is creating a repr tensorflow/datasets#5633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested Dataset Slowdown #593

Nested Dataset Slowdown #593

vyeevani commented Oct 3, 2024

vyeevani commented Oct 3, 2024

Nested Dataset Slowdown #593

Nested Dataset Slowdown #593

Comments

vyeevani commented Oct 3, 2024

vyeevani commented Oct 3, 2024