Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in HuggingfaceDatasetReader in streaming mode #308

Open
habanoz opened this issue Nov 29, 2024 · 2 comments
Open

Bug in HuggingfaceDatasetReader in streaming mode #308

habanoz opened this issue Nov 29, 2024 · 2 comments

Comments

@habanoz
Copy link

habanoz commented Nov 29, 2024

The bug is self-evident.

ex_iterable = dst._ex_iterable.shard_data_sources(rank, world_size)

The placement of rank and world size parameters is not correct. rank is assigned to num_shards parameter and world_size is assigned to index parameter.

https://github.com/huggingface/datasets/blob/06c3235a640d00bf59223ebabf3cb489a2891767/src/datasets/iterable_dataset.py#L144

This bug ruins sharding in streaming mode.

@hynky1999
Copy link
Contributor

hynky1999 commented Nov 29, 2024

Hi, good spot I do remember noticing this also, just forgot to create a PR.
Issue is with the fact that it's private method and got change month ago: huggingface/datasets@65f6eb5#diff-edc4da5f2179552e25f4f3dc9d6bf07265b68bbef048a8f712e798520a23d048L103

So now the args are different.

Do you think you could implement the fix? (fix the line + bump datasets so that it doesn't clash)

@habanoz
Copy link
Author

habanoz commented Nov 29, 2024

@hynky1999 I have created PR #309 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants