Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces load_from_disk datasets #218

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nickmitchko
Copy link

This addition allows for a user to load a dataset directory they crafted locally using the save_to_disk functionality of HuggingFace Datasets. If you pass in a dataset with a directory name (ending in / or \ ) it will be treated as a local huggingface dataset

https://huggingface.co/docs/datasets/process#save

This addition allows for a user to load a dataset directory they crafted locally using the save_to_disk functionality of HuggingFace Datasets. If you pass in a dataset with a directory name (ending in / or \ ) it will be treated as a local huggingface dataset

https://huggingface.co/docs/datasets/process#save
@nickmitchko nickmitchko changed the title Introduces Local Dataset Loading Ability Introduces load_from_disk datasets Jul 19, 2023
@artidoro
Copy link
Owner

Ideally huggingface does the parsing for us. We should stay away from deciding what is local and what is on the hub.
Also isn't this handled by load_dataset?
https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset

Please provide more details for why this would be needed or if it's already included in the repo.

@nickmitchko
Copy link
Author

Ideally huggingface does the parsing for us. We should stay away from deciding what is local and what is on the hub. Also isn't this handled by load_dataset? https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset

The impetus of this change was that load_dataset did not handle my dataset directory correctly. I can lookup the exact error but handling this use case (directory slash) works.

Please provide more details for why this would be needed or if it's already included in the repo.

You can't load a local dataset directory (one not published on the hugging face hub) without it. I have data that can't be published on the hub and this is why I need this functionality.

@nickmitchko
Copy link
Author

If you check out the load_dataset method, it only opens csv, json, parquet, and others based on the python script provided in the directory. save_to_disk from here doesn't save in this format. Thus if you concatenate a variety of datasets into one, and then save to disk, you can't use the current load_dataset method.

@nickmitchko
Copy link
Author

@artidoro thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants