Fix dataset loading, and other minor fixes #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
DiskResource
has been completely changed to now support downloading from a HuggingFace datasets repository. (Just to keep things simple I completely removed the Google Cloud logic, but if you think it should stay then we can maybe just merge the two together.)As it stands, it's been hardcoded to download from this repo but it can be changed to something else by overriding
DB_HF_DATA
(seedisk_resource.py
). It would be good if you can test this branch out with a sanitiseddesign_bench_data
folder to make sure that everything downloads correctly.Sadly, most datasets are missing their pretrained oracle weights :(. This means that most tasks just take forever to import since it will try train an oracle instead. These are the only pretrained weights I have on hand:
If you are able to fill in some of these gaps that would be good.
Other changes:
Setting 'max_len_sentences_pair' is now deprecated. This value is automatically set up. Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.
has now been suppressed since it spams the screen when you import.np.loads
.Thanks.