Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in sub-dagws #273

Open
KennethEnevoldsen opened this issue May 23, 2024 · 3 comments
Open

Issues in sub-dagws #273

KennethEnevoldsen opened this issue May 23, 2024 · 3 comments

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 23, 2024

Just found a tiny issue for all sub-dagws:

Here Peter specified source as dagw-xxx but the validator would check if the source equals to the name of the dataset folder: dataset_name = document_file.parent.parent.name which is dagw:

image

Therefore we will have:

Checking dataset: dagw:  86%|████████████████████████████████████████████▉       | 19/22 [00:09<00:02,  1.46it/s]ERROR:__main__:--- Dataset dagw failed validation ------------
ERROR:__main__:Datasheet dagw does not exist.
Error reading datasheet dagw: [Errno 2] No such file or directory: '/work/github/danish-foundation-models/docs/datasheets/dagw'
Error in document file dagw-retsinformationdk.jsonl.gz: Source should be dagw, but is dagw-retsinformationdk
Error in document file dagw-ep.jsonl.gz: Source should be dagw, but is dagw-ep
Error in document file dagw-hest.jsonl.gz: Source should be dagw, but is dagw-hest

This also is the case for checking if dataset sheets exist, it would only check if dagw.md exists.

So I guess we have to seperate each of the sub-dagw into individual folders like:

dataset_folder
│
└── dataset_name
    │
    ├── documents
    │   └── dataset_name.jsonl.gz  
    │
    └── attributes   # OPTIONAL: folder containing annotations from dataset cleaning

Originally posted by @TTTTao725 in #266 (comment)

@KennethEnevoldsen
Copy link
Contributor Author

@TTTTao725 yes I would suggest that we do that

@TTTTao725
Copy link
Contributor

great! will do.

@TTTTao725
Copy link
Contributor

Also, there is a typo here:
datasheet_path = datasheets_path / dataset_path.name

should be:
datasheet_path = datasheets_path / f'{dataset_path.name}.md'

already fixed it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants