Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for SIB-200 #266

Closed
SamuelCahyawijaya opened this issue Jan 1, 2024 · 7 comments · Fixed by #470
Closed

Create dataset loader for SIB-200 #266

SamuelCahyawijaya opened this issue Jan 1, 2024 · 7 comments · Fixed by #470
Assignees
Labels
pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: sib200/sib200.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sib200

Dataset sib200
Description SIB-200 is a topic classification data set covering 200 languages and derived from Flores-200 machine translation corpus. The English subset of Flores-200 was annotated with topic labels and those labels were projected onto the parallel instances belonging to the remaining 203 languages covered in the corpus. In initial baseline experiments on SIB-200, languages unseen during pre-training of multilingual LMs, under-represented language families, and languages from regions of Africa, Americas, Oceania, and SEA often had the lowest performance. Note that, after annotation, topic categories with less than 80 sentences were removed from the final classification dataset.
Subsets ace, ban, bjn, bug, ceb, ilo, ind, jav, kac, khm, lao, lus, min, mya, pag, shn, sun, tgl, tha, vie, war, zsm
Languages ace, ban, bjn, bug, ceb, ilo, ind, jav, kac, khm, lao, lus, min, mya, pag, shn, sun, tgl, tha, vie, war, zsm
Tasks Text Classification
License Apache license 2.0 (apache-2.0)
Homepage https://github.com/dadelani/sib-200
HF URL -
Paper URL https://arxiv.org/pdf/2309.07445v1.pdf
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Jan 1, 2024
@muhsatrio
Copy link
Contributor

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sabilmakbar
Copy link
Collaborator

Hi @muhsatrio, may we know the update on this dataloader issue? It's been 2 weeks since the last poke from the SEACrowd stale-checker, and we might consider unassigning if there's no progress update in the next 24 hours.

Copy link

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@tellarin
Copy link
Collaborator

@SamuelCahyawijaya I'd like to take on this loader, if the effort is staled.

@tellarin
Copy link
Collaborator

Can you also update the dataset card to list the HF repo? https://huggingface.co/datasets/Davlan/sib200

@holylovenia
Copy link
Contributor

holylovenia commented Feb 19, 2024

@tellarin Thanks for the info. Added in the datasheet. I've assigned you to this issue.

@muhsatrio I'm removing your assignment due to no response. 🙏

@holylovenia holylovenia added pr-ready A PR that closes this issue is Ready to be reviewed and removed staled-issue labels Mar 11, 2024
holylovenia pushed a commit that referenced this issue Apr 21, 2024
* Initial dataloader structure

* Schema implementations

* Update sib_200.py

- Standardize `_LANGUAGES`
- Update task to TOPIC_MODELING
- Fix redundant ID

* Update sib_200.py

bug fix for loading specific subset

* Update sib_200.py

Rework dataset flow - remove `_load_hf_data_from_remote`

---------

Co-authored-by: Samuel Cahyawijaya <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants