Create dataset loader for SIB-200 #266

SamuelCahyawijaya · 2024-01-01T17:44:23Z

Dataloader name: sib200/sib200.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sib200

Dataset	sib200
Description	SIB-200 is a topic classification data set covering 200 languages and derived from Flores-200 machine translation corpus. The English subset of Flores-200 was annotated with topic labels and those labels were projected onto the parallel instances belonging to the remaining 203 languages covered in the corpus. In initial baseline experiments on SIB-200, languages unseen during pre-training of multilingual LMs, under-represented language families, and languages from regions of Africa, Americas, Oceania, and SEA often had the lowest performance. Note that, after annotation, topic categories with less than 80 sentences were removed from the final classification dataset.
Subsets	ace, ban, bjn, bug, ceb, ilo, ind, jav, kac, khm, lao, lus, min, mya, pag, shn, sun, tgl, tha, vie, war, zsm
Languages	ace, ban, bjn, bug, ceb, ilo, ind, jav, kac, khm, lao, lus, min, mya, pag, shn, sun, tgl, tha, vie, war, zsm
Tasks	Text Classification
License	Apache license 2.0 (apache-2.0)
Homepage	https://github.com/dadelani/sib-200
HF URL	-
Paper URL	https://arxiv.org/pdf/2309.07445v1.pdf

The text was updated successfully, but these errors were encountered:

muhsatrio · 2024-01-04T05:53:20Z

#self-assign

github-actions · 2024-01-19T02:07:14Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar · 2024-02-01T16:35:29Z

Hi @muhsatrio, may we know the update on this dataloader issue? It's been 2 weeks since the last poke from the SEACrowd stale-checker, and we might consider unassigning if there's no progress update in the next 24 hours.

github-actions · 2024-02-16T01:57:40Z

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

tellarin · 2024-02-19T04:52:06Z

@SamuelCahyawijaya I'd like to take on this loader, if the effort is staled.

tellarin · 2024-02-19T04:52:49Z

Can you also update the dataset card to list the HF repo? https://huggingface.co/datasets/Davlan/sib200

holylovenia · 2024-02-19T06:43:09Z

@tellarin Thanks for the info. Added in the datasheet. I've assigned you to this issue.

@muhsatrio I'm removing your assignment due to no response. 🙏

* Initial dataloader structure * Schema implementations * Update sib_200.py - Standardize `_LANGUAGES` - Update task to TOPIC_MODELING - Fix redundant ID * Update sib_200.py bug fix for loading specific subset * Update sib_200.py Rework dataset flow - remove `_load_hf_data_from_remote` --------- Co-authored-by: Samuel Cahyawijaya <[email protected]>

SamuelCahyawijaya added this to SEACrowd Data Hub Jan 1, 2024

SamuelCahyawijaya converted this from a draft issue Jan 1, 2024

github-actions bot assigned muhsatrio Jan 4, 2024

github-actions bot added the staled-issue label Jan 19, 2024

github-actions bot removed the staled-issue label Feb 2, 2024

github-actions bot added the staled-issue label Feb 16, 2024

holylovenia unassigned muhsatrio Feb 19, 2024

holylovenia assigned tellarin Feb 19, 2024

holylovenia removed the staled-issue label Feb 19, 2024

tellarin mentioned this issue Mar 2, 2024

Closes #266 | Create dataset loader for SIB-200 #470

Merged

8 tasks

github-actions bot added the staled-issue label Mar 5, 2024

holylovenia added pr-ready A PR that closes this issue is Ready to be reviewed and removed staled-issue labels Mar 11, 2024

holylovenia closed this as completed in #470 Apr 21, 2024

github-project-automation bot moved this to Done in SEACrowd Data Hub Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for SIB-200 #266

Create dataset loader for SIB-200 #266

SamuelCahyawijaya commented Jan 1, 2024

muhsatrio commented Jan 4, 2024

github-actions bot commented Jan 19, 2024

sabilmakbar commented Feb 1, 2024

github-actions bot commented Feb 16, 2024

tellarin commented Feb 19, 2024

tellarin commented Feb 19, 2024

holylovenia commented Feb 19, 2024 •

edited

Loading

Create dataset loader for SIB-200 #266

Create dataset loader for SIB-200 #266

Comments

SamuelCahyawijaya commented Jan 1, 2024

muhsatrio commented Jan 4, 2024

github-actions bot commented Jan 19, 2024

sabilmakbar commented Feb 1, 2024

github-actions bot commented Feb 16, 2024

tellarin commented Feb 19, 2024

tellarin commented Feb 19, 2024

holylovenia commented Feb 19, 2024 • edited Loading

holylovenia commented Feb 19, 2024 •

edited

Loading