Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Glot500-c #682

Open
SamuelCahyawijaya opened this issue May 27, 2024 · 0 comments
Open

Create dataset loader for Glot500-c #682

SamuelCahyawijaya opened this issue May 27, 2024 · 0 comments

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: gloot500_c/gloot500_c.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?gloot500_c

Dataset gloot500_c
Description Glot500-c is a corpus of text including 511 languages, on which the Glot500-m LLM model was trained. This is a subset of Glot2000-c, based on a minimum number of sentences (30,000) exclusion criterion. This corpus is about 600 GB in size and contains about 1.5 billion sentences. The data is obtained in part by crawling data from websites and by compiling existing datasets. This means there may be overlap with other SEACrowd datasets. It also means that the licenses may be different for each of the underlying datasets, and some of the datasets will require specific registration and/or access requests. The work also contains several benchmarks and evaluation tasks on which Glot500-m is evaluated.
Subsets -
Languages bsb, iba, ind, eng, zsm, khm, lao, tha, tdt, por, ace, fil, vie, tih, mya, bcl, ceb, zlm, jav, sun, bjn, min, tgl, tam, hil, ilo, kac, war, ahk, dtp, ksw, lhu, pag, cmn, pam, bbc, ban, sxn, nia, btx, gor, mad, bts, mbb, prk, ibg, bhw, ifb, ifa, mrw
Tasks Language Modeling, Named Entity Recognition, POS Tagging, Text Classification
License Other (other)
Homepage https://github.com/cisnlp/Glot500
HF URL https://huggingface.co/datasets/cis-lmu/Glot500
Paper URL https://aclanthology.org/2023.acl-long.61/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

1 participant