You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Glot500-c is a corpus of text including 511 languages, on which the Glot500-m LLM model was trained. This is a subset of Glot2000-c, based on a minimum number of sentences (30,000) exclusion criterion. This corpus is about 600 GB in size and contains about 1.5 billion sentences. The data is obtained in part by crawling data from websites and by compiling existing datasets. This means there may be overlap with other SEACrowd datasets. It also means that the licenses may be different for each of the underlying datasets, and some of the datasets will require specific registration and/or access requests. The work also contains several benchmarks and evaluation tasks on which Glot500-m is evaluated.
Dataloader name:
gloot500_c/gloot500_c.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?gloot500_c
The text was updated successfully, but these errors were encountered: