Update (#2) · MJonibek/seacrowd-datahub@34e1749

Commit

Update (#2)

* Fix bug unique ids

* Closes SEACrowd#162 | Add Bloom-Captioning Dataloader (SEACrowd#198)

* Init dataloader bloom captioning

* Fix issue on multiple splits from its source

* Change local var

* Cater 'test' and 'val' split and fix the '_id' generation

* fix: remove abstreact and change _LOCAL and _DESC

* fix: _DESC indent

* Format openslr.py and add init file

* Closes SEACrowd#271 | Implement dataloader for UiT-ViCTSD (SEACrowd#300)

* Implement UiT-ViCTSD dataloader

* Improve subset IDs, feature types, code to generate examples

* Closes SEACrowd#161 | Create dataset loader for ICON 161 (SEACrowd#317)

* Create icon.py

* Update icon.py

* Create __init__.py

* Closes SEACrowd#142 | Add Unimorph v4 dataloader (SEACrowd#168)

* Add Unimorph dataloader

Resolves SEACrowd#142

* Add Dataset to class name

* Closes SEACrowd#71 | Create dataset loader for MASSIVE (SEACrowd#196)

* add data loader for massive dataset

* modify the class name & refactor the function name

* change task name from pos tagging to slot filling & make check_file & change subset name to differentiate intent / slot filling tasks

* Closes SEACrowd#14 | Create dataset loader for ara-close-lange (SEACrowd#243)

* Add ara_close dataloader

* Rename class name to AraCloseDataset

* Closes SEACrowd#273 | Implement dataloader for UIT_ViON (SEACrowd#282)

* Implement dataloader for UIT_ViON

* Add __init__.py

* Add {lang} in subset id for openslr

* Closes SEACrowd#219 | Create dataloader for scb-mt-en-th-2020 (SEACrowd#287)

* Create dataloader for scb-mt-en-th-2020

* Rename the data loader files to its snakecase

* rename _DATASETNAME to snakecase

* Fix languages setting

* Update template.py

* Add docstring openslr.py

* Closes SEACrowd#277 | Implement dataloader for spamid_pair (SEACrowd#281)

* Implemente dataloader for spamid_pair

* Update seacrowd/sea_datasets/spamid_pair/spamid_pair.py

Co-authored-by: Lj Miranda <[email protected]>

* Add __init__.py

* Update __init__.py

---------

Co-authored-by: Lj Miranda <[email protected]>

* Implemented dataloader for indoler

* Add imqa schema and VISUAL_QUESTION_ANSWERING task (SEACrowd#380)

* Update template.py

Update DownloadManager documentation link in template.py

* Closes SEACrowd#54 | Implement Dataloader for IndoSMD (SEACrowd#258)

* feat: indosmd dataloader for source

* refactor by pre-commit

* IndoSMD: reformatted by pre-commit

* Update changes on indosmd.py

* revised line 223 in indosmd.py

* Close#143 | Create dataset loader for Abui WordNet (SEACrowd#285)

* add tydiqa dataloader

* add id_vaccines_tweet dataloader

* add uit-vicc dataloader

* add ICON dataloader

* add iaap_squad dataloader

* add stb_ext dataloader

* Revert "add iaap_squad dataloader"

This reverts commit 1f8a591.

* Revert "add tydiqa dataloader"

This reverts commit 6bf4546.

* Revert "add id_vaccines_tweet dataloader"

This reverts commit 1154087.

* Revert "add uit-vicc dataloader"

This reverts commit 09661fa.

* Revert "add ICON dataloader"

This reverts commit 0891e58.

* Update stb_ext.py

* add abui_wordnet dataloader

* Revert "Update stb_ext.py"

This reverts commit 59c5301.

* Delete seacrowd/sea_datasets/stb_ext/stb_ext.py

* Delete seacrowd/sea_datasets/stb_ext/__init__.py

* Update abui_wordnet.py

* Update abui_wordnet.py

* Update abui_wordnet.py

---------

Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: Samuel Cahyawijaya <[email protected]>

* Added Morality Classification Tasks to constants.py (SEACrowd#371)

* Closes SEACrowd#216 |  Create dataset loader for Mozilla Pontoon (SEACrowd#260)

* Begin first draft of Mozilla Pontoon dataloader

* Add dataloader for Mozilla Pontoon

* Remove enumerate in _generate_examples

* Fix issues due to changed format, rename features and config names

* Closes SEACrowd#157 | Create dataset loader for M3Exam (SEACrowd#302)

* Add m3exam dataloader

* Small change in m3exam.py

* Fix bug during downloading

* Add meta feature to seacrowd schema for m3exam

* Rename class M3Exam to M3ExamDataset

* Add image question answering

* Merge two source schemas into one for m3exam

* Fix image path, choices and answer in m3exam

* Update CODEOWNERS

* Rectify SEACrowd Internal Vars (SEACrowd#386)

* Add missing __init__.py

* add init

* fix bug in phoatis load

* add lang variables in dataloaders

* Add dataset use ack on source HF repo into description

* Closes SEACrowd#204 | Implement dataloader for Melayu_Sabah (SEACrowd#234)

* Implement dataloader for Melayu_Sabah

* Update name for the dataloader

* Add _CITATION

* Update seacrowd/sea_datasets/melayu_sabah/melayu_sabah.py

* Applu suggestions from review

* Moving unnecessary content in dialogue text

* Update melayu_sabah.py

* Improvement: Workflow Message to Mention Assignee in Staled Issues (SEACrowd#400)

* Update stale.yml (SEACrowd#327)

* Update stale.yml

Test on adding vars on assignee & author of Issues & PR

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Closes SEACrowd#272 | Create dataset loader for SNLI (SEACrowd#290)

* [New Feature] Add SNLI dataloader

* [Fix] SNLI rev according to PR review

* [Chore] Add comment for accessibility

* Update common_parser.py (SEACrowd#333)

* Implement dataloader for UCLA Phonetic Corpus

* Implement dataloader for KDE4

* removed redundant builder_config

* Update cc3m_35l.py

Changed into no parallelization since it was kept being killed by the OS for some reason.

* Fix: Workflow Assignee Mention (SEACrowd#410)

* Update stale.yml

* Fix: wrong quote in message (SEACrowd#411)

* Update and fix bug on stale.yml

* Closes SEACrowd#17 | Implement dataloader for Philippine Fake News Corpus (SEACrowd#331)

* Implement dataloader

* Edit dataloader class name

* Simplify code

* Fix citation typo

* Closes SEACrowd#359 | Implement dataloader for LR-Sum (SEACrowd#368)

* Implement dataloader

* Fix short description

* feat: mswc dataloader skeleton

* feat: example for seacrowd schema

* Closes SEACrowd#265 | Implement dataloader for `myxnli` (SEACrowd#336)

* Implement dataloader for myxnli

* update myxnli

* Closes SEACrowd#112 | Implement Dataloader for Wisesight Thai Corpus (SEACrowd#279)

* Add wisesight_thai_sentiment dataset

* changes according to review

* changes according to review

* changes according to review

* Add changes according to review

* refactor: formatting

* fix: subset

* refactor: formatting

* Closes SEACrowd#6 | Add Loader for XCOPA (SEACrowd#286)

* initial add for loader

* edit to include multi language

* adjust comments

* apply suggestion

* fix by linter

---------

Co-authored-by: fawwaz.mayda <[email protected]>

* Closes SEACrowd#140 | Add Dengue Filipino (SEACrowd#259)

* add dengue filipino

* update license and tasks

* Update _LANGUAGE

* Update dengue_filipino.py

* feat: flores200 dataloader skeleton

* Set only one source schema

* Fix subnodes ids for root node alt_burmese_treebank

* implement Filipino Gay Language dataloader (SEACrowd#66)

* convert citation to raw string

* Closes SEACrowd#210 | Create dataset loader for Orchid Corpus (SEACrowd#303)

* Add orchid_pos dataloader

* Rename OrchidPOS to OrchidPOSDataset

* Fix parser bug in orchid_pos.py

* Add .strip() in source orchid_pos

* Cahange string for special char orchid_pos

* fix: remove useless loop

* refactor: remove unused loop

* Closes SEACrowd#159 | Create dataset loader for CC-Aligned (SEACrowd#298)

* Add cc_aligned_doc dataloader

* Rename class and format cc_aligned_doc

* Add SEACROWD_SCHEMA_NAME for cc_aligned_doc

* Closes SEACrowd#268 | Implement dataloader for Thai Toxicity Tweet Corpus (SEACrowd#301)

* Implement dataloader for Thai toxicity tweets

* Fix description grammar

* List labels as constant

* Change task to ABUSIVE_LANGUAGE_PREDICTION, improve _generate_examples

* Rename dataloader folder and file

* Remove comment, change license value

* Define SEACROWD_SCHEMA using _SUPPORTED_TASKS

* Fix bug where example ID and index do not match

* Closes SEACrowd#363 | Create dataset loader for identifikasi-bahasa (SEACrowd#379)

* [add]  initial commit

* [add] dataset loader for identifikasi_bahasa

* [refactor]  removed __main__

* Update seacrowd/sea_datasets/identifikasi_bahasa/identifikasi_bahasa.py

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#182. | Implement dataloader for `roots_vi_ted` (SEACrowd#329)

* Implement dataloader for roots_vi_ted

* update

* update

* update

* remove local data

* reformat

* Closes SEACrowd#180 | Implement `IndoMMLU` dataloader (SEACrowd#324)

* Implement dataloader for indommlu

* update

* update

* Closes SEACrowd#345 | Implemented dataloader for vlsp2016_ner (SEACrowd#372)

* Implemented dataloader for vlsp2016_ner

* Format vlsp2016_ner.py

* Closes SEACrowd#276 | Implement PRDECT-ID dataloader (SEACrowd#322)

* Implement PRDECT-ID dataloader

Closes SEACrowd#276

* Add better type formatting

* Follow id_google_play_review for structure

* Include source configs for both emotion and sentiment

* Closes SEACrowd#9 | Add bhinneka_korpus dataset loader (SEACrowd#175)

* Add bhinnek_korpus dataset loader

* Updating the suggested changes

* Resolved review suggestions

* Create indonesian_news_dataset dataloader

* Closes SEACrowd#183 | Implement `wongnai_reviews` dataloader (SEACrowd#325)

* Implement dataloader for wongnai_reviews

* add __init__.py

* update

* update

* Implement change requested by holylovenia

* Closes SEACrowd#348 | Implemented dataloader for indoner_tourism (SEACrowd#373)

* Implemented dataloader for indoner_tourism

* Perform changes requested by ljvmiranda921

* Closes SEACrowd#361 | Create dataset loader for Thai-Lao Parallel Corpus (SEACrowd#384)

* [add] dataloader for tha_lao_embassy_parcor, no citation yet

* [add] citation; removed debug code

* [style] make format restyle

* [refactor]  removed TODO code

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Update constants.py

* Closes SEACrowd#305 | Implement dataloader for UIT_ViOCD (SEACrowd#335)

* Implement dataloader for UIT_ViOCD

* update according to the review

* Update _SUPPORTED_TASKS

* Closes SEACrowd#362 | Create dataset loader for GKLMIP Khmer News Dataset (SEACrowd#383)

* [add] dataloader for gklmip_newsclass

* [refactor]  changed licence value

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#358 | Create dataset loader for GKLMIP Product Sentiment (SEACrowd#417)

* [add] dataset loader for gklmip_sentiment

* [refactor]  removed comment; removed "split" parameter in gen_kwargs

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Update constants.py

* Close SEACrowd#306 | Create dataset loader for ViHealthQA (SEACrowd#319)

* Create dataset loader for ViHealthQA SEACrowd#306

* add class docstring

* Update vihealthqa.py

* Closes SEACrowd#10 | Create beaye_lexicon dataset loader (SEACrowd#320)

* Create beaye_lexicon dataset loader

* add implementation of eng-day word pairs

* Closes SEACrowd#179 | Implement `indo_story_cloze` dataloader (SEACrowd#323)

* Implement indo_story_cloze dataloader.

* correct license

* update according to the feedback

* update

* Closes SEACrowd#353| Create dataset loader for FilWordNet (SEACrowd#377)

* Add dataloader for FilWordNet

* Update seacrowd/sea_datasets/filwordnet/filwordnet.py

Co-authored-by: Lj Miranda <[email protected]>

* Update seacrowd/sea_datasets/filwordnet/filwordnet.py

Co-authored-by: Lj Miranda <[email protected]>

* Fix formatting

---------

Co-authored-by: Lj Miranda <[email protected]>

* feat: id_sentiment_analysis dataloader

* refactor: remove print

* refactor: default config name

* feat: subsets

* Closes SEACrowd#350 | Implement dataloader for Indonesian PRONER (SEACrowd#399)

* Implement dataloader for Indonesian PRONER

* Add manual and automatic subsets

---------

Co-authored-by: Railey Montalan <[email protected]>

* Implement dataloader for IMAD Malay Corpus (SEACrowd#402)

Co-authored-by: ssfei81 <[email protected]>

* Update id_wsd.py

* add thaigov (SEACrowd#412)

* add thaigov

* Update thaigov.py

* add inline comment for file structure

* Update and rename snli.py to snli_indo.py

* Rename SNLI to SNLI Indo

* Update snli_indo.py

* [add]  dataloader for sarawak_malay

* Closes SEACrowd#264 | Create dataset loader for mySentence SEACrowd#264 (SEACrowd#291)

* add mysentences dataloader

* align the config name to subset_id

* update mysentence config

* Update mysentence.py

* remove comment line

* Update mysentence.py

* Update mysentence config

* Update mysentence.py

* Update seacrowd/sea_datasets/mysentence/mysentence.py

Fix the subset_id case-checking for data download

* added __init__.py to ucla_phonetic

* updated dataloader according to suggestions

* Update memolon.py

* fix: subset_id format

* refactor: prepend dataset name to subset id

* fix: first language is set to latin english

* Add thai depression

* Create __init__.py

* Create __init__.py

* Create __init__.py

* Implement dataloader for SeaEval

* Update template.py instruction for dataloader class name (SEACrowd#334)

* Add documentation for dataloader class name

* Update template.py

* Update REVIEWING.md

This modified the content of adding "Dataset" suffix into optional, and giving a reference to templates/templates.py for example

* Update REVIEWING.md

fix file reference name

---------

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* Closes SEACrowd#165 | Add BLOOM-LM dataset (SEACrowd#294)

* Init add BLOOM-LM dataset

* Adjusting changes based on review

* fix typing on _generate_examples

* update import based on formatter suggestion

* Closes SEACrowd#349 | Create dataset loader for QASiNa (SEACrowd#418)

* [add] dataloader for qasina

* [refactor] renamed dataset class

* [add]  added contex_title to qa_seacrowd schema

* [refactor, add]  changed QA type, added "answer_start", "contx_length" information to meta

* [refactor]  bug fixes

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#263 | Implement dataloader for VIVOS (SEACrowd#398)

* Implement dataloader for

* Implement dataloader for VIVOS

* Add missing __init__.py file

* Change _LANGUAGES into list

---------

Co-authored-by: Railey Montalan <[email protected]>

* Closes SEACrowd#190 | Create dataset loader for TydiQA  (SEACrowd#251)

* add tydiqa dataloader

* Update tydiqa.py

* add example helper and update config

* Update tydiqa.py

* Update Configs and _info

* Update features in _info()

* Update tydiqa.py

This update covers the requested changes from @jen-santoso and @jamesjaya, please advice if needs any further changes. Thanks.

* add tydiqa_id subset

* Update tydiqa.py

Reformat long lines in the code and add IndoNLG in citation

* remove tydiqa_id

* Closes SEACrowd#338 | Created DataLoader for IndonesianNMT (SEACrowd#367)

* Implementing Dataloader for indonesiannmt issue SEACrowd#338

* Update template.py

* Implementing Dataloader for indonesiannmt issue SEACrowd#338

* removed if __main__ section

* IndonesianNMT reconstructing dataloader

* Implement ssp task, implement suggestions

* format indonesiannmt

---------

Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Jonibek Mansurov <[email protected]>

* Closes SEACrowd#366 | Implement dataloader for Kheng.info Speech (SEACrowd#401)

* Implement dataloader for Kheng.info Speech

* Add init file

* Closes SEACrowd#226 | Vi Pubmed dataloader (SEACrowd#391)

* feat: vi_pubmed dataloader

* fix: homepage

* fix: non unique id error

* refactor: class name

* refactor: remove unused loop

* Create __init__.py

* [refactor]  removed comment

* Update flores200.py

* refactor: remove main function

* Closes SEACrowd#69 | Implement XStoryCloze Dataloader (SEACrowd#137)

* implement xstorycloze dataloader

* add __init__.py

* update

* remove ssp schema; add _LANGUAGES

* remove unnecessary import; pascal case for class name

* Closes SEACrowd#147 | implemented dataloader for gatitos dataset (SEACrowd#415)

* implemented dataloader for gatitos dataset

* added __init__.py to gatitos folder

* Updated gatitos

---------

Co-authored-by: ssfei81 <[email protected]>

* Update CODEOWNERS

* Patch Workflow on Stale Checking (SEACrowd#482)

* Update stale.yml

* Create add-new-comment-on-stale

* Update and rename stale.yml to stale-labeler.yml

* Update add-new-comment-on-stale

* Rename add-new-comment-on-stale to add-new-comment-on-stale.yml

* Sabilmakbar Patch Workflow (SEACrowd#484)

Bugfix on SEACrowd#482.

* Update add-new-comment-on-stale.yml

add workflow trigger criteria on PR message aswell

* Update add-new-comment-on-stale.yml

* Update add-new-comment-on-stale.yml

fix yaml indent

* Update add-new-comment-on-stale.yml

* Closes SEACrowd#340 | Implement Dataloader for emotes_3k (SEACrowd#397)

* Implement Dataloader for emotes_3k

* Implement Dataloader for emotes_3k

* Tasks updated from sentiment analysis to morality classification

* Implement Change Request

* formatting emotes_3k

---------

Co-authored-by: Jonibek Mansurov <[email protected]>

* refactor: remove main function

Co-authored-by: Lj Miranda <[email protected]>

* Update constants.py

* Closes SEACrowd#311 | Add dataloader for indonesian_madurese_bible_translation (SEACrowd#337)

* add dataloader for indonesian_madurese_bible_translation

* update the license of indonesian_madurese_bible_translation

* Update indonesian_madurese_bible_translation.py

* modify based on comments from holylovenia

* [indonesian_madurese_bible_translation]

* update based on the reviewer's comments

* Remove `CONTRIBUTING.md`, update PR Message Template, and add bash to initialize dataset (SEACrowd#468)

* add bash to initialize dataset

* delete CONTRIBUTING.md since it's duplicated with DATALOADER.md

* update the docs slightly on suggesting new dataloader contributors to use template

* fix few wordings

* Add info on required vars '_LOCAL'

* Add checklist on __init__.py

* fix wording on 2nd checklist regarding 'my_dataset' that should've been a var instead of static val

* fix wordings on first section of PR msg

* add newline separator for better readability

* add info on some to-dos

* refactor: citation

* Closes SEACrowd#83 | Implement Dataloader for GlobalWoZ (SEACrowd#261)

* refactor by pre-commit

* reformatted by pre-commit

* refactor code for globalwoz

* Create dataset loader for IndoQA SEACrowd#430 (SEACrowd#431)

* Add CODE_SWITCHING_IDENTIFICATION task (SEACrowd#488)

* Closes SEACrowd#396 | Implement dataloader for CrossSum (SEACrowd#419)

* Implement dataloader

* Change to 3-letter ISO codes

* Change task to CROSS_LINGUAL_SUMMARIZATION

* Closes SEACrowd#92 | Create Jail break data loader (SEACrowd#390)

* feat: jailbreak dataloader

* fix: minor errors

* refactor: styling

* refactor: remove main entry

* refactor: class name

* refactor: remove unused loop

* fix: separate text column into different subsets

* Create __init__.py

* Implement CommonVoice 12.0 dataloader (SEACrowd#452)

* Closes SEACrowd#202 | Implement dataloader for WIT (SEACrowd#374)

* Implement dataloader for WIT

* Remove unnecessary commits

* Add to description

---------

Co-authored-by: Railey Montalan <[email protected]>

* Split into language subsets

* Split into language subsets

* Update seacrowd/sea_datasets/thai_depression/thai_depression.py

Co-authored-by: Lj Miranda <[email protected]>

* fix: change lincense to unknown

* fix: minor errors

* Closes SEACrowd#80 | Implement MSVD-Indonesian Dataloader (SEACrowd#135)

* implement id_msvd dataloader

* change logic for seacrowd schema (text first, then video); quality of life change to video schema

* revert seacrowd video key from "text" to "texts"

* change source logic to match original data implementation

* run make check_file

* Closes SEACrowd#34  |  Create dataset loader for MKQA (SEACrowd#177)

* Create dataset loader for MKQA SEACrowd#34

* Refactor class variables _LANGUAGES to global for MKQA SEACrowd#34

* Filter supported languages (SEA only) of seacrowd_qa schema for MKQA SEACrowd#34

* Filter supported languages (SEA only) of source schema for MKQA SEACrowd#34

* Filter supported languages (SEA only) for MKQA SEACrowd#34 (a leftover)

* Change language code from macrolanguage, msa to zlm, for MKQA SEACrowd#34

* Change to a more appropriate language code of  for Malaysian variant used in MKQA SEACrowd#34

* Changed the value of field 'type' of QA schema to be more general, and moved the more specific value to 'meta' field for MKQA SEACrowd#34

* Replace None value to empty array in 'answer_aliases' sub-field for consistency in MKQA SEACrowd#34

* Closes SEACrowd#193 | Create dataset loader for MALINDO Morph (SEACrowd#332)

* Implement dataloader for MALINDO morph

* Specify file encoding and remove newlines when loading data

* Add blank __init__.py

* Fix typos in docstring

* Fix typos

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

---------

Co-authored-by: Jennifer Santoso <[email protected]>

* fix: subsets

* Closes SEACrowd#314 | Add dataloader for Indonesia chinese mt robust eval (SEACrowd#388)

* add dataloader for indonesian_madurese_bible_translation

* update dataloader for indonesia_chinese_mtrobusteval

* Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py

* Update indonesia_chinese_mtrobusteval.py

* update code based on the reviewer comments

* add __init__.py

* Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py

* Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py

---------

Co-authored-by: Jennifer Santoso <[email protected]>

* refactor: feature naming

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* fix: homepage url

* Closes SEACrowd#211 | Implement dataloader for SEAHORSE (SEACrowd#407)

* implement seahorse dataloader

* update

* update

* incorporate the latest comments though tensorflow still needed for tfds

* update

* update

* fix: lowercase feature name

* refactor: subset name

* fix: limit the sentence paths to the relevant languages

* refactor: remove possible error

* Change default split to TEST

* Closes SEACrowd#447 |  Create dataset loader for Aya Dataset (SEACrowd#457)

* Implementing data loader for Aya Dataset

* Fixing license serialization issue

* Update based on formatter for aya_dataset.py

* update xlsum to extend more langs

* update based on formatter

* Closes SEACrowd#360 | Implement dataloader for khpos (SEACrowd#376)

* Implement dataloader for khpos

* Remove unneeded comment

* Implemented Test and Validation loading

* Streamlining code

* Closes SEACrowd#116 | Add pho_ner_covid Dataloader (SEACrowd#461)

* feat: pho_ner_covid dataloader

* refactor: classname

Co-authored-by: Lj Miranda <[email protected]>

* fix: remove main function

Co-authored-by: Lj Miranda <[email protected]>

* refactor: remove inplace uses for dataframe

* refactor: remove duplicate statement

---------

Co-authored-by: Lj Miranda <[email protected]>

* refactor: remove trailing spaces

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* refactor: url format

* edit 'texts' to 'text' key (SEACrowd#499)

* Closes SEACrowd#217 | Implement dataloader for `wili_2018` (SEACrowd#381)

* Implement dataloader for wili_2018

* update

* Closes SEACrowd#104 | Add lazada_review_filipino (SEACrowd#409)

* Add lazada_review_filipino Closes SEACrowd#104

* Update lazada_review_filipino.py

Update config name

* Update lazada_review_filipino.py

fix typo

* Update lazada_review_filipino.py

bug fix - ValueError: Class label 5 greater than configured num_classes 5

* Update seacrowd/sea_datasets/lazada_review_filipino/lazada_review_filipino.py

---------

Co-authored-by: Samuel Cahyawijaya <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>

* Adjust bash script test_example.sh and test_example_source_only.sh (SEACrowd#171)

* update: adjust test_example.sh and test_example_source_only.sh

* fix: minor error message when dataset is empty

* updated kde4 language codes to iso639-3

* fix: citation

* refactor: use base config class

* create dataset loader for myanmar-rakhine parallel (SEACrowd#471)

* add pyreadr==0.5.0 (SEACrowd#504)

usage: reads/writes R RData and Rds files into/from pandas data frames

* Closes SEACrowd#97 | Inter-Agency Task Force for the Management of Emerging Infectious Diseases (IATF) COVID-19 Resolutions  (SEACrowd#460)

* Closes SEACrowd#274 | Create OIL data loader (SEACrowd#389)

* initial commit

* refactor: move module

* feat: dataset implementation

* feat: oil dataloader

* refactor: move dataloader file

* refactor: move dataloader file

* fix: non unique id error

* refactor: file formating

* refactor: remove comments

* fix: invalid config name exception raise

* refactor: audio cache file path

* fix: remove useless loop

* refactor: formatting

* Create __init__.py

* fix: citation

* fix: remove seacrowd schema

* Closes SEACrowd#49 | Updated existing TICO_19 dataloader to support more sea languages (SEACrowd#414)

* Updated existing TICO_19 dataloader to support more sea languages

* added sea languages to _LANGUAGES

---------

Co-authored-by: ssfei81 <[email protected]>

* Closes SEACrowd#443 | Add dataloader for ASR-STIDUSC (SEACrowd#493)

* Add dataloader for ASR-STIDUSC

* update task, dataset name, pythonic coding

* add relation extraction task (SEACrowd#502)

* fix: subset and config name

* Update bibtex id

* Closes SEACrowd#356 | Implement dataloader for CodeSwitch-Reddit (SEACrowd#451)

* Add CODE_SWITCHING_IDENTIFICATION task

* Implement dataloader

* Update codeswitch_reddit.py

fix column naming in source (using lowercase instead of capitalized)

* Closes SEACrowd#222 | Create dataset loader for CreoleRC (SEACrowd#469)

* Create dataset loaderfor CreoleRC

* remove changes to constants.py

* remove document_id, add normalized, add sanity check on offset value

* Update REVIEWING.md

Clarify wording in Dataloader Reviewing Doc

* Closes SEACrowd#341  | Create dataset loader for myParaphrase (SEACrowd#436)

* [add]  dataloader for my_paraphrase

* [refactor]  removed redundant breakpoint; put right default schema function

* [refactor]  changed schema for dataset

* [refactor]  split data into 3 categories(paraphrase, non_paraphrase, all)

* [refactor]  default config name is changed

* [refactor]  source configs for _paraphrase,_non_paraphrase,_all; altered schema naming

* [refactor]  cleaner conditioning, defined else clause

* Closes SEACrowd#269 | Create dataset loader for ViVQA SEACrowd#269 (SEACrowd#318)

* add vivqa dataloader

* Update vivqa.py

* update viviq dataloader config

* Update vivqa.py

* add vivqa dataloader

* Update vivqa.py

* update viviq dataloader config

* Update vivqa.py

* Update vivqa.py

* update

* Update vivqa.py

* Update vivqa.py

* Delete .idea/vcs.xml

* Delete .idea/seacrowd-datahub.iml

* Delete .idea/inspectionProfiles/profiles_settings.xml

* Delete .idea/inspectionProfiles/Project_Default.xml

* Update vivqa.py

* Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa"

This reverts commit a96fa80, reversing
changes made to 23700ca.

* Delete .idea/vcs.xml

* Delete .idea/seacrowd-datahub.iml

* Delete .idea/inspectionProfiles/profiles_settings.xml

* Delete .idea/inspectionProfiles/Project_Default.xml

* Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa"

This reverts commit a96fa80, reversing
changes made to 23700ca.

* Revert "Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa""

This reverts commit 5f1a3d6.

* fixing trailing space and run Makefile

* Closes SEACrowd#445 | Create dataset loader for malaysia-tweets-with-sentiment-labels (SEACrowd#450)

* Fix typo syntax dictionary at constants.py

* Add dataloader for malaysia_tweets

* Completed requested changes

* add dataloader for ASR-Sindodusc (SEACrowd#491)

* Closes SEACrowd#475 | Add dataloader for indonglish-dataset (SEACrowd#490)

* create dataloader for indonglish

* make subset_id unique, use ClassLabel for label

* Closes SEACrowd#215 | Implement dataloader for `thai_gpteacher` (SEACrowd#382)

* Implement dataloader for thai_gpteacher

* update

* update

* Closes SEACrowd#275 | Create dataset loader for UIT-ViCoV19QA SEACrowd#275 (SEACrowd#463)

* add SeaCrowd dataloader for uit_vicov19qa

* Merge subsets to one

* remove unused imported package

* Closes SEACrowd#309 | Create dataset loader for Vietnamese Hate Speech Detection (UIT-ViHSD) #309Uit vihsd (SEACrowd#501)

* create dataloader for uit_vihsd

* Update uit_vihsd.py

* Add some info for the labels

* Update example for Seacrowd schema

* Closes SEACrowd#441 | Add dataloader for ASR-SMALDUSC (SEACrowd#492)

* Add dataloader for ASR-SMALDUSC

* add prompt field

* Closes SEACrowd#307 | Implement dataloader for ViSoBERT  (SEACrowd#466)

* Update constants.py

* Implement dataloader for ViSoBERT

* Fix conflicts with constants.py

* Combine source and seacrowd_ssp schemas

---------

Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>

* add dataloader for wikitext_tl_39 (SEACrowd#486)

* Closes SEACrowd#393 | Create dataset loader for WEATHub (SEACrowd#496)

* [Feature] Add Weathub DataLoader

* [Fix] Add filter for SEA languages only + add constants + run formatter

* [Chore] Fix data loader naming

* [Fix] Impelement request changes from review

* Closes SEACrowd#188 | Implement dataloader for Sea-bench (SEACrowd#375)

* Implement dataloader for WIT

* Implement dataloader for sea_bench

* Remove WIT

* Remove logger and unnecessary variables

* Add instruction tuning and remove QA and summarization tasks

* Add __init__.py file

* Remove machine translation task

* Fix nitpicks

---------

Co-authored-by: Railey Montalan <[email protected]>

* Closes SEACrowd#115 | Create dataset loader for PhoMT dataset (SEACrowd#489)

* add dataloader for PhoMT dataset

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* update text1/2 name for PhoMT dataset

* Update phomt.py to replace en&vi to eng&vie

---------

Co-authored-by: Elyanah Aco <[email protected]>

* Closes SEACrowd#310 |Create dataset loader for ViSpamReviews SEACrowd#310 (SEACrowd#454)

* add vispamreviews dataloader

* update vispamreviews

* update schema

* Closes SEACrowd#530 | Add/Update Dataloader Tatabahasa (SEACrowd#540)

* feat: dataloader QA commonsense-reasoning

* nitpick

* Closes SEACrowd#267  | Add dataloader for struct_amb_ind (SEACrowd#506)

* Implement dataloader for struct_amb_ind

* Update seacrowd/sea_datasets/struct_amb_ind/struct_amb_ind.py

Co-authored-by: Jonibek Mansurov <[email protected]>

---------

Co-authored-by: Jonibek Mansurov <[email protected]>

* Closes SEACrowd#347 | Create dataset loader for IndoWiki (SEACrowd#485)

* create dataset loader for IndoWiki

* remove seacrowd schema

* Closes SEACrowd#354 | Implement dataloader for ETOS (SEACrowd#416)

* Implement dataloader for ETOS

* Implement dataloader for ETOS

* Rename dataset class name to ETOSDataset

* Remove  schema due to insufficient annotations

* Change ETOS into a POS tagging dataset

* Add missing __init__.py file

* Fix nitpicks

* Add DEFAULT_CONFIG_NAME

---------

Co-authored-by: Railey Montalan <[email protected]>

* update common_parser for UD JV_CSUI (SEACrowd#558)

* Create dataset loader for UD Javanese-CSUI SEACrowd#427 (SEACrowd#432)

* Closes SEACrowd#446 | Add/Update Dataloader voxlingua (SEACrowd#543)

* add init voxlingua

* Update seacrowd/sea_datasets/voxlingua/voxlingua.py

Co-authored-by: Lj Miranda <[email protected]>

---------

Co-authored-by: Lj Miranda <[email protected]>

* Closes SEACrowd#428 | Create dataset loader for Indonesia BioNER (SEACrowd#434)

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update cc3m_35l.py

Changed "_LANGS" to "_LANGUAGES"

* init commit

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Closes SEACrowd#344 | Create dataset loader for VLSP2016-SA (SEACrowd#500)

* [add]  dataloader for vlsp2016_sa[local]

* [refactor]  changed schema name

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Fix the private datasheet link in POINTS.md (SEACrowd#568)

* Closes SEACrowd#192 | Create dataset loader for MALINDO_parallel (SEACrowd#385)

* add malindo_parallel.py

* cleanup

* Class name fix

Co-authored-by: Lj Miranda <[email protected]>

* Remove sample licenses

Co-authored-by: Lj Miranda <[email protected]>

* fix dataset formatting error, use original dataset id

---------

Co-authored-by: Lj Miranda <[email protected]>

* Closes SEACrowd#114 | Implement dataloader for VnDT (SEACrowd#467)

* Implement dataloader for VnDT

* Add utility to impute missing sent_id and text fields from CoNLL files

* Fix imputed outputs

---------

Co-authored-by: Railey Montalan <[email protected]>

* add ocr task (SEACrowd#555)

* PR for update subset composition of TydiQA | Close SEACrowd#465 (SEACrowd#503)

* update csubset composition

* Update Subset Composition

* Update Subset Composition

* update subset name

indonesian --> ind
thai --> tha

* Update nusaparagraph_emot.py

* Update nusaparagraph_emot.py

* Update configs.py

* Closes SEACrowd#346 | Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings) (SEACrowd#406)

* Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings)

* Create __init__.py for MUSE SEACrowd#346

* Remove unused comment lines for MUSE SEACrowd#346

* changed all 2 letters language codes to 3 letters

---------

Co-authored-by: ssfei81 <[email protected]>
Co-authored-by: Frederikus Hudi <[email protected]>

* Closes SEACrowd#12 | Add/Update Dataloader BalitaNLP (SEACrowd#550)

* Implement dataloader for balita_nlp

* Remove articles with missing images from imtext schema

* Add details to metadata

* Adding New Citation for Bhinneka korpus (SEACrowd#599)

* Add bhinnek_korpus dataset loader

* Updating the suggested changes

* Resolved review suggestions

* adding new citation

---------

Co-authored-by: Holy Lovenia <[email protected]>

* Closes SEACrowd#270 | Create dataset loader for OpenViVQA SEACrowd#270 (SEACrowd#464)

* add sample

* init submit for openvivqa dataloader

* Update openvivqa.py

* Update openvivqa.py

* update dict format

* Closes SEACrowd#516 | Add/Update Dataloader id_newspaper_2018 (SEACrowd#551)

* Implement dataloader for id_newspaper_2018

* Specify JSON ecoding

* Closes SEACrowd#429 | Implement dataloader for filipino_hatespeech_election (SEACrowd#487)

* Add dataloader for filipino_hatespeech_election

* update task

* update

* Closes SEACrowd#52 | Add cosem dataloader (SEACrowd#473)

* feat: cosem dataloader

* fix: citation

* refactor: dataloader class name

* fix: file parsing logic

* fix: id format

* fix: tab separator bug in text

* fix: check for unique id

* Closes SEACrowd#424 | Add Dataloader Bactrian-X

* Import `schemas` beforehand on `templates/template.py` (SEACrowd#644)

* add import statement for schemas

* add import statement for schemas

* Closes SEACrowd#313 | Add dataloader for Saltik (SEACrowd#387)

* add dataloader for indonesian_madurese_bible_translation

* add dataloader for saltik

* Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py

* update based on the reviewer comment

* update based on the reviewer comment

* Remove the modified constants.py from PR

---------

Co-authored-by: Holy Lovenia <[email protected]>

* Add `.upper` method for `--schema` parameter (SEACrowd#648)

* add upper method for --schema

* revert code-style

* Closes SEACrowd#438 | Add dataloader for ASR-INDOCSC (SEACrowd#509)

* add dataloader for asr_indocsc

* Update asr_indocsc.py for data downloading instructions

---------

Co-authored-by: Salsabil Maulana Akbar <[email protected]>
Co-authored-by: Elyanah Aco <[email protected]>
Co-authored-by: Yuze GAO <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: XU, Yan (Yana) <[email protected]>
Co-authored-by: Haochen Li <[email protected]>
Co-authored-by: Jennifer Santoso <[email protected]>
Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Lucky Susanto <[email protected]>
Co-authored-by: Samuel Cahyawijaya <[email protected]>
Co-authored-by: Muhammad Dehan Al Kautsar <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: Lucky Susanto <[email protected]>
Co-authored-by: Maria Khelli <[email protected]>
Co-authored-by: Ishan Jindal <[email protected]>
Co-authored-by: ssfei81 <[email protected]>
Co-authored-by: IvanHalimP <[email protected]>
Co-authored-by: Enliven26 <[email protected]>
Co-authored-by: Dan John Velasco <[email protected]>
Co-authored-by: Chenxi <[email protected]>
Co-authored-by: Bhavish Pahwa <[email protected]>
Co-authored-by: FawwazMayda <[email protected]>
Co-authored-by: fawwaz.mayda <[email protected]>
Co-authored-by: Ilham F Putra <[email protected]>
Co-authored-by: rafif-kewmann <[email protected]>
Co-authored-by: mrafifrbbn <[email protected]>
Co-authored-by: Yong Zheng-Xin <[email protected]>
Co-authored-by: Amir Djanibekov <[email protected]>
Co-authored-by: Amir Djanibekov <[email protected]>
Co-authored-by: joan <[email protected]>
Co-authored-by: joanitolopo <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>
Co-authored-by: ssun32 <[email protected]>
Co-authored-by: Tyson <[email protected]>
Co-authored-by: Ilham Firdausi Putra <[email protected]>
Co-authored-by: Johanes Lee <[email protected]>
Co-authored-by: Akhdan Fadhilah <[email protected]>
Co-authored-by: Frederikus Hudi <[email protected]>
Co-authored-by: Börje Karlsson <[email protected]>
Co-authored-by: Muhammad Satrio Wicaksono <[email protected]>
Co-authored-by: Wenyu Zhang <[email protected]>
Co-authored-by: R. Damanhuri <[email protected]>
Co-authored-by: Patrick Amadeus Irawan <[email protected]>
Co-authored-by: Reza Qorib <[email protected]>
Co-authored-by: Bryan Wilie <[email protected]>
Co-authored-by: Muhammad Ravi Shulthan Habibi <[email protected]>

Loading branch information

48 people authored Apr 18, 2024

1 parent a3a6a84 commit 34e1749

.github/CODEOWNERS

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,3 +1,3 @@
  
    # These are the current maintainers/admin of the seacrowd-datahub repo

    * @holylovenia @samuelcahyawijaya @sabilmakbar @jamesjaya @yongzx @gentaiscool @ljvmiranda921 @RosenZhang @fajri91

    * @holylovenia @samuelcahyawijaya @sabilmakbar @jamesjaya @yongzx @gentaiscool @ljvmiranda921 @jen-santoso @danjohnvelasco @MJonibek @tellarin

.github/PULL_REQUEST_TEMPLATE.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,11 +1,17 @@
  
    Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

    Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples:

    **Title**: Closes #{ISSUE_NUMBER} | Add/Update Dataloader {DATALOADER_NAME}

    **First line PR Message**: Closes #{ISSUE_NUMBER}

    where you replace the {ISSUE_NUMBER} with the one corresponding to your dataset.

    ### Checkbox

    - [ ] Confirm that this PR is linked to the dataset issue.

    - [ ] Create the dataloader script `seacrowd/sea_datasets/my_dataset/my_dataset.py` (please use only lowercase and underscore for dataset naming).

    - [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables.

    - [ ] Create the dataloader script `seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py` (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its `__init__.py` within `{my_dataset}` folder.

    - [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_LOCAL`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables.

    - [ ] Implement `_info()`, `_split_generators()` and `_generate_examples()` in dataloader script.

    - [ ] Make sure that the `BUILDER_CONFIGS` class attribute is a list with at least one `SEACrowdConfig` for the source schema and one for a seacrowd schema.

    - [ ] Confirm dataloader script works with `datasets.load_dataset` function.

    - [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py`.

    - [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py` or `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}`.

    - [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

.github/workflows/add-new-comment-on-stale.yml

-Original file line number
+Diff line change
@@ -0,0 +1,43 @@
+    # This workflow is a continuation of "Mark stale issues and pull requests" workflow, on adding customized comment.
+    # You can adjust the behavior by modifying this file.
+    # For more information, see:
+    # https://github.com/peter-evans/create-or-update-comment
+    name: Adding reminder comment on staled issues & PRs
+    on:
+      issues:
+        types:
+          - labeled
+      # read these to see why it uses 'pull_request_target' instead of 'pull_request':
+      # 1. https://securitylab.github.com/research/github-actions-preventing-pwn-requests/
+      # 2. https://github.com/peter-evans/create-or-update-comment?tab=readme-ov-file#action-inputs (note section)
+      pull_request_target:
+        types:
+          - labeled
+    jobs:
+      add-comment-on-staled-issue:
+        if: github.event.label.name == 'staled-issue'
+        runs-on: ubuntu-latest
+        permissions:
+          issues: write
+        steps:
+          - name: Remind assignee on staled Issue
+            uses: peter-evans/create-or-update-comment@v2
+            with:
+              issue-number: ${{github.event.issue.number}}
+              body: "Hi @${{github.event.issue.assignee.login}}, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help."
+      add-comment-on-staled-pr:
+        if: github.event.label.name == 'need-fu-pr'
+        runs-on: ubuntu-latest
+        permissions:
+          pull-requests: write
+        steps:
+          - name: Remind assignee and author on staled PR
+            uses: peter-evans/create-or-update-comment@v2
+            with:
+              issue-number: ${{github.event.pull_request.number}}
+              body: "Hi @${{join(github.event.pull_request.assignees.*.login, ', @')}} & @${{github.event.pull_request.user.login}}, may I know if you are still working on this PR?"

.github/workflows/stale.yml → .github/workflows/stale-labeler.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -10,8 +10,8 @@ on:
  
      - cron: '20 1 * * *'

    jobs:

      stale:

      stale_detection:

        name: Detect Stale Issues/PR

        runs-on: ubuntu-latest

        permissions:

          issues: write

    @@ -20,11 +20,15 @@ jobs:
  
        steps:

        - uses: actions/stale@v8

          with:

            repo-token: ${{ secrets.GITHUB_TOKEN }}

            stale-issue-message: 'Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.'

            stale-issue-label: 'staled-issue'

            # only labels the stale, the comment addition will be handled by another workflow

            stale-issue-message: ""

            stale-pr-message: ""

            stale-issue-label: "staled-issue"

            stale-pr-label: "need-fu-pr"

            days-before-stale: 14

            days-before-close: -1

            include-only-assigned: true

            exempt-issue-labels: 'in-progress,pr-ready'

            operations-per-run: 100

            operations-per-run: 200

CONTRIBUTING.md

This file was deleted.

DATALOADER.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -100,20 +100,21 @@ Make sure your `pip` package points to your environment's source.
  
    ### 3. Implement your dataloader

    Make a new directory within the `SEACrowd/seacrowd-datahub/sea_datasets` directory:

    Use this bash script to initialize your new dataloader folder along with template of your dataloader script under `SEACrowd/seacrowd-datahub/sea_datasets` directory using this:

        mkdir seacrowd-datahub/sea_datasets/<dataset_name>

        sh templates/initiate_seacrowd_dataloader.sh <YOUR_DATALOADER_NAME>

    The value of `<YOUR_DATALODER_NAME>` can be checked on the issue ticket that you were assigned to.

    Please use lowercase letters and underscores when choosing a `<dataset_name>`.

    i.e: for this [issue ticket](https://github.com/SEACrowd/seacrowd-datahub/issues/32), the dataloader name indicates `Dataloader name: xl_sum/xl_sum.py`, hence the value of `<YOUR_DATALOADER_NAME>` is `xl_sum`.

    Please use PascalCase when choosing a `<dataset_name>`.

    To implement your dataset, there are three key methods that are important:

      * `_info`: Specifies the schema of the expected dataloader

      * `_split_generators`: Downloads and extracts data for each split (e.g. train/val/test) or associate local data with each split.

      * `_generate_examples`: Create examples from data that conform to each schema defined in `_info`.

    To start, copy [templates/template.py](templates/template.py) to your `seacrowd/sea_datasets/<dataset_name>` directory with the name `<dataset_name>.py`. Within this file, fill out all the TODOs.

        cp templates/template.py seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py

    After the bash above has been executed, you'll have your `seacrowd/sea_datasets/<dataset_name>` directory existed with the name `<dataset_name>.py`. Within this file, fill out all the TODOs based on the template.

    For the `_info_` function, you will need to define `features` for your

    `DatasetInfo` object. For the `bigbio` config, choose the right schema from our list of examples. You can find a description of these in the [Task Schemas Document](task_schemas.md). You can find the actual schemas in the [schemas directory](seacrowd/utils/schemas).

    @@ -133,7 +134,7 @@ To help you implement a dataset, you can see the implementation of [other datase
  
    #### Running & Debugging:

    You can run your data loader script during development by appending the following

    statement to your code ([templates/template.py](templates/template.py) already includes this):

    statement to your code (if you have your dataloader folder initialized using previous bash script, it already includes this, else you may add these by yourself):

    ```python

    if __name__ == "__main__":

    @@ -157,7 +158,7 @@ from datasets import load_dataset
  
    data = load_dataset("seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py", name="<dataset_name>_seacrowd_<schema>")

    ```

    Run these commands from the top level of the `nusa-crowd` repo (i.e. the same directory that contains the `requirements.txt` file).

    Run these commands from the top level of the `seacrowd/seacrowd-datahub` repo (i.e. the same directory that contains the `requirements.txt` file).

    Once this is done, please also check if your dataloader satisfies our unit tests as follows by using this command in the terminal:

    @@ -195,6 +196,7 @@ Then, run the following commands to incorporate any new changes in the master br
  
    Or you can install the pre-commit hooks to automatically pre-check before commit by:

        pre-commit install

    **Run these commands in your custom branch**.

    Push these changes to **your fork** with the following command:

0 comments on commit `34e1749`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `34e1749`

Commit

There are no files selected for viewing

0 comments on commit 34e1749

0 comments on commit `34e1749`