Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Fix bug unique ids * Closes SEACrowd#162 | Add Bloom-Captioning Dataloader (SEACrowd#198) * Init dataloader bloom captioning * Fix issue on multiple splits from its source * Change local var * Cater 'test' and 'val' split and fix the '_id' generation * fix: remove abstreact and change _LOCAL and _DESC * fix: _DESC indent * Format openslr.py and add init file * Closes SEACrowd#271 | Implement dataloader for UiT-ViCTSD (SEACrowd#300) * Implement UiT-ViCTSD dataloader * Improve subset IDs, feature types, code to generate examples * Closes SEACrowd#161 | Create dataset loader for ICON 161 (SEACrowd#317) * Create icon.py * Update icon.py * Create __init__.py * Closes SEACrowd#142 | Add Unimorph v4 dataloader (SEACrowd#168) * Add Unimorph dataloader Resolves SEACrowd#142 * Add Dataset to class name * Closes SEACrowd#71 | Create dataset loader for MASSIVE (SEACrowd#196) * add data loader for massive dataset * modify the class name & refactor the function name * change task name from pos tagging to slot filling & make check_file & change subset name to differentiate intent / slot filling tasks * Closes SEACrowd#14 | Create dataset loader for ara-close-lange (SEACrowd#243) * Add ara_close dataloader * Rename class name to AraCloseDataset * Closes SEACrowd#273 | Implement dataloader for UIT_ViON (SEACrowd#282) * Implement dataloader for UIT_ViON * Add __init__.py * Add {lang} in subset id for openslr * Closes SEACrowd#219 | Create dataloader for scb-mt-en-th-2020 (SEACrowd#287) * Create dataloader for scb-mt-en-th-2020 * Rename the data loader files to its snakecase * rename _DATASETNAME to snakecase * Fix languages setting * Update template.py * Add docstring openslr.py * Closes SEACrowd#277 | Implement dataloader for spamid_pair (SEACrowd#281) * Implemente dataloader for spamid_pair * Update seacrowd/sea_datasets/spamid_pair/spamid_pair.py Co-authored-by: Lj Miranda <[email protected]> * Add __init__.py * Update __init__.py --------- Co-authored-by: Lj Miranda <[email protected]> * Implemented dataloader for indoler * Add imqa schema and VISUAL_QUESTION_ANSWERING task (SEACrowd#380) * Update template.py Update DownloadManager documentation link in template.py * Closes SEACrowd#54 | Implement Dataloader for IndoSMD (SEACrowd#258) * feat: indosmd dataloader for source * refactor by pre-commit * IndoSMD: reformatted by pre-commit * Update changes on indosmd.py * revised line 223 in indosmd.py * Close#143 | Create dataset loader for Abui WordNet (SEACrowd#285) * add tydiqa dataloader * add id_vaccines_tweet dataloader * add uit-vicc dataloader * add ICON dataloader * add iaap_squad dataloader * add stb_ext dataloader * Revert "add iaap_squad dataloader" This reverts commit 1f8a591. * Revert "add tydiqa dataloader" This reverts commit 6bf4546. * Revert "add id_vaccines_tweet dataloader" This reverts commit 1154087. * Revert "add uit-vicc dataloader" This reverts commit 09661fa. * Revert "add ICON dataloader" This reverts commit 0891e58. * Update stb_ext.py * add abui_wordnet dataloader * Revert "Update stb_ext.py" This reverts commit 59c5301. * Delete seacrowd/sea_datasets/stb_ext/stb_ext.py * Delete seacrowd/sea_datasets/stb_ext/__init__.py * Update abui_wordnet.py * Update abui_wordnet.py * Update abui_wordnet.py --------- Co-authored-by: Lj Miranda <[email protected]> Co-authored-by: Samuel Cahyawijaya <[email protected]> * Added Morality Classification Tasks to constants.py (SEACrowd#371) * Closes SEACrowd#216 | Create dataset loader for Mozilla Pontoon (SEACrowd#260) * Begin first draft of Mozilla Pontoon dataloader * Add dataloader for Mozilla Pontoon * Remove enumerate in _generate_examples * Fix issues due to changed format, rename features and config names * Closes SEACrowd#157 | Create dataset loader for M3Exam (SEACrowd#302) * Add m3exam dataloader * Small change in m3exam.py * Fix bug during downloading * Add meta feature to seacrowd schema for m3exam * Rename class M3Exam to M3ExamDataset * Add image question answering * Merge two source schemas into one for m3exam * Fix image path, choices and answer in m3exam * Update CODEOWNERS * Rectify SEACrowd Internal Vars (SEACrowd#386) * Add missing __init__.py * add init * fix bug in phoatis load * add lang variables in dataloaders * Add dataset use ack on source HF repo into description * Closes SEACrowd#204 | Implement dataloader for Melayu_Sabah (SEACrowd#234) * Implement dataloader for Melayu_Sabah * Update name for the dataloader * Add _CITATION * Update seacrowd/sea_datasets/melayu_sabah/melayu_sabah.py * Applu suggestions from review * Moving unnecessary content in dialogue text * Update melayu_sabah.py * Improvement: Workflow Message to Mention Assignee in Staled Issues (SEACrowd#400) * Update stale.yml (SEACrowd#327) * Update stale.yml Test on adding vars on assignee & author of Issues & PR * Update stale.yml * Update stale.yml * Update stale.yml * Update stale.yml * Update stale.yml * Closes SEACrowd#272 | Create dataset loader for SNLI (SEACrowd#290) * [New Feature] Add SNLI dataloader * [Fix] SNLI rev according to PR review * [Chore] Add comment for accessibility * Update common_parser.py (SEACrowd#333) * Implement dataloader for UCLA Phonetic Corpus * Implement dataloader for KDE4 * removed redundant builder_config * Update cc3m_35l.py Changed into no parallelization since it was kept being killed by the OS for some reason. * Fix: Workflow Assignee Mention (SEACrowd#410) * Update stale.yml * Fix: wrong quote in message (SEACrowd#411) * Update and fix bug on stale.yml * Closes SEACrowd#17 | Implement dataloader for Philippine Fake News Corpus (SEACrowd#331) * Implement dataloader * Edit dataloader class name * Simplify code * Fix citation typo * Closes SEACrowd#359 | Implement dataloader for LR-Sum (SEACrowd#368) * Implement dataloader * Fix short description * feat: mswc dataloader skeleton * feat: example for seacrowd schema * Closes SEACrowd#265 | Implement dataloader for `myxnli` (SEACrowd#336) * Implement dataloader for myxnli * update myxnli * Closes SEACrowd#112 | Implement Dataloader for Wisesight Thai Corpus (SEACrowd#279) * Add wisesight_thai_sentiment dataset * changes according to review * changes according to review * changes according to review * Add changes according to review * refactor: formatting * fix: subset * refactor: formatting * Closes SEACrowd#6 | Add Loader for XCOPA (SEACrowd#286) * initial add for loader * edit to include multi language * adjust comments * apply suggestion * fix by linter --------- Co-authored-by: fawwaz.mayda <[email protected]> * Closes SEACrowd#140 | Add Dengue Filipino (SEACrowd#259) * add dengue filipino * update license and tasks * Update _LANGUAGE * Update dengue_filipino.py * feat: flores200 dataloader skeleton * Set only one source schema * Fix subnodes ids for root node alt_burmese_treebank * implement Filipino Gay Language dataloader (SEACrowd#66) * convert citation to raw string * Closes SEACrowd#210 | Create dataset loader for Orchid Corpus (SEACrowd#303) * Add orchid_pos dataloader * Rename OrchidPOS to OrchidPOSDataset * Fix parser bug in orchid_pos.py * Add .strip() in source orchid_pos * Cahange string for special char orchid_pos * fix: remove useless loop * refactor: remove unused loop * Closes SEACrowd#159 | Create dataset loader for CC-Aligned (SEACrowd#298) * Add cc_aligned_doc dataloader * Rename class and format cc_aligned_doc * Add SEACROWD_SCHEMA_NAME for cc_aligned_doc * Closes SEACrowd#268 | Implement dataloader for Thai Toxicity Tweet Corpus (SEACrowd#301) * Implement dataloader for Thai toxicity tweets * Fix description grammar * List labels as constant * Change task to ABUSIVE_LANGUAGE_PREDICTION, improve _generate_examples * Rename dataloader folder and file * Remove comment, change license value * Define SEACROWD_SCHEMA using _SUPPORTED_TASKS * Fix bug where example ID and index do not match * Closes SEACrowd#363 | Create dataset loader for identifikasi-bahasa (SEACrowd#379) * [add] initial commit * [add] dataset loader for identifikasi_bahasa * [refactor] removed __main__ * Update seacrowd/sea_datasets/identifikasi_bahasa/identifikasi_bahasa.py --------- Co-authored-by: Amir Djanibekov <[email protected]> * Closes SEACrowd#182. | Implement dataloader for `roots_vi_ted` (SEACrowd#329) * Implement dataloader for roots_vi_ted * update * update * update * remove local data * reformat * Closes SEACrowd#180 | Implement `IndoMMLU` dataloader (SEACrowd#324) * Implement dataloader for indommlu * update * update * Closes SEACrowd#345 | Implemented dataloader for vlsp2016_ner (SEACrowd#372) * Implemented dataloader for vlsp2016_ner * Format vlsp2016_ner.py * Closes SEACrowd#276 | Implement PRDECT-ID dataloader (SEACrowd#322) * Implement PRDECT-ID dataloader Closes SEACrowd#276 * Add better type formatting * Follow id_google_play_review for structure * Include source configs for both emotion and sentiment * Closes SEACrowd#9 | Add bhinneka_korpus dataset loader (SEACrowd#175) * Add bhinnek_korpus dataset loader * Updating the suggested changes * Resolved review suggestions * Create indonesian_news_dataset dataloader * Closes SEACrowd#183 | Implement `wongnai_reviews` dataloader (SEACrowd#325) * Implement dataloader for wongnai_reviews * add __init__.py * update * update * Implement change requested by holylovenia * Closes SEACrowd#348 | Implemented dataloader for indoner_tourism (SEACrowd#373) * Implemented dataloader for indoner_tourism * Perform changes requested by ljvmiranda921 * Closes SEACrowd#361 | Create dataset loader for Thai-Lao Parallel Corpus (SEACrowd#384) * [add] dataloader for tha_lao_embassy_parcor, no citation yet * [add] citation; removed debug code * [style] make format restyle * [refactor] removed TODO code --------- Co-authored-by: Amir Djanibekov <[email protected]> * Update constants.py * Closes SEACrowd#305 | Implement dataloader for UIT_ViOCD (SEACrowd#335) * Implement dataloader for UIT_ViOCD * update according to the review * Update _SUPPORTED_TASKS * Closes SEACrowd#362 | Create dataset loader for GKLMIP Khmer News Dataset (SEACrowd#383) * [add] dataloader for gklmip_newsclass * [refactor] changed licence value --------- Co-authored-by: Amir Djanibekov <[email protected]> * Closes SEACrowd#358 | Create dataset loader for GKLMIP Product Sentiment (SEACrowd#417) * [add] dataset loader for gklmip_sentiment * [refactor] removed comment; removed "split" parameter in gen_kwargs --------- Co-authored-by: Amir Djanibekov <[email protected]> * Update constants.py * Close SEACrowd#306 | Create dataset loader for ViHealthQA (SEACrowd#319) * Create dataset loader for ViHealthQA SEACrowd#306 * add class docstring * Update vihealthqa.py * Closes SEACrowd#10 | Create beaye_lexicon dataset loader (SEACrowd#320) * Create beaye_lexicon dataset loader * add implementation of eng-day word pairs * Closes SEACrowd#179 | Implement `indo_story_cloze` dataloader (SEACrowd#323) * Implement indo_story_cloze dataloader. * correct license * update according to the feedback * update * Closes SEACrowd#353| Create dataset loader for FilWordNet (SEACrowd#377) * Add dataloader for FilWordNet * Update seacrowd/sea_datasets/filwordnet/filwordnet.py Co-authored-by: Lj Miranda <[email protected]> * Update seacrowd/sea_datasets/filwordnet/filwordnet.py Co-authored-by: Lj Miranda <[email protected]> * Fix formatting --------- Co-authored-by: Lj Miranda <[email protected]> * feat: id_sentiment_analysis dataloader * refactor: remove print * refactor: default config name * feat: subsets * Closes SEACrowd#350 | Implement dataloader for Indonesian PRONER (SEACrowd#399) * Implement dataloader for Indonesian PRONER * Add manual and automatic subsets --------- Co-authored-by: Railey Montalan <[email protected]> * Implement dataloader for IMAD Malay Corpus (SEACrowd#402) Co-authored-by: ssfei81 <[email protected]> * Update id_wsd.py * add thaigov (SEACrowd#412) * add thaigov * Update thaigov.py * add inline comment for file structure * Update and rename snli.py to snli_indo.py * Rename SNLI to SNLI Indo * Update snli_indo.py * [add] dataloader for sarawak_malay * Closes SEACrowd#264 | Create dataset loader for mySentence SEACrowd#264 (SEACrowd#291) * add mysentences dataloader * align the config name to subset_id * update mysentence config * Update mysentence.py * remove comment line * Update mysentence.py * Update mysentence config * Update mysentence.py * Update seacrowd/sea_datasets/mysentence/mysentence.py Fix the subset_id case-checking for data download * added __init__.py to ucla_phonetic * updated dataloader according to suggestions * Update memolon.py * fix: subset_id format * refactor: prepend dataset name to subset id * fix: first language is set to latin english * Add thai depression * Create __init__.py * Create __init__.py * Create __init__.py * Implement dataloader for SeaEval * Update template.py instruction for dataloader class name (SEACrowd#334) * Add documentation for dataloader class name * Update template.py * Update REVIEWING.md This modified the content of adding "Dataset" suffix into optional, and giving a reference to templates/templates.py for example * Update REVIEWING.md fix file reference name --------- Co-authored-by: Salsabil Maulana Akbar <[email protected]> * Closes SEACrowd#165 | Add BLOOM-LM dataset (SEACrowd#294) * Init add BLOOM-LM dataset * Adjusting changes based on review * fix typing on _generate_examples * update import based on formatter suggestion * Closes SEACrowd#349 | Create dataset loader for QASiNa (SEACrowd#418) * [add] dataloader for qasina * [refactor] renamed dataset class * [add] added contex_title to qa_seacrowd schema * [refactor, add] changed QA type, added "answer_start", "contx_length" information to meta * [refactor] bug fixes --------- Co-authored-by: Amir Djanibekov <[email protected]> * Closes SEACrowd#263 | Implement dataloader for VIVOS (SEACrowd#398) * Implement dataloader for * Implement dataloader for VIVOS * Add missing __init__.py file * Change _LANGUAGES into list --------- Co-authored-by: Railey Montalan <[email protected]> * Closes SEACrowd#190 | Create dataset loader for TydiQA (SEACrowd#251) * add tydiqa dataloader * Update tydiqa.py * add example helper and update config * Update tydiqa.py * Update Configs and _info * Update features in _info() * Update tydiqa.py This update covers the requested changes from @jen-santoso and @jamesjaya, please advice if needs any further changes. Thanks. * add tydiqa_id subset * Update tydiqa.py Reformat long lines in the code and add IndoNLG in citation * remove tydiqa_id * Closes SEACrowd#338 | Created DataLoader for IndonesianNMT (SEACrowd#367) * Implementing Dataloader for indonesiannmt issue SEACrowd#338 * Update template.py * Implementing Dataloader for indonesiannmt issue SEACrowd#338 * removed if __main__ section * IndonesianNMT reconstructing dataloader * Implement ssp task, implement suggestions * format indonesiannmt --------- Co-authored-by: Holy Lovenia <[email protected]> Co-authored-by: Jonibek Mansurov <[email protected]> * Closes SEACrowd#366 | Implement dataloader for Kheng.info Speech (SEACrowd#401) * Implement dataloader for Kheng.info Speech * Add init file * Closes SEACrowd#226 | Vi Pubmed dataloader (SEACrowd#391) * feat: vi_pubmed dataloader * fix: homepage * fix: non unique id error * refactor: class name * refactor: remove unused loop * Create __init__.py * [refactor] removed comment * Update flores200.py * refactor: remove main function * Closes SEACrowd#69 | Implement XStoryCloze Dataloader (SEACrowd#137) * implement xstorycloze dataloader * add __init__.py * update * remove ssp schema; add _LANGUAGES * remove unnecessary import; pascal case for class name * Closes SEACrowd#147 | implemented dataloader for gatitos dataset (SEACrowd#415) * implemented dataloader for gatitos dataset * added __init__.py to gatitos folder * Updated gatitos --------- Co-authored-by: ssfei81 <[email protected]> * Update CODEOWNERS * Patch Workflow on Stale Checking (SEACrowd#482) * Update stale.yml * Create add-new-comment-on-stale * Update and rename stale.yml to stale-labeler.yml * Update add-new-comment-on-stale * Rename add-new-comment-on-stale to add-new-comment-on-stale.yml * Sabilmakbar Patch Workflow (SEACrowd#484) Bugfix on SEACrowd#482. * Update add-new-comment-on-stale.yml add workflow trigger criteria on PR message aswell * Update add-new-comment-on-stale.yml * Update add-new-comment-on-stale.yml fix yaml indent * Update add-new-comment-on-stale.yml * Closes SEACrowd#340 | Implement Dataloader for emotes_3k (SEACrowd#397) * Implement Dataloader for emotes_3k * Implement Dataloader for emotes_3k * Tasks updated from sentiment analysis to morality classification * Implement Change Request * formatting emotes_3k --------- Co-authored-by: Jonibek Mansurov <[email protected]> * refactor: remove main function Co-authored-by: Lj Miranda <[email protected]> * Update constants.py * Closes SEACrowd#311 | Add dataloader for indonesian_madurese_bible_translation (SEACrowd#337) * add dataloader for indonesian_madurese_bible_translation * update the license of indonesian_madurese_bible_translation * Update indonesian_madurese_bible_translation.py * modify based on comments from holylovenia * [indonesian_madurese_bible_translation] * update based on the reviewer's comments * Remove `CONTRIBUTING.md`, update PR Message Template, and add bash to initialize dataset (SEACrowd#468) * add bash to initialize dataset * delete CONTRIBUTING.md since it's duplicated with DATALOADER.md * update the docs slightly on suggesting new dataloader contributors to use template * fix few wordings * Add info on required vars '_LOCAL' * Add checklist on __init__.py * fix wording on 2nd checklist regarding 'my_dataset' that should've been a var instead of static val * fix wordings on first section of PR msg * add newline separator for better readability * add info on some to-dos * refactor: citation * Closes SEACrowd#83 | Implement Dataloader for GlobalWoZ (SEACrowd#261) * refactor by pre-commit * reformatted by pre-commit * refactor code for globalwoz * Create dataset loader for IndoQA SEACrowd#430 (SEACrowd#431) * Add CODE_SWITCHING_IDENTIFICATION task (SEACrowd#488) * Closes SEACrowd#396 | Implement dataloader for CrossSum (SEACrowd#419) * Implement dataloader * Change to 3-letter ISO codes * Change task to CROSS_LINGUAL_SUMMARIZATION * Closes SEACrowd#92 | Create Jail break data loader (SEACrowd#390) * feat: jailbreak dataloader * fix: minor errors * refactor: styling * refactor: remove main entry * refactor: class name * refactor: remove unused loop * fix: separate text column into different subsets * Create __init__.py * Implement CommonVoice 12.0 dataloader (SEACrowd#452) * Closes SEACrowd#202 | Implement dataloader for WIT (SEACrowd#374) * Implement dataloader for WIT * Remove unnecessary commits * Add to description --------- Co-authored-by: Railey Montalan <[email protected]> * Split into language subsets * Split into language subsets * Update seacrowd/sea_datasets/thai_depression/thai_depression.py Co-authored-by: Lj Miranda <[email protected]> * fix: change lincense to unknown * fix: minor errors * Closes SEACrowd#80 | Implement MSVD-Indonesian Dataloader (SEACrowd#135) * implement id_msvd dataloader * change logic for seacrowd schema (text first, then video); quality of life change to video schema * revert seacrowd video key from "text" to "texts" * change source logic to match original data implementation * run make check_file * Closes SEACrowd#34 | Create dataset loader for MKQA (SEACrowd#177) * Create dataset loader for MKQA SEACrowd#34 * Refactor class variables _LANGUAGES to global for MKQA SEACrowd#34 * Filter supported languages (SEA only) of seacrowd_qa schema for MKQA SEACrowd#34 * Filter supported languages (SEA only) of source schema for MKQA SEACrowd#34 * Filter supported languages (SEA only) for MKQA SEACrowd#34 (a leftover) * Change language code from macrolanguage, msa to zlm, for MKQA SEACrowd#34 * Change to a more appropriate language code of for Malaysian variant used in MKQA SEACrowd#34 * Changed the value of field 'type' of QA schema to be more general, and moved the more specific value to 'meta' field for MKQA SEACrowd#34 * Replace None value to empty array in 'answer_aliases' sub-field for consistency in MKQA SEACrowd#34 * Closes SEACrowd#193 | Create dataset loader for MALINDO Morph (SEACrowd#332) * Implement dataloader for MALINDO morph * Specify file encoding and remove newlines when loading data * Add blank __init__.py * Fix typos in docstring * Fix typos * Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py Co-authored-by: Jennifer Santoso <[email protected]> * Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py Co-authored-by: Jennifer Santoso <[email protected]> * Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py --------- Co-authored-by: Jennifer Santoso <[email protected]> * fix: subsets * Closes SEACrowd#314 | Add dataloader for Indonesia chinese mt robust eval (SEACrowd#388) * add dataloader for indonesian_madurese_bible_translation * update dataloader for indonesia_chinese_mtrobusteval * Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py * Update indonesia_chinese_mtrobusteval.py * update code based on the reviewer comments * add __init__.py * Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py * Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py --------- Co-authored-by: Jennifer Santoso <[email protected]> * refactor: feature naming Co-authored-by: Salsabil Maulana Akbar <[email protected]> * fix: homepage url * Closes SEACrowd#211 | Implement dataloader for SEAHORSE (SEACrowd#407) * implement seahorse dataloader * update * update * incorporate the latest comments though tensorflow still needed for tfds * update * update * fix: lowercase feature name * refactor: subset name * fix: limit the sentence paths to the relevant languages * refactor: remove possible error * Change default split to TEST * Closes SEACrowd#447 | Create dataset loader for Aya Dataset (SEACrowd#457) * Implementing data loader for Aya Dataset * Fixing license serialization issue * Update based on formatter for aya_dataset.py * update xlsum to extend more langs * update based on formatter * Closes SEACrowd#360 | Implement dataloader for khpos (SEACrowd#376) * Implement dataloader for khpos * Remove unneeded comment * Implemented Test and Validation loading * Streamlining code * Closes SEACrowd#116 | Add pho_ner_covid Dataloader (SEACrowd#461) * feat: pho_ner_covid dataloader * refactor: classname Co-authored-by: Lj Miranda <[email protected]> * fix: remove main function Co-authored-by: Lj Miranda <[email protected]> * refactor: remove inplace uses for dataframe * refactor: remove duplicate statement --------- Co-authored-by: Lj Miranda <[email protected]> * refactor: remove trailing spaces Co-authored-by: Salsabil Maulana Akbar <[email protected]> * refactor: url format * edit 'texts' to 'text' key (SEACrowd#499) * Closes SEACrowd#217 | Implement dataloader for `wili_2018` (SEACrowd#381) * Implement dataloader for wili_2018 * update * Closes SEACrowd#104 | Add lazada_review_filipino (SEACrowd#409) * Add lazada_review_filipino Closes SEACrowd#104 * Update lazada_review_filipino.py Update config name * Update lazada_review_filipino.py fix typo * Update lazada_review_filipino.py bug fix - ValueError: Class label 5 greater than configured num_classes 5 * Update seacrowd/sea_datasets/lazada_review_filipino/lazada_review_filipino.py --------- Co-authored-by: Samuel Cahyawijaya <[email protected]> Co-authored-by: Lj Miranda <[email protected]> * Adjust bash script test_example.sh and test_example_source_only.sh (SEACrowd#171) * update: adjust test_example.sh and test_example_source_only.sh * fix: minor error message when dataset is empty * updated kde4 language codes to iso639-3 * fix: citation * refactor: use base config class * create dataset loader for myanmar-rakhine parallel (SEACrowd#471) * add pyreadr==0.5.0 (SEACrowd#504) usage: reads/writes R RData and Rds files into/from pandas data frames * Closes SEACrowd#97 | Inter-Agency Task Force for the Management of Emerging Infectious Diseases (IATF) COVID-19 Resolutions (SEACrowd#460) * Closes SEACrowd#274 | Create OIL data loader (SEACrowd#389) * initial commit * refactor: move module * feat: dataset implementation * feat: oil dataloader * refactor: move dataloader file * refactor: move dataloader file * fix: non unique id error * refactor: file formating * refactor: remove comments * fix: invalid config name exception raise * refactor: audio cache file path * fix: remove useless loop * refactor: formatting * Create __init__.py * fix: citation * fix: remove seacrowd schema * Closes SEACrowd#49 | Updated existing TICO_19 dataloader to support more sea languages (SEACrowd#414) * Updated existing TICO_19 dataloader to support more sea languages * added sea languages to _LANGUAGES --------- Co-authored-by: ssfei81 <[email protected]> * Closes SEACrowd#443 | Add dataloader for ASR-STIDUSC (SEACrowd#493) * Add dataloader for ASR-STIDUSC * update task, dataset name, pythonic coding * add relation extraction task (SEACrowd#502) * fix: subset and config name * Update bibtex id * Closes SEACrowd#356 | Implement dataloader for CodeSwitch-Reddit (SEACrowd#451) * Add CODE_SWITCHING_IDENTIFICATION task * Implement dataloader * Update codeswitch_reddit.py fix column naming in source (using lowercase instead of capitalized) * Closes SEACrowd#222 | Create dataset loader for CreoleRC (SEACrowd#469) * Create dataset loaderfor CreoleRC * remove changes to constants.py * remove document_id, add normalized, add sanity check on offset value * Update REVIEWING.md Clarify wording in Dataloader Reviewing Doc * Closes SEACrowd#341 | Create dataset loader for myParaphrase (SEACrowd#436) * [add] dataloader for my_paraphrase * [refactor] removed redundant breakpoint; put right default schema function * [refactor] changed schema for dataset * [refactor] split data into 3 categories(paraphrase, non_paraphrase, all) * [refactor] default config name is changed * [refactor] source configs for _paraphrase,_non_paraphrase,_all; altered schema naming * [refactor] cleaner conditioning, defined else clause * Closes SEACrowd#269 | Create dataset loader for ViVQA SEACrowd#269 (SEACrowd#318) * add vivqa dataloader * Update vivqa.py * update viviq dataloader config * Update vivqa.py * add vivqa dataloader * Update vivqa.py * update viviq dataloader config * Update vivqa.py * Update vivqa.py * update * Update vivqa.py * Update vivqa.py * Delete .idea/vcs.xml * Delete .idea/seacrowd-datahub.iml * Delete .idea/inspectionProfiles/profiles_settings.xml * Delete .idea/inspectionProfiles/Project_Default.xml * Update vivqa.py * Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa" This reverts commit a96fa80, reversing changes made to 23700ca. * Delete .idea/vcs.xml * Delete .idea/seacrowd-datahub.iml * Delete .idea/inspectionProfiles/profiles_settings.xml * Delete .idea/inspectionProfiles/Project_Default.xml * Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa" This reverts commit a96fa80, reversing changes made to 23700ca. * Revert "Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa"" This reverts commit 5f1a3d6. * fixing trailing space and run Makefile * Closes SEACrowd#445 | Create dataset loader for malaysia-tweets-with-sentiment-labels (SEACrowd#450) * Fix typo syntax dictionary at constants.py * Add dataloader for malaysia_tweets * Completed requested changes * add dataloader for ASR-Sindodusc (SEACrowd#491) * Closes SEACrowd#475 | Add dataloader for indonglish-dataset (SEACrowd#490) * create dataloader for indonglish * make subset_id unique, use ClassLabel for label * Closes SEACrowd#215 | Implement dataloader for `thai_gpteacher` (SEACrowd#382) * Implement dataloader for thai_gpteacher * update * update * Closes SEACrowd#275 | Create dataset loader for UIT-ViCoV19QA SEACrowd#275 (SEACrowd#463) * add SeaCrowd dataloader for uit_vicov19qa * Merge subsets to one * remove unused imported package * Closes SEACrowd#309 | Create dataset loader for Vietnamese Hate Speech Detection (UIT-ViHSD) #309Uit vihsd (SEACrowd#501) * create dataloader for uit_vihsd * Update uit_vihsd.py * Add some info for the labels * Update example for Seacrowd schema * Closes SEACrowd#441 | Add dataloader for ASR-SMALDUSC (SEACrowd#492) * Add dataloader for ASR-SMALDUSC * add prompt field * Closes SEACrowd#307 | Implement dataloader for ViSoBERT (SEACrowd#466) * Update constants.py * Implement dataloader for ViSoBERT * Fix conflicts with constants.py * Combine source and seacrowd_ssp schemas --------- Co-authored-by: Holy Lovenia <[email protected]> Co-authored-by: Railey Montalan <[email protected]> * add dataloader for wikitext_tl_39 (SEACrowd#486) * Closes SEACrowd#393 | Create dataset loader for WEATHub (SEACrowd#496) * [Feature] Add Weathub DataLoader * [Fix] Add filter for SEA languages only + add constants + run formatter * [Chore] Fix data loader naming * [Fix] Impelement request changes from review * Closes SEACrowd#188 | Implement dataloader for Sea-bench (SEACrowd#375) * Implement dataloader for WIT * Implement dataloader for sea_bench * Remove WIT * Remove logger and unnecessary variables * Add instruction tuning and remove QA and summarization tasks * Add __init__.py file * Remove machine translation task * Fix nitpicks --------- Co-authored-by: Railey Montalan <[email protected]> * Closes SEACrowd#115 | Create dataset loader for PhoMT dataset (SEACrowd#489) * add dataloader for PhoMT dataset * Update seacrowd/sea_datasets/phomt/phomt.py Co-authored-by: Elyanah Aco <[email protected]> * Update seacrowd/sea_datasets/phomt/phomt.py Co-authored-by: Elyanah Aco <[email protected]> * Update seacrowd/sea_datasets/phomt/phomt.py Co-authored-by: Elyanah Aco <[email protected]> * Update seacrowd/sea_datasets/phomt/phomt.py Co-authored-by: Elyanah Aco <[email protected]> * Update seacrowd/sea_datasets/phomt/phomt.py Co-authored-by: Elyanah Aco <[email protected]> * update text1/2 name for PhoMT dataset * Update phomt.py to replace en&vi to eng&vie --------- Co-authored-by: Elyanah Aco <[email protected]> * Closes SEACrowd#310 |Create dataset loader for ViSpamReviews SEACrowd#310 (SEACrowd#454) * add vispamreviews dataloader * update vispamreviews * update schema * Closes SEACrowd#530 | Add/Update Dataloader Tatabahasa (SEACrowd#540) * feat: dataloader QA commonsense-reasoning * nitpick * Closes SEACrowd#267 | Add dataloader for struct_amb_ind (SEACrowd#506) * Implement dataloader for struct_amb_ind * Update seacrowd/sea_datasets/struct_amb_ind/struct_amb_ind.py Co-authored-by: Jonibek Mansurov <[email protected]> --------- Co-authored-by: Jonibek Mansurov <[email protected]> * Closes SEACrowd#347 | Create dataset loader for IndoWiki (SEACrowd#485) * create dataset loader for IndoWiki * remove seacrowd schema * Closes SEACrowd#354 | Implement dataloader for ETOS (SEACrowd#416) * Implement dataloader for ETOS * Implement dataloader for ETOS * Rename dataset class name to ETOSDataset * Remove schema due to insufficient annotations * Change ETOS into a POS tagging dataset * Add missing __init__.py file * Fix nitpicks * Add DEFAULT_CONFIG_NAME --------- Co-authored-by: Railey Montalan <[email protected]> * update common_parser for UD JV_CSUI (SEACrowd#558) * Create dataset loader for UD Javanese-CSUI SEACrowd#427 (SEACrowd#432) * Closes SEACrowd#446 | Add/Update Dataloader voxlingua (SEACrowd#543) * add init voxlingua * Update seacrowd/sea_datasets/voxlingua/voxlingua.py Co-authored-by: Lj Miranda <[email protected]> --------- Co-authored-by: Lj Miranda <[email protected]> * Closes SEACrowd#428 | Create dataset loader for Indonesia BioNER (SEACrowd#434) * Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py Co-authored-by: Jennifer Santoso <[email protected]> * Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py Co-authored-by: Jennifer Santoso <[email protected]> * Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py Co-authored-by: Jennifer Santoso <[email protected]> * Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py Co-authored-by: Jennifer Santoso <[email protected]> * Update cc3m_35l.py Changed "_LANGS" to "_LANGUAGES" * init commit * Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py Co-authored-by: Jennifer Santoso <[email protected]> * Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py Co-authored-by: Jennifer Santoso <[email protected]> * Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py Co-authored-by: Jennifer Santoso <[email protected]> * Closes SEACrowd#344 | Create dataset loader for VLSP2016-SA (SEACrowd#500) * [add] dataloader for vlsp2016_sa[local] * [refactor] changed schema name --------- Co-authored-by: Amir Djanibekov <[email protected]> * Fix the private datasheet link in POINTS.md (SEACrowd#568) * Closes SEACrowd#192 | Create dataset loader for MALINDO_parallel (SEACrowd#385) * add malindo_parallel.py * cleanup * Class name fix Co-authored-by: Lj Miranda <[email protected]> * Remove sample licenses Co-authored-by: Lj Miranda <[email protected]> * fix dataset formatting error, use original dataset id --------- Co-authored-by: Lj Miranda <[email protected]> * Closes SEACrowd#114 | Implement dataloader for VnDT (SEACrowd#467) * Implement dataloader for VnDT * Add utility to impute missing sent_id and text fields from CoNLL files * Fix imputed outputs --------- Co-authored-by: Railey Montalan <[email protected]> * add ocr task (SEACrowd#555) * PR for update subset composition of TydiQA | Close SEACrowd#465 (SEACrowd#503) * update csubset composition * Update Subset Composition * Update Subset Composition * update subset name indonesian --> ind thai --> tha * Update nusaparagraph_emot.py * Update nusaparagraph_emot.py * Update configs.py * Closes SEACrowd#346 | Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings) (SEACrowd#406) * Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings) * Create __init__.py for MUSE SEACrowd#346 * Remove unused comment lines for MUSE SEACrowd#346 * changed all 2 letters language codes to 3 letters --------- Co-authored-by: ssfei81 <[email protected]> Co-authored-by: Frederikus Hudi <[email protected]> * Closes SEACrowd#12 | Add/Update Dataloader BalitaNLP (SEACrowd#550) * Implement dataloader for balita_nlp * Remove articles with missing images from imtext schema * Add details to metadata * Adding New Citation for Bhinneka korpus (SEACrowd#599) * Add bhinnek_korpus dataset loader * Updating the suggested changes * Resolved review suggestions * adding new citation --------- Co-authored-by: Holy Lovenia <[email protected]> * Closes SEACrowd#270 | Create dataset loader for OpenViVQA SEACrowd#270 (SEACrowd#464) * add sample * init submit for openvivqa dataloader * Update openvivqa.py * Update openvivqa.py * update dict format * Closes SEACrowd#516 | Add/Update Dataloader id_newspaper_2018 (SEACrowd#551) * Implement dataloader for id_newspaper_2018 * Specify JSON ecoding * Closes SEACrowd#429 | Implement dataloader for filipino_hatespeech_election (SEACrowd#487) * Add dataloader for filipino_hatespeech_election * update task * update * Closes SEACrowd#52 | Add cosem dataloader (SEACrowd#473) * feat: cosem dataloader * fix: citation * refactor: dataloader class name * fix: file parsing logic * fix: id format * fix: tab separator bug in text * fix: check for unique id * Closes SEACrowd#424 | Add Dataloader Bactrian-X * Import `schemas` beforehand on `templates/template.py` (SEACrowd#644) * add import statement for schemas * add import statement for schemas * Closes SEACrowd#313 | Add dataloader for Saltik (SEACrowd#387) * add dataloader for indonesian_madurese_bible_translation * add dataloader for saltik * Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py * update based on the reviewer comment * update based on the reviewer comment * Remove the modified constants.py from PR --------- Co-authored-by: Holy Lovenia <[email protected]> * Add `.upper` method for `--schema` parameter (SEACrowd#648) * add upper method for --schema * revert code-style * Closes SEACrowd#438 | Add dataloader for ASR-INDOCSC (SEACrowd#509) * add dataloader for asr_indocsc * Update asr_indocsc.py for data downloading instructions --------- Co-authored-by: Salsabil Maulana Akbar <[email protected]> Co-authored-by: Elyanah Aco <[email protected]> Co-authored-by: Yuze GAO <[email protected]> Co-authored-by: Lj Miranda <[email protected]> Co-authored-by: XU, Yan (Yana) <[email protected]> Co-authored-by: Haochen Li <[email protected]> Co-authored-by: Jennifer Santoso <[email protected]> Co-authored-by: Holy Lovenia <[email protected]> Co-authored-by: Lucky Susanto <[email protected]> Co-authored-by: Samuel Cahyawijaya <[email protected]> Co-authored-by: Muhammad Dehan Al Kautsar <[email protected]> Co-authored-by: Lj Miranda <[email protected]> Co-authored-by: Lucky Susanto <[email protected]> Co-authored-by: Maria Khelli <[email protected]> Co-authored-by: Ishan Jindal <[email protected]> Co-authored-by: ssfei81 <[email protected]> Co-authored-by: IvanHalimP <[email protected]> Co-authored-by: Enliven26 <[email protected]> Co-authored-by: Dan John Velasco <[email protected]> Co-authored-by: Chenxi <[email protected]> Co-authored-by: Bhavish Pahwa <[email protected]> Co-authored-by: FawwazMayda <[email protected]> Co-authored-by: fawwaz.mayda <[email protected]> Co-authored-by: Ilham F Putra <[email protected]> Co-authored-by: rafif-kewmann <[email protected]> Co-authored-by: mrafifrbbn <[email protected]> Co-authored-by: Yong Zheng-Xin <[email protected]> Co-authored-by: Amir Djanibekov <[email protected]> Co-authored-by: Amir Djanibekov <[email protected]> Co-authored-by: joan <[email protected]> Co-authored-by: joanitolopo <[email protected]> Co-authored-by: Railey Montalan <[email protected]> Co-authored-by: Railey Montalan <[email protected]> Co-authored-by: ssun32 <[email protected]> Co-authored-by: Tyson <[email protected]> Co-authored-by: Ilham Firdausi Putra <[email protected]> Co-authored-by: Johanes Lee <[email protected]> Co-authored-by: Akhdan Fadhilah <[email protected]> Co-authored-by: Frederikus Hudi <[email protected]> Co-authored-by: Börje Karlsson <[email protected]> Co-authored-by: Muhammad Satrio Wicaksono <[email protected]> Co-authored-by: Wenyu Zhang <[email protected]> Co-authored-by: R. Damanhuri <[email protected]> Co-authored-by: Patrick Amadeus Irawan <[email protected]> Co-authored-by: Reza Qorib <[email protected]> Co-authored-by: Bryan Wilie <[email protected]> Co-authored-by: Muhammad Ravi Shulthan Habibi <[email protected]>
- Loading branch information