From 989ca859f37e73401104aa63526b5fbd0730c90f Mon Sep 17 00:00:00 2001 From: sileod Date: Fri, 30 Jun 2023 11:41:05 +0200 Subject: [PATCH 1/6] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 365985d..a252f0a 100755 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ You can also leverage [tasksource](https://github.com/sileod/tasksource/) with t ```py rte = tn.AutoTask("glue/rte") ``` -AutoTask guesses a tempalte based on the dataset structure. +AutoTask guesses a template based on the dataset structure. It also accepts a dataset as input, if it fits the template (e.g. after tasksource custom preprocessing). ## Sampling ```py tn.Classification(dataset,nrow=5000,nrows_eval=500 oversampling=2) From 45de34a5ff5a6e7c47462f20a0309edf4e81ae26 Mon Sep 17 00:00:00 2001 From: sileod Date: Fri, 30 Jun 2023 11:46:35 +0200 Subject: [PATCH 2/6] Update README.md --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index a252f0a..b5a6b4a 100755 --- a/README.md +++ b/README.md @@ -39,17 +39,18 @@ Tasknet is multitask by design. `model.task_models_list` contains one model per ## Installation `pip install tasknet` +## Balancing dataset sizes +```py +tn.Classification(dataset,nrow=5000,nrows_eval=500 oversampling=2) +``` +You can balance multiple datasets with `nrows` and `oversampling`. `nrows` is the maximal number of examples. If a dataset has less than `nrows`, it will be oversampled at most `oversampling` times. + ## AutoTask You can also leverage [tasksource](https://github.com/sileod/tasksource/) with tn.AutoTask and have one-line access to 600+ datasets, see [implemented tasks](https://github.com/sileod/tasksource/blob/main/README.md). ```py -rte = tn.AutoTask("glue/rte") +rte = tn.AutoTask("glue/rte",nrow=5000) ``` AutoTask guesses a template based on the dataset structure. It also accepts a dataset as input, if it fits the template (e.g. after tasksource custom preprocessing). -## Sampling -```py -tn.Classification(dataset,nrow=5000,nrows_eval=500 oversampling=2) -``` -You can balance multiple datasets with `nrows` and `oversampling`. `nrows` is the maximal number of examples. If a dataset has less than `nrows`, it will be oversampled at most `oversampling` times. ## Colab examples Minimal-ish example: From 93ece10867054a29203e236e0be4fb838ef7b2b4 Mon Sep 17 00:00:00 2001 From: sileod Date: Fri, 30 Jun 2023 11:47:10 +0200 Subject: [PATCH 3/6] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index b5a6b4a..1b18811 100755 --- a/README.md +++ b/README.md @@ -41,14 +41,14 @@ Tasknet is multitask by design. `model.task_models_list` contains one model per ## Balancing dataset sizes ```py -tn.Classification(dataset,nrow=5000,nrows_eval=500 oversampling=2) +tn.Classification(dataset, nrows=5000, nrows_eval=500, oversampling=2) ``` You can balance multiple datasets with `nrows` and `oversampling`. `nrows` is the maximal number of examples. If a dataset has less than `nrows`, it will be oversampled at most `oversampling` times. ## AutoTask You can also leverage [tasksource](https://github.com/sileod/tasksource/) with tn.AutoTask and have one-line access to 600+ datasets, see [implemented tasks](https://github.com/sileod/tasksource/blob/main/README.md). ```py -rte = tn.AutoTask("glue/rte",nrow=5000) +rte = tn.AutoTask("glue/rte", nrows=5000) ``` AutoTask guesses a template based on the dataset structure. It also accepts a dataset as input, if it fits the template (e.g. after tasksource custom preprocessing). From e6a5f32d92fc5875d52935bc5e3eda79370c1abf Mon Sep 17 00:00:00 2001 From: sileod Date: Fri, 30 Jun 2023 12:02:29 +0200 Subject: [PATCH 4/6] Delete .gitattributes --- .gitattributes | 1 - 1 file changed, 1 deletion(-) delete mode 100644 .gitattributes diff --git a/.gitattributes b/.gitattributes deleted file mode 100644 index b457ff5..0000000 --- a/.gitattributes +++ /dev/null @@ -1 +0,0 @@ -*.py linguist-vendored From 79afb6a4c5d00ddc02eb3f050f0175c168fea02d Mon Sep 17 00:00:00 2001 From: sileod Date: Fri, 30 Jun 2023 12:14:10 +0200 Subject: [PATCH 5/6] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1b18811..6fe2f19 100755 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ import tasknet as tn; from datasets import load_dataset rte = tn.Classification( dataset=load_dataset("glue", "rte"), - s1="sentence1", s2="sentence2", y="label") #s2 is optional + s1="sentence1", s2="sentence2", y="label") #s2 is optional # See AutoTask for shorter code class hparams: model_name='microsoft/deberta-v3-base' # deberta models have the best results (and tasknet support) From 859260b311e81fecf665d5e742fed12f9132897e Mon Sep 17 00:00:00 2001 From: sileod Date: Fri, 30 Jun 2023 13:51:01 +0200 Subject: [PATCH 6/6] Update README.md --- README.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 6fe2f19..2cba434 100755 --- a/README.md +++ b/README.md @@ -12,7 +12,9 @@ The task templates follow the same interface. They implement `preprocess_function`, a data collator and `compute_metrics`. Look at [tasks.py](https://github.com/sileod/tasknet/blob/main/src/tasknet/tasks.py) and use existing templates as a starting point to implement a custom task template. -## Task instances and example +## Installation and example + +`pip install tasknet` Each task template has fields that should be matched with specific dataset columns. Classification has two text fields `s1`,`s2`, and a label `y`. Pass a dataset to a template, and fill in the mapping between the template fields and the dataset columns to instantiate a task. ```py @@ -36,9 +38,6 @@ p([{'text':x.premise,'text_pair': x.hypothesis}]) # HuggingFace pipeline for inf ``` Tasknet is multitask by design. `model.task_models_list` contains one model per task, with a shared encoder. -## Installation -`pip install tasknet` - ## Balancing dataset sizes ```py tn.Classification(dataset, nrows=5000, nrows_eval=500, oversampling=2)