You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, I'm using a train and test validation split for ag news. The format of the dataset is this:
{"text": "Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", "label": 2, "input": "Classify the following news article as World, Sports, Business, Sci\\/Tech, and return the answer as the corresponding news article label.\ntext: Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", "output": "Business"}
{"text": "The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.", "label": 3, "input": "Classify the following news article as World, Sports, Business, Sci\\/Tech, and return the answer as the corresponding news article label.\ntext: The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.", "output": "Sci/Tech"}
I'm wondering if there are any other real datasets that are similar so I can test with them. I did try using the script to download the wikitext, but this is in different format:
Script
from datasets import load_dataset
import json
# Load the wikitext-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1")
# Convert the dataset into JSON format
def save_json(split, filename):
data = [{"text": item["text"]} for item in dataset[split]]
# Save to a JSON file
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=4, ensure_ascii=False)
# Save training and test sets
save_json("train", "wikitext_train.json")
save_json("test", "wikitext_test.json")
print("Saved wikitext_train.json and wikitext_test.json")
When I use these files, I get an error like this:
File "/workspace/torchtune/torchtune/data/_messages.py", line 223, in __call__
[rank2]: {"type": "text", "content": sample[self.column_map["output"]]}
[rank2]: KeyError: 'output'
1|30|Loss: 1.8771843910217285: 100%|██████████| 30/30 [15:28<00:00, 31.92s/it][rank4]: Traceback (most recent call last): [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 987, in <module> [rank4]: sys.exit(recipe_main()) [rank4]: File "/workspace/torchtune/torchtune/config/_parse.py", line 99, in wrapper [rank4]: sys.exit(recipe_main(conf)) [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 982, in recipe_main [rank4]: recipe.train() [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 896, in train [rank4]: for idx, batch in enumerate(self._val_dataloader): [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 731, in __next__ [rank4]: data = self._next_data() [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 787, in _next_data [rank4]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp> [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 118, in __getitem__ [rank4]: return self._prepare_sample(sample) [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 121, in _prepare_sample [rank4]: transformed_sample = self._message_transform(sample) [rank4]: File "/workspace/torchtune/torchtune/data/_messages.py", line 223, in __call__ [rank4]: {"type": "text", "content": sample[self.column_map["output"]]} [rank4]: KeyError: 'output'
Can anyone help in resolving this issue?
The text was updated successfully, but these errors were encountered:
This error seems to be coming from your val_dataloader (for idx, batch in enumerate(self._val_dataloader)). This doesn't seem to be from our built-in full finetune distributed recipe. How did you set up the validation dataset? Can you share the full config?
Hello,
Currently, I'm using a train and test validation split for ag news. The format of the dataset is this:
I'm wondering if there are any other real datasets that are similar so I can test with them. I did try using the script to download the wikitext, but this is in different format:
Script
When I use these files, I get an error like this:
This is how my dataset is defined:
Detailed error:
1|30|Loss: 1.8771843910217285: 100%|██████████| 30/30 [15:28<00:00, 31.92s/it][rank4]: Traceback (most recent call last): [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 987, in <module> [rank4]: sys.exit(recipe_main()) [rank4]: File "/workspace/torchtune/torchtune/config/_parse.py", line 99, in wrapper [rank4]: sys.exit(recipe_main(conf)) [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 982, in recipe_main [rank4]: recipe.train() [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 896, in train [rank4]: for idx, batch in enumerate(self._val_dataloader): [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 731, in __next__ [rank4]: data = self._next_data() [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 787, in _next_data [rank4]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp> [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 118, in __getitem__ [rank4]: return self._prepare_sample(sample) [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 121, in _prepare_sample [rank4]: transformed_sample = self._message_transform(sample) [rank4]: File "/workspace/torchtune/torchtune/data/_messages.py", line 223, in __call__ [rank4]: {"type": "text", "content": sample[self.column_map["output"]]} [rank4]: KeyError: 'output'
Can anyone help in resolving this issue?
The text was updated successfully, but these errors were encountered: