Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to change the datasets in JSON format? #2361

Open
kailashg26 opened this issue Feb 7, 2025 · 2 comments
Open

How to change the datasets in JSON format? #2361

kailashg26 opened this issue Feb 7, 2025 · 2 comments

Comments

@kailashg26
Copy link

kailashg26 commented Feb 7, 2025

Hello,

Currently, I'm using a train and test validation split for ag news. The format of the dataset is this:

{"text": "Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", "label": 2, "input": "Classify the following news article as World, Sports, Business, Sci\\/Tech, and return the answer as the corresponding news article label.\ntext: Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", "output": "Business"}
{"text": "The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.", "label": 3, "input": "Classify the following news article as World, Sports, Business, Sci\\/Tech, and return the answer as the corresponding news article label.\ntext: The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.", "output": "Sci/Tech"}

I'm wondering if there are any other real datasets that are similar so I can test with them. I did try using the script to download the wikitext, but this is in different format:

Script

from datasets import load_dataset
import json

# Load the wikitext-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1")

# Convert the dataset into JSON format
def save_json(split, filename):
data = [{"text": item["text"]} for item in dataset[split]]

# Save to a JSON file
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=4, ensure_ascii=False)

# Save training and test sets
save_json("train", "wikitext_train.json")
save_json("test", "wikitext_test.json")

print("Saved wikitext_train.json and wikitext_test.json")

When I use these files, I get an error like this:

 File "/workspace/torchtune/torchtune/data/_messages.py", line 223, in __call__
[rank2]:     {"type": "text", "content": sample[self.column_map["output"]]}
[rank2]: KeyError: 'output'

This is how my dataset is defined:

dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: json
  data_files: /workspace/torchtune-private/wikitext_train.json
  column: input
  split: train
seed: null
shuffle: False

Detailed error:

1|30|Loss: 1.8771843910217285: 100%|██████████| 30/30 [15:28<00:00, 31.92s/it][rank4]: Traceback (most recent call last): [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 987, in <module> [rank4]: sys.exit(recipe_main()) [rank4]: File "/workspace/torchtune/torchtune/config/_parse.py", line 99, in wrapper [rank4]: sys.exit(recipe_main(conf)) [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 982, in recipe_main [rank4]: recipe.train() [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 896, in train [rank4]: for idx, batch in enumerate(self._val_dataloader): [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 731, in __next__ [rank4]: data = self._next_data() [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 787, in _next_data [rank4]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp> [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 118, in __getitem__ [rank4]: return self._prepare_sample(sample) [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 121, in _prepare_sample [rank4]: transformed_sample = self._message_transform(sample) [rank4]: File "/workspace/torchtune/torchtune/data/_messages.py", line 223, in __call__ [rank4]: {"type": "text", "content": sample[self.column_map["output"]]} [rank4]: KeyError: 'output'

Can anyone help in resolving this issue?

@kailashg26 kailashg26 reopened this Feb 7, 2025
@RdoubleA
Copy link
Contributor

RdoubleA commented Feb 7, 2025

This error seems to be coming from your val_dataloader (for idx, batch in enumerate(self._val_dataloader)). This doesn't seem to be from our built-in full finetune distributed recipe. How did you set up the validation dataset? Can you share the full config?

@kailashg26
Copy link
Author

kailashg26 commented Feb 7, 2025

I guess, I understand the error. I messed up the naming convention and got confused between instruct and text completion. I just changed it.

Btw, could you please suggest a few chat datasets which have train and test split?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants