How to change the datasets in JSON format? #2361

kailashg26 · 2025-02-07T20:54:32Z

Hello,

Currently, I'm using a train and test validation split for ag news. The format of the dataset is this:

{"text": "Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", "label": 2, "input": "Classify the following news article as World, Sports, Business, Sci\\/Tech, and return the answer as the corresponding news article label.\ntext: Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", "output": "Business"}
{"text": "The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.", "label": 3, "input": "Classify the following news article as World, Sports, Business, Sci\\/Tech, and return the answer as the corresponding news article label.\ntext: The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.", "output": "Sci/Tech"}

I'm wondering if there are any other real datasets that are similar so I can test with them. I did try using the script to download the wikitext, but this is in different format:

Script

from datasets import load_dataset
import json

# Load the wikitext-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1")

# Convert the dataset into JSON format
def save_json(split, filename):
data = [{"text": item["text"]} for item in dataset[split]]

# Save to a JSON file
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=4, ensure_ascii=False)

# Save training and test sets
save_json("train", "wikitext_train.json")
save_json("test", "wikitext_test.json")

print("Saved wikitext_train.json and wikitext_test.json")

When I use these files, I get an error like this:

 File "/workspace/torchtune/torchtune/data/_messages.py", line 223, in __call__
[rank2]:     {"type": "text", "content": sample[self.column_map["output"]]}
[rank2]: KeyError: 'output'

This is how my dataset is defined:

dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: json
  data_files: /workspace/torchtune-private/wikitext_train.json
  column: input
  split: train
seed: null
shuffle: False

Detailed error:

1|30|Loss: 1.8771843910217285: 100%|██████████| 30/30 [15:28<00:00, 31.92s/it][rank4]: Traceback (most recent call last): [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 987, in <module> [rank4]: sys.exit(recipe_main()) [rank4]: File "/workspace/torchtune/torchtune/config/_parse.py", line 99, in wrapper [rank4]: sys.exit(recipe_main(conf)) [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 982, in recipe_main [rank4]: recipe.train() [rank4]: File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 896, in train [rank4]: for idx, batch in enumerate(self._val_dataloader): [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 731, in __next__ [rank4]: data = self._next_data() [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 787, in _next_data [rank4]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp> [rank4]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 118, in __getitem__ [rank4]: return self._prepare_sample(sample) [rank4]: File "/workspace/torchtune/torchtune/datasets/_sft.py", line 121, in _prepare_sample [rank4]: transformed_sample = self._message_transform(sample) [rank4]: File "/workspace/torchtune/torchtune/data/_messages.py", line 223, in __call__ [rank4]: {"type": "text", "content": sample[self.column_map["output"]]} [rank4]: KeyError: 'output'

Can anyone help in resolving this issue?

The text was updated successfully, but these errors were encountered:

RdoubleA · 2025-02-07T21:41:44Z

This error seems to be coming from your val_dataloader (for idx, batch in enumerate(self._val_dataloader)). This doesn't seem to be from our built-in full finetune distributed recipe. How did you set up the validation dataset? Can you share the full config?

kailashg26 · 2025-02-07T21:46:12Z

I guess, I understand the error. I messed up the naming convention and got confused between instruct and text completion. I just changed it.

Btw, could you please suggest a few chat datasets which have train and test split?

kailashg26 closed this as completed Feb 7, 2025

kailashg26 reopened this Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to change the datasets in JSON format? #2361

How to change the datasets in JSON format? #2361

kailashg26 commented Feb 7, 2025 •

edited

Loading

RdoubleA commented Feb 7, 2025

kailashg26 commented Feb 7, 2025 •

edited

Loading

How to change the datasets in JSON format? #2361

How to change the datasets in JSON format? #2361

Comments

kailashg26 commented Feb 7, 2025 • edited Loading

RdoubleA commented Feb 7, 2025

kailashg26 commented Feb 7, 2025 • edited Loading

kailashg26 commented Feb 7, 2025 •

edited

Loading

kailashg26 commented Feb 7, 2025 •

edited

Loading