Script to adapt Yahoo Q&A datasets #1984

Shadowner · 2023-03-05T23:28:50Z

Hello I've made a python script to adapt the Yahoo Q&A datasets to the expected Open Assistant Datasets format.

Because the datasets doesn't include any toxicity parameter, i've added the possibility to generate a toxicity parameter using the detoxify package has it is Apache 2 License and has pretty good results.

Exemple of gems found in the datasets that would need a small toxicity parameter :

  {
    "INSTRUCTION": "Why won't drivers use thier turn signals?",
    "RESPONSE": "Simple math: one hand on cell phone, other hand on coffee mug, right foot on gas pedal, left knee for steering.  What do you want them to use to activate the turn signal?",
    "SOURCE": "yahoo QA",
    "METADATA": {
      "category": "Cars & Transportation"
    }
  }

I could have add every parameter of Detoxify to the metadata, but i thought it was too much.

github-actions · 2023-03-05T23:31:02Z

❌ pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

olliestanley

Looks good - could you also add an arg push-to-hf to enable pushing the Dataset to HuggingFace in Parquet format? If you then run that code and push it as a Dataset to your HuggingFace account, you can include the dataset path here as part of the PR.

Shadowner requested review from Vechtomov, bitplane and huu4ontocord as code owners March 5, 2023 23:28

Shadowner force-pushed the datasets/yahoo_qa branch 2 times, most recently from dee2cdd to 09b4955 Compare March 6, 2023 00:16

Script to adapt Yahoo Q&A datasets

87e6236

Shadowner force-pushed the datasets/yahoo_qa branch from 09b4955 to 87e6236 Compare March 6, 2023 01:14

andreaskoepf added the data label Mar 10, 2023

olliestanley reviewed Apr 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to adapt Yahoo Q&A datasets #1984

Script to adapt Yahoo Q&A datasets #1984

Shadowner commented Mar 5, 2023

github-actions bot commented Mar 5, 2023

olliestanley left a comment •

edited

Loading

Script to adapt Yahoo Q&A datasets #1984

Are you sure you want to change the base?

Script to adapt Yahoo Q&A datasets #1984

Conversation

Shadowner commented Mar 5, 2023

github-actions bot commented Mar 5, 2023

olliestanley left a comment • edited Loading

Choose a reason for hiding this comment

olliestanley left a comment •

edited

Loading