Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to adapt Yahoo Q&A datasets #1984

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Shadowner
Copy link
Contributor

Hello I've made a python script to adapt the Yahoo Q&A datasets to the expected Open Assistant Datasets format.

Because the datasets doesn't include any toxicity parameter, i've added the possibility to generate a toxicity parameter using the detoxify package has it is Apache 2 License and has pretty good results.

Exemple of gems found in the datasets that would need a small toxicity parameter :

  {
    "INSTRUCTION": "Why won't drivers use thier turn signals?",
    "RESPONSE": "Simple math: one hand on cell phone, other hand on coffee mug, right foot on gas pedal, left knee for steering.  What do you want them to use to activate the turn signal?",
    "SOURCE": "yahoo QA",
    "METADATA": {
      "category": "Cars & Transportation"
    }
  }

I could have add every parameter of Detoxify to the metadata, but i thought it was too much.

@github-actions
Copy link

github-actions bot commented Mar 5, 2023

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@Shadowner Shadowner force-pushed the datasets/yahoo_qa branch 2 times, most recently from dee2cdd to 09b4955 Compare March 6, 2023 00:16
Copy link
Collaborator

@olliestanley olliestanley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - could you also add an arg push-to-hf to enable pushing the Dataset to HuggingFace in Parquet format? If you then run that code and push it as a Dataset to your HuggingFace account, you can include the dataset path here as part of the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants