Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubjQA wrong boolean values in entries #3

Open
albertvillanova opened this issue Jun 15, 2021 · 2 comments
Open

SubjQA wrong boolean values in entries #3

albertvillanova opened this issue Jun 15, 2021 · 2 comments

Comments

@albertvillanova
Copy link

As reported by @arnaudstiegler (huggingface/datasets#2503), there appears to be mismatches between some of the fileds in your SubjQA dataset.

More concretely, the boolean is_ques_subjective seems that it doesn't match the corresponding question_subj_level.

As an example, file books/splits/train.csv contains the row:

0002007770,books,interesting,matter,fascinating,part,0255768496a256c5ed7caed9d4e47e4c,a907837bafe847039c8da374a144bff9,What are the parts like?,2,0.0,False,a7f1a2503eac2580a0ebbc1d24fffca1,"While I would not recommend this book to a young reader due to a couple pretty explicate scenes I would recommend it to any adult who just loves a good book.  Once I started reading it I could not put it down.  I hesitated reading it because I didn't think that the subject matter would be interesting, but I was so wrong.  This is a wonderfully written book. ANSWERNOTFOUND",This is a wonderfully written book,"(324, 358)",2,1.0,True

where:

  • question_subj_level = 2
  • is_ques_subjective = False

whereas is_ques_subjective should be True because question_subj_level is below 4.

Issue reported by @arnaudstiegler:

SubjQA seems to have a boolean that's consistently wrong.

It defines:

question_subj_level: The subjectiviy level of the question (on a 1 to 5 scale with 1 being the most subjective).
is_ques_subjective: A boolean subjectivity label derived from question_subj_level (i.e., scores below 4 are considered as subjective)
However, is_ques_subjective seems to have wrong values in the entire dataset.

For instance, in the example in the dataset card, we have:

"question_subj_level": 2
"is_ques_subjective": false
However, according to the description, the question should be subjective since the question_subj_level is below 4

@behzadg
Copy link

behzadg commented Aug 25, 2021

Thank you for pointing this out!

I did look into this carefully, and ...

  • All numerical values (i.e., subjectivity ratings & TextBlob scores) were accurate, and there are no mistakes there.
  • There is a documentation error regarding the "is_ques_subjective" and "is_ans_subjective" columns. These columns (which represent a boolean version of subjectivity) were not derived from subjectivity label ratings as reported by annotators. Instead, there were derived based on the TextBlob subjectivity scores (any score above 0.5 is considered as subjective).

I'll perform one final check in the next couple of days, and update the Readme accordingly to fix the issue and avoid further confusion.

@behzadg
Copy link

behzadg commented Aug 25, 2021

Thank you for pointing this out!

I did look into this carefully, and ...

  • All numerical values (i.e., subjectivity ratings & TextBlob scores) were accurate, and there are no mistakes there.
  • There is a documentation error regarding the "is_ques_subjective" and "is_ans_subjective" columns. These columns (which represent a boolean version of subjectivity) were not derived from subjectivity label ratings as reported by annotators. Instead, there were derived based on the TextBlob subjectivity scores (any score above 0.5 is considered as subjective).

I'll perform one final check in the next couple of days, and update the Readme accordingly to fix the issue and avoid further confusion.

@behzadg behzadg closed this as completed Aug 25, 2021
@behzadg behzadg reopened this Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants