Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline #1

Open
brian-romanowski-nuance opened this issue Apr 11, 2022 · 1 comment

Comments

@brian-romanowski-nuance

I noticed that the string produced by textbound.brat_str() doesn't match the original line in the BRAT .ann file for a couple of documents in the n2c2 track 2 dataset: 4590 and 4693. The result is an off-by-one error in the split spans defining the covered tokens.

It looks like brat_str() and related functions depend on some particular conventions when there are newlines between annotated tokens. When there is more than one newline, it looks like brat_str() needs an empty span to work correctly. Here's one case from document 1029:

foo\n\nbarrr bat                                                             # annotated portion of the document
T6      History 228 231;232 232;233 242 foo  barrr bat       # .ann file

Note that there is a zero-length annotation span in there: (232, 232). And there are two spaces between "foo" and "barrr" in the covered text column corresponding to the two skipped newline characters.

However, in the problematic document 4590:

foooooo\n\nbarr batt                                                  # annotated portion of the document
T6      Amount 172 179;181 190     foooooo barr batt    # .ann file
T6      Amount 172 179;180 189     foooooo barr batt    # textbound.brat_str()

Notice the lack of zero-length annotation span in the .ann file and the lack of a double space in the covered text column. I think that, due to this, there is a off-by-one difference in the output of brat_str().

Same thing happens for document 4693.

Easiest fix is probably to modify the documents associated with the failure to round-trip. As far as I can tell, modifying my copy of the documents "fixed" the problem. Ideally, I suppose that it'd be best to store the original annotated spans rather than depending on conventions and attempting to reconstruct them in brat_str(). However, I totally understand not doing this since it'd involve bigger changes, the possibility of breaking something that the scoring code depends upon, and it's not clear whether anyone else would get bit by the buggy behavior.

@Lybarger
Copy link
Owner

Lybarger commented Apr 12, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants