You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the string produced by textbound.brat_str() doesn't match the original line in the BRAT .ann file for a couple of documents in the n2c2 track 2 dataset: 4590 and 4693. The result is an off-by-one error in the split spans defining the covered tokens.
It looks like brat_str() and related functions depend on some particular conventions when there are newlines between annotated tokens. When there is more than one newline, it looks like brat_str() needs an empty span to work correctly. Here's one case from document 1029:
foo\n\nbarrr bat # annotated portion of the document
T6 History 228 231;232 232;233 242 foo barrr bat # .ann file
Note that there is a zero-length annotation span in there: (232, 232). And there are two spaces between "foo" and "barrr" in the covered text column corresponding to the two skipped newline characters.
Notice the lack of zero-length annotation span in the .ann file and the lack of a double space in the covered text column. I think that, due to this, there is a off-by-one difference in the output of brat_str().
Same thing happens for document 4693.
Easiest fix is probably to modify the documents associated with the failure to round-trip. As far as I can tell, modifying my copy of the documents "fixed" the problem. Ideally, I suppose that it'd be best to store the original annotated spans rather than depending on conventions and attempting to reconstruct them in brat_str(). However, I totally understand not doing this since it'd involve bigger changes, the possibility of breaking something that the scoring code depends upon, and it's not clear whether anyone else would get bit by the buggy behavior.
The text was updated successfully, but these errors were encountered:
When I was putting together the scoring script, I cannibalized some code from another project. As part of this cannibalization, I only tested the scoring function. I did not run any additional tests on the portions of the code that export to *.ann files.
I can take a look next week and get back to you next week.
Best regards,
Kevin
From: Brian Romanowski ***@***.***>
Sent: Monday, April 11, 2022 4:28 PM
To: Lybarger/brat_scoring ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [Lybarger/brat_scoring] Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline (Issue #1)
I noticed that the string produced by textbound.brat_str() doesn't match the original line in the BRAT .ann file for a couple of documents in the n2c2 track 2 dataset: 4590 and 4693. The result is an off-by-one error in the split spans defining the covered tokens.
It looks like brat_str() and related functions depend on some particular conventions when there are newlines between annotated tokens. When there is more than one newline, it looks like brat_str() needs an empty span to work correctly. Here's one case from document 1029:
foo\n\nbarrr bat # annotated portion of the document
T6 History 228 231;232 232;233 242 foo barrr bat # .ann file
Note that there is a zero-length annotation span in there: (232, 232). And there are two spaces between "foo" and "barrr" in the covered text column corresponding to the two skipped newline characters.
However, in the problematic document 4590:
foooooo\n\nbarr batt # annotated portion of the document
T6 Amount 172 179;181 190 foooooo barr batt # .ann file
T6 Amount 172 179;180 189 foooooo barr batt # textbound.brat_str()
Notice the lack of zero-length annotation span in the .ann file and the lack of a double space in the covered text column. I think that, due to this, there is a off-by-one difference in the output of brat_str().
Same thing happens for document 4693.
Easiest fix is probably to modify the documents associated with the failure to round-trip. As far as I can tell, modifying my copy of the documents "fixed" the problem. Ideally, I suppose that it'd be best to store the original annotated spans rather than depending on conventions and attempting to reconstruct them in brat_str(). However, I totally understand not doing this since it'd involve bigger changes, the possibility of breaking something that the scoring code depends upon, and it's not clear whether anyone else would get bit by the buggy behavior.
—
Reply to this email directly, view it on GitHub<#1>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACPVTB54KCDK2WJI57PCECDVESYQPANCNFSM5TEWWAOQ>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>
I noticed that the string produced by textbound.brat_str() doesn't match the original line in the BRAT .ann file for a couple of documents in the n2c2 track 2 dataset: 4590 and 4693. The result is an off-by-one error in the split spans defining the covered tokens.
It looks like
brat_str()
and related functions depend on some particular conventions when there are newlines between annotated tokens. When there is more than one newline, it looks likebrat_str()
needs an empty span to work correctly. Here's one case from document 1029:Note that there is a zero-length annotation span in there:
(232, 232)
. And there are two spaces between "foo" and "barrr" in the covered text column corresponding to the two skipped newline characters.However, in the problematic document 4590:
Notice the lack of zero-length annotation span in the .ann file and the lack of a double space in the covered text column. I think that, due to this, there is a off-by-one difference in the output of
brat_str()
.Same thing happens for document 4693.
Easiest fix is probably to modify the documents associated with the failure to round-trip. As far as I can tell, modifying my copy of the documents "fixed" the problem. Ideally, I suppose that it'd be best to store the original annotated spans rather than depending on conventions and attempting to reconstruct them in brat_str(). However, I totally understand not doing this since it'd involve bigger changes, the possibility of breaking something that the scoring code depends upon, and it's not clear whether anyone else would get bit by the buggy behavior.
The text was updated successfully, but these errors were encountered: