Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline #1

brian-romanowski-nuance · 2022-04-11T23:27:55Z

I noticed that the string produced by textbound.brat_str() doesn't match the original line in the BRAT .ann file for a couple of documents in the n2c2 track 2 dataset: 4590 and 4693. The result is an off-by-one error in the split spans defining the covered tokens.

It looks like brat_str() and related functions depend on some particular conventions when there are newlines between annotated tokens. When there is more than one newline, it looks like brat_str() needs an empty span to work correctly. Here's one case from document 1029:

foo\n\nbarrr bat                                                             # annotated portion of the document
T6      History 228 231;232 232;233 242 foo  barrr bat       # .ann file

Note that there is a zero-length annotation span in there: (232, 232). And there are two spaces between "foo" and "barrr" in the covered text column corresponding to the two skipped newline characters.

However, in the problematic document 4590:

foooooo\n\nbarr batt                                                  # annotated portion of the document
T6      Amount 172 179;181 190     foooooo barr batt    # .ann file
T6      Amount 172 179;180 189     foooooo barr batt    # textbound.brat_str()

Notice the lack of zero-length annotation span in the .ann file and the lack of a double space in the covered text column. I think that, due to this, there is a off-by-one difference in the output of brat_str().

Same thing happens for document 4693.

Easiest fix is probably to modify the documents associated with the failure to round-trip. As far as I can tell, modifying my copy of the documents "fixed" the problem. Ideally, I suppose that it'd be best to store the original annotated spans rather than depending on conventions and attempting to reconstruct them in brat_str(). However, I totally understand not doing this since it'd involve bigger changes, the possibility of breaking something that the scoring code depends upon, and it's not clear whether anyone else would get bit by the buggy behavior.

The text was updated successfully, but these errors were encountered:

Lybarger · 2022-04-12T01:00:08Z

When I was putting together the scoring script, I cannibalized some code from another project. As part of this cannibalization, I only tested the scoring function. I did not run any additional tests on the portions of the code that export to *.ann files. I can take a look next week and get back to you next week. Best regards, Kevin From: Brian Romanowski ***@***.***> Sent: Monday, April 11, 2022 4:28 PM To: Lybarger/brat_scoring ***@***.***> Cc: Subscribed ***@***.***> Subject: [Lybarger/brat_scoring] Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline (Issue #1) I noticed that the string produced by textbound.brat_str() doesn't match the original line in the BRAT .ann file for a couple of documents in the n2c2 track 2 dataset: 4590 and 4693. The result is an off-by-one error in the split spans defining the covered tokens. It looks like brat_str() and related functions depend on some particular conventions when there are newlines between annotated tokens. When there is more than one newline, it looks like brat_str() needs an empty span to work correctly. Here's one case from document 1029: foo\n\nbarrr bat # annotated portion of the document T6 History 228 231;232 232;233 242 foo barrr bat # .ann file Note that there is a zero-length annotation span in there: (232, 232). And there are two spaces between "foo" and "barrr" in the covered text column corresponding to the two skipped newline characters. However, in the problematic document 4590: foooooo\n\nbarr batt # annotated portion of the document T6 Amount 172 179;181 190 foooooo barr batt # .ann file T6 Amount 172 179;180 189 foooooo barr batt # textbound.brat_str() Notice the lack of zero-length annotation span in the .ann file and the lack of a double space in the covered text column. I think that, due to this, there is a off-by-one difference in the output of brat_str(). Same thing happens for document 4693. Easiest fix is probably to modify the documents associated with the failure to round-trip. As far as I can tell, modifying my copy of the documents "fixed" the problem. Ideally, I suppose that it'd be best to store the original annotated spans rather than depending on conventions and attempting to reconstruct them in brat_str(). However, I totally understand not doing this since it'd involve bigger changes, the possibility of breaking something that the scoring code depends upon, and it's not clear whether anyone else would get bit by the buggy behavior. — Reply to this email directly, view it on GitHub<#1>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACPVTB54KCDK2WJI57PCECDVESYQPANCNFSM5TEWWAOQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline #1

Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline #1

brian-romanowski-nuance commented Apr 11, 2022

Lybarger commented Apr 12, 2022 via email

Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline #1

Textbound.brat_str() / textbound_str() does not always round-trip when split spans are separated by more than one newline #1

Comments

brian-romanowski-nuance commented Apr 11, 2022

Lybarger commented Apr 12, 2022 via email