You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks to me as if it occurs both the XML and HTML -- the first instance is fine, but the second instance in paragraph 6 is a ligature in the XML as well.
We shouldn't attempt to fix this with a straightforward normalisation.
None of the transformations have the right handling of both fi and superscripts -- the canonical ones do not get rid of the ligature, and the compatibility ones do not correctly preserve the superscript.
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
We'd need a carefully nuanced converter to make good decisions about the most popular legacy characters, and I'm wary that some ligatures like ɶ might have specific concrete meanings which would be lost by expansion. (https://caselaw.nationalarchives.gov.uk/ewca/civ/2015/541 talks about pronunciation and uses IPA, but doesn't talk about this sound.)
I don't think we'll fix this one quickly.
Thank you very much for the issue, though -- it's very good to have this in mind, particularly when considering search.
The PDF of [2024] UKFTT 31 (TC) contains a number of instances of the "fl" ligature (U+FB02).
This is seen repeatedly in the phrase "potato flour":
Screencast.from.17-01-24.08.49.19.webm
I do not have access to the original DOCX, although I note the ligature is also present in the PDF judgement on the official Tribunals website.
The ligature is also present in the HTML version but not in the XML version.
I suggest that the text undergoes Unicode Normalisation before a PDF is created.
(Apologies if this isn't the correct repo. Feel free to move it somewhere more suitable.)
The text was updated successfully, but these errors were encountered: