Ligatures aren't normalised in PDF or HTML #145

edent · 2024-01-17T08:57:07Z

The PDF of [2024] UKFTT 31 (TC) contains a number of instances of the "ﬂ" ligature (U+FB02).

This is seen repeatedly in the phrase "potato ﬂour":

Screencast.from.17-01-24.08.49.19.webm

I do not have access to the original DOCX, although I note the ligature is also present in the PDF judgement on the official Tribunals website.

The ligature is also present in the HTML version but not in the XML version.

I suggest that the text undergoes Unicode Normalisation before a PDF is created.

(Apologies if this isn't the correct repo. Feel free to move it somewhere more suitable.)

dragon-dxw · 2024-01-17T10:47:29Z

It looks to me as if it occurs both the XML and HTML -- the first instance is fine, but the second instance in paragraph 6 is a ligature in the XML as well.

We shouldn't attempt to fix this with a straightforward normalisation.

(from section 1.2 of https://unicode.org/reports/tr15/)

None of the transformations have the right handling of both fi and superscripts -- the canonical ones do not get rid of the ligature, and the compatibility ones do not correctly preserve the superscript.

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

We'd need a carefully nuanced converter to make good decisions about the most popular legacy characters, and I'm wary that some ligatures like ɶ might have specific concrete meanings which would be lost by expansion. (https://caselaw.nationalarchives.gov.uk/ewca/civ/2015/541 talks about pronunciation and uses IPA, but doesn't talk about this sound.)

I don't think we'll fix this one quickly.

Thank you very much for the issue, though -- it's very good to have this in mind, particularly when considering search.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ligatures aren't normalised in PDF or HTML #145

Ligatures aren't normalised in PDF or HTML #145

edent commented Jan 17, 2024

dragon-dxw commented Jan 17, 2024 •

edited

Loading

Ligatures aren't normalised in PDF or HTML #145

Ligatures aren't normalised in PDF or HTML #145

Comments

edent commented Jan 17, 2024

dragon-dxw commented Jan 17, 2024 • edited Loading

dragon-dxw commented Jan 17, 2024 •

edited

Loading