Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ligatures aren't normalised in PDF or HTML #145

Open
edent opened this issue Jan 17, 2024 · 1 comment
Open

Ligatures aren't normalised in PDF or HTML #145

edent opened this issue Jan 17, 2024 · 1 comment

Comments

@edent
Copy link

edent commented Jan 17, 2024

The PDF of [2024] UKFTT 31 (TC) contains a number of instances of the "fl" ligature (U+FB02).

This is seen repeatedly in the phrase "potato flour":

Screencast.from.17-01-24.08.49.19.webm

I do not have access to the original DOCX, although I note the ligature is also present in the PDF judgement on the official Tribunals website.

The ligature is also present in the HTML version but not in the XML version.

I suggest that the text undergoes Unicode Normalisation before a PDF is created.

(Apologies if this isn't the correct repo. Feel free to move it somewhere more suitable.)

@dragon-dxw
Copy link
Collaborator

dragon-dxw commented Jan 17, 2024

It looks to me as if it occurs both the XML and HTML -- the first instance is fine, but the second instance in paragraph 6 is a ligature in the XML as well.

We shouldn't attempt to fix this with a straightforward normalisation.

image
(from section 1.2 of https://unicode.org/reports/tr15/)

None of the transformations have the right handling of both fi and superscripts -- the canonical ones do not get rid of the ligature, and the compatibility ones do not correctly preserve the superscript.

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

We'd need a carefully nuanced converter to make good decisions about the most popular legacy characters, and I'm wary that some ligatures like ɶ might have specific concrete meanings which would be lost by expansion. (https://caselaw.nationalarchives.gov.uk/ewca/civ/2015/541 talks about pronunciation and uses IPA, but doesn't talk about this sound.)

I don't think we'll fix this one quickly.

Thank you very much for the issue, though -- it's very good to have this in mind, particularly when considering search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants