Initial README content #3

twardoch · 2022-03-22T21:31:12Z

Problem

When you choose a font to typeset some text, the very first question that interests you is: which fonts support the language(s) of my text? A font that doesn’t support the languages won’t be of any interest.

But what does it mean, exactly, that a font supports a given language? For Latin-script fonts, the task is reasonably easy and mostly equals to: does the font have glyphs for all the Unicode codepoints used by the language? In reality, this isn’t always so trivial either. To typeset text that is written in English, it’s not enough that the font has glyphs for the A-Z and a-z letters. It also needs digits, and some punctuation. Well, it also probably needs some accented letters, because you may want to write the names Chloë or Brontë, for example.

But it’s still a relatively easy task to check. The Unicode CLDR project collects “exemplar characters” several categories. If you check if the font contains glyphs for all these characters, you can say, “OK, this font supported this language”. The Rosetta Type Hyperglot project contains similar information, with some annotations.

Rationale behind Shaperglot

But this approach does not work for scripts that need “shaping”, a process that maps the input Unicode codepoints of the text into a series of glyphs in a way which is not a 1:1 correspondence. For scripts like Arabic or Devenagari, it’s not enough to check if the font has default glyphs for all Unicode codepoints from some set. You also need to check if the font has some rules (features) that perform the shaping so that the final rendered text is orthographically correct.

Shaperglot allows to check for the Unicode coverage, but also allows other tests. In particular, the idea is that:

you feed in a specially prepared text (string of characters), and the font
you get the default series of glyphs
you run HarfBuzz and observe if something changed (the final series of glyphs is different from the default series of glyphs), and what specifically has changed

The fact that a change happened indicates that there is some support for a language beyond just the Unicode codepoint coverage.

For example, if I put the default i and apply the locl feature with the script tag latn and the language tag TRK, and I see that the output glyph (or series) is different than the input, I can say with higher certainty “this font supports Turkish”.

Shaperglot will not (yet ;) ) use computer vision to judge the quality of the change, but it’s based on a very reasonable assumption that if I put in some letter and ask HarfBuzz to apply a certain feature, and the result as the same as the input, then it means that the feature is not meaningfully implemented, hence there is a problem.

The advantage of using Shaperglot approach is that the tests can be complex. Sometimes, the meaningful change will come about only in a combination of certain features, not just one feature. Or maybe an alternative (some fonts may implement something via liga, some others may implement the same via ccmp or calt). So the test may ask for all 3 features to be applied and check if something changed.

Shaperglot has example implementations of tests for some languages, but needs more data.

In future, additional, more sophisticated tests, can be implemented. Test-driven development can help to have better fonts, but also can help to get better info about language support.

The text was updated successfully, but these errors were encountered:

simoncozens · 2022-10-18T13:14:26Z

Thanks for this. I wrote something similar independently, and (finally) integrated it.

* Update checker.py Added mark2base test that uses the serialized buffer to see if a mark has a GPOS shift if placed after a target base mark. * Use shaper to check whether glyphs exist, see #7 * Add youseedee to requirements * Fix some lints * Read your own config file, pylint * More pylint fixes * Pin protobuf dependency * Further poetry dependency fixes * Cache shaping * Fix error message * Implement an "unknown" state * Implement the "report" option * Speed up the mark checker * Don't GSUB closure on pathological fonts * Make pylint happier * Make result status machine readable * A new test for unencoded glyph variants. Fixes #8 * Use the language tag from the language we're checking * Skip tests based on certain conditions (missing features), fixes #11 * Make linter happier * Update orthographies check to include auxiliary chars There is probably a more elegant way to implement this but I have merged auxiliary characters into the bases for the orthographies check. For the purposes of language support testing base and auxiliary characters need to be included to ensure loan words, names and place names can all be typed for a given language. * Improve error messages * Add Neil's work * Pylint stuff * Update shaping_differs.py Fixed Type Error caused by trying to concat YAML to str * Make non-verbose less verbose * Transfer IP to Google --------- Co-authored-by: Simon Cozens <[email protected]> Co-authored-by: Dave Crossland <[email protected]>

simoncozens closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial README content #3

Initial README content #3

twardoch commented Mar 22, 2022

simoncozens commented Oct 18, 2022

Initial README content #3

Initial README content #3

Comments

twardoch commented Mar 22, 2022

Problem

Rationale behind Shaperglot

simoncozens commented Oct 18, 2022