Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial README content #3

Closed
twardoch opened this issue Mar 22, 2022 · 1 comment
Closed

Initial README content #3

twardoch opened this issue Mar 22, 2022 · 1 comment

Comments

@twardoch
Copy link

Problem

When you choose a font to typeset some text, the very first question that interests you is: which fonts support the language(s) of my text? A font that doesn’t support the languages won’t be of any interest.

But what does it mean, exactly, that a font supports a given language? For Latin-script fonts, the task is reasonably easy and mostly equals to: does the font have glyphs for all the Unicode codepoints used by the language? In reality, this isn’t always so trivial either. To typeset text that is written in English, it’s not enough that the font has glyphs for the A-Z and a-z letters. It also needs digits, and some punctuation. Well, it also probably needs some accented letters, because you may want to write the names Chloë or Brontë, for example.

But it’s still a relatively easy task to check. The Unicode CLDR project collects “exemplar characters” several categories. If you check if the font contains glyphs for all these characters, you can say, “OK, this font supported this language”. The Rosetta Type Hyperglot project contains similar information, with some annotations.

Rationale behind Shaperglot

But this approach does not work for scripts that need “shaping”, a process that maps the input Unicode codepoints of the text into a series of glyphs in a way which is not a 1:1 correspondence. For scripts like Arabic or Devenagari, it’s not enough to check if the font has default glyphs for all Unicode codepoints from some set. You also need to check if the font has some rules (features) that perform the shaping so that the final rendered text is orthographically correct.

Shaperglot allows to check for the Unicode coverage, but also allows other tests. In particular, the idea is that:

  • you feed in a specially prepared text (string of characters), and the font
  • you get the default series of glyphs
  • you run HarfBuzz and observe if something changed (the final series of glyphs is different from the default series of glyphs), and what specifically has changed

The fact that a change happened indicates that there is some support for a language beyond just the Unicode codepoint coverage.

For example, if I put the default i and apply the locl feature with the script tag latn and the language tag TRK, and I see that the output glyph (or series) is different than the input, I can say with higher certainty “this font supports Turkish”.

Shaperglot will not (yet ;) ) use computer vision to judge the quality of the change, but it’s based on a very reasonable assumption that if I put in some letter and ask HarfBuzz to apply a certain feature, and the result as the same as the input, then it means that the feature is not meaningfully implemented, hence there is a problem.

The advantage of using Shaperglot approach is that the tests can be complex. Sometimes, the meaningful change will come about only in a combination of certain features, not just one feature. Or maybe an alternative (some fonts may implement something via liga, some others may implement the same via ccmp or calt). So the test may ask for all 3 features to be applied and check if something changed.

Shaperglot has example implementations of tests for some languages, but needs more data.

In future, additional, more sophisticated tests, can be implemented. Test-driven development can help to have better fonts, but also can help to get better info about language support.

@simoncozens
Copy link
Collaborator

Thanks for this. I wrote something similar independently, and (finally) integrated it.

simoncozens added a commit that referenced this issue Mar 24, 2023
* Update checker.py

Added mark2base test that uses the serialized buffer to see if a mark has a GPOS shift if placed after a target base mark.

* Use shaper to check whether glyphs exist, see #7

* Add youseedee to requirements

* Fix some lints

* Read your own config file, pylint

* More pylint fixes

* Pin protobuf dependency

* Further poetry dependency fixes

* Cache shaping

* Fix error message

* Implement an "unknown" state

* Implement the "report" option

* Speed up the mark checker

* Don't GSUB closure on pathological fonts

* Make pylint happier

* Make result status machine readable

* A new test for unencoded glyph variants. Fixes #8

* Use the language tag from the language we're checking

* Skip tests based on certain conditions (missing features), fixes #11

* Make linter happier

* Update orthographies check to include auxiliary chars

There is probably a more elegant way to implement this but I have merged auxiliary characters into the bases for the orthographies check. For the purposes of language support testing base and auxiliary characters need to be included to ensure loan words, names and place names can all be typed for a given language.

* Improve error messages

* Add Neil's work

* Pylint stuff

* Update shaping_differs.py

Fixed Type Error caused by trying to concat YAML to str

* Make non-verbose less verbose

* Transfer IP to Google

---------

Co-authored-by: Simon Cozens <[email protected]>
Co-authored-by: Dave Crossland <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants