Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[glyphsets] Language definition overhaul #109

Merged
merged 63 commits into from
Dec 1, 2023
Merged

Conversation

yanone
Copy link
Collaborator

@yanone yanone commented Jun 1, 2023

This PR follows through with the overhaul of the character sets in gfglyphsets as outlined here.

Copy link
Contributor

@moyogo moyogo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Maltese, Latvian and Icelandic are under 5_000_000 speakers and should not be included. Either their inclusion needs to be hard-coded or they should be excluded or the requirements need to be adjusted.

For Lithuanian, it could be covered without Lithuanian dictionary notation (which includesthe soft dotted I studf) being covered if the threshold was lowered to include those languages.

Not sure why Bavarian is commented out.

@yanone
Copy link
Collaborator Author

yanone commented Jun 5, 2023

@moyogo I removed languages under 5M from Core, which also included Lithuanian (2.3M).

@RosaWagner RosaWagner changed the title Language definition overhaul [glyphsets] Language definition overhaul Jun 21, 2023
@RosaWagner
Copy link
Contributor

Since Maltese, Latvian and Icelandic are the "main/primary" languages of countries in Europe I would include them in Core. I don't know if it means the threshold should be moved or if these languages should just be included. We could try to see what happens if we move the threshold to 2M and decide from there.

@moyogo
Copy link
Contributor

moyogo commented Nov 13, 2023

Personally I think everything should be in one place. It's really inconvenient to have to change 3 different repositories for each language.

@yanone I’m not suggesting having language specific yaml files.
I’m suggesting keeping glyphset specific info in glyphset definition files.

Considering having multiple sources for the data is an issue, I’d suggest replacing the .stub.nam files by data in the same glyphset definition yaml files.

Something like:

stub:
  - 0x0024 DOLLAR SIGN
  - etc.
language_codes:
  - ca_Latn  # Catalan
  - cs_Latn  # Czech
  - etc.

If the unencoded glyphs are in the glyphset definition file then something like the following could be in there as well:

unencoded_glyphs:
  - periodcentered.loclCAT
  - periodcentered.loclCAT.case
  - etc.

If we do want the info in language data, it needs to be more flexible than glyph names.

@yanone
Copy link
Collaborator Author

yanone commented Nov 21, 2023

Bringing in @vv-monsalve here.

One of the reasons why we thought of the approach of putting glyph names into gflanguages is the support of some African languages that are largely unencoded. At least that's what I understood from Viviana. Putting large sets of unencoded glyphs that are language-specific into the definitions of gflanguages is the only way that makes sense to me, even though I understand that it's unsexy to put glyph names there, because they are/can be application-specific.

Viviana (or Denis), could you please provide an example for such language-specific glyphs other than the European ones I've already singled out? This is for my own understanding.

If it's indeed necessary to have unencoded glyph names in language definitions, I would ask to move forward with the proposal.

@moyogo
Copy link
Contributor

moyogo commented Nov 22, 2023

Viviana (or Denis), could you please provide an example for such language-specific glyphs other than the European ones I've already singled out? This is for my own understanding.

@yanone These depend on the glyphset, the design of the glyphs themselves and the scope or target of the font (for Latin):

  • a
  • Alpha-latin, alpha-latin
  • Bhook
  • Bstroke and bstroke
  • Dhook
  • Dstroke and dstroke
  • f
  • fhook
  • Gstroke and gstroke
  • Gamma-latin
  • Hstroke and hstroke
  • istroke
  • lambdastroke
  • Nhookleft
  • Eng
  • Esh
  • Vhook and vhook
  • Ezh
  • Lcommaaccent lcommaaccent Ncommaccent ncommaaccent
  • Scedilla scedilla Tcedilla tcedilla
  • kip
  • Any of the letters with circumflex and another top mark above
  • Any of the letters with ogonek
  • likely some others I’m forgetting

There are more for other writing system (like Cyrillic be-cy te-cy sha-cy pe-cy ge-cy gje-cy gebar-cy de-cy ka-cy zhe-cy fi-cy softsign-cy hardsign-cy gedescender-cy gestrokehook-cy tshe-cy etc.).

@vv-monsalve
Copy link
Contributor

could you please provide an example for such language-specific glyphs

Currently, we are adding mainly the SSA languages. But if we eventually add e.g. an indigenous language like Piaroa, it would require glyphs like:

  • acedilla, icedilla, ocedilla, ucedilla
  • adieresiscedilla, odieresiscedilla, udieresiscedilla
  • uacutecedilla

@moyogo
Copy link
Contributor

moyogo commented Nov 22, 2023

@yanone @vv-monsalve Did you mean multiple-to-one glyphs or language specific alternate single glyphs by "unencoded glyphs"?

The multiple-to-one glyphs are already listed in the language data.
They can be handled with mark positioning and contextual kerning but if that doesn’t work composite glyphs can be used. These can be derived from the language data exemplar, at least for Latin.

For example, a multiple-to-one glyph like a_cedilla is found as "{a̧}" in the exemplar characters of the languages using it.

An alternate single glyph is something like bstroke.alt (or with a clearer name bstroke.midoverlaystroke or for Glyphsapp bstroke.EMPPLG0) for the languages that would use a glyph distinct from the default, when bstroke with a strough through the ascender, for the glyphset and scope.

@vv-monsalve
Copy link
Contributor

Did you mean multiple-to-one glyphs or language specific alternate single glyphs by "unencoded glyphs"?

I think I had in mind only the multiple-to-one.

For example, a multiple-to-one glyph like a_cedilla is found as "{a̧}" in the exemplar characters of the languages using it.

I was not aware of this option in the language definition. Would it also work for more complex or unusual combinations like ubartilde? For example, I haven't managed to copy/paste that glyph dynamically composed from indd to a .textprotofile.
Screen Shot 2023-11-22 at 18 41 44

If we can include these glyphs in the exemplar_chars in the language definitions, we can easily create a new .textproto file for the desired language and then include the language in the .yaml file. However, I'm not sure if we can create and add 'language.textproto' files for languages that aren't included in Unicode. Is our gflanguages data related to this?

An alternate single glyph is something like bstroke.alt

This seems to be more related to each font design, so I don't know if we should/can add them in the Glyphsets definitions.

This would bring us back to what are the goals here. From what I understand it is to have a centralized tool to create all the necessary files (from Glyphs to Glyphsets or nam) to define the required glyphs for language support of a font.

If the intention with, for example, the .stub.glyphs file is to create customized .plist for projects, then yes, these alternate glyphs should be listed there.

@moyogo
Copy link
Contributor

moyogo commented Nov 23, 2023

I was not aware of this option in the language definition. Would it also work for more complex or unusual combinations like ubartilde? For example, I haven't managed to copy/paste that glyph dynamically composed from indd to a .textprotofile.

@yanone How did you get ʉ̃ in the InDesign document in the first place? If it was with the glyph palette and the glyph wasn’t named properly (uni0289_tildecomb) in the font used then InDesign may not be able to know what text string that glyph represents.

I’d recommend using an input tool like https://r12a.github.io/uniview/.

On macOS, you can use the Edit > Emoji & Symbols from the menu (but Ctrl+Cmd+Space works better) to insert any character if you know or can guess the Unicode character name. In this case searching "u bar" shows ʉ and searching "combining tilde" shows the combining tilde. It shows the character name when you hover, in case there are multiple characters that look like a match.
Screenshot 2023-11-23 at 10 11 29
Screenshot 2023-11-23 at 10 12 20

The better input method is to use a proper keyboard layout. For Piaroa, there is https://keyman.com/keyboards/pid_piaroa, the user can use AltGr+a AltGr+, to input ä̧ for example.

If we can include these glyphs in the exemplar_chars in the language definitions, we can easily create a new .textproto file for the desired language and then include the language in the .yaml file.

Yes, that’s basically what I’ve been doing for the GF Latin PriAfrican and GF Latin African PRs. I also have a script that lists the graphemes composed of sequences of characters like ʉ̃ or ä̧.

However, I'm not sure if we can create and add 'language.textproto' files for languages that aren't included in Unicode. Is our gflanguages data related to this?

gflanguages uses IETF BCP47 language tags as identifiers.
The BCP47 tags gflanguages uses are composed of an ISO 639 language tag, an ISO 15924 script tag and an optional ISO 3166 country or region code. They could also have optional variant or extension subtags when the language-script-country code is not specific enough.

When you say these languages "aren’t included in Unicode", it’s confusing.
Piaroa is supported in Unicode (a̧ ä̧ ȩ i̧ o̧ u̧ ü̧), just not like European languages that have their accented letters as single characters.
Beside the input issue, our issue here is that fonts don’t support those Unicode characters sequences, when they should handle them with correct glyph positioning or glyph substitution.
This is no different than languages using complex scripts that don’t have every single character sequence encoded as single characters.

Languages not supported by Unicode are the ones where no character exists at all for their orthographies.

@yanone
Copy link
Collaborator Author

yanone commented Nov 24, 2023

This is actually an unexpected turn of events.

Yes, sequences of characters such as {a̧} are already correctly handled in the assembly of the glyphsets as per this PR, as to the Python code they appear as separate Unicodes and thus are added separately to the list of required characters. Whether or not the composition of these sequences is implemented correctly in the fonts shall be in the responsibility of the nascent shaperglot-based language shaping checks, so in the definitions we need not concern ourselves with that any further as long as they are defined in the correct {a̧} notation in gflanguages.

For illustration: If I paste into FontGoogles, the glyph is listed as its two separate characters:
Bildschirmfoto 2023-11-24 um 15 34 08

So let’s take a step back then and reconsider which glyphs are actually unencoded that we need to see included in the language definitions.

Because if they are really just the caron.alt for Czech and Slovak as well as the periodcentered for Catalan, I suggest to not add that functionality to gflanguages and keep those few unencoded glyphs here in gfglyphsets, be it in additional language-specific definition files or in glyphset-specific definition files. If it's just those few ones, I won't mind keeping them in additional language-specific files here in gfglyphsets because I would expect them to basically never change, so the authoring effort for a third repository (the other two being gflanguages and shaperglot which has its own shaping-specific language definitions) is negligible.

So, which actually unencoded glyphs do we need?

@moyogo
Copy link
Contributor

moyogo commented Nov 25, 2023

So, which actually unencoded glyphs do we need?

It depends, like mentionned above.

For Catalan periodcentered, the default periodcentered should be designed and kerned for Catalan in the first place (Catalan names are not just used in Catalan).
The only reason GF Latin Core glyphset need a Catalan locl variant is because the periodcentered is not designed and kerned for it.

The same goes for the glyphs listed in #109 (comment) variants are needed when the default glyphs are not appropriate for specific cases.
For example defining an unencoded Eng.alt in Wolof wo_Latn only makes sense if the Eng is never appropriate, but a font may very well have the preferred Wolof shape of the glyph and Northern Sami would need to have an Eng variant instead.

@yanone
Copy link
Collaborator Author

yanone commented Nov 27, 2023

Okay, thank you. I gather that the separate definition of these examples in gflanguages is unnecessary (for now) and I will close the PR over there, and find another simple way to have those few glyphs defined over here in glyphsets.

I've been wanting to rewrite this PR here anyway because I want to see the language list included in the actual Python module because the list is needed for the fontbakery language shaping checks.

I'm gonna need a few more days for this.

@yanone yanone marked this pull request as ready for review November 29, 2023 12:49
@yanone
Copy link
Collaborator Author

yanone commented Nov 29, 2023

After we discovered that we actually don't need a lot of language-specific unencoded glyphs (that was a misunderstanding) I've completely closed the PR over at gflanguages and kept those three language and Glyphs.app-specific glyphs (like periodcentered.CAT and caron.alt) where they were, in the .stub.glyphs file. We can work on that again later if we ever actually need to define loads of unencoded glyphs, but for now we don't.

Instead I've moved the language definitions per glyphset into the Python package alongside the final .nam files because these are required to be distributed inside the Python package for third-party applications like fontbakery's shape_languages check to read both the available language codes (consumed by shaperglot) as well as the codepoints from the .nam file to compute beforehand which glyphsets a font even supports, as the data.json database will eventually disappear after this transition is complete.

I've removed the draft status of the PR and now officially ask for the final review.

After this is done and a new package is published, I will continue to rewrite the shape_languages check to consume the newly available data.

@yanone yanone requested review from moyogo and RosaWagner November 29, 2023 13:01
@yanone yanone merged commit 4483c1c into main Dec 1, 2023
9 checks passed
@yanone yanone deleted the language-definition-overhaul branch May 4, 2024 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants