[glyphsets] Language definition overhaul #109

yanone · 2023-06-01T14:22:34Z

This PR follows through with the overhaul of the character sets in gfglyphsets as outlined here.

…ng of marks)

moyogo

It looks like Maltese, Latvian and Icelandic are under 5_000_000 speakers and should not be included. Either their inclusion needs to be hard-coded or they should be excluded or the requirements need to be adjusted.

For Lithuanian, it could be covered without Lithuanian dictionary notation (which includesthe soft dotted I studf) being covered if the threshold was lowered to include those languages.

Not sure why Bavarian is commented out.

yanone · 2023-06-05T12:17:12Z

@moyogo I removed languages under 5M from Core, which also included Lithuanian (2.3M).

RosaWagner · 2023-07-05T13:06:52Z

Since Maltese, Latvian and Icelandic are the "main/primary" languages of countries in Europe I would include them in Core. I don't know if it means the threshold should be moved or if these languages should just be included. We could try to see what happens if we move the threshold to 2M and decide from there.

moyogo · 2023-11-13T07:37:48Z

Personally I think everything should be in one place. It's really inconvenient to have to change 3 different repositories for each language.

@yanone I’m not suggesting having language specific yaml files.
I’m suggesting keeping glyphset specific info in glyphset definition files.

Considering having multiple sources for the data is an issue, I’d suggest replacing the .stub.nam files by data in the same glyphset definition yaml files.

Something like:

stub:
  - 0x0024 DOLLAR SIGN
  - etc.
language_codes:
  - ca_Latn  # Catalan
  - cs_Latn  # Czech
  - etc.

If the unencoded glyphs are in the glyphset definition file then something like the following could be in there as well:

unencoded_glyphs:
  - periodcentered.loclCAT
  - periodcentered.loclCAT.case
  - etc.

If we do want the info in language data, it needs to be more flexible than glyph names.

yanone · 2023-11-21T10:25:21Z

Bringing in @vv-monsalve here.

One of the reasons why we thought of the approach of putting glyph names into gflanguages is the support of some African languages that are largely unencoded. At least that's what I understood from Viviana. Putting large sets of unencoded glyphs that are language-specific into the definitions of gflanguages is the only way that makes sense to me, even though I understand that it's unsexy to put glyph names there, because they are/can be application-specific.

Viviana (or Denis), could you please provide an example for such language-specific glyphs other than the European ones I've already singled out? This is for my own understanding.

If it's indeed necessary to have unencoded glyph names in language definitions, I would ask to move forward with the proposal.

moyogo · 2023-11-22T17:14:11Z

Viviana (or Denis), could you please provide an example for such language-specific glyphs other than the European ones I've already singled out? This is for my own understanding.

@yanone These depend on the glyphset, the design of the glyphs themselves and the scope or target of the font (for Latin):

a
Alpha-latin, alpha-latin
Bhook
Bstroke and bstroke
Dhook
Dstroke and dstroke
f
fhook
Gstroke and gstroke
Gamma-latin
Hstroke and hstroke
istroke
lambdastroke
Nhookleft
Eng
Esh
Vhook and vhook
Ezh
Lcommaaccent lcommaaccent Ncommaccent ncommaaccent
Scedilla scedilla Tcedilla tcedilla
kip
Any of the letters with circumflex and another top mark above
Any of the letters with ogonek
likely some others I’m forgetting

There are more for other writing system (like Cyrillic be-cy te-cy sha-cy pe-cy ge-cy gje-cy gebar-cy de-cy ka-cy zhe-cy fi-cy softsign-cy hardsign-cy gedescender-cy gestrokehook-cy tshe-cy etc.).

vv-monsalve · 2023-11-22T22:22:43Z

could you please provide an example for such language-specific glyphs

Currently, we are adding mainly the SSA languages. But if we eventually add e.g. an indigenous language like Piaroa, it would require glyphs like:

acedilla, icedilla, ocedilla, ucedilla
adieresiscedilla, odieresiscedilla, udieresiscedilla
uacutecedilla

moyogo · 2023-11-22T23:03:46Z

@yanone @vv-monsalve Did you mean multiple-to-one glyphs or language specific alternate single glyphs by "unencoded glyphs"?

The multiple-to-one glyphs are already listed in the language data.
They can be handled with mark positioning and contextual kerning but if that doesn’t work composite glyphs can be used. These can be derived from the language data exemplar, at least for Latin.

For example, a multiple-to-one glyph like a_cedilla is found as "{a̧}" in the exemplar characters of the languages using it.

An alternate single glyph is something like bstroke.alt (or with a clearer name bstroke.midoverlaystroke or for Glyphsapp bstroke.EMPPLG0) for the languages that would use a glyph distinct from the default, when bstroke with a strough through the ascender, for the glyphset and scope.

vv-monsalve · 2023-11-23T00:07:31Z

Did you mean multiple-to-one glyphs or language specific alternate single glyphs by "unencoded glyphs"?

I think I had in mind only the multiple-to-one.

For example, a multiple-to-one glyph like a_cedilla is found as "{a̧}" in the exemplar characters of the languages using it.

I was not aware of this option in the language definition. Would it also work for more complex or unusual combinations like ubartilde? For example, I haven't managed to copy/paste that glyph dynamically composed from indd to a .textprotofile.

If we can include these glyphs in the exemplar_chars in the language definitions, we can easily create a new .textproto file for the desired language and then include the language in the .yaml file. However, I'm not sure if we can create and add 'language.textproto' files for languages that aren't included in Unicode. Is our gflanguages data related to this?

An alternate single glyph is something like bstroke.alt

This seems to be more related to each font design, so I don't know if we should/can add them in the Glyphsets definitions.

This would bring us back to what are the goals here. From what I understand it is to have a centralized tool to create all the necessary files (from Glyphs to Glyphsets or nam) to define the required glyphs for language support of a font.

If the intention with, for example, the .stub.glyphs file is to create customized .plist for projects, then yes, these alternate glyphs should be listed there.

moyogo · 2023-11-23T09:42:18Z

I was not aware of this option in the language definition. Would it also work for more complex or unusual combinations like ubartilde? For example, I haven't managed to copy/paste that glyph dynamically composed from indd to a .textprotofile.

@yanone How did you get ʉ̃ in the InDesign document in the first place? If it was with the glyph palette and the glyph wasn’t named properly (uni0289_tildecomb) in the font used then InDesign may not be able to know what text string that glyph represents.

I’d recommend using an input tool like https://r12a.github.io/uniview/.

On macOS, you can use the Edit > Emoji & Symbols from the menu (but Ctrl+Cmd+Space works better) to insert any character if you know or can guess the Unicode character name. In this case searching "u bar" shows ʉ and searching "combining tilde" shows the combining tilde. It shows the character name when you hover, in case there are multiple characters that look like a match.

The better input method is to use a proper keyboard layout. For Piaroa, there is https://keyman.com/keyboards/pid_piaroa, the user can use AltGr+a AltGr+, to input ä̧ for example.

If we can include these glyphs in the exemplar_chars in the language definitions, we can easily create a new .textproto file for the desired language and then include the language in the .yaml file.

Yes, that’s basically what I’ve been doing for the GF Latin PriAfrican and GF Latin African PRs. I also have a script that lists the graphemes composed of sequences of characters like ʉ̃ or ä̧.

However, I'm not sure if we can create and add 'language.textproto' files for languages that aren't included in Unicode. Is our gflanguages data related to this?

gflanguages uses IETF BCP47 language tags as identifiers.
The BCP47 tags gflanguages uses are composed of an ISO 639 language tag, an ISO 15924 script tag and an optional ISO 3166 country or region code. They could also have optional variant or extension subtags when the language-script-country code is not specific enough.

When you say these languages "aren’t included in Unicode", it’s confusing.
Piaroa is supported in Unicode (a̧ ä̧ ȩ i̧ o̧ u̧ ü̧), just not like European languages that have their accented letters as single characters.
Beside the input issue, our issue here is that fonts don’t support those Unicode characters sequences, when they should handle them with correct glyph positioning or glyph substitution.
This is no different than languages using complex scripts that don’t have every single character sequence encoded as single characters.

Languages not supported by Unicode are the ones where no character exists at all for their orthographies.

yanone · 2023-11-24T14:40:38Z

This is actually an unexpected turn of events.

Yes, sequences of characters such as {a̧} are already correctly handled in the assembly of the glyphsets as per this PR, as to the Python code they appear as separate Unicodes and thus are added separately to the list of required characters. Whether or not the composition of these sequences is implemented correctly in the fonts shall be in the responsibility of the nascent shaperglot-based language shaping checks, so in the definitions we need not concern ourselves with that any further as long as they are defined in the correct {a̧} notation in gflanguages.

For illustration: If I paste a̧ into FontGoogles, the glyph is listed as its two separate characters:

So let’s take a step back then and reconsider which glyphs are actually unencoded that we need to see included in the language definitions.

Because if they are really just the caron.alt for Czech and Slovak as well as the periodcentered for Catalan, I suggest to not add that functionality to gflanguages and keep those few unencoded glyphs here in gfglyphsets, be it in additional language-specific definition files or in glyphset-specific definition files. If it's just those few ones, I won't mind keeping them in additional language-specific files here in gfglyphsets because I would expect them to basically never change, so the authoring effort for a third repository (the other two being gflanguages and shaperglot which has its own shaping-specific language definitions) is negligible.

So, which actually unencoded glyphs do we need?

moyogo · 2023-11-25T05:50:54Z

So, which actually unencoded glyphs do we need?

It depends, like mentionned above.

For Catalan periodcentered, the default periodcentered should be designed and kerned for Catalan in the first place (Catalan names are not just used in Catalan).
The only reason GF Latin Core glyphset need a Catalan locl variant is because the periodcentered is not designed and kerned for it.

The same goes for the glyphs listed in #109 (comment) variants are needed when the default glyphs are not appropriate for specific cases.
For example defining an unencoded Eng.alt in Wolof wo_Latn only makes sense if the Eng is never appropriate, but a font may very well have the preferred Wolof shape of the glyph and Northern Sami would need to have an Eng variant instead.

yanone · 2023-11-27T13:48:08Z

Okay, thank you. I gather that the separate definition of these examples in gflanguages is unnecessary (for now) and I will close the PR over there, and find another simple way to have those few glyphs defined over here in glyphsets.

I've been wanting to rewrite this PR here anyway because I want to see the language list included in the actual Python module because the list is needed for the fontbakery language shaping checks.

I'm gonna need a few more days for this.

yanone · 2023-11-29T13:01:49Z

After we discovered that we actually don't need a lot of language-specific unencoded glyphs (that was a misunderstanding) I've completely closed the PR over at gflanguages and kept those three language and Glyphs.app-specific glyphs (like periodcentered.CAT and caron.alt) where they were, in the .stub.glyphs file. We can work on that again later if we ever actually need to define loads of unencoded glyphs, but for now we don't.

Instead I've moved the language definitions per glyphset into the Python package alongside the final .nam files because these are required to be distributed inside the Python package for third-party applications like fontbakery's shape_languages check to read both the available language codes (consumed by shaperglot) as well as the codepoints from the .nam file to compute beforehand which glyphsets a font even supports, as the data.json database will eventually disappear after this transition is complete.

I've removed the draft status of the PR and now officially ask for the final review.

After this is done and a new package is published, I will continue to rewrite the shape_languages check to consume the newly available data.

…glefonts/glyphsets into language-definition-overhaul

yanone added 13 commits June 1, 2023 16:19

Initial commit

2c00f0d

Added Icelandic, used upper() to get uppercase letters from base

9e6d57e

Move scripts into scripts folder, added small description in the header

4d14242

Added Latvian

52756bb

Added Maltese

dffd931

Exclude punctuation category altogether

848aa49

Exclude dotted circle explicitly (used in language defs for positioni…

572d02e

…ng of marks)

Added Lithuanian

d0062ab

Added Northern Sami

9a4b73e

Added Hawaiian

6b6340a

Added Esperanto

ccfa926

Added Welsh

dbff68a

No new line at the end of the .nam file

1f069a6

moyogo reviewed Jun 2, 2023

View reviewed changes

yanone added 12 commits June 2, 2023 14:02

Re-added chars after gflanguages changes

5d084b4

Added script to identify a character in gflanguages

2c2ed8a

Re-added chars after gflanguages changes

0bdd7c7

Re-add punctuation

b35e3fa

Re-add punctuation

e413dbf

Removed Swiss German, Sami, and Esperanto

6d428ba

Removed more characters from gflanguages

fe84843

Removed more characters from gflanguages

c267ed9

Removed more characters from gflanguages

120f17e

Removed Hawaiian & updated gflanguages

a70981c

Annotations

12ed660

Removed languages under 5M from Core

2910e22

RosaWagner changed the title ~~Language definition overhaul~~ [glyphsets] Language definition overhaul Jun 21, 2023

RosaWagner mentioned this pull request Jul 5, 2023

[glyphsets] deprecated IJ/ij in GF_glyphsets #111

Merged

moyogo mentioned this pull request Nov 11, 2023

Beyond glyph set will not be updated before 2024 - right? #148

Open

yanone added 2 commits November 29, 2023 13:37

Moved language definitions to Python module

928f934

Merge branch 'main' into language-definition-overhaul

01b2f5d

yanone marked this pull request as ready for review November 29, 2023 12:49

yanone requested review from moyogo and RosaWagner November 29, 2023 13:01

yanone added 11 commits November 29, 2023 14:03

Delete GF_Latin_Core.yaml

d621257

Merge branch 'language-definition-overhaul' of https://github.com/goo…

79db032

…glefonts/glyphsets into language-definition-overhaul

Added flowchart & comments to auto-generated files

04d1d97

Minor updates to boxes

0e91beb

Load local module

baa7c61

Update requirements.txt

f2a67ab

Revert

2e5350c

Update README.md

24f107b

Tackle UnicodeDecode error

d0995de

Add test to make sure that .nam files are included in the package

9fb0803

Revert this

8209806

yanone merged commit 4483c1c into main Dec 1, 2023
9 checks passed

yanone deleted the language-definition-overhaul branch May 4, 2024 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[glyphsets] Language definition overhaul #109

[glyphsets] Language definition overhaul #109

yanone commented Jun 1, 2023

moyogo left a comment

yanone commented Jun 5, 2023

RosaWagner commented Jul 5, 2023

moyogo commented Nov 13, 2023

yanone commented Nov 21, 2023

moyogo commented Nov 22, 2023

vv-monsalve commented Nov 22, 2023

moyogo commented Nov 22, 2023

vv-monsalve commented Nov 23, 2023

moyogo commented Nov 23, 2023

yanone commented Nov 24, 2023 •

edited

Loading

moyogo commented Nov 25, 2023

yanone commented Nov 27, 2023

yanone commented Nov 29, 2023

[glyphsets] Language definition overhaul #109

[glyphsets] Language definition overhaul #109

Conversation

yanone commented Jun 1, 2023

moyogo left a comment

Choose a reason for hiding this comment

yanone commented Jun 5, 2023

RosaWagner commented Jul 5, 2023

moyogo commented Nov 13, 2023

yanone commented Nov 21, 2023

moyogo commented Nov 22, 2023

vv-monsalve commented Nov 22, 2023

moyogo commented Nov 22, 2023

vv-monsalve commented Nov 23, 2023

moyogo commented Nov 23, 2023

yanone commented Nov 24, 2023 • edited Loading

moyogo commented Nov 25, 2023

yanone commented Nov 27, 2023

yanone commented Nov 29, 2023

yanone commented Nov 24, 2023 •

edited

Loading