Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GF_Latin_PriAfrican #137

Merged
merged 8 commits into from
Jan 2, 2024
Merged

Conversation

chrissimpkins
Copy link
Member

Closes #136

@chrissimpkins
Copy link
Member Author

chrissimpkins commented Oct 9, 2023

Please hold on a merge until we've completed the review of the data in this PR. I'll update in this thread when we are ready.

@chrissimpkins
Copy link
Member Author

This PR will be reviewed by @moyogo and @NeilSureshPatel

@moyogo
Copy link
Contributor

moyogo commented Oct 14, 2023

@chrissimpkins at first glance ebreve ibreve obreve look out of place. They don't seem to be used in those languages and their uppercase are missing.

@NeilSureshPatel
Copy link

@chrissimpkins I wanted to confirm that the priority list is meant to cover the languages in the countries Dave mentions in #136. We had discussed different tiers of priorities in the past, so I want to make sure I am looking at the right things. Thanks.

@chrissimpkins
Copy link
Member Author

@chrissimpkins I wanted to confirm that the priority list is meant to cover the languages in the countries Dave mentions in #136. We had discussed different tiers of priorities in the past, so I want to make sure I am looking at the right things. Thanks.

I shared a Doc with you that details the list of languages.

@NeilSureshPatel
Copy link

@chrissimpkins One more question. Are the glyphsets meant to be exhaustive or additive? There are things missing but if all fonts are required to support levels 1-4, then I can see that some of the missing things are not required if they are already covered in the other Latin coverage levels.

@chrissimpkins
Copy link
Member Author

This is meant to be a superset of the Latin Core set to fully support the target languages. So, the question is: if a family includes Latin Core + Latin PriAfrican coverage, does it fully support the languages of interest?

@moyogo
Copy link
Contributor

moyogo commented Oct 30, 2023

@chrissimpkins @davelab6

Language data

I’ve updated language data for some of the target languages in gflanguages PR 114. The Afrikaans base characters were incorrect, however the update doesn’t affect this glyphset.

Note: Yoruba has been split into Yoruba (Nigeria) yo_Latn and Yoruba (Benin) yo_Latn_BJ. Yoruba (Benin) uses ɛ ɔ kp sh and uppercase instead of ẹ ọ p ṣ used in Yoruba (Nigeria).

As mentionned in the meeting:

GF Latin African Pri

Besides Ŋŋ needed for Luganda spoken in Uganda (and several languages not listed in the targeted languages) this is mostly a priority Nigerian language glyph set. The following target languages are already supported by GF Latin Core:

  • Afrikaans (af)
  • Oromo (om)
  • Swahili (sw)
  • Xhosa (xh)
  • Zulu (zu)

Note: Xhosa (xh) and Zulu (zu) spoken in South Africa require lowercase-to-uppercase kerning as they use lowercase prefixes at the beginning of proper nouns. This can be narrowed down to vowels aeiou for the lowercase in from of uppercase. For example "eVe" in "eVenda" should kern visually symmetrically in most designs.

This glyphset adds support for:

  • Hausa (ha) with Ɓ ɓ Ɗ ɗ Ƙ ƙ Ƴ ƴ
  • Igbo (ig) with Ị ị Ṅ ṅ Ọ ọ Ụ ụ
  • Luganda (lg) with Ŋ ŋ (once it’s added)
  • Yoruba (yo) with Ẹ ẹ Ọ ọ Ṣ ṣ Ḿ ḿ Ń ń Ǹ ǹ

Characters that should be removed

Some character should not be in the GF Latin African Pri glyphset:
0x2011 NON-BREAKING HYPHEN
0x2020 DAGGER
0x2021 DOUBLE DAGGER
0x2032 PRIME
0x2033 DOUBLE PRIME
They seem to come from an outdated version of the Yoruba language data. They are not more needed by these African languages than by the languages supported by other GF Latin sets, like English in the GF Latin Core for example.

Missed opportunity

Since several countries listed in the shared document use none of the target languages it seems there are easy additions that should be recommended.

I’d strongly recommend adding 6 characters Ɛ ɛ Ɲ ɲ Ɔ ɔ:

  • Ɲ ɲ for Fula (ff_Latn) spoken by about 35 million people from Senegal in the West to Cameroon in the East. Ɓ ɓ Ɗ ɗ Ƴ ƴ are already and Ŋ ŋ should already be in this set.
  • Ɛ ɛ Ɔ ɔ for Akan (ak: Twi twi, Fante fat) spoken in Ghana by about 10 million people and related Baoulé (bci) in Côte d’Ivoire by another 8 million people.
  • Ɛ ɛ Ɲ ɲ Ɔ ɔ for Bambara (bm) and Dioula (dyu) or closely related Manding languages spoken in Côte d’Ivoire and neighbouring countries by about 15 million people. Ŋ n should already be in this set.

Besides this total of 70M speakers, a lot of additional languages would be supported with these additions, these are the larger ones in the countries listed.

Note: like Ŋ, Ɲ can have both n-form and N-form, both forms are not necessary at the same time.

@davelab6
Copy link
Member

davelab6 commented Nov 1, 2023

Let's not miss the opportunity. I think the Eng here is also a locl verison, so there's 2 versions of that to be added, with the needed feature code.

@davelab6
Copy link
Member

davelab6 commented Nov 2, 2023

Note: Yoruba has been split into Yoruba (Nigeria) yo_Latn and Yoruba (Benin) yo_Latn_BJ. Yoruba (Benin) uses ɛ ɔ kp sh and uppercase instead of ẹ ọ p ṣ used in Yoruba (Nigeria).

I think you are saying we need both, even though its split, which I agree with. Thanks to @Black-sage for explaining to me the ethnic integration across state lines is strong :)

Revisions based on input from Neil Patel and Denis Jacquerye, including the comments in #137 (comment)
@moyogo
Copy link
Contributor

moyogo commented Nov 4, 2023

@chrissimpkins Here’s the ouput for the .nam file with a patched version of Yanone’s assemble_charactersets.py (#109 and #142) given the languages (Afrikaans, Hausa, Igbo, Luganda, Oromo, Swahili, Xhosa, Yoruba, Zulu and the additional Akan, Bambara, Dioula, Fulfulde):

0x014A LATIN CAPITAL LETTER ENG
0x014B LATIN SMALL LETTER ENG
0x0181 LATIN CAPITAL LETTER B WITH HOOK
0x0186 LATIN CAPITAL LETTER OPEN O
0x018A LATIN CAPITAL LETTER D WITH HOOK
0x0190 LATIN CAPITAL LETTER OPEN E
0x0198 LATIN CAPITAL LETTER K WITH HOOK
0x0199 LATIN SMALL LETTER K WITH HOOK
0x019D LATIN CAPITAL LETTER N WITH LEFT HOOK
0x01B3 LATIN CAPITAL LETTER Y WITH HOOK
0x01B4 LATIN SMALL LETTER Y WITH HOOK
0x01F8 LATIN CAPITAL LETTER N WITH GRAVE
0x01F9 LATIN SMALL LETTER N WITH GRAVE
0x0253 LATIN SMALL LETTER B WITH HOOK
0x0254 LATIN SMALL LETTER OPEN O
0x0257 LATIN SMALL LETTER D WITH HOOK
0x025B LATIN SMALL LETTER OPEN E
0x0272 LATIN SMALL LETTER N WITH LEFT HOOK
0x0323 COMBINING DOT BELOW
0x1E3E LATIN CAPITAL LETTER M WITH ACUTE
0x1E3F LATIN SMALL LETTER M WITH ACUTE
0x1E44 LATIN CAPITAL LETTER N WITH DOT ABOVE
0x1E45 LATIN SMALL LETTER N WITH DOT ABOVE
0x1E62 LATIN CAPITAL LETTER S WITH DOT BELOW
0x1E63 LATIN SMALL LETTER S WITH DOT BELOW
0x1EB8 LATIN CAPITAL LETTER E WITH DOT BELOW
0x1EB9 LATIN SMALL LETTER E WITH DOT BELOW
0x1ECA LATIN CAPITAL LETTER I WITH DOT BELOW
0x1ECB LATIN SMALL LETTER I WITH DOT BELOW
0x1ECC LATIN CAPITAL LETTER O WITH DOT BELOW
0x1ECD LATIN SMALL LETTER O WITH DOT BELOW
0x1EE4 LATIN CAPITAL LETTER U WITH DOT BELOW
0x1EE5 LATIN SMALL LETTER U WITH DOT BELOW

The difference with the latest file in the PR is that the following are missing:

0x0323 COMBINING DOT BELOW
0x1E62 LATIN CAPITAL LETTER S WITH DOT BELOW

Somehowe we missed capital S with dot below (see #141).

@moyogo
Copy link
Contributor

moyogo commented Nov 5, 2023

@chrissimpkins I opened #144 for those fixes.

@chrissimpkins
Copy link
Member Author

@chrissimpkins I opened #144 for those fixes.

Reviewed and dropped a request there. Thank you very much Denis!

…tch1

Add Sdotbelow, dotbelowcomb to PriAfrican glyph set, update data.json mapping for PriAfrican glyph set
@chrissimpkins
Copy link
Member Author

We merged Denis' #144 into this branch.

@EbenSorkin
Copy link

I am excited for this to be ready.

@chrissimpkins
Copy link
Member Author

I think that we are nearly ready to merge. Open for any final comments / recommendations before we do so.

Copy link
Member

@davelab6 davelab6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds 33 encoded characters; are there any unencoded (locl) forms to go with them?

Also please update the root README to reflect this additon to the sets

Lib/glyphsets/data.json Show resolved Hide resolved
@moyogo
Copy link
Contributor

moyogo commented Nov 7, 2023

This adds 33 encoded characters; are there any unencoded (locl) forms to go with them?

Considering the scope of the glyphset, they are not included.

locl forms make sense if the preferred forms are or need to be different from the ones in the fonts.

The N-form /Eng usually encoded as the default /Eng or as /Eng.loclNSM, preferred in Northern Sami likely belongs to GF Latin Beyond, since Northern Sami is not covered by GF Latin Core nor GF Latin PriAfrican as it’s spoken in Finland, Norway, Sweden.

Liberia’s /Bhook (used in Gio, Loma, Liberia Kpelle which are out of scope), similar to /Btopbar, is not included.
Note: while reviewing the target languages, it was noted a similar form (/Btopbar or rounder) was used in Xhosa and Zulu from the 1930s until the 1950s. However /Bhook doesn’t appear in current Xhosa or Zulu corpora.

Both N-form and n-form of /Nhookleft are not differentiated here, as the n-form, like /Eng, should be the default when applicable.

@davelab6
Copy link
Member

davelab6 commented Nov 8, 2023

Thanks for the quick and thorough response Denis!

I'll update the README

@vv-monsalve
Copy link
Contributor

vv-monsalve commented Nov 9, 2023

This morning, we discussed the upcoming African Glyphsets definition to clarify their differences. Understanding this one would be a subset of the Latin-SSA, we should review the following both for this PR and the Latin-SSA under development:

  • Naming
    Under the current system, the lowest level of language support for a script has been named "Kernel", and next above is "Core". We probably should keep using use these words consistently to indicate levels also within languages and call this one "African Kernel" e.g. GF_Latin_African_Kernel and the one defined in SSA latin all changes #134 "African Core" GF_Latin_African_Core

  • Modularity
    The Glyphsets have been defined and built as modules that allow the use of different blocks of information to be combined to build sets of glyphs with different levels of complexity according to the needs of each project.
    Following that system, the glyphs included in this PR need to be removed from SSA latin all changes #134

cc @RosaWagner

@chrissimpkins
Copy link
Member Author

chrissimpkins commented Nov 10, 2023

Waiting to resolve Viv's recommendation in #137 (comment) to merge this.

Thoughts @davelab6? The PriAfrican name was as you recommended in #136.

The modularity issue is a good point. Should this glyph set be a subset of the pan-African set? Or a separate set and projects must layer pan-African on top of this one to achieve the widest defined African lang support?

@vv-monsalve
Copy link
Contributor

vv-monsalve commented Nov 10, 2023

The modularity issue is a good point. Should this glyph set be a subset of the pan-African set? Or a separate set and projects must layer pan-African on top of this one to achieve the widest defined support?

My understanding is if e.g. a is already included in one set, then the next module only includes e.g. abreveacute.
Now, giving this a revision again, this seems true between Core and Vietnamese, but not between Kernel and Core (the latter repeats the letters present in Kernel, e.g. A). We probably need Rosa or Marc's confirmation.

@moyogo
Copy link
Contributor

moyogo commented Nov 10, 2023

@vv-monsalve @chrissimpkins in #142 scripts/assemble_charactersets.py has been patched to allow glyphsets defintion yaml files to have an extends list of other glyphsets. The various glyphset files (nam, nice names, production names) will only list the additional glyphs, not the ones in the other glyphsets.

For example, the GF_Latin_PriAfrican.yaml used for #144 was:

extends:
  - GF Latin Core
language_codes:
  - af_Latn  # Afrikaans
  - ak_Latn  # Akan
  - bm_Latn  # Bambara
  - dyu_Latn # Dioula
  - ff_Latn  # Fulfulde
  - ha_Latn  # Hausa
  - ig_Latn  # Igbo
  - lg_Latn  # Luganda
  - om_Latn  # Oromo
  - sw_Latn  # Swahili
  - xh_Latn  # Xhosa
  - yo_Latn  # Yoruba
  - zu_Latn  # Zulu

Other than that, there will be overlap between some sets for example GF Latin Vietnamese overlaps with GF Latin African as they both extend GF Latin Core but share glyphs and have different purposes. Neither of GF Latin African or GF Latin Vietnamese extends the other.

It makes sense GF Latin Core should extends GF Latin Kernel. But is it convenient?

I don’t think the names GF Latin African Core and GF Latin African Kernel make sense.
GF Latin Kernel is meant to be used in all glyphsets, Latin or not, and GF Latin Core is meant to be used in all Latin glyphsets.
GF Latin PriAfrican is not a great name either as there may be different levels of priority. I don’t have better names to suggest.

@EbenSorkin
Copy link

EbenSorkin commented Nov 10, 2023 via email

@davelab6
Copy link
Member

I don't have a strong opinion about the details here. The business need is for a priority set of African languages to be supported, or a kernel Latin set be included in primarily non Latin fonts.

@m4rc1e
Copy link
Collaborator

m4rc1e commented Nov 21, 2023

Sorry for the delay.

I don't really get this discussion about having non-overlapping glyphsets.

If I checkout the data.json file (the source of truth), we can see that each glyph also lists the glyphsets it appears in. We can see that the "a" is in Latin kernel and Latin core. If we add new glyphs for African Latin which are not in data.json file, I expect them to only include African glyphsets. If the glyph already exists in the data.json file, I expect the existing glyph's glyphset list to be updated to include the African glyphsets.

I have no opinions on the actual glyphs included in this pr since you're all much smarter than me.

@vv-monsalve
Copy link
Contributor

vv-monsalve commented Nov 21, 2023

We can see that the "a" is in Latin kernel and Latin core.

Yes, I saw this, but it only happens for those two Glyphsets. If you go to e.g. GF_Latin_Beyond, that "a" is not there cause, of course, it is already included in the required base Glyphset: the Latin_Core. And this is the whole idea of the modular system.

I was surprised to see a letter like "a" repeated in those Glyphsets (kernel + Core), tbh since the modular idea is precisely for each added definition to be built up over the previous one. But I don't know if there was a particular reason or need to repeat the glyphs in Kernel and Core.

Regardless of what we decide (to allow the repetition or not), we should define a consistent approach. And I would advocate refraining from repeating and using the dry key principle in the administration of information.

@RosaWagner
Copy link
Contributor

RosaWagner commented Nov 21, 2023

Kernel glyphset is not part of the Latin modular system because it is not the minimal required set for font supporting latin languages, it is the minimal set for non-LCG families.

Core is the minimal glyphset tested for Latin support, all families supporting latin should have Core+ParticularSet. We do not want Kernel+Core+OtherSet because (1) it starts to be complicated, (2) would set Kernel as the minimum required (which is not the case). So Core needs to have all the minimal amount of glyphs required. We may modify Kernel in the future and it shouldn’t affect Core.

African, Beyond, Vietnamese etc are made in addition to Core, no need to repeat codepoints between files. We chose the modular system to make clear that supporting a non-Core sets alone would result in an incomplete support for Google Fonts. As a matter of fact, these glyphsets were made in the context of Google Fonts, not for users out of this context to know what characters they need to support a certain region etc. Since all GF Latin families supports Core, the modular system also brings clarity on the number of glyphs to add to support a particular set.

Now, if there is still the will to make a comprehensive set to say “if you support this, you support African languages” without caring for the minimal set required (ie not working in a modular way and including basic alphabet everywhere), I honestly don’t see a problem with that. What does it change at the end? Even if all the sets work independently, Core will still be the minimal set required by fontbakery in the googlefonts profile. It would allow to update Core without affecting the other sets, and probably would simplify Jan’s tooling to build sets.

@vv-monsalve
Copy link
Contributor

Kernel glyphset is not part of the Latin modular system because it is not the minimal required set for font supporting latin languages, it is the minimal set for non-LCG families.

Makes sense now, thanks.

@tphinney
Copy link

tphinney commented Dec 7, 2023

@moyogo @EbenSorkin Are design notes around these characters all accumulated and summarized somewhere other than just this pull request? Things like whether one uses an N or n form for the cap form of a letter.... (Not that this should block the pull request.)

I think @RosaWagner makes a strong argument that this should include anything needed that is in addition to G Latin Core. That is already the case with the PR as it stands, right?

So the remaining decision is: What name should be used? Dave says it does not matter to him. Seems like there are several options on the table.

Is there a time or process for any remaining decisions blocking completion of the pull request?

@EbenSorkin
Copy link

EbenSorkin commented Dec 7, 2023 via email

@chrissimpkins
Copy link
Member Author

@moyogo @EbenSorkin Are design notes around these characters all accumulated and summarized somewhere other than just this pull request? Things like whether one uses an N or n form for the cap form of a letter.... (Not that this should block the pull request.)

I think @RosaWagner makes a strong argument that this should include anything needed that is in addition to G Latin Core. That is already the case with the PR as it stands, right?

So the remaining decision is: What name should be used? Dave says it does not matter to him. Seems like there are several options on the table.

Is there a time or process for any remaining decisions blocking completion of the pull request?

I can share the documentation with you Thomas. Denis and Neil were involved in defining the final set. The initial attempt was based on my own research on language support targets.

@chrissimpkins
Copy link
Member Author

chrissimpkins commented Dec 14, 2023

Things like whether one uses an N or n form for the cap form of a letter....

For Eng,the answer might be both depending on the languages that you intend to support. Mind looping Marianna and I into your repository tracker to discuss it with you? We just reviewed this in Google Sans. I don't believe that there is a way to document this type of requirement in glyphsets. What we likely need is full-fledged documentation rather than lists of codepoints to address this level of detail.

@chrissimpkins
Copy link
Member Author

I am under the impression that nobody felt strongly enough about Pri and

SSA to rename them. I'm not sure what the answer is to the rest of your

question. I will be coming back to glyph design notes for Pri and SSA stuff

but probably not this week.

On Thu, Dec 7, 2023 at 1:38 PM Thomas Phinney @.***>

wrote:

@moyogo https://github.com/moyogo @EbenSorkin

https://github.com/EbenSorkin Are design notes around these characters

all accumulated and summarized somewhere other than just this pull request?

Things like whether one uses an N or n form for the cap form of a

letter.... (Not that this should block the pull request.)

I think @RosaWagner https://github.com/RosaWagner makes a strong

argument that this should include anything needed that is in addition to G

Latin Core. That is already the case with the PR as it stands, right?

So the remaining decision is: What name should be used? Dave says it does

not matter to him. Seems like there are several options on the table.

Is there a time or process for any remaining decisions blocking completion

of the pull request?

Reply to this email directly, view it on GitHub

#137 (comment),

or unsubscribe

https://github.com/notifications/unsubscribe-auth/AAQUQXKYYGSQHZMYVZYWADLYIIELJAVCNFSM6AAAAAA5ZCBXR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBVHEYDQNBVGA

.

You are receiving this because you were mentioned.Message ID:

@.***>

The Pri reflects Google business needs and will be used to define some font development contracts. That is why the wording is as it is. It is not intended to indicate priorities for any other development team out there. I wouldn't get hung up on the name. We can revisit it and change if necessary down the road. The key right now is to signal that the data are stable and be able to provide it to development teams.

I don't have strong opinions about superset and subset issues. IMO a designer who is briefed to develop to a Fonts lang support standard probably wants as concise of a way to understand that as possible. So, duplication across glyphsets like the SSA areas is likely fine in that sense. If this is a subset of full SSA, then the designer ignores this list and spends their time understanding the full SSA codepoint list if they are commissioned to develop that lang support, and this one if they are spec'd to develop projects intended for a smaller set of languages.

I respect all of the feedback here. Let's leave this PR open until next week so that all who have commented to date have an opportunity to weigh in further. If there is no more input we will merge as is next week. It will be open to additional edits in future PR's.

@chrissimpkins chrissimpkins merged commit c92a5fd into main Jan 2, 2024
9 checks passed
@chrissimpkins chrissimpkins deleted the chrissimpkins-latin-priafrican branch January 2, 2024 21:22
@chrissimpkins
Copy link
Member Author

Many thanks to all for the feedback here. And a big thanks to @NeilSureshPatel and @moyogo for the review on these codepoints. Greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define GF Latin PriAfrican
9 participants