Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design a cohesive solution for supported locales #58

Open
sffc opened this issue Apr 27, 2020 · 61 comments · May be fixed by #4607
Open

Design a cohesive solution for supported locales #58

sffc opened this issue Apr 27, 2020 · 61 comments · May be fixed by #4607
Assignees
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility

Comments

@sffc
Copy link
Member

sffc commented Apr 27, 2020

I very often see clients who want to use ICU as a default behavior, but fall back to custom logic if ICU does not support a given locale.

The main problem, of course, is that the locale fallback chain is an essential piece of whether or not a locale is supported. If you have locale data for "en" and "en_001", but request "en_US" or "en_GB", the answer is that both of those locales are supported, even though they both load their data from a fallback locale.

I'm not 100% confident, but I think the prevailing use case is that programmers want to know whether the locale falls back all the way to root. If it gets "caught" by an intermediate language, then that's fine, as long as we don't use the stub data in root.

ECMA-402 has the concept of supportedLocalesOf. Although it's not clear in the MDN, it appears that this method has the ability to check for locale fallbacks. This is better than ICU4C's behavior of getAvailableLocales, which returns a string list and requires the user to figure out how to do fallbacks and matching on that list.

We could consider whether this use case fits in with the data provider, or whether we want to put it on APIs directly.

@sffc sffc added T-core Type: Required functionality help wanted Issue needs an assignee labels Apr 27, 2020
@sffc sffc added C-meta Component: Relating to ICU4X as a whole A-design Area: Architecture or design labels May 7, 2020
@sffc sffc added this to the 2020 Q3 milestone Jun 17, 2020
@sffc
Copy link
Member Author

sffc commented Jun 24, 2020

Putting this on the Version 1 backlog. We still need help on this.

@sffc sffc closed this as completed Jun 24, 2020
@sffc sffc removed this from the 2020 Q3 milestone Jun 24, 2020
@sffc sffc reopened this Sep 4, 2020
@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Feb 20, 2021
@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 14, 2021
@sffc sffc self-assigned this Mar 28, 2022
@sffc sffc added this to the ICU4X 1.0 milestone Mar 28, 2022
@sffc sffc added S-medium Size: Less than a week (larger bug fix or enhancement) and removed backlog help wanted Issue needs an assignee labels Mar 28, 2022
@sffc
Copy link
Member Author

sffc commented Jun 22, 2022

Here's my current thinking.

There are three levels of locale support in a data provider:

  1. Stored Locales: Given a particular key, these are the locales that have an exact match without any fallbacking. This list is largely internal to the data provider. This is what we currently expose in the iterable data provider.
  2. Resolved Locales: These are the locales that are expressly guaranteed to have an exact match over all keys that the provider supports. This is what can be publicly consumable to answer the question of "supported locales". Resolved locales should generally include the language-script plus zero or more regional variants.
  3. Fallbackable Locales: This is the unbounded set of locales such that when vertical fallback is enabled, a resolved locale is reachable.

A particular provider must store metadata listing the resolved locales, which increases the complexity. I don't see a way around this. In the fs provider, the list is stored in the manifest.json file, and in the blob provider, it is stored alongside the big zeromap.

The list is could be exposed through the following trait:

pub trait SupportedLocales {
    /// Returns resolved locales or MissingResourceKey if the key is not supported
    fn supported_locales_for_key(&self, key: ResourceKey) -> Result<Vec<ResourceOptions>, DataError>;
}

In ForkByKeyProvider, the implementation works the same as load_payload: we loop over the providers until finding one that doesn't return MissingResourceKey.

To answer the ECMA-402 question "what are the supported locales for NumberFormat", you use this API with the key decimal/symbols@1. When we add more versions of these data keys, we should likely loop over all versions of the key that our code supports and take the union.

Another mode that aligns closer to ECMA-402 is

pub trait SupportedLocales {
    /// Returns the subset of locales that are fallbackable locales.
    fn supported_locales_of(&self, key: ResourceKey, locales: Vec<ResourceOptions>) -> Result<Vec<ResourceOptions>, DataError>;
}

ForkByKeyProvider would again look for the first provider that supports the key, and then return its result.

An issue here is that vertical fallback would need to be invoked. Therefore, rather than having this API, it may be better and cheaper to just run the full vertical fallback stack, but stop short of deserializing/downcasting.

@sffc
Copy link
Member Author

sffc commented Jun 23, 2022

I propose considering what I proposed above as the course of action, and removing this from the 1.0 critical path.

@jedel1043
Copy link
Contributor

jedel1043 commented Dec 14, 2022

I've been implementing the Intl spec for the past couple of months, and I can give a bit of my perspective on this issue.

From what I could see while implementing the locale resolution algorithms defined in the ECMA402 spec, what the API seems to try to accomplish is to determine if a specific locale will return "correct" results if used as the locale of a specific service. Everything else is just taking that and extending it to several different APIs that filter/choose/tune a list of user-provided locales to ensure that all locales passed to the services are always "valid" in a sense.

Maybe this means that the providers don't explicitly need a SupportedLocales feature, but more like a way to pass them a locale and a key/service to know if that key/service using that locale will return "correct" results.

@hsivonen
Copy link
Member

Collator is special in the sense that in the absence of natural-language output, und is more applicable as fallback than for services that involve natural-language output. Currently, if you request e.g. en, the collator in ICU4X falls back to und, which is correct in terms of comparison behavior, but existing Web-exposed behavior of Intl.Collator is that languages for which the root collation is known to be valid (without reordering), such as en, fr, etc., are supposed to behave in the outward API as if language-specific data existed for them. Furthermore, in Firefox and Chrome 1) und is treated as unsupported by supportedLocalesOf and 2) locales with actually-unsupported subtags are treated as supported by supportedLocalesOf if the language counts as supported.

console.log(Intl.Collator.supportedLocalesOf(['und', 'und-u-co-emoji', 'ban', 'id-u-co-pinyin', 'de-ID', "en", "fr", "el"])); logs the array [ "id-u-co-pinyin", "de-ID", "en", "fr", "el" ] in both Firefox and Chrome.

(In Safari, the array is prepended with "en-US-u-va-posix", which is just weird. Safari turns "und" (but not "und-u-co-emoji") into that, even though in CLDR the POSIX variant of English is not the same as the root. Weird.)

Boa currently logs the whole input array, because of "// TODO: ugly hack to accept locales that fallback to "und" in the collator/segmenter services".

It's not clear to me how "Resolved Locales" above should capture the "root is known to be valid" concept. As I understand it, we don't currently store this information. To the extent the provider infrastructure supports aliases, I guess one possibility would be list en, fr, etc. as aliases of und.

@sffc sffc modified the milestones: 1.5 Blocking ⟨P1⟩, ICU4X 2.0 May 23, 2024
@sffc sffc mentioned this issue Jul 2, 2024
@sffc
Copy link
Member Author

sffc commented Jul 2, 2024

We're landing DryDataProvider, the successor to can_load, in #5141. Still want to decide whether the trait function is default-impled. The discussion in my previous post above says that we would not have a default impl in 2.0.

@sffc sffc moved this from Unclaimed for sprint to Needs discussion to unblock in icu4x 2.0 Jul 2, 2024
robertbastian added a commit that referenced this issue Jul 4, 2024
@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Sep 4, 2024
@robertbastian
Copy link
Member

  • @sffc - Could people call the impldry macros directly?
  • @robertbastian - Only on their own structs, which might duplicate data
  • @sffc - Maybe we should just pick a key and impl it for that one key
  • @robertbastian - Or just make it a blanket impl like BlobDataProvider does

Move this issue to needs-approval to discuss above point.

@robertbastian robertbastian added needs-approval One or more stakeholders need to approve proposal and removed discuss-priority Discuss at the next ICU4X meeting labels Sep 5, 2024
@sffc
Copy link
Member Author

sffc commented Sep 5, 2024

Making clients poke into the internals and pick a key with DryDataProvider and be surprised by the results and break when we change things is just not a great solution. This is an ECMA-402 thing, so we should make an API (however minimal) for it. ECMA-402 only requires supported_locales_of.

I posted my solution in #5504.

DryDataProvider solves the problem, but only if it has retain-base-languages behavior. We rejected retain-base-languages for default compiled data in #5195 and I stand strongly by that. However, we should still be able to implement this function in icu_plurals.

@sffc sffc assigned robertbastian and unassigned sffc Sep 5, 2024
@sffc
Copy link
Member Author

sffc commented Sep 5, 2024

My preferred fix is to make it such that impl DryDataProvider for provider::Baked has retain_base_languages behavior, even if impl DataProvider for provider::Baked has maximal deduplication behavior.

@sffc
Copy link
Member Author

sffc commented Sep 5, 2024

If we cannot agree to an in-library solution, then:

  1. We explicitly state that we are choosing to not support ECMA-402 supportedLocalesOf and resolvedOptions().locale out of the box
  2. Add a tutorial illustrating how a client who needs this behavior can implement it themselves

@sffc
Copy link
Member Author

sffc commented Sep 5, 2024

If we can't agree on an in-library solution, we need to figure out a userland solution that doesn't involve duplicating data. A few approaches we could take:

  1. impl DryDataProvider for provider::Baked for our chosen key either all the time or behind a cfg
  2. Change baked_exporter such that impl DryDataProvider does not depend on impl DataProvider
  3. Keep track somewhere of whether the provider was exported with retain_base_languages (in Baked, this can be done via make_provider!, and in blob, it can be another internal marker that we add) and make supported_locales_of return an error if the data was not built with retain_base_languages

@sffc
Copy link
Member Author

sffc commented Sep 5, 2024

I implemented a sketch of how option 3 could work in #5506.

Another option is that we could use something similar to #5506 but just record the locales that were exported in the data exporter. This was suggested earlier in the thread. The downside is that it gives you the locales for the exporter overall, not the locales for a specific key, so it might not work for LocaleFamily::FULL, for example. However, since we don't use LocaleFamily::FULL in ICU4X compiled data, it would be fine.

@robertbastian
Copy link
Member

Some solutions:

  1. DryDataProvider always has retain-base-language behavior, but DataProvider stays with the maximal deduplication behavior.
  2. SourceInfo contains the deduplication strategy
  3. SourceInfo contains the list of locales from datagen
  4. Add to docs, "Warning: this only works if you're using custom compiled data with retain-base-languages!"
  5. Have DRY_IF_RETAIN in the impl macro and put bounds on supported_locales_of so that it won't work unless you have custom compiled data with those bounds.
  6. Ship default compiled data with retain-base-languages

Discussion:

  • @robertbastian The way you implemented icu_plurals::provider::supported_locales_of as a per-crate API does not scale to other crates. We would need need this function basically for every constructor: icu_plurals::PluralRules::supported_locales_of
  • @sffc It shouldn't be up to the client to know what key they should invoke DryDataProvider on. I agree with adding static functions on the types.
  • @robertbastian I don't like the SourceInfo stuff.
  • @hsivonen It was mentioned that people who want this info would need to regen baked data. What are the practical consequences of doing that? If an application wants to build their own data / replace icu_collator_data, is that practical?
  • @robertbastian You set an environment variable.
  • @sffc If DryDataProvider always has retain-base-languages behavior, then most impls can share the trie. Something like 90% will be identical impls.
  • @robertbastian I'm not convinced; we don't control users' data sources. Also, there are advantages of retain-base-languages in your own data.
  • @sffc Do we need DRY_IF_RETAIN in 2.0?
  • @robertbastian It's the provider module, so we don't need to be stable, but we can add it semver-compatible way so long as we don't ship DRY in 2.0

Proposal:

  • Don't need to land SourceInfo at this time; stick with retain-base-languages as the solution to the supported-locales and resolved-locales problems
  • Make the nicer user-facing API that depends on DRY_IF_RETAIN, but does not need to land in 2.0
  • For now, don't ship impl DryDataProvider for provider::Baked, unless a client asks for it, in which case we add DRY and provide instructions with the caveat that it could break in 2.x

LGTM: @sffc @robertbastian

@sffc sffc moved this from Needs discussion to unblock to Being worked on in icu4x 2.0 Sep 19, 2024
@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Sep 19, 2024
robertbastian added a commit that referenced this issue Sep 19, 2024
@robertbastian
Copy link
Member

What's left to do is to activate DRY on chosen data markers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility
Projects
Status: Being worked on
Development

Successfully merging a pull request may close this issue.

6 participants