-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thai #68
Comments
This is just informational. We may or may not want to integrate it. First we may want to know if the same is achievable with a table and hooks. According to https://www.loc.gov/catdir/cpso/romanization/thai.pdf the process is not trivial. |
Fixed with #79. |
Unfortunately, Aksharamukha does not provide ALA-LC compatibility. there are two options:
|
From 2024-06-20 meeting with LC catalogers and further discussion, we agreed on the following:
Currently, the most accurate tool available (for S2R ONLY) is Plangsarn. The integration capabilities of this tool are still TBD. LC Thai catalogers' workflow needs the R2S function because Romanized text is inserted first in the cataloging process. Given the above conditions, the ideal (not from a time efficiency perspective, but from an accuracy one) would be to propose some significant changes to the ALA-LC tables, so that at least R2S transliteration becomes lossless, by mapping exact variants of each Thai character to a Roman character with a diacritic added. (This may be a long-running task but the most beneficial for the community.) Similarly, removing artificial word splitting from S2R transliteration can be proposed. It is unknown at the moment how the community will react to this proposal. @RandyBarry will lead this initiative. For S2R spacing issues, I'm testing the integration of ML-based part-of-speech analysis tools such as https://huggingface.co/KoichiYasuoka/roberta-base-thai-spm-upos that have yielded good results (but not 100% accurate, especially on words that can be ambiguously compounded) . As an interim solution to a deterministic one based on an updated ALA-LC table, it seems viable. |
Add Thai support.
The text was updated successfully, but these errors were encountered: