Thai #68

scossu · 2023-11-08T13:40:59Z

Add Thai support.

scossu · 2023-11-08T13:46:07Z

Plangsarn for Thai:
The link is http://164.115.23.167/plangsarn/ . It is developed jointly by Thammasat University and National Electronics and Computer Technology Center of Thailand. It is free to use for anyone with access to Internet and follow ALA-LC romanization table. The accuracy of converter is 90% OK.

This is just informational. We may or may not want to integrate it. First we may want to know if the same is achievable with a table and hooks. According to https://www.loc.gov/catdir/cpso/romanization/thai.pdf the process is not trivial.

scossu · 2023-12-04T00:54:41Z

Fixed with #79.

scossu · 2024-02-26T14:16:08Z

Unfortunately, Aksharamukha does not provide ALA-LC compatibility. there are two options:

Adapt Plangsarn (if we have access to the source code to reverse-engineer)
Start from scratch under catalogers' supervision.

scossu · 2024-07-03T13:13:49Z

From 2024-06-20 meeting with LC catalogers and further discussion, we agreed on the following:

Script-to-Roman transliteration is not deterministic because ALA-LC romanization adds spaces between words that are not present in the source script.
Roman-to-script transliteration is not deterministic because romanized Thai assimilates multiple forms of the same letter into one, and it is impossible to revert back to the original form from the ALA-LC romanization.

Currently, the most accurate tool available (for S2R ONLY) is Plangsarn. The integration capabilities of this tool are still TBD.

LC Thai catalogers' workflow needs the R2S function because Romanized text is inserted first in the cataloging process.

Given the above conditions, the ideal (not from a time efficiency perspective, but from an accuracy one) would be to propose some significant changes to the ALA-LC tables, so that at least R2S transliteration becomes lossless, by mapping exact variants of each Thai character to a Roman character with a diacritic added. (This may be a long-running task but the most beneficial for the community.)

Similarly, removing artificial word splitting from S2R transliteration can be proposed. It is unknown at the moment how the community will react to this proposal. @RandyBarry will lead this initiative.

For S2R spacing issues, I'm testing the integration of ML-based part-of-speech analysis tools such as https://huggingface.co/KoichiYasuoka/roberta-base-thai-spm-upos that have yielded good results (but not 100% accurate, especially on words that can be ambiguously compounded) . As an interim solution to a deterministic one based on an updated ALA-LC table, it seems viable.

scossu added the script label Nov 8, 2023

scossu self-assigned this Nov 8, 2023

scossu added the help wanted Extra attention is needed label Nov 8, 2023

scossu closed this as completed Dec 4, 2023

scossu added this to the Phase 3 milestone Feb 26, 2024

scossu reopened this Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thai #68

Thai #68

scossu commented Nov 8, 2023

scossu commented Nov 8, 2023

scossu commented Dec 4, 2023

scossu commented Feb 26, 2024

scossu commented Jul 3, 2024

Thai #68

Thai #68

Comments

scossu commented Nov 8, 2023

scossu commented Nov 8, 2023

scossu commented Dec 4, 2023

scossu commented Feb 26, 2024

scossu commented Jul 3, 2024