Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thai #68

Open
scossu opened this issue Nov 8, 2023 · 4 comments
Open

Thai #68

scossu opened this issue Nov 8, 2023 · 4 comments
Assignees
Labels
help wanted Extra attention is needed script
Milestone

Comments

@scossu
Copy link
Collaborator

scossu commented Nov 8, 2023

Add Thai support.

@scossu scossu added the script label Nov 8, 2023
@scossu scossu self-assigned this Nov 8, 2023
@scossu scossu added the help wanted Extra attention is needed label Nov 8, 2023
@scossu
Copy link
Collaborator Author

scossu commented Nov 8, 2023

Plangsarn for Thai:
The link is http://164.115.23.167/plangsarn/ . It is developed jointly by Thammasat University and National Electronics and Computer Technology Center of Thailand. It is free to use for anyone with access to Internet and follow ALA-LC romanization table. The accuracy of converter is 90% OK.

This is just informational. We may or may not want to integrate it. First we may want to know if the same is achievable with a table and hooks. According to https://www.loc.gov/catdir/cpso/romanization/thai.pdf the process is not trivial.

@scossu
Copy link
Collaborator Author

scossu commented Dec 4, 2023

Fixed with #79.

@scossu scossu closed this as completed Dec 4, 2023
@scossu scossu added this to the Phase 3 milestone Feb 26, 2024
@scossu
Copy link
Collaborator Author

scossu commented Feb 26, 2024

Unfortunately, Aksharamukha does not provide ALA-LC compatibility. there are two options:

  1. Adapt Plangsarn (if we have access to the source code to reverse-engineer)
  2. Start from scratch under catalogers' supervision.

@scossu scossu reopened this Feb 26, 2024
@scossu
Copy link
Collaborator Author

scossu commented Jul 3, 2024

From 2024-06-20 meeting with LC catalogers and further discussion, we agreed on the following:

  1. Script-to-Roman transliteration is not deterministic because ALA-LC romanization adds spaces between words that are not present in the source script.
  2. Roman-to-script transliteration is not deterministic because romanized Thai assimilates multiple forms of the same letter into one, and it is impossible to revert back to the original form from the ALA-LC romanization.

Currently, the most accurate tool available (for S2R ONLY) is Plangsarn. The integration capabilities of this tool are still TBD.

LC Thai catalogers' workflow needs the R2S function because Romanized text is inserted first in the cataloging process.

Given the above conditions, the ideal (not from a time efficiency perspective, but from an accuracy one) would be to propose some significant changes to the ALA-LC tables, so that at least R2S transliteration becomes lossless, by mapping exact variants of each Thai character to a Roman character with a diacritic added. (This may be a long-running task but the most beneficial for the community.)

Similarly, removing artificial word splitting from S2R transliteration can be proposed. It is unknown at the moment how the community will react to this proposal. @RandyBarry will lead this initiative.

For S2R spacing issues, I'm testing the integration of ML-based part-of-speech analysis tools such as https://huggingface.co/KoichiYasuoka/roberta-base-thai-spm-upos that have yielded good results (but not 100% accurate, especially on words that can be ambiguously compounded) . As an interim solution to a deterministic one based on an updated ALA-LC table, it seems viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed script
Projects
None yet
Development

No branches or pull requests

1 participant