You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a romanization project I'm working on I'm using the polyglot-tokenizer with good results for most of the indian languages. Are you aware of it? My question is if the NLTK is better in the tokenization.
Thank you.
The text was updated successfully, but these errors were encountered:
Thank you for testing it! I was aware of NLTK, but at the end I have preferred that one because of the extended indian languages support. Currently I'm looking to the Byte Pair Encoding approach in order to get rid of a specific language model, so to build a cross-lingual model. I saw you are working on the same. Also in my case I do indian language classification (the source is Wikipedia as well for indian script languages), and most of the problems were due to the Tokenizer actually, more than the classifier itself. Hopefully the BPE will give better results for a language agnostic approach!
For a romanization project I'm working on I'm using the polyglot-tokenizer with good results for most of the indian languages. Are you aware of it? My question is if the NLTK is better in the tokenization.
Thank you.
The text was updated successfully, but these errors were encountered: