Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CocCocTokenizer is worse than VnCoreNLP Tokenizer in information-retrieval tasks #28

Open
duongkstn opened this issue Jul 4, 2023 · 1 comment

Comments

@duongkstn
Copy link

Hi, Have you ever measure metrics like Recall@1, Recall@100 accuracy in any information retrieval tasks before and compare the results to other Vietnamese tokenizing models, say, VnCoreNLP ?

In my own datasets, VnCoreNLP is little bit better than CocCocTokenizer (I use the basic BM25 score)

@bachan
Copy link
Member

bachan commented Jul 4, 2023

I'm not really working in the company for quite a while anymore, but I'll reply here.

I guess, the goal when building this tokenizer was efficiency over precision. So that the memory footprint is relatively low and the throughput is high. We had to process billions of pages with that thing daily.

It is important to note that we tried to keep cases where it's wrong to be reasonable from text matching perspective. Which means that if coccoc-tokenizer is incorrect from the linguistics standpoint, it still doesn't affect search ranking metrics.

Hope that helps. :)

@duongkstn duongkstn changed the title CocCocTokenizer is wors than VnCoreNLP Tokenizer in information-retrieval tasks CocCocTokenizer is worse than VnCoreNLP Tokenizer in information-retrieval tasks Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants