You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, Have you ever measure metrics like Recall@1, Recall@100 accuracy in any information retrieval tasks before and compare the results to other Vietnamese tokenizing models, say, VnCoreNLP ?
In my own datasets, VnCoreNLP is little bit better than CocCocTokenizer (I use the basic BM25 score)
The text was updated successfully, but these errors were encountered:
I'm not really working in the company for quite a while anymore, but I'll reply here.
I guess, the goal when building this tokenizer was efficiency over precision. So that the memory footprint is relatively low and the throughput is high. We had to process billions of pages with that thing daily.
It is important to note that we tried to keep cases where it's wrong to be reasonable from text matching perspective. Which means that if coccoc-tokenizer is incorrect from the linguistics standpoint, it still doesn't affect search ranking metrics.
Hope that helps. :)
duongkstn
changed the title
CocCocTokenizer is wors than VnCoreNLP Tokenizer in information-retrieval tasks
CocCocTokenizer is worse than VnCoreNLP Tokenizer in information-retrieval tasks
Jul 5, 2023
Hi, Have you ever measure metrics like Recall@1, Recall@100 accuracy in any information retrieval tasks before and compare the results to other Vietnamese tokenizing models, say, VnCoreNLP ?
In my own datasets, VnCoreNLP is little bit better than CocCocTokenizer (I use the basic BM25 score)
The text was updated successfully, but these errors were encountered: