-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low quality of Chinese to English #107
Comments
Hi, The problem seems to be this particular model, opusTCv20210807+nopar+ft95-sepvoc_transformer-small-align_2023-03-16. It appears to be a transformer-tiny model, and I haven't tested those with OPUS-CAT (looks like they don't work). Normally those models are hidden in the model downloader, but this model is for some reason named transformer-small instead of transformer-tiny, so it's not picked up by hiding function. Usually the performance of the tiny models is pretty bad, so it's not a priority to add support for them. Could you test with another model to see if it works, for instance the one selected below (opus+bt-2021-04-30): |
Thanks for the swift reply - yes no error with the model you have recommended! |
(I'm reopening this as an issue about the low quality of the Chinese models.) Would you mind adding a few comments on the types of errors you see with the Chinese model? We don't have the resources to properly validate the quality of all the models, especially for non-European languages and for languages for which there are no widely available public test sets that can be used for comparison. Currently all the Chinese models are multilingual, and this might have some effect on the quality, I wonder if training a bilingual Mandarin to English model would help (although maybe not, since the data is overwhelmingly Mandarin already in the multilingual models). Also, the bulk of the data we have for training Chinese models (https://opus.nlpl.eu/results/zh&en/corpus-result-table) is from the CCMatrix corpus, which is crawled data and has lots of problems. There's a decent amount of UN data, which might be of better quality, so possibly oversampling that might help. If you see any improvement with fine-tuning, I'd appreciate if you could mention it in this thread. |
Sure I can try to help. Now I have tested on new text from another fan translation of a different book (one can find the ZH of that text here and that fan translation here). I regard the fan translations mentioned as the gold standard. Here are the first few sentences on the default model that I can screenshot Here is another example from the 5th/6th paragraphs: Again the fine-tuned one is complete, although the understandability is clearly worse than even a Google translate of the Chinese page. I am surprised at the improvement with what I think is a rather limited fine-tuning dataset although very much in the same domain (wuxia novels). If there are some formal quality metrics I should report, or if you need more info, let me know. |
Thanks for pointing this out. Something must be wrong with that base model, those omissions should not happen. This is probably due to some corruption in the training data, which is then overridden by the fine-tuning, which explains the big quality change. It also looks like the OpenSubtitles corpus might not have been used in training the models, that has a lot of Chinese in it. I will try to investigate this further when I have time, having decent Chinese models would be nice. Mozilla is also training models that could potentially be deployed with OPUS-CAT, and they have a Chinese-English model under development: https://github.com/mozilla/firefox-translations-models/tree/main/models/dev/zhen. So rather than retraining the zh-en model, it might make more sense to add the Mozilla models to OPUS-CAT, in case the Mozilla model is clearly better. |
How easy is it to add the Mozilla model, or any other, to OPUS-CAT? I am happy to try the same testing examples with their dev model or their production one when that happens. |
The Mozilla models are Marian models like the other ones in OPUS-CAT, so it's not that difficult, but still requires a bunch of changes to the downloader etc. I don't expect to have time to work on it in until next year, when I will be changing the model downloader in any case. If you want to test the Mozilla model, it is supposed to be available in the Firefox Nightly build (https://www.mozilla.org/en-US/firefox/133.0a1/releasenotes/), but I couldn't get it working myself. |
Understood, if you get time next year to work on it and update this issue I will try to compare them. Thanks for the tip - I don't suppose it will work for me if it didn't for you but I'll give it a go! |
Hi - I am excited to try your program as it was recommended to me by a Uni Professor for a fine-tuning project I want to work on.
I am using Linux v1.3.1-beta. I want to translate Chinese to English. Hopefully I installed everything correctly but after I put in some text to test it I immediately get the error in red:
Here should be the relevant bit of the log file in ~/.local/share/opuscat/logs
Thank you for any ideas on a fix!
The text was updated successfully, but these errors were encountered: