Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbers getting changed after translation #24

Open
AM-ash-OR-AM-I opened this issue Oct 15, 2023 · 8 comments
Open

Numbers getting changed after translation #24

AM-ash-OR-AM-I opened this issue Oct 15, 2023 · 8 comments

Comments

@AM-ash-OR-AM-I
Copy link

AM-ash-OR-AM-I commented Oct 15, 2023

I've deployed the model and while inference

{
"text":"*Apply Euclid's division algorithm to determine the Highest Common Factor (HCF) of $231$ and $396$.\n\n",
"translated_text":" * ಯುಕ್ಲಿಡ್ನ ಡಿವಿಷನ್ ಅಲ್ಗಾರಿದಮ್ಅನ್ನು ಅನ್ವಯಿಸಿ, ಅತಿ ಹೆಚ್ಚು ಸಾಮಾನ್ಯ ಅಂಶವನ್ನು (ಎಚ್ಸಿಎಫ್) ನಿರ್ಧರಿಸಲು $239 ಮತ್ತು $396."
}

231 -> 239. Issue seems be changing the number only when $ is given otherwise it seems to be okay, what's the reason for this and a possible solution?

@PranjalChitale
Copy link
Collaborator

You can use our inference pipeline which should handle these cases. You can follow the steps described here.

We tried the same example on our demo and it worked fine, the numbers were preserved.

@GokulNC
Copy link
Member

GokulNC commented Oct 22, 2023

I just tried for the following sentence on the demo page:

India's foreign exchange reserves increased by USD $1.153 billion to USD $585.895 billion for the week ending October 13, reversing a trend of multiple weeks of decline.

It translated to Hindi as:

13 अक्टूबर को समाप्त सप्ताह के लिए भारत का विदेशी मुद्रा भंडार अमेरिकी डॉलर 1 बिलियन से बढ़कर अमेरिकी डॉलर 2 बिलियन हो गया, जो कई हफ्तों की गिरावट की प्रवृत्ति को उलट देता है।

Is it handled for floating point cases as well?
Thanks!

@jsk1808
Copy link

jsk1808 commented Nov 22, 2023

I'm facing the same problem. The model is hallucinating numbers. Any updates on how to fix that?

@PranjalChitale
Copy link
Collaborator

General comment about the numeral issue.

In some cases, we do observe that the placeholder based approach in the inference engine might result in suboptimal results for certain cases involving numerals as the model hallucinates the placeholder identifier as the actual number instead of retention of the placeholder, which is observed in the example in this comment.

You can consider removing the numeral pattern and let the model handle the numerals on its own, to avoid these placeholder-induced hallucinations.

@GokulNC
Copy link
Member

GokulNC commented Mar 28, 2024

I see the following difference between training code and inference code:

During training / finetuning, the placeholder being used is <dnt> do_not_translate_this </dnt>.
Ref: https://github.com/AI4Bharat/IndicTrans2/blob/main/scripts/normalize_regex.py

But during inference, a different tag is being used altogether: <ID1>, <ID2>, etc.
Ref: https://github.com/AI4Bharat/IndicTrans2/blob/main/inference/normalize_regex_inference.py

Why is this the case? Doesn't this mean that the model isn't primed explicitly to retain the <ID> placeholders, and hence the root cause of the above issue?

Shouldn't we be using <dnt> during inference as well?

Please correct me if I am wrong somewhere. Thanks!

@PranjalChitale
Copy link
Collaborator

PranjalChitale commented Mar 28, 2024

Yes, we used the dnt based approach during training, however we do apply a final-stage of fine-tuning on BPCC-seed data, which does not contain much representation of such cases. Therefore, the model slightly loses its ability to work with tags because of lack of representation of DNT cases in the BPCC-seed data. However, in the broader scheme of things, we chose improved translation quality over preserving this ability. The approach doesn't work well with the final models, therefore, we switched to the placeholder based approach which is observed to be very effective in most cases, apart from numbers we don't observe hallucinations in any other case.

Doing away with the numeral pattern might be a fix, but this needs to extensively tested.

@PranjalChitale
Copy link
Collaborator

PranjalChitale commented Mar 28, 2024

Why is this the case? Doesn't this mean that the model isn't primed explicitly to retain the placeholders, and hence the root cause of the above issue?

Yes, you are correct.

We don't explicitly use these ID tags during training, and this was based on an empirical observation that ID tags get preserved by the model in most cases but cannot be 100% guaranteed.

@GokulNC
Copy link
Member

GokulNC commented Mar 28, 2024

Cool, thanks! Will finetune with IDs instead of dnt tags.


Also just FYI, like you said above, the model does not seem to be working well with dnt tags as well during inference:

Input: Movie fans were much more positive, according to ratings on <dnt> Amazon.com </dnt>.
Output from IndicTrans: अमेज़न. कॉम पर रेटिंग के अनुसार, फिल्म के प्रशंसक बहुत अधिक सकारात्मक थे।

Although it ignores the <dnt> tags, it does not seem to retain the phrase inside it as-it-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants