Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation processing problem #52

Open
Khalid-kamal opened this issue Oct 24, 2022 · 9 comments
Open

Translation processing problem #52

Khalid-kamal opened this issue Oct 24, 2022 · 9 comments

Comments

@Khalid-kamal
Copy link

When you have a sentence and dots are found in the middle, the sentence cannot be completed and only the first part is translated, ignoring the last portion after dots. for example
The officers and employees of the Bank, who are not local nationals of the Kingdom of ................... shall be exempt from customs duties and other levies, prohibitions and restrictions on the importation of motor vehicles and spare parts thereof, and household effects, equipment and furniture.
The result comes only for the first part until Kingdom of

@SafeTex
Copy link

SafeTex commented Oct 26, 2022

Hello Khalid

Are you translating into Arabic by any chance?

I wouldn't be surprised if this has something to do with right to left languages but I'm only guessing of course

The thing is that when I tested what you said in one of my language pairs (Swedish to English), Opus CAT translated everything (see attached file)
dot translation

@Khalid-kamal
Copy link
Author

So, it seems that the problem is in the language you are translating into, but this should not happen since the tool is counting the source words and compare them to the target words. It may be a bug and needs to be fixed.
Thanks for your guressing

@TommiNieminen
Copy link
Collaborator

I don't seem to be able to reproduce this issue, at least with the opus+bt-2021-04-13 English to Arabic model. Do you have more information in what contexts this issue occurs in?

@Khalid-kamal
Copy link
Author

Tommi,
Would you try this sentence and see the result:
1996 ................... among certain African states and international organizations;
image

@Khalid-kamal
Copy link
Author

Here is the database
image

@TommiNieminen
Copy link
Collaborator

That looks like a fine-tuned model, so it's possible that this caused by the fine-tuning process. Since the data used for fine-tuning is generally very domain-specific, it may cause performance to degrade with source texts that don't belong to the fine-tuning domain (such as these kinds of texts where a series of periods is used as placeholder).

How much data did you use to fine-tune the model with, and what sort of data was it? Another complicating factor is that the Arabic models are multilingual models, i.e. they support multiple variants of Arabic, which might affect fine-tuning.

@Khalid-kamal
Copy link
Author

Over one million segments

@Khalid-kamal
Copy link
Author

Most of the data is almost in the main domain

@TommiNieminen
Copy link
Collaborator

Ok, that's a lot of data. It does sound like the problem with the repeated periods is caused by the fine-tuning. If there are other errors in the translations besides the problem with the repeated periods, I would advise fine-tuning with smaller, more targeted set of segments.

If the model translates OK otherwise, it's also possible to use a pre-edit rule to edit those problematic sentences automatically before they are translated. For instance, you could use a rule like this:

image

This rule would truncate all series of repeated periods to five periods, which might be easier for a MT model to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants