Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Consistent GET variables? #156

Open
Ryu945 opened this issue Jun 10, 2020 · 5 comments
Open

More Consistent GET variables? #156

Ryu945 opened this issue Jun 10, 2020 · 5 comments

Comments

@Ryu945
Copy link

Ryu945 commented Jun 10, 2020

I noticed when using /translate, the get variable is "langpair" and it takes a value like "eng | spa".

When using /translateChain, the get variable is "langpairs" and it takes an argument like "eng | spa | fra"

I know the input is technically different but shouldn't the variables be more consistent for usage sake?

Also shouldn't there be some way for the API to pick appropriate middle languages on its own instead of requiring it to be specified. For example, depending on the quality of a language pair, sending "eng | fra" might mean using " eng | spa | fra" automatically since spa and fra are both romance languages and spa might have a high score in accuracy between both eng and fra.

edit: While the thought of a cleaned up interface is on the mind, what if /translate and /translateChain were both done with /translate . It would run the appropiate code based on whether it is fed "eng | spa | fra" or "eng | spa"

@TinoDidriksen
Copy link
Member

langpairs is fine for chained translations. It's no fun having to parse a variable to see what it is, so being a separate variable is best. If there was automatic chain detection, then langpair could make sense.

As to that, it's much harder than you think. That means we have to annotate which language family all ISO 639-3 codes are, plus what quality level each of our implementations are, and then work out a complicated weighting of paths. It may be that eng -> spa -> fra is best language family path, but what if we have 90% quality eng -> jpn and 90% jpn -> fra while the eng-spa-fra path is only 70%?

So it's doable, but someone needs to annotate families (easy, but boring), and figure out relative quality levels of all pairs (extremely hard), and then work out an algorithm for making best fit paths.

@Ryu945
Copy link
Author

Ryu945 commented Jun 11, 2020

I know that if N is the total number of languages then it would take a table of N*(N-1) entries.

Also I noticed you used jpn as an example. Was that recently added to the database?

@ftyers
Copy link
Member

ftyers commented Jul 1, 2020

The way to do this right now is basically to use vocabulary coverage over a corpus. This is the best indicator of the quality of a pair.

This is something that could be automated on a rolling basis... download the latest Wikipedia (or Wikinews) dump, calculate the coverage, and store the number in a stats file in the pair.

@Ryu945
Copy link
Author

Ryu945 commented Jul 2, 2020

The way to do this right now is basically to use vocabulary coverage over a corpus. This is the best indicator of the quality of a pair.

This is something that could be automated on a rolling basis... download the latest Wikipedia (or Wikinews) dump, calculate the coverage, and store the number in a stats file in the pair.

Is there anyway to calculate this now before something like this gets implemented? I believe /calcCoverage only does it for the sentence you enter.

@ftyers
Copy link
Member

ftyers commented Jul 2, 2020

@Ryu945 I can't do it, but you could! :) I don't expect that such a script should take longer than an hour or two to write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants