More Consistent GET variables? #156

Ryu945 · 2020-06-10T22:04:23Z

I noticed when using /translate, the get variable is "langpair" and it takes a value like "eng | spa".

When using /translateChain, the get variable is "langpairs" and it takes an argument like "eng | spa | fra"

I know the input is technically different but shouldn't the variables be more consistent for usage sake?

Also shouldn't there be some way for the API to pick appropriate middle languages on its own instead of requiring it to be specified. For example, depending on the quality of a language pair, sending "eng | fra" might mean using " eng | spa | fra" automatically since spa and fra are both romance languages and spa might have a high score in accuracy between both eng and fra.

edit: While the thought of a cleaned up interface is on the mind, what if /translate and /translateChain were both done with /translate . It would run the appropiate code based on whether it is fed "eng | spa | fra" or "eng | spa"

TinoDidriksen · 2020-06-11T05:24:51Z

langpairs is fine for chained translations. It's no fun having to parse a variable to see what it is, so being a separate variable is best. If there was automatic chain detection, then langpair could make sense.

As to that, it's much harder than you think. That means we have to annotate which language family all ISO 639-3 codes are, plus what quality level each of our implementations are, and then work out a complicated weighting of paths. It may be that eng -> spa -> fra is best language family path, but what if we have 90% quality eng -> jpn and 90% jpn -> fra while the eng-spa-fra path is only 70%?

So it's doable, but someone needs to annotate families (easy, but boring), and figure out relative quality levels of all pairs (extremely hard), and then work out an algorithm for making best fit paths.

Ryu945 · 2020-06-11T16:32:40Z

I know that if N is the total number of languages then it would take a table of N*(N-1) entries.

Also I noticed you used jpn as an example. Was that recently added to the database?

ftyers · 2020-07-01T20:44:06Z

The way to do this right now is basically to use vocabulary coverage over a corpus. This is the best indicator of the quality of a pair.

This is something that could be automated on a rolling basis... download the latest Wikipedia (or Wikinews) dump, calculate the coverage, and store the number in a stats file in the pair.

Ryu945 · 2020-07-02T17:39:55Z

The way to do this right now is basically to use vocabulary coverage over a corpus. This is the best indicator of the quality of a pair.

This is something that could be automated on a rolling basis... download the latest Wikipedia (or Wikinews) dump, calculate the coverage, and store the number in a stats file in the pair.

Is there anyway to calculate this now before something like this gets implemented? I believe /calcCoverage only does it for the sentence you enter.

ftyers · 2020-07-02T17:41:50Z

@Ryu945 I can't do it, but you could! :) I don't expect that such a script should take longer than an hour or two to write.

TinoDidriksen added the question label Jun 11, 2020

TinoDidriksen added the help wanted label Mar 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Consistent GET variables? #156

More Consistent GET variables? #156

Ryu945 commented Jun 10, 2020 •

edited

Loading

TinoDidriksen commented Jun 11, 2020

Ryu945 commented Jun 11, 2020

ftyers commented Jul 1, 2020

Ryu945 commented Jul 2, 2020

ftyers commented Jul 2, 2020

More Consistent GET variables? #156

More Consistent GET variables? #156

Comments

Ryu945 commented Jun 10, 2020 • edited Loading

TinoDidriksen commented Jun 11, 2020

Ryu945 commented Jun 11, 2020

ftyers commented Jul 1, 2020

Ryu945 commented Jul 2, 2020

ftyers commented Jul 2, 2020

Ryu945 commented Jun 10, 2020 •

edited

Loading