Decide on recommendation for edit distance algorithm #149

mweidling · 2023-01-23T13:24:16Z

In the QA specs we claim that we provide a recommendation for the edit distance algorithm to be used.

For this we have to

collect the most popular / widely used algorithms for this task
evaluate them
decide which one(s) we recommend

bertsky · 2023-01-23T13:57:28Z

FYI I did some comprehensive testing of Python-available character alignment packages roughly 5 yrs ago, and the most important result was that you absolutely cannot trust but need to test them, thoroughly. Issues:

robustness (crashes on rare circumstances or large problems)
stack limitation (i.e. bad implementation of the recursion)
average/benign performance
worst-case memory and time complexity
whether or not you can get a coarse, cheap estimate first - or can pass some upper limit on the distance
whether or not the actual alignment (or just the score) is returned
whether or not multiple (or just the single-best) result is returned
whether or not combining codepoints are glued to the base characters (i.e. grapheme clusters), or are treated as independent (in the alignment and in the distance metric)
whether and how efficient and parallel 1:n or even n:m variants are implemented
whether or not custom weights or cost functions can be passed in (but we could always make that a post-processing step as long as we get the alignment path itself)

Tools besides rapidfuzz and jellyfish you might want to list:

built-in difflib.SequenceMatcher
https://github.com/poke1024/pyalign

mweidling · 2023-01-23T18:58:58Z

That's very valuable information, thank you! I will keep this in mind when we think about a recommendation (if we can break it down to one, actually).

kba mentioned this issue Jan 23, 2023

QA Spec OCR-D/spec#225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide on recommendation for edit distance algorithm #149

Decide on recommendation for edit distance algorithm #149

mweidling commented Jan 23, 2023

bertsky commented Jan 23, 2023 •

edited

Loading

mweidling commented Jan 23, 2023

Decide on recommendation for edit distance algorithm #149

Decide on recommendation for edit distance algorithm #149

Comments

mweidling commented Jan 23, 2023

bertsky commented Jan 23, 2023 • edited Loading

mweidling commented Jan 23, 2023

bertsky commented Jan 23, 2023 •

edited

Loading