add ukrainian #27

tggo · 2021-03-04T14:10:10Z

No description provided.

ojwb · 2022-11-08T05:39:20Z

@stefanvodita Comparing with your PR (#30), you update algorithms/index.tt (good); this one also adds a stopword list (good if it's a sensible list - passing it through google translate it seems plausible to me).

Both lack any useful description of the algorithm though, which is unhelpful for anyone trying to understand the algorithm. It also helps to have such information for maintenance going forwards (knowing the ideas and choices behind the algorithm can help judge if a proposed change makes sense, or how to address a reported short-coming).

The original algorithms provide a text description of the algorithm - I'm not sure this is actually worth the effort. It's perhaps useful is someone wants to implement the algorithm from scratch, but in the context of Snowball we tend to end up unsure what to do if the text and snowball implementation disagree.

What I think is useful is to detail the origins of the algorithm, and provide information about the design which isn't obvious from just reading the code, such as why particular design choices were made.

As best I can make out, the algorithm was developed by @Tapkomet. Comparing to the existing snowball algorithms it looks like it's loosely based on algorithms/russian.sbl.

There's a repo at https://github.com/Tapkomet/UAStemming but it doesn't seem to have much documentation. The only information about the algorithm design I found was:

The approach prefers removing suffixes and compeltely ignores preffixes. Roughly 95% accuracy when tested.

Although this algorithm appears to be derived from an existing algorithm, it doesn't use the RV and R2 regions which the stemmer it seems it was based on does. Most of the existing Snowball stemmers use these or similar regions to prevent unhelpful changes to shorter words. See also: https://snowballstem.org/texts/r1r2.html

While it's not a requirement to use regions, the lack of documentation leaves me unsure whether the region approach was tried and rejected for some reason, or if it was removed without really understanding its purpose. The algorithm seems to instead use simple length-based check (not hop 4 or (...)) which simply excludes any word < 4 characters long from stemming. The region approach instead counts spans of vowels and non-vowels, which seems to do a better job at least for other languages.

stefanvodita · 2022-11-26T11:07:07Z

Thank you for the in-depth explanation @ojwb. Unfortunately I don't have any more context on @Tapkomet's implementation either.

Tapkomet · 2022-12-01T19:48:06Z

@ojwb @stefanvodita hello, I will offer what insight I have. Which isn't that much tbh. I originally made this project for my uni studies, and then never got back to it to actually get it to a practically usable state, which is also why I never documented it much. And, well, I've forgotten a lot.

You are correct that I based this on algorithms/russian.sbl

If I recall correctly, I ignored preffixes because I couldn't get them to work properly, they'd often cut the part of the word you shouldn't cut. As for RV/R2, IIRC I didn't really understand them and used hop 4 as a stopgap measure to get "good enough" results for my study project. As far as I know russian and Ukrainian words are structured similarly enough that the same approach is likely to be helpful.

On another note, I definitely recall that many of the substrings I used in the algorithm I added simply because that seemed to work well with my test data (which was a bunch of Ukrainian Wikipedia articles).

If there's anything else you want to know, I'll probably be able to figure it out with a bit of time, so ask away if you have any more questions. As well, feel free to make use of anything in my algorithm, I'd love to have some good come of it. Actually some people are already working on that, if you don't know them I can get you in touch if you're interested.

add ukrainian

b62b156

ojwb mentioned this pull request Sep 20, 2023

Add ukrainian stemmer snowballstem/snowball#178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ukrainian #27

add ukrainian #27

tggo commented Mar 4, 2021

ojwb commented Nov 8, 2022

stefanvodita commented Nov 26, 2022

Tapkomet commented Dec 1, 2022

add ukrainian #27

Are you sure you want to change the base?

add ukrainian #27

Conversation

tggo commented Mar 4, 2021

ojwb commented Nov 8, 2022

stefanvodita commented Nov 26, 2022

Tapkomet commented Dec 1, 2022