Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ukrainian #27

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add ukrainian #27

wants to merge 1 commit into from

Conversation

tggo
Copy link

@tggo tggo commented Mar 4, 2021

No description provided.

@ojwb
Copy link
Member

ojwb commented Nov 8, 2022

@stefanvodita Comparing with your PR (#30), you update algorithms/index.tt (good); this one also adds a stopword list (good if it's a sensible list - passing it through google translate it seems plausible to me).

Both lack any useful description of the algorithm though, which is unhelpful for anyone trying to understand the algorithm. It also helps to have such information for maintenance going forwards (knowing the ideas and choices behind the algorithm can help judge if a proposed change makes sense, or how to address a reported short-coming).

The original algorithms provide a text description of the algorithm - I'm not sure this is actually worth the effort. It's perhaps useful is someone wants to implement the algorithm from scratch, but in the context of Snowball we tend to end up unsure what to do if the text and snowball implementation disagree.

What I think is useful is to detail the origins of the algorithm, and provide information about the design which isn't obvious from just reading the code, such as why particular design choices were made.

As best I can make out, the algorithm was developed by @Tapkomet. Comparing to the existing snowball algorithms it looks like it's loosely based on algorithms/russian.sbl.

There's a repo at https://github.com/Tapkomet/UAStemming but it doesn't seem to have much documentation. The only information about the algorithm design I found was:

The approach prefers removing suffixes and compeltely ignores preffixes. Roughly 95% accuracy when tested.

Although this algorithm appears to be derived from an existing algorithm, it doesn't use the RV and R2 regions which the stemmer it seems it was based on does. Most of the existing Snowball stemmers use these or similar regions to prevent unhelpful changes to shorter words. See also: https://snowballstem.org/texts/r1r2.html

While it's not a requirement to use regions, the lack of documentation leaves me unsure whether the region approach was tried and rejected for some reason, or if it was removed without really understanding its purpose. The algorithm seems to instead use simple length-based check (not hop 4 or (...)) which simply excludes any word < 4 characters long from stemming. The region approach instead counts spans of vowels and non-vowels, which seems to do a better job at least for other languages.

@stefanvodita
Copy link

Thank you for the in-depth explanation @ojwb. Unfortunately I don't have any more context on @Tapkomet's implementation either.

@Tapkomet
Copy link

Tapkomet commented Dec 1, 2022

@ojwb @stefanvodita hello, I will offer what insight I have. Which isn't that much tbh. I originally made this project for my uni studies, and then never got back to it to actually get it to a practically usable state, which is also why I never documented it much. And, well, I've forgotten a lot.

You are correct that I based this on algorithms/russian.sbl

If I recall correctly, I ignored preffixes because I couldn't get them to work properly, they'd often cut the part of the word you shouldn't cut. As for RV/R2, IIRC I didn't really understand them and used hop 4 as a stopgap measure to get "good enough" results for my study project. As far as I know russian and Ukrainian words are structured similarly enough that the same approach is likely to be helpful.

On another note, I definitely recall that many of the substrings I used in the algorithm I added simply because that seemed to work well with my test data (which was a bunch of Ukrainian Wikipedia articles).

If there's anything else you want to know, I'll probably be able to figure it out with a bit of time, so ask away if you have any more questions. As well, feel free to make use of anything in my algorithm, I'd love to have some good come of it. Actually some people are already working on that, if you don't know them I can get you in touch if you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants