-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ukrainian #27
base: main
Are you sure you want to change the base?
add ukrainian #27
Conversation
@stefanvodita Comparing with your PR (#30), you update Both lack any useful description of the algorithm though, which is unhelpful for anyone trying to understand the algorithm. It also helps to have such information for maintenance going forwards (knowing the ideas and choices behind the algorithm can help judge if a proposed change makes sense, or how to address a reported short-coming). The original algorithms provide a text description of the algorithm - I'm not sure this is actually worth the effort. It's perhaps useful is someone wants to implement the algorithm from scratch, but in the context of Snowball we tend to end up unsure what to do if the text and snowball implementation disagree. What I think is useful is to detail the origins of the algorithm, and provide information about the design which isn't obvious from just reading the code, such as why particular design choices were made. As best I can make out, the algorithm was developed by @Tapkomet. Comparing to the existing snowball algorithms it looks like it's loosely based on There's a repo at https://github.com/Tapkomet/UAStemming but it doesn't seem to have much documentation. The only information about the algorithm design I found was:
Although this algorithm appears to be derived from an existing algorithm, it doesn't use the RV and R2 regions which the stemmer it seems it was based on does. Most of the existing Snowball stemmers use these or similar regions to prevent unhelpful changes to shorter words. See also: https://snowballstem.org/texts/r1r2.html While it's not a requirement to use regions, the lack of documentation leaves me unsure whether the region approach was tried and rejected for some reason, or if it was removed without really understanding its purpose. The algorithm seems to instead use simple length-based check ( |
@ojwb @stefanvodita hello, I will offer what insight I have. Which isn't that much tbh. I originally made this project for my uni studies, and then never got back to it to actually get it to a practically usable state, which is also why I never documented it much. And, well, I've forgotten a lot. You are correct that I based this on algorithms/russian.sbl If I recall correctly, I ignored preffixes because I couldn't get them to work properly, they'd often cut the part of the word you shouldn't cut. As for RV/R2, IIRC I didn't really understand them and used hop 4 as a stopgap measure to get "good enough" results for my study project. As far as I know russian and Ukrainian words are structured similarly enough that the same approach is likely to be helpful. On another note, I definitely recall that many of the substrings I used in the algorithm I added simply because that seemed to work well with my test data (which was a bunch of Ukrainian Wikipedia articles). If there's anything else you want to know, I'll probably be able to figure it out with a bit of time, so ask away if you have any more questions. As well, feel free to make use of anything in my algorithm, I'd love to have some good come of it. Actually some people are already working on that, if you don't know them I can get you in touch if you're interested. |
No description provided.