-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add rules for Swedish #115
base: main
Are you sure you want to change the base?
Conversation
Seems to not work in the initial comment, as that's not an issue comment created. Should work here though: /action blocklist sv 80 |
One issue that I did think about is that 14-word sentences in Swedish can tend to be pretty long, as Swedish, like German, write compound words together, to give fairly long words. Would it be reasonable to use a lower max word limit? The particular example sentence would probably be filtered out with a "used more than 80-times" blocklist ("meningssystem" is used 3 times when I ripgrep through the wikiextracted text), but some potentially very long sentences could be constructed from pretty common compound words. |
That's definitely something to keep in mind while reviewing. How long does it take to say that sentence? |
About 8 seconds, timing myself. But I think it would be something like that for most people, if they don't stumble on the words, which is quite possible reading a sentence like that the first time. I'll keep that in mind for reviewing. Should the goal be for sentences to be fairly straightforward to say for most people, and not too long? |
I'd say around 8 seconds is fine. However I'd say it shouldn't be all sentences that long, might get quite exhausting after recording for some time. |
Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047 |
@andersjohansson you'll find the blocklist at the top right of the following link as posted by the previous comment: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047 Anything I could help you with? |
That’s great!
I’m away from my computer for a few weeks now so won’t be able to take it forward for a while.
|
The sample extraction seems to result in an empty file? Could it be that all sentences are rejected for some reason or is there some other problem? One problem that I have noted with Swedish Wikipedia is that it contains a massive amount of bot-articles by lsjbot (https://en.wikipedia.org/wiki/Lsjbot). This is fine for Wikipedia but very few of these articles contain suitable sample sentences. A lot of the words from these articles also contribute to the massive list of unusual words to block. Would it be possible to exclude these bot-articles in some way before extracting stuff? |
There seems to have been an error downloading the WikiExtractor script. I've manually restarted the job, let's see if that helps.
I thought there was a discussion around that somewhere, however I can't find it. As far as I remember this is not possible as we're not getting author information in the output of the WikiExtractor script. |
Looks like it doesn't. Will have a look tomorrow. |
/action blocklist sv 80 (ignore the output, this is for testing only) |
@andersjohansson I think I have fixed the issue for now. If you merge master into your branch and push it, it should generate a new sample output. |
Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/187520977 |
Initial rules for extracting Swedish. Seems to give reasonable output already, albeit with unusual words here and there that could very well be filtered out with a blocklist of uncommon words.
Let’s try to generate one:
/action blocklist sv 80