-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a limit to the audio duration? #181
Comments
Hey @JJun-Guo, recordings in Common Voice are currently limited to 10 seconds. Here is a related recent discussion on allowing more: |
hi,how about the shortest time limit?
Junjun Guo
***@***.***
发自 网易邮箱大师
…---- 回复的原邮件 ----
发件人 Harikalar Kutusu (a.k.a. Bülent ***@***.***> 日期 2023年06月06日 12:10 收件人 ***@***.***> 抄送至 Jue ***@***.***>***@***.***> 主题 Re: [common-voice/cv-sentence-extractor] Is there a limit to the audio duration? (Issue #181)
Hey @JJun-Guo, recordings in Common Voice are currently limited to 10 seconds.
Here is a related recent discussion on allowing more:
https://discourse.mozilla.org/t/discussion-relaxation-of-the-10-sec-recording-limitation/114142
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I need to check it from the code, but from my head, it was 1 sec but dropped to 0.5... Actually, as it also includes silences, short uttrences can easily be recorded putting a silence at the start or at the end while recording. |
I was wrong. It is 1 sec. 0.5 sec is for the benchmark sentences (numbers etc). But as I stated on the link given in the previous post, state-of-the art models work better with longer utterences. E.g. whisper best works for 5-25 sec recordings... So, it is better to get an average char duration and calculate a minimum sentence length from there... |
Wouldn't the short time affect downstream tasks? Such as speech recognition. The overall distribution time of the data set is 1-10s, so which range is most of the data concentrated in?
郭军军
***@***.***
…---- Replied Message ----
From Harikalar Kutusu (a.k.a. Bülent ***@***.***> Date 06/6/2023 12:55 To ***@***.***> Cc Jue ***@***.***> ,
***@***.***> Subject Re: [common-voice/cv-sentence-extractor] Is there a limit to the audio duration? (Issue #181)
I was wrong. It is 1 sec. 0.5 sec is for the benchmark sentences (numbers etc).
https://github.com/common-voice/common-voice/blob/3bccdf446f6acd8a9afda1db7a9a1664457e611d/web/src/components/pages/contribution/speak/speak.tsx#L42
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
AFAIK, a rule-of-thumb is to train a model with data which it will see in the wild. For a general purpose ASR model where the model is subjected to everyday speech, I think it should include shorter ones, because spontanous speech/conversations include them extensively, like in short answers to questions: yes-no-ok-fine-etc, "What do you want?" => "Tea..." like... I think it is best to have a more-or-less evenly distributed durations (flat curve), thus sentence lengths. One could work on the betterment of their Common Voice dataset to remedy peaks in the distribution. I created webapps where people can examine their datasets in more details, also helping in this area - for all CV languages. And this is the distribution in text corpus: Because we had little CC0 sentence resources, we had to rely on volunteers writing common everyday stuff, which are short and dropped the average recording duration to 3.6 - from around 4 secs. We need to remedy this issue... You can check your language from here: You can also check the overall changes in time here: |
If you are working on the cv-sentence-extractor rules (first run): Getting longer sentences are better I think. It is easier to get shorter sentences from other sources. Once it gets data from an article, it is done. Some points on this:
|
Not wrong, but might be risky without proper testing. Note that if the Sentence Extractor can't find 3 sentences with the required length, it will not continue to try with less words, it will just use what it got and continue on to the next article. Of course with proper analysis of the source it would be possible to fully optimize this. |
@MichaelKohler, can this be made adaptive? I mean, not to put an absolute minimum, but set a "requested_minimum", if the 3 sentences are not found, fill it with shorter ones... |
Yes, certainly would be an option, but that would need to be implemented. Overall this would mean going over the sentences multiple times for the case where it won't find enough sentences the first time, but probably not such a big hit on performance overall. In the end, for development purposes that won't matter and for the final run it's fine as well as that runs in the GitHub Action. |
As you know working on this was on my to-do list, if only I can get really good results... I'll look into this. E.g sorting sentences by length can help performance. |
Mh, this made me think. Now I wonder if the legal requirement is just "maximum 3 sentences per article" or if there could be issues if we always pick the 3 longest sentences. In some articles the longest 3 sentences might be the majority of content. Probably something that would need to be verified just to make sure. To be clear: I only ever knew about the "maximum 3 sentences per article" without any further restrictions, but I can't guarantee that this is exactly what the lawyers said. |
Very good point... But this is how it works now, isn't it? So, as of now, if an article has 3 sentences, they are taken if the rules match. |
Right now it's fully random, but rejecting what does not fit the rules. So generally, by analysis the full Wikipedia dump, you could optimize the minimum words rule to get the most words out. But that would be different than always taking the longest sentences. Of course depending on the requirements additional rules can be added. At this point I don't even know if it would be a problem or not to do it that way. |
As I mentioned above, with the state-of-the-art models and HW advancements, it is better to get longer audio, thus longer texts. A change in this repo towards this goal would be awesome. Especially because there is no going back once 3-4 word sentences are taken... With longer sentences, duplicates/similarities will also drop substantially, and more possible vocabulary will go into the text-corpus. I think more common words are already in the corpora or can easily be added from other sources, but less frequent ones will be needed by everyone (if too-technical/problematic/hard-to-read ones got correctly ruled out). If it is legally possible of course... |
@jessicarose Analog to the other question I tagged you in, could you also check here if we in theory would be allowed to always take the 3 longest sentences per article? Thanks! |
Sorry to ping the issue... I'm nearly finalizing my work and I need to ask if taking the longest three sentences will ever be possible - because there is no going back. |
@JJun-Guo, the recording limit is increased to 15 seconds in Common Voice v1.114.2. @MichaelKohler: Probably all rule files should adapt to this change, including the defaults. |
@HarikalarKutusu Thanks for keeping track of this. I agree. Do you know what the correct value for EN would be and then we set that as default? And do you have time to reach out to all language contributors to get a new estimate? I'd be fine with one PR updating all the values as I think it's rather low-risk of a change. One thing to note is that some languages use characters and some use words. |
@MichaelKohler I think a 50% increase should be fine for both max words and characters. With the new v17.0, I can add some character speed measurements and possibly per user, and their distribution in the Analyzer, so that one can for example see the 95 percentile coverage from those values. But that part should be handled by communities like you suggest. For those languages which already did run the I have time for PRs and posts in Discourse, but you might need to point to them in case somebody decides on a re-run... |
I can try to keep this in mind :) |
I opened a discussion here, your input would be very valuable: |
Is there a limit to the audio duration?
The text was updated successfully, but these errors were encountered: