-
Notifications
You must be signed in to change notification settings - Fork 0
Bulk Submissions
If you know of a public domain corpus of sentences with more than 100k sentences, you can manually submit a pull request to add this as a bulk dataset. However, you will need to manually perform QA (quality assurance) to make sure the sentences are valid and high-quality.
This Discourse post has a more detailed guide for how to do manual QA, but in brief:
-
You need 2-3 native speakers to review a random sample of sentences to verify their correctness
-
The sentences should be spelt correctly.
-
The sentences should be grammatically correct.
-
The sentences should be speakable (also avoiding non-native uncommon words)
We're looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review. Feel free to set up this QA however makes the most sense for you, but here's a sample Google Spreadsheets template from Mozilla and the one for Luganda. Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here's an example PR.
Quick Steps
Link to the common voice repository Fork and Clone the repository
- git checkout -b add-batch-five-lug
- Add the filename.txt to the folder Server/data/lg/
- git add filename.txt
git status
Git commit -m “ added new batch of 1000 sentences”
- Git push origin add-batch-five-lug ---commit 8007 error rate
The Link to store Manual results for Luganda is here.