Skip to content

Bulk Submissions

KakoozaJerry edited this page Jul 9, 2021 · 3 revisions

Bulk submission

If you know of a public domain corpus of sentences with more than 100k sentences, you can manually submit a pull request to add this as a bulk dataset. However, you will need to manually perform QA (quality assurance) to make sure the sentences are valid and high-quality.

This Discourse post has a more detailed guide for how to do manual QA, but in brief:

  1. You need 2-3 native speakers to review a random sample of sentences to verify their correctness

  2. The sentences should be spelt correctly.

  3. The sentences should be grammatically correct.

  4. The sentences should be speakable (also avoiding non-native uncommon words)

We're looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review. Feel free to set up this QA however makes the most sense for you, but here's a sample Google Spreadsheets template from Mozilla and the one for Luganda. Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here's an example PR.

Quick Steps

Link to the common voice repository Fork and Clone the repository

  • git checkout -b add-batch-five-lug
  • Add the filename.txt to the folder Server/data/lg/
  • git add filename.txt
  • git status
  • Git commit -m “ added new batch of 1000 sentences”
  • Git push origin add-batch-five-lug ---commit 8007 error rate

The Link to store Manual results for Luganda is here.

Clone this wiki locally