Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
Filip Ginter committed Dec 10, 2024
1 parent 346d63a commit 54d591a
Showing 1 changed file with 8 additions and 12 deletions.
20 changes: 8 additions & 12 deletions release_checklist.html
Original file line number Diff line number Diff line change
Expand Up @@ -186,8 +186,7 @@ <h1 id="repository-and-files">Repository and files</h1>

<h2 id="data-split">Data split</h2>

<p>These guidelines are strong recommendations rather than strict rules. In some
cases it may not be even possible to meet all of them with a given dataset.</p>
<p>These guidelines are strong recommendations rather than strict rules.</p>

<p>The general underlying idea is that we do not want to ban tiny datasets from
being released, but we only want to distinguish training and test sets if
Expand All @@ -201,8 +200,8 @@ <h2 id="data-split">Data split</h2>
in the given language, we may set aside a small sample and call it “train”.
This may be useful if the treebank is used in a shared task where the systems
are supposed to use cross-lingual projection techniques and cannot access the test data,
but the developers should have access at least to a sample of the language
so they can see the annotation, language properties etc.
but the developers should have access at least to a small sample of the language
so they can see the annotation, language properties, etc.
In other situations, users should still use the entire data for training and
testing via cross-validation.
Providing the sample is completely optional.
Expand All @@ -213,21 +212,18 @@ <h2 id="data-split">Data split</h2>
<li>If you have less than 20K words:
<ul>
<li>Option A: Keep everything as test data. Users will have to do 10-fold cross-validation if they want to train on it.</li>
<li>Option B: If there are no larger treebanks of this language, keep almost everything as test data but set aside a small sample (20 to 50 sentences) and call it “train”. Consider translating and annotating the 20 examples from the <a href="https://github.com/UniversalDependencies/cairo/blob/master/translations.txt">Cairo Cicling Corpus</a> (CCC) and providing them as the sample.</li>
<li>Option B: Only if there are no larger treebanks of this language, keep almost everything as test data but set aside a small sample (20 to 50 sentences) and call it “train”. Consider translating and annotating the 20 examples from the <a href="https://github.com/UniversalDependencies/cairo/blob/master/translations.txt">Cairo Cicling Corpus</a> (CCC) and providing them as the sample.</li>
</ul>
</li>
<li>If you have between 20K and 30K words, take 10K as test data and the rest as training data.</li>
<li>If you have between 30K and 100K words, take 10K as test data, 10K as dev data and the rest as training data.</li>
<li>If you have more than 100K words, take 80% as training data, 10% (min 10K words) as dev data and 10% (min 10K words) as test data.</li>
<li>If you have between 20K and 110K words, take a minimum of 10K words as test data, 10% of the remainder as dev data, and the remainder as training data.</li>
<li>If you have more than 110K words, take between 10K words and 10% of the data as test data, take another between 10K words and 10% of the data as dev data, and the remainder as test data.</li>
<li>If the treebank contains running text (rather than random shuffled sentences), make sure you split the data on document boundaries. Shuffling sentences should be avoided if possible, but sometimes it is necessary in order to prevent copyright issues. If you must shuffle, consider shuffling blocks of sentences (up to N characters long) rather than individual sentences.</li>
<li>If the treebank contains different domains or genres, try to distribute them proportionally to training, dev and test. Ideally, it should be also possible to tell them apart by sentence ids.</li>
<li>If the treebank contains different domains or genres, try to distribute them proportionally to training, dev and test. Ideally, it should also be possible to tell them apart by sentence ids.</li>
<li>If the data in the treebank overlap with another UD treebank of the same language, make sure that the overlapping sentences end up in the same part (training/dev/test) in both treebanks! (By overlap we mean duplicate source text but not individual simple sentences that occur naturally at different positions of independent texts.)</li>
<li>If this is one language of a multi-lingual parallel treebank, make sure that corresponding sentences in all languages end up in the same part (training/dev/test)!</li>
<li>It is desirable that the data split of one treebank is stable across UD releases, i.e. a sentence that was in training data in release N is not moved to dev or test data in release N+1. We want to prevent accidental misguided results of experiments where people take a parser trained on UD 1.1 and apply it to test data from UD 1.2. In exceptional cases some restructuring that violates this rule can be approved by the release team, provided there are good reasons for it. (One obviously valid reason is that a growing treebank exceeds the 20K-word threshold and is split to training-test.) If at all possible, please try to plan ahead and minimize the need for re-splits in the future.</li>
<li>It is desirable that the data split of one treebank is stable across UD releases, i.e., a sentence that was in training data in release N is not moved to dev or test data in release N+1. We want to prevent accidental misguided results of experiments where people take a parser trained on UD 1.1 and apply it to test data from UD 1.2. In exceptional cases some restructuring that violates this rule can be approved by the release team, provided there are good reasons for it. (One obviously valid reason is that a growing treebank exceeds the 20K-word threshold and is split to training and test.) If at all possible, please try to plan ahead and minimize the need for re-splits in the future.</li>
</ol>

<p>See also <a href="https://cl.lingfil.uu.se/pipermail/ud/2015-November/000095.html">this e-mail thread</a>.</p>

<h2 id="the-readme-file">The README file</h2>

<p>The <code class="language-plaintext highlighter-rouge">README</code> file is distributed together with the data and summarizes information about the treebank for its users;
Expand Down

0 comments on commit 54d591a

Please sign in to comment.