Create Private Dataset Owner FAQ and Reviewer SOP (SEACrowd#156)

* Create PRIVATE.md * Update PRIVATE.md * Create REVIEWING.md
MJonibek · Dec 26, 2023 · a3a6a84 · a3a6a84
1 parent 8a9512f
commit a3a6a84
Show file tree

Hide file tree

Showing 2 changed files with 79 additions and 0 deletions.
diff --git a/PRIVATE.md b/PRIVATE.md
@@ -0,0 +1,26 @@
+# Private Dataset Owner FAQ
+These frequently-asked questions are compiled to assist owners of private and non-open-sourced data to make informed decisions about the SEACrowd Project.
+
+1. **What is SEACrowd?** — SEACrowd is, in short, a project initiated by SEA researchers who aim to create the most comprehensive and standardized inventory of all datasets available for SEA languages. This dataset may be used by AI, NLP, and Computational Linguistics Researchers for open-source projects afterwards. You may check this link for more information.
+
+2. **Why was I contacted by a SEACrowd Moderator regarding my dataset?** — This is because a contributor to the project submitted a request to add your dataset to our inventory. As per our policy, we do not automatically add a private / non-open-source / non-openly-available dataset even if the datasheet submitted was approved by the dataset reviewers. We always strive to contact the original authors first for their permission and to identify what level of “openness” they wish to grant should they allow their dataset to be included in the inventory.
+
+3. **If I open my dataset, does it mean that SEACrowd will own my dataset?** - No, the original dataset owner will always be you. SEACrowd only creates a datasheet so other people can find your dataset in our Catalogue and provides a dataloader so other researchers can load your dataset easily. Your original repository is still the original data source.
+
+4. **Do I have to open-source my dataset?** - Yes. To be included in SEACrowd, you must open-source your dataset. Your dataset may be hosted anywhere as long as it is accessible and downloadable through a link. Additionally, you may opt to collect data about users who wish to access your dataset once the dataloader has been implemented and the SEACrowd catalog is live. You may find details for that [here](https://huggingface.co/docs/hub/datasets-gated)
+
+5. **My dataset is using a specific license. Will there be a license change if I opt to allow my dataset to be listed?** - No, there will not be a license change. We understand that a lot of the datasets in the SEA region are locked by licenses that are imposed due to funding requirements, institutional regulations, and many more. SEACrowd itself (including the dataloader to be built for your dataset) uses the [Apache-2.0 open-source license](https://www.apache.org/licenses/LICENSE-2.0.html), but this only applies to our code and our project, which are separate artifacts from your data.
+
+6. **I would like to give my permission to open-source and have my dataset listed. What should I do?** - Thank you for your contribution! Please reply to our moderator’s request email and, in writing, provide your permission. We may contact you further to request more information about your dataset (or to request a copy of a subset of your data) to assist in building the dataloader.
+
+7. **I agree to open-source my data, but I am quite lost on how to make it available online. What should I do?** - You may host your data through numerous providers such as the [HuggingFace Hub](https://huggingface.co/datasets), or through your own project website or in a GitHub repository (if the size is small enough). If you are having trouble, we would be more than happy to assist you.
+
+8. **Do you have a rubric on how dataset submissions are reviewed for inclusion to the inventory? How do I know that my dataset will be part of a properly curated inventory with good data standards?** - We provide a short rubric of our review guidelines [here](https://github.com/SEACrowd/seacrowd-datahub/blob/master/REVIEWING.md) for your reference.
+
+9. **What are the advantages to adding my dataset in SEACrowd?** - Branding and advertisement are the strongest draws, as it allows other researchers to identify that your dataset and organization exists. This may lead to potential collaborations and projects with others in the future.
+
+10. **How many contribution points will I get for opening my dataset?** - Please take a look at our detailed [guide](https://github.com/SEACrowd/seacrowd-datahub/blob/master/POINTS.md) on contribution points on GitHub.
+
+11. **What languages are you allowing to be included in your catalogue?** - We provide a list of languages we support [here](https://github.com/SEACrowd/seacrowd-datahub/blob/master/LANGUAGES.md).
+
+For any other questions, please do not hesitate to contact any of our moderators.
diff --git a/REVIEWING.md b/REVIEWING.md
@@ -0,0 +1,53 @@
+# Datasheet Reviewer SOP
+
+Generally, the objective of datasheet review is to ensure that:
+
+### 1. The dataset is available and accessible.
+FAQs:
+1. Can the dataset be free-upon-request?
+    * Yes. For example, we can approve datasets that are hosted on hubs such as HuggingFace, but are gated by required acknowledgements to terms and conditions. The dataset must indicate that it is free-upon-request.
+
+### 2. There must be no duplicate datasheets.
+FAQs:
+1. What if a contributor submitted a new datasheet for `X` but a datasheet for `X` is already approved?
+    * *Is the new datasheet more complete and better than the existing datasheet?*  
+      **Yes** → Proceed with the normal review process, change the existing datasheet’s status to “Deprecated”.  
+      **No** → Reject.
+
+2. What if more than one contributor submitted new datasheets of the same dataset, and all of them have not yet been approved? (After 23 Nov)
+    * Pick one that is relatively better than the others, fix the incorrect/inconsistent parts, then “Approve”.
+
+3. What if more than one contributor submitted new datasheets of the same dataset, and all of them have not yet been approved (Before 23 Nov)
+    * Pick one that is more complete than the others, fix any incorrect/inconsistent parts, then “Approve”.
+    * For the others, use “sharing points” status.
+    * Split the obtained points of the datasheet between the contributors. It doesn’t have to be an equal split. The contributor who gives more complete information can receive higher points.
+        * For example, for a datasheet worth 6 points, the assignment could be: contributor 1 gets 3 points, contributor 2 gets 2 points, contributor 3 gets 1 point.
+        * We may simplify this by accepting only a subset of identical submissions that have the most complete information and set a ratio split of points (e.g., 7:3).
+
+### 3. The information provided in the datasheet is correct, the information aligns with the dataset, and the dataset is relevant to SEA.
+FAQs:
+1. What should I do if the datasheet has incorrect or missing information?
+    * There are multiple ways to correct this:
+        * Ask the contributor to fix it (with some guidance) using the edit link (column AU in the approval sheet).
+        * Reviewer uses the edit link (column AU) to fix it themselves.
+        * **[NOT RECOMMENDED]** Reviewer uses a hidden sheet (_raw) to directly edit in the cells. This is only recommended when a large number of data per subset of the total dataset must be edited.
+
+2. What should I check and how should I proceed?
+    * See the checklist below.
+
+## Approval Checklist
+Check the following before approving:
+1. Data availability (is it free and open-source or is it private?)
+2. Dataset splits (if train, validation, or test are available)
+3. Dataset size (in lines, disk size, or any provided metric)
+4. Dataset license
+5. Task type (whether the data can be represented as the mentioned task)
+6. Paper (whether it directs to the correct publication. Archival version has higher priority)
+7. Languages (list of all languages it supports)
+
+## What to do next?
+1. Change the status to **Rejected**, **Approved**, or **Sharing Points**.
+2. Add notes and obtained points (in column BB in the approval sheet)
+3. Check the scoring guide and see which languages gets additional points (if any).
+4. Add the dataloader name (use Python snake case)
+5. Wait for a GitHub issue to be generated for the approved datasheet.