Merge pull request #37 from SunbirdAI/dataset-docs

adding documentation for salt dataset
SunbirdAI · Nov 28, 2024 · 470bd95 · 470bd95
2 parents 6e0a5cc + 39a4d0a
commit 470bd95
Show file tree

Hide file tree

Showing 19 changed files with 391 additions and 99 deletions.
diff --git a/.gitignore b/.gitignore
@@ -170,7 +170,9 @@ poetry.toml
 # ruff
 .ruff_cache/
 
+.history
+
 # LSP config files
 pyrightconfig.json
 
-# End of https://www.toptal.com/developers/gitignore/api/python
+# End of https://www.toptal.com/developers/gitignore/api/python
diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
-# leb 💬
+# SALT 💬
 Language experimentation tools to accompany the SALT dataset
 
 ## Docs
 After editing the documentation .md file
 
-1. You can view the documentation locally by running `mkdocs serve
+1. You can view the documentation locally by running `mkdocs serve`
 2. If all looks good, run `./build_and_deploy_docs.sh` to build deploy the documentation
 
 
diff --git a/docs/API/index.md b/docs/API/index.md
@@ -0,0 +1 @@
+# SUNBIRDAI API
diff --git a/docs/assets/favicon.ico b/docs/assets/favicon.ico
diff --git a/docs/assets/logo.png b/docs/assets/logo.png
diff --git a/docs/blog/index.md b/docs/blog/index.md
@@ -0,0 +1,2 @@
+# Blog
+
diff --git a/docs/index.md b/docs/index.md
@@ -1,15 +1,12 @@
-This site contains the project documentation for the `leb` project used for the [Sunbird AI Language Projects](https://sunbird.ai/portfolio/african-languages/).
+# SALT Documentation
 
+## Welcome to the SALT project documentation!
 
-# Leb Documentation
+This documentation serves as the official guide for the [**SALT**](https://github.com/SunbirdAI/salt) project, which is part of the [Sunbird AI Language Projects](https://sunbird.ai/portfolio/african-languages/). The goal of this documentation is to provide you with comprehensive information on how to use the Leb project effectively.
 
-Welcome to the Leb project documentation!
+<!-- ## Table Of Contents
 
-This documentation serves as the official guide for the **Leb** project, which is part of the [Sunbird AI Language Projects](https://sunbird.ai/portfolio/african-languages/). The goal of this documentation is to provide you with comprehensive information on how to use the Leb project effectively.
-
-## Table Of Contents
-
-- [💬 LEB](index.md)
+- [💬 SALT](index.md)
 - [Getting Started](#getting-started)
     - [Introduction](tutorials/01-introduction.md)
     - [Installation](tutorials/02-installation.md)
@@ -18,14 +15,14 @@ This documentation serves as the official guide for the **Leb** project, which i
     - [Beginner](#beginner)
         - [Basics](tutorials/04-basics.md)
         - [Data Exploration](tutorials/05-data-exploration.md)
-    - [Leb Datasets](#leb-datasets)
+    - [SALT Datasets](#salt-datasets)
         - [Text Datasets](tutorials/06-text-datasets.md)
         - [Speech Datasets](tutorials/07-speech-datasets.md)
-    - [Leb Models](#leb-models)
+    - [SALT Models](#salt-models)
         - [Translation Models](tutorials/08-translation-models.md)
         - [ASR Models](tutorials/09-asr-models.md)
         - [TTS Models](tutorials/10-tts-models.md)
-    - [Leb Pipelines](#leb-pipelines)
+    - [SALT Pipelines](#salt-pipelines)
         - [Data Loading](tutorials/11-data-loading.md)
         - [Training](tutorials/12-training.md)
     - [Speaker Diarization](#diarization)
@@ -37,4 +34,4 @@ This documentation serves as the official guide for the **Leb** project, which i
 
 Quickly find what you're looking for depending on your use case by looking at the different sections and subsections.
 
-
+ -->
diff --git a/docs/reference.md b/docs/reference.md
@@ -1,4 +1,4 @@
 This part of the project documentation focuses on
 an **information-oriented** approach. Use it as a
 reference for the technical implementation of the
-`leb` project code.
+`SALT` project code.
diff --git a/docs/stylesheets/custom.css b/docs/stylesheets/custom.css
@@ -0,0 +1,3 @@
+.md-footer-meta__inner {
+    display: none;
+}
diff --git a/docs/tutorials/04-basics.md b/docs/tutorials/04-basics.md
@@ -15,7 +15,7 @@ set up the configs
 ```python
 
 yaml_config = '''
-huggingface_load:   
+huggingface_load:
   path: Sunbird/salt
   split: train
   name: text-all
@@ -34,20 +34,32 @@ ds = leb.dataset.create(config)
 list(ds.take(5))
 
 ```
-output
 
-```
-[{'source': '>>lug<< Eggplants always grow best under warm conditions.',
-  'target': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu'},
- {'source': '>>ach<< Eggplants always grow best under warm conditions.',
-  'target': 'Bilinyanya pol kare dongo maber ka lyeto tye'},
- {'source': '>>lug<< Farmland is sometimes a challenge to farmers.',
-  'target': "Ettaka ly'okulimirako n'okulundirako ebiseera ebimu kisoomooza abalimi"},
- {'source': '>>ach<< Farmland is sometimes a challenge to farmers.',
-  'target': 'Ngom me pur i kare mukene obedo peko madit bot lupur'},
- {'source': '>>lug<< Farmers should be encouraged to grow more coffee.',
-  'target': 'Abalimi balina okukubirizibwa okwongera okulima emmwanyi'}]
+output
 
+```json
+[
+  {
+    "source": ">>lug<< Eggplants always grow best under warm conditions.",
+    "target": "Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu"
+  },
+  {
+    "source": ">>ach<< Eggplants always grow best under warm conditions.",
+    "target": "Bilinyanya pol kare dongo maber ka lyeto tye"
+  },
+  {
+    "source": ">>lug<< Farmland is sometimes a challenge to farmers.",
+    "target": "Ettaka ly'okulimirako n'okulundirako ebiseera ebimu kisoomooza abalimi"
+  },
+  {
+    "source": ">>ach<< Farmland is sometimes a challenge to farmers.",
+    "target": "Ngom me pur i kare mukene obedo peko madit bot lupur"
+  },
+  {
+    "source": ">>lug<< Farmers should be encouraged to grow more coffee.",
+    "target": "Abalimi balina okukubirizibwa okwongera okulima emmwanyi"
+  }
+]
 ```
 
-This is how a basic data loader works
+This is how a basic data loader works
diff --git a/docs/tutorials/06-text-datasets.md b/docs/tutorials/06-text-datasets.md
diff --git a/docs/tutorials/07-speech-datasets.md b/docs/tutorials/07-speech-datasets.md
@@ -0,0 +1,42 @@
+# Sunbird African Language Technology (SALT) dataset
+
+SALT is a multi-way parallel text and speech corpus of Engish and six languages widely spoken in Uganda and East Africa: `Luganda`, `Lugbara`, `Acholi`, `Runyankole`, `Ateso` and `Swahili`.
+The core of the dataset is a set of `25,000` sentences covering a range of topics of local relevance, such as agriculture, health and society.
+Each sentence is translated into all languages, to support machine translation, and speech recordings are made for approximately `5,000` of the sentences both by a variety of speakers in natural settings (suitable for ASR) and by professionals in a studio setting (suitable for text-to-speech).
+
+## Subsets
+
+| Subset name           | Contents                                                                          |
+| --------------------- | --------------------------------------------------------------------------------- |
+| text-all              | Text translations of each sentence.                                               |
+| multispeaker-`{lang}` | Speech recordings of each sentence, by a variety of speakers in natural settings. |
+| studio-`{lang}`       | Speech recordings in a studio setting, suitable for text-to-speech.               |
+
+The sentence IDs map across subsets, so that for example the text of a sentence in Acholi can be mapped to the studio recording of that concept being expressed in Swahili.
+The subsets can therefore be combined to support the training and evaluation of several further tasks, such as speech-to-text translation and speech-to-speech translation.
+
+## Language support
+
+| ISO 639-3 | Language                 | Translated text | Multispeaker speech | Studio speech |
+| --------- | ------------------------ | --------------- | ------------------- | ------------- |
+| eng       | English (Ugandan accent) | Yes             | Yes                 | Yes           |
+| lug       | Luganda                  | Yes             | Yes                 | Yes           |
+| ach       | Acholi                   | Yes             | Yes                 | Yes           |
+| lgg       | Lugbara                  | Yes             | Yes                 | Yes           |
+| teo       | Ateso                    | Yes             | Yes                 | Yes           |
+| nyn       | Runyankole               | Yes             | Yes                 | Yes           |
+| swa       | Swahili                  | Yes             | No                  | Yes           |
+| ibo       | Igbo                     | Yes             | No                  | No            |
+
+## Helper utilities
+
+Code for convenient experimentation with multilingual models can be found at [https://github.com/SunbirdAI/salt](https://github.com/SunbirdAI/salt).
+See example notebooks [here](https://github.com/SunbirdAI/salt/tree/main/notebooks).
+
+## Collaborators
+
+This dataset was collected in practical collaboration between Sunbird AI and the Makerere University AI Lab (Ugandan languages) and KenCorpus, Maseno University (Swahili).
+
+## Reference
+
+[Machine Translation For African Languages: Community Creation Of Datasets And Models In Uganda](https://openreview.net/pdf?id=BK-z5qzEU-9). Benjamin Akera, Jonathan Mukiibi, Lydia Sanyu Naggayi, Claire Babirye, Isaac Owomugisha, Solomon Nsumba, Joyce Nakatumba-Nabende, Engineer Bainomugisha, Ernest Mwebaze, John Quinn. 3rd Workshop on African Natural Language Processing, 2022.
diff --git a/docs/tutorials/09-asr-models.md b/docs/tutorials/09-asr-models.md
@@ -13,7 +13,8 @@ Before getting started, ensure that you have the following prerequisites:
 ## Installation
 To begin, install the necessary dependencies by running the following commands:
 
-```{bash}
+```bash
+
 !pip install -q jiwer evaluate
 !pip install -qU accelerate
 !pip install -q transformers[torch]
@@ -29,7 +30,8 @@ These commands will install the required libraries, including Jiwer, Evaluate, A
 Create a YAML configuration file named asr_config.yml with the necessary settings for your training. Here's an example configuration:
 
 
-```{yaml}
+```yaml
+
 train:
   source:
     language: [luganda, english]
@@ -80,7 +82,8 @@ To use the trained model for inference, follow these steps:
 1. Load the trained model and processor:
 
 
-```{python}
+```python
+
 model = Wav2Vec2ForCTC.from_pretrained("path/to/trained/model")
 processor = Wav2Vec2Processor.from_pretrained("path/to/processor")
 ```
diff --git a/docs/tutorials/13-diarization.md b/docs/tutorials/13-diarization.md
@@ -10,7 +10,7 @@ Speaker Diarization at Sunbird is performed using pyannote's speaker-diarization
 
 The necessary libraries to perform speaker diarization required for efficient execution of the pipeline and determine various metrics are installed and imported.
 
-```python
+```bash
 !pip install pyctcdecode
 !pip install kenlm
 !pip install jiwer
@@ -19,8 +19,9 @@ The necessary libraries to perform speaker diarization required for efficient ex
 !pip install pandas
 !pip install pyannote.audio
 !pip install onnxruntime
+```
 
-
+```python
 import torch
 from huggingface_hub import hf_hub_download
 from transformers import (
@@ -65,7 +66,7 @@ tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(model_id)
 
 #### Tokenizer setup
 
-```python 
+```python
 tokenizer.set_target_lang("eng")
 model.load_adapter("eng_meta")
 ```
@@ -82,6 +83,7 @@ sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lam
 ```
 
 #### Language model file setup
+
 Within the `Sunbird/sunbird-mms` huggingface repository is a subfolder named `language_model` containing various language models capable of efficient transcription.
 
 ```python
@@ -136,7 +138,7 @@ pipe = AutomaticSpeechRecognitionPipeline(
  transcription = pipe("/content/Kibuuka_eng.mp3")
 ```
 
- The resulting dictionary `transcription` will contain a `text` key containing all the transcribed text as well as a `chunks` containing individual texts along with their time stamps of the format below:
+The resulting dictionary `transcription` will contain a `text` key containing all the transcribed text as well as a `chunks` containing individual texts along with their time stamps of the format below:
 
 ```python
  {
@@ -165,7 +167,7 @@ import librosa
 SAMPLE_RATE = 16000
 
 def load_audio(file: str, sr: int = SAMPLE_RATE) -> np.ndarray:
-    
+
     try:
         # librosa automatically resamples to the given sample rate (if necessary)
         # and converts the signal to mono (by averaging channels)
@@ -175,6 +177,7 @@ def load_audio(file: str, sr: int = SAMPLE_RATE) -> np.ndarray:
 
     return audio
 ```
+
 The `load_audio` functions takes an audio file and sampling rate as one of its parameters. The sampling rate used for this Speaker Diarization is 16000. This sampling rate should be the same sampling rate used to transcribe the audio from using the Sunbird mms to ensure consistency with the output.
 
 **Diarization Pipeline**
@@ -183,7 +186,6 @@ The class `Diarization Pipeline` is a custom class created to facilitate the dia
 
 It returns a pandas DataFrame with with columns for the segment, label, speaker, start time, and end time of each speaker segment.
 
-
 ```python
 class DiarizationPipeline:
     def __init__(
@@ -242,7 +244,7 @@ The function iterates through segments of a transcript and assigns the speaker l
 In case of no overlap, a the fill_nearest parameter can be set to `True`, then the function will assign the speakers to segments by finding the closest speaker in time.
 
 The function takes parameters:
-    
+
 `diarize_df`: a pandas DataFrame returned by the DiarizationPipeline containing the diarization information with columns like `start`, `end` and `speaker`
 
 `transcript_result`: A dictionary with a key `chunks` that contains a list of trancript `Segments` obtained from the ASR pipeline.
@@ -264,7 +266,7 @@ The function takes parameters:
 ```python
 
 def assign_word_speakers(diarize_df, transcript_result, fill_nearest=False):
- 
+
     transcript_segments = transcript_result["chunks"]
 
     for seg in transcript_segments:
@@ -288,6 +290,7 @@ def assign_word_speakers(diarize_df, transcript_result, fill_nearest=False):
 ```
 
 **Running the diarization model**
+
 ```python
 diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device)
 diarize_segments = diarize_model("/content/Kibuuka_eng.mp3", min_speakers=1, max_speakers=2)
@@ -445,4 +448,4 @@ output
   {'text': 'you', 'timestamp': (45.48, 45.54), 'speaker': 'SPEAKER_01'},
   {'text': 'are', 'timestamp': (45.56, 45.62), 'speaker': 'SPEAKER_01'},
   {'text': 'married', 'timestamp': (45.68, 45.92), 'speaker': 'SPEAKER_01'}]}
-```
+```