Skip to content

Commit

Permalink
Merge pull request #37 from SunbirdAI/dataset-docs
Browse files Browse the repository at this point in the history
adding documentation for salt dataset
  • Loading branch information
evie-8 authored Nov 28, 2024
2 parents 6e0a5cc + 39a4d0a commit 470bd95
Show file tree
Hide file tree
Showing 19 changed files with 391 additions and 99 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,9 @@ poetry.toml
# ruff
.ruff_cache/

.history

# LSP config files
pyrightconfig.json

# End of https://www.toptal.com/developers/gitignore/api/python
# End of https://www.toptal.com/developers/gitignore/api/python
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# leb 💬
# SALT 💬
Language experimentation tools to accompany the SALT dataset

## Docs
After editing the documentation .md file

1. You can view the documentation locally by running `mkdocs serve
1. You can view the documentation locally by running `mkdocs serve`
2. If all looks good, run `./build_and_deploy_docs.sh` to build deploy the documentation


1 change: 1 addition & 0 deletions docs/API/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# SUNBIRDAI API
Binary file added docs/assets/favicon.ico
Binary file not shown.
Binary file added docs/assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/blog/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Blog

21 changes: 9 additions & 12 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,12 @@
This site contains the project documentation for the `leb` project used for the [Sunbird AI Language Projects](https://sunbird.ai/portfolio/african-languages/).
# SALT Documentation

## Welcome to the SALT project documentation!

# Leb Documentation
This documentation serves as the official guide for the [**SALT**](https://github.com/SunbirdAI/salt) project, which is part of the [Sunbird AI Language Projects](https://sunbird.ai/portfolio/african-languages/). The goal of this documentation is to provide you with comprehensive information on how to use the Leb project effectively.

Welcome to the Leb project documentation!
<!-- ## Table Of Contents
This documentation serves as the official guide for the **Leb** project, which is part of the [Sunbird AI Language Projects](https://sunbird.ai/portfolio/african-languages/). The goal of this documentation is to provide you with comprehensive information on how to use the Leb project effectively.

## Table Of Contents

- [💬 LEB](index.md)
- [💬 SALT](index.md)
- [Getting Started](#getting-started)
- [Introduction](tutorials/01-introduction.md)
- [Installation](tutorials/02-installation.md)
Expand All @@ -18,14 +15,14 @@ This documentation serves as the official guide for the **Leb** project, which i
- [Beginner](#beginner)
- [Basics](tutorials/04-basics.md)
- [Data Exploration](tutorials/05-data-exploration.md)
- [Leb Datasets](#leb-datasets)
- [SALT Datasets](#salt-datasets)
- [Text Datasets](tutorials/06-text-datasets.md)
- [Speech Datasets](tutorials/07-speech-datasets.md)
- [Leb Models](#leb-models)
- [SALT Models](#salt-models)
- [Translation Models](tutorials/08-translation-models.md)
- [ASR Models](tutorials/09-asr-models.md)
- [TTS Models](tutorials/10-tts-models.md)
- [Leb Pipelines](#leb-pipelines)
- [SALT Pipelines](#salt-pipelines)
- [Data Loading](tutorials/11-data-loading.md)
- [Training](tutorials/12-training.md)
- [Speaker Diarization](#diarization)
Expand All @@ -37,4 +34,4 @@ This documentation serves as the official guide for the **Leb** project, which i
Quickly find what you're looking for depending on your use case by looking at the different sections and subsections.

-->
2 changes: 1 addition & 1 deletion docs/reference.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This part of the project documentation focuses on
an **information-oriented** approach. Use it as a
reference for the technical implementation of the
`leb` project code.
`SALT` project code.
3 changes: 3 additions & 0 deletions docs/stylesheets/custom.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.md-footer-meta__inner {
display: none;
}
40 changes: 26 additions & 14 deletions docs/tutorials/04-basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ set up the configs
```python

yaml_config = '''
huggingface_load:
huggingface_load:
path: Sunbird/salt
split: train
name: text-all
Expand All @@ -34,20 +34,32 @@ ds = leb.dataset.create(config)
list(ds.take(5))

```
output

```
[{'source': '>>lug<< Eggplants always grow best under warm conditions.',
'target': 'Bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu'},
{'source': '>>ach<< Eggplants always grow best under warm conditions.',
'target': 'Bilinyanya pol kare dongo maber ka lyeto tye'},
{'source': '>>lug<< Farmland is sometimes a challenge to farmers.',
'target': "Ettaka ly'okulimirako n'okulundirako ebiseera ebimu kisoomooza abalimi"},
{'source': '>>ach<< Farmland is sometimes a challenge to farmers.',
'target': 'Ngom me pur i kare mukene obedo peko madit bot lupur'},
{'source': '>>lug<< Farmers should be encouraged to grow more coffee.',
'target': 'Abalimi balina okukubirizibwa okwongera okulima emmwanyi'}]
output

```json
[
{
"source": ">>lug<< Eggplants always grow best under warm conditions.",
"target": "Bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu"
},
{
"source": ">>ach<< Eggplants always grow best under warm conditions.",
"target": "Bilinyanya pol kare dongo maber ka lyeto tye"
},
{
"source": ">>lug<< Farmland is sometimes a challenge to farmers.",
"target": "Ettaka ly'okulimirako n'okulundirako ebiseera ebimu kisoomooza abalimi"
},
{
"source": ">>ach<< Farmland is sometimes a challenge to farmers.",
"target": "Ngom me pur i kare mukene obedo peko madit bot lupur"
},
{
"source": ">>lug<< Farmers should be encouraged to grow more coffee.",
"target": "Abalimi balina okukubirizibwa okwongera okulima emmwanyi"
}
]
```

This is how a basic data loader works
This is how a basic data loader works
Empty file removed docs/tutorials/06-text-datasets.md
Empty file.
42 changes: 42 additions & 0 deletions docs/tutorials/07-speech-datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Sunbird African Language Technology (SALT) dataset

SALT is a multi-way parallel text and speech corpus of Engish and six languages widely spoken in Uganda and East Africa: `Luganda`, `Lugbara`, `Acholi`, `Runyankole`, `Ateso` and `Swahili`.
The core of the dataset is a set of `25,000` sentences covering a range of topics of local relevance, such as agriculture, health and society.
Each sentence is translated into all languages, to support machine translation, and speech recordings are made for approximately `5,000` of the sentences both by a variety of speakers in natural settings (suitable for ASR) and by professionals in a studio setting (suitable for text-to-speech).

## Subsets

| Subset name | Contents |
| --------------------- | --------------------------------------------------------------------------------- |
| text-all | Text translations of each sentence. |
| multispeaker-`{lang}` | Speech recordings of each sentence, by a variety of speakers in natural settings. |
| studio-`{lang}` | Speech recordings in a studio setting, suitable for text-to-speech. |

The sentence IDs map across subsets, so that for example the text of a sentence in Acholi can be mapped to the studio recording of that concept being expressed in Swahili.
The subsets can therefore be combined to support the training and evaluation of several further tasks, such as speech-to-text translation and speech-to-speech translation.

## Language support

| ISO 639-3 | Language | Translated text | Multispeaker speech | Studio speech |
| --------- | ------------------------ | --------------- | ------------------- | ------------- |
| eng | English (Ugandan accent) | Yes | Yes | Yes |
| lug | Luganda | Yes | Yes | Yes |
| ach | Acholi | Yes | Yes | Yes |
| lgg | Lugbara | Yes | Yes | Yes |
| teo | Ateso | Yes | Yes | Yes |
| nyn | Runyankole | Yes | Yes | Yes |
| swa | Swahili | Yes | No | Yes |
| ibo | Igbo | Yes | No | No |

## Helper utilities

Code for convenient experimentation with multilingual models can be found at [https://github.com/SunbirdAI/salt](https://github.com/SunbirdAI/salt).
See example notebooks [here](https://github.com/SunbirdAI/salt/tree/main/notebooks).

## Collaborators

This dataset was collected in practical collaboration between Sunbird AI and the Makerere University AI Lab (Ugandan languages) and KenCorpus, Maseno University (Swahili).

## Reference

[Machine Translation For African Languages: Community Creation Of Datasets And Models In Uganda](https://openreview.net/pdf?id=BK-z5qzEU-9). Benjamin Akera, Jonathan Mukiibi, Lydia Sanyu Naggayi, Claire Babirye, Isaac Owomugisha, Solomon Nsumba, Joyce Nakatumba-Nabende, Engineer Bainomugisha, Ernest Mwebaze, John Quinn. 3rd Workshop on African Natural Language Processing, 2022.
9 changes: 6 additions & 3 deletions docs/tutorials/09-asr-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ Before getting started, ensure that you have the following prerequisites:
## Installation
To begin, install the necessary dependencies by running the following commands:

```{bash}
```bash

!pip install -q jiwer evaluate
!pip install -qU accelerate
!pip install -q transformers[torch]
Expand All @@ -29,7 +30,8 @@ These commands will install the required libraries, including Jiwer, Evaluate, A
Create a YAML configuration file named asr_config.yml with the necessary settings for your training. Here's an example configuration:


```{yaml}
```yaml

train:
source:
language: [luganda, english]
Expand Down Expand Up @@ -80,7 +82,8 @@ To use the trained model for inference, follow these steps:
1. Load the trained model and processor:


```{python}
```python

model = Wav2Vec2ForCTC.from_pretrained("path/to/trained/model")
processor = Wav2Vec2Processor.from_pretrained("path/to/processor")
```
21 changes: 12 additions & 9 deletions docs/tutorials/13-diarization.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Speaker Diarization at Sunbird is performed using pyannote's speaker-diarization

The necessary libraries to perform speaker diarization required for efficient execution of the pipeline and determine various metrics are installed and imported.

```python
```bash
!pip install pyctcdecode
!pip install kenlm
!pip install jiwer
Expand All @@ -19,8 +19,9 @@ The necessary libraries to perform speaker diarization required for efficient ex
!pip install pandas
!pip install pyannote.audio
!pip install onnxruntime
```


```python
import torch
from huggingface_hub import hf_hub_download
from transformers import (
Expand Down Expand Up @@ -65,7 +66,7 @@ tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(model_id)

#### Tokenizer setup

```python
```python
tokenizer.set_target_lang("eng")
model.load_adapter("eng_meta")
```
Expand All @@ -82,6 +83,7 @@ sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lam
```

#### Language model file setup

Within the `Sunbird/sunbird-mms` huggingface repository is a subfolder named `language_model` containing various language models capable of efficient transcription.

```python
Expand Down Expand Up @@ -136,7 +138,7 @@ pipe = AutomaticSpeechRecognitionPipeline(
transcription = pipe("/content/Kibuuka_eng.mp3")
```

The resulting dictionary `transcription` will contain a `text` key containing all the transcribed text as well as a `chunks` containing individual texts along with their time stamps of the format below:
The resulting dictionary `transcription` will contain a `text` key containing all the transcribed text as well as a `chunks` containing individual texts along with their time stamps of the format below:

```python
{
Expand Down Expand Up @@ -165,7 +167,7 @@ import librosa
SAMPLE_RATE = 16000

def load_audio(file: str, sr: int = SAMPLE_RATE) -> np.ndarray:

try:
# librosa automatically resamples to the given sample rate (if necessary)
# and converts the signal to mono (by averaging channels)
Expand All @@ -175,6 +177,7 @@ def load_audio(file: str, sr: int = SAMPLE_RATE) -> np.ndarray:

return audio
```

The `load_audio` functions takes an audio file and sampling rate as one of its parameters. The sampling rate used for this Speaker Diarization is 16000. This sampling rate should be the same sampling rate used to transcribe the audio from using the Sunbird mms to ensure consistency with the output.

**Diarization Pipeline**
Expand All @@ -183,7 +186,6 @@ The class `Diarization Pipeline` is a custom class created to facilitate the dia

It returns a pandas DataFrame with with columns for the segment, label, speaker, start time, and end time of each speaker segment.


```python
class DiarizationPipeline:
def __init__(
Expand Down Expand Up @@ -242,7 +244,7 @@ The function iterates through segments of a transcript and assigns the speaker l
In case of no overlap, a the fill_nearest parameter can be set to `True`, then the function will assign the speakers to segments by finding the closest speaker in time.

The function takes parameters:

`diarize_df`: a pandas DataFrame returned by the DiarizationPipeline containing the diarization information with columns like `start`, `end` and `speaker`

`transcript_result`: A dictionary with a key `chunks` that contains a list of trancript `Segments` obtained from the ASR pipeline.
Expand All @@ -264,7 +266,7 @@ The function takes parameters:
```python

def assign_word_speakers(diarize_df, transcript_result, fill_nearest=False):

transcript_segments = transcript_result["chunks"]

for seg in transcript_segments:
Expand All @@ -288,6 +290,7 @@ def assign_word_speakers(diarize_df, transcript_result, fill_nearest=False):
```

**Running the diarization model**

```python
diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device)
diarize_segments = diarize_model("/content/Kibuuka_eng.mp3", min_speakers=1, max_speakers=2)
Expand Down Expand Up @@ -445,4 +448,4 @@ output
{'text': 'you', 'timestamp': (45.48, 45.54), 'speaker': 'SPEAKER_01'},
{'text': 'are', 'timestamp': (45.56, 45.62), 'speaker': 'SPEAKER_01'},
{'text': 'married', 'timestamp': (45.68, 45.92), 'speaker': 'SPEAKER_01'}]}
```
```
Loading

0 comments on commit 470bd95

Please sign in to comment.