Skip to content

Latest commit

 

History

History
174 lines (128 loc) · 5.91 KB

README.md

File metadata and controls

174 lines (128 loc) · 5.91 KB

Conformer: Convolution-augmented Transformer for Speech Recognition

Status License Open Issues Repo Size Last Commit

Conformer

This repository provides an implementation of the paper Conformer: Convolution-augmented Transformer for Speech Recognition. It includes training scripts including support for distributed GPU training using Lightning AI and web-app for inference using Gradio and CTC-Decoder with KenLM.

📄 Paper and Blog References


Installation

1. Clone the Repository

git clone https://github.com/LuluW8071/Conformer.git
cd Conformer

2. Install Dependencies

Before installing dependencies, ensure the following are installed:

  • CUDA Toolkit (For Training)
  • PyTorch (CPU or GPU version)
  • SOX
    sudo apt update
    sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-dev

Install the remaining dependencies:

pip install -r requirements.txt

Usage

Audio Preprocessing

1. Common Voice Conversion

To preprocess the Common Voice dataset:

python3 common_voice.py \
    --file_path /path/to/validated.tsv \
    --save_json_path converted_clips \
    -w 4 \
    --percent 10

2. Personal Recordings

To record your own voice, use Mimic Record Studio and prepare it for training :

python3 mimic_record.py \
    --input_file /path/to/transcript.txt \
    --output_dir /path/to/save \
    --percent 20 \
    --upsample 5  # Duplicate 5 times in train json only

Note: The --upsample flag duplicates train json only to increase sample size.

3. Merge JSON Files

Combine personal recordings and datasets into a single JSON file:

python3 merge_jsons.py personal/train.json converted_clips/train.json \
    --output merged_train.json

Perform same operation for validation json file.


Training

Before starting, add your Comet ML API key and project name to the .env file.

To train the Conformer model:

python3 train.py \
    -g 4 \                    # Number of GPUs
    -w 8 \                    # Number of CPU workers
    --epochs 100 \            # Number of epochs
    --batch_size 32 \         # Batch size
    -lr 4e-5 \                # Learning rate
    --precision 16-mixed \    # Mixed precision training
    --checkpoint_path /path/to/checkpoint.ckpt  # Optional: Resume from a checkpoint

Exporting the Model

In order to serialize the model allowing for better optimization to run in C++ level runtimes, export the PyTorch model using torchscript.

python3 torchscript.py \
    --model_checkpoint /path/to/checkpoint.ckpt \
    --save_path /path/to/optimized_model.pt

Inference

Gradio Demo
python3 gradio_demo.py \
    --model_path /path/to/optimized_model.pt \
    --share     # Optional: To share the Gradio app publicly
Web Flask Demo
python3 app.py \
    --model_path /path/to/optimized_model.pt

See notebook for inference examples.


Experiment Details

Datasets

Dataset Usage Duration (Hours) Description
Mozilla Common Voice 7.0 + Personal Recordings Training ~1855 + 20 Crowd-sourced and personal audio recordings
Validation ~161 + 2 Validation split (8%)
LibriSpeech Training ~960 Train-clean-100, Train-clean-360, Train-other-500
Validation ~10.5 Test-clean, Test-other

Results

Loss Curves

LibriSpeech Mozilla Corpus + Personal Recordings
Libri Loss Mozilla Loss

Word Error Rate (WER)

Note

The model trained on the Mozilla Corpus dataset shows a slightly higher WER compared to the LibriSpeech dataset. However, it's important to note that the Mozilla validation was conducted on a dataset 15 times larger than the LibriSpeech validation set.

Dataset WER (%) Model Link
LibriSpeech 22.94 🔗
Mozilla Corpus 25.29 🔗

Expected WER with CTC + KenLM decoding: ~15%.


Citation

@misc{gulati2020conformer,
      title={Conformer: Convolution-augmented Transformer for Speech Recognition}, 
      author={Anmol Gulati et al.},
      year={2020},
      url={https://arxiv.org/abs/2005.08100}, 
}