Conformer: Convolution-augmented Transformer for Speech Recognition

This repository provides an implementation of the paper Conformer: Convolution-augmented Transformer for Speech Recognition. It includes training scripts including support for distributed GPU training using Lightning AI and web-app for inference using Gradio and CTC-Decoder with KenLM.

📄 Paper and Blog References

Attention Is All You Need
Conformer: Convolution-augmented Transformer for Speech Recognition
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
KenLM
Boosting Sequence Generation Performance with Beam Search Language Model Decoding

Installation

1. Clone the Repository

git clone https://github.com/LuluW8071/Conformer.git
cd Conformer

2. Install Dependencies

Before installing dependencies, ensure the following are installed:

CUDA Toolkit (For Training)
PyTorch (CPU or GPU version)

SOX

sudo apt update
sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-dev

Install the remaining dependencies:

pip install -r requirements.txt

Usage

Audio Preprocessing

1. Common Voice Conversion

To preprocess the Common Voice dataset:

python3 common_voice.py \
    --file_path /path/to/validated.tsv \
    --save_json_path converted_clips \
    -w 4 \
    --percent 10

2. Personal Recordings

To record your own voice, use Mimic Record Studio and prepare it for training :

python3 mimic_record.py \
    --input_file /path/to/transcript.txt \
    --output_dir /path/to/save \
    --percent 20 \
    --upsample 5  # Duplicate 5 times in train json only

Note: The --upsample flag duplicates train json only to increase sample size.

3. Merge JSON Files

Combine personal recordings and datasets into a single JSON file:

python3 merge_jsons.py personal/train.json converted_clips/train.json \
    --output merged_train.json

Perform same operation for validation json file.

Training

Before starting, add your Comet ML API key and project name to the .env file.

To train the Conformer model:

python3 train.py \
    -g 4 \                    # Number of GPUs
    -w 8 \                    # Number of CPU workers
    --epochs 100 \            # Number of epochs
    --batch_size 32 \         # Batch size
    -lr 4e-5 \                # Learning rate
    --precision 16-mixed \    # Mixed precision training
    --checkpoint_path /path/to/checkpoint.ckpt  # Optional: Resume from a checkpoint

Exporting the Model

In order to serialize the model allowing for better optimization to run in C++ level runtimes, export the PyTorch model using torchscript.

python3 torchscript.py \
    --model_checkpoint /path/to/checkpoint.ckpt \
    --save_path /path/to/optimized_model.pt

Inference

Gradio Demo

python3 gradio_demo.py \
    --model_path /path/to/optimized_model.pt \
    --share     # Optional: To share the Gradio app publicly

Web Flask Demo

python3 app.py \
    --model_path /path/to/optimized_model.pt

See notebook for inference examples.

Experiment Details

Datasets

Dataset	Usage	Duration (Hours)	Description
Mozilla Common Voice 7.0 + Personal Recordings	Training	~1855 + 20	Crowd-sourced and personal audio recordings
	Validation	~161 + 2	Validation split (8%)
LibriSpeech	Training	~960	Train-clean-100, Train-clean-360, Train-other-500
	Validation	~10.5	Test-clean, Test-other

Results

Loss Curves

LibriSpeech	Mozilla Corpus + Personal Recordings

Word Error Rate (WER)

Note

The model trained on the Mozilla Corpus dataset shows a slightly higher WER compared to the LibriSpeech dataset. However, it's important to note that the Mozilla validation was conducted on a dataset 15 times larger than the LibriSpeech validation set.

Dataset	WER (%)	Model Link
LibriSpeech	22.94	🔗
Mozilla Corpus	25.29	🔗

Expected WER with CTC + KenLM decoding: ~15%.

Citation

@misc{gulati2020conformer,
      title={Conformer: Convolution-augmented Transformer for Speech Recognition}, 
      author={Anmol Gulati et al.},
      year={2020},
      url={https://arxiv.org/abs/2005.08100}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Conformer: Convolution-augmented Transformer for Speech Recognition

📄 Paper and Blog References

Installation

1. Clone the Repository

2. Install Dependencies

Usage

Audio Preprocessing

1. Common Voice Conversion

2. Personal Recordings

3. Merge JSON Files

Training

Exporting the Model

Inference

Gradio Demo

Web Flask Demo

Experiment Details

Datasets

Results

Loss Curves

Word Error Rate (WER)

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Conformer: Convolution-augmented Transformer for Speech Recognition

📄 Paper and Blog References

Installation

1. Clone the Repository

2. Install Dependencies

Usage

Audio Preprocessing

1. Common Voice Conversion

2. Personal Recordings

3. Merge JSON Files

Training

Exporting the Model

Inference

Gradio Demo

Web Flask Demo

Experiment Details

Datasets

Results

Loss Curves

Word Error Rate (WER)

Citation