This repository contains an implementation of a speech recognition system using the Monotonic Chunkwise Attention (MoChA) mechanism. The goal of this project is to provide an end-to-end automatic speech recognition (ASR) system that can transcribe speech into text.
Before getting started, make sure to have the following prerequisites installed:
- Python 3.9
- TensorFlow 2
- Numpy
- Librosa
- tqdm
To install the required packages, you can run:
pip install -r requirements.txt
- Download your desired speech recognition dataset (e.g., LibriSpeech, CommonVoice) and organize it into the
data/raw_data
directory, with separate subdirectories for train, dev, and test sets. - Run the preprocessing script to convert the raw audio files into suitable features (e.g., MFCCs, log-mel filterbank energies) and store them in the
data/preprocessed_data
directory:
python src/data_utils.py
To train the MoChA ASR model, run the following command:
python src/train.py
This will train the model using the preprocessed data and save the best performing model in the models/saved_models
directory.
To evaluate the trained model on the test dataset, run:
python src/evaluate.py
This will load the best performing model from models/saved_models
and calculate the performance metrics (e.g., Word Error Rate) on the test dataset.
After training and evaluating the model, you can use it for your own speech recognition tasks by importing the MoChAASR
class from the model.py
file and loading the trained weights.
from src.model import MoChAASR
model = MoChAASR()
model.load_weights('path/to/saved/model/weights')
transcription = model.transcribe('path/to/audio/file')
This project is licensed under the MIT License. See the LICENSE file for more details.