Skip to content

End-to-end speech recognition using RNN Transducers in Tensorflow 2.0

License

Notifications You must be signed in to change notification settings

nichongjia-2007/rnnt-speech-recognition

 
 

Repository files navigation

RNN-Transducer Speech Recognition

End-to-end speech recognition using RNN-Transducer in Tensorflow 2.0

Overview

This speech recognition model is based off Google's Streaming End-to-end Speech Recognition For Mobile Devices research paper and is implemented in Python 3 using Tensorflow 2.0

Usage

The main script of the repository is the run_rnnt.py script. Everything is run through there, you are just going to be specifying a variety of parameters.

Here is a list of all parameters:

python run_rnnt.py --help

       USAGE: run_rnnt.py [flags]
flags:

run_rnnt.py:

  --batch_size: Batch size.
    (default: '64')
    (an integer)

  --checkpoint: Path of checkpoint to load (default to latest in 'model_dir')

  --dataset_name: <common-voice>: Dataset to use.

  --dataset_path: Dataset path.
  
  --encoder_layers: Number of encoder layers.
    (default: '8')
    (an integer)

  --encoder_size: Units per encoder layer.
    (default: '2048')
    (an integer)

  --epochs: Number of training epochs.
    (default: '20')
    (an integer)

  --eval_size: Eval size.
    (default: '1000')
    (an integer)

  --input: Input file.

  --joint_net_size: Joint network units.
    (default: '640')
    (an integer)

  --keep_top: Maximum checkpoints to keep.
    (default: '5')
    (an integer)

  --learning_rate: Training learning rate.
    (default: '0.0001')
    (a number)

  --max_data: Max size of data.
    (an integer)

  --mode: <train|eval|transcribe-file>: Mode to run in.

  --model_dir: Model output directory.
    (default: './model')

  --pred_net_layers: Number of prediction network layers.
    (default: '2')
    (an integer)

  --shuffle_buffer_size: Shuffle buffer size.
    (an integer)

  --softmax_size: Units in softmax layer.
    (default: '4096')
    (an integer)

  --steps_per_checkpoint: Number of steps between each checkpoint.
    (default: '1000')
    (an integer)
    
  --steps_per_log: Number of steps between each log written.
    (default: '100')
    (an integer)

  --tb_log_dir: Tensorboard log directory.
    (default: './logs')

  --tpu: GCP TPU to use.

Getting Started

NOTE: If you are not training using docker you must run the following commands + setup the loss function (instructions for this can be found in warp-transducer/tensorflow_binding)

To setup your environment, run the following commands:

git clone --recurse https://github.com/noahchalifour/rnnt-speech-recognition.git
cd rnnt-speech-recognition
pip install tensorflow==2.1.0 # or tensorflow-gpu==2.1.0 for GPU support
pip install -r requirements.txt

Once your environment is all set you are ready to start training your own models.

Supported Datasets

Currently we only support the Common Voice dataset. We plan on adding support for other datasets in the future.

Training a Model

Training on Host

To train a simple model, run the following command:

python run_rnnt.py \
    --mode train \
    --dataset_name common-voice \
    --dataset_path <path to your dataset>

Training in Docker Container

View Image

You can also train your model in a docker container based on the Tensorflow docker image. To do so, run the following commands:

NOTE: Specify all your paramters in ALL CAPS as environment variables when training in a docker container.

docker run -d --name rnnt-speech-recognition \
    -e MODE=train \
    -e DATASET_NAME=common-voice \
    -e DATASET_PATH=<path to your dataset> \
    noahchalifour/rnnt-speech-recognition

Evaluation

To run evaluation, use the following command:

python run_rnnt.py \
    --mode eval \
    --dataset_name common-voice \
    --dataset_path <path to your dataset>

Inference

Transcribing a WAV file

To transcribe a WAV file, run the following command:

python run_rnnt.py \
    --mode transcribe-file \
    --input <path to wav file>

Real-time transcription

To run real-time transcription using your computer microphone, run the following command:

python run_rnnt.py \
    --mode realtime

Author

Noah Chalifour, [email protected]

About

End-to-end speech recognition using RNN Transducers in Tensorflow 2.0

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.5%
  • Shell 5.5%
  • Dockerfile 2.0%