RNN-Transducer Speech Recognition

End-to-end speech recognition using RNN-Transducer in Tensorflow 2.0

Overview

This speech recognition model is based off Google's Streaming End-to-end Speech Recognition For Mobile Devices research paper and is implemented in Python 3 using Tensorflow 2.0

Usage

The main script of the repository is the run_rnnt.py script. Everything is run through there, you are just going to be specifying a variety of parameters.

Here is a list of all parameters:

python run_rnnt.py --help

       USAGE: run_rnnt.py [flags]
flags:

run_rnnt.py:

  --batch_size: Batch size.
    (default: '64')
    (an integer)

  --checkpoint: Path of checkpoint to load (default to latest in 'model_dir')

  --dataset_name: <common-voice>: Dataset to use.

  --dataset_path: Dataset path.
  
  --encoder_layers: Number of encoder layers.
    (default: '8')
    (an integer)

  --encoder_size: Units per encoder layer.
    (default: '2048')
    (an integer)

  --epochs: Number of training epochs.
    (default: '20')
    (an integer)

  --eval_size: Eval size.
    (default: '1000')
    (an integer)

  --input: Input file.

  --joint_net_size: Joint network units.
    (default: '640')
    (an integer)

  --keep_top: Maximum checkpoints to keep.
    (default: '5')
    (an integer)

  --learning_rate: Training learning rate.
    (default: '0.0001')
    (a number)

  --max_data: Max size of data.
    (an integer)

  --mode: <train|eval|transcribe-file>: Mode to run in.

  --model_dir: Model output directory.
    (default: './model')

  --pred_net_layers: Number of prediction network layers.
    (default: '2')
    (an integer)

  --shuffle_buffer_size: Shuffle buffer size.
    (an integer)

  --softmax_size: Units in softmax layer.
    (default: '4096')
    (an integer)

  --steps_per_checkpoint: Number of steps between each checkpoint.
    (default: '1000')
    (an integer)
    
  --steps_per_log: Number of steps between each log written.
    (default: '100')
    (an integer)

  --tb_log_dir: Tensorboard log directory.
    (default: './logs')

  --tpu: GCP TPU to use.

Getting Started

NOTE: If you are not training using docker you must run the following commands + setup the loss function (instructions for this can be found in warp-transducer/tensorflow_binding)

To setup your environment, run the following commands:

git clone --recurse https://github.com/noahchalifour/rnnt-speech-recognition.git
cd rnnt-speech-recognition
pip install tensorflow==2.1.0 # or tensorflow-gpu==2.1.0 for GPU support
pip install -r requirements.txt

Once your environment is all set you are ready to start training your own models.

Supported Datasets

Currently we only support the Common Voice dataset. We plan on adding support for other datasets in the future.

Training a Model

Training on Host

To train a simple model, run the following command:

python run_rnnt.py \
    --mode train \
    --dataset_name common-voice \
    --dataset_path <path to your dataset>

Training in Docker Container

View Image

You can also train your model in a docker container based on the Tensorflow docker image. To do so, run the following commands:

NOTE: Specify all your paramters in ALL CAPS as environment variables when training in a docker container.

docker run -d --name rnnt-speech-recognition \
    -e MODE=train \
    -e DATASET_NAME=common-voice \
    -e DATASET_PATH=<path to your dataset> \
    noahchalifour/rnnt-speech-recognition

Evaluation

To run evaluation, use the following command:

python run_rnnt.py \
    --mode eval \
    --dataset_name common-voice \
    --dataset_path <path to your dataset>

Inference

Transcribing a WAV file

To transcribe a WAV file, run the following command:

python run_rnnt.py \
    --mode transcribe-file \
    --input <path to wav file>

Real-time transcription

To run real-time transcription using your computer microphone, run the following command:

python run_rnnt.py \
    --mode realtime

Author

Noah Chalifour, [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
scripts		scripts
utils		utils
warp-transducer @ 45a4a10		warp-transducer @ 45a4a10
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
evaluate.py		evaluate.py
model.py		model.py
requirements.txt		requirements.txt
run_rnnt.py		run_rnnt.py
train.py		train.py
transcribe.py		transcribe.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNN-Transducer Speech Recognition

Overview

Usage

Getting Started

Supported Datasets

Training a Model

Training on Host

Training in Docker Container

Evaluation

Inference

Transcribing a WAV file

Real-time transcription

Author

About

Releases

Packages

Languages

License

nichongjia-2007/rnnt-speech-recognition

Folders and files

Latest commit

History

Repository files navigation

RNN-Transducer Speech Recognition

Overview

Usage

Getting Started

Supported Datasets

Training a Model

Training on Host

Training in Docker Container

Evaluation

Inference

Transcribing a WAV file

Real-time transcription

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages