This repo is the official PyTorch implementation of 'VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification', which has been submitted to IEEE/ACM Trans. on TASLP.
Please see requirements.txt
.
Step1. Prepare clean source speech and noise recordings in .wav or .flac format.
Step2. Prepare reverberant and direct-path RIRs
python dataset/gen_rir.py -c [config/config_gen_rir.json]
Step3. Save the list of filepath for the source speech, simulated RIR (.npz), and noise to .txt file
python datset/gen_fpath_txt.py -i [dirpath] -o [.txt filepath] -e [filename extension]
Prepare the official single-channel test sets of REVERB Challenge Dataset.
Step1. Prepare the RIRs of the 'Single' subfolder in ACE Challenge.
Step2. Downsample the RIRs to 16kHz
python datset/gen_16kHz_ACE_RIR.py -i [ACE 'Single' dirpath] -o [saved dirpath]
Step3. Save the list of filepath for the source speech, ACE RIR, and noise to .txt file
python datset/gen_fpath_txt.py -i [dirpath] -o [.txt filepath] -e [filename extension]
Step4. Generate the test set (consists of reverberant speech and labels)
python dataset/gen_SimACE_testset.py --[keyword] [arg]
Step1. Edit the config file (for example: config/config_VINP_oSpatialNet.toml
and config/config_VINP_TCNSAS.toml
).
Step2. Run
# train from scratch
torchrun --standalone --nnodes=1 --nproc_per_node=[number of GPUs] train.py -c [config filepath] -p [saved dirpath]
# resume training
torchrun --standalone --nnodes=1 --nproc_per_node=[number of GPUs] train.py -c [config filepath] -p [saved dirpath] -r
# use pretrained checkpoints
torchrun --standalone --nnodes=1 --nproc_per_node=[number of GPUs] train.py -c [config filepath] -p [saved dirpath] --start_ckpt [pretrained model filepath]
Run
python enhance_rir_avg.py -c [config filepath] --ckpt [list of checkpoints] -i [reverberant speech dirpath] -o [output dirpath] -d [GPU id]
Evaluation results are saved to the output folder.
For SimData, run
bash eval/eval_all.sh -i [speech dirpath] -r [reference dirpath]
For RealData, the reference is not available. Run
bash eval/eval_all.sh -i [speech dirpath]
For SimData, run
python eval/eval_ASR_REVERB_SimData.py -i [speech dirpath] -m [whisper model name (tiny small medium)]
For RealData, run
python eval/eval_ASR_REVERB_RealData.py -i [speech dirpath] -m [whisper model name (tiny small medium)]
Step1. Estimate RT60 and DRR using
python estimate_T60_DRR.py -i [estimated RIR dirpath]
Step2. Run
python eval/eval_T60_or_DRR.py -o [estimated RT60 or DRR .json] -r [reference RT60 or DRR .json]
If you find our work helpful, please cite
@misc{wang2025vinpvariationalbayesianinference,
title={VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification},
author={Pengyu Wang and Ying Fang and Xiaofei Li},
year={2025},
eprint={2502.07205},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2502.07205},
}