This repository hosts the models for the paper "Analysis of XLS-R for Speech Quality Assessment".
Comparison of model performance on each unseen corpus individually (NISQA, IUB) and combined together (Unseen). The metric is RMSE, lower is better.
V1 Results
Model | NISQA | IUB | Unseen |
---|---|---|---|
XLS-R 300M Layer24 Bi-LSTM [1] | 0.5907 | 0.5067 | 0.5323 |
DNSMOS [2] | 0.8718 | 0.5452 | 0.6565 |
MFCC Transformer | 0.8280 | 0.7775 | 0.7924 |
XLS-R 300M Layer5 Transformer | 0.6256 | 0.5049 | 0.5425 |
XLS-R 300M Layer21 Transformer | 0.5694 | 0.5025 | 0.5227 |
XLS-R 300M Layer5+21 Transformer | 0.5683 | 0.4886 | 0.5129 |
XLS-R 1B Layer10 Transformer | 0.5456 | 0.5815 | 0.5713 |
XLS-R 1B Layer41 Transformer | 0.5657 | 0.4656 | 0.4966 |
XLS-R 1B Layer10+41 Transformer | 0.5748 | 0.5288 | 0.5425 |
XLS-R 2B Layer10 Transformer | 0.6277 | 0.4899 | 0.5334 |
XLS-R 2B Layer41 Transformer | 0.5724 | 0.4897 | 0.5150 |
XLS-R 2B Layer10+41 Transformer | 0.6036 | 0.4743 | 0.5150 |
Human | 0.6738 | 0.6573 | 0.6629 |
V2 Results
UPDATE: the code has been updated to use version 2 of the models. Version 1 used the final model checkpoint by mistake, version 2 uses the checkpoint with the minimum validation loss.
Model | NISQA | IUB | Unseen |
---|---|---|---|
XLS-R 300M Layer24 Bi-LSTM [1] | 0.5907 | 0.5067 | 0.5323 |
DNSMOS [2] | 0.8718 | 0.5452 | 0.6565 |
MFCC Transformer | 0.9291 | 0.7415 | 0.8003 |
XLS-R 300M Layer5 Transformer | 0.6494 | 0.5117 | 0.5550 |
XLS-R 300M Layer21 Transformer | 0.5852 | 0.4838 | 0.5152 |
XLS-R 300M Layer5+21 Transformer | 0.5861 | 0.4768 | 0.5108 |
XLS-R 1B Layer10 Transformer | 0.6217 | 0.4763 | 0.5225 |
XLS-R 1B Layer41 Transformer | 0.5615 | 0.4646 | 0.4946 |
XLS-R 1B Layer10+41 Transformer | 0.6024 | 0.4624 | 0.5068 |
XLS-R 2B Layer10 Transformer | 0.5227 | 0.4447 | 0.4686 |
XLS-R 2B Layer41 Transformer | 0.5295 | 0.4926 | 0.5035 |
XLS-R 2B Layer10+41 Transformer | 0.5191 | 0.4573 | 0.4760 |
Human | 0.6738 | 0.6573 | 0.6629 |
[1] Tamm, B., Balabin, H., Vandenberghe, R., Van hamme, H. (2022) Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications. Proc. Interspeech 2022, 4083-4087, doi: 10.21437/Interspeech.2022-10147
[2] C. K. A. Reddy, V. Gopal and R. Cutler, "DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6493-6497, doi: 10.1109/ICASSP39728.2021.9414878.
MOS predictions on two unseen datasets: NISQA (top) and IU Bloomington (bottom). Our proposed model based on embeddings extracted from the 10th layer of the pre-trained XLS-R 2B outperforms DNSMOS and the MFCC baseline. The human ACRs are also visualized for the IUB corpus.
π
Excellent (MOS = 4.808)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-excellent.mp4 |
DNSMOS | 3.699 | -1.109 |
MFCC Transformer | 3.497 | β1.311 | |
XLS-R 2B Layer10 Transformer |
3.935 | -0.873 |
π
Good (MOS = 4.104)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-good.mp4 |
DNSMOS | 3.269 | -0.835 |
MFCC Transformer | 2.498 | -1.606 | |
XLS-R 2B Layer10 Transformer |
3.793 | -0.311 |
π
Fair (MOS = 3.168)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-fair.mp4 |
DNSMOS | 3.309 | +0.141 |
MFCC Transformer | 3.931 | +0.763 | |
XLS-R 2B Layer10 Transformer |
3.080 | -0.088 |
π
Poor (MOS = 2.240)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-poor.mp4 |
DNSMOS | 2.704 | +0.464 |
MFCC Transformer | 1.927 | -0.313 | |
XLS-R 2B Layer10 Transformer |
2.284 | +0.044 |
π
Bad (MOS = 1.416)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-bad.mp4 |
DNSMOS | 2.553 | +1.137 |
MFCC Transformer | 1.806 | +0.390 | |
XLS-R 2B Layer10 Transformer |
2.312 | +0.896 |
pip install xls-r-sqa
First, clone the repository.
git clone https://github.com/lcn-kul/xls-r-analysis-sqa.git
Next, install the requirements to a virtual environment of your choice.
cd xls-r-analysis-sqa/
pip3 install -r requirements.txt
This code uses truncated XLS-R models. By default, the code will attempt to auto-download the required truncated XLS-R model from Hugging Face whenever you create an E2EModel that uses XLS-R. For example:
from xls_r_sqa.config import XLSR_2B_TRANSFORMER_32DEEP_CONFIG
from xls_r_sqa.e2e_model import E2EModel
model = E2EModel(
config=XLSR_2B_TRANSFORMER_32DEEP_CONFIG,
xlsr_layers=10,
auto_download=True # <-- default is True
)
If you do not wish to auto-download, or if you would like to choose your own save location, there are two manual approaches:
-
Download Truncated Models: Clone the truncated XLS-R repositories from Hugging Face (using Git LFS). Follow [these instructions] in xls_r_sqa/models/xls-r-trunc/README.md.
-
Truncate Full XLS-R Yourself: Download the full pre-trained XLS-R models (see [these instructions] in xls_r_sqa/models/xls-r/README.md) and then run
truncate_w2v2.py
to create the truncated versions locally.
Warning: The combined size of all truncated XLS-R repos is approximately 15 GB (plus
.git
overhead, effectively doubling the storage needed). Make sure you have sufficient disk space before downloading or truncating them yourself.
A working example is provided in test_e2e_sqa.py.
@INPROCEEDINGS{10248049,
author={Tamm, Bastiaan and Vandenberghe, Rik and Van Hamme, Hugo},
booktitle={2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
title={Analysis of XLS-R for Speech Quality Assessment},
year={2023},
volume={},
number={},
pages={1-5},
doi={10.1109/WASPAA58266.2023.10248049}
}