Skip to content

lcn-kul/xls-r-analysis-sqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

xls-r-analysis-sqa

1. Overview

This repository hosts the models for the paper "Analysis of XLS-R for Speech Quality Assessment".

1.1. Performance On Unseen Datasets

Comparison of model performance on each unseen corpus individually (NISQA, IUB) and combined together (Unseen). The metric is RMSE, lower is better.

V1 Results
Model NISQA IUB Unseen
XLS-R 300M Layer24 Bi-LSTM [1] 0.5907 0.5067 0.5323
DNSMOS [2] 0.8718 0.5452 0.6565
MFCC Transformer 0.8280 0.7775 0.7924
XLS-R 300M Layer5 Transformer 0.6256 0.5049 0.5425
XLS-R 300M Layer21 Transformer 0.5694 0.5025 0.5227
XLS-R 300M Layer5+21 Transformer 0.5683 0.4886 0.5129
XLS-R 1B Layer10 Transformer 0.5456 0.5815 0.5713
XLS-R 1B Layer41 Transformer 0.5657 0.4656 0.4966
XLS-R 1B Layer10+41 Transformer 0.5748 0.5288 0.5425
XLS-R 2B Layer10 Transformer 0.6277 0.4899 0.5334
XLS-R 2B Layer41 Transformer 0.5724 0.4897 0.5150
XLS-R 2B Layer10+41 Transformer 0.6036 0.4743 0.5150
Human 0.6738 0.6573 0.6629

V2 Results

UPDATE: the code has been updated to use version 2 of the models. Version 1 used the final model checkpoint by mistake, version 2 uses the checkpoint with the minimum validation loss.

Model NISQA IUB Unseen
XLS-R 300M Layer24 Bi-LSTM [1] 0.5907 0.5067 0.5323
DNSMOS [2] 0.8718 0.5452 0.6565
MFCC Transformer 0.9291 0.7415 0.8003
XLS-R 300M Layer5 Transformer 0.6494 0.5117 0.5550
XLS-R 300M Layer21 Transformer 0.5852 0.4838 0.5152
XLS-R 300M Layer5+21 Transformer 0.5861 0.4768 0.5108
XLS-R 1B Layer10 Transformer 0.6217 0.4763 0.5225
XLS-R 1B Layer41 Transformer 0.5615 0.4646 0.4946
XLS-R 1B Layer10+41 Transformer 0.6024 0.4624 0.5068
XLS-R 2B Layer10 Transformer 0.5227 0.4447 0.4686
XLS-R 2B Layer41 Transformer 0.5295 0.4926 0.5035
XLS-R 2B Layer10+41 Transformer 0.5191 0.4573 0.4760
Human 0.6738 0.6573 0.6629

[1] Tamm, B., Balabin, H., Vandenberghe, R., Van hamme, H. (2022) Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications. Proc. Interspeech 2022, 4083-4087, doi: 10.21437/Interspeech.2022-10147

[2] C. K. A. Reddy, V. Gopal and R. Cutler, "DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6493-6497, doi: 10.1109/ICASSP39728.2021.9414878.

1.2. Visualization of MOS Predictions

MOS predictions on two unseen datasets: NISQA (top) and IU Bloomington (bottom). Our proposed model based on embeddings extracted from the 10th layer of the pre-trained XLS-R 2B outperforms DNSMOS and the MFCC baseline. The human ACRs are also visualized for the IUB corpus.

1.3. Example Audio Segments

πŸ”Š

Excellent (MOS = 4.808)

Audio Sample Model Prediction Error
iub-excellent.mp4
DNSMOS 3.699 -1.109
MFCC Transformer 3.497 βˆ’1.311
XLS-R 2B Layer10
Transformer
3.935 -0.873
πŸ”Š

Good (MOS = 4.104)

Audio Sample Model Prediction Error
iub-good.mp4
DNSMOS 3.269 -0.835
MFCC Transformer 2.498 -1.606
XLS-R 2B Layer10
Transformer
3.793 -0.311
πŸ”Š

Fair (MOS = 3.168)

Audio Sample Model Prediction Error
iub-fair.mp4
DNSMOS 3.309 +0.141
MFCC Transformer 3.931 +0.763
XLS-R 2B Layer10
Transformer
3.080 -0.088
πŸ”Š

Poor (MOS = 2.240)

Audio Sample Model Prediction Error
iub-poor.mp4
DNSMOS 2.704 +0.464
MFCC Transformer 1.927 -0.313
XLS-R 2B Layer10
Transformer
2.284 +0.044
πŸ”Š

Bad (MOS = 1.416)

Audio Sample Model Prediction Error
iub-bad.mp4
DNSMOS 2.553 +1.137
MFCC Transformer 1.806 +0.390
XLS-R 2B Layer10
Transformer
2.312 +0.896

2. Installation

Option A: Install via pip (Recommended)

pip install xls-r-sqa

Option B: Install From Source

First, clone the repository.

git clone https://github.com/lcn-kul/xls-r-analysis-sqa.git

Next, install the requirements to a virtual environment of your choice.

cd xls-r-analysis-sqa/
pip3 install -r requirements.txt

3. Truncated XLS-R Models

This code uses truncated XLS-R models. By default, the code will attempt to auto-download the required truncated XLS-R model from Hugging Face whenever you create an E2EModel that uses XLS-R. For example:

from xls_r_sqa.config import XLSR_2B_TRANSFORMER_32DEEP_CONFIG
from xls_r_sqa.e2e_model import E2EModel

model = E2EModel(
    config=XLSR_2B_TRANSFORMER_32DEEP_CONFIG,
    xlsr_layers=10,
    auto_download=True  # <-- default is True
)

If you do not wish to auto-download, or if you would like to choose your own save location, there are two manual approaches:

  1. Download Truncated Models: Clone the truncated XLS-R repositories from Hugging Face (using Git LFS). Follow [these instructions] in xls_r_sqa/models/xls-r-trunc/README.md.

  2. Truncate Full XLS-R Yourself: Download the full pre-trained XLS-R models (see [these instructions] in xls_r_sqa/models/xls-r/README.md) and then run truncate_w2v2.py to create the truncated versions locally.

Warning: The combined size of all truncated XLS-R repos is approximately 15 GB (plus .git overhead, effectively doubling the storage needed). Make sure you have sufficient disk space before downloading or truncating them yourself.

4. Usage

A working example is provided in test_e2e_sqa.py.

5. Citation

@INPROCEEDINGS{10248049,
  author={Tamm, Bastiaan and Vandenberghe, Rik and Van Hamme, Hugo},
  booktitle={2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, 
  title={Analysis of XLS-R for Speech Quality Assessment}, 
  year={2023},
  volume={},
  number={},
  pages={1-5},
  doi={10.1109/WASPAA58266.2023.10248049}
}

About

Analysis of XLS-R for Speech Quality Assessment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages