You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on an issue from the openWakeWord repo, I'm exploring why the initial pronunciation of certain single words is incorrect for most combinations of speaker IDs and noise scale values (though some seem fine).
It seems to be due to this line, where the phonemes for the target text have the "^" phoneme (id 1) preprended. I experimented by instead prepending the "^_" phoneme sequence (ids [1, 0]) when the target text is only a single word and the produced speech then sounds correct.
This is fairly odd behavior, as the pronunciation of these same words is correct when they are part of a multi-word text sequence. I could theorize that since most TTS datasets are trained on sentences and not single words this is an example of unexpected out-of-domain behavior, but ultimately I'm not sure.
Does using the [1, 0] phoneme sequence seem like a viable work-around? Might that cause any detrimental side-affects that I'm not considering?
The text was updated successfully, but these errors were encountered:
I think, you are right: this is a bug. All phoneme symbols need to be accompanied by the padding symbol (binary 0). The above symbol is the so called BOS (begin of sentence) symbol. Looking at the code, also the EOS (end of sentence) symbol ($) should be padded with a 0, but isn't.
Based on an issue from the openWakeWord repo, I'm exploring why the initial pronunciation of certain single words is incorrect for most combinations of speaker IDs and noise scale values (though some seem fine).
It seems to be due to this line, where the phonemes for the target text have the "^" phoneme (id 1) preprended. I experimented by instead prepending the "^_" phoneme sequence (ids [1, 0]) when the target text is only a single word and the produced speech then sounds correct.
This is fairly odd behavior, as the pronunciation of these same words is correct when they are part of a multi-word text sequence. I could theorize that since most TTS datasets are trained on sentences and not single words this is an example of unexpected out-of-domain behavior, but ultimately I'm not sure.
Does using the [1, 0] phoneme sequence seem like a viable work-around? Might that cause any detrimental side-affects that I'm not considering?
The text was updated successfully, but these errors were encountered: