Goal: train a model to detect when an audio segment contains "um" or other filler words.
project was started at Hack the North 2019
- work down during hack the north
- post hackathon
Easiest method is probably to treat this as an audio classification task for every n seconds of audio. Since there's no actual long term time dependency when detecting a single word, a CNN should be good enough. A spectrogram will be generated by doing consecutive fourier transforms on the audio segment, which will serve as input to the model.
note: spectrograms show the intensity of frequencies as it changes over time
use a simple CNN described in http://www.isca-speech.org/archive/interspeech_2015/papers/i15_1478.pdf
Google already has a training system set up (https://www.tensorflow.org/tutorials/sequences/audio_recognition) so that might be easy enough to build on. The performance here might not be state of the art, but this is a hackathon, and I can build on top of this later after a proof of concept.
the requirement here is an unscripted, labeled dataset containing filler words such as um. I've thought about using google's or microsoft api to label some audio, but those models sometimes actively ignore filler words. That makes some sense from a design decision, but is entirely useless here.
SSPNet Vocalization Corpus (http://www.dcs.gla.ac.uk/vincia/?p=378) contains:
2763 audio clips (11 seconds each) containing at least one laughter or filler instance. Overall, the corpus involves 120 subjects (63 females and 57 males). The clips are extracted from phone calls where two fully unacquainted speakers try to solve the Winter Survival Task.
In this case, we're interested in the filler instances. After filtering for that, There are a total of 2988 instances of filler moments in this dataset. Not a whole lot to be training on, but this is the only dataset I was able to find that I could actually access.
not including the laugh parts at all so that the final dataset would be balanced ish instead of there being significantly more of non-filler moments. It's still not totally balanced, but that's probably okay. Reflective of actual testing distribution and whatever.
It's actually not at all balanced........ note to self: try undersampling the not um's if the initial model doesn't work
Other issue: they speak with british accents. That's likely to mess with the results.
alternative datasets: https://catalog.ldc.upenn.edu/LDC2005S16
google's predefined model operates on 1 second segments. However, in the dataset, there are 81 instances where the filler segment exceeds 1 second. However, only 10 instances contains more than 2 seconds of filler, so we can use that as a filter
A 2 second sliding window can be used to create many 2 second audio clips. They will be labeled according to whether or not they contain a filler word
question: what should I do for 2 second segments which contains a part of a filler word?
Decision: if it lasts longer than a second or is more than 90% of the original filler section, then it's a filler
the validation or testing should contain 2 types of instances:
- the unique person speaking was not included in the training set
- the specific filler instance was not included in the training set
a quick hack was used to make tensorflow's training pipeline use my custom train val split
after running tensorflow's train script, the validation set is obviously very wrong.
potential reason: didn't configure tensorflow properly (something something bazel, something something ./configure)
However, the overall shape of the graph seems right. I've also tested with the provided testing options:
with label_wav.py:
left (score = 0.80921)
right (score = 0.12201)
_unknown_ (score = 0.04661)
with generate_streaming_test_wav
spent much too long trying to get this working. Many many hours spent on something that doesn't actually contribute to how this would work, leaving very little time to train my actual model
so at this point it's late at night and I realize that the checkpoint file cannot automatically accommodate the change in input and output vectors (ie. input is now 2000ms, output is now just a couple of classes)
last minute retraining of the base model. Adjusted unknown_percentage to 50 since there's only a single other class. starts right away with 50% accuracy, with makes sense as there's only 2 options. Converged somewhat at only 1.6k steps when I stopped it in order to train my actual um model. The lack of reliable validation data is really hurting right now as there isn't really a way to check for overfitting and not enough time to write something up as I still haven't finished figuring out how to get the model to predict on a real time audio stream
in conclusion: bad planning on my part
this incredibly messed up graph is the training result. The bit at the start was when I started, then restarted a training run, but couldn't be bothered to remove it. The base model trained for 1.5k steps before switching over to using the um dataset. Later on, I forgot to turn off the training run and I really hope it didn't overfit too badly, but again, validation isn't working. The fact that the model is converging so quickly might be worrying or it might just be because there are so few classes. And data. There isn't enough data.
However, it does seem to be converging to something, which is good! loss isn't going down.
This section is where I spent both not enough time and at the same time, too much time. The planned demo involved a real time beep as the speaker says a filler word (ie. um). The only inference that's really working is reading 2 second wav files and giving you a result, so that's what the demo will be. I'll add the ability to speak into the speaker, and save a bunch of wav files into a folder, which would be sent for inference.
- I have not implemented any data augmentation methods. I should try to do some.
- dataset: find a bigger, more varied dataset with many different accents
- CNN is fine for this use case since I'm only detecting specific sounds. If I want to incorporate other filler words that are actually in use in normal english, such as "like", I'll probably have to do something with recurrent networks or similar so that the model will be able to take into account that time dependency
- actually have a working demonstration with real time detection. The model is fast enough, theoretically, given the architecture, even off my CPU, I just didn't have the time to implement it
The two most important things to fix:
- The training pipeline was messed up somewhere and was unable to show validation loss properly
- the dataset was small and only contained british english
Since I'm no longer constrained by the time limit of a hackathon, may as well just rewrite the whole thing without bothering with tensorflow's thing.
I'll still train first on tensorflow's speech command dataset first before training on "um"s. This time, I'll try to find a way to train the base model on the full dataset instead of just a single word. Meaning when transferring, replace the output layer shape.
in addition to the filler words dataset from before, new potential datasets:
- CallHome English corpus of telephone speech (https://ca.talkbank.org/access/CallHome/eng.html)
- Santa Barbara Corpus of Spoken American English (https://www.linguistics.ucsb.edu/research/santa-barbara-corpus)
- The Buckeye Speech Corpus (https://buckeyecorpus.osu.edu/) if the above isn't enough
- LDC Spoken Language Sampler (https://catalog.ldc.upenn.edu/LDC2017S16)