-
Notifications
You must be signed in to change notification settings - Fork 0
Speech Testing GUI
- Ubuntu 14.04 Trusty Tahr
- ROS Indigo with additional packages installed
- Install
hlpr_speech
required dependencies -
hlpr_speech
cloned to your workspace'ssrc/
directory - Run
catkin_make
in your workspace root and thensource devel/setup.bash
(if using the default Bash shell)
- With a
roscore
started, runrosrun rqt_speech_testing rqt_speech_testing
- If the GUI fails to start, you may have to run
rqt --force-discover
- If the GUI fails to start, you may have to run
To load and recognize speech from a .wav
file, click the open button or enter the path to the file in the text field and press enter. The .wav
file will then be loaded and keyword recognition run. The recognized speech output will be displayed in a tree format in the output window towards the bottom.
Audio can also be loaded from a folder of .wav
files for easy comparison of a dataset before and after tuning the underlying speech recognition system that the Speech Testing GUI interfaces with (the hlpr_speech_recognition
Python module)
In order for pocketsphinx to properly recognize the speech in your audio files, the .wav
files must be exported at a sample rate of 16,000 Hz and be in the 16 bit signed PCM format. If you're looking for a good audio editor and recording program that can do this easily, check out Audacity.
- Full path to the recognized file or
Recording
if live audio is being recognized (see below) - ISO 8061 timestamp of when recognition occurred and the recognized text
- Note that the recognized text will always be in uppercase format because that's how the keywords are defined within the
hlpr_speech_recognition/data/
directory - Audio that contains speech or other sounds that can't be with matched with sufficient confidence will return the recognized text string
UNKNOWN
- Note that the recognized text will always be in uppercase format because that's how the keywords are defined within the
Audio data can also be recorded live from the Kinect or another microphone on your computer. First, ensure that it is the default input device and then start the Speech Testing GUI. Begin recording audio for recognition by clicking the "Record" button. A new entry will be added to the output window labeled "Recording". Any audio recognized will then be output in the same format described above. To stop audio recording, click the "Record" button again.
At the bottom of the window are two buttons for acting on the generated output. The first, "Clear output" will empty the output view. The second will export the output view in JSON format to easily compare diffs after tuning or for other scripts, systems, or applications to parse.
For example, this is the exported JSON of the first screenshot:
[
{
"name": "/home/petschekr/Music/HLPR-Speech Test/Close your hand.wav",
"recognizedText": [
{
"timestamp": "2017-06-06 10:07:40",
"text": "CLOSE YOUR HAND"
}
]
}
]
- Speech recognition is handled by the
SpeechRecognizer
class within thehlpr_speech_recognition
class and not this GUI directly. To tweak the performance of pocketsphinx, you'll want to head there.- The pull request that adds this GUI also brought improvements to how
SpeechRecognizer
matches keyphrases. It now applies the recognition threshold relatively to the other matches that the engine returns. Previously, recognition results were only accepted if their probability was over 100% because the minimum probability (returned by the engine as the log10 of the actual probability) was set to 0 and 100 = 1.00. The absolute minimum threshold has now been set to -1500 which seems to be sufficient to reject background noise but not speech.
- The pull request that adds this GUI also brought improvements to how
- Check the output from the terminal in which you started the Speech Testing GUI for additional, verbose information about what is going on during speech recognition
- When the recognition is restarted, its input arguments are printed
- When matching a keyphrase, the engine will list possible matches in a list of tuples, e.g.:
[('CLOSE YOUR HAND', -758, 3, 74)]
where the tuples are in the format(phrase, log10 probability, start frame of match, end frame of match)
- When matching a keyphrase, the engine will also print if it detected a keyphrase with sufficient confidence and if it could not find a match or wasn't confident enough in the result and returned
UNKNOWN
- Weights were added to the keyphrase list to improve accuracy but these require further tuning. See
hlpr_speech_recognition/data/kps.txt
.