Scripts for alignment of laboratory speech production data
- Kyle Gorman [email protected]
- Michael Wagner [email protected]
- FQRSC Nouvelle Chercheur NP-132516
- SSHRC Canada Research Chair 218503
- SSHRC Digging Into Data Challenge Grant 869-2009-0004
See included "LICENSE"
USAGE: ./align.py [OPTIONS] data_to_be_aligned/
Option Function
-a Perform speaker adaptation,
w/ or w/o prior training
-d dictionary specify a dictionary file [default: dictionary.txt]
-h Display this message
-m List files containing
out-of-dictionary words
-n n Number of training iterations [default: 4]
for each step of training
(NB: available only with -t)
-s samplerate (Hz) Samplerate for models [default: 8000]
(NB: available only with -t)
-t training_data/ Perform model training
Forced alignment can be thought of as the process of finding the times at which individual sounds and words appear in an audio recording under the constraint that words in the recording follow the same order as they appear in the transcript. This is accomplished much in the same way as traditional speech recognition, but the problem is somewhat easier given the constraints on the "language model" imposed by the transcript.
The primary use of forced alignment is to eliminate the need for human annotation of time-boundaries for acoustic events of interest. Perhaps you are interested in sound change: forced alignment can be used to locate individual vowels in a sociolinguistic interview for formant measurement. Perhaps you are interested in laboratoy speech production: forced alignment can be used to locate the target word for pitch measurement.
Yes! If you have a few hours of high quality speech and associated word-level transcripts, Prosodylab-Aligner can induce a new acoustic model, then compute the best alignments for said data according to the acoustic model.
Forced alignment works well for audio from speakers of similar dialects with little background noise. Aligning data with considerable dialect variation, or to speech embedded in noise or music, is currently state of the art.
You can train your own acoustic models, using as much training data as possible, or try to reduce the noise in your test data before aligning.
The Hidden Markov Model Toolkit (HTK) is a set of programs for speech recognition and forced alignment. The HTK book describes how to train acoustic models and perform forced alignment. However, the procedure is rather complex and the error messages are cryptic. Prosodylab-Aligner essentially automates the HTK forced alignment workflow.
The Penn Forced Aligner (P2FA) provides forced alignment for American English using an acoustic model derived from audio of US Supreme Court oral arguments. Prosodylab-Aligner has a number of additional capabilities, most importantly acoustic model training, and it is possible in theory to use Prosodylab-Aligner to simulate P2FA.
The scripts require a version of Python no earlier than 2.6, a BASH-compatible shell located in /bin/sh
, and curl
. All these will be installed on recent Macintosh computers as well as most computers running Linux. The scripts included here also assume that HTK and SoX are installed on your system. While these scripts can also be made to work on Windows computers, it is non-trivial and not described here.
On Linux or similar POSIX-based systems, SoX can be obtained from the distribution-specific package manager (apt-get
, yum
, etc.), or can be compiled from source without too much difficulty.
On Mac OS X it may be obtained from Fink or DarwinPorts, though compiling by hand may be somewhat difficult. Fortunately, the SoX maintainers provide compiled binaries for Mac OS X. You can simply download these binaries from the following URL (click on the link after the text "Looking for the latest version?"):
http://sox.sourceforge.net
The zip file can be expanded by double-clicking on it. The resulting files must be placed in your $PATH
. A simple way to do this is to navigate to the resulting directory, and issue the following command:
$ sudo mv rec play sox soxi /usr/local/bin
This will prompt for your password; type it in (it will not "echo", as ***
), and hit Enter when you're done.
You can confirm that SoX is installed by issuing the following command in any directory:
$ sox --version
sox: SoX v14.3.2
Note that your version may be different: align.py
has been tested for this version, but it should work for both somewhat older versions as well as for the foreseeable future.
You will need first to download HTK's source code.
Note that you will have to make an account and agree to their restricted distribution license. Once you obtain the "tarball", the following command (adjusting for version number) should unpack it:
$ tar -fvxz htk-3.4.1.tar.gz
Once you extract the application, navigate into the resulting directory:
$ cd htk
Edit the file configure
, making the following changes:
- On line 5507, replace
-m32
with-s
(make 64-bit stripped binaries instead of 32-bit unstripped binaries) - On line 6788, replace
-O2
with-O3
(make fully optimmized binaries)
This will produce smaller, faster binaries than otherwise. Then run the following commands
$ ./configure --disable-hslab --disable-hlmtools
...
$ make all
...
$ sudo make install
...
By default, no C compiler is installed on Mac OS X. There are a few quick ways to get one. You can get a full set of compilers by downloading Xcode from the Mac App Store. This package is really quite large and may take days(!) to download. A good alternative is to download the new Command Line Tools for Xcode package on the Mac App Store, which is much smaller. You will need a free registration to download either package.
Once that's taken care of, execute the following commands in the "htk/" directory you just navigated to:
$ ./configure --disable-hslab --disable-hlmtools
...
$ make all
...
$ sudo make install
...
You can confirm that HTK is installed by issuing the following command in any directory:
$ HCopy -V
HTK Version Information
Module Version Who Date : CVS Info
HCopy 3.4.1 CUED 12/03/09 : $Id: HCopy.c,v 1.1.1.1 2006/10/11 09:54:59 jal58 Exp $
...
First, obtain an appropriate pronunciation dictionary. Since many of the intended users are American English speakers, I've provided a script (get_dict.sh
) which will download the CMU pronunciation dictionary automatically.
./get_dict.sh
Other dictionaries can be found online, or written in the CMU format for specific tasks. If you're working with RP speakers, CELEX might be a good one.
Imagine you simply want to align multiple audio files with their associated label files, in the following format:
file data/myexp_1_1_1.*
data/myexp_1_1_1.lab: ASCII text
data/myexp_1_1_1.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 22050 Hz
cat data/myexp_1_1_1.lab
BARACK OBAMA WAS TALKING ABOUT HOW THERE'S A MISUNDERSTANDING THAT ONE MINORITY GROUP CAN'T GET ALONG WITH ANOTHER SUCH AS AFRICAN AMERICANS AND LATINOS AND HE'S SAID THAT HE HIMSELF HAS SEEN IT HAPPEN THAT THEY CAN AND HE'S BEEN INVOLVED WITH GROUPS OF OTHER MINORITIES
In the case that you only want to align one .wav/.lab pair, perhaps to test out the system, the script align_ex.sh
is provided, and can be used like the following:
$ ./align_ex.sh data/myexp_1_1_1.wav data/myexp_1_1_1.lab
...
Assuming alignment is successful, this script will copy the resulting TextGrid file (called myexp_1_1_1.TextGrid
) to the current directory for your inspection.
If you'd like to align multiple .wav/.lab file pairs, and they're all in a single directory, aligning them is as simple as:
$ ./align.py data/
...
This will compute the best alignments, and then place then in Praat TextGrids in the data/ directory.
Several errors can occur at this stage.
First, if a .lab file in data/ is not paired with a .wav file in the same directory, or vis versa, then align.py will quit and report the unpaired data to unpaired.txt. You can read this file to figure out what files are missing, or use it to delete present, but unpaired, files. The following will delete unpaired files, after they are found by align.py and written to unpaired.txt.
$ rm `xargs -d '\n' < unpaired`
Secondly, a word in your .lab files may be missing from the dictionary. Such words are written to outofdict.txt. You can transcribe these in outofdict.txt using a text editor, then mix them back in like so:
$ ./sort.py dictionary.txt outofdict.txt > tmp;
$ mv tmp dictionary.txt
If you call align.py with the argument -m, each word in outofdict.txt is paired with a list of .lab files where it occurs. This may be useful for fixing typos in the .lab files.
Also, if SoX is not installed, but it needed because the audio is in a different format than the provided models (which are mono and sampled at 8000 Hz), an error will be raised.
Lastly, the file align.py
may not be marked as executable on your system, in which case you'll get an error like the following:
$ ./align.py data/
-bash: ./align.py: Permission denied
On Linux or Mac OS X, the following command should do the trick:
$ chmod +x ./align.py
Then, run align.py
like above.
The align.py
script makes prodigious use of "temporary" disk space. On Linux (in particular), it is possible that this space is limited by the OS, and align.py
will fail with number of cascading errors referring to disc space. A simple way to fix this is to use a temporary directory located somewhere else. If the environmental variable $TMPDIR
is defined and it points to a writeable directory, align.py
will use it.
$ mkdir ~/tmpdir
$ export TMPDIR=~/tmpdir
The align.py
script also allows you to train your own models, where the folder for training is specified by a directory after the -t
flag
$ ./align.py -t test_data/ data/
...
Please note: THIS REQUIRES A LOT OF DATA to work well, and further takes a long time when there is a lot of data. It is also possible to train on your test data, and in fact it is something we do quite often at the lab. That looks like:
$ ./align.py -t data/ data/
...
When -t
is specified, a few other command-line options to align.py
become available. The -s
flag specifies samplerate for the models used, and if SoX is installed, both training and testing data will be resampled to this rate, if they do not match it. For instance, to use 44010 Hz models, you could say:
$ ./align.py -s 44010 -t data data
...
Note that the slash character </> is not obligatory in specifying directories: align.py assumes these are directory names, possibly including wildcards, and expands the wildcards if possible.
$ ./align.py -d MY_DICTIONARY.txt -t data data
...
Lastly, the -n
flag may be used to specify the number of training iterations per "round": align.py performs three rounds of training, each of which take approximately the same time, so the effect of increasing this value by one is approximately 3-fold. By default, -n
is 4 (so 12 iterations of training in all), but the following command would set it at 5 (or 15 rounds of training):
$ ./align.py -n 4 -t data data
...
Other options are documented above.
Users who are familiar with Python are encouraged to import align.py
as a Python module if it makes sense for their application.
Many users have requested the ability to store an acoustic model for future use. Prosodylab-Aligner is not built with this in mind, but it is certainly possible for technically-inclined users to save their acoustic models for reuse.
- Open
align.py
in a text editor.- Change the global variable
DEBUG
toTrue
. - Then, edit the global variable
CMU\_PHONES
so that it contains the same phoneset as your training data. - Exit the text editor.
- Change the global variable
- Gather the training data and perform model training with the
-t
flag. - At the end of training and alignment,
align.py
will print out the location of the temporary directory where the resulting acoustic models are stored.- Navigate to this directory, then to the subdirectory
HMM
. - You will see a number of numbered subdirectories here. Go to the second-highest numbered subdirectory (e.g., if the last subdiretory is
9
, go to8
). - Copy the files
hmmdefs
andmacros
to the subdirectoryMOD
where Prosodylab-Aligner is located.
- Navigate to this directory, then to the subdirectory
- To return to normal operation, change the global variable
DEBUG
inalign.py
back toFalse
.
Note that this will overwrite the default acoustic model, so you may want to keep multiple copies of the Prosodylab-Aligner directory.