Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/principles of the dictionary maintenance #22

Open
kkm000 opened this issue Oct 6, 2019 · 2 comments
Open

Documentation/principles of the dictionary maintenance #22

kkm000 opened this issue Oct 6, 2019 · 2 comments

Comments

@kkm000
Copy link

kkm000 commented Oct 6, 2019

Hi Nickolay, I tried to find a description of some criteria as to how the dictionary is transcribed (from other sources). One feature that strikes me as particularly odd is the entries marked with the 1 for primary stress on multiple vowels. Many of these are compounds--and, without understanding even the basic principles, I am not trying to get into the topic of compound treatment. But there are non-initialisms, non-compounds which show multiple "primary stress," (if I should take the digit markers at their face value). For example, the stressed final -ee more or less consistently yields IY1 (see entries for inductee, markee, pawnee), many of which have another primary stress elsewhere in addition to the final IY1; according to AmHer, inductee, has a secondary stress on /in/-, and the other two, bisyllabic examples do not carry a secondary stress at all. So the example of inductee is off the mark, with its double primary stress:

CMUDict IH2 N D AH1 K T IY1
AmHer   IH2 N D AH0 K T IY1 (assuming ə -> AH0)

On the other hand, the phonemic transcription of a sample of a few words in -ee carrying the stress 1 elsewhere (e.g. manatee M AE1 N AH0 T IY2) does match that of AmHer. Looks like the common theme here is the final stressed IY1.

Is this just an error, or is there is more to it? If that should be fixed, I have a list, not split into categories of initialisms, compounds and simple words, but I can manually select the latter category, it's not that large. Most of the rejects are compounds even in the weakest sense of the word (e.g. remake, where re is a morpheme that would not stand on its own).

The cmusphinx-devel list on SF has had almost no traffic for the last 2 years, so I do not think it makes more sense to bring the question there than here--or is it? Or has the list moved elsewhere?

I am using the dictionary in a research, and simply discarding data with multiple primary stress (this is how science is supposed to work, is not it? If the data does not fit the theory, too bad for the data :D ). But this is still confusing.

@nshmyrev
Copy link
Contributor

nshmyrev commented Oct 7, 2019

Hello Kirill

Professor Alex Rudnicky is actually responsible for cmudict, he should be able to answer all questions. Cmusphinx-devel is the right place to ask as well as Alexander Rudnicky [email protected]. He is usually very responsive.

CMUDict was created manually, over the years there could be errors and inconsistencies. TTS people maintain their own version of cmudict. Probably some of those could be fixed.

There is also Pronlex which is known to give better accuracy and much more consistent as in https://arxiv.org/pdf/1801.00059.pdf.

I personally prefer to have something next-generation with automated audio examples for each of the words and automatic check of the pronunciation by the engine probably.

@Alexir
Copy link

Alexir commented Oct 23, 2023

It is the case that cmudict was created manually and that multiple people worked on it over time. So inconsistencies are inevitable.

In the case of INDUCTEE, we might have an ambiguity on proper stress. To me, either vowel could be stressed in casual speech; at the very least the first one should be marked AH2. If the person who did the entry had just done INDUCT I can see that they might have slipped. One project for someone might be to find all such cases and fix them.

Note that not all entries have stress markings. cmudict was created for use in speech recognition. At some point, it was noticed that ASR models don't seem to use stress per se; the information in other ways (in the acoustic model). So stress was left out when words were added. Other uses, such as TTS do need the stress information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants