Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FactoredDiphoneBlockV2 with right context output for training #57

Merged
merged 4 commits into from
Aug 23, 2024

Conversation

NeoLegends
Copy link
Member

This PR adds an output on p(r | c, x). This usually gives some small WER improvements if you also have a loss on this in training.

When decoding w/ the joint output, the right context is ignored (as we are still using a diphone model).

V2 to avoid hash breakage.

@NeoLegends NeoLegends added the enhancement New feature or request label Aug 20, 2024
@NeoLegends NeoLegends self-assigned this Aug 20, 2024
@NeoLegends NeoLegends force-pushed the moritz-fh-diphone-w-right-ctx branch from bff53f4 to 444eca7 Compare August 20, 2024 13:31
@NeoLegends NeoLegends changed the title Add FactoredDiphoneBlockV2 with right context output for training Add FactoredDiphoneBlockV2 with right context output for training Aug 20, 2024
@NeoLegends NeoLegends force-pushed the moritz-fh-diphone-w-right-ctx branch from d56700e to 7068e4f Compare August 20, 2024 13:44
Comment on lines +210 to +211
features_right = torch.cat((features, center_states_embedded), -1) # B, T, F+E
logits_right = self.right_context_encoder(features_right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the only difference to a FactoredTriphoneBlock would be that right_context_encoder would additionally also take contexts_embedded_left as input, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Even the decoding can remain the same, as you can (successfully) run decodings w/ only the diphone part of a triphone model.

If you wanted I could also add that variant (i.e. the triphone) to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This usually gives some small WER improvements if you also have a loss on this in training.

Did you ever try how "factored diphone with auxilliary right context loss" compares to "factored triphone but in recognition we only use diphone"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. IIRC the difference was non-existent or minimal, but I'd have to look up the exact results.

Comment on lines 201 to 202
:param contexts_left: The left contexts used to compute p(c|l,x), shape B, T.
:param contexts_center: The center states used to compute p(r|c,x), shape B, T.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is sort of implicit in your naming scheme (e.g. of center_state_embedding vs left_context_embedding) But there is iirc the difference that for l and r we only predict/encode the phoneme identity but c consists of the center phoneme and the state index. This is I think automatically handled via get_center_dim.

Should/can this be better documented in the docstrings?
E.g. I do not find it obvious that contexts_left can be from [1, num_contexts] while contexts_center should be from [1, some_factor * num_contexts]. Could this be a source of (user) errors or is it otherwise also obvious in our implementations?

Copy link
Member Author

@NeoLegends NeoLegends Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll extend the documentation. Making this a bit more obvious was one of the first things I did at i6, so the other implementations should be fine.

@NeoLegends NeoLegends merged commit 06a43f3 into main Aug 23, 2024
2 checks passed
@NeoLegends NeoLegends deleted the moritz-fh-diphone-w-right-ctx branch August 23, 2024 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants