Add `FactoredDiphoneBlockV2` with right context output for training #57

NeoLegends · 2024-08-20T13:30:21Z

This PR adds an output on p(r | c, x). This usually gives some small WER improvements if you also have a loss on this in training.

When decoding w/ the joint output, the right context is ignored (as we are still using a diphone model).

V2 to avoid hash breakage.

michelwi · 2024-08-21T07:46:39Z

i6_models/parts/factored_hybrid/diphone.py

+        features_right = torch.cat((features, center_states_embedded), -1)  # B, T, F+E
+        logits_right = self.right_context_encoder(features_right)


now the only difference to a FactoredTriphoneBlock would be that right_context_encoder would additionally also take contexts_embedded_left as input, right?

Correct.

Even the decoding can remain the same, as you can (successfully) run decodings w/ only the diphone part of a triphone model.

If you wanted I could also add that variant (i.e. the triphone) to this PR.

This usually gives some small WER improvements if you also have a loss on this in training.

Did you ever try how "factored diphone with auxilliary right context loss" compares to "factored triphone but in recognition we only use diphone"?

Yes. IIRC the difference was non-existent or minimal, but I'd have to look up the exact results.

michelwi · 2024-08-21T08:00:56Z

i6_models/parts/factored_hybrid/diphone.py

+        :param contexts_left: The left contexts used to compute p(c|l,x), shape B, T.
+        :param contexts_center: The center states used to compute p(r|c,x), shape B, T.


It is sort of implicit in your naming scheme (e.g. of center_state_embedding vs left_context_embedding) But there is iirc the difference that for l and r we only predict/encode the phoneme identity but c consists of the center phoneme and the state index. This is I think automatically handled via get_center_dim.

Should/can this be better documented in the docstrings?
E.g. I do not find it obvious that contexts_left can be from [1, num_contexts] while contexts_center should be from [1, some_factor * num_contexts]. Could this be a source of (user) errors or is it otherwise also obvious in our implementations?

Good point. I'll extend the documentation. Making this a bit more obvious was one of the first things I did at i6, so the other implementations should be fine.

NeoLegends added the enhancement New feature or request label Aug 20, 2024

NeoLegends requested review from curufinwe, christophmluscher, Marvin84, michelwi and BenediktConze August 20, 2024 13:30

NeoLegends self-assigned this Aug 20, 2024

NeoLegends force-pushed the moritz-fh-diphone-w-right-ctx branch from bff53f4 to 444eca7 Compare August 20, 2024 13:31

NeoLegends changed the title ~~Add FactoredDiphoneBlockV2 with right context output for training~~ Add FactoredDiphoneBlockV2 with right context output for training Aug 20, 2024

Add FactoredDiphoneBlockV2 with right context output for training

7068e4f

NeoLegends force-pushed the moritz-fh-diphone-w-right-ctx branch from d56700e to 7068e4f Compare August 20, 2024 13:44

Marvin84 approved these changes Aug 20, 2024

View reviewed changes

michelwi reviewed Aug 21, 2024

View reviewed changes

NeoLegends added 3 commits August 22, 2024 10:10

update types

d3e3be7

update var names

adfcdde

document valid value range for context values

647b774

NeoLegends mentioned this pull request Aug 22, 2024

Add FactoredTriphoneBlockV1 #58

Open

michelwi approved these changes Aug 23, 2024

View reviewed changes

NeoLegends merged commit 06a43f3 into main Aug 23, 2024
2 checks passed

NeoLegends deleted the moritz-fh-diphone-w-right-ctx branch August 23, 2024 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `FactoredDiphoneBlockV2` with right context output for training #57

Add `FactoredDiphoneBlockV2` with right context output for training #57

NeoLegends commented Aug 20, 2024

michelwi Aug 21, 2024

NeoLegends Aug 21, 2024

michelwi Aug 21, 2024

NeoLegends Aug 23, 2024

michelwi Aug 21, 2024

NeoLegends Aug 22, 2024 •

edited

Loading

		features_right = torch.cat((features, center_states_embedded), -1) # B, T, F+E
		logits_right = self.right_context_encoder(features_right)

		:param contexts_left: The left contexts used to compute p(c\|l,x), shape B, T.
		:param contexts_center: The center states used to compute p(r\|c,x), shape B, T.

Add FactoredDiphoneBlockV2 with right context output for training #57

Add FactoredDiphoneBlockV2 with right context output for training #57

Conversation

NeoLegends commented Aug 20, 2024

michelwi Aug 21, 2024

Choose a reason for hiding this comment

NeoLegends Aug 21, 2024

Choose a reason for hiding this comment

michelwi Aug 21, 2024

Choose a reason for hiding this comment

NeoLegends Aug 23, 2024

Choose a reason for hiding this comment

michelwi Aug 21, 2024

Choose a reason for hiding this comment

NeoLegends Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

Add `FactoredDiphoneBlockV2` with right context output for training #57

Add `FactoredDiphoneBlockV2` with right context output for training #57

NeoLegends Aug 22, 2024 •

edited

Loading