-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FactoredDiphoneBlockV2
with right context output for training
#57
Conversation
bff53f4
to
444eca7
Compare
FactoredDiphoneBlockV2
with right context output for training
d56700e
to
7068e4f
Compare
features_right = torch.cat((features, center_states_embedded), -1) # B, T, F+E | ||
logits_right = self.right_context_encoder(features_right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now the only difference to a FactoredTriphoneBlock
would be that right_context_encoder
would additionally also take contexts_embedded_left
as input, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
Even the decoding can remain the same, as you can (successfully) run decodings w/ only the diphone part of a triphone model.
If you wanted I could also add that variant (i.e. the triphone) to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This usually gives some small WER improvements if you also have a loss on this in training.
Did you ever try how "factored diphone with auxilliary right context loss" compares to "factored triphone but in recognition we only use diphone"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. IIRC the difference was non-existent or minimal, but I'd have to look up the exact results.
:param contexts_left: The left contexts used to compute p(c|l,x), shape B, T. | ||
:param contexts_center: The center states used to compute p(r|c,x), shape B, T. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sort of implicit in your naming scheme (e.g. of center_state_embedding
vs left_context_embedding
) But there is iirc the difference that for l
and r
we only predict/encode the phoneme identity but c
consists of the center phoneme and the state index. This is I think automatically handled via get_center_dim
.
Should/can this be better documented in the docstrings?
E.g. I do not find it obvious that contexts_left
can be from [1, num_contexts]
while contexts_center
should be from [1, some_factor * num_contexts]
. Could this be a source of (user) errors or is it otherwise also obvious in our implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll extend the documentation. Making this a bit more obvious was one of the first things I did at i6, so the other implementations should be fine.
This PR adds an output on
p(r | c, x)
. This usually gives some small WER improvements if you also have a loss on this in training.When decoding w/ the joint output, the right context is ignored (as we are still using a diphone model).
V2 to avoid hash breakage.