You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Like CLIP, where we compute the image and text embeddings and compute the similarities to retrieve the best matching text, I tried the same using motion and text, but it does not work.
Eg. Using the AMASS dataset and bs = 2; texts: 'jump', 'dancing',
emb = enc.encode_motions(batch['x']).to(device)
emb /= emb.norm(dim=-1, keepdim=True)
text_inputs = torch.cat([clip.tokenize(c) for c in batch["clip_text"]]).to(device)
text_features = clip_model.encode_text(text_inputs).float()
text_features /= text_features.norm(dim=-1, keepdim=True)
logit_scale = clip_model.logit_scale.exp()
similarity = (logit_scale * emb @ text_features.float().T).softmax(dim=-1)
values, indices = similarity[0].topk(len(batch["clip_text"]))
# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{batch['clip_text'][index]:>16s}: {100 * value.item():.2f}%")
Expected output for similarity[0] -> high "jump" probability
But I get a high "dance" probability output. I have tested this with multiple batches and the correct text does not get the highest similarity a majority of the times. Am I inferencing it wrong?
The text was updated successfully, but these errors were encountered:
That's weird. Your code looks good to me, but we do know that the cosine similarity should work to some extent according to the action classification experiment. Did you try using it as a reference?
Like CLIP, where we compute the image and text embeddings and compute the similarities to retrieve the best matching text, I tried the same using motion and text, but it does not work.
Eg. Using the AMASS dataset and bs = 2; texts: 'jump', 'dancing',
Expected output for similarity[0] -> high "jump" probability
But I get a high "dance" probability output. I have tested this with multiple batches and the correct text does not get the highest similarity a majority of the times. Am I inferencing it wrong?
The text was updated successfully, but these errors were encountered: