You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why are you predicting the actions for each frame of the video (output is (b, f, action dim, vocab_size)) instead of the expected (b, action dim, vocab_size) for a next action prediction)? . The cross entropy loss for the final action prediction (labeled single eval loss) seems rather high, although still an improvement over rt1x released by Google and Octo:
Additionally the training cross entropy loss over the entire frame prediction seems to saturate before reaching 0 for the LR schedules I tried:
Are the past images in a video used to condition the hidden layers like in https://deepimagination.cc/eDiff-I/ ? :
Why are you predicting the actions for each frame of the video (output is (b, f, action dim, vocab_size)) instead of the expected (b, action dim, vocab_size) for a next action prediction)? . The cross entropy loss for the final action prediction (labeled single eval loss) seems rather high, although still an improvement over rt1x released by Google and Octo:
Additional info:
-I'm only able to run batch size of 16 on my GPUs, maybe that is the issue. Or potentially data augmentation from https://github.com/octo-models/octo/blob/main/examples/06_pytorch_oxe_dataloader.py is the issue.
-I am using a pre-trained MaxViT from pytorch with your classifier_free_guidance layers as seen here: https://github.com/kyegomez/RT-X/blob/031e6edb1734774e772f497b11fb49df634fef8d/rtx/rtx1.py#L402 (I'm happy to make a pull request to add this option here as well).
-I am using https://github.com/sebbyjp/robo_transformers for comparision to official rt1x and octo baselines
The text was updated successfully, but these errors were encountered: