You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @lpiccinelli-eth,
First of all, thank you for sharing your valuable research.
While reviewing the code that implements the details of your paper, I came across a question. It seems that the SphHarm class at this link is not being used.
According to the paper, camera prediction is performed using imagery observation, and then Laplace Spherical Harmonic Encoding (SHE) is applied to create camera embeddings. Afterward, cross-attention is used to estimate depth.
Could you please clarify where this part is implemented in the code? Additionally, could you explain why the code might work even without applying the above-mentioned SHE?
The text was updated successfully, but these errors were encountered:
Hi @pnpmpnp, I appreciate your interest!
To answer your question, we used the functional version of the spherical encoding for V1 - which is suggested to use right now.
In particular, we used the 8th degree (to be honest it is a bit of an overkill, degree 3 is already fine) as you can see from line 18 of V1 decoder.
If you have any other questions, do not hesitate to ask.
@lpiccinelli-eth, Thank you very much for your prompt reply.
Your clear answer resolved the part of my question that I was unsure about.
Since I’m not very familiar with performing geometric embedding using techniques like spherical harmonics or Fourier transforms, I think I’m having difficulty fully understanding how this module works and its expected benefits. Could you possibly share any papers or resources that helped you gain intuition on this topic?
Hi @lpiccinelli-eth,
First of all, thank you for sharing your valuable research.
While reviewing the code that implements the details of your paper, I came across a question. It seems that the SphHarm class at this link is not being used.
According to the paper, camera prediction is performed using imagery observation, and then Laplace Spherical Harmonic Encoding (SHE) is applied to create camera embeddings. Afterward, cross-attention is used to estimate depth.
Could you please clarify where this part is implemented in the code? Additionally, could you explain why the code might work even without applying the above-mentioned SHE?
The text was updated successfully, but these errors were encountered: