You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
which is supposed to replace embeddings in full_seq (77,768) between start and end with the ones from seq. However, a transpose operation is first performed, making full_seq have a shape of (768,77), which makes the assignment full_seq[start:end] be over the wrong dimension. Similarly, seq is also addressed wrongly.
Moreover, I believe the calculation of spans to also be incorrect, as it considers words without considering the possibility of a word being broken into multiple tokens. In the repository of the paper author, this function
def get_token_alignment_map(tree, tokens):
if tokens is None:
return {i:[i] for i in range(len(tree.leaves())+1)}
def get_token(token):
return token[:-4] if token.endswith("</w>") else token
idx_map = {}
j = 0
max_offset = np.abs(len(tokens) - len(tree.leaves()))
mytree_prev_leaf = ""
for i, w in enumerate(tree.leaves()):
token = get_token(tokens[j])
idx_map[i] = [j]
if token == mytree_prev_leaf+w:
mytree_prev_leaf = ""
j += 1
else:
if len(token) < len(w):
prev = ""
while prev + token != w:
prev += token
j += 1
token = get_token(tokens[j])
idx_map[i].append(j)
# assert j - i <= max_offset
else:
mytree_prev_leaf += w
j -= 1
j += 1
idx_map[i+1] = [j]
return idx_map
is used to perform this mapping between word spans and token spans.
The text was updated successfully, but these errors were encountered:
@elvisnava Thank you very much for pointing this out. To be honest, the code in this repository is based on an earlier release by the author in the OpenReview as Supplementary Material [the zip file].
Also, thank you for pointing out the problem of a possible lack of alignment between tokens and spans. This appears to be handled properly by the official implementation, and we will incorporate it into our pipeline with reference to that implementation.
I will be fixing this issue as soon as I can. Thank you very much for pointing this out to me.
Thank you for your response. I believe that in the original code, the original dimensions of seq and full_seq are meant to be (B, 77, 768), with B being the batch size. Then, the transpose operation would successfully operate on the sequence dimension (77).
I would advise anyone against using this implementation until these issues are fixed.
In the function for sequence alignment (but the same can be said about
_expand_sequence
), we have:which is supposed to replace embeddings in
full_seq
(77,768) betweenstart
andend
with the ones fromseq
. However, a transpose operation is first performed, makingfull_seq
have a shape of (768,77), which makes the assignmentfull_seq[start:end]
be over the wrong dimension. Similarly,seq
is also addressed wrongly.Moreover, I believe the calculation of spans to also be incorrect, as it considers words without considering the possibility of a word being broken into multiple tokens. In the repository of the paper author, this function
is used to perform this mapping between word spans and token spans.
The text was updated successfully, but these errors were encountered: