Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update indexes_containing_token function #1050

Merged
merged 1 commit into from
Jan 26, 2024

Conversation

AllentDan
Copy link
Collaborator

No description provided.

self.logger.warning(
f'The token {token}, its length of indexes {indexes} is '
'not 1. Currently, it can not be used as stop words')
indexes = []
self._indexes_tokens_deque.append((token, indexes))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方一定等于1吧,因为入口函数这里> 1的时候过滤了

def indexes_containing_token(self, token):
"""Return all the possible indexes, whose decoding output may contain
the input token."""
encoded = self.encode(token, add_bos=False)
if len(encoded) > 1:
self.logger.warning(
f'The token {token}, its length of indexes {encoded} is over '
'than 1. Currently, it can not be used as stop words')
return []
return self.model.indexes_containing_token(token)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,但是不保证这个函数单独被调用

@irexyc
Copy link
Collaborator

irexyc commented Jan 26, 2024

意思是说模型输出的token_id可能不在vocab里面,然后出现的时候也要stop是么?

@AllentDan
Copy link
Collaborator Author

AllentDan commented Jan 26, 2024

意思是说模型输出的token_id可能不在vocab里面,然后出现的时候也要stop是么?

主要是直接从huggingface的 AutoTokenizer 类拿属性 vocab_size 不保险,可能存在的情况是某些模型自己加了其他字段,然后可以 index 是超出 vocab_size 的。正常来讲,这个 PR里新加的 vocab_size_with_added 应该是全的,不过先不用吧还是。

@lvhan028 lvhan028 merged commit a93a8fb into InternLM:main Jan 26, 2024
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants