Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 1 #9407

Merged
merged 15 commits into from
Nov 27, 2024

Conversation

yinfan98
Copy link
Contributor

PR types

Others

PR changes

Models

Description

为Bloom提供tokenizer fast支持,顺便想问一下。是对test里每个def我都要添加一个fast的测试吗~感谢🙏

Copy link

paddle-bot bot commented Nov 11, 2024

Thanks for your contribution!

@yinfan98 yinfan98 changed the title 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 1 Nov 11, 2024
@DrownFish19
Copy link
Collaborator

  • 如讨论结果,如果存在对应测试case,需要对齐添加。
  • 基础测试函数已经在common中添加,基本已经覆盖所有fast类型tokenizer,如果存在特殊case,可以在test_tokenizer.py中重写函数以跳过。

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 52.96%. Comparing base (1ba3bef) to head (8907b0c).
Report is 14 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/bloom/tokenizer_fast.py 78.78% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9407      +/-   ##
===========================================
- Coverage    53.08%   52.96%   -0.12%     
===========================================
  Files          687      689       +2     
  Lines       109472   109412      -60     
===========================================
- Hits         58114    57952     -162     
- Misses       51358    51460     +102     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yinfan98
Copy link
Contributor Author

cc:@DrownFish19 麻烦再帮我看下pr吧,感谢

@@ -0,0 +1,131 @@
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

辛苦在这里增加一下HuggingFace的Copyright

("bloom", "BloomTokenizer"),
(
"bloom",
("BloomTokenizer", "BloomTokenizerFast" if is_tokenizers_available() else None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议这里换行一下,格式统一

"LlamaTokenizer": LlamaConverter,
"BertTokenizer": BertConverter,
}
SLOW_TO_FAST_CONVERTERS = {"LlamaTokenizer": LlamaConverter, "BertTokenizer": BertConverter}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的convert是可以通用吗?后续可以验证一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该没有新加bloom的,因为我看在hf上bloom只有fast,没有convert的流程

Copy link
Collaborator

@DrownFish19 DrownFish19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DrownFish19 DrownFish19 merged commit a9a6b80 into PaddlePaddle:develop Nov 27, 2024
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants