Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7B模型训练数据污染(疑似?) #551

Open
shifop opened this issue Jan 22, 2025 · 1 comment
Open

7B模型训练数据污染(疑似?) #551

shifop opened this issue Jan 22, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@shifop
Copy link

shifop commented Jan 22, 2025

看仓库说明:
使用的训练数据是:
https://huggingface.co/datasets/shibing624/chinese_text_correction
测试数据是:
SIGHAN-2015(sighan2015_test.tsv)
EC-LAW(ec_law_test.tsv)
MCSC(mcsc_test.tsv)

检查发现,EC-LAW和MCSC数据和训练数据是有重叠的,这和三个测试集的效果一致,EC-LAW,MCSC接近1,SIGHAN-2015奇怪的只有0.4917

想问一下,训练的时候有去除在测试集中的数据吗?

@shifop shifop added the bug Something isn't working label Jan 22, 2025
@shibing624
Copy link
Owner

训练的时候包括了测试集中的数据。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants