Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics reported in README don't match the actual statistics of the corpus #2

Open
alexeyev opened this issue May 17, 2023 · 9 comments

Comments

@alexeyev
Copy link

Dear colleagues, thank you for your fantastic work on the long-awaited treebank!

Decided that I should report this to you just in case: one can see from both the .conllu files and stats.xml that there is a total of 781 sentences in the corpus; while the README file states there are 6400 of them.

Best regards.

@alexeyev
Copy link
Author

It is also rather unusual that the training data segment is 8.5 times smaller than the test set; what are the motivations for that? Thank you.

@martinpopel
Copy link
Member

I agree the "6400 sentences (7.4K tokens)" is suspicious (it would mean 1.2 tokens per sentence) even without looking at the .conllu and stats.xml files. I guess it should be "7.4K words (6.4K words excluding punctuation)".

the training data segment is 8.5 times smaller than the test set; what are the motivations for that?

See the data split guidelines. For treebanks with less than 20k words, it suggest to either keep everything as test data or set aside 20-50 sentences as "train". Here we see 80 sentences in train with strange sent_id system (...,797, 798, 789_, 790_, ... 799_, 800), but the fact that train is 8.5 times smaller than test is OK.

@alexeyev
Copy link
Author

See the data split guidelines.

Thank you for the prompt response and the pointers!

dan-zeman added a commit that referenced this issue May 17, 2023
@alexeyev
Copy link
Author

I was going to close this issue, but now I see that quite a few changes has been made to the treebank recently. Could you please update the README, and provide at least some brief description of the modifications?

Has the annotation scheme changed? Do the earlier sentences remain in place?

Thank you in advance and thank you for your work.

@ibrahimbenli
Copy link
Contributor

We have expanded the tree bank. The old sentences are still there. There are currently 2.4K sentences and 23K words. It will continue to expand. The dataset mostly contains headlines from news websites. We are also updating the readme file.

@alexeyev
Copy link
Author

Thank you for your response and clarifications.

I see that @dan-zeman has updated the stats, thanks!

Looks like the README file still states that there are 781 sentences.

@dan-zeman
Copy link
Member

The file stats.xml gets updated automatically during release. Updating README.md is your responsibility :-) (in particular the Changelog section, but of course if there are other sections that become invalid when new data is added, please fix them too). During release, parts of the README will be automatically copied to the website, so again it is important to have the README in dev up-to-date by release data freeze deadline.

@alexeyev
Copy link
Author

I am just an occasional user and not a contributor of this project. Hence I do not have any information on the details of the novelties in the recent additions. But I could try and make a pull request after the New Year holidays of course.

@ibrahimbenli
Copy link
Contributor

ibrahimbenli commented Dec 31, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants