-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistics reported in README don't match the actual statistics of the corpus #2
Comments
It is also rather unusual that the training data segment is 8.5 times smaller than the test set; what are the motivations for that? Thank you. |
I agree the "6400 sentences (7.4K tokens)" is suspicious (it would mean 1.2 tokens per sentence) even without looking at the
See the data split guidelines. For treebanks with less than 20k words, it suggest to either keep everything as test data or set aside 20-50 sentences as "train". Here we see 80 sentences in train with strange |
Thank you for the prompt response and the pointers! |
I was going to close this issue, but now I see that quite a few changes has been made to the treebank recently. Could you please update the README, and provide at least some brief description of the modifications? Has the annotation scheme changed? Do the earlier sentences remain in place? Thank you in advance and thank you for your work. |
We have expanded the tree bank. The old sentences are still there. There are currently 2.4K sentences and 23K words. It will continue to expand. The dataset mostly contains headlines from news websites. We are also updating the readme file. |
Thank you for your response and clarifications. I see that @dan-zeman has updated the stats, thanks! Looks like the README file still states that there are 781 sentences. |
The file |
I am just an occasional user and not a contributor of this project. Hence I do not have any information on the details of the novelties in the recent additions. But I could try and make a pull request after the New Year holidays of course. |
Hello! Happy New year.
We are sorry.
We updated README file in dev. We are waiting new release date.
Sincerely
Saygılarımla
İbrahim Benli
…________________________________
From: Anton Alekseev ***@***.***>
Sent: Tuesday, December 31, 2024 8:08:39 PM
To: UniversalDependencies/UD_Kyrgyz-KTMU ***@***.***>
Cc: ibrahimbenli ***@***.***>; Comment ***@***.***>
Subject: Re: [UniversalDependencies/UD_Kyrgyz-KTMU] Statistics reported in README don't match the actual statistics of the corpus (Issue #2)
I am just an occasional user and not a contributor of this project. Hence I do not have any information on the details of the novelties in the recent additions. But I could try and make a pull request after the New Year holidays of course.
—
Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJSPHPXYYGAE25FYJWJWQQ32ILFRPAVCNFSM6AAAAABSC2DKMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGU4TQNRRGQ>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Dear colleagues, thank you for your fantastic work on the long-awaited treebank!
Decided that I should report this to you just in case: one can see from both the
.conllu
files andstats.xml
that there is a total of 781 sentences in the corpus; while the README file states there are 6400 of them.Best regards.
The text was updated successfully, but these errors were encountered: