-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sindhi features update #1067
Comments
I am not sure I understand what you mean by "overwriting the features in the existing config files". Sindhi has no language-specific documentation (and at least the one-page index page is required before any Sindhi treebank can be released). So it uses only features that are documented globally. At present the following features are allowed: |
This is probably so. The main problem of that dataset is that it has only morphological annotation and no syntax, so it is not a treebank. |
There's this block in feats.json in the tools repo:
There's also a block in https://github.com/UniversalDependencies/tools/blob/19c980e95ed0944dd5ecd262322403f8a77cee69/data/data.json#L1308 So for those blocks, I can just go through and update the features allowed in the link you gave, and it will all work out? Or do I need to make some other change elsewhere? What if there are features used in the prototype dataset but not in our proposed dataset, should we leave those checked in case that dataset is finished with a different standard for the features? Or should we just uncheck those for now, and if someone looks to make progress on that dataset, we'll have a discussion about unifying the features at that time? |
NO. As the documentation says (and the same warning is on the first line of Instead, use the online forms I linked above. As for unchecking the features used in the older dataset, I think there is no doubt that the two genders, two numbers and three persons will be needed, I'm not 100% sure about the case values but they look somehow expectable to me, too. So the question won't be whether to remove them but rather where to allow them – for example, the |
Can do, thanks. Maybe a similar warning at the top of docfeats with the appropriate URL for that page, assuming it's also generated? https://github.com/UniversalDependencies/tools/blob/master/data/docfeats.json |
PS agreed on the wide distribution of where those features occur being a little suspicious. We're in the process of annotating features in the dev branch of our dataset, and as we make progress we'll update the allowed features |
The file is generated but there is no URL/form that produces it. Instead, the file describes the machine-readable part of the |
Is there something similar for XPOS validations? I don't see it in udtools/data |
No, because XPOS is not UD. The validator will only impose some restrictions such that it cannot contain a tabulator (probably it cannot contain any space character but I would have to check). Otherwise it's up to you what you put there and whether you want to validate it at your end. It's not even language-specific, it's treebank-specific. |
Got it. Is there a framework for language specific validation, or is that something we do on our own before submitting the treebank? I do have a local validation script & would be happy either to merge it or just keep running it myself |
We have an English-specific validation script. In EWT it is at https://github.com/UniversalDependencies/UD_English-EWT/blob/dev/not-to-release/tools/neaten.py |
Excellent, thanks. Was not sure if the tools supported it or individual languages / treebanks did it themselves. |
There's also https://github.com/mr-martian/UD-GreekCheck for Ancient Greek |
As you know, the UD validation infrastructure includes some tests that are language-specific. But it does not run custom scripts provided with the data. |
Hey all,
We've been working on a new Sindhi dataset of about 5000 sentences. I'd say we're on track to have it finished by the time of the next UD release in May, in the worst case next Nov.
@muteeurahman
As part of this, we redefined a set of features that we think better fit the data. The existing features were on a smaller, unfinished dataset for which the original author is sometimes rather hard to reach. That dataset hasn't updated in several years, so our expectation is that there isn't a timeline for publishing it.
In a case like this, should we just overwrite the features in the existing config files? Should we merge our features with the original features proposed for Sindhi?
Thanks in advance
The text was updated successfully, but these errors were encountered: