Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sindhi features update #1067

Open
AngledLuffa opened this issue Nov 28, 2024 · 14 comments
Open

Sindhi features update #1067

AngledLuffa opened this issue Nov 28, 2024 · 14 comments

Comments

@AngledLuffa
Copy link

Hey all,

We've been working on a new Sindhi dataset of about 5000 sentences. I'd say we're on track to have it finished by the time of the next UD release in May, in the worst case next Nov.

@muteeurahman

As part of this, we redefined a set of features that we think better fit the data. The existing features were on a smaller, unfinished dataset for which the original author is sometimes rather hard to reach. That dataset hasn't updated in several years, so our expectation is that there isn't a timeline for publishing it.

In a case like this, should we just overwrite the features in the existing config files? Should we merge our features with the original features proposed for Sindhi?

Thanks in advance

@dan-zeman
Copy link
Member

I am not sure I understand what you mean by "overwriting the features in the existing config files". Sindhi has no language-specific documentation (and at least the one-page index page is required before any Sindhi treebank can be released). So it uses only features that are documented globally. At present the following features are allowed: Case (Abl, Acc, Gen, Nom); Gender (Masc and Fem); Number (Sing and Plur); and Person (1, 2, 3). If you need other feature-value pairs that are already documented globally, you can simply check them in the feature registration form linked above. If you need something that is not documented yet, you will have to provide the documentation page first (in the format expected by the validation system).

@dan-zeman
Copy link
Member

That dataset hasn't updated in several years, so our expectation is that there isn't a timeline for publishing it.

This is probably so. The main problem of that dataset is that it has only morphological annotation and no syntax, so it is not a treebank.

@AngledLuffa
Copy link
Author

I am not sure I understand what you mean by "overwriting the features in the existing config files"

There's this block in feats.json in the tools repo:

"sd": {
"Abbr": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Yes"], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": [], "byupos": {}},
"Case": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Abl", "Acc", "Gen", "Nom"], "lvalues": [], "unused_uvalues": ["Abe", "Abs", "Add", "Ade", "All", "Ben", "Cau", "Cmp", "Cns", "Com", "Dat", "Del", "Dis", "Ela", "Equ", "Erg", "Ess", "Ill", "Ine", "Ins", "Lat", "Loc", "Par", "Per", "Sbe", "Sbl", "Spl", "Sub", "Sup", "Tem", "Ter", "Tra", "Voc"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"Acc": 1, "Nom": 1}, "ADP": {"Acc": 1, "Nom": 1}, "DET": {"Acc": 1, "Nom": 1}, "NOUN": {"Abl": 1, "Acc": 1, "Nom": 1}, "PRON": {"Abl": 1, "Acc": 1, "Gen": 1, "Nom": 1}, "PROPN": {"Abl": 1, "Acc": 1, "Nom": 1}, "VERB": {"Acc": 1, "Nom": 1}}, "lastchanged": "2023-07-25-15-50-45", "lastchanger": "dan-zeman"},
"Foreign": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Yes"], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": [], "byupos": {}},
"Form": {"type": "lspec", "doc": "none", "permitted": 0, "errors": [], "uvalues": [], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": ["Bound", "Simple"], "byupos": {}},
"Gender": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Fem", "Masc"], "lvalues": [], "unused_uvalues": ["Com", "Neut"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"Fem": 33, "Masc": 67}, "ADV": {"Masc": 8}, "DET": {"Fem": 18, "Masc": 33}, "INTJ": {"Masc": 2}, "NOUN": {"Fem": 636, "Masc": 669}, "PRON": {"Fem": 10, "Masc": 9}, "PROPN": {"Fem": 96, "Masc": 201}, "VERB": {"Masc": 6}}},
"Number": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Plur", "Sing"], "lvalues": [], "unused_uvalues": ["Coll", "Count", "Dual", "Grpa", "Grpl", "Inv", "Pauc", "Ptan", "Tri"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"Plur": 1, "Sing": 1}, "ADV": {"Plur": 1, "Sing": 1}, "AUX": {"Plur": 1, "Sing": 1}, "DET": {"Plur": 1, "Sing": 1}, "NOUN": {"Plur": 1, "Sing": 1}, "NUM": {"Plur": 1, "Sing": 1}, "PRON": {"Plur": 1, "Sing": 1}, "PROPN": {"Plur": 1, "Sing": 1}, "VERB": {"Plur": 1, "Sing": 1}}, "lastchanged": "2024-09-23-10-54-21", "lastchanger": "dan-zeman"},
"Person": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["1", "2", "3"], "lvalues": [], "unused_uvalues": ["0", "4"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"2": 9, "3": 406}, "ADV": {"3": 8}, "DET": {"3": 205}, "INTJ": {"3": 2}, "NOUN": {"3": 1251}, "PRON": {"1": 55, "2": 28, "3": 78}, "PROPN": {"3": 289}, "VERB": {"3": 15}}},
"Typo": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Yes"], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": [], "byupos": {}}
},

There's also a block in docfeats.json

https://github.com/UniversalDependencies/tools/blob/19c980e95ed0944dd5ecd262322403f8a77cee69/data/data.json#L1308
https://github.com/UniversalDependencies/tools/blob/19c980e95ed0944dd5ecd262322403f8a77cee69/data/docfeats.json#L189
https://github.com/UniversalDependencies/tools/blob/19c980e95ed0944dd5ecd262322403f8a77cee69/data/feats.json#L3276

So for those blocks, I can just go through and update the features allowed in the link you gave, and it will all work out? Or do I need to make some other change elsewhere?

What if there are features used in the prototype dataset but not in our proposed dataset, should we leave those checked in case that dataset is finished with a different standard for the features? Or should we just uncheck those for now, and if someone looks to make progress on that dataset, we'll have a discussion about unifying the features at that time?

@dan-zeman
Copy link
Member

dan-zeman commented Nov 28, 2024

NO.

As the documentation says (and the same warning is on the first line of feats.json, as well as in README.md in the folder with the JSON files), you should not touch the JSON files.

Instead, use the online forms I linked above.

As for unchecking the features used in the older dataset, I think there is no doubt that the two genders, two numbers and three persons will be needed, I'm not 100% sure about the case values but they look somehow expectable to me, too. So the question won't be whether to remove them but rather where to allow them – for example, the Gender feature is currently allowed (among others) for adverbs and interjections, which I find suspicious at best. Given that the dataset was never valid and released, feel free to uncheck the combinations that do not make sense. When someone tries to make the dataset valid, they can either fix the data, or re-allow the feature where needed.

@AngledLuffa
Copy link
Author

Can do, thanks. Maybe a similar warning at the top of docfeats with the appropriate URL for that page, assuming it's also generated?

https://github.com/UniversalDependencies/tools/blob/master/data/docfeats.json

@AngledLuffa
Copy link
Author

PS agreed on the wide distribution of where those features occur being a little suspicious. We're in the process of annotating features in the dev branch of our dataset, and as we make progress we'll update the allowed features

@dan-zeman
Copy link
Member

Can do, thanks. Maybe a similar warning at the top of docfeats with the appropriate URL for that page, assuming it's also generated?

https://github.com/UniversalDependencies/tools/blob/master/data/docfeats.json

The file is generated but there is no URL/form that produces it. Instead, the file describes the machine-readable part of the docs repository and it's updated when the contents of docs is modified.

@AngledLuffa
Copy link
Author

Is there something similar for XPOS validations? I don't see it in udtools/data

@dan-zeman
Copy link
Member

Is there something similar for XPOS validations? I don't see it in udtools/data

No, because XPOS is not UD.

The validator will only impose some restrictions such that it cannot contain a tabulator (probably it cannot contain any space character but I would have to check). Otherwise it's up to you what you put there and whether you want to validate it at your end. It's not even language-specific, it's treebank-specific.

@AngledLuffa
Copy link
Author

Got it. Is there a framework for language specific validation, or is that something we do on our own before submitting the treebank? I do have a local validation script & would be happy either to merge it or just keep running it myself

@nschneid
Copy link
Contributor

nschneid commented Dec 6, 2024

We have an English-specific validation script. In EWT it is at https://github.com/UniversalDependencies/UD_English-EWT/blob/dev/not-to-release/tools/neaten.py

@AngledLuffa
Copy link
Author

Excellent, thanks. Was not sure if the tools supported it or individual languages / treebanks did it themselves.

@mr-martian
Copy link
Contributor

There's also https://github.com/mr-martian/UD-GreekCheck for Ancient Greek

@dan-zeman
Copy link
Member

As you know, the UD validation infrastructure includes some tests that are language-specific. But it does not run custom scripts provided with the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants