Sindhi features update #1067

AngledLuffa · 2024-11-28T02:35:13Z

Hey all,

We've been working on a new Sindhi dataset of about 5000 sentences. I'd say we're on track to have it finished by the time of the next UD release in May, in the worst case next Nov.

@muteeurahman

As part of this, we redefined a set of features that we think better fit the data. The existing features were on a smaller, unfinished dataset for which the original author is sometimes rather hard to reach. That dataset hasn't updated in several years, so our expectation is that there isn't a timeline for publishing it.

In a case like this, should we just overwrite the features in the existing config files? Should we merge our features with the original features proposed for Sindhi?

Thanks in advance

dan-zeman · 2024-11-28T16:37:44Z

I am not sure I understand what you mean by "overwriting the features in the existing config files". Sindhi has no language-specific documentation (and at least the one-page index page is required before any Sindhi treebank can be released). So it uses only features that are documented globally. At present the following features are allowed: Case (Abl, Acc, Gen, Nom); Gender (Masc and Fem); Number (Sing and Plur); and Person (1, 2, 3). If you need other feature-value pairs that are already documented globally, you can simply check them in the feature registration form linked above. If you need something that is not documented yet, you will have to provide the documentation page first (in the format expected by the validation system).

dan-zeman · 2024-11-28T16:45:33Z

That dataset hasn't updated in several years, so our expectation is that there isn't a timeline for publishing it.

This is probably so. The main problem of that dataset is that it has only morphological annotation and no syntax, so it is not a treebank.

AngledLuffa · 2024-11-28T19:41:52Z

I am not sure I understand what you mean by "overwriting the features in the existing config files"

There's this block in feats.json in the tools repo:

"sd": {
"Abbr": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Yes"], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": [], "byupos": {}},
"Case": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Abl", "Acc", "Gen", "Nom"], "lvalues": [], "unused_uvalues": ["Abe", "Abs", "Add", "Ade", "All", "Ben", "Cau", "Cmp", "Cns", "Com", "Dat", "Del", "Dis", "Ela", "Equ", "Erg", "Ess", "Ill", "Ine", "Ins", "Lat", "Loc", "Par", "Per", "Sbe", "Sbl", "Spl", "Sub", "Sup", "Tem", "Ter", "Tra", "Voc"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"Acc": 1, "Nom": 1}, "ADP": {"Acc": 1, "Nom": 1}, "DET": {"Acc": 1, "Nom": 1}, "NOUN": {"Abl": 1, "Acc": 1, "Nom": 1}, "PRON": {"Abl": 1, "Acc": 1, "Gen": 1, "Nom": 1}, "PROPN": {"Abl": 1, "Acc": 1, "Nom": 1}, "VERB": {"Acc": 1, "Nom": 1}}, "lastchanged": "2023-07-25-15-50-45", "lastchanger": "dan-zeman"},
"Foreign": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Yes"], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": [], "byupos": {}},
"Form": {"type": "lspec", "doc": "none", "permitted": 0, "errors": [], "uvalues": [], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": ["Bound", "Simple"], "byupos": {}},
"Gender": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Fem", "Masc"], "lvalues": [], "unused_uvalues": ["Com", "Neut"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"Fem": 33, "Masc": 67}, "ADV": {"Masc": 8}, "DET": {"Fem": 18, "Masc": 33}, "INTJ": {"Masc": 2}, "NOUN": {"Fem": 636, "Masc": 669}, "PRON": {"Fem": 10, "Masc": 9}, "PROPN": {"Fem": 96, "Masc": 201}, "VERB": {"Masc": 6}}},
"Number": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Plur", "Sing"], "lvalues": [], "unused_uvalues": ["Coll", "Count", "Dual", "Grpa", "Grpl", "Inv", "Pauc", "Ptan", "Tri"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"Plur": 1, "Sing": 1}, "ADV": {"Plur": 1, "Sing": 1}, "AUX": {"Plur": 1, "Sing": 1}, "DET": {"Plur": 1, "Sing": 1}, "NOUN": {"Plur": 1, "Sing": 1}, "NUM": {"Plur": 1, "Sing": 1}, "PRON": {"Plur": 1, "Sing": 1}, "PROPN": {"Plur": 1, "Sing": 1}, "VERB": {"Plur": 1, "Sing": 1}}, "lastchanged": "2024-09-23-10-54-21", "lastchanger": "dan-zeman"},
"Person": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["1", "2", "3"], "lvalues": [], "unused_uvalues": ["0", "4"], "unused_lvalues": [], "evalues": [], "byupos": {"ADJ": {"2": 9, "3": 406}, "ADV": {"3": 8}, "DET": {"3": 205}, "INTJ": {"3": 2}, "NOUN": {"3": 1251}, "PRON": {"1": 55, "2": 28, "3": 78}, "PROPN": {"3": 289}, "VERB": {"3": 15}}},
"Typo": {"type": "universal", "doc": "global", "permitted": 1, "errors": [], "uvalues": ["Yes"], "lvalues": [], "unused_uvalues": [], "unused_lvalues": [], "evalues": [], "byupos": {}}
},

There's also a block in docfeats.json

https://github.com/UniversalDependencies/tools/blob/19c980e95ed0944dd5ecd262322403f8a77cee69/data/data.json#L1308
https://github.com/UniversalDependencies/tools/blob/19c980e95ed0944dd5ecd262322403f8a77cee69/data/docfeats.json#L189
https://github.com/UniversalDependencies/tools/blob/19c980e95ed0944dd5ecd262322403f8a77cee69/data/feats.json#L3276

So for those blocks, I can just go through and update the features allowed in the link you gave, and it will all work out? Or do I need to make some other change elsewhere?

What if there are features used in the prototype dataset but not in our proposed dataset, should we leave those checked in case that dataset is finished with a different standard for the features? Or should we just uncheck those for now, and if someone looks to make progress on that dataset, we'll have a discussion about unifying the features at that time?

dan-zeman · 2024-11-28T20:07:01Z

NO.

As the documentation says (and the same warning is on the first line of feats.json, as well as in README.md in the folder with the JSON files), you should not touch the JSON files.

Instead, use the online forms I linked above.

As for unchecking the features used in the older dataset, I think there is no doubt that the two genders, two numbers and three persons will be needed, I'm not 100% sure about the case values but they look somehow expectable to me, too. So the question won't be whether to remove them but rather where to allow them – for example, the Gender feature is currently allowed (among others) for adverbs and interjections, which I find suspicious at best. Given that the dataset was never valid and released, feel free to uncheck the combinations that do not make sense. When someone tries to make the dataset valid, they can either fix the data, or re-allow the feature where needed.

AngledLuffa · 2024-11-28T20:30:00Z

Can do, thanks. Maybe a similar warning at the top of docfeats with the appropriate URL for that page, assuming it's also generated?

https://github.com/UniversalDependencies/tools/blob/master/data/docfeats.json

AngledLuffa · 2024-11-28T20:35:22Z

PS agreed on the wide distribution of where those features occur being a little suspicious. We're in the process of annotating features in the dev branch of our dataset, and as we make progress we'll update the allowed features

dan-zeman · 2024-11-28T21:27:40Z

Can do, thanks. Maybe a similar warning at the top of docfeats with the appropriate URL for that page, assuming it's also generated?

https://github.com/UniversalDependencies/tools/blob/master/data/docfeats.json

The file is generated but there is no URL/form that produces it. Instead, the file describes the machine-readable part of the docs repository and it's updated when the contents of docs is modified.

AngledLuffa · 2024-12-05T23:27:11Z

Is there something similar for XPOS validations? I don't see it in udtools/data

dan-zeman · 2024-12-06T12:35:10Z

Is there something similar for XPOS validations? I don't see it in udtools/data

No, because XPOS is not UD.

The validator will only impose some restrictions such that it cannot contain a tabulator (probably it cannot contain any space character but I would have to check). Otherwise it's up to you what you put there and whether you want to validate it at your end. It's not even language-specific, it's treebank-specific.

AngledLuffa · 2024-12-06T20:33:08Z

Got it. Is there a framework for language specific validation, or is that something we do on our own before submitting the treebank? I do have a local validation script & would be happy either to merge it or just keep running it myself

nschneid · 2024-12-06T20:36:22Z

We have an English-specific validation script. In EWT it is at https://github.com/UniversalDependencies/UD_English-EWT/blob/dev/not-to-release/tools/neaten.py

AngledLuffa · 2024-12-06T20:51:10Z

Excellent, thanks. Was not sure if the tools supported it or individual languages / treebanks did it themselves.

mr-martian · 2024-12-06T21:54:14Z

There's also https://github.com/mr-martian/UD-GreekCheck for Ancient Greek

dan-zeman · 2024-12-06T21:59:33Z

As you know, the UD validation infrastructure includes some tests that are language-specific. But it does not run custom scripts provided with the data.

AngledLuffa added features Sindhi labels Nov 28, 2024

dan-zeman added this to the v2.16 milestone Nov 28, 2024

dan-zeman added the standard needed label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sindhi features update #1067

Sindhi features update #1067

AngledLuffa commented Nov 28, 2024

dan-zeman commented Nov 28, 2024

dan-zeman commented Nov 28, 2024

AngledLuffa commented Nov 28, 2024

dan-zeman commented Nov 28, 2024 •

edited

Loading

AngledLuffa commented Nov 28, 2024

AngledLuffa commented Nov 28, 2024

dan-zeman commented Nov 28, 2024

AngledLuffa commented Dec 5, 2024

dan-zeman commented Dec 6, 2024

AngledLuffa commented Dec 6, 2024

nschneid commented Dec 6, 2024

AngledLuffa commented Dec 6, 2024

mr-martian commented Dec 6, 2024

dan-zeman commented Dec 6, 2024

Sindhi features update #1067

Sindhi features update #1067

Comments

AngledLuffa commented Nov 28, 2024

dan-zeman commented Nov 28, 2024

dan-zeman commented Nov 28, 2024

AngledLuffa commented Nov 28, 2024

dan-zeman commented Nov 28, 2024 • edited Loading

AngledLuffa commented Nov 28, 2024

AngledLuffa commented Nov 28, 2024

dan-zeman commented Nov 28, 2024

AngledLuffa commented Dec 5, 2024

dan-zeman commented Dec 6, 2024

AngledLuffa commented Dec 6, 2024

nschneid commented Dec 6, 2024

AngledLuffa commented Dec 6, 2024

mr-martian commented Dec 6, 2024

dan-zeman commented Dec 6, 2024

dan-zeman commented Nov 28, 2024 •

edited

Loading