-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature documentation tools/data/feats.json #1055
Comments
I'll let @dan-zeman answer this—not sure why some values appear to be binary (e.g. for
For |
I just had a closer look, there are many strange things for many languages:
|
These JSON files were originally meant to be written and read by my scripts only, which is why they are undocumented and sometimes messy. This should be improved when I have time. I did not foresee the use case with annotation tools but it makes perfect sense. Parts of the file are artifacts of the transition from an older validation procedure to the new one. The permitted UPOS-Feature-Value triples were initialized by collecting their occurrences from the treebanks so that no dataset becomes invalid just by introducing this type of test. Once initialized, people can edit them here and then the value will be boolean. The validator will allow a triple if it finds it in JSON with any nonzero value (if it is 7279, it is clearly the count from some version of the data; if it is 1, it may be result of manual editing, or also a count in case of rare features). Now the important question what is/is not an error. Clearly there are many triples that the JSON file (and => the validator) allows although they should not be allowed. For example, Ideally we would want to be able to uncheck the wrong combination and leave it up to the treebank maintainers to fix the data. But then they should get a four-year grace period to do so. This is the standard procedure with tests that I implement directly in the validator script. But it does not work (yet) with the feature registration system, where anybody can edit the features, and if a feature is newly disallowed, the treebanks that have it will immediately become invalid (as opposed to LEGACY). |
Is the information about what has been set manually via the link you gave and what has been counted still available (for those cases where the value is 1)? In this case, could we have (temporarily) a second file ( |
Apparently there are also cases where a feature is allowed for a given UPOS but never occurs in the data, since it is a wrong assignment (and not assigned for future data), e.g. French allows |
The information about what has been set manually is not specifically saved anywhere but there are two sources from which you could deduce it. First, there is the git history. It will tell you that the initialization from data occurred in December 2020. Any later modifications would be either manual edits or pre-generated records when a new language was added to UD (but these would have no data-induced feature counts). The second (and arguably easier to use) source of information is the
Nevertheless, it may not be too informative about features you may want to disallow (while the validator still accepts them) because of the reasons I indicated earlier: people are discouraged from removing stuff, manual editions typically mean that they added stuff. |
I think you could read the current |
The |
I don't think it would make sense to update it automatically on GitHub. Also because various people and various applications may prefer different kinds of modifying the file. |
Yes, sorry about that - those have all been fixed upstream already and will propagate to the UD repo on the next release. |
I wonder whether I understand well the definition of which features can go (or not) with with UPOS in data/feats.json. For example for English there is the following definition for the feature
Tense
:Does this mean that tokens with the UPOS
NOUN
(only valuePast
) orSCONJ
(valuesPast
andPres
) can have a featureTense
? Or this this an error, and this list was created automatically scanning all English Treebanks? There is in fact one token with UposNOUN
and featureTense
inUD_English-PUD/en_pud-ud-test.conllu (sentence
n02034018
).Other features like
VerbForm
also listsNOUN
as possible UPOS (orDegree
which can go withPUNCT
, one instance in UD_English-GUM/en_gum-ud-train.conllu)for French the feature
Mood
can go withPRON
(example in UD_French-FQB/fr_fqb-ud-test.conllu) andGender
with ADP (also in French-FQB).feats.json
allows alsoTense
forNOUN
for French, even there is no instance in any of the French treebanks.Maybe I have misunderstood the structure of this file (I could find a documentation neither), or are these things to be corrected?
The reason I ask is that I would like to exploit this file in ConlluEditor to disallow annotation invalid features for a given UPOS.
The text was updated successfully, but these errors were encountered: