Skip to content

3.1. Conll‐X formalization guidelines

Flavio Pisciotta edited this page Aug 8, 2024 · 111 revisions

1. General Principles

In order to make our entries compatible with Universal Dependencies-related projects, we include a CoNLL-X formalization of our constructional entries.

In such a formalization, a construction is defined as a set of directed, acyclic, labeled graphs with constructional elements as nodes and relations as edges. The format was chosen to maximize compatibility with UD notation, in a way that constructions formalized with our notation can be (semi-)automatically matched on UD-parsed sentences.

Fig 1. Dependency representation for the construction "fare una Nata" (see Section 2).

For clarity’s sake, we will refer to each element of a construction as a “token”; note, however, that our notion of token can apply below the word level. As a matter of fact, in AdoC we represent constructions at the sentence, phrase and word level (constructions at higher level of complexity, e.g., textual level, are not represented in our resource for now). Thus, as can be seen in Figure 2, morphological constructions are defined as graphs too.

Fig 2. Dependency representation for the morphological construction "Proper Noun-ata" (see Section 2).

As Figure 2 shows, the theoretical choice of including constructions at the world level poses the challenge of the representation of subword tokens and relations. Since different solutions on how to address this challenge have been recently proposed 1 2 3, we will specify in the text or in the footnotes in which ways our approach differs from the ones proposed in the literature, in order to make the approaches comparable.

2. The annotation format

The CoNLL-X format we employ is inspired by UD formalization. Formalized constructions are encoded in plain text files (UTF-8) that contain three types of lines:

  • Construction-level comments starting with a hash (#), which we discuss in Section 4. Formally, they are comment lines that occur at the beginning of construction, before token lines. Construction level comments specify information that applies to the construction as a whole, and include:
  • Token lines containing the annotation of a component/token/node in 13 fields separated by single tab characters, which we describe in Section 3.
  • Blank lines marking end of construction.

As an example of how a complete formalization should appear, we show the construction entry for Light Verbs Constructions (Verb + Noun), with the Noun created by Proper Noun + ata suffix. We assume that the construction with id 10 (see vertical links) is a more general 'Verb + Noun' Light Verb Construction.

ex. Gianni ha fatto una Berlusconata 'Gianni did something typical of Berlusconi/something Berlusconi-like'

# cxn_id = 1
# name = fare una PROPN-ata
# function = ref:A does something typical of ref:D-1
# horizontal_links =
# vertical_links = 10
ID UD.FORM LEMMA UPOS FEATS HEAD DEPREL REQUIRED WITHOUT SEM_FEATS SEM_ROLES ADJACENCY IDENTITY
A _ _ NOUN, PROPN, PRON    _ B nsubj   0   _ _ Agent   _ _
B _ fare VERB   _ 0 root   1   _ _ _   _ _
C una uno DET  Gender=Fem D det   1   _ _ _   _ _
D _ _ NOUN Gender=Fem B obj   1   _ _ _   _ _
D-1 _ _ PROPN Animacy=Hum root/m   1   _ _ _   _ _
D-2 _ -ata BMORPH   _  D-1 der/m    1   _ _ _   D-1 _

2.1. What counts as a construction? Capturing formal variation

In AdoC, we consider costructions as prototypes which can be instantiated by slightly different constructs; thus, two formal variants can instantiate the same construction as long as the formal variation does not entail some difference in meaning, distribution, etc. However, since the aim of the formalization is to match patterns in UD-annotated sentences, problems could arise due to such formal variation in the actual occurrences.

If the formal variants cannot be captured by means of regular expressions in the form and lemma fields, or by specifying multiple options as values in other fields, we create more than one graph for a single entry, adding a, b, etc. to the identifier of the graph.

For instance, the morphological construction with the prefix semi- and an adjectival base (entry 169) can appear both bonded and hyphenated. Since we cannot handle both a morphological and a multiword form in the same representation, we can add to the same file a second graph headed # cxn_id = 169b, while the other metadata stay the same (apart for the token referenced in the function):

# cxn_id = 169
# name = semiADJ
# function = not fully or not properly ref:A-2
# ...	

ID   UD.FORM     LEMMA     UPOS   ...
A       _        semi.+     ADJ   ...
A-1     _        semi     BMORPH  ...
A-2     _         _         ADJ   ...

# cxn_id = 169b
# name = semiADJ
# function = not fully or not properly ref:C
# ...

ID   UD.FORM     LEMMA     UPOS   ...
A      semi      semi       ADJ   ...
B       _         -        PUNCT  ...
C       _         _         ADJ   ...

3. Fields

In this section, we describe the the fields, i.e., the columns in our CoNLL-X format. The fields contain information related to the tokens, while construction-level information is specified in the Metadata. In accordance with CoNLL-U guidelines, the fields must meet the following constraints:

  • Fields must not be empty.
  • Fields other than FORM and LEMMA must not contain space characters.
  • Underscore (_) is used to denote unspecified values in all fields except ID.

We add two further conventions:

  • Double slash (//) denotes that the field is not applicable to the specific token.
  • In cases where multiple possibilities can apply (e.g., UPOS can have two possible values, NOUN or VERB for instance), these are concatenated by means of a comma.

Our format includes the following 13 fields:

ID

The ID represents the token index, specified as an uppercase alphabetic letter starting at A for each new construction.

For tokens containing subword tokens (i.e., morphological constructions), the ID is formed by adding to the ID of the word-token and an integer starting at 1 for each subword-token (e.g., A-1, A-2, etc.):

# cxn_id = 169
# name = semiADJ
# ...

ID   UD.FORM     LEMMA     UPOS   ...
A       _        semi.+     ADJ   ...
A-1     _        semi     BMORPH  ...
A-2     _         _         ADJ   ...

UD.FORM

Superficial form of the token for specified fillers. It is specified in case of uninflecting or formally fixed elements. For instance, in the case of the idiom essere in grado di V 'to be able to V', the superficial form is specified for tokens B and D, since they are uninflecting (in 'in' and di 'of' are prepositions), while in the case of token C the form is specified since the noun grado is a fixed element in this idiom.

# ...
# name = essere in grado di V
# function = be able to ref:D
# ...

ID   UD.FORM     LEMMA     UPOS   ...
A       _        essere     ADJ   ...
B      in         in      BMORPH  ...
C     grado      grado      ADJ   ...
D      di         di        ADP   ...
E      _           _       VERB   ...

The field is generally left unspecified in the case of morphological constructions.

LEMMA

Lemma or stem of the token for specified fillers. In the case of morphological constructions, the lemma at the word level is in the form of regular expression 4, while the lemma(s) for the subword specified tokens is the morpheme.

Allomorphy can sometimes be handled at the word level through the regular expression, as in the case of the prefix in- 'in, un', which has four possible allomorphs: in-, im-, ir-, il-. Note that in the example below we write the regex using the constructs accepted in Grew requests.

# ...
# name = inADJ
# ...

ID   UD.FORM        LEMMA                     UPOS     ...
A       _       in.+\|im.+\|irr.+\|ill.+      ADJ      ...
A-1     _           in                       BMORPH    ...
A-2     _            _                        ADJ      ...

UPOS

This field contains the part-of-speech tags for each token. As far as word-level is concerned, we refer to Universal part-of-speech tags. In the case of subword tokens, there are two possibilities:

  • elements which exist as free lexemes are tagged according to their part-of-speech;
  • bound forms, including affixes, combining forms, and affixoids, are tagged as BMORPH (which stands for Bound MORPHeme) 5.

Below we see three examples: the first is the already introduced semi+Adjective for derivation, the second is an example of Noun + Noun compounding with capo, and the third is an example of neoclassical compounding involving two combining forms, one of which is specified (-logia).

# ...
# name = semiADJ
# ...

ID   UD.FORM     LEMMA     UPOS   ...
A       _        semi.+     ADJ   ...
A-1     _        semi     BMORPH  ...
A-2     _         _         ADJ   ...
# ...
# name = capoN
# function = head or boss of ref:A-2
# ...

ID   UD.FORM     LEMMA     UPOS   ...
A       _       capo.+     NOUN   ...
A-1     _        capo      NOUN   ...
A-2     _         _        NOUN   ...

# ...
# name = Xlogia
# function = study of ref:A-1
# ...

ID   UD.FORM     LEMMA     UPOS     ...   
A       _       .+logia    NOUN     ...   
A-1     _         _        BMORPH   ...   
A-2     _        logia     BMORPH   ...   

Note that in morphological constructions the token at the word level bears its own part of speech in the UPOS field: in this way, we can specify the output category of the morphological process.

FEATS

This field includes a list of lexical types and morphosyntactic features from the universal feature inventory (listed in Table 1) or from the language-specific extension for Italian. These features are used to specify:

  • Morphosyntactic and lexical constraints on open slots in the construction (i.e., tokens underspecified for their form and lemma).
  • Morphosyntactic constraints on lexically filled slots left underspecified for their form.

Thus, we do not use this field to annotate all the applicable features for each token (as it happens instead in UD-treebanks), but only to express the necessary information to constrain the matching process, or express semantic constraints regardless of them currently being annotated in corpora.

Lexical features* Inflectional features*
Nominal* Verbal*
PronType Gender VerbForm
NumType Animacy Mood
Poss NounClass Tense
Reflex Number Aspect
Other Case Voice
Abbr Definite Evident
Typo Deixis Polarity
Foreign DeixisRef Person
ExtPos Degree Polite
Clusivity

Tab 1. List of Universal features in Universal Dependencies (from the UD website)

It is possible to include multiple features in the same field by separating them with a vertical bar (|). As we see in the example below (the form with the clitic pronoun of (a) NP tocca V-inf 'NP has to V'), we employ feature both for matching underspecified tokens (A must be both a personal pronoun and a clitic pronoun) and to constrain lexically specified tokens (B is an impersonal verb, so it can only appear in a 3rd person singular form).

# cxn_id = 156b
# name = (a) NP tocca V-inf
# function = ref:B has to ref:D
# ...

ID    UD.FORM   LEMMA    UPOS   FEATS
A        _       _       PRON  PronType=Prs|Clitic=Yes
B        _     toccare   VERB  Number=Sing|Person=3
C        _       _       VERB  VerbForm=Inf

Apart from lexical and morphosyntactic information, we also use this field to annotate some semantic features which are not expressed in Italian by dedicated morphemes, but are available in the tagset: Animacy and Aspect. Moreover, we extend the use of the Definite feature: while in the Italian language-specific extension only definite and indefinite are used for determiners, we allow the use of specific indefinite too, when needed to express certain constraints. It is clear however, that since in Italian corpora these feature are generally not annotated, for the moment they only serve the purpose of formally expressing constraints without any benefit in matching UD-annotated patterns.

HEAD

This field contains the syntactic head of the current word, which is either the ID of the head or zero (0) in case a token is the root (i.e., the governor node) in the construction. As a convention, in morphological constructions we consider the full word as the head of root subword element. See, for instance, the Light Verb Construction with Proper Nouns (Section 2) and its representations (they are separated since subword and above word relations constitute graphs on two different levels):

# ... 
# name = fare una PROPN-ata
# ...

ID    UD.FORM   LEMMA   UPOS                 ...    HEAD  
A        _       _      NOUN, PROPN, PRON    ...    B
B        _      fare    VERB                 ...    0
C        una    uno     ADP                  ...    D
D        _     .+ata    NOUN                 ...    B
D-1      _       _      PROPN                ...    D
D-2      _      ata     BMORPH               ...    D-1  

DEPREL

In this field, we specify the type of relation that a token has with the token specified as its head. As for UPOS and FEATS, we stick to Universal Dependency Relations, and to the language-specific subtypes defined for Italian.

Generally, we follow the annotation found in UD-annotated Italian treebanks, and we tend to apply the annotation found in corpora also when our theoretical analysis of the dependecy relations would be different (mostly in case of chunks and MWEs). We do this since theoretical considerations are left to the full entry of the construction, while here we mainly pursue the practical aim of matching UD-annotated patterns.

Nominals

Clauses

Modifier words

Function Words

Core arguments

 

 

 

nsubj

obj

iobj

 

csubj

ccomp

xcomp

 

 

 

Non-core dependents

 

 

 

 

obl

vocative

expl

dislocated

 

advcl

 

 

 

 

advmod*

discourse

 

 

 

aux

cop

mark

 

 

Nominal dependents

 

 

 

nmod

appos

nummod

 

acl

 

 

 

amod

 

 

 

det

clf

case

 

Coordination

Headless

Loose

Special

Other

conj

cc

 

 

 

fixed

flat

 

 

 

list

parataxis

 

 

 

compound

orphan

goeswith

reparandum

 

punct

root

dep

 

 

Tab 2. Set of Universal Dependency Relations (from the UD website)

In our constructions, we use root as the DEPREL for the token with HEAD = 0. However, this is mostly a convention since the token tagged as root does not always (or may never) constitute the root in actual UD-annotated sentences. This tag is only used to mark the element that has no dependency relations to other tokens inside the construction. Thus, in the case of word-level constructions, since we have only one word-level token, this token (corresponding to the full word) will be our root.

The presence of word-level, morphological constructions poses the challenge of introducing a new set of dependency relations between subword elements. We employ a set of relations, partially inspired by previous work 6, all ending in /m:

  • root/m : the root inside the morphological construction. In derivation, it corresponds to the stem, while in compounding it corresponds to the head of the compound. In the case of coordinate compounds, we conventionally mark the first token as root/m.
  • der/m : the relation that links derivational affixes to the stem.
  • case/m : the relation that links the complement to the head in subordinate compounds (e.g., capostazione 'station master')
  • mod/m : the relation that links the attribute to the head in attributive compounds (e.g., altopiano 'upland')
  • conj/m : the relation that links the second constituent to the first (the head) in coordinate compounds (e.g., cartongesso 'drywall')

Below we show three examples: prefixation with semi-, Noun+Noun compounds with capo, and neoclassical compounds with -logia:

# ...
# name = semiADJ
# ...

ID   UD.FORM     LEMMA     UPOS   ...   HEAD   DEPREL
A       _        semi.+     ADJ   ...   0      root
A-1     _        semi     BMORPH  ...   A-2    der/m
A-2     _         _         ADJ   ...   A      root/m
# ...
# name = capoN
# ...

ID   UD.FORM     LEMMA     UPOS   ...   HEAD   DEPREL
A       _       capo.+     NOUN   ...   0      root
A-1     _        capo      NOUN   ...   A      root/m
A-2     _         _        NOUN   ...   A-1    case/m
# ...
# name = Xlogia
# ...

ID   UD.FORM     LEMMA     UPOS     ...   HEAD   DEPREL
A       _       .+logia    NOUN     ...   0      root
A-1     _         _        BMORPH   ...   A      case/m
A-2     _        logia     BMORPH   ...   A-1    root/m

REQUIRED

Encodes whether the token has to be obligatorily expressed (valued as 1) in the construction or can be omitted (valued as 0). This field is used to include formal variants in the matching process (see Section 2.1), as in the common case of subject omission. For instance, we set as not required both the subject (Patient) and the prepositional phrase da N (Agent) in the Passive Construction with venire:

# ... 
# name = venire Passive Construction
# ...

ID    UD.FORM   LEMMA     UPOS                 ...    HEAD   DEPREL       REQUIRED  
A        _       _        NOUN, PROPN, PRON    ...    C      nsubj:pass   0
B        _       venire   AUX                  ...    C      aux:pass     1
C        _       _        VERB                 ...    0      root         1
D        _       da       ADP                  ...    E      case         0
E        _       _        NOUN, PROPN, PRON    ...    C      obl          0

WITHOUT

This field codes which (if any) field values are to be excluded in the matching process. Since its information always refers to some other fields, it is expressed as follows:

COLUMN_NAME=value1,value2,...

It is also possible to annotate restrictions pertaining to more than one field by separating them with a vertical bar (|), as in FEATS:

COLUMN_NAME1=value1,value2|COLUMN_NAME2=value1

For instance, we want to exclude some specific numerals from the open slot of the idiom Num N in croce 'barely/only Num of Ns, a few Ns', since it would be impossible to have zero 'zero' and uno 'one':

# ...
# cxn = Num N in croce
# function = barely ref:A of ref:B/a few ref:B
# ...

ID    UD.FORM    LEMMA     UPOS    ...   REQUIRED    WITHOUT
A       _         _        NUM           1           LEMMA='zero','uno'
B       _         _        NOUN          0           _
C      in         in       ADP           1           _
D      croce      croce    NOUN          1           _

In some cases, it could be necessary to costrain not the token itself, but possible children nodes (i.e., tokens governed by that token). In this cases, we use the prefix CHILDREN: before the value of the field. The most common case is specifying the impossibility for a token to have children bearing some specific relation to it, as in the case of intransitive constructions, in which it is not admitted to have a token linked to the verb by a obj relation. Another possible case is the one of impersonal constructions with si, where we exclude tokens linked to the verb by a subj relations:

# ...
# cxn = impersonal si construction
# ...

ID    UD.FORM    LEMMA     UPOS    ...   REQUIRED    WITHOUT
A      si         si       PRON    ...    1          _
B       _         _        VERB    ...    1          CHILDREN:DEPREL=subj

or, in the case of complement-taking predicates like sembra 'it seems' used as parenthetical verbs, we could exclude the possibility to have any kind of children:

# ...
# cxn = parenthetical sembra
# ...

ID    UD.FORM    LEMMA      UPOS    ...   REQUIRED    WITHOUT
A       _        _          VERB    ...    1          _
B       _        sembrare   VERB    ...    1          CHILDREN:DEPREL=.+

SEM_FEATS

This field is a semantic "counterpart" of the field FEATS. Here, we specify semantic constraints on the open slots in the constructions not covered in FEATS. Since there are currently no semantically annotated corpora for Italian, we chose to employ tagsets from resources commonly used and/or possibly linked to the UD project. However, such constraints do not actually constrain the matching process at the moment, and represent more a description of the constructions.
We used two main sources for the tagsets: Open Multilingual Wordnet Topics describe the ontological class of the token, while we adopted the tagset for Aktionsart from the Unimorph project. We illustrate the semantic features and their tagsets in a dedicated page.

Semantic features
Nouns Verbs Adjectives Adverbs
OntoClass OntoClass OntoClass AdvClass
Aktionsart AdjClass

Tab 3. List of Semantic features by part-of-speech

The annotation for this field is formally identical to the one in FEATS, and it is also possible to annotate the same token for multiple features by separating them with a vertical bar (|). Below, we see Light verbs constructions with fare 'do' and nouns of psychological state:

# ...
# cxn = fare Npsych
# function = cause to feel ref:B
# ...

ID    UD.FORM    LEMMA      UPOS    ...   SEM_FEATS    
A       _        fare       VERB    ...    _
B       _        _          NOUN    ...    OntoClass=feeling

or the progressive construction (stare + Gerund 'be Ving'), that is acceptable in standard Italian only with dynamic verbs:

# ...
# cxn = stare Vger
# function = be doing ref:B
# ...

ID    UD.FORM    LEMMA      UPOS    ...   SEM_FEATS    
A       _        stare      VERB    ...    _
B       _        _          VERB    ...    Aktionsart=DYN

At the moment, this annotation is mainly directed to nouns and verbs only, since there are no shared semantic classifications for adjectives and adverbs. In the case of adjectives, it is possible to use the only three classes found in MultiWordNet (all, relational, participial), but such classification is not fully semantic in nature, and in some cases it could be necessary to adopt a finer-grained tagset: see for example the evaluative construction with the suffix -astro that expresses approximation when used with colour adjectives:

# ...
# cxn = ADJastro
# function = not properly ref:A-1
# ...

ID    UD.FORM    LEMMA      UPOS    ...   SEM_FEATS    
A       _        +.astro    ADJ     ...    _
A-1     _        _          ADJ     ...    AdjClass=colour
A-2     _        astro      BMORPH  ...    _

Thus, we are considering to use some tagsets from the literature or from projects on other languages, even though they do not come from resources for Italian.

SEM_ROLES

In this field, we annotate the semantic roles of the participants, mainly in argument structure constructions. The semantic role is specified both in open and filled slots. At the moment, this field is not intended to enhance the matching process; instead this field could be employed to annotate the roles in argument structure constructions matched in corpora.

We decided to use an adapted version of the role hierarchy from UVI (Unified Verb Index). For an in-depth description of the tagset we refer to the dedicated page of the wiki.

1st layer

2nd layer

3rd layer

4th layer

Affector

Causer

Agent

Co-agent

Stimulus

 

 

Precondition

 

 

Undergoer

Pivot

 

 

Instrument

 

 

Patient

Co-patient

 

Experiencer

 

Theme

Co-theme

 

Topic

 

Asset

 

Beneficiary

Maleficiary

 

Eventuality

Subeventuality

 

Property

Attribute

 

 

Manner

 

 

Value

Extent

Duration

Place

Locus

Location

Axis

Initial_location Initial_location_st

Destination Destination_st

Source

Initial_location Initial_location_dyn

 

Material

 

Initial_state

 

Goal

Destination Destination_dyn

Recipient

Result

Product

Trajectory

 

 

Tab 4. Hierarchy of Semantic roles from UVI (adapted from UVI website).

We chose the UVI tagset for two reasons:

  • it is organized hierarchically;
  • it can be of mapped on tagsets used in other resources.

For the first point, Table 4 shows the hierarchical organization of semantic roles: the hierarchy has max. 4 layers, and all the roles come down to four macroroles (Affector, Undergoer, Property, Place). In this way, it is possible to annotate the semantic roles in constructions at different levels of generality and abstraction. For example, in the passive construction with venire 'come' as auxiliary, the participant expressed by the da N prepositional phrase could be an animate Affector (i.e., an Agent), but also an inanimate one (a Causer), or an event (a Precondition). Thus, we could choose to use the Affector role, since it is the most general one:

# ... 
# name = venire Passive Construction
# ...

ID    UD.FORM   LEMMA     UPOS                 ...    HEAD   DEPREL       ...   SEM_ROLES  
A        _       _        NOUN, PROPN, PRON    ...    C      nsubj:pass   ...   Patient
B        _       venire   AUX                  ...    C      aux:pass     ...   //
C        _       _        VERB                 ...    0      root         ...   //
D        _       da       ADP                  ...    E      case         ...   //
E        _       _        NOUN, PROPN, PRON    ...    C      obl          ...   Affector

As for the second point, not only UVI is linked to VerbNet, FrameNet, and PropBank, but we also propose a tentative mapping with the participant roles listed in Croft's (2022) glossary of Comparative Concepts, which are currently used in the MoCCA project (A Model of Comparative Concepts for Aligning Constructicons). This gives us the possibility to align in the future our constructions with entries from other resources (be they Constructicons for other languages or not).

As shown in Table 4, we slightly modified the original tagset. More specifically:

  • we deleted Locus, since no definition of this role was given in UVI, nor examples containing this role were found in the database; thus, it was judged to be redundant with respect to Location.
  • we deleted Maleficiary since it encoded a too fine-grained distinction with respect to Beneficiary.
  • we deleted all the labels marking the presence of a second participant partially having the same role of another one in the sentence (i.e., Co-Agent, Co-Theme, Co-Patient, Subeventuality), since they would be probably unused in abstract and semi-schematic constructions. A possibility could be to introduce a Comitative label as a generealization of the Co-* roles.
  • we split Initial_location and Destination, making a distinction between their use in states and in dynamic events. We made this choice since both the labels were used in more than one subcategory of Place, and no clear distinction was drawn between, e.g., Initial_location as a subtype of Location and as a subtype of Source. Thus, we decided to consider Initial_location and Destination when employed in stative situations as subtypes of Location, while when they appear in dynamic events they are subtipes of Source and Goal.

ADJACENCY

The field constrains the linear adjacency of a token to other tokens of the construction. When it is not possible to have elements intervening between the token annotated and the preceding one (i.e., the left-adjacent token), the field is filled with the ID of the left-adjacent token. For instance, in the idiom Num N in croce 'barely/only Num of Ns, a few Ns', it is not possible to separate the noun (when present) from in, nor in and croce:

# ...
# cxn = Num N in croce
# function = barely ref:A of ref:B/a few ref:B
# ...

ID    UD.FORM    LEMMA     UPOS     ...    ADJACENCY
A       _         _        NUM      ...    _     
B       _         _        NOUN     ...    _     
C      in         in       ADP      ...    B      
D      croce      croce    NOUN     ...    C      

This field is annotated also in morphological constructions.

IDENTITY

In this field we annotate coindexation phenomena. As in WITHOUT, in this field we specify information related to other columns, since coindexation can happen at different levels (morphosyntactic features, ontological class, lemma, etc.). Thus, the values in IDENTITY are expressed as follows:

COLUMN_NAME=token_ID

where COLUMN_NAME is the name of the field in which coindexation takes place, and token_ID is the ID of the token whose values apply also to the token annotated. For instance, in the discontinuous reduplication construction Noun non Noun 'not properly a Noun', the second noun (C) must appear in the same form of the first one (A):

# ...
# cxn = N non N
# function = not properly a ref:A
# ...

ID    UD.FORM    LEMMA     UPOS     ...    IDENTITY
A       _         _        NOUN     ...    _     
B      non       non       ADV      ...    _     
C       _         _        NOUN     ...    UD.FORM=A          

4. Metadata

While in the Fields information and constraints related to the tokens, in metadata we include information related to the construction as a whole. Metadata expresses database-related information (e.g., the identifier), as well as linguistic and network-related information. As we already mentioned (Section 2), metadata consist of comment lines, preceded by an hash (#), that are placed above the columns containing the Fields. The number of comment lines for each construction is variable, but a construction must contain at least three pieces of information, that is:

Two other optional fields describe how the construction is connected to which constructions in the database:

Finally, there is the possibility to add a number of optional metadata that describe holistic properties of the construction.

cxn_id

The unique and stable identifier for the constructional entry, corresponding to the one automatically assigned in the entry for the construction.

name

The "human-intelligible" name for the construction, corresponding to the one chosen in the entry for the construction.

function

A natural language string that briefly defines the function of the construction. Here, by function we mainly refer to the meaning in the narrow sense, leaving out as much as possible the description of pragmatic and information structural aspects. Differently from the Definition field in the constructional entry, here the definition should be as concise as possible, consisting, when possible, in a paraphrase of the constructional meaning. We do not follow any semantic formalism, in order to avoid limitations in the expression of the range of possible functions.

The string can include references to specific tokens (in the form of ref:token_ID) to map the functional information on constructional elements:

# cxn_id = 1
# name = fare una PROPN-ata
# function = ref:A does something typical of ref:D-1

Links

Two optional comment lines specify the links included in our database, i.e., horizontal_links and vertical_links. These metadata encode the same information contained in the database entry for the contructions, so we refer to the relevant sections in the wiki for a more in-depth description and the theoretical justification of the fields.

By including links to other constructions in the formalizations, we aim to build a graph connecting the constructions in our database. This graph differs from the construction-graphs in that it works both on the paradigmatic level and on different level of abstractions, while construction-graphs work on the syntagmatic level. The graph will include two kind of edges, represented by the two types of metadata we describe below.

horizontal_links

Comment line that contains the cxn_ids of related constructions at the same level of abstraction as the one in question. The linked constructions have some kind of paradigmatic/synonimic relation with the construction annotated.

Horizontal links are bidirectional, and thus if a construction X is horizontally linked to a construction Y, we will annotate the relation in both the formalizations.

In case of more than one id, we use a space as separator:

# horizontal_links = 1 34 39

vertical_links

Comment line that contains the cxn_ids of related constructions at a higher level of abstraction from the one in question. The construction annotated instantiates or is a polysemic/metaphorical extansion of the linked constructions.

Note that differently from horizontal_links, vertical_link is unidirectional, and thus we only annotate higher level constructions in the formalization of their daughters, and not vice versa.

In case of more than one id, we use a space as separator:

# vertical_links = 10

Holistic properties

Holistic properties is a name for a group of metadata that encode at the construction-level the same features normally expressed by Fields in the case of specific tokens.

Such kind of information becomes very relevant when we look at the behaviour of multiword expressions (MWEs). In many cases, the behaviour and kind of relation a MWE has with the rest of the sentence differs radically from what is predicted by the features of their parts.

An example can be the MWE giorno dopo giorno 'day after day', which is formed by two nouns and a preposition, but acts as an adverb, and thus should have as a whole an advmod relation with the verb. Ideally, we should keep together the MWE as a single layer with its own features, since the adverbial function is not attributable to any of its elements:

# treebank = PoSTWITA
# sent_id = 4287
# text = giorno dopo giorno mi accorgo quanto Louis sia importante per me 
# text_translation = 'Day after day I realise how important Louis is to me' 

ID	FORM	               LEMMA	UPOS	...	HEAD	DEPREL
1-3	giorno dopo giorno	_	ADV	...	5	advmod
1	giorno	              giorno	NOUN		_       _
2	dopo	              dopo	ADP	...	_	_
3	giorno	              giorno	NOUN	...	_	_
4	mi	              mi	PRON	...	5	expl
5	accorgo	              accorgere	VERB	...	0	root
...	...	              ...       ...	...	...	...

However, the UD format is based on a lexicalist approach to syntax, and thus it would not be possible to capture information such as the output category in the case of units whose complexity level is above word. While it is not a problem in the case of morphological constructions, since we can encode output information at the word-level in a UD-compatible format, MWEs like giorno dopo giorno are treated compositionally in UD-annotated sentences, causing a misleading analysis of the construction, as shown by the annotation of the sentence in PoSTWITA:

# sent_id = 4287
# text = Giorno dopo giorno mi accorgo quanto Louis sia importante per me 
# text_translation= Day after day I realise how important Louis is to me 

ID	FORM	               LEMMA	UPOS	...	HEAD	DEPREL
1	giorno	              giorno	NOUN	...	5       obl
2	dopo	              dopo	ADP	...	3	case
3	giorno	              giorno	NOUN	...	1	nmod
4	mi	              mi	PRON	...	5	expl
5	accorgo	              accorgere	VERB	...	0	root
...	...	              ...       ...	...	...	...

Since we cannot overcome this problem at the token level, also to ensure the matching of the patterns, we propose to encode such relevant information as metadata, by including as comment line the name of the relevant Field and specifying its value at the constructional level. In our case, we annotate that the MWE as a whole works as an adverb and is in a relation of adverbial modification to its head in the sentence.

# UPOS = ADV
# DEPREL = advmod 

Depending on the case, we could also specify information such as the semantic class, aktionsart and so on. As we saw, expressing this kind of holistic properties is unproblematic in the case of morphological constructions, since we have the fields available for the full word:

# name = ADJmente
# function = in a ref:A-1 way

ID     FORM         LEMMA      UPOS     ...
A      .+mente      .+mente    ADV      ...
A-1     _           _          ADJ      ...
A-2     _           mente      BMORPH   ...

Footnotes

[1]: Bedire et al. (2021)

[2]: Zeman (2023)

[3]: Guillaume et al. (2024)

[4]: Obviously, employing regular expressions yields noisy results. A possible option in the future would be to use the formalism employed by the morphological analyzer AnIta (Tamburini & Melandri 2012), which in our example would correspond to in>.+. However, currently there are no fully morphologically annotated corpora for Italian.

[5]: This differs from the approach in Guillaume et al. (2024), where the UPOS X is employed. This tag is generally used in UD for elements that cannot be assigned a part of speech. Thus, the status of the subword element is left unspecified here, and later specified in the FEATS field through the proposed feature TokenType (with the possible values: Root, InflAff, DerAff, Word).

[6]: Different solutions to this problem have been proposed in the literature. Bedire et al. (2021) use dep:der to express the dependecy between derivational affixes and the stem. Zeman (2023) only presents a compounding case, in which a canonical syntactic dependency relation is used, while employs wroot to mark the head of the compound. The solution closest to ours is proposed by Guillaume et al. (2024). We adopted their proposal of marking with /m the subword relations, and the idea of employing syntactic dependencies in compounds. There are however some differences: for derivation and subordinate compounds they use an original label, comp/m, which in the case of compounds is further specified with the kind of subordinative relation (e.g., in babysitter, baby has a relation of comp:obj/m to sitter).