Alignment is an important part of UMR annotation that anchors UMR concept nodes to the sentence. Unlike AMR, alignments in UMR are supposed to be available for every annotated sentence and are directly stored in the same file as the graphs. However, at present they are not documented in the UMR annotation guidelines. We have to guess the rules from the released data or from what the UMR annotation tool does.
A UMR file has four annotation blocks for each sentence. Each block starts
with a comment line (first character is #
) and ends with an empty line. The
alignment is described in the third block but the first block is important as
well because it defines the tokens to which the alignment refers. The format
of the first block is not unified and varies across languages in UMR 1.0. The
important point is that the surface sentence must be presented as a sequence
of tokens where neighboring tokens must be separated by a space character. So
for example, we must insert a space between a word and a following comma.
The third block has as many non-comment lines as there are concept nodes in
the sentence level graph. The order of the lines (nodes) is not significant.
Each such line starts with a node id (variable), followed by a colon and a
space, followed by numeric references to token ranges. For example, the
following line says that node s16p
is aligned to the first token of the
sentence (whose index is 1, not 0):
s16p: 1-1
Note that even if the node is aligned to a single token, it is still presented as a range.
Sometimes a node corresponds to multiple tokens that are not adjacent in the sentence. It means that we need discontinuous alignments. Nothing similar occurs in the UMR 1.0 data, so we define the notation here. An alignment line may contain multiple ranges separated by a comma and a space. When this happens, the sub-ranges must be ordered by the token numbers and the first token number of a sub-range must be higher than the last token number of the previous sub-range + 1 (that is, there must be a gap containing at least one token, otherwise the sub-ranges could be merged).
s16p: 1-2, 4-4
Abstract concepts (reifications, discourse relations etc.) that do not have a corresponding token on the surface still have an alignment line but their alignment is 0-0. Naturally, 0 cannot be combined with real token ranges, hence 0-1 would be illegal.
Certain special nodes (such as the name
concept attached via a :name
relation to a concept representing a named entity) are anchored to "-1"
instead of 0, so their range is -1--1. See also issue #2 in the UMR
annotation repository.
??? – to investigate
??? – to investigate whether two nodes can map to the same token
These are our (ÚFAL) guidelines. They may be inspired by what we saw in UMR 1.0 but they do not attempt to mimic exactly the approach taken there.
-
The easiest alignment is between a content word and the concept node that represents it: entities to nouns, states to adjectives or verbs, and processes to verbs.
-
Overtly expressed discourse connectives often have their own nodes, too.
-
Auxiliary verbs are aligned together with the main verb to the same event concept. The same holds for non-referential reflexive markers (smát se “to laugh”) and for verbal particles (come up).
-
Some prepositions may have their own concept nodes. If they do not, then they should be aligned to the same node as their noun (they are like case markers in other languages). Note that this may lead to discontinuous alignment if there is an adjective between the preposition and the noun.
-
Subordinating conjunctions are to clauses what prepositions are to nominals, so we might treat them accordingly and align them with verbs, unless they have their own node. This would be parallel to languages where subordination is marked morphologically on the verb.
-
Numerical quantities do not have their own concept node because they are annotated as numerical
:quant
attributes, e.g.(s1h / house :quant 10)
. Therefore we should aligns1h
to the whole expression ten houses. (This is different from approximate quantities that have their own node and:quant
is the relation that attaches them, e.g.(s1h / house :quant (s1s / several))
will haves1h
attached to houses ands1s
to several.) -
Punctuation tokens are normally not aligned with nodes. An exception would be that a node is aligned to a range of tokens, there is a punctuation symbol somewhere in the middle of the range and excluding it from the alignment would break the otherwise contiguous alignment into two sub-ranges.
-
Reifications (the *-91 event concepts) are meant as abstract concepts, meaning that they typically do not have a corresponding token. However, if there is a token that is not aligned to anything else and that gave rise to the event, we should align it with the *-91 node. In particular, the copula (být “to be”) will often correspond to
have-mod-91
. -
The abstract concepts
person
,thing
etc. may be aligned to overtly expressed pronouns. If the concept is only inferred from morphological agreement marked on the verb, it will stay unaligned. -
Somewhat schizophrenic situation arises with named entities. Typically there is an abstract concept (
person
,organization
etc.) with aname
child node. The abstract parent is aligned to the name tokens in the sentence. Thename
child stays unaligned, although it directly points to the orthographic words of the name via its:opN
attributes. (This rule is inferred from the data relased in UMR 1.0.)- Nevertheless, there are situations when a
name
node is aligned to the name tokens. If the parent node has other children and they are aligned, then the parent node will not be aligned to the name tokens, hence the name node will align with them. For example, the Philippine island of Leyte is analyzed as(s4i2 / island :wiki "Leyte" :name (s4n2 / name :op1 "Leyte") :place (s4c / country :wiki "Philippines" :name (s4n3 / name :op1 "Philippine")))
wheres4i2
is aligned to island,s4n2
to Leyte,s4c
to Philippine, ands4n3
is unaligned.
- Nevertheless, there are situations when a
-
More generally, the approach in UMR 1.0 seems to be:
- If the parent node covers the same tokens as its child node, the alignment
will be assigned to the parent and the child will be formally unaligned.
For example (english_umr-0003, snt8), doctor is represented as
(s8p2 / person :ARG1-of (s8h / have-role-91 :ARG3 (s8d / doctor)))
; the token is aligned tos8p2
, whiles8h
ands8d
are unaligned. - If the parent node has multiple children that together completely cover
the parent's span, the alignment will be assigned to the children and the
parent will be formally unaligned. For example (english_umr-0003, snt6),
next several days is represented as
(s6t / temporal-quantity :quant (s6s2 / several) :unit (s6d / day) :mod (s6n / next))
; here,s6s2
,s6d
ands6n
are aligned to their respective tokens whiles6t
is unaligned.
- If the parent node covers the same tokens as its child node, the alignment
will be assigned to the parent and the child will be formally unaligned.
For example (english_umr-0003, snt8), doctor is represented as
-
While the above rules strive to align as many non-punctuation tokens as possible, it is not required that all of them are aligned to concepts. There may be words that are not even distantly related to any individual node; such words will stay unaligned.