Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex group matching with enhanced parentage constraint only works with corresponding basic parentage constraint #46

Open
nschneid opened this issue Dec 24, 2024 · 9 comments

Comments

@nschneid
Copy link
Contributor

I am trying to match and reattach all enhanced edges meeting certain criteria, copying the edeprel, but there is an error about a missing regex bracket group. See the commented-out line:

; invert headedness. Apply only to the first flat dependent after the honorific 
upos=/PROPN/&lemma=/{St\.?}/&edep=/(.*)/&deprel=/(.*)/&storage!=/HONORIFIC/;func=/flat/;edep=/(.*)/	#1>#2;#1~#2;#3>#1	#2>#1;#1:deprel=nmod:desc;#1:edep=OLD$1;#2~#1;#1:edep=nmod:desc;#3>#2;#2:deprel=$2;#2:edep=OLDflat;#3~#2;#2:edep=$1;#1:storage=HONORIFIC

; reattach other dependents including other flat dependents. may be some amods that are incorrectly moved
;storage=/HONORIFIC/&deprel=/nmod:desc/;upos=/.*/;edep=/(.*)/	#2>#1;#1~#3	#2~#3;#3:edep=$1
; the above line produces an error that I think is a bug: The action '#3:edep=$1' refers to a missing regex bracket group '$1'
; works when the edep constraint #1~#3 is tied to a basic deprel constraint #1>#3 as follows:
storage=/HONORIFIC/&deprel=/nmod:desc/;upos=/.*/;edep=/(.*)/	#2>#1;#1>#3;#1~#3	#3:edep=OLD$1;#2~#3;#3:edep=$1

storage=/HONORIFIC/&deprel=/nmod:desc/;upos=/.*/;upos=/.*/	#2>#1;#1>#3	#2>#3
@nschneid
Copy link
Contributor Author

nschneid commented Dec 25, 2024

I think this is the only issue blocking nmod:desc implementation for honorific titles in EWT. There needs to be a way to reattach enhanced deps that are there due to relative clauses, coordination, and control involving the name.

If the bracket group matching problem is thorny I would actually prefer a more direct syntax for this, as discussed in #40.

@amir-zeldes
Copy link
Owner

amir-zeldes commented Dec 25, 2024

I think the issue is that you're using edep for the capture replacement rather than edom. From the documentation:

  • edep – this field searches over potentially multiple enhanced dependency labels, meaning that edep=/nsubj:xsubj/ will match any tokens which have at least one enhanced edge with the label nsubj:xsubj
  • edom – this field also searches over all enhanced edges, but it can be captured in order to simultaneously assign an edge parent and relation. For example the matcher edom=/(.*subj:xsubj)/ will capture both the designated label AND its enhanced parent, which can then be assigned to another token as an action (#2:edom=$1)

The problem is that unlike other fields, DEPS contains multiple edge and label annotations, so it's hard to operate on it in a simple way. I haven't implemented the notation suggested in #40 yet, though it's still conceivable (but not likely I'll do this very soon, it seems non trivial and I have a bunch of other things in the queue... PRs welcome though!)

@nschneid
Copy link
Contributor Author

nschneid commented Dec 25, 2024

Maybe I misunderstood the edom example in the docs. Does .* match the beginning of the enhanced edge string including head ID, e.g. 13:n in 13:nsubj:xsubj? I was trying to add an edge with a ~ action so I figured edep would be the way to get just the edeprel part of the string.

Note that

storage=/HONORIFIC/&deprel=/nmod:desc/;upos=/.*/;edep=/(.*)/	#2>#1;#1>#3;#1~#3	#3:edep=OLD$1;#2~#3;#3:edep=$1

does work, but including the #1>#3 constraint limits the number of rule applications too much.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 25, 2024

It might also help to give pseudocode in the docs for the rule matching algorithm. I am assuming something like:

for each sentence in data:
    for each rule in script:
        while there is a new set of node and edge bindings such that the constraints in the first 2 columns are met:
            apply modification actions
            if special action `last` is specified:
                continue to next sentence
            if special action `once` is specified:
                continue to next rule

I have wondered about the order and timing of bindings (are nodes matched left to right? computed once per sentence per rule, or with each rule application?). E.g. I was trying to get away with a rule that would apply to the leftmost flat dependent, update the head's storage, and skip subsequent flat dependents, but it didn't work so I ended up making a separate rule explicitly constraining the order of flat dependents.

Also I mentioned "node and edge bindings" contemplating the case where a pair of nodes has two enhanced dependencies between them, so a rule could apply twice to the same nodes. But maybe that is not supported.

@amir-zeldes
Copy link
Owner

including the #1>#3 constraint limits the number of rule applications too much

Can you post a minimal conllu input example and note what you expect to happen to the sentence which doesn't? Then I can take a look in the debugger

(are nodes matched left to right? computed once per sentence per rule, or with each rule application?... the case where a pair of nodes has two enhanced dependencies between them, so a rule could apply twice to the same nodes

Yes, the algorithm you outlined above is right (for each sentence, each rule is applied, unless last or once is used as you wrote). Node and edge matches are computed only once, before the rule is applied, and then it is applied to all matches, so it's not important if node sets are ordered left to right or not - each set of matching nodes will be subjected to the 'action' column directives. There is no iterative changing and rechecking of constraints - they are only tested before a rule is applied. This also means there can never be an infinite loop à la Grew, because any changes triggered by a rule do not impact that rule's inputs anymore.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 26, 2024

Node and edge matches are computed only once, before the rule is applied, and then it is applied to all matches, so it's not important if node sets are ordered left to right or not - each set of matching nodes will be subjected to the 'action' column directives.

Well the order will matter for last and once right? Or just once (can a last rule apply multiple times)?

Also, I assume the node bindings for any rule application must be distinct (#1 and #2 can never be the same node)—is that correct?

Can you post a minimal conllu input example and note what you expect to happen to the sentence which doesn't? Then I can take a look in the debugger

Sure, attaching the .conllu and the script (adding .txt extensions for GitHub reasons):

sample.conllu.txt
flat2nmoddesc.ini.txt

With the rule I want to write (R1), it gives an error: The action '#3:edep=OLD$1' refers to a missing regex bracket group '$1'. With the variant (R2), it correctly updates 3 tokens ("at", "St.", "Francis") but not "which", where only the edep should be modified.

@nschneid
Copy link
Contributor Author

OK I finally figured out how to use the VSCode debugger and isolated the cause of the problem, which is that sometimes #2 and #3 in the rule are being bound to the same node, and this apparently means the bracket group match of #3 is unavailable, triggering the error.

I don't see a direct way to specify a node-inequality constraint so I copied the rule and specified #2.*#3 in the first instance and #3.*#2 in the second. That appears to succeed as a workaround. But I'm guessing #2 and #3 possibly matching the same node wasn't the intended behavior.

@nschneid
Copy link
Contributor Author

When setting the new edeprel I can't figure out how to prevent it from overriding a different one. This is happening at least for nodes that have two edeps with the same deprel (an nsubj which corresponds to the basic dep, and an nsubj into the relative clause).

Maybe there's a way to do this by capturing the head ID and setting edom explicitly? I haven't been able to get this to work though (either by capturing num of a node, or capturing head of a node—either way it gets written as 0 in the new edep).

@nschneid
Copy link
Contributor Author

nschneid commented Dec 30, 2024

Maybe there's a way to do this by capturing the head ID and setting edom explicitly? I haven't been able to get this to work though (either by capturing num of a node, or capturing head of a node—either way it gets written as 0 in the new edep).

Debugging why it comes out as 0, it seems that internally nodes have high offsets like tokoffset of 2670. When setting edom to $14||NEWEDEP, it is capturing the actual node number of 8, but then tokoffset is subtracted from that so it ends up negative and then is truncated to 0 during serialization. Not sure how to interpret the tokoffset logic.

I would love to be able to substitute a node variable instead: edom=#1||NEWEDEP. But that doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants