Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex fails to match (bad trie optimization?) #22892

Closed
mauke opened this issue Jan 7, 2025 · 1 comment
Closed

regex fails to match (bad trie optimization?) #22892

mauke opened this issue Jan 7, 2025 · 1 comment

Comments

@mauke
Copy link
Contributor

mauke commented Jan 7, 2025

Description
Some regexes involving alternations unexpectedly fail to match.

See also: Initial report by jwz and discussion at https://www.jwz.org/blog/2025/01/now-i-have-two-problems/.

  • "ABCDE" =~ m/ABCF|BCDE|C(G)/ fails
  • "ABCDE" =~ m/ABCF|BCDE|CG/ succeeds
  • "ABCDE" =~ m/ABCF|BCDE|C[Gg]/ fails
  • "ABCDE" =~ m/ABCF|BCD[Ee]|C[Gg]/ succeeds

All of these should match.

The last perl version that handled this correctly was 5.8.9. Everything from 5.10 on seems to be broken, which points to the trie optimization introduced in 5.10.

Steps to Reproduce

$ perl -le 'print 0 + ("ABCDE" =~ m/ABCF|BCDE|C(G)/);'
0

Expected behavior

$ perl -le 'print 0 + ("ABCDE" =~ m/ABCF|BCDE|C(G)/);'
1

Perl configuration

Summary of my perl5 (revision 5 version 40 subversion 0) configuration:
   
  Platform:
    osname=linux
    osvers=6.5.0-10040-tuxedo
    archname=x86_64-linux-thread-multi-ld
    uname='linux luum 6.5.0-10040-tuxedo #44 smp preempt_dynamic wed may 8 17:36:39 utc 2024 x86_64 x86_64 x86_64 gnulinux '
    config_args='-de -Dprefix=/home/mauke/perl5/perlbrew/perls/perl-5.40.0 -Dcc=cgcc -Dman1dir=none -Dman3dir=none -Dusethreads -Duselongdouble -Aoptimize=-flto -Aeval:scriptdir=/home/mauke/perl5/perlbrew/perls/perl-5.40.0/bin'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=define
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cgcc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-O2 -flto'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='11.4.0'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='long double'
    nvsize=16
    Off_t='off_t'
    lseeksize=8
    alignbytes=16
    prototype=define
  Linker and Libraries:
    ld='cgcc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/x86_64-linux-gnu /usr/lib /usr/lib64
    libs=-lpthread -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=/lib/x86_64-linux-gnu/libc.so.6
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.35'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -flto -L/usr/local/lib -fstack-protector-strong'


Characteristics of this binary (from libperl): 
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_STRTOLD
    HAS_TIMES
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_ALL
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_LONG_DOUBLE
    USE_PERLIO
    USE_PERL_ATOF
    USE_REENTRANT_API
    USE_THREAD_SAFE_LOCALE
  Built under linux
  Compiled at Jun  9 2024 23:04:17
  %ENV:
    PERLBREW_BASHRC_VERSION="0.74"
    PERLBREW_HOME="/home/mauke/.perlbrew"
    PERLBREW_MANPATH="/home/mauke/perl5/perlbrew/perls/perl-5.40.0/man"
    PERLBREW_PATH="/home/mauke/perl5/perlbrew/bin:/home/mauke/perl5/perlbrew/perls/perl-5.40.0/bin"
    PERLBREW_PERL="perl-5.40.0"
    PERLBREW_ROOT="/home/mauke/perl5/perlbrew"
    PERLBREW_VERSION="0.98"
    PERLDOC="-oman"
    PERL_UNICODE="SAL"
  @INC:
    /home/mauke/perl5/perlbrew/perls/perl-5.40.0/lib/site_perl/5.40.0/x86_64-linux-thread-multi-ld
    /home/mauke/perl5/perlbrew/perls/perl-5.40.0/lib/site_perl/5.40.0
    /home/mauke/perl5/perlbrew/perls/perl-5.40.0/lib/5.40.0/x86_64-linux-thread-multi-ld
    /home/mauke/perl5/perlbrew/perls/perl-5.40.0/lib/5.40.0
@demerphq
Copy link
Collaborator

demerphq commented Feb 2, 2025

Something is borked in the AHO-CORASICK code. I'm surprised this wasn't noticed long ago.

It looks to me like its bailing out of the AHO-CORASICK search too early.

FTR: The general idea of AHO-CORASICK is that we build a fail table so that when we fail to find an accepting transition in a given state we fail to the next most appropriate state. We seem to be doing that, but then bailing out of the process too early. The fail transition to state 7, is simultaneously an accepting state for 'C' at ofs 2, but a continuing state of BCDE on the letter D at ofs 1. For some reason we are exiting right after we accept the C, without realizing we can go further.

I need to dig into this, just leaving a record in case I cant return immediately to it.

Simpler case:

./perl -Ilib -Mre=Debug,TRIE -le 'print $& if ("ABCDEX" =~ m/ABCF|BCDE|C/);
'
Compiling REx "ABCF|BCDE|C"
  Looking for TRIE'able sequences. Tail node is  13:END
  - 1:BRANCH (buf:0/0) (1) -> 3:EXACT <ABCF>	=> 13:END	(First==-1,Last==-1,Cur==1,tt==END,ntt==EXACT,nntt==END)
  - 5:BRANCH (buf:0/0) (5) -> 7:EXACT <BCDE>	=> 13:END	(First==1,Last==-1,Cur==5,tt==EXACT,ntt==EXACT,nntt==END)
  - 9:BRANCH (buf:0/0) (9) -> 11:EXACT <C>	=> 13:END	(First==1,Last==5,Cur==9,tt==EXACT,ntt==EXACT,nntt==END)
  - END (13) <SCAN FINISHED> (First==1, Last==9, Cur==13, tt==EXACT)
    make_trie start==1, first==1, last==13, tail==13 depth=1
    TRIE(NATIVE): W:3 C:9 Uq:6 Min:1 Max:4
      Char : Match Base  Ofs     A   B   C   F   D   E
      State|-----------------------------------------------
      #   1|       @   6 + 0[    2   6   A   .   .   .]
      #   2|       @   8 + 1[    .   3   .   .   .   .]
      #   3|       @   8 + 2[    .   .   4   .   .   .]
      #   4|       @   8 + 3[    .   .   .   5   .   .]
      #   5| W   1 @   0 
      #   6|       @   A + 2[    .   .   7   .   .   .]
      #   7|       @   9 + 4[    .   .   .   .   8   .]
      #   8|       @   9 + 5[    .   .   .   .   .   9]
      #   9| W   2 @   0 
      #   A| W   3 @   0 
    word_info N:(prev,len)= 1:(0,4) 2:(0,4) 3:(0,1)
Stclass Failtable (11 states): 0, 0, 1, 6, 7, 1, 1, 10, 1, 1, 1
Final program:
   1: TRIEC-EXACT<S:1/10 W:3 L:1/4 C:9/6>[ABC] (13)
      <ABCF> 
      <BCDE> 
      <C> 
  13: END (0)
stclass AHOCORASICKC-EXACT<S:1/10 W:3 L:1/4 C:9/6>[ABC] minlen 1 
Matching REx "ABCF|BCDE|C" against "ABCDEX"
Matching stclass AHOCORASICKC-EXACT<S:1/10 W:3 L:1/4 C:9/6>[ABC] against "ABCDEX" (6 bytes)
   0 <> <ABCDEX>             |   0|  Charid:  1 CP:  41 State:    1, word=0 - legal
   1 <A> <BCDEX>             |   0|  Charid:  2 CP:  42 State:    2, word=0 - legal
   2 <AB> <CDEX>             |   0|  Charid:  3 CP:  43 State:    3, word=0 - legal
   3 <ABC> <DEX>             |   0|  Charid:  5 CP:  44 State:    4, word=3 - fail
   3 <ABC> <DEX>             |   0|  Fail transition to State:    7, word=3 - legal
Matches word #3 at position 2. Trying full pattern...
   2 <AB> <CDEX>             |   0| 1:TRIEC-EXACT<S:1/10 W:3 L:1/4 C:9/6>[ABC](13)
   2 <AB> <CDEX>             |   0| TRIE: State:    1 Accepted: N TRIE: Charid:  3 CP:  43 After State:    a
   3 <ABC> <DEX>             |   0| TRIE: State:    a Accepted: Y TRIE: Charid:  0 CP:   0 After State:    0
                             |   0| TRIE: got 1 possible matches
                             |   0| TRIE matched word #3, continuing
                             |   0| TRIE: only one match left, short-circuiting: #3 <C>
   3 <ABC> <DEX>             |   0| 13:END(0)
Match successful!
C
Freeing REx: "ABCF|BCDE|C"

demerphq added a commit that referenced this issue Feb 7, 2025
…ases properly

In some circumstances the AHO-CORASICK logic wasn't matching properly
when there were two possibilities whose proper prefix matches a proper
suffix of a third possibilty, and one of those possibilities was shorter
than the other.

This was because we were resetting the failed flag properly. This bug
must be rare because it took more than a decade for anyone to notice.

This patch fixes the problem by resetting the failed flag after a
successful transition.

A good example of this problem is as follows:

    "ABCDE" =~ m/ABCF|BCDE|C/

This should match 'BCDE' and not 'C'. Because of the flag issue we were
matching 'C' instead.

This fixes #22892
demerphq added a commit that referenced this issue Feb 7, 2025
In some circumstances the AHO-CORASICK logic wasn't matching properly
when there were two possibilities whose proper prefix matches a proper
suffix of a third possibilty, and one of those possibilities was shorter
than the other.

This was because we were NOT resetting the 'failed' flag properly.
This bug must be rare because it took more than a decade for anyone
to notice.

This patch fixes the problem by resetting the failed flag after a
successful transition.

A good example of this problem is as follows:

    "ABCDE" =~ m/ABCF|BCDE|C/

This should match 'BCDE' and not 'C'. Because of the flag issue we were
matching 'C' instead.

This fixes #22892
demerphq added a commit that referenced this issue Feb 9, 2025
and add a bit of extra metadata to both. I used a cruder version
of this to debug #22892
demerphq added a commit that referenced this issue Feb 11, 2025
and add a bit of extra metadata to both. I used a cruder version
of this to debug #22892
demerphq added a commit that referenced this issue Feb 11, 2025
and add a bit of extra metadata to both. I used a cruder version
of this to debug #22892
scottchiefbaker pushed a commit to scottchiefbaker/perl5 that referenced this issue Feb 11, 2025
In some circumstances the AHO-CORASICK logic wasn't matching properly
when there were two possibilities whose proper prefix matches a proper
suffix of a third possibilty, and one of those possibilities was shorter
than the other.

This was because we were NOT resetting the 'failed' flag properly.
This bug must be rare because it took more than a decade for anyone
to notice.

This patch fixes the problem by resetting the failed flag after a
successful transition.

A good example of this problem is as follows:

    "ABCDE" =~ m/ABCF|BCDE|C/

This should match 'BCDE' and not 'C'. Because of the flag issue we were
matching 'C' instead.

This fixes Perl#22892
scottchiefbaker pushed a commit to scottchiefbaker/perl5 that referenced this issue Feb 11, 2025
and add a bit of extra metadata to both. I used a cruder version
of this to debug Perl#22892
scottchiefbaker pushed a commit to scottchiefbaker/perl5 that referenced this issue Feb 14, 2025
and add a bit of extra metadata to both. I used a cruder version
of this to debug Perl#22892
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants