Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It seems that the regex method return empty string inside a loop or a spark UDF #2

Open
davidegazze opened this issue Jun 21, 2022 · 2 comments

Comments

@davidegazze
Copy link

Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.

To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break

I have:

ERROR on loop i:5 # Where the number can change
TRIE_VALUES: '' (len: 0)

My workaround for this case is to add a del like this:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break
    del TRIE_VALUES

With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.

My pandas udf is something like this:

def trieregex_udf(df):
     # Read source
     values = ### read_values()
     trie = TrieRegex(*patterns)
     regex = trie.regex()
     # Apply regex to DF
     output = .....
     return output

output = df.groupby("id").applyInPandas(trieregex_udf, schema="v string").toPandas()

sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.

@alaa-maverick
Copy link

alaa-maverick commented Jul 3, 2022

I'm having the same issue. When you create an instance in a loop, sometimes it returns an empty regex:

from trieregex import TrieRegEx as TRE

words = ['lemon', 'lime', 'pomelo', 'orange', 'citron', 'grapefruit', 'grape', 'tangerine', 'tangelo']
empty_cnt = 0
for i in range(100):
   trie = TRE()
   trie.add(*words)
   reg = trie.regex()
   if len(reg) < 1:
      empty_cnt += 1

print(f"{empty_cnt} empty")

Output:
48 empty

However if you take the line trie = TRE() out of the loop, there won't be an empty output.

@chrisPiemonte
Copy link

any news ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants