You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.
To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:
# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
TRIE_VALUES = TrieRegEx(*VALUES)
i = i + 1
if len(TRIE_VALUES.regex()) < 1:
print(f"ERROR on loop i:{i}")
print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
break
I have:
ERROR on loop i:5 # Where the number can change
TRIE_VALUES: '' (len: 0)
My workaround for this case is to add a del like this:
# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
TRIE_VALUES = TrieRegEx(*VALUES)
i = i + 1
if len(TRIE_VALUES.regex()) < 1:
print(f"ERROR on loop i:{i}")
print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
break
del TRIE_VALUES
With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.
sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.
The text was updated successfully, but these errors were encountered:
I'm having the same issue. When you create an instance in a loop, sometimes it returns an empty regex:
from trieregex import TrieRegEx as TRE
words = ['lemon', 'lime', 'pomelo', 'orange', 'citron', 'grapefruit', 'grape', 'tangerine', 'tangelo']
empty_cnt = 0
for i in range(100):
trie = TRE()
trie.add(*words)
reg = trie.regex()
if len(reg) < 1:
empty_cnt += 1
print(f"{empty_cnt} empty")
Output: 48 empty
However if you take the line trie = TRE() out of the loop, there won't be an empty output.
Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.
To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:
I have:
My workaround for this case is to add a del like this:
With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.
My pandas udf is something like this:
sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.
The text was updated successfully, but these errors were encountered: