-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix #57608: queries on categorical string columns in HDFStore.select() return unexpected results. #61225
base: main
Are you sure you want to change the base?
Conversation
HDFStore.select() return unexpected results. In function __init__() of class Selection (pandas/core/io/pytables.py), the method self.terms.evaluate() was not returning the correct value for the where condition. The issue stemmed from the function convert_value() of class BinOp (pandas/core/computation/pytables.py), where the function searchedsorted() did not return the correct index when matching the where condition in the metadata (categories table). Replacing searchsorted() with np.where() resolves this issue.
pandas/core/computation/pytables.py
Outdated
@@ -239,7 +239,8 @@ def stringify(value): | |||
if conv_val not in metadata: | |||
result = -1 | |||
else: | |||
result = metadata.searchsorted(conv_val, side="left") | |||
# Find the index of the first match of conv_val in metadata | |||
result = np.where(metadata == conv_val)[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result = np.where(metadata == conv_val)[0][0] | |
result = np.flatnonzero(metadata == conv_val)[0] |
Also is it possible to know if metadata is sorted ahead of time so we can use searchsorted
? it will be much faster in that case
@@ -239,7 +239,13 @@ def stringify(value): | |||
if conv_val not in metadata: | |||
result = -1 | |||
else: | |||
result = metadata.searchsorted(conv_val, side="left") | |||
# Check if metadata is sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we probably won't want to do this check here because this also incurs some performance penalty. I was just staying if there's something in the preprocessing code above that already checked this for us.
If not, just using np.flatnonzero
here directly is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn’t look like there’s a check to ensure the metadata is ordered — this part seems to work with just the array of the category values, so it may not be able to confirm whether it was ordered
In function init() of class Selection (pandas/core/io/pytables.py), the method self.terms.evaluate() was not returning the correct value for the where condition. The issue stemmed from the function convert_value() of class BinOp (pandas/core/computation/pytables.py), where the function searchedsorted() did not return the correct index when matching the where condition in the metadata (categories table). Replacing searchsorted() with np.where() resolves this issue.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.