-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Snowflake/Snowpark Compare #354
Conversation
Attached is a brief benchmark on the new optimizations. We can see there's very little difference in performance between the 2-8 column use cases in both 1-10 million rows, meanwhile we see a very significant difference performing a compare using a model monitoring table (with 15 join colums and 95 compare columns). This is because the model monitoring table compare's runtime is heavily centered on the intersect portion where columns are compared against eachother, as opposed to the merge portion where the provided tables/dataframes are merged. This is expected, as all performance changes in this PR are centered specifically on the intersect component, making comparisons on large tables with many columns far more feasible from a performance perspective. Factors that determine whether the runtime of the compare is centered on the intersect includes:
The comparison data used for the 2-8 col, 1-10 million row benchmark are made up of float and int columns which take 0.1-0.2 seconds per match calculation (max_diff, null, match count). This would at best shave off 2-3 seconds runtime for the 8 column comparisons, which is insignificant compared against Snowflake performance volatility. The model monitoring table, on the other hand, is made up of a vast array of columns that can take up to 1.5 seconds to compare per match calculation per column. This depends primarily on the datatype and the size of the intersect dataframe. With these new performance improvements, the Snowflake Compare runtime is actually shorter than the time it would take just to load Snowflake tables into dataframes to set up a Pandas Compare. Meaning if you have any Snowflake tables that you want to compare, it will almost always be faster to do a Snowflake compare than it will be to load the tables into Pandas dataframes and do a Pandas compare (all this without having to store two entire tables as dataframes in memory). The larger the tables, the more true this will be. |
@rhaffar can you also post the results from running the snowflake tests. Just make sure to omit any sensitive info in paths etc. Just so we have some record that those test also work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the great work.
All I could clip without catching sensitive info, but yep all 66 snowflake tests passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Additional context:
Benchmark in following comment