Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Snowflake/Snowpark Compare #354

Merged
merged 3 commits into from
Nov 19, 2024

Conversation

rhaffar
Copy link
Contributor

@rhaffar rhaffar commented Nov 12, 2024

  • Make the max_diff, null_dif, and matching calculations non-blocking so that Snowflake may process them all simultaneously.
  • Multithread the column stat calculation so that several column-pairs can forward their calculations requests to Snowflake simultaneously.
  • Patched an intermittent bug where the Snowflake optimizer in an attempt to apply pushdown filtering would end up applying an incorrect filter which would ruin the pre-merge df1 and df2 dataframes.

Additional context:

  • The multithreading solution required splitting apart the adding of match rows to the intersect dataframe from the calculation of matches/max_diff/null_diff. This is because in order to keep the function thread safe, we want to avoid updating the intersect dataframe across multiple threads simultaneously. Any pointers on making the code there a little more clean is appreciated.
  • The intermittent issue was (and still is) hard to pin down. Ultimately, the issue was resolved by forcing caching on df1 and df2 prior to the join which generates the temp columns, since it was this join that was being ruined by the Snowflake optimizer. It might have been enough to just force a collection prior to the join, but this seems to work fine as well.

Benchmark in following comment

@rhaffar
Copy link
Contributor Author

rhaffar commented Nov 12, 2024

benchmark

Attached is a brief benchmark on the new optimizations. We can see there's very little difference in performance between the 2-8 column use cases in both 1-10 million rows, meanwhile we see a very significant difference performing a compare using a model monitoring table (with 15 join colums and 95 compare columns). This is because the model monitoring table compare's runtime is heavily centered on the intersect portion where columns are compared against eachother, as opposed to the merge portion where the provided tables/dataframes are merged. This is expected, as all performance changes in this PR are centered specifically on the intersect component, making comparisons on large tables with many columns far more feasible from a performance perspective.

Factors that determine whether the runtime of the compare is centered on the intersect includes:

  1. A large number of columns.
  2. Columns that are more expensive to compare.

The comparison data used for the 2-8 col, 1-10 million row benchmark are made up of float and int columns which take 0.1-0.2 seconds per match calculation (max_diff, null, match count). This would at best shave off 2-3 seconds runtime for the 8 column comparisons, which is insignificant compared against Snowflake performance volatility. The model monitoring table, on the other hand, is made up of a vast array of columns that can take up to 1.5 seconds to compare per match calculation per column. This depends primarily on the datatype and the size of the intersect dataframe.

With these new performance improvements, the Snowflake Compare runtime is actually shorter than the time it would take just to load Snowflake tables into dataframes to set up a Pandas Compare. Meaning if you have any Snowflake tables that you want to compare, it will almost always be faster to do a Snowflake compare than it will be to load the tables into Pandas dataframes and do a Pandas compare (all this without having to store two entire tables as dataframes in memory). The larger the tables, the more true this will be.

@fdosani
Copy link
Member

fdosani commented Nov 13, 2024

@rhaffar can you also post the results from running the snowflake tests. Just make sure to omit any sensitive info in paths etc. Just so we have some record that those test also work.

Copy link
Contributor

@gladysteh99 gladysteh99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work.

@rhaffar
Copy link
Contributor Author

rhaffar commented Nov 18, 2024

@rhaffar can you also post the results from running the snowflake tests. Just make sure to omit any sensitive info in paths etc. Just so we have some record that those test also work.

passing

All I could clip without catching sensitive info, but yep all 66 snowflake tests passing

Copy link
Member

@fdosani fdosani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fdosani fdosani merged commit 09686a9 into capitalone:develop Nov 19, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants