-
Notifications
You must be signed in to change notification settings - Fork 129
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Deploying to gh-pages from @ e7cd7e3 🚀
- Loading branch information
Showing
28 changed files
with
1,825 additions
and
72 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 5d67110e4b7f3308f48f9b0005f1974e | ||
config: 1f4b85819b4429b3833c6521f227662a | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,268 @@ | ||
Snowpark/Snowflake Usage | ||
======================== | ||
|
||
For ``SnowflakeCompare`` | ||
|
||
- ``on_index`` is not supported. | ||
- Joining is done using ``EQUAL_NULL`` which is the equality test that is safe for null values. | ||
- Compares ``snowflake.snowpark.DataFrame``, which can be provided as either raw Snowflake dataframes | ||
or the as the names of full names of valid snowflake tables, which we will process into Snowpark dataframes. | ||
|
||
|
||
SnowflakeCompare Object Setup | ||
--------------------------------------------------- | ||
There are two ways to specify input dataframes for ``SnowflakeCompare`` | ||
|
||
Provide Snowpark dataframes | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
.. code-block:: python | ||
from snowflake.snowpark import Session | ||
from snowflake.snowpark import Row | ||
import datetime | ||
import datacompy.snowflake as sp | ||
connection_parameters = { | ||
... | ||
} | ||
session = Session.builder.configs(connection_parameters).create() | ||
data1 = [ | ||
Row(acct_id=10000001234, dollar_amt=123.45, name='George Maharis', float_fld=14530.1555, | ||
date_fld=datetime.date(2017, 1, 1)), | ||
Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=1.0, | ||
date_fld=datetime.date(2017, 1, 1)), | ||
Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=None, | ||
date_fld=datetime.date(2017, 1, 1)), | ||
Row(acct_id=10000001237, dollar_amt=123456.0, name='Bob Loblaw', float_fld=345.12, | ||
date_fld=datetime.date(2017, 1, 1)), | ||
Row(acct_id=10000001239, dollar_amt=1.05, name='Lucille Bluth', float_fld=None, | ||
date_fld=datetime.date(2017, 1, 1)), | ||
] | ||
data2 = [ | ||
Row(acct_id=10000001234, dollar_amt=123.4, name='George Michael Bluth', float_fld=14530.155), | ||
Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=None), | ||
Row(acct_id=None, dollar_amt=1345.0, name='George Bluth', float_fld=1.0), | ||
Row(acct_id=10000001237, dollar_amt=123456.0, name='Robert Loblaw', float_fld=345.12), | ||
Row(acct_id=10000001238, dollar_amt=1.05, name='Loose Seal Bluth', float_fld=111.0), | ||
] | ||
df_1 = session.createDataFrame(data1) | ||
df_2 = session.createDataFrame(data2) | ||
compare = sp.SnowflakeCompare( | ||
session, | ||
df_1, | ||
df_2, | ||
join_columns=['acct_id'], | ||
rel_tol=1e-03, | ||
abs_tol=1e-04, | ||
) | ||
compare.matches(ignore_extra_columns=False) | ||
# This method prints out a human-readable report summarizing and sampling differences | ||
print(compare.report()) | ||
Provide the full name (``{db}.{schema}.{table_name}``) of valid Snowflake tables | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Given the dataframes from the prior examples... | ||
|
||
.. code-block:: python | ||
df_1.write.mode("overwrite").save_as_table("toy_table_1") | ||
df_2.write.mode("overwrite").save_as_table("toy_table_2") | ||
compare = sp.SnowflakeCompare( | ||
session, | ||
f"{db}.{schema}.toy_table_1", | ||
f"{db}.{schema}.toy_table_2", | ||
join_columns=['acct_id'], | ||
rel_tol=1e-03, | ||
abs_tol=1e-04, | ||
) | ||
compare.matches(ignore_extra_columns=False) | ||
# This method prints out a human-readable report summarizing and sampling differences | ||
print(compare.report()) | ||
Reports | ||
------- | ||
|
||
A report is generated by calling ``report()``, which returns a string. | ||
Here is a sample report generated by ``datacompy`` for the two tables above, | ||
joined on ``acct_id`` (Note: the names for your dataframes are extracted from | ||
the name of the provided Snowflake table. If you chose to directly use Snowpark | ||
dataframes, then the names will default to ``DF1`` and ``DF2``.):: | ||
|
||
DataComPy Comparison | ||
-------------------- | ||
|
||
DataFrame Summary | ||
----------------- | ||
|
||
DataFrame Columns Rows | ||
0 DF1 5 5 | ||
1 DF2 4 5 | ||
|
||
Column Summary | ||
-------------- | ||
|
||
Number of columns in common: 4 | ||
Number of columns in DF1 but not in DF2: 1 | ||
Number of columns in DF2 but not in DF1: 0 | ||
|
||
Row Summary | ||
----------- | ||
|
||
Matched on: ACCT_ID | ||
Any duplicates on match values: No | ||
Absolute Tolerance: 0 | ||
Relative Tolerance: 0 | ||
Number of rows in common: 4 | ||
Number of rows in DF1 but not in DF2: 1 | ||
Number of rows in DF2 but not in DF1: 1 | ||
|
||
Number of rows with some compared columns unequal: 4 | ||
Number of rows with all compared columns equal: 0 | ||
|
||
Column Comparison | ||
----------------- | ||
|
||
Number of columns compared with some values unequal: 3 | ||
Number of columns compared with all values equal: 1 | ||
Total number of values which compare unequal: 6 | ||
|
||
Columns with Unequal Values or Types | ||
------------------------------------ | ||
|
||
Column DF1 dtype DF2 dtype # Unequal Max Diff # Null Diff | ||
0 DOLLAR_AMT double double 1 0.0500 0 | ||
2 FLOAT_FLD double double 3 0.0005 2 | ||
1 NAME string(16777216) string(16777216) 2 NaN 0 | ||
|
||
Sample Rows with Unequal Values | ||
------------------------------- | ||
|
||
ACCT_ID DOLLAR_AMT (DF1) DOLLAR_AMT (DF2) | ||
0 10000001234 123.45 123.4 | ||
|
||
ACCT_ID NAME (DF1) NAME (DF2) | ||
0 10000001234 George Maharis George Michael Bluth | ||
1 10000001237 Bob Loblaw Robert Loblaw | ||
|
||
ACCT_ID FLOAT_FLD (DF1) FLOAT_FLD (DF2) | ||
0 10000001234 14530.1555 14530.155 | ||
1 10000001235 1.0000 NaN | ||
2 10000001236 NaN 1.000 | ||
|
||
Sample Rows Only in DF1 (First 10 Columns) | ||
------------------------------------------ | ||
|
||
ACCT_ID_DF1 DOLLAR_AMT_DF1 NAME_DF1 FLOAT_FLD_DF1 DATE_FLD_DF1 | ||
0 10000001239 1.05 Lucille Bluth NaN 2017-01-01 | ||
|
||
Sample Rows Only in DF2 (First 10 Columns) | ||
------------------------------------------ | ||
|
||
ACCT_ID_DF2 DOLLAR_AMT_DF2 NAME_DF2 FLOAT_FLD_DF2 | ||
0 10000001238 1.05 Loose Seal Bluth 111.0 | ||
|
||
|
||
Convenience Methods | ||
------------------- | ||
|
||
There are a few convenience methods and attributes available after the comparison has been run: | ||
|
||
.. code-block:: python | ||
compare.intersect_rows[['name_df1', 'name_df2', 'name_match']].show() | ||
# -------------------------------------------------------- | ||
# |"NAME_DF1" |"NAME_DF2" |"NAME_MATCH" | | ||
# -------------------------------------------------------- | ||
# |George Maharis |George Michael Bluth |False | | ||
# |Michael Bluth |Michael Bluth |True | | ||
# |George Bluth |George Bluth |True | | ||
# |Bob Loblaw |Robert Loblaw |False | | ||
# -------------------------------------------------------- | ||
compare.df1_unq_rows.show() | ||
# --------------------------------------------------------------------------------------- | ||
# |"ACCT_ID_DF1" |"DOLLAR_AMT_DF1" |"NAME_DF1" |"FLOAT_FLD_DF1" |"DATE_FLD_DF1" | | ||
# --------------------------------------------------------------------------------------- | ||
# |10000001239 |1.05 |Lucille Bluth |NULL |2017-01-01 | | ||
# --------------------------------------------------------------------------------------- | ||
compare.df2_unq_rows.show() | ||
# ------------------------------------------------------------------------- | ||
# |"ACCT_ID_DF2" |"DOLLAR_AMT_DF2" |"NAME_DF2" |"FLOAT_FLD_DF2" | | ||
# ------------------------------------------------------------------------- | ||
# |10000001238 |1.05 |Loose Seal Bluth |111.0 | | ||
# ------------------------------------------------------------------------- | ||
print(compare.intersect_columns()) | ||
# OrderedSet(['acct_id', 'dollar_amt', 'name', 'float_fld']) | ||
print(compare.df1_unq_columns()) | ||
# OrderedSet(['date_fld']) | ||
print(compare.df2_unq_columns()) | ||
# OrderedSet() | ||
Duplicate rows | ||
-------------- | ||
|
||
Datacompy will try to handle rows that are duplicate in the join columns. It does this behind the | ||
scenes by generating a unique ID within each unique group of the join columns. For example, if you | ||
have two dataframes you're trying to join on acct_id: | ||
|
||
=========== ================ | ||
acct_id name | ||
=========== ================ | ||
1 George Maharis | ||
1 Michael Bluth | ||
2 George Bluth | ||
=========== ================ | ||
|
||
=========== ================ | ||
acct_id name | ||
=========== ================ | ||
1 George Maharis | ||
1 Michael Bluth | ||
1 Tony Wonder | ||
2 George Bluth | ||
=========== ================ | ||
|
||
Datacompy will generate a unique temporary ID for joining: | ||
|
||
=========== ================ ======== | ||
acct_id name temp_id | ||
=========== ================ ======== | ||
1 George Maharis 0 | ||
1 Michael Bluth 1 | ||
2 George Bluth 0 | ||
=========== ================ ======== | ||
|
||
=========== ================ ======== | ||
acct_id name temp_id | ||
=========== ================ ======== | ||
1 George Maharis 0 | ||
1 Michael Bluth 1 | ||
1 Tony Wonder 2 | ||
2 George Bluth 0 | ||
=========== ================ ======== | ||
|
||
And then merge the two dataframes on a combination of the join_columns you specified and the temporary | ||
ID, before dropping the temp_id again. So the first two rows in the first dataframe will match the | ||
first two rows in the second dataframe, and the third row in the second dataframe will be recognized | ||
as uniquely in the second. | ||
|
||
Additional considerations | ||
------------------------- | ||
- It is strongly recommended against joining on float columns (or any column with floating point precision). | ||
Columns joining tables are compared on the basis of an exact comparison, therefore if the values comparing | ||
your float columns are not exact, you will likely get unexpected results. | ||
- Case-sensitive columns are only partially supported. We essentially treat case-sensitive | ||
columns as if they are case-insensitive. Therefore you may use case-sensitive columns as long as | ||
you don't have several columns with the same name differentiated only be case sensitivity. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.