Skip to content

Commit

Permalink
Deploying to gh-pages from @ e7cd7e3 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
fdosani committed Oct 30, 2024
1 parent bef5c63 commit 1069d2c
Show file tree
Hide file tree
Showing 28 changed files with 1,825 additions and 72 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 5d67110e4b7f3308f48f9b0005f1974e
config: 1f4b85819b4429b3833c6521f227662a
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file modified .doctrees/api/datacompy.doctree
Binary file not shown.
Binary file modified .doctrees/developer_instructions.doctree
Binary file not shown.
Binary file modified .doctrees/environment.pickle
Binary file not shown.
Binary file modified .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/snowflake_usage.doctree
Binary file not shown.
8 changes: 8 additions & 0 deletions _sources/api/datacompy.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ datacompy.polars module
:undoc-members:
:show-inheritance:

datacompy.snowflake module
--------------------------

.. automodule:: datacompy.snowflake
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------

Expand Down
20 changes: 19 additions & 1 deletion _sources/developer_instructions.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,24 @@ Run ``python -m pytest`` to run all unittests defined in the subfolder
`pytest-runner <https://pypi.python.org/pypi/pytest-runner>`_.


Snowflake testing
-----------------
Testing the Snowflake compare requires the use of a Snowflake cluster, as Snowflake does not support local running.
This means that Snowflake tests do not get run in CICD, and changes to the Snowflake Compare must be validated by
the process of running these tests locally.

Note that you must have the following environment variables set in order to instantiate a Snowflake Connection (for testing purposes):

- "SF_ACCOUNT": with your SF account
- "SF_UID": with your SF username
- "SF_PWD": with your SF password
- "SF_WAREHOUSE": with your desired SF warehouse
- "SF_DATABASE": with a valid database with which you have access
- "SF_SCHEMA": with a valid schema belonging to the provided database

Once these are set, you are free to run the suite of Snowflake tests.


Management of Requirements
--------------------------

Expand Down Expand Up @@ -130,4 +148,4 @@ Finally upload to PyPi::
twine upload --repository-url https://test.pypi.org/legacy/ dist/*

# real pypi
twine upload dist/*
twine upload dist/*
3 changes: 2 additions & 1 deletion _sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Contents
Installation <install>
Pandas Usage <pandas_usage>
Spark Usage <spark_usage>
Snowflake Usage <snowflake_usage>
Polars Usage <polars_usage>
Fugue Usage <fugue_usage>
Benchmarks <benchmark>
Expand All @@ -28,4 +29,4 @@ Indices and tables

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
* :ref:`search`
268 changes: 268 additions & 0 deletions _sources/snowflake_usage.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
Snowpark/Snowflake Usage
========================

For ``SnowflakeCompare``

- ``on_index`` is not supported.
- Joining is done using ``EQUAL_NULL`` which is the equality test that is safe for null values.
- Compares ``snowflake.snowpark.DataFrame``, which can be provided as either raw Snowflake dataframes
or the as the names of full names of valid snowflake tables, which we will process into Snowpark dataframes.


SnowflakeCompare Object Setup
---------------------------------------------------
There are two ways to specify input dataframes for ``SnowflakeCompare``

Provide Snowpark dataframes
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python
from snowflake.snowpark import Session
from snowflake.snowpark import Row
import datetime
import datacompy.snowflake as sp
connection_parameters = {
...
}
session = Session.builder.configs(connection_parameters).create()
data1 = [
Row(acct_id=10000001234, dollar_amt=123.45, name='George Maharis', float_fld=14530.1555,
date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=1.0,
date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=None,
date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001237, dollar_amt=123456.0, name='Bob Loblaw', float_fld=345.12,
date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001239, dollar_amt=1.05, name='Lucille Bluth', float_fld=None,
date_fld=datetime.date(2017, 1, 1)),
]
data2 = [
Row(acct_id=10000001234, dollar_amt=123.4, name='George Michael Bluth', float_fld=14530.155),
Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=None),
Row(acct_id=None, dollar_amt=1345.0, name='George Bluth', float_fld=1.0),
Row(acct_id=10000001237, dollar_amt=123456.0, name='Robert Loblaw', float_fld=345.12),
Row(acct_id=10000001238, dollar_amt=1.05, name='Loose Seal Bluth', float_fld=111.0),
]
df_1 = session.createDataFrame(data1)
df_2 = session.createDataFrame(data2)
compare = sp.SnowflakeCompare(
session,
df_1,
df_2,
join_columns=['acct_id'],
rel_tol=1e-03,
abs_tol=1e-04,
)
compare.matches(ignore_extra_columns=False)
# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())
Provide the full name (``{db}.{schema}.{table_name}``) of valid Snowflake tables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Given the dataframes from the prior examples...

.. code-block:: python
df_1.write.mode("overwrite").save_as_table("toy_table_1")
df_2.write.mode("overwrite").save_as_table("toy_table_2")
compare = sp.SnowflakeCompare(
session,
f"{db}.{schema}.toy_table_1",
f"{db}.{schema}.toy_table_2",
join_columns=['acct_id'],
rel_tol=1e-03,
abs_tol=1e-04,
)
compare.matches(ignore_extra_columns=False)
# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())
Reports
-------

A report is generated by calling ``report()``, which returns a string.
Here is a sample report generated by ``datacompy`` for the two tables above,
joined on ``acct_id`` (Note: the names for your dataframes are extracted from
the name of the provided Snowflake table. If you chose to directly use Snowpark
dataframes, then the names will default to ``DF1`` and ``DF2``.)::

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

DataFrame Columns Rows
0 DF1 5 5
1 DF2 4 5

Column Summary
--------------

Number of columns in common: 4
Number of columns in DF1 but not in DF2: 1
Number of columns in DF2 but not in DF1: 0

Row Summary
-----------

Matched on: ACCT_ID
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 4
Number of rows in DF1 but not in DF2: 1
Number of rows in DF2 but not in DF1: 1

Number of rows with some compared columns unequal: 4
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 6

Columns with Unequal Values or Types
------------------------------------

Column DF1 dtype DF2 dtype # Unequal Max Diff # Null Diff
0 DOLLAR_AMT double double 1 0.0500 0
2 FLOAT_FLD double double 3 0.0005 2
1 NAME string(16777216) string(16777216) 2 NaN 0

Sample Rows with Unequal Values
-------------------------------

ACCT_ID DOLLAR_AMT (DF1) DOLLAR_AMT (DF2)
0 10000001234 123.45 123.4

ACCT_ID NAME (DF1) NAME (DF2)
0 10000001234 George Maharis George Michael Bluth
1 10000001237 Bob Loblaw Robert Loblaw

ACCT_ID FLOAT_FLD (DF1) FLOAT_FLD (DF2)
0 10000001234 14530.1555 14530.155
1 10000001235 1.0000 NaN
2 10000001236 NaN 1.000

Sample Rows Only in DF1 (First 10 Columns)
------------------------------------------

ACCT_ID_DF1 DOLLAR_AMT_DF1 NAME_DF1 FLOAT_FLD_DF1 DATE_FLD_DF1
0 10000001239 1.05 Lucille Bluth NaN 2017-01-01

Sample Rows Only in DF2 (First 10 Columns)
------------------------------------------

ACCT_ID_DF2 DOLLAR_AMT_DF2 NAME_DF2 FLOAT_FLD_DF2
0 10000001238 1.05 Loose Seal Bluth 111.0


Convenience Methods
-------------------

There are a few convenience methods and attributes available after the comparison has been run:

.. code-block:: python
compare.intersect_rows[['name_df1', 'name_df2', 'name_match']].show()
# --------------------------------------------------------
# |"NAME_DF1" |"NAME_DF2" |"NAME_MATCH" |
# --------------------------------------------------------
# |George Maharis |George Michael Bluth |False |
# |Michael Bluth |Michael Bluth |True |
# |George Bluth |George Bluth |True |
# |Bob Loblaw |Robert Loblaw |False |
# --------------------------------------------------------
compare.df1_unq_rows.show()
# ---------------------------------------------------------------------------------------
# |"ACCT_ID_DF1" |"DOLLAR_AMT_DF1" |"NAME_DF1" |"FLOAT_FLD_DF1" |"DATE_FLD_DF1" |
# ---------------------------------------------------------------------------------------
# |10000001239 |1.05 |Lucille Bluth |NULL |2017-01-01 |
# ---------------------------------------------------------------------------------------
compare.df2_unq_rows.show()
# -------------------------------------------------------------------------
# |"ACCT_ID_DF2" |"DOLLAR_AMT_DF2" |"NAME_DF2" |"FLOAT_FLD_DF2" |
# -------------------------------------------------------------------------
# |10000001238 |1.05 |Loose Seal Bluth |111.0 |
# -------------------------------------------------------------------------
print(compare.intersect_columns())
# OrderedSet(['acct_id', 'dollar_amt', 'name', 'float_fld'])
print(compare.df1_unq_columns())
# OrderedSet(['date_fld'])
print(compare.df2_unq_columns())
# OrderedSet()
Duplicate rows
--------------

Datacompy will try to handle rows that are duplicate in the join columns. It does this behind the
scenes by generating a unique ID within each unique group of the join columns. For example, if you
have two dataframes you're trying to join on acct_id:

=========== ================
acct_id name
=========== ================
1 George Maharis
1 Michael Bluth
2 George Bluth
=========== ================

=========== ================
acct_id name
=========== ================
1 George Maharis
1 Michael Bluth
1 Tony Wonder
2 George Bluth
=========== ================

Datacompy will generate a unique temporary ID for joining:

=========== ================ ========
acct_id name temp_id
=========== ================ ========
1 George Maharis 0
1 Michael Bluth 1
2 George Bluth 0
=========== ================ ========

=========== ================ ========
acct_id name temp_id
=========== ================ ========
1 George Maharis 0
1 Michael Bluth 1
1 Tony Wonder 2
2 George Bluth 0
=========== ================ ========

And then merge the two dataframes on a combination of the join_columns you specified and the temporary
ID, before dropping the temp_id again. So the first two rows in the first dataframe will match the
first two rows in the second dataframe, and the third row in the second dataframe will be recognized
as uniquely in the second.

Additional considerations
-------------------------
- It is strongly recommended against joining on float columns (or any column with floating point precision).
Columns joining tables are compared on the basis of an exact comparison, therefore if the values comparing
your float columns are not exact, you will likely get unexpected results.
- Case-sensitive columns are only partially supported. We essentially treat case-sensitive
columns as if they are case-insensitive. Therefore you may use case-sensitive columns as long as
you don't have several columns with the same name differentiated only be case sensitivity.
2 changes: 1 addition & 1 deletion _static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
const DOCUMENTATION_OPTIONS = {
VERSION: '0.14.1',
VERSION: '0.14.2',
LANGUAGE: 'en',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
Loading

0 comments on commit 1069d2c

Please sign in to comment.