Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for replace option when using insert_rows with MsSqlHook #40836

Merged
merged 24 commits into from
Jul 26, 2024

Conversation

dabla
Copy link
Contributor

@dabla dabla commented Jul 17, 2024

This PR adds support for the replace option when using insert_rows with the MsSqlHook. Unfortunately MSSQL doesn't have a REPLACE or an UPSERT SQL syntax. In order to achieve the same functionality, you have to use the MERGE INTO statement. In order to be able to use that statement in a dynamic way, you have to get the primary keys of the table you want to replace into. This means the current implementation of the _generate_insert_sql with the _replace_statement_format field won't work, as some assumptions have been made there which makes it impossible to achieve the same with the current _generate_insert_sql implementation of the DbApiHook.

So to make it work for MsSqlHook, I've override the _generate_insert_sql method in the MsSqlHook and also added a method named get_primary_keys which dynamically retrieves the PK associated to the given table as this will be needed for the MERGE INTO SQL statement. If the _generate_insert_sql method is called for insert statement, then the original implementation is being called, otherwise the MSSQL specific implementation is being executed.

Another advantage is that you can benefit from fast performance while upserting with the executemany option.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

… can also support the replace option and generate a MERGE INTO statement for MSSQL so replace with insert_rows also works on MSSQL
@dabla dabla marked this pull request as ready for review July 17, 2024 15:21
@dabla
Copy link
Contributor Author

dabla commented Jul 18, 2024

@potiuk Is there a way to ignore the required license header in sql files within tests? Searched through the project but didn't find any example, saw other sql file and seemed that one wasn't impacted so wonder why

@dabla
Copy link
Contributor Author

dabla commented Jul 18, 2024

Never mind the hive.sql also has the header, problem is I use the content of the sql for my assertion in the test, as I could have it as a python string as it would be sensitive to formatting and thus make false assertion failures

@dabla
Copy link
Contributor Author

dabla commented Jul 18, 2024

I've added the license header in the sql file and written a function which filters out the license header so the assertion still works

@dabla
Copy link
Contributor Author

dabla commented Jul 18, 2024

It's a pitty that I can't use the functools.lru_cache decorator? I could use the cachetools cache decorator, but then that would add an extra dependency. Dunno why unctools.lru_cache is banned, as per default it has a maxsize of 128 entries per cached method? This method needs to be cache as it will be called multiple times per chunked insert rows, otherwise we could end up having the same issue as in #40609. A decorator also looks nicer instead of manually implementing it with a class member as it is a cross cutting concern.

@potiuk
Copy link
Member

potiuk commented Jul 18, 2024

It's a pitty that I can't use the functools.lru_cache decorator? I could use the cachetools cache decorator, but then that would add an extra dependency. Dunno why unctools.lru_cache is banned, as per default it has a maxsize of 128 entries per cached method? This method needs to be cache as it will be called multiple times per chunked insert rows, otherwise we could end up having the same issue as in #40609. A decorator also looks nicer instead of manually implementing it with a class member as it is a cross cutting concern.

Functools.lru_cache should not be used on object methods only on functions. https://stackoverflow.com/questions/33672412/python-functools-lru-cache-with-instance-methods-release-object

And we are using methodtools.lru_cache instead (already our dependency).

See #37757. Maybe a good idea (feel free to add PR) is to update the description of error and suggest to use methodtools insteadl

@dabla
Copy link
Contributor Author

dabla commented Jul 18, 2024

It's a pitty that I can't use the functools.lru_cache decorator? I could use the cachetools cache decorator, but then that would add an extra dependency. Dunno why unctools.lru_cache is banned, as per default it has a maxsize of 128 entries per cached method? This method needs to be cache as it will be called multiple times per chunked insert rows, otherwise we could end up having the same issue as in #40609. A decorator also looks nicer instead of manually implementing it with a class member as it is a cross cutting concern.

Functools.lru_cache should not be used on object methods only on functions. https://stackoverflow.com/questions/33672412/python-functools-lru-cache-with-instance-methods-release-object

And we are using methodtools.lru_cache instead (already our dependency).

See #37757. Maybe a good idea (feel free to add PR) is to update the description of error and suggest to use methodtools insteadl

Thank you Jarek good to know didn't catch that one but should have ;)

@dabla
Copy link
Contributor Author

dabla commented Jul 19, 2024

Integration test for mssql is failing, but don't know why cause I didn't change anything related to connection here for MsSqlHook.

FAILED tests/integration/providers/microsoft/mssql/hooks/test_mssql.py::TestMsSqlHook::test_get_conn - pymssql.exceptions.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (mssql)\n')
FAILED tests/integration/providers/microsoft/mssql/hooks/test_mssql.py::TestMsSqlHook::test_autocommit - pymssql.exceptions.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (mssql)\n')
ERROR tests/integration/providers/apache/hive/transfers/test_mssql_to_hive.py::TestMsSqlToHiveTransfer::test_execute - pymssql.exceptions.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (mssql)\n')
ERROR tests/integration/providers/google/cloud/transfers/test_bigquery_to_mssql.py::TestBigQueryToMsSqlOperator::test_execute - pymssql.exceptions.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (mssql)\n')
ERROR tests/integration/providers/google/cloud/transfers/test_mssql_to_gcs.py::TestMsSqlToGoogleCloudStorageOperator::test_execute - pymssql.exceptions.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (mssql)\n')

@potiuk
Copy link
Member

potiuk commented Jul 19, 2024

Not the most stable server via docker compose - it happens. Rerun it.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants