-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add skip_conflicts
option to load
function
#143
Comments
Having looked further into this, it won't be easy to implement. There are two main hurdles:
Here is some code that ChatGPT wrote to demonstrate what the command should be for PostgreSQL. You can see that it explicitly uses the |
Here is ChatGPT's suggestion for:
SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
SELECT constraint_name
FROM information_schema.table_constraints
WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
); And the translation to different database systems via:
Oracle: SELECT column_name
FROM all_cons_columns
WHERE constraint_name = (
SELECT constraint_name
FROM all_constraints
WHERE table_name = 'your_table_name' AND constraint_type = 'P'
); SQLite: PRAGMA table_info(your_table_name)
WHERE pk > 0; MS SQL Server: SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
SELECT constraint_name
FROM information_schema.table_constraints
WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
); Note that I haven't checked them, and that the MS SQL Server value looks suspiciously similar to the PostgreSQL one. |
I think that we could probably do this at some point. It would be better to implement this as a new function, This would use a function behind the scenes called |
I've used COPY to transfer rows in a db migration; this can be 10,000s or millions. My process is:
In one table there is a composite unqiue key (column_X, column_Y) where if there is a conflict in 1 of the 5000, because it is a single tranaction and the cursor is not commited, none of rows are copied. On a conflict I would like to ignore You can also skip conflicts using Unfortunately you cannot With that in mind, if it fits your use cases too it would be great if a copy function could specify a |
skip_conflicts
and replace_conflicts
options to load
and copy_table_rows
merge
function
merge
functionskip_conflicts
option to load
function
The first task, whatever we do is to determine the primary keys of a table. The specific queries are given above. The primary key columns can be added as an extra attribute to the Line 13 in 456f0db
The current etlhelper/test/integration/test_utils.py Line 10 in 456f0db
We would need to add a |
Summary
As an ETL Helper user I want to use
skip_conficts
so that I can handle primary key violations at the database level without requiring ETL Helper's error handling.Description
A common error in a big insert is where records already exist for a given primary key. Some databases have internal methods for handling these scenarios. For example SQLite has the
ON CONFLICT
orINSERT
syntax:https://www.sqlite.org/lang_conflict.html. PostgreSQL has ON CONFLICT DO NOTHING: https://www.postgresql.org/docs/current/sql-insert.html#:~:text=ON%20CONFLICT%20DO%20NOTHING%20simply,can%20perform%20unique%20index%20inference.
There should be implementations for each database. The specific way in which they are applied will need to be configured in the DbHelper classes. The SQL statement generated for the
load
function may need changes around the INSERT keyword, or as a suffix.Acceptance critera
load
withskip_conflicts=True
means that primary key violations pass silently and rows are unchangedThe text was updated successfully, but these errors were encountered: