Add `skip_conflicts` option to `load` function #143

volcan01010 · 2022-11-07T10:15:40Z

Note: this might be a bit complicated - maybe users should just use executemany and copy_rows and write their own INSERT statements to ensure that they get exactly what they want.

Summary

As an ETL Helper user I want to use skip_conficts so that I can handle primary key violations at the database level without requiring ETL Helper's error handling.

Description

A common error in a big insert is where records already exist for a given primary key. Some databases have internal methods for handling these scenarios. For example SQLite has the ON CONFLICT or INSERT syntax:
https://www.sqlite.org/lang_conflict.html. PostgreSQL has ON CONFLICT DO NOTHING: https://www.postgresql.org/docs/current/sql-insert.html#:~:text=ON%20CONFLICT%20DO%20NOTHING%20simply,can%20perform%20unique%20index%20inference.

There should be implementations for each database. The specific way in which they are applied will need to be configured in the DbHelper classes. The SQL statement generated for the load function may need changes around the INSERT keyword, or as a suffix.

Acceptance critera

Calling load with skip_conflicts=True means that primary key violations pass silently and rows are unchanged

The text was updated successfully, but these errors were encountered:

volcan01010 · 2023-05-05T08:28:41Z

Having looked further into this, it won't be easy to implement. There are two main hurdles:

Writing a SQL command that includes solving of conflicts requires ETLHelper to know what the primary key is for a table. It turns out that this isn't straightforward to find out. We discovered this when we wrote the table_info command. For example, a primary key may be a composite of multiple columns.
The syntax for different databases is different. This is less of a problem and is what ETL Helper was designed for. But there would still be a bit of work involved.

Here is some code that ChatGPT wrote to demonstrate what the command should be for PostgreSQL. You can see that it explicitly uses the id command.

volcan01010 · 2023-05-05T12:54:07Z

ChatGPT was also able to generate equivalent commands for Oracle, SQLite and MS SQL Server. There are two things to note:

Oracle and SQL Server use MERGE, where SQLite and PostgreSQL use INSERT ... ON CONFLICT
MERGE commands also require knowlege of the primary key for the ON clause

volcan01010 · 2023-05-05T13:05:22Z

Here is ChatGPT's suggestion for:

Please write a SQL command that lists the primary key columns for a table in a PostgreSQL database.

SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
    SELECT constraint_name
    FROM information_schema.table_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);

And the translation to different database systems via:

Can you give equivalent commands for Oracle, SQLite and MS SQL Server databases?

Oracle:

SELECT column_name
FROM all_cons_columns
WHERE constraint_name = (
    SELECT constraint_name
    FROM all_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'P'
);

SQLite:

PRAGMA table_info(your_table_name)
WHERE pk > 0;

MS SQL Server:

SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
    SELECT constraint_name
    FROM information_schema.table_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);

Note that I haven't checked them, and that the MS SQL Server value looks suspiciously similar to the PostgreSQL one.

volcan01010 · 2023-05-17T15:11:38Z

I think that we could probably do this at some point.

It would be better to implement this as a new function, merge, rather than updating the load function as that would better reflect the fact that the underlying mechanism is different.

This would use a function behind the scenes called generate_merge_sql instead of generate_insert_sql that is used by load.

kerberpolis · 2023-06-07T10:48:17Z

I've used COPY to transfer rows in a db migration; this can be 10,000s or millions. My process is:

I export items from a source databse
Augment the data
In batches of 5000 copy the augmented data to the target table.

In one table there is a composite unqiue key (column_X, column_Y) where if there is a conflict in 1 of the 5000, because it is a single tranaction and the cursor is not commited, none of rows are copied. On a conflict I would like to ignore ON CONFLICT (column_X, column_Y) DO NOTHING as they have already been migrated (so this is relevant for skip_conflict=True only.)

You can also skip conflicts using ON CONFLICT ON CONSTRAINT [constraint_name] DO NOTHING.

Unfortunately you cannot ON CONFLICT (id, column_X, column_Y) ... as the PK and UQ constraints are independant. Neither can you chain them ON CONFLICT (id) ... ON CONFLICT (column_X, column_Y) ...

With that in mind, if it fits your use cases too it would be great if a copy function could specify a constraint_name instead of only being a PK. For case witth more than one constraint, it would require the developer to either pre-filter their data beforehand. Or perform some kind of upsert but at that point it might be easier to supply their own SQL statement...

volcan01010 · 2023-07-20T16:12:47Z

The first task, whatever we do is to determine the primary keys of a table.

The specific queries are given above.

The primary key columns can be added as an extra attribute to the table_info response.

etlhelper/etlhelper/utils.py

Line 13 in 456f0db

def table_info(table, conn, schema=None):

The current table_info response looks like:

etlhelper/test/integration/test_utils.py

Line 10 in 456f0db

    
           def test_table_info_no_schema_no_duplicates(pgtestdb_conn, pgtestdb_test_tables):

We would need to add a primary_key_query to each DbHelper that performed a similar job to the table_info_query.
https://github.com/BritishGeologicalSurvey/etlhelper/blob/456f0db009524354d0377621067ba56e8ac441da/etlhelper/db_helpers/postgres.py#L13C4-L13C21

volcan01010 changed the title ~~Add skip_conflicts and replace_conflicts options to load and copy_table_rows~~ Create merge function Jul 20, 2023

volcan01010 changed the title ~~Create merge function~~ Add skip_conflicts option to load function Jul 20, 2023

volcan01010 mentioned this issue Jul 20, 2023

Add merge function #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `skip_conflicts` option to `load` function #143

Add `skip_conflicts` option to `load` function #143

volcan01010 commented Nov 7, 2022 •

edited

Loading

volcan01010 commented May 5, 2023

volcan01010 commented May 5, 2023

volcan01010 commented May 5, 2023

volcan01010 commented May 17, 2023

kerberpolis commented Jun 7, 2023

volcan01010 commented Jul 20, 2023

Add skip_conflicts option to load function #143

Add skip_conflicts option to load function #143

Comments

volcan01010 commented Nov 7, 2022 • edited Loading

Summary

Description

Acceptance critera

volcan01010 commented May 5, 2023

volcan01010 commented May 5, 2023

volcan01010 commented May 5, 2023

volcan01010 commented May 17, 2023

kerberpolis commented Jun 7, 2023

volcan01010 commented Jul 20, 2023

Add `skip_conflicts` option to `load` function #143

Add `skip_conflicts` option to `load` function #143

volcan01010 commented Nov 7, 2022 •

edited

Loading