Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

laderjk · 2024-10-15T16:37:42Z

Hello everyone! We are having an issue with duplicated values in an identity column when executing redshift.to_sql() after executing a redshift.copy(). When trying to insert more than 1000 records we use the COPY method and when trying to insert less than 1000 records we use the redshift.to_sql() method as is recommended in the docs.

This is the structure of the table:

create table if not exists dim_testing
(
    id BIGINT IDENTITY NOT NULL PRIMARY KEY,
    name varchar
);

This is the code in which we are able to reproduce the issue:

import random
import string

import awswrangler as wr
import pandas as pd

def generate_df(length):
    return pd.DataFrame([''.join(random.choices(string.ascii_letters + string.digits, k=30)) for _ in range(length)], columns=['name'])

df = generate_df(1500)
wr.redshift.copy(
    df=df,
    con=redshift.conn,
    path=f's3://s3_bucket_path/dim_testing/',
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

df = generate_df(500)
wr.redshift.to_sql(
    df,
    redshift.conn,
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

After executing that code if we look for duplicates by the identity column id, we get duplicated values:

select id, count(*)
from dim_testing
group by id
having count(*) > 1;

If we only use to_sql() or copy(), the issue does not happens. It happens when executing to_sql() after a copy().
Is this normal? Are we doing something wrong? Should we only chose one write method and not use the other?

Thanks!

The text was updated successfully, but these errors were encountered:

kukushking · 2024-11-15T11:55:05Z

Hi @laderjk thanks for opening this - looks like a race condition. Can you try passing lock=True to both calls? This will lock the table line up all the updates/inserts.

laderjk · 2024-12-02T22:42:07Z

Hi, @kukushking I tried passing lock=True to both calls and I get the same result. =(

laderjk added the question Further information is requested label Oct 15, 2024

kukushking self-assigned this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

laderjk commented Oct 15, 2024

kukushking commented Nov 15, 2024

laderjk commented Dec 2, 2024

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

Comments

laderjk commented Oct 15, 2024

kukushking commented Nov 15, 2024

laderjk commented Dec 2, 2024