Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

Open
laderjk opened this issue Oct 15, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@laderjk
Copy link

laderjk commented Oct 15, 2024

Hello everyone! We are having an issue with duplicated values in an identity column when executing redshift.to_sql() after executing a redshift.copy(). When trying to insert more than 1000 records we use the COPY method and when trying to insert less than 1000 records we use the redshift.to_sql() method as is recommended in the docs.

This is the structure of the table:

create table if not exists dim_testing
(
    id BIGINT IDENTITY NOT NULL PRIMARY KEY,
    name varchar
);

This is the code in which we are able to reproduce the issue:

import random
import string

import awswrangler as wr
import pandas as pd

def generate_df(length):
    return pd.DataFrame([''.join(random.choices(string.ascii_letters + string.digits, k=30)) for _ in range(length)], columns=['name'])

df = generate_df(1500)
wr.redshift.copy(
    df=df,
    con=redshift.conn,
    path=f's3://s3_bucket_path/dim_testing/',
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

df = generate_df(500)
wr.redshift.to_sql(
    df,
    redshift.conn,
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

After executing that code if we look for duplicates by the identity column id, we get duplicated values:

select id, count(*)
from dim_testing
group by id
having count(*) > 1;

If we only use to_sql() or copy(), the issue does not happens. It happens when executing to_sql() after a copy().
Is this normal? Are we doing something wrong? Should we only chose one write method and not use the other?

Thanks!

@laderjk laderjk added the question Further information is requested label Oct 15, 2024
@kukushking
Copy link
Contributor

Hi @laderjk thanks for opening this - looks like a race condition. Can you try passing lock=True to both calls? This will lock the table line up all the updates/inserts.

@kukushking kukushking self-assigned this Nov 15, 2024
@laderjk
Copy link
Author

laderjk commented Dec 2, 2024

Hi, @kukushking I tried passing lock=True to both calls and I get the same result. =(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants