-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional Postgres Insert Speed-up #79
Comments
Hi, thanks for sharing this, it looks great! As you know,
As the article says, the The method on the webpage uses the from etlhelper import iter_rows, DbParams
SRC_DB = DbParams(dbtype="ORACLE", ...)
DEST_DB = DbParams(dbtype="PG", ...)
select_sql = "SELECT id, value FROM my_table"
def csv_string_row_factory(cursor):
"""Replace default tuple output with text string e.g. csv."""
def create_row(row)
return "{}|{}".format(*row)
return create_row
# Do the transformation
with SRC_DB.connect('PASSWORD_VARIABLE') as src_conn:
csv_string_generator = iter_rows(select_sql, src_conn, row_factory=csv_string_row_factory)
dummy_csv_file = StringIteratorIO(csv_string_generator)
with DEST_DB.connect('ANOTHER_PASSWORD_VARIABLE') as dest_conn:
with dest_conn.cursor() as cursor:
cursor.copy_from(dummy_csv_file, 'target_table', sep='|', size=5000) Have a go and let me know how you get on. |
It may be possible to write a more generic This would have the benefit of taking care of arbitrary row lengths and quoting of text strings etc.. We could include that in the main |
StackOverflow code is CC-BY-SA-4.0. But a StringIteratorIO could be in a module of its own and the different licence acknowleged there. |
Oh nice! Thanks for the detailed explanation :) Good to note about the license as well. I'll give it a try today and keep you posted on how it goes! |
Managed to get it working using this:
Still testing speed with some etl trials. |
As an update, I've been seeing best results (i.e. fewest errors) using:
Not sure if it's something that would need to be integrated into etlhelper at this point, but figured I'd post it here in case it helps anyone else. I'm also using primary key sorting on the Oracle side to allow for pagination of the source table, so that a new connection is created for every x-million rows (thereby avoiding timeout issues). |
Great! Thanks for the update. Keep us updated if you find any other tweaks. I think it is worth adding as a recipe in the README at some point at least. There is a ticket for I also wonder if a |
Hi @rwolniak, This issue has been around a while, but I've kept it open because I'd still like to create a "Recipe" for this once we move to a ReadTheDocs style documentation page. In the meantime, I was working on a Oracle to PostgreSQL ETL and I used pyinstrument to try to see where I could speed it up. It turned out that the I found there is another way to read CLOB objects directly as strings. When I applied it, I had a 10x speed up. I plan to make this the default behaviour in a future release: #110 (comment) If your pipeline involves CLOBs or BLOBs it might benefit from this approach. Cheers |
I've been doing some research into speeding up my ETLs for Oracle -> Postgres using etlhelper, and came across this article for loading data into postgresql. It looks like using psycopg2's
copy_from
in combination withio.StringIO
could result in up to 5x performance improvements on the Postgres side. Is there a way to leverage this for etlhelper? Maybe an optional flag for the postgres db_helper's executemany function to use it? Would be amazing to be able to cut my ETL time down for multi-million-row tables.The text was updated successfully, but these errors were encountered: