-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Database class #97
base: main
Are you sure you want to change the base?
Conversation
Thanks - this is very interesting. I'll think about this and reply properly later. |
Check my second commit. I've implemented connection pooling and MySQL DbHelper. There's no backward incompatible changes, so module can still be used as before, but now there's more :D with oracle:
with sqlite:
rows = oracle.iter_rows(select)
sqlite.executemany(selectlite, rows) There're some caveats though:
|
Thanks again for all this. There is a lot of good code in here that would improve The current merge request covers many different concepts and it may be easier to incorporate if it was divided into smaller requests (MySQL, connection pooling, Database class). Each of these should have tests, too. MySQL connectionThe simplest one is the MySQL DbHelper. We would be very grateful for this. Please have a look at the open merge request for the Informix driver (#25) for an example of what is required. That request stalled because I couldn't get the Docker container working to run the tests. If you know how to set up a MySQL Docker container and can add instructions to CONTRIBUTING.md, that would be excellent. Note that most of our integration test suite can be run locally with just a PostgreSQL docker container. Only the Oracle and MSSQL Server tests require those databases. We run those inside BGS against our own servers. See CONTRIBUTING.md for details. Connection poolingWe have an open ticket on connection pooling (#94). The ticket isn't fully defined yet and we haven't set acceptance criteria. We would welcome your thoughts this. Database classThe code in this merge request represents quite a big change to On the other hand, this Database class represents the logical next step from when we added the The Database class, as defined here, repeats a lot of code from etlhelper/etlhelper/db_params.py Line 125 in bd2d693
I like that you store the password_variable on the class, so that you don't to add it again. That's a good call, but I think it's cleaner to add it separately rather than modify existing functions. How about something like this? class Database:
def __init__(self, **kwargs):
self.db_params = DbParams(**kwargs)
self.connection_args = {}
self._password_variable = None
@property
def password_variable(self):
if not self._password_varible:
raise SomeException("explain what this is and why you need it")
return self._password_variable
@password_variable.setter
def password_variable(self, value):
self._password_variable = value
def connect(self):
return self.db_params.connect(self.password_variable, **self.connection_args)
def get_rows(sql, ...):
with self.connect() as conn:
return get_rows(sql, conn, ...) This can be used as: oracle = Database(dbtype="ORACLE", ...)
oracle.password_variable = "ORACLE_PASSWORD"
oracle.connection_args = {"encoding": "utf8"}
sql = "SELECT ..."
rows = oracle.get_rows(sql) This provides similar functionality without having to modify existing code. It keeps things simple by not using persistent connections and also provides a way to get a connection if you want to do something manual with the cursors.
I also don't have a good sense of how Thanks again for your contributions so far. It is a very nice surprise when these helpful fixes appear out of nowhere. How did you find out about |
I'm currently working on a project where I need to periodically transfer data from Oracle to Postgre. Lucky for us we have full control over Postgre server, so obvious choice was Foreign Data Wrapper for Oracle. However it requires manual compilation and installation - luxury we do not always have in our projects. Also project is quite big and a need may arise to integrate with other DB types. So I decided to seek for other ETL solutions, preferably written in Python. Pandas was an option, but it's unable to "upsert" data, I solved the issue with pangres. However performance was not quite good: it worked 5 times slower than FDW solution. So after googling a bit I encountered pyetl which looks interesting, but does not work with Postgres and also ETLHelper, which worked flawlessly and worked 2 times faster than pandas solution (2.5 times slower than FDW) which is I think best we can get while transferring data through Python. So currently I do not use ETLHelper much, but I may start using it in near future. As there's always need to transfer data from one place to another. That's why I decided to improve it. It's always nice to have ready to use tool which simply does the job. |
That's doable. |
Unfortunately we cannot get much flexibility here. Cause for example in case of psycopg2 *args and **kwargs are passed to the connect() function. This means that even if you connect to the same database, but with different credentials you'll need to create new connection pool. Same stuff happens with other DB types. That's why my current implementation creates pools based on DbParams(host, port etc), it uses |
Actually that's why
Well actually it leads to barely maintainable code. I did not like the way etlhelper/etlhelper/db_params.py Lines 59 to 61 in bd2d693
Where new DbHelper instantiated just to get it's required_params . Or with already mentioned connect where class method simply calls external function:etlhelper/etlhelper/db_params.py Lines 125 to 133 in bd2d693
I understand that you try to keep backward compatibility, but sometimes it's better to refactor completely, before project turns into development hell :D |
Yep, there're places that need rework. Now I think that |
Yep, additional methods for dealing with |
Actually in current version it never closes unless you use with oracle:
do_something() |
Overall: as I said there are still places to improve the code, which are not here yet cause I try not to change existing source code at all to avoid breaking anything, while try to reuse parts of it and keep familiar look. This database class is not the final one I think cause I have even more ideas how to improve the module. For example I'm thinking about adding something like |
Hi, I had another chance to look through this. I do like the idea of the Database class providing Our aim with I still think the best way to implement a Database would be to re-use the existing functions. An updated implementation (taking into account your suggestions on adding passwords) is below. This version solves the issue of closing connections: class Database:
def __init__(self, password_variable=None, connection_args={}, **kwargs):
self.password_variable = password_variable
self.connection_args = connection_args
self.db_params = DbParams(**kwargs)
@classmethod
def from_db_params(cls, db_params, password_variable=None, connection_args={}):
return cls(password_variable=password, connection_args=connection_args, **db_params)
def connect(self):
return self.db_params.connect(self.password_variable, **self.connection_args)
def get_rows(sql, ...):
with self.connect() as conn:
rows = get_rows(sql, conn, ...)
return rows
def iter_rows(sql, ...):
try:
conn = self.connect()
iterator = iter_rows(sql, conn, ...)
# Connection will be closed when iterator is exhausted
rows = ConnectionClosingIterator(iterator, conn)
finally:
# close connection if something goes wrong while preparing iterator
conn.close()
return rows
class ConnectionClosingIterator():
# This class-level variable is shared by all instances
_active_connections = set()
def __init__(self, iterator, conn):
self.iterator = iterator
if conn in self._active_connections:
# If same connection is used by two iterators then the
# second will break when the first closes the connection
raise Exception("Can't use same connection for two iterators")
self.conn = conn
self._active_connections.add(conn)
def __iter__(self):
try:
yield from self.iterator
finally:
self.conn.close()
self._active_connections.remove(self.conn)
def __next__(self):
try:
return next(self.iterator)
except StopIteration:
self.conn.close()
self._active_connections.remove(self.conn)
raise For connection pooling, I'd like to add the incrementally, too. If we can get it working outside a Database class, it will be easy to add it in afterwards. With an implementation based on the above, connecting to a database as a different user would count as a different database and so would have a pool of its own. def __init__(self, ...):
...
self.session = None
def start_session(self, **kwargs):
self.session = ...
def end_session(self, **kwargs):
self.session.close(...)
def connect(self):
if self.session:
conn = self.db_params.session_connect(self.session)
else:
self.db_params.connect(self.password_variable, **self.connection_args)
return conn It is likely to be next year before we can do significant work on this, but I'm making the notes here while they are fresh in my mind, particularly @spenny-liam's iterator that closes the connection. |
Hey! Check this out: I've created
Database
class which aggregates module functionality in single place, creating more maintainable code. It rendersdb_params.py
,etl.py
and some parts ofdb_helper_factory.py
obsolete. There is still some place for improvement, like refactoringDbHelper
class, but for now this is enough I guess.Module is now easier to use, like: