-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(PPS-107): Feat/single table driver #376
Conversation
Implemented migrate_db for single_table_alchemy Implemented session for single_table_alchemy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't a full review but figured I share these comments I had so far
indexd/default_settings.py
Outdated
@@ -17,7 +17,7 @@ | |||
# will be created as "<PREFIX><PREFIX><GUID>". | |||
CONFIG["INDEX"] = { | |||
"driver": SQLAlchemyIndexDriver( | |||
"sqlite:///index.sq3", | |||
"postgres://postgres:postgres@localhost:5432/indexd_tests", # pragma: allowlist secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the idea behind this change? Did SQLite not work easily with your changes?
To be clear, I think it's fine (it's more consistent to default to postgres)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sqlite doesn't handle some of the newer data types. I moved all the test to use postgres instead of sqlite too, just to stay consistent. I saw there was a mix.
@@ -4,7 +4,7 @@ | |||
from indexd.index.drivers.alchemy import SQLAlchemyIndexDriver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this an unused import now?
@@ -80,7 +80,6 @@ def query_urls( | |||
query = query.having( | |||
~q_func["string_agg"](IndexRecordUrl.url, ",").contains(exclude) | |||
) | |||
print(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe lets debug log instead of removing entirely? I'm not sure if people expect this to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments before I left for PTO that I forgot to send, apologies. Here they are now
baseid = Column(String, index=True) | ||
rev = Column(String) | ||
form = Column(String) | ||
size = Column(BigInteger, index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we indexing on size, file_name, version, uploader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing it, I cant remember why i did that. Its been a while.
""" | ||
Get the full index document | ||
""" | ||
# TODO: some of these fields may not need to be a variable and could directly go to the return object -Binam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you do this?
Base.metadata.bind = self.engine | ||
self.Session = sessionmaker(bind=self.engine) | ||
|
||
def migrate_index_database(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just remove this for the new driver if we can. Rely on alembic
""" | ||
session = self.Session() | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify all this by following the guide here: https://docs.sqlalchemy.org/en/14/orm/session_basics.html#framing-out-a-begin-commit-rollback-block
and use the Session.begin()
I think you can just return that, since it's a context manager
bin/migrate_to_single_table.py
Outdated
@@ -0,0 +1,268 @@ | |||
""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a quick description here with how to use this?
bin/migrate_to_single_table.py
Outdated
args = parser.parse_args() | ||
migrator = IndexRecordMigrator(conf_data=args.creds_path) | ||
migrator.index_record_to_new_table() | ||
# cProfile.run("migrator.index_record_to_new_table()", filename="profile_results.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code, remove if not using
bin/migrate_to_single_table.py
Outdated
|
||
self.session = Session() | ||
|
||
def index_record_to_new_table(self, batch_size=1000, retry_limit=4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this batch_size could be much larger. Where's the job that will run this? How much RAM do you plan on providing? Indexd records text is pretty darn small. Maybe 1000 characters on average? 1000 chars / record, 1 char = 1 byte in ASCII, 1kb per record. so 1000 records = 1MB. If we have 1 GBs of usable RAM, we could probably go up to like 10000 records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should do that, but if we can reliably, it might be faster (reducing the total number of queries and commits to the db). Would outputting to a file and using /COPY be faster? Don't you guys use /COPY for something already? Did you consider this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this. I think there's a couple of things here:
- We had a conversation about whether to use sql+bash script vs python. We went with python because it would be easier for any gen3 user to use it.
- should we do python + /copy? My concern with this is that this would blow up for something like dcf where we have about 56mil records. I think we'd use up a lot of memory. We could do it in batches though 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's been a while since we've had the design conversations around the options here. Python makes sense, - we're already so close with this. I was just thinking out loud and not remembering previous conversations. Let's not shift gears at this point. I am curious if you were able to tune these numbers and avoid using OFFSET and if that helped
bin/migrate_to_single_table.py
Outdated
for offset in range(0, total_records, batch_size): | ||
stmt = ( | ||
self.session.query(IndexRecord) | ||
.offset(offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this OFFSET is likely costing us quite a bit once we get to large numbers. There are some alternative pagination methods but they rely on some order. I'm also concerned we're not using ORDER BY here, I don't think postgres guarantees order unless you use that https://www.postgresql.org/docs/current/queries-order.html . And if we end up using ORDER BY, then I think you can change to NOT use offset but instead use a WHERE + LIMIT
SELECT *
FROM index_records
WHERE guid > {{last_seen_guid}}
LIMIT {{batch_size}}
And then be sure to save the last_seen_guid after processing the last record in the batch
Jira Ticket: PXP-xxxx
Coverage went down by 2% because we removed
test_setup.py
. The coverage went down for theutils.py
file that contains the old, deprecated method of database migration that usessqlite3
.New Features
record
table (Configurable and requires migration)Breaking Changes
Bug Fixes
Improvements
sqlite
from unit tests. Unit tests now only usespostgres
Dependency updates
sqlalchemy
~1.3.3 -> ^1.4.0Deployment changes