Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PPS-107): Feat/single table driver #376

Merged
merged 49 commits into from
Sep 10, 2024
Merged

(PPS-107): Feat/single table driver #376

merged 49 commits into from
Sep 10, 2024

Conversation

BinamB
Copy link
Contributor

@BinamB BinamB commented Feb 5, 2024

Jira Ticket: PXP-xxxx

  • Remove this line if you've changed the title to (PXP-xxxx): <title>

Coverage went down by 2% because we removed test_setup.py. The coverage went down for the utils.py file that contains the old, deprecated method of database migration that uses sqlite3.

New Features

  • Implemented single table architecture to IndexD with a single new record table (Configurable and requires migration)

Breaking Changes

Bug Fixes

Improvements

  • Removed sqlite from unit tests. Unit tests now only uses postgres

Dependency updates

  • sqlalchemy ~1.3.3 -> ^1.4.0

Deployment changes

Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a full review but figured I share these comments I had so far

@@ -17,7 +17,7 @@
# will be created as "<PREFIX><PREFIX><GUID>".
CONFIG["INDEX"] = {
"driver": SQLAlchemyIndexDriver(
"sqlite:///index.sq3",
"postgres://postgres:postgres@localhost:5432/indexd_tests", # pragma: allowlist secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the idea behind this change? Did SQLite not work easily with your changes?

To be clear, I think it's fine (it's more consistent to default to postgres)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sqlite doesn't handle some of the newer data types. I moved all the test to use postgres instead of sqlite too, just to stay consistent. I saw there was a mix.

@@ -4,7 +4,7 @@
from indexd.index.drivers.alchemy import SQLAlchemyIndexDriver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an unused import now?

indexd/index/blueprint.py Show resolved Hide resolved
@@ -80,7 +80,6 @@ def query_urls(
query = query.having(
~q_func["string_agg"](IndexRecordUrl.url, ",").contains(exclude)
)
print(query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe lets debug log instead of removing entirely? I'm not sure if people expect this to exist.

Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments before I left for PTO that I forgot to send, apologies. Here they are now

@paulineribeyre paulineribeyre mentioned this pull request May 29, 2024
indexd/index/blueprint.py Show resolved Hide resolved
baseid = Column(String, index=True)
rev = Column(String)
form = Column(String)
size = Column(BigInteger, index=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we indexing on size, file_name, version, uploader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing it, I cant remember why i did that. Its been a while.

"""
Get the full index document
"""
# TODO: some of these fields may not need to be a variable and could directly go to the return object -Binam
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you do this?

Base.metadata.bind = self.engine
self.Session = sessionmaker(bind=self.engine)

def migrate_index_database(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just remove this for the new driver if we can. Rely on alembic

"""
session = self.Session()

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify all this by following the guide here: https://docs.sqlalchemy.org/en/14/orm/session_basics.html#framing-out-a-begin-commit-rollback-block

and use the Session.begin()

I think you can just return that, since it's a context manager

tests/conftest.py Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
@@ -0,0 +1,268 @@
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a quick description here with how to use this?

args = parser.parse_args()
migrator = IndexRecordMigrator(conf_data=args.creds_path)
migrator.index_record_to_new_table()
# cProfile.run("migrator.index_record_to_new_table()", filename="profile_results.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code, remove if not using


self.session = Session()

def index_record_to_new_table(self, batch_size=1000, retry_limit=4):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this batch_size could be much larger. Where's the job that will run this? How much RAM do you plan on providing? Indexd records text is pretty darn small. Maybe 1000 characters on average? 1000 chars / record, 1 char = 1 byte in ASCII, 1kb per record. so 1000 records = 1MB. If we have 1 GBs of usable RAM, we could probably go up to like 10000 records.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should do that, but if we can reliably, it might be faster (reducing the total number of queries and commits to the db). Would outputting to a file and using /COPY be faster? Don't you guys use /COPY for something already? Did you consider this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this. I think there's a couple of things here:

  1. We had a conversation about whether to use sql+bash script vs python. We went with python because it would be easier for any gen3 user to use it.
  2. should we do python + /copy? My concern with this is that this would blow up for something like dcf where we have about 56mil records. I think we'd use up a lot of memory. We could do it in batches though 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's been a while since we've had the design conversations around the options here. Python makes sense, - we're already so close with this. I was just thinking out loud and not remembering previous conversations. Let's not shift gears at this point. I am curious if you were able to tune these numbers and avoid using OFFSET and if that helped

for offset in range(0, total_records, batch_size):
stmt = (
self.session.query(IndexRecord)
.offset(offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this OFFSET is likely costing us quite a bit once we get to large numbers. There are some alternative pagination methods but they rely on some order. I'm also concerned we're not using ORDER BY here, I don't think postgres guarantees order unless you use that https://www.postgresql.org/docs/current/queries-order.html . And if we end up using ORDER BY, then I think you can change to NOT use offset but instead use a WHERE + LIMIT

SELECT * 
FROM index_records
WHERE guid > {{last_seen_guid}}
LIMIT {{batch_size}}

And then be sure to save the last_seen_guid after processing the last record in the batch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants