(PPS-107): Feat/single table driver #376

BinamB · 2024-02-05T18:18:56Z

Jira Ticket: PXP-xxxx

Remove this line if you've changed the title to (PXP-xxxx): <title>

Coverage went down by 2% because we removed test_setup.py. The coverage went down for the utils.py file that contains the old, deprecated method of database migration that uses sqlite3.

New Features

Implemented single table architecture to IndexD with a single new record table (Configurable and requires migration)

Breaking Changes

Bug Fixes

Improvements

Removed sqlite from unit tests. Unit tests now only uses postgres

Dependency updates

sqlalchemy ~1.3.3 -> ^1.4.0

Deployment changes

sqlite databases are not supported anymore
To run single table migration check this doc: https://github.com/uc-cdis/indexd/blob/master/docs/migration_to_single_table_indexd.md

Implemented migrate_db for single_table_alchemy Implemented session for single_table_alchemy

Avantol13

this isn't a full review but figured I share these comments I had so far

Avantol13 · 2024-04-22T20:17:52Z

indexd/default_settings.py

@@ -17,7 +17,7 @@
 # will be created as "<PREFIX><PREFIX><GUID>".
 CONFIG["INDEX"] = {
    "driver": SQLAlchemyIndexDriver(
-        "sqlite:///index.sq3",
+        "postgres://postgres:postgres@localhost:5432/indexd_tests",  # pragma: allowlist secret


What's the idea behind this change? Did SQLite not work easily with your changes?

To be clear, I think it's fine (it's more consistent to default to postgres)

Yeah sqlite doesn't handle some of the newer data types. I moved all the test to use postgres instead of sqlite too, just to stay consistent. I saw there was a mix.

Avantol13 · 2024-04-22T20:18:28Z

deployment/Secrets/indexd_settings.py

@@ -4,7 +4,7 @@
 from indexd.index.drivers.alchemy import SQLAlchemyIndexDriver


isn't this an unused import now?

indexd/index/blueprint.py

Avantol13 · 2024-04-22T20:20:03Z

indexd/index/drivers/query/urls.py

@@ -80,7 +80,6 @@ def query_urls(
                query = query.having(
                    ~q_func["string_agg"](IndexRecordUrl.url, ",").contains(exclude)
                )
-            print(query)


maybe lets debug log instead of removing entirely? I'm not sure if people expect this to exist.

Avantol13

I had some comments before I left for PTO that I forgot to send, apologies. Here they are now

indexd/index/blueprint.py

Avantol13 · 2024-06-18T15:48:41Z

indexd/index/drivers/single_table_alchemy.py

+    baseid = Column(String, index=True)
+    rev = Column(String)
+    form = Column(String)
+    size = Column(BigInteger, index=True)


why are we indexing on size, file_name, version, uploader?

removing it, I cant remember why i did that. Its been a while.

Avantol13 · 2024-06-18T15:49:18Z

indexd/index/drivers/single_table_alchemy.py

+        """
+        Get the full index document
+        """
+        # TODO: some of these fields may not need to be a variable and could directly go to the return object -Binam


could you do this?

Avantol13 · 2024-06-18T15:50:10Z

indexd/index/drivers/single_table_alchemy.py

+        Base.metadata.bind = self.engine
+        self.Session = sessionmaker(bind=self.engine)
+
+    def migrate_index_database(self):


we should just remove this for the new driver if we can. Rely on alembic

Avantol13 · 2024-06-18T15:58:53Z

indexd/index/drivers/single_table_alchemy.py

+        """
+        session = self.Session()
+
+        try:


I think we can simplify all this by following the guide here: https://docs.sqlalchemy.org/en/14/orm/session_basics.html#framing-out-a-begin-commit-rollback-block

and use the Session.begin()

I think you can just return that, since it's a context manager

tests/conftest.py

tests/postgres/migrations/test_bb3d7586a096_createsingletable.py

Avantol13 · 2024-06-18T17:55:53Z

bin/migrate_to_single_table.py

@@ -0,0 +1,268 @@
+"""
+


Can you add a quick description here with how to use this?

Avantol13 · 2024-06-18T17:57:20Z

bin/migrate_to_single_table.py

+    args = parser.parse_args()
+    migrator = IndexRecordMigrator(conf_data=args.creds_path)
+    migrator.index_record_to_new_table()
+    # cProfile.run("migrator.index_record_to_new_table()", filename="profile_results.txt")


dead code, remove if not using

Avantol13 · 2024-06-18T18:39:15Z

bin/migrate_to_single_table.py

+
+        self.session = Session()
+
+    def index_record_to_new_table(self, batch_size=1000, retry_limit=4):


I imagine this batch_size could be much larger. Where's the job that will run this? How much RAM do you plan on providing? Indexd records text is pretty darn small. Maybe 1000 characters on average? 1000 chars / record, 1 char = 1 byte in ASCII, 1kb per record. so 1000 records = 1MB. If we have 1 GBs of usable RAM, we could probably go up to like 10000 records.

I'm not sure we should do that, but if we can reliably, it might be faster (reducing the total number of queries and commits to the db). Would outputting to a file and using /COPY be faster? Don't you guys use /COPY for something already? Did you consider this?

I did consider this. I think there's a couple of things here:

We had a conversation about whether to use sql+bash script vs python. We went with python because it would be easier for any gen3 user to use it.

should we do python + /copy? My concern with this is that this would blow up for something like dcf where we have about 56mil records. I think we'd use up a lot of memory. We could do it in batches though 🤔

Yeah, it's been a while since we've had the design conversations around the options here. Python makes sense, - we're already so close with this. I was just thinking out loud and not remembering previous conversations. Let's not shift gears at this point. I am curious if you were able to tune these numbers and avoid using OFFSET and if that helped

Avantol13 · 2024-06-18T18:50:17Z

bin/migrate_to_single_table.py

+            for offset in range(0, total_records, batch_size):
+                stmt = (
+                    self.session.query(IndexRecord)
+                    .offset(offset)


this OFFSET is likely costing us quite a bit once we get to large numbers. There are some alternative pagination methods but they rely on some order. I'm also concerned we're not using ORDER BY here, I don't think postgres guarantees order unless you use that https://www.postgresql.org/docs/current/queries-order.html . And if we end up using ORDER BY, then I think you can change to NOT use offset but instead use a WHERE + LIMIT

SELECT * FROM index_records WHERE guid > {{last_seen_guid}} LIMIT {{batch_size}}

And then be sure to save the last_seen_guid after processing the last record in the batch

bin/migrate_to_single_table.py

k-burt-uch and others added 8 commits August 24, 2023 10:58

Skeleton code for Single Table Indexd

def18c8

single table alchemy

d522abe

Parameterized app fixture to run single table for all existing tests

62561e0

Implemented migrate_db for single_table_alchemy Implemented session for single_table_alchemy

add to alchemy

67247e6

fixing things

d136b6c

pushing some stuff

dfbca20

Fix tests (partial)

2dc057e

Fix unit tests

f558dde

github-actions bot added the @requires-indexd label Feb 5, 2024

BinamB added 2 commits February 13, 2024 13:03

Fix urls endpoints unit tests

55b1f2e

migration script

0c20220

Avantol13 requested changes Apr 26, 2024

View reviewed changes

Avantol13 requested changes May 20, 2024

View reviewed changes

BinamB added 4 commits May 24, 2024 11:58

Add async code

253cfe4

add asyncpg

b28594f

fix async

7ecffd3

remove worker

6fa45b4

paulineribeyre mentioned this pull request May 29, 2024

Feat/stats table #378

Closed

BinamB added 3 commits June 10, 2024 12:37

Async migration

a3dccb5

migrate no bulk inserts

a76fc53

async migration remove + PR comments

1b7219a

Avantol13 requested changes Jun 18, 2024

View reviewed changes

Avantol13 reviewed Jun 18, 2024

View reviewed changes

bin/migrate_to_single_table.py Show resolved Hide resolved

BinamB added 3 commits June 20, 2024 12:20

Updated sync migration

71cac1a

pr review changes

0e977b3

fix migration

de64991

This was referenced Oct 25, 2024

feat(va-testing.data-commons.org): test deployment on 20241028 uc-cdis/cdis-manifest#8036

Merged

feat(va.data-commons.org): deployment on 20241028 uc-cdis/cdis-manifest#8040

Merged

updating MIDRC prod SW to 2024.10 uc-cdis/cdis-manifest#8056

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(PPS-107): Feat/single table driver #376

(PPS-107): Feat/single table driver #376

BinamB commented Feb 5, 2024 •

edited by paulineribeyre

Loading

Avantol13 left a comment

Avantol13 Apr 22, 2024

BinamB Jun 14, 2024

Avantol13 Apr 22, 2024

Avantol13 Apr 22, 2024

Avantol13 left a comment

Avantol13 Jun 18, 2024

BinamB Jun 24, 2024

Avantol13 Jun 18, 2024

Avantol13 Jun 18, 2024

Avantol13 Jun 18, 2024

Avantol13 Jun 18, 2024

Avantol13 Jun 18, 2024

Avantol13 Jun 18, 2024

Avantol13 Jun 18, 2024

BinamB Jun 27, 2024

Avantol13 Jul 2, 2024

Avantol13 Jun 18, 2024

		@@ -4,7 +4,7 @@
		from indexd.index.drivers.alchemy import SQLAlchemyIndexDriver


		self.session = Session()

		def index_record_to_new_table(self, batch_size=1000, retry_limit=4):

(PPS-107): Feat/single table driver #376

(PPS-107): Feat/single table driver #376

Conversation

BinamB commented Feb 5, 2024 • edited by paulineribeyre Loading

New Features

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

Avantol13 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Avantol13 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BinamB commented Feb 5, 2024 •

edited by paulineribeyre

Loading