Replace SQLAlchemy ORM with psycopg db driver plus optional query builder #254

robsavoye · 2023-03-24T06:20:35Z

Sqlalchemy is a high level framework to work with multiple database type, like mysql, postgresql, etc... It's overkill for this project as we will only use postgresql with postgis. For postgresql access, sqlalchemy uses psycopg2. We should simplify the code base to use psycopg2 directly. This will make it much easier to change the database schema, and simplifying the code base will make it easier to maintain.

spwoodcock · 2023-03-24T08:03:50Z

SQLAlchemy is a great package. While there is a small performance overhead compared to using the driver directly, I think the trade off is worth it.

If we use psycopg directly, I'm not sure how we could manage the database models in a maintainable way. We currently also rely in automatic database migration (creation) on the first start of FMTM (e.g. for local dev servers), and this relies on SQLAlchemy.

robsavoye · 2023-03-24T08:08:50Z

I think we'll be doing this in a much simpler way, and just load an sql file that's a dump of the schema. Other than using sqlalchemy to create the database, I was using the lower level function to execute SQL anyway. This will also make it easier to create test cases, since they'll all need to create a database with test data. It's not a performance issue, mostly just reducing complexity since pretty soon we'll be doing major triage on the schema.

spwoodcock · 2023-03-24T08:24:57Z

There are advantages to both approaches - I agree an SQL file is very simple given basic knowledge of SQL.

Do you propose we change to psycopg prior to the schema edits for role types?

robsavoye · 2023-03-24T08:31:58Z

Yes, as it'll make refactoring the schema less messy. At this point we haven't figured out where to add the roles, so that refactoring is probably a week or two away.

robsavoye · 2023-03-24T13:28:59Z

I'll just note this is not a high priority, nor in the critical path for now. :-) But at least it's worth discussing it.

spwoodcock · 2023-05-03T08:00:09Z

We need to make a decision if we want to move ahead with this, or continue with an ORM.

If we continue with SQLAlchemy, then we need a way to migrate the database after updates (e.g. alembic).

robsavoye · 2023-05-04T14:13:29Z

I personally find psycopg2 simpler since we'll never talk to anything but a postgresql database with postgis support. In other projects rather than use sqlalchemy to setup the database schema, I just load a an SQL file, which makes it very easy to update the schema. Sqlalchemy is great if you need to talk to different databases, but FMTM doesn't do that. I do know that in most of the queries I added to the backend I use the lower level API anyway using execute().

The classic question is how much time to refactor ? I'd guess at least a week ? Not sure if I want to add alembic as another dependency.

spwoodcock · 2023-05-13T16:12:19Z

This idea is growing on me.

Work required

I estimate we currently have about 10 database queries to re-write in pure SQL, to be able to remove SQLAlchemy - I could work on this after finishing the docker config I am working on.

Additional benefits

Obviously this would make database queries marginally quicker / reduce overhead.
We could also shift driver to use asyncpg or psycopg3 (fully async, also faster, fits better with FastAPI).
A side advantage is that we can make Postgis queries directly on the DB, instead of using GeoAlchemy (which I have always found to be lacklustre, plus another dependency).

For consideration

I would actually propose we strip out the geo functionality from the FMTM API completely, and instead use: https://github.com/CrunchyData/pg_featureserv.
- pg_featureserv is a tiny http wrapper around Postgis, allowing us to make complex queries using the OGC Features spec.
- It's also written in Golang and should be much faster than any Python queries we write.
- It also helps us use a more micro-service based approach, reducing the complexity of the FMTM API itself.

robsavoye · 2023-05-14T19:47:27Z

I think stripping out the geo functionality is a separate issue. I know all the SQL queries I added to the backend were using the lower level support in SqlAlchemy to just call execute(), those would be trivial to convert. I think the biggest difference is that without SqlAlchemy we'd have to have docker load an SQL file to create all the tables. Personally I find it easier to edit an SQL text file than change what SqlAlchemy uses. I like the idea of going all the way and using psycopg3. In the long run, I think dropping SqlAlchemy will make the code simpler and more maintainable.

spwoodcock · 2023-06-01T23:20:43Z

Is anyone working on this?

I made a start this evening using psycopg3.

I slightly underestimated the task - it's a pretty big refactor, but I do think the end result is easy to understand.

The downside is that we lose the overview of what tables and fields are present in the DB. Plus the ORM provides centralisation of tables and fields that we don't get with raw SQL (updating or adding a new field requires updating in multiple places).

We would also have to be wary of SQL injection on raw queries. This risk is reduced by passing in params instead of substituting variables directly into the string.

Note 1: We can use pooling of connections with psycopg and probably even use async queries.

Note 2: I stripped out a lot of unused dependencies and updated others (I had never actually gone through the package list we inherited before).

spwoodcock · 2023-06-04T20:19:12Z

After discussion in PR #441, we decided to use asyncpg over psycopg3, and try out a lightweight query generator to use alongside raw SQL.

spwoodcock · 2024-08-03T22:47:09Z

Back and forth a bit on this topic!

While asyncpg may be more performant and good for many use cases, I actually think that psycopg has an edge for our application:

Type safety when used with Pydantic.
Easier static type checking via tools like Pyright / mypy (+ code completion).
Usage of standard libpq official Postgres driver, which in theory should have better longevity (asyncpg uses a custom implementation that would need to be maintained...).

Static Type Checking / Pydantic Integration

Both asyncpg and psycopg would use raw SQL and return Record objects.
We need a way to convert the Record to a Pydantic model easily, for usage in FastAPI etc.
Asyncpg supports this via a workaround: https://github.com/pydantic/pydantic/issues/9406#issuecomment-2104224328
Psycopg supports this in a very simple way using row_factory on the database call: https://www.psycopg.org/psycopg3/docs/advanced/typing.html#example-returning-records-as-pydantic-models

Example of using psycopg with Pydantic (to seralise a returned user from the db):

from pydantic import BaseModel
from psycopg.rows import class_row

class User(BaseModel):
    id: int
    first_name: str
    last_name: str
    dob: Optional[date]
    
async def get_user_by_id(db: Connection, id: int):
    async with conn.cursor(row_factory=class_row(User)) as cur:
        await cur.execute(
            '''
            SELECT id, first_name, last_name, dob
            FROM (VALUES
                (1, 'John', 'Doe', '2000-01-01'::date),
                (2, 'Jane', 'White', NULL)
            ) AS data (id, first_name, last_name, dob)
            WHERE id = %(id)s;
            ''',
            {"id": id},
        )
        obj = await cur.fetchone()

        # reveal_type(obj) would return 'Optional[User]' here

        if not obj:
            raise KeyError(f"user {id} not found")

        # reveal_type(obj) would return 'User' here

        return obj

We could even include some basic function like get_by_id as part of the model:

class User(BaseModel):
    id: int
    first_name: str
    last_name: str
    dob: Optional[date]
    
    async def by_id(self, db: Connection, id: int):
        ... logic here

# Usage
user = await User.by_id(conn, id=10)

Note

All other ORMs would require creating separate models for the tables that could potentially get out of sync.

Only SQLModel is a variable option I will describe below (I don't think we should use it).

Using the DB driver directly, plus Pydantic models for type safety seems like the best of both worlds.

psycopg also takes care of SQL injection risk, so that benefit of ORMs is gone too.

spwoodcock · 2024-08-06T10:17:50Z

Work towards this - removed SQLAlchemy-Utils package:
c309d57
4f7de9a

spwoodcock · 2024-08-06T10:48:36Z

It is possible to use the SQLModel package as a query building for Pydantic:

It uses SQLAlchemy under the hood.
I worry about the longevity of SQLModel, as it's tied to FastAPI. It does have a good community behind it, but again suffers from the single dev behind the project instead of a team.
I don't think we need to include a query builder unnecessarily, unless specifically requested. Sure it makes onboarding easier not having to know SQL, but it's a great skill to learn that isn't going anywhere soon. If required, it's also really easy to write simple queries using LLM these days.

In case we revisit this & require a query building, I will include some examples / proposals below.

Example creating table SQL from Pydantic models (not as useful for us):

Warning

Untested code

from sqlmodel import Field, SQLModel
from sqlalchemy.dialects import postgresql
from sqlalchemy.schema import CreateTable

class User(SQLModel, table=True):
    id: int = Field(default=None, primary_key=True)
    name: str
    email: str

print(CreateTable(User.__table__).compile(dialect=postgresql.dialect()))

Example creating query SQL from Pydantic models:

Warning

Untested code

from pydantic import BaseModel
from sqlmodel import SQLModel
from sqlalchemy.dialects import postgresql

class User(BaseModel):
    id: int
    name: str

class DbUser(User, table=True):
    # NOTE we override the fields that require SQLModel.Field to apply constraints
    id: int = Field(default=None, primary_key=True)

# NOTE db is the db session here
query = db.query(User).distinct(User.id).order_by(User.id)
sql = q.statement.compile(dialect=postgresql.dialect())

robsavoye added enhancement New feature or request priority:high Should be addressed as a priority backend Related to backend code labels Mar 24, 2023

robsavoye assigned nrjadkry Mar 24, 2023

robsavoye added priority:low Backlog of tasks that will be addressed in time and removed priority:high Should be addressed as a priority labels Mar 24, 2023

spwoodcock mentioned this issue Jun 4, 2023

Setup alembic #441

Closed

spwoodcock changed the title ~~Refactor backend, replace sqlalchemy with psycopg2~~ Refactor backend, replace sqlalchemy with asyncpg Jun 4, 2023

spwoodcock mentioned this issue Oct 5, 2023

Add SQL-based migration script, with migration dir #878

Merged

spwoodcock mentioned this issue Jan 23, 2024

Replace synchronous database driver with async (sqalchemy) #1131

Closed

15 tasks

spwoodcock changed the title ~~Refactor backend, replace sqlalchemy with asyncpg~~ Replace SQLAlchemy ORM with asyncpg db driver plus optional query builder Jan 26, 2024

spwoodcock added the effort:medium Likely a day or two label Aug 3, 2024

spwoodcock added this to Field Mapping Tasking Manager (FMTM) Aug 3, 2024

github-project-automation bot moved this to Tasks in Field Mapping Tasking Manager (FMTM) Aug 3, 2024

spwoodcock unassigned nrjadkry Aug 3, 2024

spwoodcock changed the title ~~Replace SQLAlchemy ORM with asyncpg db driver plus optional query builder~~ Replace SQLAlchemy ORM with psycopg db driver plus optional query builder Aug 6, 2024

spwoodcock mentioned this issue Aug 7, 2024

Replace encode/databases with psycopg & pydantic model validation hotosm/drone-tm#128

Merged

5 tasks

spwoodcock mentioned this issue Oct 24, 2024

Remove SQLAlchemy and replace with async psycopg db driver #1834

Merged

9 tasks

spwoodcock closed this as completed in #1834 Oct 25, 2024

github-project-automation bot moved this from Tasks to Done in Field Mapping Tasking Manager (FMTM) Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace SQLAlchemy ORM with psycopg db driver plus optional query builder #254

Replace SQLAlchemy ORM with psycopg db driver plus optional query builder #254

robsavoye commented Mar 24, 2023

spwoodcock commented Mar 24, 2023 •

edited

Loading

robsavoye commented Mar 24, 2023

spwoodcock commented Mar 24, 2023

robsavoye commented Mar 24, 2023

robsavoye commented Mar 24, 2023

spwoodcock commented May 3, 2023

robsavoye commented May 4, 2023

spwoodcock commented May 13, 2023

robsavoye commented May 14, 2023

spwoodcock commented Jun 1, 2023 •

edited

Loading

spwoodcock commented Jun 4, 2023 •

edited

Loading

spwoodcock commented Aug 3, 2024 •

edited

Loading

spwoodcock commented Aug 6, 2024

spwoodcock commented Aug 6, 2024 •

edited

Loading

Replace SQLAlchemy ORM with psycopg db driver plus optional query builder #254

Replace SQLAlchemy ORM with psycopg db driver plus optional query builder #254

Comments

robsavoye commented Mar 24, 2023

spwoodcock commented Mar 24, 2023 • edited Loading

robsavoye commented Mar 24, 2023

spwoodcock commented Mar 24, 2023

robsavoye commented Mar 24, 2023

robsavoye commented Mar 24, 2023

spwoodcock commented May 3, 2023

robsavoye commented May 4, 2023

spwoodcock commented May 13, 2023

Work required

Additional benefits

For consideration

robsavoye commented May 14, 2023

spwoodcock commented Jun 1, 2023 • edited Loading

spwoodcock commented Jun 4, 2023 • edited Loading

spwoodcock commented Aug 3, 2024 • edited Loading

Static Type Checking / Pydantic Integration

spwoodcock commented Aug 6, 2024

spwoodcock commented Aug 6, 2024 • edited Loading

spwoodcock commented Mar 24, 2023 •

edited

Loading

spwoodcock commented Jun 1, 2023 •

edited

Loading

spwoodcock commented Jun 4, 2023 •

edited

Loading

spwoodcock commented Aug 3, 2024 •

edited

Loading

spwoodcock commented Aug 6, 2024 •

edited

Loading