SQLAlchemy: Polyfill for `AUTOINCREMENT` columns #77

amotl · 2023-09-14T13:00:51Z

About

CrateDB does not support the notion of autoincrement columns, because it is a distributed database. For supporting certain applications, specific workarounds have been showing a successful outcome. Hereby, we would like to evaluate how a corresponding patch could be generalized, to be optionally enabled on the CrateDB SQLAlchemy dialect.

Reference

Optimize single primary key usage patterns crate#11020

Details

CrateDB can generate unique values server-side by using the GEN_RANDOM_TEXT_UUID() function, which is applicable for TEXT types.

"columnname" TEXT DEFAULT GEN_RANDOM_TEXT_UUID() NOT NULL

While working on mlflow-cratedb, we had the need to create unique Integer-based values, so we added a corresponding monkeypatch, which also has a remark:

TODO: Submit patch to crate-python, to be enabled by a dialect parameter crate_polyfill_autoincrement or such.

Proposal

So, the idea here is to add a dialect parameter crate_polyfill_autoincrement, which will, under the hood, transparently augment SQLAlchemy models to swap in a different procedure, depending on its column type. For text-based columns, the server-side DEFAULT GEN_RANDOM_TEXT_UUID() may be applicable, while for integer-based columns, the patching would need to resort to a timestamp ¹².

Followup

If that turns out well, make sure to report the outcome back to the original ticket crate/crate#11020, and also write a "best practice" community post about it.

I don't know yet how or whether the timestamp resolution should be configurable at all. It would probably be best to always use nanosecond resolution, in order to reduce the chance of collisions in high-traffic/-volume/-performance environments. If that is not an issue, millisecond resolution might be enough, but making it configurable would probably be a mess. ↩
For the column to be able to store large integer numbers like Unix time since Epoch in nanoseconds, it would need to be converged into a BIGINT, if it was defined as an INT only. ↩

The text was updated successfully, but these errors were encountered:

hlcianfagna · 2023-09-14T13:16:46Z

Another option for INT auto-increments which could be useful in some cases, as something that a developer would need to consciously enable, could be for the driver to generate the values working on the knowledge that it is the only client adding rows. So it could get current max at the first request and then manage a sequence from then on.

amotl · 2023-09-14T13:41:50Z

Hi Hernan. It always sounds nice to gain "kind of autoincrement" features, but, at the same time, always falls short, as there are too many ambiguities and spots where things can go wrong. In this manner, the disadvantages outweigh the advantages. Just to enumerate a few examples:

Even if the driver would know it is the only one, it would not know if it is used concurrently.
The driver would need to keep sequence state per connection, and can't communicate that state to other connections, neither in-process nor ex-process.
If an INSERT goes wrong after generating an identifier, what to do?

This issue was primarily raised to track carrying over the code which has been conceived at mlflow-cratedb, which is validated by a real-world SQLAlchemy application and corresponding test cases now, so I think it is valuable to think about generalizing it.

I am not too fancy to think about different solutions which are even constrained so much, and offer plenty of chances to use wrong. As we are handling primary keys at this spot, the "what could possibly go wrong" aspect is straight on the table here.

However, when using client-side identifier generation anyway, I am open for any suggestions whether to offer different generation algorithms, if that would not complicate the implementation too much, and would satisfy the uniqueness constraints.

amotl · 2023-09-17T18:50:29Z

Following up on crate/crate-python#580 by @surister (thanks!), we may think about accompanying this with an alternative, yet different, to emulate autoincrement column constraints based on integer values. This is what we are currently using to compensate one spot at crate-workbench/langchain#1.

id = sa.Column(sa.BigInteger, primary_key=True, default=generate_autoincrement_identifier)

import datetime as dt

def generate_autoincrement_identifier() -> int:
    """
    Generate auto-incrementing integer values,
    fitting into the BIGINT data type for a while.

    Based on Nagamani23, it returns the number of
    microseconds elapsed since the beginning of 2023.
    
    The property that the value is increasing, to resemble
    classical autoincrement behavior, is important for the
    corresponding use cases.
    """
    naga23_delta = dt.datetime.utcnow() - dt.datetime(2023, 1, 1)
    naga23_microseconds = int(naga23_delta.total_seconds() * 1_000_000)
    return naga23_microseconds

seut · 2023-09-18T09:13:55Z

id = sa.Column(sa.BigInteger, primary_key=True, default=generate_autoincrement_identifier)

Why not use now() server_default instead? This would at least use server logic (use server's date/time env instead of possible different clients one).
Also the microsecond calculation looks flawn to me as the result is only kinda unique in the seconds range. The server's now() function would be in milisecond range. (https://crate.io/docs/crate/reference/en/5.4/general/builtins/scalar-functions.html#now)

id = sa.Column(sa.BigInteger, primary_key=True, default="now()")

amotl · 2023-09-18T10:13:36Z

a) It apparently does not work out of the box because NOW() seems to return a string type. Is there a way to the cast the value transparently, or use a different function?

id = sa.Column(sa.BigInteger, primary_key=True, server_default="NOW()")

SQLParseException[Cannot cast `'NOW()'` of type `text` to type `bigint`]

b) Microsecond resolution works reasonably well. Millisecond resolution might be too low in high performance or high concurrency situtations, which a typical test suite will produce on its own, or simulate on purpose.

>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500672.265006
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500674.232871

surister · 2023-09-18T10:24:14Z

a) It apparently does not work out of the box because NOW() seems to return a string type. Is there a way to the cast the value transparently, or use a different function?
id = sa.Column(sa.BigInteger, primary_key=True, server_default="NOW()")
SQLParseException[Cannot cast `'NOW()'` of type `text` to type `bigint`]
b) Microsecond resolution works reasonably well. Millisecond resolution might be too low in high performance or high concurrency situtations, which a typical test suite will produce on its own, or simulate on purpose.
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500672.265006
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500674.232871

How about:

https://docs.python.org/3/library/time.html#time.perf_counter

>>> time.perf_counter_ms()
9720976934550

amotl · 2023-09-18T10:27:46Z

Thanks for the suggestion, but it has a drawback:

The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.

hlcianfagna · 2023-09-18T10:53:04Z

NOW() seems to return a string type. Is there a way to the cast the value transparently

now()::BIGINT

or

(extract(EPOCH FROM now()) * 1000)

?

amotl · 2023-09-18T11:05:37Z

With SQL DDL, it works well, even without a cast.

create table testdrive (foo bigint default now(), bar text);
insert into testdrive (bar) values ('baz');

cr> select * from testdrive;
+---------------+-----+
|           foo | bar |
+---------------+-----+
| 1695034699106 | baz |
+---------------+-----+
SELECT 1 row in set (0.881 sec)

And when using this SQLAlchemy definition (which I got wrong before), it also makes a corresponding LangChain test case succeed.

sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now())

Excellent, I will change the corresponding code, to check if it also works well on MLflow, and will also add relevant documentation improvements.

amotl · 2023-09-18T11:34:26Z

@seut: Your suggestion worked well to satisfy both test suites, see ¹². Thank you very much.

amotl · 2024-06-22T19:44:24Z

There is a patch now, which includes the corresponding improvement.

Database: Add autoincrement, uniqueness, and sync-writes polyfills #28

amotl changed the title ~~SQLAlchemy: Polyfill for autoincrement columns~~ SQLAlchemy: Polyfill for AUTOINCREMENT columns Sep 14, 2023

amotl mentioned this issue Sep 18, 2023

SQLAlchemy: Add support for 'autogenerate_uuid' in column creation crate/crate-python#580

Closed

3 tasks

amotl mentioned this issue Sep 18, 2023

SQLAlchemy: Use server-side now() function for "autoincrement" columns crate/mlflow-cratedb#24

Merged

amotl added the question Further information is requested label Oct 6, 2023

amotl transferred this issue from crate/crate-python Jun 16, 2024

amotl mentioned this issue Jun 22, 2024

Database: Add autoincrement, uniqueness, and sync-writes polyfills #28

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLAlchemy: Polyfill for `AUTOINCREMENT` columns #77

SQLAlchemy: Polyfill for `AUTOINCREMENT` columns #77

amotl commented Sep 14, 2023 •

edited

Loading

hlcianfagna commented Sep 14, 2023 •

edited

Loading

amotl commented Sep 14, 2023 •

edited

Loading

amotl commented Sep 17, 2023 •

edited

Loading

seut commented Sep 18, 2023

amotl commented Sep 18, 2023 •

edited

Loading

surister commented Sep 18, 2023

amotl commented Sep 18, 2023 •

edited

Loading

hlcianfagna commented Sep 18, 2023

amotl commented Sep 18, 2023 •

edited

Loading

amotl commented Sep 18, 2023 •

edited

Loading

amotl commented Jun 22, 2024

SQLAlchemy: Polyfill for AUTOINCREMENT columns #77

SQLAlchemy: Polyfill for AUTOINCREMENT columns #77

Comments

amotl commented Sep 14, 2023 • edited Loading

About

Reference

Details

Proposal

Followup

Footnotes

hlcianfagna commented Sep 14, 2023 • edited Loading

amotl commented Sep 14, 2023 • edited Loading

amotl commented Sep 17, 2023 • edited Loading

seut commented Sep 18, 2023

amotl commented Sep 18, 2023 • edited Loading

surister commented Sep 18, 2023

amotl commented Sep 18, 2023 • edited Loading

hlcianfagna commented Sep 18, 2023

amotl commented Sep 18, 2023 • edited Loading

amotl commented Sep 18, 2023 • edited Loading

Footnotes

amotl commented Jun 22, 2024

SQLAlchemy: Polyfill for `AUTOINCREMENT` columns #77

SQLAlchemy: Polyfill for `AUTOINCREMENT` columns #77

amotl commented Sep 14, 2023 •

edited

Loading

hlcianfagna commented Sep 14, 2023 •

edited

Loading

amotl commented Sep 14, 2023 •

edited

Loading

amotl commented Sep 17, 2023 •

edited

Loading

amotl commented Sep 18, 2023 •

edited

Loading

amotl commented Sep 18, 2023 •

edited

Loading

amotl commented Sep 18, 2023 •

edited

Loading

amotl commented Sep 18, 2023 •

edited

Loading