-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLAlchemy: Polyfill for AUTOINCREMENT
columns
#77
Comments
autoincrement
columnsAUTOINCREMENT
columns
Another option for INT auto-increments which could be useful in some cases, as something that a developer would need to consciously enable, could be for the driver to generate the values working on the knowledge that it is the only client adding rows. So it could get current max at the first request and then manage a sequence from then on. |
Hi Hernan. It always sounds nice to gain "kind of autoincrement" features, but, at the same time, always falls short, as there are too many ambiguities and spots where things can go wrong. In this manner, the disadvantages outweigh the advantages. Just to enumerate a few examples:
This issue was primarily raised to track carrying over the code which has been conceived at mlflow-cratedb, which is validated by a real-world SQLAlchemy application and corresponding test cases now, so I think it is valuable to think about generalizing it. I am not too fancy to think about different solutions which are even constrained so much, and offer plenty of chances to use wrong. As we are handling primary keys at this spot, the "what could possibly go wrong" aspect is straight on the table here. However, when using client-side identifier generation anyway, I am open for any suggestions whether to offer different generation algorithms, if that would not complicate the implementation too much, and would satisfy the uniqueness constraints. |
Following up on crate/crate-python#580 by @surister (thanks!), we may think about accompanying this with an alternative, yet different, to emulate autoincrement column constraints based on integer values. This is what we are currently using to compensate one spot at crate-workbench/langchain#1. id = sa.Column(sa.BigInteger, primary_key=True, default=generate_autoincrement_identifier) import datetime as dt
def generate_autoincrement_identifier() -> int:
"""
Generate auto-incrementing integer values,
fitting into the BIGINT data type for a while.
Based on Nagamani23, it returns the number of
microseconds elapsed since the beginning of 2023.
The property that the value is increasing, to resemble
classical autoincrement behavior, is important for the
corresponding use cases.
"""
naga23_delta = dt.datetime.utcnow() - dt.datetime(2023, 1, 1)
naga23_microseconds = int(naga23_delta.total_seconds() * 1_000_000)
return naga23_microseconds |
Why not use
|
a) It apparently does not work out of the box because id = sa.Column(sa.BigInteger, primary_key=True, server_default="NOW()")
b) Microsecond resolution works reasonably well. Millisecond resolution might be too low in high performance or high concurrency situtations, which a typical test suite will produce on its own, or simulate on purpose. >>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500672.265006
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500674.232871 |
How about: https://docs.python.org/3/library/time.html#time.perf_counter >>> time.perf_counter_ms()
9720976934550 |
Thanks for the suggestion, but it has a drawback:
|
now()::BIGINT or (extract(EPOCH FROM now()) * 1000) ? |
With SQL DDL, it works well, even without a cast. create table testdrive (foo bigint default now(), bar text);
insert into testdrive (bar) values ('baz'); cr> select * from testdrive;
+---------------+-----+
| foo | bar |
+---------------+-----+
| 1695034699106 | baz |
+---------------+-----+
SELECT 1 row in set (0.881 sec) And when using this SQLAlchemy definition (which I got wrong before), it also makes a corresponding LangChain test case succeed. sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now()) Excellent, I will change the corresponding code, to check if it also works well on MLflow, and will also add relevant documentation improvements. |
There is a patch now, which includes the corresponding improvement. |
About
CrateDB does not support the notion of
autoincrement
columns, because it is a distributed database. For supporting certain applications, specific workarounds have been showing a successful outcome. Hereby, we would like to evaluate how a corresponding patch could be generalized, to be optionally enabled on the CrateDB SQLAlchemy dialect.Reference
Details
CrateDB can generate unique values server-side by using the GEN_RANDOM_TEXT_UUID() function, which is applicable for
TEXT
types.While working on mlflow-cratedb, we had the need to create unique Integer-based values, so we added a corresponding monkeypatch, which also has a remark:
Proposal
So, the idea here is to add a dialect parameter
crate_polyfill_autoincrement
, which will, under the hood, transparently augment SQLAlchemy models to swap in a different procedure, depending on its column type. For text-based columns, the server-sideDEFAULT GEN_RANDOM_TEXT_UUID()
may be applicable, while for integer-based columns, the patching would need to resort to a timestamp 12.Followup
If that turns out well, make sure to report the outcome back to the original ticket crate/crate#11020, and also write a "best practice" community post about it.
Footnotes
I don't know yet how or whether the timestamp resolution should be configurable at all. It would probably be best to always use nanosecond resolution, in order to reduce the chance of collisions in high-traffic/-volume/-performance environments. If that is not an issue, millisecond resolution might be enough, but making it configurable would probably be a mess. ↩
For the column to be able to store large integer numbers like Unix time since Epoch in nanoseconds, it would need to be converged into a BIGINT, if it was defined as an INT only. ↩
The text was updated successfully, but these errors were encountered: