Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLAlchemy: Polyfill for AUTOINCREMENT columns #77

Open
amotl opened this issue Sep 14, 2023 · 11 comments
Open

SQLAlchemy: Polyfill for AUTOINCREMENT columns #77

amotl opened this issue Sep 14, 2023 · 11 comments
Labels
question Further information is requested

Comments

@amotl
Copy link
Member

amotl commented Sep 14, 2023

About

CrateDB does not support the notion of autoincrement columns, because it is a distributed database. For supporting certain applications, specific workarounds have been showing a successful outcome. Hereby, we would like to evaluate how a corresponding patch could be generalized, to be optionally enabled on the CrateDB SQLAlchemy dialect.

Reference

Details

CrateDB can generate unique values server-side by using the GEN_RANDOM_TEXT_UUID() function, which is applicable for TEXT types.

"columnname" TEXT DEFAULT GEN_RANDOM_TEXT_UUID() NOT NULL

While working on mlflow-cratedb, we had the need to create unique Integer-based values, so we added a corresponding monkeypatch, which also has a remark:

TODO: Submit patch to crate-python, to be enabled by a dialect parameter crate_polyfill_autoincrement or such.

Proposal

So, the idea here is to add a dialect parameter crate_polyfill_autoincrement, which will, under the hood, transparently augment SQLAlchemy models to swap in a different procedure, depending on its column type. For text-based columns, the server-side DEFAULT GEN_RANDOM_TEXT_UUID() may be applicable, while for integer-based columns, the patching would need to resort to a timestamp 12.

Followup

If that turns out well, make sure to report the outcome back to the original ticket crate/crate#11020, and also write a "best practice" community post about it.

Footnotes

  1. I don't know yet how or whether the timestamp resolution should be configurable at all. It would probably be best to always use nanosecond resolution, in order to reduce the chance of collisions in high-traffic/-volume/-performance environments. If that is not an issue, millisecond resolution might be enough, but making it configurable would probably be a mess.

  2. For the column to be able to store large integer numbers like Unix time since Epoch in nanoseconds, it would need to be converged into a BIGINT, if it was defined as an INT only.

@amotl amotl changed the title SQLAlchemy: Polyfill for autoincrement columns SQLAlchemy: Polyfill for AUTOINCREMENT columns Sep 14, 2023
@hlcianfagna
Copy link
Contributor

hlcianfagna commented Sep 14, 2023

Another option for INT auto-increments which could be useful in some cases, as something that a developer would need to consciously enable, could be for the driver to generate the values working on the knowledge that it is the only client adding rows. So it could get current max at the first request and then manage a sequence from then on.

@amotl
Copy link
Member Author

amotl commented Sep 14, 2023

Hi Hernan. It always sounds nice to gain "kind of autoincrement" features, but, at the same time, always falls short, as there are too many ambiguities and spots where things can go wrong. In this manner, the disadvantages outweigh the advantages. Just to enumerate a few examples:

  • Even if the driver would know it is the only one, it would not know if it is used concurrently.
  • The driver would need to keep sequence state per connection, and can't communicate that state to other connections, neither in-process nor ex-process.
  • If an INSERT goes wrong after generating an identifier, what to do?

This issue was primarily raised to track carrying over the code which has been conceived at mlflow-cratedb, which is validated by a real-world SQLAlchemy application and corresponding test cases now, so I think it is valuable to think about generalizing it.

I am not too fancy to think about different solutions which are even constrained so much, and offer plenty of chances to use wrong. As we are handling primary keys at this spot, the "what could possibly go wrong" aspect is straight on the table here.

However, when using client-side identifier generation anyway, I am open for any suggestions whether to offer different generation algorithms, if that would not complicate the implementation too much, and would satisfy the uniqueness constraints.

@amotl
Copy link
Member Author

amotl commented Sep 17, 2023

Following up on crate/crate-python#580 by @surister (thanks!), we may think about accompanying this with an alternative, yet different, to emulate autoincrement column constraints based on integer values. This is what we are currently using to compensate one spot at crate-workbench/langchain#1.

id = sa.Column(sa.BigInteger, primary_key=True, default=generate_autoincrement_identifier)
import datetime as dt

def generate_autoincrement_identifier() -> int:
    """
    Generate auto-incrementing integer values,
    fitting into the BIGINT data type for a while.

    Based on Nagamani23, it returns the number of
    microseconds elapsed since the beginning of 2023.
    
    The property that the value is increasing, to resemble
    classical autoincrement behavior, is important for the
    corresponding use cases.
    """
    naga23_delta = dt.datetime.utcnow() - dt.datetime(2023, 1, 1)
    naga23_microseconds = int(naga23_delta.total_seconds() * 1_000_000)
    return naga23_microseconds

@seut
Copy link
Member

seut commented Sep 18, 2023

id = sa.Column(sa.BigInteger, primary_key=True, default=generate_autoincrement_identifier)

Why not use now() server_default instead? This would at least use server logic (use server's date/time env instead of possible different clients one).
Also the microsecond calculation looks flawn to me as the result is only kinda unique in the seconds range. The server's now() function would be in milisecond range. (https://crate.io/docs/crate/reference/en/5.4/general/builtins/scalar-functions.html#now)

id = sa.Column(sa.BigInteger, primary_key=True, default="now()")

@amotl
Copy link
Member Author

amotl commented Sep 18, 2023

a) It apparently does not work out of the box because NOW() seems to return a string type. Is there a way to the cast the value transparently, or use a different function?

id = sa.Column(sa.BigInteger, primary_key=True, server_default="NOW()")
SQLParseException[Cannot cast `'NOW()'` of type `text` to type `bigint`]

b) Microsecond resolution works reasonably well. Millisecond resolution might be too low in high performance or high concurrency situtations, which a typical test suite will produce on its own, or simulate on purpose.

>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500672.265006
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500674.232871

@surister
Copy link
Contributor

a) It apparently does not work out of the box because NOW() seems to return a string type. Is there a way to the cast the value transparently, or use a different function?

id = sa.Column(sa.BigInteger, primary_key=True, server_default="NOW()")
SQLParseException[Cannot cast `'NOW()'` of type `text` to type `bigint`]

b) Microsecond resolution works reasonably well. Millisecond resolution might be too low in high performance or high concurrency situtations, which a typical test suite will produce on its own, or simulate on purpose.

>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500672.265006
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500674.232871

How about:

https://docs.python.org/3/library/time.html#time.perf_counter

>>> time.perf_counter_ms()
9720976934550

@amotl
Copy link
Member Author

amotl commented Sep 18, 2023

Thanks for the suggestion, but it has a drawback:

The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.

@hlcianfagna
Copy link
Contributor

NOW() seems to return a string type. Is there a way to the cast the value transparently

now()::BIGINT

or

(extract(EPOCH FROM now()) * 1000)

?

@amotl
Copy link
Member Author

amotl commented Sep 18, 2023

With SQL DDL, it works well, even without a cast.

create table testdrive (foo bigint default now(), bar text);
insert into testdrive (bar) values ('baz');
cr> select * from testdrive;
+---------------+-----+
|           foo | bar |
+---------------+-----+
| 1695034699106 | baz |
+---------------+-----+
SELECT 1 row in set (0.881 sec)

And when using this SQLAlchemy definition (which I got wrong before), it also makes a corresponding LangChain test case succeed.

sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now())

Excellent, I will change the corresponding code, to check if it also works well on MLflow, and will also add relevant documentation improvements.

@amotl
Copy link
Member Author

amotl commented Sep 18, 2023

@amotl amotl added the question Further information is requested label Oct 6, 2023
@amotl amotl transferred this issue from crate/crate-python Jun 16, 2024
@amotl
Copy link
Member Author

amotl commented Jun 22, 2024

There is a patch now, which includes the corresponding improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants