Why UUIDs? #519

janosh · 2024-01-01T16:08:52Z

I was looking at a wall of job IDs like this earlier and wondering if jobflow actually requires full 36-character UUIDs?

PhononUUIDs(optimization_run_uuid='cc10cd4d-944d-43b1-9651-2e168765c02e', displacements_uuids=['c29abb65-a8e5-47ff-a6d6-6776cd8e4095', 'e2e07788-daed-4201-921b-ec2e57c802c4', 'bf443b3d-6334-4b04-8f8a-29fabc08c37e', '23c36a18-4066-4047-a553-aa9fe2afa714', '81134dc7-a88f-486a-af3b-ddd9c1ba3eb1', '1009897b-a97c-4d4a-9cac-e952766b9608', '15015ddf-7967-410d-8fef-13642d7bd9e5', '1899803f-76c1-4768-bf23-c4d61fb51691', 'd39ce806-7403-495e-89a4-a4de961590bf', '35bb27e8-7f31-44ec-b684-0f9391b4228e', 'c6f188fb-2d26-47ff-b0f8-8438fae8ff58', '06b2249c-e7c6-4cf8-91b0-f572fccc209c', '18bbc0be-be4c-4131-8ef6-f6742c899acd', '8303eb32-162a-4a65-b75f-d8586aa23636', '392282e2-cbe6-447e-b7a9-e20a02e57cbe', '835b35f9-413c-479c-944d-14d9d4649b71', 'c59f17e4-1f9b-4e65-a9f2-dc9fc7314283', '95d9e094-802b-45c5-8630-3b0e8e6c3b64', 'dea7a13c-6921-42f5-adf5-7a4670319a7b', '6faa0b95-0058-4476-bd55-109d3746a334', '6a6fbce4-41b3-4b7a-8d69-0b6ffab7716c', '0ef5726d-3b62-4464-a5e9-f33237f59d83', '3de16018-90ad-4da7-9885-86a783d30c1c', '5c779412-0c29-423b-ae7b-c6eade5124bc', 'adc6ef3c-fc70-43cf-a819-741b45d0ec7f', '5b47a7ff-bb6d-45f4-b442-1183db8a58df', 'e01a19d7-3685-465a-a93b-bfa3e876081c', '79addbaa-32fb-4c5c-82c2-23c01dd9cd89', '6232ed4f-8b57-40a9-9fdf-0956913a84f6', '772a84ea-9f22-4167-8887-dc8ea46b2ad7', 'e74100ac-f17f-4fdc-81b9-18b2e3c69064', '7d0ba392-0eef-4da6-8d31-6eb84223dcc3', '98c688d5-5aa2-49cb-ac77-847b04abd145', '3e4a0fd0-d383-4b1a-b20c-56794fd9a096', 'ed466545-1988-4d52-8dec-835075a6049d', 'bb829a61-6e91-481f-b90e-ffa1baa0e876', '6e49c916-2197-4ff8-a582-52e751d20837', '6a5a7a71-3721-44e3-bd93-08c8128ef52c', '6dcba672-d54e-4900-b807-737f6adfc6e3', '6f13d708-a1cf-4884-b748-dcc932e753b9', '5721c7af-932b-4250-81f9-ab88f219a98f', '6f754a3f-81f6-46be-9160-95ffd0af6f9b', '723b9f8c-11ea-4261-801f-f02e1b4aa518', '90889945-059e-4777-9b94-a473ddacb185', '27066b5b-5c80-408b-a8aa-cae9b0304a26', '3389b0c1-e39d-4fba-b175-986b12c2baeb', '8185facc-f20b-4261-b8d2-cf584f50c151', '23251773-7ea8-42d5-a791-303240af896c', '5605e378-5648-44ab-b2dd-1caf44267c00', '6df6ccb8-d26e-4bbf-8285-18fdfe1ae64e', '2af8cf20-c1ee-48ae-bc52-dacfb60d9bf2', '9725444a-b59f-499b-9293-51641a65f689', '9c262ca0-15d3-4a78-8e53-a8643c0e46da', '39ebce16-b68c-4edc-8e2b-00d4cf677931', '267b87e0-ff6d-4832-a703-6d3cbc127e24', '186e0ac2-57a7-424f-8e95-09ddf6e25a86', '16a2f8cc-73ab-46ea-bc33-b6304250376e', '8fb0d733-8c8c-4df3-97f9-68937ea92b73', '788bc502-840a-4283-995b-6edf6df57458', 'f390865a-da66-4241-9bee-763279e050f9', 'c433c10b-64e1-4ace-aa07-13af868fbbc7', '23384eb4-21e3-4819-86d7-1f60a579d839', 'aa346ded-5d7a-4aff-a16a-413250513b40', 'b5a64bbd-ae68-49e2-8b60-f04e3e80158d', '5d392703-1fcd-42e1-a87f-8da4c2be82df', '6b3c4b43-2968-4a95-b32b-bf9be92aea6b'], static_run_uuid='4afd1757-09f3-44c4-ad15-93d099e989d0', born_run_uuid=None))

YouTube for example uses 11-character IDs?

https://youtu.be/R3N6BocbknM

It's case-sensitive alpha-numeric plus hyphen and underscore giving $64^{11} \approx 7.4 \cdot 10^{19}$ combinations. I think there would be readability and "pastibility" improvements.

The text was updated successfully, but these errors were encountered:

utf · 2024-01-07T13:06:03Z

I'm not against using shorter IDs, especially since they're currently encoded as string not bson uuids. Is there a way to generate those short IDs without an additional dependency?

janosh · 2024-01-07T15:11:15Z

Yes, we could do

import random
import string


def get_random_id():
    """Return a random 11-character YouTube-like ID."""
    chars = random.choices(string.ascii_letters + string.digits + "-_", k=11)
    return "".join(chars)

Maybe pastibility would be better if we drop the dash and underscore.

def get_random_id():
    """Return a random 11-character alpha-numeric ID."""
    chars = random.choices(string.ascii_letters + string.digits, k=11)
    return "".join(chars)

gpetretto · 2024-01-08T12:21:58Z

I think that making the uuid shorter and easier for copy/paste would indeed be beneficial, but I have a few notes that may be considered before proceeding:

in jobflow-remote, in the CLI, I am using the structure of the UUID to distinguish if a user passes a uuid or a db_id. This is not a blocking issue, but I thought it was worth mentioning it. Maybe the new id structure could be made so to be clearly identifiable?
in the databases and in the code the UUIDs are always referred as uuid. Changing that would require some refactoring and maybe backward incompatible changes, not changing means that a uuid attribute contains something that is not a proper uuid

janosh · 2024-01-08T12:37:59Z

@gpetretto Great points both!

We could prefix the IDs with jf- which would allow identifying them via id.startwith('jf-') and would even allow future changes of the ID format without breaking these checks.
If we go ahead with this change, I vote to rename from uuid to id (or jf_id) since keeping uuid would be misleading.

gpetretto · 2024-01-08T16:08:17Z

Thanks for considering these points. I have a few more comments.

There is an error in the estimate of the number of possible combinations. It is not 11^64 but 64^11 ~ 7.37E19 combinations, which is considerably smaller than the number of different UUIDs (5.3E36). I suppose the possibility of collision is still rather low, but if this option is chosen, it may be convenient to increase the size of the string.
If changing the id format is an option, it may also be convenient to consider other algorithms and their features. For example something sortable. Here are a few examples I have come across:
- ULID
- Snowflake_id
- NanoID
- CUID
python implementations are typically available and they probably won't require further dependencies, but I suppose it would also be easy to add a simple reimplementation inside jobflow.

janosh · 2024-01-08T16:19:15Z

I was looking at NanoID when opening this issue but ULID looks even better. Only thing is 26 vs 36 is not that much shorter.

It is not 11^64 but 64^11 ~ 7.37E19

Oops, rookie mistake 😅

Andrew-S-Rosen · 2024-01-08T23:08:44Z

Maybe I'm not thinking hard enough about it, but is there any reason not to have the unique ID simply be the datetime + a small identifier?

time_now = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H-%M-%S-%f")
f"{time_now}-{SMALL_ID_GOES_HERE}"

Then it would be interpretable but also with low probability of clashing even if SMALL_ID_GOES_HERE is not particularly diverse.

I guess mine isn't shorter though, so I take it back. 😅

jmmshn · 2024-01-18T17:41:53Z

The sortable point from @gpetretto is pretty important.
Some of the more complex workflows in emmet require the IDs to be sorted (ideally based on creating time). So if we move away from UUID4 (which I think is random), I think this is actually a requirement.

Otherwise, we run into problems like you see here:
materialsproject/atomate2#655 (comment)

utf · 2024-01-18T18:00:17Z

Agreed that given the emmet limitations, having a sortable id is essential.

Of the sortable options mentioned

UUID1 is 36 characters
ULID is 26 characters
Snowflake-id is 19 characters but all numbers and quite difficult to tell the difference between ids.
NanoID and CUID are not sortable as far as I can tell.

Are there any other options/suggestions on what to use?

Andrew-S-Rosen · 2024-01-18T18:02:50Z

Just chiming in to say that, while it doesn't help now, looks like uuidv7 might address some of these needs in the future (e.g. for other projects). https://buildkite.com/blog/goodbye-integers-hello-uuids

janosh · 2024-01-18T18:12:26Z

Cool! uuidv7 support in std lib is tracked in python/cpython#89083 with a planned release in Python 3.13.

We could install uuid7 in the meantime or copy-paste this function into jobflow

Andrew-S-Rosen · 2024-01-18T18:33:00Z

@janosh just a note: stevesimmons/uuid7#1 (comment)

jmmshn · 2024-01-18T18:57:00Z

Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.

Andrew-S-Rosen · 2024-01-18T18:58:40Z

I think that makes the most sense of the available options.

utf · 2024-01-18T18:59:35Z

Ok, happy to go with that. @jmmshn would you be able to submit a PR? I will then release a new version of jobflow.

jmmshn · 2024-01-18T19:04:10Z

OK!

gpetretto · 2024-01-18T19:07:50Z

I think that also the initial points raised by @janosh in the first message are worth considering, i.e. the ability to quickly copy/paste and having a shorter ID. uuid1 and uuid7 address the sortability, but not the other two issues. ULID would address all of them (10 characters shorter, no hyphens, sortable). Why not considering that instead?

utf · 2024-01-18T19:10:46Z

I would also be happy with ULID. The only drawback is that we would need to either:

Add ULID as a dependency
Copy the code into jobflow.

What I'm not clear about is whether we'd also need to do the same for emmet in order to use the sorting features.

jmmshn · 2024-01-18T19:13:43Z

Ditto @janosh and @gpetretto on the size thing.
since the UUID bit of code in jobflow is so small can't we just move it all to emmet core?

utf · 2024-01-18T19:15:45Z

Jobflow doesn't depend on emmet though (only atomate2 does).

From some quick reading, it seems like the sorting should be trivial without needing ULID as a dependency.

davidwaroquiers · 2024-01-18T19:20:34Z

Maybe in monty? Both emmet and joblow depend on monty.

jmmshn · 2024-01-18T20:12:39Z

I think we can just geta PR going with UUID1 for now and add others later.

gpetretto · 2024-01-18T21:50:47Z

I would like to point out two other smaller issues with uuid1:

it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries.
For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.

As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.

davidwaroquiers · 2024-01-19T08:38:04Z

I would like to point out two other smaller issues with uuid1:

it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries.

For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.

As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.

I also think it might be better to not change things too many times as jobflow (and atomate2) are already used in production by many people. One thing that was discussed above was also the renaming of the uuid attribute (to id or jf_id). This might also be something to keep in mind (as @gpetretto mentioned, having a uuid attribute which is not a "proper" uuid could make things confusing). Such a change could be done earlier with a deprecation. Not sure I would call it id nor jf_id though. Maybe uid (for unique id) ?

jmmshn · 2024-01-19T17:57:45Z

I agree with the naming issue, and a change to uid is probably needed.
I think we can enable support for multiple id types, keep it at uuid4 as default for now, and swap it over to ulid or uuid7 at a later time.
This will give me some kind of sortable id immediately so I can proceed with my stuff (just have a hack in a comparator for the uuid1's) then we can wait a while to swap it over to ulid as default after we have all used it in testing for a while.

Also, if someone is already committed to using uuid4 they can just tweak their jobflow.yaml and not disrupt anything even after the switch.

mkhorton · 2024-01-23T20:15:36Z

One important factor here is that MongoDB does have a native UUID datatype, which reduces the number of bytes to store it compared to a string representation. Since ULID is bit-compatible with UUID, it’s probably possible to switch away from using a string too.

mkhorton · 2024-01-23T20:21:01Z

That said, I’m not convinced about the argument against uuid7 if it’s a standard and implementations are imminent. If the size on disk is the same, I’m not sure it matters if the string representation has hyphens or not — and indeed, the hyphens help with readability.

dpldgr · 2024-02-01T07:31:29Z

Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.

A few comments on UUIDv1 and UUIDv7:

UUIDv1 is not easily sortable in the same way that UUIDv7 is because of it's bit layout, so I would recommend against it's use if performant sorting by timestamp is important.
UUIDv7 is close to being made a standard but has not been finalised yet. It's close enough that you could with high probability rely on the bit layout not to change: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/14/
A 3rd option would be to use UUIDv8 which has a much more flexible layout: just about any bit layout is valid as long as the version and variant fields are set correctly.

mkhorton · 2024-02-01T18:57:50Z

ULID was already merged into emmet, but I agree with your points @dpldgr. I’d much rather we try to stick with a standard, it seems safe to adopt UUIDv7 given how close it is to being finalized.

jmmshn · 2024-02-01T19:41:37Z

So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.

It might make sense to have a couple of full builds of MP with atomate2 data first before we fully commit to a convention change. But one of the MP builders will have to chime in and participate on that.

Andrew-S-Rosen · 2024-02-01T20:12:26Z

@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?

mkhorton · 2024-02-02T01:40:53Z

This library looks good:

https://github.com/aminalaee/uuid-utils

I would avoid the “uuid7” library since apparently it’s non-compliant.

Discussion on uuid7 making it into the stdlib here:

https://discuss.python.org/t/add-uuid7-in-uuid-module-in-standard-library/44390/5

dpldgr · 2024-02-02T07:08:48Z

So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.

From what I can tell, ULID is very very similiar to UUIDv7 in bit layout. It uses the same bit layout for the timestamp, but omits the variant and version fields that all UUIDs have (they have random values instead). So you could trivially make a ULID into a UUIDv7 by setting the variant field (bits 64-65) to binary 10 and the version field (bits 48-51) to binary 0111 . That would make for a much easier potential migration path than converting from UUIDv1 to UUIDv7.

I would avoid the “uuid7” library since apparently it’s non-compliant.

It is compliant, but with an old version of the draft. It should either be updated or removed as a package because there's obviously plenty of scope for confusion and unexpected results.

dpldgr · 2024-02-02T08:00:19Z

@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?

That one looks like a good choice to me. It's an active repo that has had plenty of releases, so you're much more likely to get support if it's needed. The 'uuid7' repo in comparison has been stagnant for quite some time.

janosh added question Further information is requested ux User experience labels Jan 1, 2024

jmmshn mentioned this issue Jan 18, 2024

Electrode Workflow materialsproject/atomate2#655

Merged

1 task

jmmshn mentioned this issue Jan 18, 2024

allow different UID types #529

Merged

jmmshn mentioned this issue Jan 23, 2024

Allow ULID for MPIDS materialsproject/emmet#928

Merged

kjappelbaum mentioned this issue May 31, 2024

generate more meaningful UUIDs for report directories lamalab-org/chem-bench#360

Open

jmmshn mentioned this issue Aug 15, 2024

Fix MPID assignment in electrode workflow materialsproject/atomate2#956

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why UUIDs? #519

Why UUIDs? #519

janosh commented Jan 1, 2024 •

edited

Loading

utf commented Jan 7, 2024

janosh commented Jan 7, 2024

gpetretto commented Jan 8, 2024

janosh commented Jan 8, 2024

gpetretto commented Jan 8, 2024

janosh commented Jan 8, 2024

Andrew-S-Rosen commented Jan 8, 2024 •

edited

Loading

jmmshn commented Jan 18, 2024

utf commented Jan 18, 2024

Andrew-S-Rosen commented Jan 18, 2024

janosh commented Jan 18, 2024

Andrew-S-Rosen commented Jan 18, 2024

jmmshn commented Jan 18, 2024

Andrew-S-Rosen commented Jan 18, 2024

utf commented Jan 18, 2024

jmmshn commented Jan 18, 2024

gpetretto commented Jan 18, 2024

utf commented Jan 18, 2024

jmmshn commented Jan 18, 2024 •

edited

Loading

utf commented Jan 18, 2024

davidwaroquiers commented Jan 18, 2024

jmmshn commented Jan 18, 2024

gpetretto commented Jan 18, 2024

davidwaroquiers commented Jan 19, 2024

jmmshn commented Jan 19, 2024 •

edited

Loading

mkhorton commented Jan 23, 2024

mkhorton commented Jan 23, 2024 •

edited

Loading

dpldgr commented Feb 1, 2024

mkhorton commented Feb 1, 2024

jmmshn commented Feb 1, 2024

Andrew-S-Rosen commented Feb 1, 2024

mkhorton commented Feb 2, 2024

dpldgr commented Feb 2, 2024

dpldgr commented Feb 2, 2024

Why UUIDs? #519

Why UUIDs? #519

Comments

janosh commented Jan 1, 2024 • edited Loading

utf commented Jan 7, 2024

janosh commented Jan 7, 2024

gpetretto commented Jan 8, 2024

janosh commented Jan 8, 2024

gpetretto commented Jan 8, 2024

janosh commented Jan 8, 2024

Andrew-S-Rosen commented Jan 8, 2024 • edited Loading

jmmshn commented Jan 18, 2024

utf commented Jan 18, 2024

Andrew-S-Rosen commented Jan 18, 2024

janosh commented Jan 18, 2024

Andrew-S-Rosen commented Jan 18, 2024

jmmshn commented Jan 18, 2024

Andrew-S-Rosen commented Jan 18, 2024

utf commented Jan 18, 2024

jmmshn commented Jan 18, 2024

gpetretto commented Jan 18, 2024

utf commented Jan 18, 2024

jmmshn commented Jan 18, 2024 • edited Loading

utf commented Jan 18, 2024

davidwaroquiers commented Jan 18, 2024

jmmshn commented Jan 18, 2024

gpetretto commented Jan 18, 2024

davidwaroquiers commented Jan 19, 2024

jmmshn commented Jan 19, 2024 • edited Loading

mkhorton commented Jan 23, 2024

mkhorton commented Jan 23, 2024 • edited Loading

dpldgr commented Feb 1, 2024

mkhorton commented Feb 1, 2024

jmmshn commented Feb 1, 2024

Andrew-S-Rosen commented Feb 1, 2024

mkhorton commented Feb 2, 2024

dpldgr commented Feb 2, 2024

dpldgr commented Feb 2, 2024

janosh commented Jan 1, 2024 •

edited

Loading

Andrew-S-Rosen commented Jan 8, 2024 •

edited

Loading

jmmshn commented Jan 18, 2024 •

edited

Loading

jmmshn commented Jan 19, 2024 •

edited

Loading

mkhorton commented Jan 23, 2024 •

edited

Loading