-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why UUIDs? #519
Comments
I'm not against using shorter IDs, especially since they're currently encoded as string not bson uuids. Is there a way to generate those short IDs without an additional dependency? |
Yes, we could do import random
import string
def get_random_id():
"""Return a random 11-character YouTube-like ID."""
chars = random.choices(string.ascii_letters + string.digits + "-_", k=11)
return "".join(chars) Maybe pastibility would be better if we drop the dash and underscore. def get_random_id():
"""Return a random 11-character alpha-numeric ID."""
chars = random.choices(string.ascii_letters + string.digits, k=11)
return "".join(chars) |
I think that making the uuid shorter and easier for copy/paste would indeed be beneficial, but I have a few notes that may be considered before proceeding:
|
@gpetretto Great points both!
|
Thanks for considering these points. I have a few more comments.
|
I was looking at NanoID when opening this issue but ULID looks even better. Only thing is 26 vs 36 is not that much shorter.
Oops, rookie mistake 😅 |
Maybe I'm not thinking hard enough about it, but is there any reason not to have the unique ID simply be the datetime + a small identifier?
Then it would be interpretable but also with low probability of clashing even if I guess mine isn't shorter though, so I take it back. 😅 |
The sortable point from @gpetretto is pretty important. Otherwise, we run into problems like you see here: |
Agreed that given the emmet limitations, having a sortable id is essential. Of the sortable options mentioned
Are there any other options/suggestions on what to use? |
Just chiming in to say that, while it doesn't help now, looks like uuidv7 might address some of these needs in the future (e.g. for other projects). https://buildkite.com/blog/goodbye-integers-hello-uuids |
Cool! uuidv7 support in std lib is tracked in python/cpython#89083 with a planned release in Python 3.13. We could install |
@janosh just a note: stevesimmons/uuid7#1 (comment) |
Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind. |
I think that makes the most sense of the available options. |
Ok, happy to go with that. @jmmshn would you be able to submit a PR? I will then release a new version of jobflow. |
OK! |
I think that also the initial points raised by @janosh in the first message are worth considering, i.e. the ability to quickly copy/paste and having a shorter ID. uuid1 and uuid7 address the sortability, but not the other two issues. ULID would address all of them (10 characters shorter, no hyphens, sortable). Why not considering that instead? |
I would also be happy with ULID. The only drawback is that we would need to either:
What I'm not clear about is whether we'd also need to do the same for emmet in order to use the sorting features. |
Ditto @janosh and @gpetretto on the size thing. |
Jobflow doesn't depend on emmet though (only atomate2 does). From some quick reading, it seems like the sorting should be trivial without needing ULID as a dependency. |
Maybe in monty? Both emmet and joblow depend on monty. |
I think we can just geta PR going with UUID1 for now and add others later. |
I would like to point out two other smaller issues with uuid1:
As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes. |
I also think it might be better to not change things too many times as jobflow (and atomate2) are already used in production by many people. One thing that was discussed above was also the renaming of the uuid attribute (to id or jf_id). This might also be something to keep in mind (as @gpetretto mentioned, having a uuid attribute which is not a "proper" uuid could make things confusing). Such a change could be done earlier with a deprecation. Not sure I would call it id nor jf_id though. Maybe uid (for unique id) ? |
I agree with the naming issue, and a change to Also, if someone is already committed to using uuid4 they can just tweak their |
One important factor here is that MongoDB does have a native UUID datatype, which reduces the number of bytes to store it compared to a string representation. Since ULID is bit-compatible with UUID, it’s probably possible to switch away from using a string too. |
That said, I’m not convinced about the argument against uuid7 if it’s a standard and implementations are imminent. If the size on disk is the same, I’m not sure it matters if the string representation has hyphens or not — and indeed, the hyphens help with readability. |
A few comments on UUIDv1 and UUIDv7:
|
ULID was already merged into emmet, but I agree with your points @dpldgr. I’d much rather we try to stick with a standard, it seems safe to adopt UUIDv7 given how close it is to being finalized. |
So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can. It might make sense to have a couple of full builds of MP with atomate2 data first before we fully commit to a convention change. But one of the MP builders will have to chime in and participate on that. |
@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python? |
This library looks good: https://github.com/aminalaee/uuid-utils I would avoid the “uuid7” library since apparently it’s non-compliant. Discussion on uuid7 making it into the stdlib here: https://discuss.python.org/t/add-uuid7-in-uuid-module-in-standard-library/44390/5 |
From what I can tell, ULID is very very similiar to UUIDv7 in bit layout. It uses the same bit layout for the timestamp, but omits the variant and version fields that all UUIDs have (they have random values instead). So you could trivially make a ULID into a UUIDv7 by setting the variant field (bits 64-65) to binary 10 and the version field (bits 48-51) to binary 0111 . That would make for a much easier potential migration path than converting from UUIDv1 to UUIDv7.
It is compliant, but with an old version of the draft. It should either be updated or removed as a package because there's obviously plenty of scope for confusion and unexpected results. |
That one looks like a good choice to me. It's an active repo that has had plenty of releases, so you're much more likely to get support if it's needed. The 'uuid7' repo in comparison has been stagnant for quite some time. |
I was looking at a wall of job IDs like this earlier and wondering if
jobflow
actually requires full 36-character UUIDs?YouTube for example uses 11-character IDs?
It's case-sensitive alpha-numeric plus hyphen and underscore giving$64^{11} \approx 7.4 \cdot 10^{19}$ combinations. I think there would be readability and "pastibility" improvements.
The text was updated successfully, but these errors were encountered: