-
Notifications
You must be signed in to change notification settings - Fork 52
Home
It's quite an old idea.
TSID Creator is just one implementation of that idea.
Because at the time I couldn't think of a better name. I don't even know if "sortable" is a dictionary word. So it's my fault. :)
This could be called, for example, Time-Sorted Unique Identifier, Twitter Snowflake Identifier, Time Sequence Identifier, etc.
You can call it whatever you see fit, as long as people know what you're talking about.
In the end, it's just a name.
Because the TSID Creator was implemented that year and because there needed to be an easy-to-remember date.
You can use whatever start date you want in your application. You just need to be consistent, I mean, if you decide to use the date 2022-12-22, you have to use it in the entire application, always.
Because those two are the durations that fit in 41 or 42 bits.
If your programming language or database only supports signed integer data types, the first bit of the identifier will be used as a sign bit. This means that the limit will be reached in 69 years.
In languages and databases that support unsigned integers, the limit is 139 years.
If the identifier is stored in string format or in byte array format, which might not be common for a 64-bit value, the limit is also 139 years.
Honestly, I don't know what will happen in 69 or 139 years.
It is a way to identify the ID generator.
The function of the node identifier is to prevent collisions between IDs produced by more than one ID generator.
When you have more than one process generating IDs, there is a relatively high probability of a collision in a 64-bit ID. But when you say that each generator will have a unique node identifier in your application, you eliminate that collision probability.
This can be a virtual machine ID, a container ID, a running process ID, etc. You are the one who says what it means within the context of your application.
If you have only one process generating IDs, you don't have to worry about collisions within your application.
It's just a bunch of bits that are incremented each time a new identifier is generated. And when the timestamp changes, these bits are randomly reset.
The function of these bits is to ensure that the identifiers are always monotonic, I mean, that an identifier is always greater than the previous identifier.
This prevents collisions between identifiers created by the same generator. Since identifiers never go backwards, there is no risk of collision with identifiers that were previously created by a single generator.
Of course, the system clock can go backwards, causing the risk of identifier collisions. But we can't do much about it.
I know it's a little confusing. When I first implemented TSID Creator, the last 22 bits were completely random. So I decided to split this chunk of bits into two subcomponents. These 22 bits can be called the "tail" or something like that. I keep calling it "random" because that's how it's implemented and because these subcomponents are still initialized randomly.
Because it's ULID encoding and because it's very efficient.
Nothing prevents you from encoding the TSID in an encoding of your choice, like base-62 for example.
In the current implementation of TSID Creator, this is the only encoding.
This too can seem confusing.
But, in fact, there is only one type of identifier implemented by TSID Creator.
What changes are the number of bits reserved for the node identifier and for the counter.
I could have just implemented the 1024 node variant, as that's what was used in Twitter Snowflake. I split it into 3 because I thought it would be convenient.
Today I realize that the 256 node variant is the most used.
TsidCreator
class the easiest way to generate TSIDs.
TsidFactory
class is the class that actually creates the TSIDs. This class can be configured to create TSIDs however you see fit. For example, you can change the amount of bits reserved for the node identifier, you can change the start date of the timestamp, you can change the random number generator, etc.
Tsid
class is a value object.
In some applications it may be more convenient to use a value object than a basic data type like long
or String
.
Because it was the default in Snowflake Twitter IDs.
In fact, the Twitter Snowflake timestamp is 41 bits long. I added 1 bit to turn the TSID into an unsigned integer, doubling the lifetime of TSIDs. It also made integer format sorting consistent with string format sorting.
Some implementations of the concept have different bit counts for timestamps, for example the Mastodon timestamp is 48 bits long.
In the current implementation of TSID Creator, this number of bits cannot be changed.
Because it was the default in Snowflake Twitter IDs.
If you want, you can use any number of bits between 0 and 20. But if you do that, you're also changing the number of bits in the counter.
In Twitter Snowflake, this node idea actually consisted of two parts: Datacenter ID and Worker ID. These two things added together give 10 bits.
Because it was the default in Snowflake Twitter IDs.
If you want you can use any number of bits between 2 and 22. But for that you have to change the number of bits of the node identifier.
Nobody asked me any of this. I use this text format to try to better explain the decisions I had to make during the implementation of the TSID Creator. The questions I've included here are ones I think I would ask myself if I saw this project for the first time. Hope this is helpful.