Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proquint() generates duplicate ids with relatively small n #8

Open
Raesu opened this issue Aug 25, 2021 · 2 comments
Open

proquint() generates duplicate ids with relatively small n #8

Raesu opened this issue Aug 25, 2021 · 2 comments

Comments

@Raesu
Copy link

Raesu commented Aug 25, 2021

If I'm not mistaken, running proquint with the default argument n_words=2L should be able to generate 2^32 (4,294,967,296) unique identifiers.

When I run proquint with just n=121969, I consistently get duplicates. You can test this with the below:

proquint(121969) |> unique() |> length() which in my case is always 2 to 3 less than 121,969. Am I missing something with how this should work?

@Raesu Raesu changed the title proquint() generates duplicate ids proquint() generates duplicate ids with relatively small n Aug 25, 2021
@richfitz
Copy link
Member

richfitz commented Aug 25, 2021

This was a bit alarming, but it turns out that it's correct. The proquint method runs by selecting n_words * n 16 bit integers, then converting them into "words". This should give a keyspace of 2^32 (~4 billion) as you say above.

Skipping out any code in the package, and looking at selecting ~121k 32 bit integers, and looking for duplicates:

set.seed(1)
n <- 121969
y <- replicate(100, sum(duplicated(sample(2^32, n, replace = TRUE))))
table(y)
#  0  1  2  3  4  5  6  7 
# 16 35 23 18  4  2  1  1 

which is consistent with what I saw from proquint based on your example.

Using the "birthday paradox distribution", we only need to draw 77k samples to have a 50% chance of a collision:

qbirthday(classes = 2^32)
# [1] 77164

@r-ash found another way of looking at this problem here - the expected number of collisions can be computed by the probability that the i'th term appears twice (ignoring triple-collisions)

2^32 * dbinom(2, n, 1 / 2^32)
# [1] 1.731782
mean(y)
# [1] 1.74

Typically when I've needed large keyspaces I've used ids::random_id which uses by default 16 bytes (so 2^128). That breaks qbirthday but gets us into healthy territory (based on log-log extrapolation I make it that we'd need to draw ~10^19 samples to have a 50% chance of duplicate).

@Raesu
Copy link
Author

Raesu commented Aug 25, 2021

Ok, good to know this isn't a bug. It would be nice if ids::proquint had an argument unique where it resamples any duplicates before returning the vector. For now I solved my problem by using n=3. I would use ids::random_id but find proquints to be fun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants