Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ids of adjective_animal not unique #20

Open
S-AQ opened this issue Nov 23, 2023 · 2 comments
Open

ids of adjective_animal not unique #20

S-AQ opened this issue Nov 23, 2023 · 2 comments

Comments

@S-AQ
Copy link

S-AQ commented Nov 23, 2023

Hi, thank you for the package.

I discovered that the ids resulting from adjective_animal are not unique:

library(dplyr)
set.seed(7040)
animal_ids <- adjective_animal(n = 10000, n_adjectives = 1, max_len = 6)
animal_ids %>% unique() %>% length()
# 9937

Perhaps this might result from homonyms getting listed as separate entries in the word list, but I haven't checked. Anyway, I thought it was important to raise an issue.

@richfitz
Copy link
Member

richfitz commented Nov 24, 2023

This is because we sample without replacement; the ids come from a pool of some size and collisions are very possible. We discuss this in the vignette - see the description of the "pool size":

There are 1748 animal names and 8946 adjectives so each one you add increases the identifier space by a factor of 8946. So for 1, 2, and 3 adjectives there are about 15.6 million, 140 billion and 1250 trillion possible combinations.

With 1 adjective you have 15.6 million possibilities, and if you draw 10000 from this your chance of seeing a collision is quite high:

> pbirthday(10000, 1748 * 8946)
[1] 0.9591473

(this is the probability of at least one collision within the set of 10000 samples; or alternatively only a 4% chance of not seeing a collision). If collision avoidance is important, and you need a large set of identifiers you should either add more adjectives (adding two will reduce the collision probability to about 1/2,500 for 10,000 samples), or use one of the more boring identifiers.

If you set a max_len argument you deplete the pool even more, making collisions very likely, but increasing the readability of the identifiers you create.

@alexpghayes
Copy link

Given that the primary pursue of an id is often to be a unique identifier, it could be valuable:

  • update the function documentation so that it clarifies whether generated ids are guaranteed to be unique vs unique with high probability vs might not be unique
  • possibly check if generated ids are unique and warn the user if they are not (personally it seems like it would be valuable to default to warning the user when they are not unique, or possibly messaging to alert the user that no check for uniqueness has occurred if they check itself is too expensive to turn on by default)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants