ids of `adjective_animal` not unique #20

S-AQ · 2023-11-23T09:04:53Z

Hi, thank you for the package.

I discovered that the ids resulting from adjective_animal are not unique:

library(dplyr)
set.seed(7040)
animal_ids <- adjective_animal(n = 10000, n_adjectives = 1, max_len = 6)
animal_ids %>% unique() %>% length()
# 9937

Perhaps this might result from homonyms getting listed as separate entries in the word list, but I haven't checked. Anyway, I thought it was important to raise an issue.

The text was updated successfully, but these errors were encountered:

richfitz · 2023-11-24T09:51:52Z

This is because we sample without replacement; the ids come from a pool of some size and collisions are very possible. We discuss this in the vignette - see the description of the "pool size":

There are 1748 animal names and 8946 adjectives so each one you add increases the identifier space by a factor of 8946. So for 1, 2, and 3 adjectives there are about 15.6 million, 140 billion and 1250 trillion possible combinations.

With 1 adjective you have 15.6 million possibilities, and if you draw 10000 from this your chance of seeing a collision is quite high:

> pbirthday(10000, 1748 * 8946)
[1] 0.9591473

(this is the probability of at least one collision within the set of 10000 samples; or alternatively only a 4% chance of not seeing a collision). If collision avoidance is important, and you need a large set of identifiers you should either add more adjectives (adding two will reduce the collision probability to about 1/2,500 for 10,000 samples), or use one of the more boring identifiers.

If you set a max_len argument you deplete the pool even more, making collisions very likely, but increasing the readability of the identifiers you create.

alexpghayes · 2024-12-19T20:35:04Z

Given that the primary pursue of an id is often to be a unique identifier, it could be valuable:

update the function documentation so that it clarifies whether generated ids are guaranteed to be unique vs unique with high probability vs might not be unique
possibly check if generated ids are unique and warn the user if they are not (personally it seems like it would be valuable to default to warning the user when they are not unique, or possibly messaging to alert the user that no check for uniqueness has occurred if they check itself is too expensive to turn on by default)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ids of `adjective_animal` not unique #20

ids of `adjective_animal` not unique #20

S-AQ commented Nov 23, 2023 •

edited

Loading

richfitz commented Nov 24, 2023 •

edited

Loading

alexpghayes commented Dec 19, 2024

ids of adjective_animal not unique #20

ids of adjective_animal not unique #20

Comments

S-AQ commented Nov 23, 2023 • edited Loading

richfitz commented Nov 24, 2023 • edited Loading

alexpghayes commented Dec 19, 2024

ids of `adjective_animal` not unique #20

ids of `adjective_animal` not unique #20

S-AQ commented Nov 23, 2023 •

edited

Loading

richfitz commented Nov 24, 2023 •

edited

Loading