You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, in the following code, I fail to understand how the hash function is working here and how would it make sure that the test set is randomly selected. Can you shed some light on this please? Thank you.
Sure. The CRC-32 algorithm takes the identifier and converts it to a (somewhat) pseudo-random 32-bit number. In other words, we're using the crc32() function as a hash function: the same identifier will always produce the same CRC-32 "hash", but two different identifiers will generally produce very different (seemingly random) hashes.
The & 0xffffffff can be removed: it doesn't hurt, but it was only needed in Python 2 to convert the output of crc32() from a signed 32-bit integer to an unsigned 32-bit integer.
So now we have a 32-bit signed integer. We check whether it's smaller than test_ratio * 2**32. The largest possible CRC-32 value is 0xffffffff, which is 232 - 1. So if test_ratio is equal to, say, 0.2, then the pseudo-random CRC-32 will only be smaller than 232 about 20% of the time, on average over all possible identifiers.
In short, this function takes an identifier and a test_ratio, and it computes a CRC-32 of the identifier to determine whether or not it belongs to the test set. Each identifier has a probability equal to the test_ratio of being assigned to the test set.
Since each identifier is assigned randomly with probability test_ratio, it is possible that the actual test set size will not be exactly equal to the test_ratio, especially for small datasets. If that's a problem, then you can choose to use the other approaches, such as shuffling the dataset and using the first 20% for testing (for example).
Hi there, in the following code, I fail to understand how the hash function is working here and how would it make sure that the test set is randomly selected. Can you shed some light on this please? Thank you.
The text was updated successfully, but these errors were encountered: