Need help understanding crc hash used to explain test train split in Chapter 2 #655

paliwalvarad · 2022-05-01T22:51:15Z

Hi there, in the following code, I fail to understand how the hash function is working here and how would it make sure that the test set is randomly selected. Can you shed some light on this please? Thank you.

def test_set_check(identifier,test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio *2**32)

ageron · 2022-05-05T09:02:47Z

Hi @paliwalvarad ,

Sure. The CRC-32 algorithm takes the identifier and converts it to a (somewhat) pseudo-random 32-bit number. In other words, we're using the crc32() function as a hash function: the same identifier will always produce the same CRC-32 "hash", but two different identifiers will generally produce very different (seemingly random) hashes.

The & 0xffffffff can be removed: it doesn't hurt, but it was only needed in Python 2 to convert the output of crc32() from a signed 32-bit integer to an unsigned 32-bit integer.

So now we have a 32-bit signed integer. We check whether it's smaller than test_ratio * 2**32. The largest possible CRC-32 value is 0xffffffff, which is 232 - 1. So if test_ratio is equal to, say, 0.2, then the pseudo-random CRC-32 will only be smaller than 232 about 20% of the time, on average over all possible identifiers.

In short, this function takes an identifier and a test_ratio, and it computes a CRC-32 of the identifier to determine whether or not it belongs to the test set. Each identifier has a probability equal to the test_ratio of being assigned to the test set.

Since each identifier is assigned randomly with probability test_ratio, it is possible that the actual test set size will not be exactly equal to the test_ratio, especially for small datasets. If that's a problem, then you can choose to use the other approaches, such as shuffling the dataset and using the first 20% for testing (for example).

Hope this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help understanding crc hash used to explain test train split in Chapter 2 #655

Need help understanding crc hash used to explain test train split in Chapter 2 #655

paliwalvarad commented May 1, 2022 •

edited by ageron

Loading

ageron commented May 5, 2022

Need help understanding crc hash used to explain test train split in Chapter 2 #655

Need help understanding crc hash used to explain test train split in Chapter 2 #655

Comments

paliwalvarad commented May 1, 2022 • edited by ageron Loading

ageron commented May 5, 2022

paliwalvarad commented May 1, 2022 •

edited by ageron

Loading