Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix]: fixes balanced subsampling bug in data/emnist.py #84

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mariovas3
Copy link

@mariovas3 mariovas3 commented Mar 20, 2024

Account for y labels being offset by NUM_SPECIAL_TOKENS when calling np.bincount in emnist balance subsampling.

The offsetting is found here:

y_train = data["dataset"]["train"][0, 0]["labels"][0, 0] + NUM_SPECIAL_TOKENS

and here:

y_test = data["dataset"]["test"][0, 0]["labels"][0, 0] + NUM_SPECIAL_TOKENS

np.bincount will prepend zeros for elements that were not found starting from 0 to y_min_element-1; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset.

Example bug:

>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])

…alling np.bincount in emnist balance subsampling

np.bincount will prepend zeros for elements that were not found starting from 0 to y_min_element-1; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant