`WARNING: The least populated class in y has only 2 members, which is less than n_splits=3` #22

FrancescoCasalegno · 2022-04-19T14:25:09Z

Context & Description

When we run k-fold cross validation, we use n_splits=3.

morphoclass/dvc/training/configs/splitter-stratifKFold.yaml

n_splits: 3

But for layer L4 and layer L6 of the dataset interneurons, we have classes L4_BP and L6_DBC with 2<3 samples.

This situation generates the following Python warning when iterating over StratifiedKFold.split(X, y):

UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.

What happens?

If a class has less members than n_splits, then for some splits of StratifiedKFold we will have no representatives of that class in the validation or in the training set. For instance, splitting [0] * 7 + [1] * 3 with StratifiedKFold(n_splits=3) yields the following training and validation sets, where in split 3 there is no sample of class 1 in the validation set!

    train-set           --- valid-set
[1, 0, 0, 0, 0, 0]      --- [1, 0, 0, 0]   # split 1
[1, 0, 0, 0, 0, 0, 0]   --- [1, 0, 0]      # split 2
[1, 1, 0, 0, 0, 0, 0]   --- [0, 0, 0]      # split 3

Why may this be an issue?

Some metrics may not be computed and/or return wrong values.
Metrics such as precision_score, recall_score, f1_score cannot be computed if there is no sample for a given class.
For instance, in the example above, using the validation set of split 3 to compute f1_score for y_pred = [0, 0, 0] will return 0.0 (despite y_pred matching perfectly y_true!!) and raise this warning:

UndefinedMetricWarning: 
Precision is ill-defined and being set to 0.0 due to no predicted samples. 
Use `zero_division` parameter to control this behavior.

However in our case we do not compute metrics per-split and then average across all splits, but instead we take all the out-of-sample predictions (generated during the various splits) and then compute the metric using all samples. Therefore, no class can ever have 0 samples during evaluation.

A class in the validation set may not be present in the training set.
This would be dramatic, because after the training the model would not even be aware the existence of a class that is however present in the validation set. So it is guaranteed that the model will never predict that class.

However I have never observed this happening on our data. I am not even sure it is possible.

Evaluating on classes with few samples may be not very meaningful.
Does it really make sense to take into account the model performance with respect to a class that has only 1 member in the training set or in the validation set?

However as long as we look at micro or weighted averages, the impact of (potentially awful) performance on classes with 1 or 2 samples is limited. But this could be bad if we want to look at macro averages. https://github.com/scikit-learn/scikit-learn/blob/baf828ca126bcb2c0ad813226963621cafe38adb/sklearn/metrics/_classification.py#L1049-L1062

How do we solve this?

We could remove or merge classes with less than 3 samples.
But this should be discussed with the scientists. Maybe those classes with few samples are very important and well defined and must be kept anyway?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`WARNING: The least populated class in y has only 2 members, which is less than n_splits=3` #22

`WARNING: The least populated class in y has only 2 members, which is less than n_splits=3` #22

FrancescoCasalegno commented Apr 19, 2022

WARNING: The least populated class in y has only 2 members, which is less than n_splits=3 #22

WARNING: The least populated class in y has only 2 members, which is less than n_splits=3 #22

Comments

FrancescoCasalegno commented Apr 19, 2022

Context & Description

What happens?

Why may this be an issue?

How do we solve this?

`WARNING: The least populated class in y has only 2 members, which is less than n_splits=3` #22

`WARNING: The least populated class in y has only 2 members, which is less than n_splits=3` #22