Is greedy selection overfitting to specific data splits? #1

freezevector · 2025-03-03T12:04:44Z

Hi, thanks for sharing your code—its a really interesting work!

I have a question about the greedy selection strategy implemented in the code, particularly in greedy_selection.py:L138

It appears that the current implementation records only the best result from each run of the greedy selection, rather than aggregating the results across all 10 cross-validation folds. I might be missing something here, but usually, in cross-validation, we expect to aggregate performance metrics (such as TPR@1%FPR) over all folds to get a more reliable measure of the model’s generalization. In your implementation if TPR@1%FPR is zero you do not aggregate it to the final results...

Could you clarify if this approach is intentional? It seems that by only recording the best result, the evaluation might be biased toward specific data splits, and we wouldn't get a good indication of how well the greedy selection strategy works on average across different data partitions.

Thanks in advance for your response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is greedy selection overfitting to specific data splits? #1

Is greedy selection overfitting to specific data splits? #1

freezevector commented Mar 3, 2025

Is greedy selection overfitting to specific data splits? #1

Is greedy selection overfitting to specific data splits? #1

Comments

freezevector commented Mar 3, 2025