Early stopping #2125

vdbergh · 2024-08-01T08:40:40Z

Lots of tests are stopped early these days, even at LLR=-0.5. One should be aware that this drastically alters the error probabilities of the SPRT compared to their design values of 0.05.

The text was updated successfully, but these errors were encountered:

peregrineshahin · 2024-08-01T09:06:09Z

Stopping happens for too many reasons.
if the test is bugged or the developer wants to submit other tests, ideas that they think are more interesting/promising instead, or
if the submitter believes the test will be a waste of time anyway, or if they want to save resources for other promising ideas running.
The idea that tests should not be early-stopped is exactly to me identical to the idea that we should make a random monkey patch generator and run random 30 tests that compile each day, and they should not be stopped.
I'm not sure why would anyone believe in a patch that the authors themselves don't, or anyone doesn't believe in. Thanks.

vdbergh · 2024-08-01T09:10:58Z

Is a developer does not believe in a test, why then submit it to Fishtest? I can imagine that a developer sometimes changes their mind during the running of a test but this should be quite rare.

Besides the stopping of tests is strongly correlated with the LLR value. So the latter is the main driver.

vdbergh · 2024-08-01T09:14:34Z

Rather than stopping early the developers should be able to indicate the beta value they are happy with. For example if a developer routinely stops a test at LLR=-1.0 this means they are happy with beta=0.37. So then they should submit tests with alpha=0.05, beta=0.37. That would make the trade off clear...

peregrineshahin · 2024-08-01T09:51:11Z

well, the way I look at it is that most users submit tests that are related at once, the idea of the patch is the same but the parameters get tweaked in different directions, there is a relation between the tests one submits when they touch the same code,
in such cases, they assume the effects of the parameters will be continuous which is probably a valid assumption, and make the judgment based on the LLR of such direction setting at a negative value for similar tweaks especially when the other direction is already looking good, so the tests alone might be indeed statistically not valid, but the many of them are pointing to the fail, and the other direction is behaving differently, so it's not a judgment based on the LLR of that specific test alone.

peregrineshahin · 2024-08-01T09:53:21Z

Note that users are limited to 6 active tests per user, this also leads to many early stopping to not get less TP, do you think such a limitation is a bad thing?

vdbergh · 2024-08-01T10:09:44Z

I think we should remain focused. The facts are that many tests get stopped at LLR~-1.0. All I did in the OP is to indicate that stopping early drastically changes beta.

We had this discussion about beta before but then Vondele noted that instead of changing beta one may equivalently change the Elo bounds.

This is mathematically true, but now I realize that there is a psychological difference. With an SPRT[-2.94, 2.94] people will always be tempted to stop a test at LLR=-1.0, no matter what the Elo bounds are.

With an asymmetric SPRT (e.g. SPRT[-0.983, 2.557] which has alpha=0.05 and the same beta value as stopping an SPRT[-2.94, 2.94] at LLR=-1.0) there would be less temptation - I think....

vdbergh · 2024-08-01T10:21:34Z

BTW I think the idea of picking a "promising" patch among related patches depending on how the LLR evolves is strongly flawed. The LLR is a much too noisy metric for that.

peregrineshahin · 2024-08-01T10:23:43Z

If LOS is too noisy and LLR is too noisy, maybe we can fix one, or introduce a one that is not noisy?

vdbergh · 2024-08-01T10:45:40Z

If LOS is too noisy and LLR is too noisy, maybe we can fix one, or introduce a one that is not noisy?

As I see it, noisy optimization is intrinsically difficult.

dubslow · 2024-08-01T11:06:49Z

I complained about this practice on Discord a couple days ago, and was greeted with mostly silence, and one comment that running tests to -2.94 LLR is too dogmatic (and implicitly corroborated by pere's first comment here).

I'm of the opinion that stopping early too frequently is actually a more effective waste of computation than letting them run to the bounds. I agree that should devs really intend to have a higher false negative rate, then they should indeed use an alternate beta before the test starts (rather than spamming the tests page with inconclusive noise as present). Choosing the stop rule during the test is extra bias over choosing the stop rule before the test and sticking to it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early stopping #2125

Early stopping #2125

vdbergh commented Aug 1, 2024

peregrineshahin commented Aug 1, 2024

vdbergh commented Aug 1, 2024

vdbergh commented Aug 1, 2024 •

edited

Loading

peregrineshahin commented Aug 1, 2024

peregrineshahin commented Aug 1, 2024

vdbergh commented Aug 1, 2024 •

edited

Loading

vdbergh commented Aug 1, 2024

peregrineshahin commented Aug 1, 2024

vdbergh commented Aug 1, 2024 •

edited

Loading

dubslow commented Aug 1, 2024

Early stopping #2125

Early stopping #2125

Comments

vdbergh commented Aug 1, 2024

peregrineshahin commented Aug 1, 2024

vdbergh commented Aug 1, 2024

vdbergh commented Aug 1, 2024 • edited Loading

peregrineshahin commented Aug 1, 2024

peregrineshahin commented Aug 1, 2024

vdbergh commented Aug 1, 2024 • edited Loading

vdbergh commented Aug 1, 2024

peregrineshahin commented Aug 1, 2024

vdbergh commented Aug 1, 2024 • edited Loading

dubslow commented Aug 1, 2024

vdbergh commented Aug 1, 2024 •

edited

Loading

vdbergh commented Aug 1, 2024 •

edited

Loading

vdbergh commented Aug 1, 2024 •

edited

Loading