-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: asv seems to report some measurements 50% lower than the actual time #33
Comments
Could you find a smaller reproducer? Also didn't want to hijack the issue but the refactor into runner was just that, a refactor, and sampling methods and other details were not changed at all. |
Also, sorry if this was mentioned in the issue but is this a bug in the sense that it was previously reporting different times? It wasn't immediately clear to me from the graph / issue / report. |
We managed to get a quite stable environment (results seem consistent even for a single run). I'm running the benchmarks now with the next attributes: If instead of just forcing a single execution I leave the default attributes (not sure exactly what are they, but seems like it ends up executing this benchmark 10 times), then we get these results: The server configuration at the beginning of the recordings was actually not as stable, but you can see the last dip with the hover, this one happened with the same exact configuration as the first chart. What I would expect to see in the second chart is a line even flatter than in the first chart, since we'd be taking 10 samples and doing its mean, so the little spikes wouldn't be there. But instead, we get those measurements, some of them which seem correct, around 7ms, but all the smaller ones at 5ms and 2.75ms aren't surely correct, our function can't run as fast, as you can see in the first chart. Not sure what's the exact bug, but seems like something in the way results are aggregated after many samples are obtained. We have a different server, running an older asv version for what I know (@rhshadrach can you confirm please), and we don't seem to have these incorrect measurements: There is a bit of variance since the core isn't isolated, and CPU TurboBoost and frequency scaling are enabled, but all the measurements seem reasonable. The hardware is faster and the measurements are around 5.25ms instead of 7ms, but we don't have any point half of that as in the new server. There can be the case that I'm wrong, and the bug already existed in the older version, and it's not happening when aggregating the values but in the measurement. In that case, since this other server with the older version is taking the mean of around 100 samples, then we won't see the very low measurements either. But I would be surprised if that's the case. Regarding the changes in the logic in the refactoring, I don't know what the logic was in the old code, as you surely know better than me that old asv code was very difficult to read. But from our results, I think the older asv version used to run many benchmarks 100 times, and the refactored code doesn't seem to run it more than 10 in any case I've seen. Maybe that's the only change, but I assumed some things were changed. Or maybe that was changed before the refactoring, I don't really know. But that's what I had in mind when I said the logic changed. |
I did a test, with asv 0.5.1, and the parameters set to In this chart, the first commit is with the latest asv and some repetitions (not sure how many, I think 2), and the last commit is with asv 0.5.1 and the parameters above: So, it doesn't seem related to the refactor. I assumed that was the problem, expecting that we had detected the problem earlier if it was from long time ago, but seems like it's been there unnoticed for a while. |
Correct - it is ASV 0.5.1 |
Glad to know it isn't a regression. I'm actually happy to see a resurgence of interest in ASV post the refactor. Will try for a simpler reproducer and fix :) |
I managed to get a bit more clarity on the problem. For a given pandas benchmark {'samples': [2.8450973331928253e-05, 3.4871045500040054e-06, 2.968008629977703e-06], 'number': 1} The first sample is consistent with the times the benchmark run, but the others are one order of magnitude faster. I checked the function, to see if something is getting cached on it, but I don't get one order of magnitude faster in the second or third iteration: In [1]: import pandas as pd
In [2]: from pandas.core.construction import extract_array
In [3]: series = pd.Series([1, 2, 3])
In [4]: %time extract_array(series, extract_numpy=True)
CPU times: user 31 µs, sys: 7 µs, total: 38 µs
Wall time: 42.7 µs
Out[4]: array([1, 2, 3])
In [5]: %time extract_array(series, extract_numpy=True)
CPU times: user 22 µs, sys: 5 µs, total: 27 µs
Wall time: 31.5 µs
Out[5]: array([1, 2, 3])
In [6]: %time extract_array(series, extract_numpy=True)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs
Wall time: 19.3 µs
Out[6]: array([1, 2, 3])
In [7]: %time extract_array(series, extract_numpy=True)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs
Wall time: 20 µs
Out[7]: array([1, 2, 3])
In [8]: %time extract_array(series, extract_numpy=True)
CPU times: user 20 µs, sys: 5 µs, total: 25 µs
Wall time: 29.6 µs
Out[8]: array([1, 2, 3])
In [9]: %time extract_array(series, extract_numpy=True)
CPU times: user 26 µs, sys: 0 ns, total: 26 µs
Wall time: 31 µs
Out[9]: array([1, 2, 3]) Not sure exactly where the problem is, but I don't see why in a second execution of the benchmark we should get a much faster execution. In asv it seems to be consistent. Different calls to |
I need to do some more research, but seems like in some cases, the time asv reports for a particular benchmark is exactly 50% as it should be . You can see the details of the issue in this comment in the pandas repo: pandas-dev/pandas#55007 (comment)
The text was updated successfully, but these errors were encountered: