Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: asv seems to report some measurements 50% lower than the actual time #33

Open
datapythonista opened this issue Oct 25, 2023 · 7 comments

Comments

@datapythonista
Copy link
Member

I need to do some more research, but seems like in some cases, the time asv reports for a particular benchmark is exactly 50% as it should be . You can see the details of the issue in this comment in the pandas repo: pandas-dev/pandas#55007 (comment)

@HaoZeke
Copy link
Member

HaoZeke commented Oct 27, 2023

I need to do some more research, but seems like in some cases, the time asv reports for a particular benchmark is exactly 50% as it should be . You can see the details of the issue in this comment in the pandas repo: pandas-dev/pandas#55007 (comment)

Could you find a smaller reproducer? Also didn't want to hijack the issue but the refactor into runner was just that, a refactor, and sampling methods and other details were not changed at all.

@HaoZeke
Copy link
Member

HaoZeke commented Oct 27, 2023

Also, sorry if this was mentioned in the issue but is this a bug in the sense that it was previously reporting different times? It wasn't immediately clear to me from the graph / issue / report.

@datapythonista
Copy link
Member Author

We managed to get a quite stable environment (results seem consistent even for a single run). I'm running the benchmarks now with the next attributes: warmup_time=0 repeat=1 number=1 rounds=1 and you can see for this particular benchmark our results:

Screenshot_20231027_133233

If instead of just forcing a single execution I leave the default attributes (not sure exactly what are they, but seems like it ends up executing this benchmark 10 times), then we get these results:

Screenshot_20231025_132902

The server configuration at the beginning of the recordings was actually not as stable, but you can see the last dip with the hover, this one happened with the same exact configuration as the first chart.

What I would expect to see in the second chart is a line even flatter than in the first chart, since we'd be taking 10 samples and doing its mean, so the little spikes wouldn't be there. But instead, we get those measurements, some of them which seem correct, around 7ms, but all the smaller ones at 5ms and 2.75ms aren't surely correct, our function can't run as fast, as you can see in the first chart. Not sure what's the exact bug, but seems like something in the way results are aggregated after many samples are obtained.

We have a different server, running an older asv version for what I know (@rhshadrach can you confirm please), and we don't seem to have these incorrect measurements:

Screenshot_20231027_134611

There is a bit of variance since the core isn't isolated, and CPU TurboBoost and frequency scaling are enabled, but all the measurements seem reasonable. The hardware is faster and the measurements are around 5.25ms instead of 7ms, but we don't have any point half of that as in the new server. There can be the case that I'm wrong, and the bug already existed in the older version, and it's not happening when aggregating the values but in the measurement. In that case, since this other server with the older version is taking the mean of around 100 samples, then we won't see the very low measurements either. But I would be surprised if that's the case.

Regarding the changes in the logic in the refactoring, I don't know what the logic was in the old code, as you surely know better than me that old asv code was very difficult to read. But from our results, I think the older asv version used to run many benchmarks 100 times, and the refactored code doesn't seem to run it more than 10 in any case I've seen. Maybe that's the only change, but I assumed some things were changed. Or maybe that was changed before the refactoring, I don't really know. But that's what I had in mind when I said the logic changed.

@datapythonista
Copy link
Member Author

I did a test, with asv 0.5.1, and the parameters set to warmup_time=0 repeat=3 number=1 rounds=1, and I also see the problem with unreasonable measurements.

In this chart, the first commit is with the latest asv and some repetitions (not sure how many, I think 2), and the last commit is with asv 0.5.1 and the parameters above:

Screenshot_20231030_235028

So, it doesn't seem related to the refactor. I assumed that was the problem, expecting that we had detected the problem earlier if it was from long time ago, but seems like it's been there unnoticed for a while.

@rhshadrach
Copy link

We have a different server, running an older asv version for what I know (@rhshadrach can you confirm please), and we don't seem to have these incorrect measurements:

Correct - it is ASV 0.5.1

@HaoZeke
Copy link
Member

HaoZeke commented Oct 31, 2023

I did a test, with asv 0.5.1, and the parameters set to warmup_time=0 repeat=3 number=1 rounds=1, and I also see the problem with unreasonable measurements.

In this chart, the first commit is with the latest asv and some repetitions (not sure how many, I think 2), and the last commit is with asv 0.5.1 and the parameters above:

Screenshot_20231030_235028

So, it doesn't seem related to the refactor. I assumed that was the problem, expecting that we had detected the problem earlier if it was from long time ago, but seems like it's been there unnoticed for a while.

Glad to know it isn't a regression. I'm actually happy to see a resurgence of interest in ASV post the refactor. Will try for a simpler reproducer and fix :)

@datapythonista
Copy link
Member Author

datapythonista commented Nov 1, 2023

I managed to get a bit more clarity on the problem. For a given pandas benchmark attrs_caching.SeriesArrayAttribute.time_extract_array_numpy, those are the samples if repeat=3:

{'samples': [2.8450973331928253e-05, 3.4871045500040054e-06, 2.968008629977703e-06], 'number': 1}

The first sample is consistent with the times the benchmark run, but the others are one order of magnitude faster.

I checked the function, to see if something is getting cached on it, but I don't get one order of magnitude faster in the second or third iteration:

In [1]: import pandas as pd

In [2]: from pandas.core.construction import extract_array

In [3]: series = pd.Series([1, 2, 3])

In [4]: %time extract_array(series, extract_numpy=True)
CPU times: user 31 µs, sys: 7 µs, total: 38 µs
Wall time: 42.7 µs
Out[4]: array([1, 2, 3])

In [5]: %time extract_array(series, extract_numpy=True)
CPU times: user 22 µs, sys: 5 µs, total: 27 µs
Wall time: 31.5 µs
Out[5]: array([1, 2, 3])

In [6]: %time extract_array(series, extract_numpy=True)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs
Wall time: 19.3 µs
Out[6]: array([1, 2, 3])

In [7]: %time extract_array(series, extract_numpy=True)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs
Wall time: 20 µs
Out[7]: array([1, 2, 3])

In [8]: %time extract_array(series, extract_numpy=True)
CPU times: user 20 µs, sys: 5 µs, total: 25 µs
Wall time: 29.6 µs
Out[8]: array([1, 2, 3])

In [9]: %time extract_array(series, extract_numpy=True)
CPU times: user 26 µs, sys: 0 ns, total: 26 µs
Wall time: 31 µs
Out[9]: array([1, 2, 3])

Not sure exactly where the problem is, but I don't see why in a second execution of the benchmark we should get a much faster execution. In asv it seems to be consistent. Different calls to asv run will get similar results (a first slower execution, and faster executions after).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants