BUG: asv seems to report some measurements 50% lower than the actual time #33

datapythonista · 2023-10-25T10:26:01Z

I need to do some more research, but seems like in some cases, the time asv reports for a particular benchmark is exactly 50% as it should be . You can see the details of the issue in this comment in the pandas repo: pandas-dev/pandas#55007 (comment)

HaoZeke · 2023-10-27T03:46:19Z

I need to do some more research, but seems like in some cases, the time asv reports for a particular benchmark is exactly 50% as it should be . You can see the details of the issue in this comment in the pandas repo: pandas-dev/pandas#55007 (comment)

Could you find a smaller reproducer? Also didn't want to hijack the issue but the refactor into runner was just that, a refactor, and sampling methods and other details were not changed at all.

HaoZeke · 2023-10-27T03:48:23Z

Also, sorry if this was mentioned in the issue but is this a bug in the sense that it was previously reporting different times? It wasn't immediately clear to me from the graph / issue / report.

datapythonista · 2023-10-27T09:57:35Z

We managed to get a quite stable environment (results seem consistent even for a single run). I'm running the benchmarks now with the next attributes: warmup_time=0 repeat=1 number=1 rounds=1 and you can see for this particular benchmark our results:

If instead of just forcing a single execution I leave the default attributes (not sure exactly what are they, but seems like it ends up executing this benchmark 10 times), then we get these results:

The server configuration at the beginning of the recordings was actually not as stable, but you can see the last dip with the hover, this one happened with the same exact configuration as the first chart.

What I would expect to see in the second chart is a line even flatter than in the first chart, since we'd be taking 10 samples and doing its mean, so the little spikes wouldn't be there. But instead, we get those measurements, some of them which seem correct, around 7ms, but all the smaller ones at 5ms and 2.75ms aren't surely correct, our function can't run as fast, as you can see in the first chart. Not sure what's the exact bug, but seems like something in the way results are aggregated after many samples are obtained.

We have a different server, running an older asv version for what I know (@rhshadrach can you confirm please), and we don't seem to have these incorrect measurements:

There is a bit of variance since the core isn't isolated, and CPU TurboBoost and frequency scaling are enabled, but all the measurements seem reasonable. The hardware is faster and the measurements are around 5.25ms instead of 7ms, but we don't have any point half of that as in the new server. There can be the case that I'm wrong, and the bug already existed in the older version, and it's not happening when aggregating the values but in the measurement. In that case, since this other server with the older version is taking the mean of around 100 samples, then we won't see the very low measurements either. But I would be surprised if that's the case.

Regarding the changes in the logic in the refactoring, I don't know what the logic was in the old code, as you surely know better than me that old asv code was very difficult to read. But from our results, I think the older asv version used to run many benchmarks 100 times, and the refactored code doesn't seem to run it more than 10 in any case I've seen. Maybe that's the only change, but I assumed some things were changed. Or maybe that was changed before the refactoring, I don't really know. But that's what I had in mind when I said the logic changed.

datapythonista · 2023-10-30T19:55:59Z

I did a test, with asv 0.5.1, and the parameters set to warmup_time=0 repeat=3 number=1 rounds=1, and I also see the problem with unreasonable measurements.

In this chart, the first commit is with the latest asv and some repetitions (not sure how many, I think 2), and the last commit is with asv 0.5.1 and the parameters above:

So, it doesn't seem related to the refactor. I assumed that was the problem, expecting that we had detected the problem earlier if it was from long time ago, but seems like it's been there unnoticed for a while.

rhshadrach · 2023-10-30T20:37:16Z

We have a different server, running an older asv version for what I know (@rhshadrach can you confirm please), and we don't seem to have these incorrect measurements:

Correct - it is ASV 0.5.1

HaoZeke · 2023-10-31T04:28:43Z

I did a test, with asv 0.5.1, and the parameters set to warmup_time=0 repeat=3 number=1 rounds=1, and I also see the problem with unreasonable measurements.

In this chart, the first commit is with the latest asv and some repetitions (not sure how many, I think 2), and the last commit is with asv 0.5.1 and the parameters above:

So, it doesn't seem related to the refactor. I assumed that was the problem, expecting that we had detected the problem earlier if it was from long time ago, but seems like it's been there unnoticed for a while.

Glad to know it isn't a regression. I'm actually happy to see a resurgence of interest in ASV post the refactor. Will try for a simpler reproducer and fix :)

datapythonista · 2023-11-01T12:01:18Z

I managed to get a bit more clarity on the problem. For a given pandas benchmark attrs_caching.SeriesArrayAttribute.time_extract_array_numpy, those are the samples if repeat=3:

{'samples': [2.8450973331928253e-05, 3.4871045500040054e-06, 2.968008629977703e-06], 'number': 1}

The first sample is consistent with the times the benchmark run, but the others are one order of magnitude faster.

I checked the function, to see if something is getting cached on it, but I don't get one order of magnitude faster in the second or third iteration:

In [1]: import pandas as pd

In [2]: from pandas.core.construction import extract_array

In [3]: series = pd.Series([1, 2, 3])

In [4]: %time extract_array(series, extract_numpy=True)
CPU times: user 31 µs, sys: 7 µs, total: 38 µs
Wall time: 42.7 µs
Out[4]: array([1, 2, 3])

In [5]: %time extract_array(series, extract_numpy=True)
CPU times: user 22 µs, sys: 5 µs, total: 27 µs
Wall time: 31.5 µs
Out[5]: array([1, 2, 3])

In [6]: %time extract_array(series, extract_numpy=True)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs
Wall time: 19.3 µs
Out[6]: array([1, 2, 3])

In [7]: %time extract_array(series, extract_numpy=True)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs
Wall time: 20 µs
Out[7]: array([1, 2, 3])

In [8]: %time extract_array(series, extract_numpy=True)
CPU times: user 20 µs, sys: 5 µs, total: 25 µs
Wall time: 29.6 µs
Out[8]: array([1, 2, 3])

In [9]: %time extract_array(series, extract_numpy=True)
CPU times: user 26 µs, sys: 0 ns, total: 26 µs
Wall time: 31 µs
Out[9]: array([1, 2, 3])

Not sure exactly where the problem is, but I don't see why in a second execution of the benchmark we should get a much faster execution. In asv it seems to be consistent. Different calls to asv run will get similar results (a first slower execution, and faster executions after).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: asv seems to report some measurements 50% lower than the actual time #33

BUG: asv seems to report some measurements 50% lower than the actual time #33

datapythonista commented Oct 25, 2023

HaoZeke commented Oct 27, 2023

HaoZeke commented Oct 27, 2023 •

edited

Loading

datapythonista commented Oct 27, 2023

datapythonista commented Oct 30, 2023

rhshadrach commented Oct 30, 2023

HaoZeke commented Oct 31, 2023

datapythonista commented Nov 1, 2023 •

edited

Loading

BUG: asv seems to report some measurements 50% lower than the actual time #33

BUG: asv seems to report some measurements 50% lower than the actual time #33

Comments

datapythonista commented Oct 25, 2023

HaoZeke commented Oct 27, 2023

HaoZeke commented Oct 27, 2023 • edited Loading

datapythonista commented Oct 27, 2023

datapythonista commented Oct 30, 2023

rhshadrach commented Oct 30, 2023

HaoZeke commented Oct 31, 2023

datapythonista commented Nov 1, 2023 • edited Loading

HaoZeke commented Oct 27, 2023 •

edited

Loading

datapythonista commented Nov 1, 2023 •

edited

Loading