-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Your test duration measurement is inaccurate #143
Comments
That's why the API:
There's no one-size-fits-all benchmarking design in the JS ecosystem. But instead APIs allowing to refine the benchmark accordingly and offering sane defaults. What is actually missing in tinybench API to adapt it to browsers? |
@jerome-benoit I didn't realize you also used confidence - that means the jitter isn't of much concern. But for the rest, in short, let start = now()
fn()
let end = now()
let duration = end - start The concrete fix is this:
Changing the |
On a related note, maybe they should've called it |
Benchmark warmup at
Only the latency of the benchmarking function execution with JIT deoptimization is measured in tinybench.
A correct benchmark methodology means not modifying the experiment to time. Such as measuring the time a runner at doing 500m is not measuring the time and distance of one step repetitively and use the average of that measurement as a base to time his 500m course. It's utterly wrong in so many ways ... I've seen benchmarking tool such as mitata using a similar totally flawed methology. Tinybench will never go that path as we care about using unbiased measurement methodology. That why I've forked mitata in tatami-ng because the maintainer was not inclined to external contributions about it. And now pushing the relevant bits of that fork to tinybench that will show up in version 3.x.x
Tinybench is meant to be a lean library using state of the art benchmarking methods and advanced statistics. The analysis of them such as determining if the margin of error is acceptable, the median absolute deviation is acceptable, ... and globally the statistical significance of the result will not be part of tinybench. It's up to the user to analyze them and eventually automate the detection of anomalies in the measurement.
The analysis of the result is meant to tell if a measurement is correct or not: for example the presence of a lot of zero measurement will make the margin of error go high for latency => results cannot be trusted. And using a totally flawed benchmarking methodology (and opening a wide door to the premature optimization disease) as a workaround to a too high resolution in the JS runtime timestamping is not an acceptable solution. The root cause must be fixed: not offering an optional mode with high resolution timer in a JS runtime is considered as a bug nowadays. And browsers can be started with high resolution timer for benchmarking purpose. So I repeat: what is actually missing in tinybench to run accurate benchmark using state of the art methodology in browsers? |
@jerome-benoit Just letting you know I plan to respond with precise numbers, just I've got a bunch of math (mostly stats and calculus) to work out. May turn this into a whitepaper later for others to reference. |
Precise numbers on how a flawed benchmarking methodology can give unbiased results? ;-)
If you plan to write down that violating theses two points will make a benchmark more accurate, the proof of the contrary is already settled since years: Kolmorogov-Naguno publications on the generalization of the averaging concept - the f-mean has the correct properties if and only if f is injective -> a mean (weighted or not) is not injective -> using a mean as f-mean does not behave correctly. |
The hard numbers I'm working on are around why your benchmarking methodology is flawed. Mine is not any more perfect, but I'd have to do (a lot) more research to figure out how to correctly account for measurement error in general.
These aren't simple averages. They're population-weighted averages. https://en.wikipedia.org/wiki/Weighted_arithmetic_mean lists precisely the method I'm using as a valid use, just in a different context (average test grade). And to be clear, when it comes to performance, there's three main statistics that matter most: the arithmetic mean, the upper end of the confidence interval, and (for real-time cases) the max value. In the context I'm currently working with, I'm dealing with a soft real-time process where a certain function call must complete in under 5 microseconds on average and preferably under 2 microseconds. Yes, microseconds, not milliseconds. And yes, this obviously sits well below even the minimum
Obviously, that is a concern. I in my current (not-yet-public) benchmarking code prepend a null test for this exact reason and make sure to print both raw stats including it and stats adjusted using numbers from a null test. My adjusted stats displayed for those is admittedly probably biased and obviously not perfect, but fixing that would require a bunch of math that I currently don't plan to do in the near term. (I have more pressing priorities.)
The ultimate mean I'm returning isn't of the samples themselves, but of the (potentially unmeasurable) durations of each iteration plus the small but statistically significant measurement overhead. Thing is, you don't need to know all values to have a mean. Simply knowing how many runs and how long all runs collectively took is sufficient. As wait times add just through the passage of time, you only need to measure the whole span to know how long all runs collectively took. And non-weighted means (what these intermediate means are) with known population sizes can be correctly merged using population-weighted means. (You obviously can't do this with unknown populations. But the population size here is known.) There are caveats, like the lack of a true min/max and a lack of a true median. But that goes without saying.
This is the part I'm working on, so I won't address it here right now. |
You still does not seem to really understand why it's plain wrong as a measurement method, why any weighted averages as a primary source of a measurements sample is just biased by definition and can't be used as a trusted source. Basically you're building a sample with only imputed values using the mean. It has be proven since years to be non anecdotally biased: rubin's law introduction, distribution of samples comparaison methods, ...
No, you can't. Or you have to prove that Kolmogorov-Nagumo were wrong on the generalization of the concept of averaging. Since the beginning you are making incorrect statements on:
And you continue with plain wrong statements about well-known mathematics proven results in the statistics field. I see no point in continuing such a discussion. |
Okay, closing. I'll agree to disagree, in particular on these two key points.
If I write up a whitepaper about this and get it peer reviewed, I may consider sharing it here. But aside from that, I'm ending the discussion. I doubt that even providing hard stats on how naively using this code provides skewed statistics or providing mathematical proofs that either of the two above bullets are false will change your opinions on the matter, so continuing it will be fruitless. let start = performance.now()
// do stuff
let end = performance.now()
let duration = end - start My calculations came out to about a 70% chance of undercounting a 20us interval and 30% chance of counting it as zero given Chrome's 100us granularity + 100us random jitter, by the way. I just hadn't yet either 1. calculated the expected mean value (specifically I still plan to complete these calculations anyway, as you weren't the only considered audience for them. |
And the reason is simple: no mathematical proof of their fallacy exists, no counter examples exist. And if you had one to share, it will probably end up being a candidate for the fields medal ... |
Performance measurement is unfortunately not as simple as
performance.now()
in browsers. Further, operating systems can and do sometimes have resolution limits of their own.Here's some considerations that need to be addressed, in general:
performance.now()
tick during sleep. In other platforms, they don't, in violation of the spec: Suggestions: ticking during sleep, comparison across contexts, time origin + now semantics, and skew definition w3c/hr-time#115 (comment)The benchmarks currently naively use start and end
performance.now()
/process.hrtime.bigint()
calls. The precision issues can give you bad data, but it's not insurmountable:It doesn't appear your benchmark execution code currently takes any of this into account, hence this issue.
The text was updated successfully, but these errors were encountered: