-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SearchPerfTest should track / aggregate total true CPU cycles spent running a query #277
Comments
I'm trying to find a simple/clean way to do this accounting and it does not look simple. The We might also make a Maybe @jpountz has some simple idea :) |
Would it really be so terrible to add some accounting to Lucene's executor implementation? I thought about using Collector API (which does have a start when LeafCollector gets created and then a finish() later), but this doesn't capture all the work we do. |
I understand why you are trying to measure CPU time, but it would have other issues. E.g. it would not report issues if Maybe we should rather look into running search tasks twice: once with an executor of size 1, which should be an ok proxy for throughput, and once with an executor of size N>1 (e.g. N=8 like you suggested) to measure Lucene's ability to take advantage of "free" CPUs to improve latency. |
Yeah +1 to measure the speedup we see due to concurrency "working" well too. But using a single thread as a proxy for throughput wouldn't measure cases where Lucene can save CPU by better cross-segment comparisons? Or, perhaps, those improvements might come through (ish?) by measuring effective QPS using intra-query concurrency? But, not always, like if the long pole slice for a segment still takes X CPU, but because of an innovation, other slices were able to terminate sooner / skip better, the QPS would not change but total CPU improved. Maybe the best overall solution for measuring "Lucene CPU improvements" is to build a true red-line test... why try to build tricky proxy metrics heh :) |
I've made some progress here ... I added an "actual QPS" measure, recorded by It looks like this:
In this case, Note that these numbers not necessarily a red-line QPS measurement, since it is up to the test runner to configure concurrency sufficiently to saturate CPU cores or IO or some other bottleneck. It is simply the actual QPS that the test achieved ... when testing for query latency (the per-task "effective QPS" we normally report) one should not run near red-line if they want accurate approximation of the total CPU cost of each query. Next step I'll try to add a red-line QPS to nightly benchmarks ... |
I ran this same
Note that The top-level
vs
This is on a somewhat degenerate index that has many slices ( |
Today
luceneutil
reports the effective QPS as 1.0 / wall-clock-time. But when using intra-query concurrency, that's a lie (will be too high) since multiple cores are running at once.Let's change that to use the JMX bean to measure per-thread actual CPU cycles, and carefully aggregate across all threads that run for the query, to compute the aggregated CPU time spent on the query, and translate to effective QPS of 1.0/aggregated-cpu-time.
This should be more accurate, and remove the false results we see when using intra-query concurrency, assuming the JMX bean is accurate. And it'd mean we can go back to running both inter- and intra-query concurrency at the same time, and our benchmarking runs will finish quicker.
To implement this, I think we must make a wrapped
ThreadPoolExecutor
to pass toIndexSearcher
that tracks which query each work unit (slice+query) corresponds to, and aggregate CPU time across the N threads per-query accordingly...The text was updated successfully, but these errors were encountered: