Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APC engine very slow when APCu contains many keys - O(N^2) performance #36

Closed
thedeacon opened this issue Feb 4, 2021 · 5 comments
Closed
Labels
bug Something isn't working enhancement New feature or request

Comments

@thedeacon
Copy link
Contributor

I am running the prometheus_client_php code on servers with several million entries in APCu. The use of nested calls to APCUIterator in collectCounters(), collectGauges(), and collectHistograms() is O(N^2), with N being the number of keys stored in APCu. When N is large -- several millions of keys -- this slows the code to a crawl and makes the APC storage engine unusable.

In my staging environment with approximately 2.5 million APCu keys, I tested a call to getMetricFamilySamples() taking approximately 9 seconds to return, due to the above issue, for example. During that time, one CPU core is pegged by the code running this nested pair of loops across all the APCu keys!

I have forked the code and re-implemented portions of the APC storage engine, to use a tree-based storage pattern, essentially a directed acyclic graph, where starting at the root of the tree leads to discovery of all APCu keys that the storage engine is using. This presents a massive speed-up in our environment, since performance is now O(N), where N is the number of (metrics x labels x buckets) being tracked, and is unrelated to the number of keys stored in APCu. Usually N in this case is a couple dozen keys to maybe a few hundred max, so relatively small.

Performance scales MUCH better with this change. In my same staging environment with 2.5M APCu keys, a test call to getMetricFamilySamples() decreased from 9 seconds to 33ms(!). This is fast enough that I can have Prometheus poll metrics every second, if I want.

Before dropping a PR here, I figured I'd open this issue and see if there is any interest in discussing the solution, or if you prefer to see the code right away. The current code passes all phpstan and phpunit tests. Feel free to reply, or if you just want to see the code, let me know and I'll drop a PR.

@LKaemmerling
Copy link
Member

Hey @thedeacon,

this sounds interesting! The only problem i see is that we would need to be backward compatible, so maybe using a new Storage Class (APCv2) would be a better idea. Feel free to open a MR!

Thank you!

  • Lukas

@LKaemmerling LKaemmerling added bug Something isn't working enhancement New feature or request labels Feb 4, 2021
@thedeacon
Copy link
Contributor Author

@LKaemmerling,

Cool, I'll do that. Just FYI, I implemented the code to be drop-in compatible with the existing APC storage class from an API perspective. All the public methods and parameters are unchanged. The modular design of the storage layer vs the rest of the code was really helpful there!

I'm pretty sure that's what you mean when you refer to being backward-compatible -- I can swap out the original APC.php and my modified one and all the code that calls into it works just fine. The only difference is the data stored in APC changes, of course. So the API is backward-compatible but the data storage format isn't.

It looks like I also addressed an issue brought up in #27 where APCu can get stuck and hang either if the module isn't present or in rare cases where there's a LOT of traffic to a particular key (or something is broken on the server) and apcu_cas() could spin forever. Mostly theoretical but I'd hate to have our metrics-collecting code wedge in a tight loop and bring the server to its knees.

I'll just open an MR and let you comment, instead of talking your ear off about it. Here we go: #37

-Paul

@thedeacon
Copy link
Contributor Author

@LKaemmerling any feedback, questions, comments on the current MR #37 ?

@LKaemmerling
Copy link
Member

Hey @thedeacon,

thank you for the ping! I didn't had time to look that deep into it like the change request it. I guess that I will have the time within this week.

@thedeacon
Copy link
Contributor Author

Quick update here, after being gone quite a while, I have returned with a new PR #72 which addresses the feedback on #37 (and incorporates around eight months of other changes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants