-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APCng storage - removed APCUIterator usage while collecting metrics, removed metrics collecting lag #96
APCng storage - removed APCUIterator usage while collecting metrics, removed metrics collecting lag #96
Conversation
…removed metrics collecting lag Signed-off-by: Rastusik <[email protected]>
05bfade
to
22f514f
Compare
Hey @Rastusik, first of all, thank you, but what do you want to achieve with this change? So basically what is the intention behind it? |
hi @LKaemmerling , I just wanted to get rid of the last apcuiterator occurence in the metrics collecting routine. It doesn't have to run ever, and the price for it is small. And as a bonus, there is no more need for the caching TTL for the meta data, because the meta data cache will be repopulated effectively after addition of new metric (during the next collecting routine run) |
@LKaemmerling what do you think? Is this contribution acceptable? Shall I fix the tests? Although they are not in any way connected to the changes |
@Rastusik of course every contribution is welcome, but I still don’t understand what is the exact benefit from this change? Is it Performances reason? If so please provide benchmarks |
@LKaemmerling the primary reason is that I got rid of this code: private function scanAndBuildMetainfoCache(): array
{
$arr = [];
$matchAllMeta = sprintf('/^%s:.*:meta/', $this->prometheusPrefix);
foreach (new APCUIterator($matchAllMeta) as $apc_record) {
$arr[] = $apc_record['key'];
}
if ($apc_ttl >= 1) {
apcu_store($this->metainfoCacheKey, $arr, $apc_ttl);
} The new algorithm makes it possible to build metainfo cache without scanning the whole apcu content. Other improvement is that there is no reason any more to periodically regenerate metainfo cache after some TTL period, the metrics are all regenerated during the first metrics collection run after each metric addition. It is possible because the new metainfo cache generation algorithm has complexity O(N), where N is the metrics count. The old algorithm has also O(N) complexity, but N is the count of apcu entries. This is a problem for one of our projects which stores 1GB of entries in APCU. AcpuIterator creates unnecessary locks for our use case, which means that several web requests take 5 seconds during first cache generation run. I know that the general case has already been solved in the APCng adapter, this is just a small improvement to get rid of all the apcu scanning. I don't have any benchmarks, because this code just handles the special case of the first generation of metainfo cache. Basically the benchmarks would be the same as before with APCng, this is just an improvement of the first metrics collection run. Feel free to ask any question, I'm not sure if I was able to explain the improvement good enough. |
…nfo generation Signed-off-by: Rastusik <[email protected]>
…e can work again Signed-off-by: Rastusik <[email protected]>
@LKaemmerling I added some other minor improvements and fixed the tests regarding the intent behind all these changes: I'm working on projects which are built on Swoole. the apps collect metrics per app instance in APCu. These apps are relatively highly performant (10 swoole worker processes, 20 reqs/s), and they store around 150 metrics in each request, and there is potential for more metrics and more concurrent processes. that is why I needed to get rid of as much blocking as possible. this code already is deployed in production and works flawlessly. what do you think about this? |
511b905
to
962dfa9
Compare
…g. swoole), better concurrent data updates, fixed tests so they work with floats correctly Signed-off-by: Rastusik <[email protected]>
962dfa9
to
51f39f4
Compare
@Rastusik generally said I like the idea! I'm currently quite sick, so I will look into it when I'm fully recovered hopefully at the end of next week/the week after. |
Signed-off-by: Rastusik <[email protected]>
@LKaemmerling sorry for disturbing, I was just expecting some/any feedback, thank you for clarification. Get well soon ;) |
…ting can make the metrics expire Signed-off-by: Rastusik <[email protected]>
7746689
to
d58d937
Compare
@Rastusik thank you! This looks fine from my side even if I'm not experienced with APC(u), also the tests pass (with changes that are understandable). |
@ashandi I can't remember the details, because this work was done long ago, but I believe that the problem is with your APCu php settings. |
Hey @Rastusik, thank you for the response! You are right, |
hey @ashandi , I don't really remember what our problem was in details, I just know that there were some apcu deadlocks when we were losing data from apcu (when they were expunged). The point was that when one php process found out that some key is missing in the cache, because it was expunged, it wanted to add it... but if the php process hit the point in time when a different php process was running the expunge algorithm in apcu, there were deadlocks. We had a performance intensive application with cca 80 requests/s with 10 php processes on swoole and the deadlocks brought the application pod down, so we needed to make sure that the expunge algorithm would never be triggered. that's why we were setting the ttl (and maybe there were 2 different ttls) to a very high non-zero value, probably some months into to future. The root cause of the problem seems to be the apcu expunge algorithm and this was our workaround. I'm not really able to tell you if the changes from your pull request will break something or not. According to what I just wrote it seems like with proper apcu settings with no expunges triggered, the condition that you are describing shouldn't occur. If you want to have some custom apcu settings and it's improbable that you will hit the deadlocks, maybe your changes will make sense - it looks like the data counters will just be reset. maybe prometheus knows how to handle those resets already. From what I remember, I wanted to avoid the expunge of the metrics data from apcu, but I can't really remember why it was the case and have no time to work it out. I would recommend you to try to use your implementation in production and see if it works, because this fix can be problematic and other fixes will need to follow it, so it would be better to just test and improve the whole batch of fixes before releasing them. And I also recommend you to check the apcu extension C code, it's relatively simple and you will get the idea what's happening underneath the surface. |
@Rastusik, I don't think that I can manage |
The algorithm is simple - it uses an atomic counter to count all stored metrics metadata keys, and stores the keys into a counted list (multiple data keys with a counter pattern)