Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APCng storage - removed APCUIterator usage while collecting metrics, removed metrics collecting lag #96

Merged

Conversation

Rastusik
Copy link
Contributor

@Rastusik Rastusik commented Nov 9, 2022

The algorithm is simple - it uses an atomic counter to count all stored metrics metadata keys, and stores the keys into a counted list (multiple data keys with a counter pattern)

…removed metrics collecting lag

Signed-off-by: Rastusik <[email protected]>
@Rastusik Rastusik force-pushed the apcng_collect_apcuiterator_removal branch from 05bfade to 22f514f Compare November 9, 2022 17:04
@LKaemmerling
Copy link
Member

Hey @Rastusik,

first of all, thank you, but what do you want to achieve with this change? So basically what is the intention behind it?

@Rastusik
Copy link
Contributor Author

Rastusik commented Nov 9, 2022

hi @LKaemmerling , I just wanted to get rid of the last apcuiterator occurence in the metrics collecting routine. It doesn't have to run ever, and the price for it is small. And as a bonus, there is no more need for the caching TTL for the meta data, because the meta data cache will be repopulated effectively after addition of new metric (during the next collecting routine run)

@Rastusik
Copy link
Contributor Author

@LKaemmerling what do you think? Is this contribution acceptable? Shall I fix the tests? Although they are not in any way connected to the changes

@LKaemmerling
Copy link
Member

@Rastusik of course every contribution is welcome, but I still don’t understand what is the exact benefit from this change? Is it Performances reason? If so please provide benchmarks

@Rastusik
Copy link
Contributor Author

@LKaemmerling the primary reason is that I got rid of this code:

private function scanAndBuildMetainfoCache(): array
    {
        $arr = [];
        $matchAllMeta = sprintf('/^%s:.*:meta/', $this->prometheusPrefix);
        foreach (new APCUIterator($matchAllMeta) as $apc_record) {
            $arr[] = $apc_record['key'];
        }
        if ($apc_ttl >= 1) {
            apcu_store($this->metainfoCacheKey, $arr, $apc_ttl);
        }

The new algorithm makes it possible to build metainfo cache without scanning the whole apcu content. Other improvement is that there is no reason any more to periodically regenerate metainfo cache after some TTL period, the metrics are all regenerated during the first metrics collection run after each metric addition. It is possible because the new metainfo cache generation algorithm has complexity O(N), where N is the metrics count. The old algorithm has also O(N) complexity, but N is the count of apcu entries. This is a problem for one of our projects which stores 1GB of entries in APCU. AcpuIterator creates unnecessary locks for our use case, which means that several web requests take 5 seconds during first cache generation run. I know that the general case has already been solved in the APCng adapter, this is just a small improvement to get rid of all the apcu scanning.

I don't have any benchmarks, because this code just handles the special case of the first generation of metainfo cache. Basically the benchmarks would be the same as before with APCng, this is just an improvement of the first metrics collection run.

Feel free to ask any question, I'm not sure if I was able to explain the improvement good enough.

@Rastusik
Copy link
Contributor Author

@LKaemmerling I added some other minor improvements and fixed the tests

regarding the intent behind all these changes: I'm working on projects which are built on Swoole. the apps collect metrics per app instance in APCu. These apps are relatively highly performant (10 swoole worker processes, 20 reqs/s), and they store around 150 metrics in each request, and there is potential for more metrics and more concurrent processes. that is why I needed to get rid of as much blocking as possible. this code already is deployed in production and works flawlessly.

what do you think about this?

@Rastusik Rastusik force-pushed the apcng_collect_apcuiterator_removal branch 2 times, most recently from 511b905 to 962dfa9 Compare November 18, 2022 13:50
…g. swoole), better concurrent data updates, fixed tests so they work with floats correctly

Signed-off-by: Rastusik <[email protected]>
@Rastusik Rastusik force-pushed the apcng_collect_apcuiterator_removal branch from 962dfa9 to 51f39f4 Compare November 18, 2022 14:44
@LKaemmerling
Copy link
Member

@Rastusik generally said I like the idea! I'm currently quite sick, so I will look into it when I'm fully recovered hopefully at the end of next week/the week after.

@Rastusik
Copy link
Contributor Author

Rastusik commented Nov 18, 2022

@LKaemmerling sorry for disturbing, I was just expecting some/any feedback, thank you for clarification. Get well soon ;)

…ting can make the metrics expire

Signed-off-by: Rastusik <[email protected]>
@Rastusik Rastusik force-pushed the apcng_collect_apcuiterator_removal branch from 7746689 to d58d937 Compare November 28, 2022 11:05
@LKaemmerling
Copy link
Member

@Rastusik thank you! This looks fine from my side even if I'm not experienced with APC(u), also the tests pass (with changes that are understandable).

@LKaemmerling LKaemmerling merged commit c08814f into PromPHP:main Nov 28, 2022
@ashandi
Copy link

ashandi commented Apr 16, 2024

Hello guys!

Seems like these changes caused the following issues :
#124
#126

@Rastusik, do you have any idea why some of the keys can not be found in the cache during work o the scanAndBuildMetainfoCache function ?

@Rastusik
Copy link
Contributor Author

@ashandi I can't remember the details, because this work was done long ago, but I believe that the problem is with your APCu php settings. ttl=0 doesn't really mean anything, when the APCu size hits the configured limit, all the data is expunged (I believe this is the term used in apcu source code). This code expects that data won't be expunged from the cache and the expectation is to configure enough RAM for APCu if ttl=0 is used, or use shorter a defined ttl so the cache size will never be greater than configured limit. The old algorithm worked well, but if the cache needed to be regenerated, we expected 1+ second delays in PHP processing, that's why this change was introduced. I can't really remember the proper apcu configuration because I no longer work for the same company, but it should be straightforward to check the apcu source code and get the idea.

@ashandi
Copy link

ashandi commented May 2, 2024

Hey @Rastusik, thank you for the response!

You are right, ttl=0 means that the value will persist until it is removed from the cache by the system (restart or memory limit exceeded). So we caught the exception because we probably have too small an APCu memory size. My main point here is the following: this library shouldn't raise an exception if one specific metric is missed. As I can see, all the other adapters like Redis just ignore the situation when a metaKey is missing and continue to collect the other metrics. It would be very kind of you to take a look at my pull request and give it a thumbs-up, or explain why it might not be the best solution. Thank you.

@Rastusik
Copy link
Contributor Author

Rastusik commented May 5, 2024

hey @ashandi , I don't really remember what our problem was in details, I just know that there were some apcu deadlocks when we were losing data from apcu (when they were expunged). The point was that when one php process found out that some key is missing in the cache, because it was expunged, it wanted to add it... but if the php process hit the point in time when a different php process was running the expunge algorithm in apcu, there were deadlocks. We had a performance intensive application with cca 80 requests/s with 10 php processes on swoole and the deadlocks brought the application pod down, so we needed to make sure that the expunge algorithm would never be triggered. that's why we were setting the ttl (and maybe there were 2 different ttls) to a very high non-zero value, probably some months into to future. The root cause of the problem seems to be the apcu expunge algorithm and this was our workaround.

I'm not really able to tell you if the changes from your pull request will break something or not. According to what I just wrote it seems like with proper apcu settings with no expunges triggered, the condition that you are describing shouldn't occur. If you want to have some custom apcu settings and it's improbable that you will hit the deadlocks, maybe your changes will make sense - it looks like the data counters will just be reset. maybe prometheus knows how to handle those resets already.

From what I remember, I wanted to avoid the expunge of the metrics data from apcu, but I can't really remember why it was the case and have no time to work it out.

I would recommend you to try to use your implementation in production and see if it works, because this fix can be problematic and other fixes will need to follow it, so it would be better to just test and improve the whole batch of fixes before releasing them. And I also recommend you to check the apcu extension C code, it's relatively simple and you will get the idea what's happening underneath the surface.

@ashandi
Copy link

ashandi commented May 9, 2024

@Rastusik, I don't think that I can manage ttl settings at all because all the apcu_add, apcu_store functions are used with ttl=0 that overwrites the setting from php.ini.
However, it's true that I shouldn't catch this exception if I set a necessary size of memory for APCu because the number of metrics is limited and shouldn't grow with time. Anyway, thank you for your explanation, I'll try to increase my apc.shm_size and see what will happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants