-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Riak_kv K8s memory consumption #1889
Comments
What version of Riak? How many keys overall? What is the average size of the keys (including bucket and type names)? What is the per-pod memory limit. Enabling the leveldb limits doesn't alter the leveled backend demand for memory, and enabling tictacaae will enable the leveled backend. There has been a lot of work on memory efficiency and AAE efficiency recently, but much of that has yet to make it through to the basho repos. There are some specific operations that can be memory intensive (like AAE store rebuilds). Is there a pattern of the memory suddenly spiking? If memory is very tight, you probably have more control over memory if you use the legacy anti_entropy rather than tictacaae. The tictacaae is designed to work in native mode (when the storage_backend is leveled), running in parallel mode (i.e. with another backend) like this, has had far less non-functional testing. In terms of recovery, AAE is used to discover differences, and then those differences are resolved by AAE prompting read repair. Read repair will read the values for a suspect key from all N vnodes that hold them, and over-write any out-of-date values (using the vector clocks to determine what is out of date). The process of AAE discovering the differences are:
There are some logs in the riak_kv_vnode module that summarise what happens in each cycle, as well as some stats in riak stats. If you're getting memory spikes you might want to look for logs about rebuilds starting/completing and see if they correlate. |
Thank you for the quick response!
All context about behavior and seeking of corrupted keys can be find in our previous issue from my collegue. So we tried to "force" the recovery of these keys manually making requests for them or using aae_fold queries like
|
The underlying issue of corrupting objects is fixed in this PR #1882. This is in the nhs-riak 3.2.2 release, but hasn't been accepted into the basho repo yet. The key amnesia logs I assume are from 3 different vnodes. The Inbound counter on one of them is showing nearly 8m updates for that vnode alone. Is that expected through normal activity? I assume not, and in fact these keys are stuck in some continuous AAE exchange/repair key churn. Do you think this is the case? What is the output of this log - https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_vnode.erl#L1215-L1226 If you are in some continuous exchange churn you might be able to improve your symptoms by increasing your exchange tick to reduce the frequency of activity - https://github.com/basho/riak_kv/blob/develop-3.0/priv/riak_kv.schema#L150-L162. There are a bunch of improvements to exchange efficiency that are in the latest nhs-riak 3.2.2 release that would help too. However, I have no way of getting these upstream into the basho repo at present. You may wish to consider the relative risks of building your own release from nhs-riak if you need these changes. |
The output for this log. exchanges_expected and exchanges_completed are always 6 for different partitions and in output for different nodes in general.
|
AAE is working very hard. You may be able to ease your symptoms by increasing the exchangetick. i.e. from
The change can be made permanently by changing the setting in riak.conf - https://github.com/nhs-riak/riak_kv/blob/nhse-develop/priv/riak_kv.schema#L150-L162. The long-term answer is to upgrade, to a version with the necessary fixes and performance improvements (i.e. Riak 3.2.2 with OTP22), but that would require you to build your own packages from the nhs-riak repo, given the current delays bringing the basho repo up-to-date. |
Thanks for the advice, we are going to try to build a new image with the new version. Just for clarification, I found min 8 vnodes in the cluster that output entries with And could you please explain in more detail what exactly parameters mean in the output of this log. |
There's an explanation of incrementing key epochs here. The linked issue (KV679) and related PRs contain more information. "Local" is referring to the vector clock of the object being replaced in the vnode, "In" is the new PUT arriving at the vnode. If you look at the log_key_amnesia function in the riak_kv_vnode, you can see it is called in two cases. If the object is not_found, then the Local values are assumed to be 0/0. So in this case, a PUT arrived - that according to the vector clock, this vnode has coordinated 41 changes to this object ... but instead found it wasn't present. The KV679 issue that this is there to solve is complicated. It covers various corner cases, which include data being deleted due to backend corruption/failure. But also the affect of deleting keys (when delete_mode =/= keep), and then writing new values to the object, only for "older" tombstones to be resurrected (known as the dombstone scenario). If you've highlighted 8 vnodes, it might be worth looking at their relative location in the ring i.e. which preflists they share. If they don't share preflists, you might have potential solutions involving partition repair. If they share preflists, then there might be some further clues as to what is happening. |
After a couple of weeks of testing, we tried several scenarios.
We have changed the version of riak as you noticed to the 3.2.2 version from your NHS repo. It works but we don't see any visible effect on our data corruption case, but noticed that CPU consumption became less.
Replica cluster for test after adding and leave/join procedure:
After deleting the directory and rebuilding:
Is it normal that our tictac_aae directories have, in my opinion, so big size in comparison with leveldb directories with data?
This still does not affect keys but feels like memory consumption decreased. And
|
I would expect tictac_aae folders to be much smaller than leveldb folders, unless your object sizes are very small (i.e. much less than 1KB). If you're not using store_heads - it is just keys and clocks I think (and a very small amount of additional metadata). |
I don't understand what is happening. There is an environment variable in riak_kv of log_readrepair - you can set that to true, and it will log out the actual keys that it is repairing. You may then be able to get some clues - are they all from one bucket, are they from a specific set of keys that have different activity patterns etc. It also logs the details of the vector clocks that differ, which may offer some further clues. |
Does every log appearance for each unique key mean that this key is corrupted and will be passed to the recovering process?
|
A clock of However, the timestamp in the clock is not as expected:
So this is a key that has a single update (the initial write), made on 26th November 2023 at 12:06:52 - that somehow isn't in the other vnode. Might the key have been deleted recently? Do you expect the key to exist? Did something odd happen late November (most of these timestamps are from similar date/times). |
Sorry, I ignored your question about the recovery process. The recovery process is read repair, and this is logged at the start of that recovery process - so the AAE has discovered this difference between the clocks, and it will now prompt a GET of an object (internally, but using the same GET_FSM as an external fetch). It is expected that GET_FSM will discover the same difference, and and the end of that process it will prompt any vnode that doesn't have the object (i.e. the one with clock of there are some stats in Obviously if something is broken with the repair process related to the state of these objects, then if you wait you will see a repeating pattern of the same keys being passed to repair. |
We never delete keys so it should be present in the base. I don't know if specific dates are a clue in case I found a lot of different timestamps.
Stats from
For something odd and abnormal I can remember the situation when one of the nodes can't successfully start after rollout restart. This situation was not the only one in the lifetime of the cluster.
For the solution, we decided to remove the broken index from the data directory and shift the responsibility for index recovery to anti-entropy processes. When I say "remove" I mean like the command |
The really odd thing is that the fallback count > 0. This implies that when you're doing the read repair, there's two primary nodes responding, and one fallback. In a healthy cluster we would expect only primaries to respond. What does |
Same output for all 6 nodes.
|
So none of the read repair counters are increasing? i.e. AAE is prompting a repair, but then a repair doesn't happen? There is a second environment variable - riak_kv read_repair_log. If you set this to true, this will provide an additional log in the read repair process - https://github.com/nhs-riak/riak_kv/blob/nhse-develop-3.4/src/riak_kv_get_fsm.erl#L713-L717. If you turn that on, and then don't see any such logs, this will confirm that the prompt of read repair doesn't then trigger read repair. |
I'm sorry for the confusion, I mean only primary counters are increasing right now. Fallbacks are still the same. |
New log seems normal. Many keys in diffent vnodes.
|
So one possibility is that repair is working, and it is just taking time to work through all the differences - as opposed to just churning through the same keys over and over again. So do you see the same keys being repeatedly appearing in the logs, or is each key being returned for repair once and only once. There is a log EX003 which will be in the erlang.log file (we're interested in the case where in_sync is If we can establish that this is working, but just taking time, the time can be reduced by a couple of ways (most notably by triggering range_repair). However, if it isn't working (it is just the same keys recycling), then this is a different scenario to troubleshoot. |
If you want to correlate the mismatched_segments to a given vnode exchange using the exchange ID present in the EX003 log, there is a debug log which can be enabled - https://github.com/nhs-riak/riak_kv/blob/nhse-develop-3.4/src/riak_kv_tictacaae_repairs.erl#L108-L110. |
I enabled the log you mentioned above. However, I am not sure that I interpreted the information correctly.
Then I filtered for them in root@riak-0:/var/lib/riak# grep -F -f 123 ../../log/riak/erlang.log.1 | grep in_sync=false | awk '{print $2, $13, $16, $19, $22}'
2024-07-09T08:06:39.029 pending_state=clock_compare id=a2662ac3-5c2a-4d7a-9ed6-ba55696a64ed mismatched_segments=985364 keys_passed_for_repair=411
2024-07-09T08:07:59.188 pending_state=clock_compare id=462e0d2f-fe04-4ad5-b7db-843bd84e38cd mismatched_segments=987172 keys_passed_for_repair=0
2024-07-09T08:54:37.097 pending_state=clock_compare id=e6fedabf-52b8-42a1-a0e9-4c3d18623d38 mismatched_segments=963155 keys_passed_for_repair=304
2024-07-09T08:54:53.323 pending_state=clock_compare id=feb7e4ad-bd47-46c4-a3d3-b55e62729d87 mismatched_segments=1039457 keys_passed_for_repair=104
2024-07-09T08:55:08.056 pending_state=clock_compare id=2d1f2ffe-958f-4b5a-9d1b-3a230ef57818 mismatched_segments=990598 keys_passed_for_repair=97
2024-07-09T08:55:22.424 pending_state=clock_compare id=31a01a5d-b1fd-4488-a6fd-276d35920977 mismatched_segments=1039457 keys_passed_for_repair=118
2024-07-09T08:55:34.365 pending_state=clock_compare id=db90f922-2f2d-4449-9261-11ab1e3fda5d mismatched_segments=1039457 keys_passed_for_repair=94
2024-07-09T09:58:37.629 pending_state=clock_compare id=365f3e05-4482-4368-b200-228e9c8de43c mismatched_segments=854269 keys_passed_for_repair=553
2024-07-09T10:00:07.307 pending_state=clock_compare id=b7a27c6d-b3be-4ab7-a9d7-6aeefc7d64d8 mismatched_segments=761469 keys_passed_for_repair=352
2024-07-09T10:01:01.638 pending_state=clock_compare id=b268bfed-1796-4094-a736-0d9b41cdea0f mismatched_segments=917088 keys_passed_for_repair=256
2024-07-09T10:01:40.534 pending_state=clock_compare id=e5fae04f-dde5-455a-9948-5f388c633528 mismatched_segments=900336 keys_passed_for_repair=359
2024-07-09T10:02:38.947 pending_state=clock_compare id=bb9be016-13ec-45f2-a308-b94ea4b96836 mismatched_segments=1039457 keys_passed_for_repair=346
2024-07-09T10:30:35.603 pending_state=clock_compare id=ddafb2dc-8a67-4133-86f2-e1679be1cd16 mismatched_segments=906936 keys_passed_for_repair=508
2024-07-09T10:32:21.332 pending_state=clock_compare id=83b99b7a-bd64-4977-b63c-e04681f63555 mismatched_segments=828127 keys_passed_for_repair=230
2024-07-09T10:33:14.240 pending_state=clock_compare id=039793dc-1b6b-4a16-8a62-d79820349ad1 mismatched_segments=1039457 keys_passed_for_repair=360
2024-07-09T10:33:48.976 pending_state=clock_compare id=76c8200c-b30f-47ff-9cc0-68d448f83885 mismatched_segments=1039299 keys_passed_for_repair=385
2024-07-09T10:34:18.032 pending_state=clock_compare id=1eaed470-7731-49cf-9aa6-6b430a75218c mismatched_segments=690370 keys_passed_for_repair=66
2024-07-09T11:18:36.318 pending_state=clock_compare id=1756c0e4-e07b-4887-9b7e-e4c552e0d531 mismatched_segments=837962 keys_passed_for_repair=551
2024-07-09T11:19:13.093 pending_state=clock_compare id=409c02cc-cccd-4bc0-a12b-2c480b0e126f mismatched_segments=877715 keys_passed_for_repair=131
2024-07-09T11:19:47.503 pending_state=clock_compare id=d6bccf7a-698a-4cf4-9aea-7b99c05d6307 mismatched_segments=1013600 keys_passed_for_repair=301
2024-07-09T11:20:06.434 pending_state=clock_compare id=4cc2103b-ec04-44f2-93d0-51f52f1a2e2f mismatched_segments=919943 keys_passed_for_repair=38
2024-07-09T12:22:41.647 pending_state=clock_compare id=c336dbc9-bb64-4bdc-b0c0-052186888870 mismatched_segments=944373 keys_passed_for_repair=277
2024-07-09T12:23:58.234 pending_state=clock_compare id=6dff7121-e094-499d-85b0-6a7ec4bb104a mismatched_segments=1039330 keys_passed_for_repair=117
2024-07-09T12:25:10.992 pending_state=clock_compare id=fc4b9070-1f7b-4b33-87b0-a3731b1be797 mismatched_segments=1039426 keys_passed_for_repair=89
2024-07-09T12:25:24.430 pending_state=clock_compare id=2894df2b-b899-47d5-a0bf-69413874edad mismatched_segments=784947 keys_passed_for_repair=95
2024-07-09T12:25:54.977 pending_state=clock_compare id=34de1a02-125e-488e-bbe5-da6d2387159a mismatched_segments=1039426 keys_passed_for_repair=93
2024-07-09T13:34:33.017 pending_state=clock_compare id=2110f771-2dbb-4347-9b51-3652af1bab33 mismatched_segments=591954 keys_passed_for_repair=285
2024-07-09T13:34:48.154 pending_state=clock_compare id=b31e5d4e-d7fc-48f9-88a1-548ab2181501 mismatched_segments=718162 keys_passed_for_repair=118
2024-07-09T13:35:02.519 pending_state=clock_compare id=7ae99cc5-7eb2-42dd-97e1-7f567e15adc8 mismatched_segments=1039204 keys_passed_for_repair=86
2024-07-09T13:35:12.635 pending_state=clock_compare id=2595be80-9f01-4c35-aeee-aa3c2f6bbc10 mismatched_segments=798716 keys_passed_for_repair=53
2024-07-09T13:37:18.388 pending_state=clock_compare id=36e0f139-96d4-42b7-a4bb-c56b78cfcecc mismatched_segments=968263 keys_passed_for_repair=1035
2024-07-09T14:06:43.972 pending_state=clock_compare id=670facbf-66c4-4536-838e-0dcc5ebb676c mismatched_segments=751539 keys_passed_for_repair=620
2024-07-09T14:08:14.046 pending_state=clock_compare id=6e741e4d-c905-4912-8566-edb35a929ac9 mismatched_segments=775683 keys_passed_for_repair=370
2024-07-09T14:08:53.028 pending_state=clock_compare id=2031ed35-9222-44fc-b922-2413f80651ae mismatched_segments=1037363 keys_passed_for_repair=384
2024-07-09T14:09:27.846 pending_state=clock_compare id=6e950f51-c3e7-4aa9-acae-bf742f8923a2 mismatched_segments=1037395 keys_passed_for_repair=375
2024-07-09T14:09:56.957 pending_state=clock_compare id=9e93291e-7f0f-4147-8b4f-724b5387de44 mismatched_segments=769211 keys_passed_for_repair=85 Do I understand correctly that the mismatched_segments value should change more linearly and not have these "bounces". Or this is ok? |
I would expect it to be more linear than this, but mismatched_segments is an estimate, so it can't be guaranteed. However, this doesn't look particularly healthy. Do you have multiple pairs of vnodes with the same behaviour on exchanges? |
For the read repairs prompted by this exchange (i.e. vnode pair) is the not_found always the same vnode of the two? Is that a vnode you expect to be being repaired by AAE? |
If we assume that this is just the AAE repair taking a long time, you can accelerate repair for a vnode that you deleted using partition repair- https://docs.riak.com/riak/kv/latest/using/repair-recovery/repairs/index.html#repairing-a-single-partition (use 'riak remote_console |
@martinsumner Thanks for your advice. We finally fixed data corruption in our cluster. As you say above we manually run the partition repairs function on several of them. After that, we deleted old tictacaae directories to force rebuild aae trees and to free some disk space. After the successful rebuild was completed, the remaining delta was eliminated within 3 days. All exchanges now are in a state |
Hello, I would like to ask for some advice. We have a 5-node riak kv nodes in ring installed in a kubernetes cluster. About 500 GB of data is stored on each node, but there are also butch of corrupted keys. We use "leveldb" as the backend and have enabled tictacaae.
Thanks in advance, I am ready to provide any additional information.
The text was updated successfully, but these errors were encountered: