How far ahead in a partition KEY order settings would go to look for different KEY #829
-
We have one use case where Kafka topic is used to sync optout status with external services like Shopify and we get some random bulk request in this topic few times a month ranging between 0.7 million to as high as 2 million for single client. this overwhelms one of the partition out of 10 (when using client id as KEY). I was looking at parallel consumer as a possible solution where despite that load regular load (with mix of client ids) can keep on processing. I tried to create test data with diff clients ids mixed with large block of request from single client id to simulate this behaviour. What I noticed was the keys separated by large block of requests from single customer were not picked up for processing till requests reached 80% of the block (block size of 0.1 million) with max concurrency setting of 20. I then changed to following settings
Now with these settings the first lookup happened very quickly and within first block different keys were picked up very quickly but post that they started lagging again. is this due to buffer being dominated by blocks of request still waiting to be processed from single client id with very high load? can this be configured somehow to ensure that at least a small block of requests can be from different client ids/keys? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Hi @yogeshwarpst - so the way it works - Parallel Consumer consumes messages from partition in offset order into the buffer - then the messages from the buffer are parallelised by key - so you are still limited by how much it can read into memory from Kafka before reaching the buffer limit and pausing consumption to prevent OOM. Other than that - i can't see any other approach that could help just from Parallel Consumer - you could of course use KSLQ / Flink / KStreams to filter or repartition the topic to make sure that bulk requests are on separate partition or separate topic etc but that is more on the architecture / data flow change level. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed reply, message size is quite small roughly 300 chars per event. I'll be doing more thorough testing based on this input and will update here |
Beta Was this translation helpful? Give feedback.
-
During testing load of upto 2.5 million messages was tested with 2Gb heap size and it worked fine. The memory overhead of buffer seems to be in the range of 650 Mb per million messages for the message size applicable to our use case. one observation I had was setting a smaller initial load factor and higher max load factor did not dynamically adjust the buffer but directly setting maxBufferSize or setting needed value in initial load factor (along with max concurrency as suggested above) works. are there some examples for inital vs max load factors and how it dynamically increases the buffer size? |
Beta Was this translation helpful? Give feedback.
Hi @yogeshwarpst - so the way it works - Parallel Consumer consumes messages from partition in offset order into the buffer - then the messages from the buffer are parallelised by key - so you are still limited by how much it can read into memory from Kafka before reaching the buffer limit and pausing consumption to prevent OOM.
If the messages are small and you have enough RAM - you can increase buffer size to be really large (1M of 1KB messages is 1GB, with overheads - a bit more but still within reason) - so even taking into account object serialization / copies - on something like 8GB or 16GB VM - you can try to set buffer to 1 - 2M messages and give it shot.
Of course first validate …