Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SinkGroupSpec, DeploymentUnit, MaxBytesPerBatch #172

Merged
merged 49 commits into from
Apr 1, 2021

Conversation

alok87
Copy link
Contributor

@alok87 alok87 commented Mar 21, 2021

Brings up new specification to define sinkGroup based batcher and loader configuration. Introduces the concept of deployment unit which the user can specify to solve the scaling issues described in #167.maxBytesPerBatch replaces maxSize as it makes all the topics operating in the same pod take the same amount of resources, paving the way for scaling being possible. As scaling requires homegeneous resource consumption across multiple topics.

alok87 added 2 commits March 21, 2021 21:31
MaxSize gets deprecated in favour of MaxBytesPerBatch. This config makes all fat and lean tables behave the same way in the Redshiftbatcher. They both take the same amount of memory and the scaling becomes easier now.

Related #136
#167
@alok87 alok87 changed the title MaxBytesPerBatch is better than maxSize SinkGroupSpec, DeploymentUnit, MaxBytesPerBatch Mar 22, 2021
@alok87
Copy link
Contributor Author

alok87 commented Mar 24, 2021

Live with backward compatibility

@alok87
Copy link
Contributor Author

alok87 commented Mar 24, 2021

Old spec is compatible, tested OK.

@alok87
Copy link
Contributor Author

alok87 commented Mar 24, 2021

Metric is a problem, values being Gauge and not being set in some conditions never go down to 0

@alok87
Copy link
Contributor Author

alok87 commented Mar 25, 2021

Two problems have come out after testing with 300 tables:

  1. The distribution of units is not same. Big topics are landing into the same unit, making some unit slow. The distribution of topics to units should be uniform.
  2. The batcher units keep on waiting and running even when the lag has come down, and it is waiting for the loader lag to come down.

@alok87
Copy link
Contributor Author

alok87 commented Mar 25, 2021

Things are not auto recovering from

E0325 12:03:27.620298       1 batch_processor.go:498] ts.fabric.vn_call_forwardings, error(s) occured in processing (sending err)
I0325 12:03:27.620305       1 batch_processor.go:176] ts.fabric.vn_call_forwardings: batch processing gracefully shutingdown

@alok87
Copy link
Contributor Author

alok87 commented Mar 25, 2021

Releases are getting stuck.

@alok87
Copy link
Contributor Author

alok87 commented Mar 26, 2021

cpu_pprof.pdf

CPU Profiling

Update: opened a separate issue to solve this #173

alok87 added 3 commits March 26, 2021 16:48
Separate out realtime calculation and sinkgroup.
Separation of concern.
Main reason: Need batcher and loader lag to allocateDeploymentUnits
Easier to debug and easier to delete this way
alok87 added 6 commits March 31, 2021 12:18
Deadlock is
current cannot be populated until reload pods are there and reload cannot be done until current is not there
This is so that, we only operate on the topics which are not realtime but only reloading. Not doing this makes the allocator generate duplicates as it operates on the realtime topics also since the current status still has realtime topics. So whenever realtime updates happen always fix the state of batcher reloading topics.
@alok87
Copy link
Contributor Author

alok87 commented Mar 31, 2021

if maxReloadingTopics is reduced it does not take effect.

@alok87
Copy link
Contributor Author

alok87 commented Mar 31, 2021

sometimes batcher is stuck in session closure loop, mostly related to MaxProcessingTime, need to be checked
looks dupe of #172 (comment)

Error while closing connection to broker broken pipe

Update: trying internal kafka listners (Strimzi) strimzi/strimzi-kafka-operator#4688

This was fixed after using internal routing.

@alok87
Copy link
Contributor Author

alok87 commented Mar 31, 2021

Loader optimizations

Time

  • maxSizePerBatch: v low value causes slowness due to repeated merge
  • maxWaitSeconds: if batchSize mbs is small due to maxWait hit due to First schema creation un-necessarily retrys #129 then v small batches are loaded causing slowness
    Update: maxWait should reset after processing is done, so that next batch is not small

Dividing loader into multiple pods

Grouping loader into group of pods based on the lag it has. So that there is minimum shuffling.
(not required, small batches operate at same speed)

Screenshot 2021-04-01 at 2 37 20 PM

alok87 added 3 commits April 1, 2021 08:44
This is required so that batches are made of big size at the time of full sink.
Solves Time part of #172 (comment)
@alok87 alok87 merged commit 5952d7a into master Apr 1, 2021
alok87 added a commit that referenced this pull request Apr 14, 2021
@alok87 alok87 deleted the api-sinkgroups-maxBytes branch May 31, 2021 07:21
alok87 added a commit that referenced this pull request Jun 5, 2021
This is required so that batches are made of big size at the time of full sink.
Solves Time part of #172 (comment)
alok87 added a commit that referenced this pull request Jun 5, 2021
SinkGroupSpec, DeploymentUnit, MaxBytesPerBatch
alok87 added a commit that referenced this pull request Jun 5, 2021
alok87 added a commit that referenced this pull request Jun 7, 2021
This is required so that batches are made of big size at the time of full sink.
Solves Time part of #172 (comment)
alok87 added a commit that referenced this pull request Jun 7, 2021
SinkGroupSpec, DeploymentUnit, MaxBytesPerBatch
alok87 added a commit that referenced this pull request Jun 7, 2021
alok87 added a commit that referenced this pull request Jun 17, 2021
This is required so that batches are made of big size at the time of full sink.
Solves Time part of #172 (comment)
alok87 added a commit that referenced this pull request Jun 17, 2021
SinkGroupSpec, DeploymentUnit, MaxBytesPerBatch
alok87 added a commit that referenced this pull request Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant