Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/batch #1

Closed
wants to merge 23 commits into from
Closed

Feat/batch #1

wants to merge 23 commits into from

Conversation

aaronchar
Copy link

@aaronchar aaronchar commented Mar 25, 2022

This pull request is directly related to the following bug Junos Terraform bug. In the current state, it functions for the test case linked in that issue. It needs more testing but I just wanted to get some dialogue flowing around this.

When I started looking into the reported bug my first thought was to set up a connection pool and simply reuse SSH clients, this does work to limit the concurrent connections however when I started playing with the scaling of this I noticed two things

  1. The performance was still extremely poor when we are looking at large infrastructure deployments. This can be overcome by removing the locking on functions and instead abstracting out the driver connection, however, this leads directly to my second point.

  2. The netconf subsystem can't handle extreme concurrency (outside of commit operations), it simply crashes subsystem request for netconf by user automation failed, subsystem not found

So this pushed me down the current approach when seems to function correctly within the current framework without many changes.

I created a process in which we can batch requests. Since most everything depends directly upon the final commit action or tainting. Simply building up a list of changes and applying them right before that operation seemed like the simplest solution.

The one big difference is the read operation, where I am doing something similar however instead of requesting each resource we need to read, I am simply reading it once and storing the corresponding XML output into a local cache which is then used as a lookup after that point to find matching elements. On top of this when running in batch mode we will do a commit check prior to applying the configuration

I have also abstracted the junos_helpers library out into an interface and made a slight modification to the so the interface is returned to the consumer/provider, this lets us simply switch between serial or batch mode. My thinking behind this is that the method can be left up to a provider flag to determine which method to use simply by calling either NewSerialClient or NewBatchClient

recursiveillusion and others added 21 commits March 25, 2022 07:20
this will prevent this change from being a breaking change
updated the MarshalGroup read order so we always read from the write cache before we read from the remove cache

this resolves an issue with stale state being read back into

terrafrom after an update operation
…bit.

I also moved more of the batch code into custom functions so it is easier to modify/update
…rted everything from using a simple string to using sync maps, each map is based off the apply-group name. My intention is to expand this so that we have the ability to nest multiple objets under the same apply-group and actaully allow editing of elements inside those apply maps
…urce name

this makes it cleaner in my oppinon
…ability to reduce multiple deletes for the same element within a commit op
@aaronchar
Copy link
Author

@davedotdev

Just looking for some feedback on this . Thanks

@davedotdev
Copy link
Owner

Somehow missed this, sorry. Will start the feedback process.

Copy link
Owner

@davedotdev davedotdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a clarity thing on Serial client, everything looks fine.

What I'm not clear about is your overall intention. Not to be pedantic on this, but one of the earliest experiments that was done was the concept of a worker pool. Ideally, JTAF would invoke a MAX number of connections and they would be reused. Ultimately they would block until free and then the provider at whatever stage it was at would proceed to the next resource call.

Is batching simpler? Easier to reason about? With a worker pool of sockets the issue of overloading NETCONF can be handled at the cost of slowing down the TF provider, but it also provides an easy mental model to reason about. Currently, (this is my issue not yours), I can't build a clear enough mental model to couple batched operations to that of Terraform logic. For instance, if I create say ten resources that have relationships using keys, or a depends_on relationship, how will batching help here? It's not a challenge to be awkward, I'm just trying to understand if there's an easier way of mending this issue or if I'm being a turnip and incorrectly understand.

Either way, thank you for the effort and apologies on the slow response.

Best,
D

// NewClient returns gonetconf new client driver
func NewClient(username string, password string, sshkey string, address string, port int) (*GoNCClient, error) {
// NewSerialClient returns gonetconf new client driver
func NewSerialClient(username string, password string, sshkey string, address string, port int) (NCClient, error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By NewSerialClient, what do you mean exactly? Serial as in sequential?

Copy link
Author

@aaronchar aaronchar May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the naming isn't great, but since the current flow is using mutex locking and unlocking it ends up so that only one action is performed at a given time so I went with the name serial.

@aaronchar
Copy link
Author

aaronchar commented May 5, 2022

@davedotdev

Thanks for the response.

So we have spent some time internally playing around with the most performant method to make this work, we have used worker pools and it does increase performance to an extent. But you still run int some issues with concurrency with NETCONF, it really doesn't like it and the fact that everything is still single threaded because of locking that is going on.

The next method we tried as to use a single TCP socket and multiple sessions over that single socket. This also work and increased performance however it had the un intended side effect of actually crashing the NETCONF daemon due to concurrency , If i recall it was around 10-20 concurrent actions.

The purpose of this batch mode is to stop us from doing a full send/receive round-trip for every resource we are defining. Instead we are building up internal sync maps for

  • read
  • write
  • delete

Read - This is really only update once when the provider starts, we call all the groups from the device and hydrate a sync map which is then referenced during the refresh stage. So instead of having to create 200 connections for 200 resources it's a single initial connection and a simple internal map lookup after that.

write - This one is used anytime a create/update action is taken. Instead of sending this off over NETCONF for each action we simply build up an internal map of all changes that we are making (indexed by resource_name). This map is referenced during the post create read operation and pushed back into the Terrafrom model .

delete - similar flow as above.

perhaps batching isn't the correct name for this, but what ends up happening is we can push a single NETCONF config which contains all our add's and deletes at the end (during the commit stage). We then perform a commit check operation before we actually apply the commit.

Our current testing shows this takes about a 90 seconds to do around 1000 elements, and 30 seconds of that is the commit check action.

@aaronchar
Copy link
Author

aaronchar commented May 5, 2022

My intention with this is to essentially allow the Terrafrom provider to opt into this mode. By default it would use the existing method which works fine for small deployments but when you start looking at 1000 resources+ you are talking about 30+ minutes per device.

That doesn't include the fact that after the initial commit you will also need to refresh all those elements so the time doubles . So if we want to deploy a change to our fabric which is multiple leafs and spines we are looking at hours of just applying a configuration change across them

recursiveillusion added 2 commits May 5, 2022 12:17
…onfig method, this is mean for sending raw commands that do not require a commit. Such as getting rollback information
@aaronchar aaronchar marked this pull request as draft May 11, 2022 03:17
@aaronchar
Copy link
Author

@davedotdev - not sure if you have given this any more thought, I have changed it to a draft , which it should have been initially. Some things need to be cleaned up in.

Just wanted to get some feedback to see if this is something that would even be considered or not before I clean it up.

@aaronchar
Copy link
Author

I am closing this in favour of some alternate pr's right on the jtaf library that moves the helper functions to that system.

@aaronchar aaronchar closed this Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants