-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/batch #1
Feat/batch #1
Conversation
just makes more sense
this will prevent this change from being a breaking change
updated the MarshalGroup read order so we always read from the write cache before we read from the remove cache this resolves an issue with stale state being read back into terrafrom after an update operation
…bit. I also moved more of the batch code into custom functions so it is easier to modify/update
…rted everything from using a simple string to using sync maps, each map is based off the apply-group name. My intention is to expand this so that we have the ability to nest multiple objets under the same apply-group and actaully allow editing of elements inside those apply maps
…urce name this makes it cleaner in my oppinon
…ability to reduce multiple deletes for the same element within a commit op
Just looking for some feedback on this . Thanks |
Somehow missed this, sorry. Will start the feedback process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a clarity thing on Serial client, everything looks fine.
What I'm not clear about is your overall intention. Not to be pedantic on this, but one of the earliest experiments that was done was the concept of a worker pool. Ideally, JTAF would invoke a MAX number of connections and they would be reused. Ultimately they would block until free and then the provider at whatever stage it was at would proceed to the next resource call.
Is batching simpler? Easier to reason about? With a worker pool of sockets the issue of overloading NETCONF can be handled at the cost of slowing down the TF provider, but it also provides an easy mental model to reason about. Currently, (this is my issue not yours), I can't build a clear enough mental model to couple batched operations to that of Terraform logic. For instance, if I create say ten resources that have relationships using keys, or a depends_on
relationship, how will batching help here? It's not a challenge to be awkward, I'm just trying to understand if there's an easier way of mending this issue or if I'm being a turnip and incorrectly understand.
Either way, thank you for the effort and apologies on the slow response.
Best,
D
// NewClient returns gonetconf new client driver | ||
func NewClient(username string, password string, sshkey string, address string, port int) (*GoNCClient, error) { | ||
// NewSerialClient returns gonetconf new client driver | ||
func NewSerialClient(username string, password string, sshkey string, address string, port int) (NCClient, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By NewSerialClient, what do you mean exactly? Serial as in sequential?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps the naming isn't great, but since the current flow is using mutex locking and unlocking it ends up so that only one action is performed at a given time so I went with the name serial.
Thanks for the response. So we have spent some time internally playing around with the most performant method to make this work, we have used worker pools and it does increase performance to an extent. But you still run int some issues with concurrency with NETCONF, it really doesn't like it and the fact that everything is still single threaded because of locking that is going on. The next method we tried as to use a single TCP socket and multiple sessions over that single socket. This also work and increased performance however it had the un intended side effect of actually crashing the NETCONF daemon due to concurrency , If i recall it was around 10-20 concurrent actions. The purpose of this batch mode is to stop us from doing a full send/receive round-trip for every resource we are defining. Instead we are building up internal sync maps for
Read - This is really only update once when the provider starts, we call all the groups from the device and hydrate a sync map which is then referenced during the write - This one is used anytime a create/update action is taken. Instead of sending this off over delete - similar flow as above. perhaps batching isn't the correct name for this, but what ends up happening is we can push a single Our current testing shows this takes about a 90 seconds to do around 1000 elements, and 30 seconds of that is the commit check action. |
My intention with this is to essentially allow the Terrafrom provider to opt into this mode. By default it would use the existing method which works fine for small deployments but when you start looking at 1000 resources+ you are talking about 30+ minutes per device. That doesn't include the fact that after the initial commit you will also need to refresh all those elements so the time doubles . So if we want to deploy a change to our fabric which is multiple leafs and spines we are looking at hours of just applying a configuration change across them |
…onfig method, this is mean for sending raw commands that do not require a commit. Such as getting rollback information
@davedotdev - not sure if you have given this any more thought, I have changed it to a draft , which it should have been initially. Some things need to be cleaned up in. Just wanted to get some feedback to see if this is something that would even be considered or not before I clean it up. |
I am closing this in favour of some alternate pr's right on the |
This pull request is directly related to the following bug Junos Terraform bug. In the current state, it functions for the test case linked in that issue. It needs more testing but I just wanted to get some dialogue flowing around this.
When I started looking into the reported bug my first thought was to set up a connection pool and simply reuse SSH clients, this does work to limit the concurrent connections however when I started playing with the scaling of this I noticed two things
The performance was still extremely poor when we are looking at large infrastructure deployments. This can be overcome by removing the locking on functions and instead abstracting out the driver connection, however, this leads directly to my second point.
The
netconf
subsystem can't handle extreme concurrency (outside of commit operations), it simply crashessubsystem request for netconf by user automation failed, subsystem not found
So this pushed me down the current approach when seems to function correctly within the current framework without many changes.
I created a process in which we can batch requests. Since most everything depends directly upon the final commit action or tainting. Simply building up a list of changes and applying them right before that operation seemed like the simplest solution.
The one big difference is the read operation, where I am doing something similar however instead of requesting each resource we need to read, I am simply reading it once and storing the corresponding XML output into a local cache which is then used as a lookup after that point to find matching elements. On top of this when running in batch mode we will do a commit check prior to applying the configuration
I have also abstracted the
junos_helpers
library out into an interface and made a slight modification to the so theinterface
is returned to the consumer/provider, this lets us simply switch betweenserial
orbatch
mode. My thinking behind this is that the method can be left up to a provider flag to determine which method to use simply by calling eitherNewSerialClient
orNewBatchClient