Feat/batch #1

aaronchar · 2022-03-25T23:17:57Z

This pull request is directly related to the following bug Junos Terraform bug. In the current state, it functions for the test case linked in that issue. It needs more testing but I just wanted to get some dialogue flowing around this.

When I started looking into the reported bug my first thought was to set up a connection pool and simply reuse SSH clients, this does work to limit the concurrent connections however when I started playing with the scaling of this I noticed two things

The performance was still extremely poor when we are looking at large infrastructure deployments. This can be overcome by removing the locking on functions and instead abstracting out the driver connection, however, this leads directly to my second point.
The netconf subsystem can't handle extreme concurrency (outside of commit operations), it simply crashes subsystem request for netconf by user automation failed, subsystem not found

So this pushed me down the current approach when seems to function correctly within the current framework without many changes.

I created a process in which we can batch requests. Since most everything depends directly upon the final commit action or tainting. Simply building up a list of changes and applying them right before that operation seemed like the simplest solution.

The one big difference is the read operation, where I am doing something similar however instead of requesting each resource we need to read, I am simply reading it once and storing the corresponding XML output into a local cache which is then used as a lookup after that point to find matching elements. On top of this when running in batch mode we will do a commit check prior to applying the configuration

I have also abstracted the junos_helpers library out into an interface and made a slight modification to the so the interface is returned to the consumer/provider, this lets us simply switch between serial or batch mode. My thinking behind this is that the method can be left up to a provider flag to determine which method to use simply by calling either NewSerialClient or NewBatchClient

just makes more sense

this will prevent this change from being a breaking change

updated the MarshalGroup read order so we always read from the write cache before we read from the remove cache this resolves an issue with stale state being read back into terrafrom after an update operation

…bit. I also moved more of the batch code into custom functions so it is easier to modify/update

…rted everything from using a simple string to using sync maps, each map is based off the apply-group name. My intention is to expand this so that we have the ability to nest multiple objets under the same apply-group and actaully allow editing of elements inside those apply maps

…urce name this makes it cleaner in my oppinon

…ability to reduce multiple deletes for the same element within a commit op

aaronchar · 2022-04-26T23:12:24Z

@davedotdev

Just looking for some feedback on this . Thanks

davedotdev · 2022-05-05T16:20:10Z

Somehow missed this, sorry. Will start the feedback process.

davedotdev

Other than a clarity thing on Serial client, everything looks fine.

What I'm not clear about is your overall intention. Not to be pedantic on this, but one of the earliest experiments that was done was the concept of a worker pool. Ideally, JTAF would invoke a MAX number of connections and they would be reused. Ultimately they would block until free and then the provider at whatever stage it was at would proceed to the next resource call.

Is batching simpler? Easier to reason about? With a worker pool of sockets the issue of overloading NETCONF can be handled at the cost of slowing down the TF provider, but it also provides an easy mental model to reason about. Currently, (this is my issue not yours), I can't build a clear enough mental model to couple batched operations to that of Terraform logic. For instance, if I create say ten resources that have relationships using keys, or a depends_on relationship, how will batching help here? It's not a challenge to be awkward, I'm just trying to understand if there's an easier way of mending this issue or if I'm being a turnip and incorrectly understand.

Either way, thank you for the effort and apologies on the slow response.

Best,
D

davedotdev · 2022-05-05T16:30:03Z

helpers/junos_helpers/serial.go

-// NewClient returns gonetconf new client driver
-func NewClient(username string, password string, sshkey string, address string, port int) (*GoNCClient, error) {
+// NewSerialClient returns gonetconf new client driver
+func NewSerialClient(username string, password string, sshkey string, address string, port int) (NCClient, error) {


By NewSerialClient, what do you mean exactly? Serial as in sequential?

perhaps the naming isn't great, but since the current flow is using mutex locking and unlocking it ends up so that only one action is performed at a given time so I went with the name serial.

aaronchar · 2022-05-05T17:11:58Z

@davedotdev

Thanks for the response.

So we have spent some time internally playing around with the most performant method to make this work, we have used worker pools and it does increase performance to an extent. But you still run int some issues with concurrency with NETCONF, it really doesn't like it and the fact that everything is still single threaded because of locking that is going on.

The next method we tried as to use a single TCP socket and multiple sessions over that single socket. This also work and increased performance however it had the un intended side effect of actually crashing the NETCONF daemon due to concurrency , If i recall it was around 10-20 concurrent actions.

The purpose of this batch mode is to stop us from doing a full send/receive round-trip for every resource we are defining. Instead we are building up internal sync maps for

read
write
delete

Read - This is really only update once when the provider starts, we call all the groups from the device and hydrate a sync map which is then referenced during the refresh stage. So instead of having to create 200 connections for 200 resources it's a single initial connection and a simple internal map lookup after that.

write - This one is used anytime a create/update action is taken. Instead of sending this off over NETCONF for each action we simply build up an internal map of all changes that we are making (indexed by resource_name). This map is referenced during the post create read operation and pushed back into the Terrafrom model .

delete - similar flow as above.

perhaps batching isn't the correct name for this, but what ends up happening is we can push a single NETCONF config which contains all our add's and deletes at the end (during the commit stage). We then perform a commit check operation before we actually apply the commit.

Our current testing shows this takes about a 90 seconds to do around 1000 elements, and 30 seconds of that is the commit check action.

aaronchar · 2022-05-05T17:15:35Z

My intention with this is to essentially allow the Terrafrom provider to opt into this mode. By default it would use the existing method which works fine for small deployments but when you start looking at 1000 resources+ you are talking about 30+ minutes per device.

That doesn't include the fact that after the initial commit you will also need to refresh all those elements so the time doubles . So if we want to deploy a change to our fabric which is multiple leafs and spines we are looking at hours of just applying a configuration change across them

…onfig method, this is mean for sending raw commands that do not require a commit. Such as getting rollback information

…interface.

aaronchar · 2022-05-11T03:19:02Z

@davedotdev - not sure if you have given this any more thought, I have changed it to a draft , which it should have been initially. Some things need to be cleaned up in.

Just wanted to get some feedback to see if this is something that would even be considered or not before I clean it up.

aaronchar · 2022-07-11T18:03:01Z

I am closing this in favour of some alternate pr's right on the jtaf library that moves the helper functions to that system.

recursiveillusion and others added 21 commits March 25, 2022 07:20

restructring the junos helpers

02f17d5

missed the close method

5865a9c

added in xml query to the raeder system

c898bbd

completed create and read, next step is delete

15b49b1

delete function is working as expected with this module

243466e

replicated update functionality

7ca5db5

minor tweak

3d07e90

accidently commited .idea

e243328

corrected some comments

a92d823

push deletes to the commitdb before adds

9853413

just makes more sense

addedin .gitignore

248ab0f

added .idea files again

8d6455f

missed an unlock on error

2613a84

added NewClient back in with a deprecation warning

ea639ed

this will prevent this change from being a breaking change

moved cache writes to individual functions so it is easier to manage

b6e3509

updated the MarshalGroup read order so we always read from the write cache before we read from the remove cache this resolves an issue with stale state being read back into terrafrom after an update operation

changed to use unlock with a defer statement just to clean this up a …

db5c709

…bit. I also moved more of the batch code into custom functions so it is easier to modify/update

corrected the name oft he internal read sync map

98b8ff2

changed from a string slice to just a simple concated string per reso…

a22eebe

…urce name this makes it cleaner in my oppinon

remove some dead code and cleaned up a few fn calls

fafbc4c

added in the ability to query the group delete map this opens up the …

f699e62

…ability to reduce multiple deletes for the same element within a commit op

davedotdev reviewed May 5, 2022

View reviewed changes

recursiveillusion added 2 commits May 5, 2022 12:17

removed some commented out code and also added in the SendRawNetconfC…

e68e09c

…onfig method, this is mean for sending raw commands that do not require a commit. Such as getting rollback information

added in the SendRawNetconfConfig method and also exposed it via the …

924fdfb

…interface.

aaronchar marked this pull request as draft May 11, 2022 03:17

aaronchar closed this Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/batch #1

Feat/batch #1

aaronchar commented Mar 25, 2022 •

edited

Loading

aaronchar commented Apr 26, 2022

davedotdev commented May 5, 2022

davedotdev left a comment

davedotdev May 5, 2022

aaronchar May 5, 2022 •

edited

Loading

aaronchar commented May 5, 2022 •

edited

Loading

aaronchar commented May 5, 2022 •

edited

Loading

aaronchar commented May 11, 2022

aaronchar commented Jul 11, 2022

Feat/batch #1

Feat/batch #1

Conversation

aaronchar commented Mar 25, 2022 • edited Loading

aaronchar commented Apr 26, 2022

davedotdev commented May 5, 2022

davedotdev left a comment

Choose a reason for hiding this comment

davedotdev May 5, 2022

Choose a reason for hiding this comment

aaronchar May 5, 2022 • edited Loading

Choose a reason for hiding this comment

aaronchar commented May 5, 2022 • edited Loading

aaronchar commented May 5, 2022 • edited Loading

aaronchar commented May 11, 2022

aaronchar commented Jul 11, 2022

aaronchar commented Mar 25, 2022 •

edited

Loading

aaronchar May 5, 2022 •

edited

Loading

aaronchar commented May 5, 2022 •

edited

Loading

aaronchar commented May 5, 2022 •

edited

Loading