Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Topology start recording action should not silently restart all recordings #1035

Open
tthvo opened this issue May 16, 2023 · 14 comments
Open
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@tthvo
Copy link
Member

tthvo commented May 16, 2023

Current Behavior:

When I start a recording in the topology view it appears to overwrite the previous recording from the Topolgy view.
It is not a good idea to just assume overwriting is fine. This ultimately leads to data loss to the un-informed customer. Data Loss is never good in IT.

Commented by Scott.

Expected Behavior:

I think the best thing to do here is actually just ignore the request on any targets that already have such a recording running, since starting a new parallel active recording in the same running state doesn't actually capture any additional data. If the recording doesn't exist, it should be created and started. If it exists but has been stopped, it should be restarted. If it is already running just do nothing.

And maybe in the UI we can disable the action to start a recording if all of the child nodes already have the recording in the running state

Suggested by Andrew.

Anything else:

Reference: #906

Depends on: https://github.com/cryostatio/cryostat/issues/1492

@tthvo tthvo added bug Something isn't working good first issue Good for newcomers labels May 16, 2023
@tthvo tthvo moved this to Todo in 2.4.0 release May 16, 2023
@andrewazores
Copy link
Member

And maybe in the UI we can disable the action to start a recording if all of the child nodes already have the recording in the running state

This would be a -web feature since it involves a UI component state,

I think the best thing to do here is actually just ignore the request on any targets that already have such a recording running, since starting a new parallel active recording in the same running state doesn't actually capture any additional data. If the recording doesn't exist, it should be created and started. If it exists but has been stopped, it should be restarted. If it is already running just do nothing.

This I think probably can and should be done server-side.

https://github.com/cryostatio/cryostat/blob/2b0cd5adacf8f928f8666f36663b0a5b31e0e76b/src/main/java/io/cryostat/net/web/http/api/v2/graph/StartRecordingOnTargetMutator.java#L181

https://github.com/cryostatio/cryostat/blob/2b0cd5adacf8f928f8666f36663b0a5b31e0e76b/src/main/java/io/cryostat/recordings/RecordingTargetHelper.java#L157

Currently, when requesting to start a recording, the client can use a query parameter or GraphQL argument restart to ask the server to close and re-open any recordings that already exist with the same name as in the new request. If this parameter is not provided or is false then the server will either create the new recording or will fail with an exception. If the parameter is true then the server will either create the new recording, or will stop the existing recording and start a new one with the other request parameters (event template, maxAge/maxSize settings, etc.).

I think this behaviour makes sense to retain, and it is already part of published API so it would be nice to be able to enhance this while keeping the same request format. I would propose that the query parameter ?restart=[true|false] be kept but documented as deprecated, and a new replacement like ?replace=[always,stopped,never] introduced. ?replace=always would behave the same as ?restart=true, ?replace=never would behave the same as ?restart=false, and ?replace=stopped would implement the quoted behaviour. If replace is not specified then it should behave like ?replace=never, so that this also maintains consistency and compatibility with the existing ?restart=false default behaviour.

@tthvo
Copy link
Member Author

tthvo commented May 16, 2023

That would be a much better idea! Open backend issue: https://github.com/cryostatio/cryostat/issues/1492

@andrewazores andrewazores moved this from Todo to Backlog in 2.4.0 release May 23, 2023
@aali309
Copy link
Contributor

aali309 commented Oct 5, 2023

@andrewazores @tthvo I think this has been solved? I tried reproducing and all seems good. Recording with the same name wont start and it will produce and error saying recording with than name exists. if restart is always it restarts the recording in stopped or running mode. ?? Any suggestions?

@andrewazores
Copy link
Member

I think this probably is solved - your https://github.com/cryostatio/cryostat/pull/1587 would have been the backend fix.

@andrewazores
Copy link
Member

To clarify though, it sounds like the workflow you're describing is based on the Recordings > Create Recording view? This issue refers to the bulk creation action on the Topology view:

Screenshot_2023-10-05_12-32-36

@tthvo
Copy link
Member Author

tthvo commented Oct 6, 2023

Can we do something like:

If there is at least 1 target that has the active recording in RUNNING state, open a modal to ask the user if they want to continue the action.

For this action, we can reuse the following with additional filter (i.e. state == RUNNING).

export const isQuickRecordingExist = (group: EnvironmentNode, { services }: ActionUtils) => {

Of course, if allowed, we can do the same for Delete Recording just to keep it consistent with the rest of the "dangerous" deletion actions elsewhere.

@andrewazores
Copy link
Member

If there is at least 1 target that has the active recording in RUNNING state, open a modal to ask the user if they want to continue the action.

And not prompt if all the targets have no recordings... ? I think if a prompt is shown at all on this action, it should always be shown (until the user checks the "don't ask again" box), and the request to the backend should use ?replace=stopped. This way it is a non-destructive action and will never cause JFR data loss.

If we're displaying a modal we could also ask the user which event template they want to use...

Of course, if allowed, we can do the same for Delete Recording just to keep it consistent with the rest of the "dangerous" deletion actions elsewhere.

Yes, I think the Stop and Delete actions here should have a confirmation modal.

@tthvo
Copy link
Member Author

tthvo commented Oct 6, 2023

And not prompt if all the targets have no recordings... ? I think if a prompt is shown at all on this action, it should always be shown (until the user checks the "don't ask again" box), and the request to the backend should use ?replace=stopped. This way it is a non-destructive action and will never cause JFR data loss.

Oh yess this sounds good!! Was just concerned about a bunch of error notifications when recordings exist.

I am thinking if we go in this direction, why not make it into a modal for interactive recording creation. Basically, CreateRecording view, so users can create any recording, not just a special predefine one. As for other actions, there can be a list of recordings to choose from. All should be supported neatly with GraphQl?

It should fit our goal for multi target design later on. Wdyt?

@andrewazores
Copy link
Member

Oh yess this sounds good!! Was just concerned about a bunch of error notifications when recordings exist.

I don't think there should be error notifications if we use replace=stopped.

I am thinking if we go in this direction, why not make it into a modal for interactive recording creation. Basically, CreateRecording view, so users can create any recording, not just a special predefine one.

See last point below for more detail - but I think this could be a good idea, with some refinement. My first thought is that this might just feel like it gets in the way of achieving the user's goal of just getting some JFR data from their applications. Maybe instead of launching a full custom dialog every time, this uses a configuration that is tucked away in the client-side settings, similar to what the Automated Analysis Dashboard card does.

As for other actions, there can be a list of recordings to choose from. All should be supported neatly with GraphQl?

You mean for the stop/delete/archive actions? I think just always (re)starting a recording with a common name/label and then doing the stop/delete/archive actions based on that name/label is good enough for this case. Otherwise, there's a whole lot of extra complexity about which recordings show up in that list since it has to reflect the various options available across all of the targets in the selection, and how those actions behave depending on which recordings we display and if the chosen recording is actually present in all of the targets, etc. I think this is already a bit of a pain point in our Automated Rules UX when the user has to enter a matchExpression that matches a set of targets in order to select a Target-provided Event Template, but it isn't necessarily clear what is happening there, and JMX auth/SSL misconfigurations also get in the way, etc.

It should fit our goal for multi target design later on. Wdyt?

I think so, but we need to be careful about our designs so that the UX isn't just throwing more and more endless possibilities and options at the user. It would probably be better to have two or three different simpler workflows all built around this idea and on top of these backend capabilities, rather than to build one giant generic interface that lets the user do everything but requires them to deeply study every option to figure out what makes sense.

@tthvo
Copy link
Member Author

tthvo commented Oct 6, 2023

I don't think there should be error notifications if we use replace=stopped.

Ah sorry I was looking at this:

https://github.com/cryostatio/cryostat/blob/e9ddcaf6503df2e7bc0ef35cc418d0cb01730c7e/src/main/java/io/cryostat/recordings/RecordingTargetHelper.java#L207

Looks like it errors out if return false on line 151. And error will be extracted and notifications will send as here.

export const notifyGroupActionErrors = (

The function can just be refined to recognize "recording exists" error and silent it. What do u think @andrewazores?

I think so, but we need to be careful about our designs so that the UX isn't just throwing more and more endless possibilities and options at the user. It would probably be better to have two or three different simpler workflows all built around this idea and on top of these backend capabilities, rather than to build one giant generic interface that lets the user do everything but requires them to deeply study every option to figure out what makes sense.

Right! I didn't think through enough. Looks likes the idea complicated things even more, especially the archive/stop/delete action.

Maybe instead of launching a full custom dialog every time, this uses a configuration that is tucked away in the client-side settings, similar to what the Automated Analysis Dashboard card does.

The idea with AA Card sounds fit for this case.

@tthvo
Copy link
Member Author

tthvo commented Oct 6, 2023

See last point below for more detail - but I think this could be a good idea, with some refinement.

Another simpler way to achieve this could be after we have a design for multi-target UI: The group action can just redirect to corresponding view (i.e. create recording) with descendant targets pre-selected.

@andrewazores
Copy link
Member

Ah sorry I was looking at this:

https://github.com/cryostatio/cryostat/blob/e9ddcaf6503df2e7bc0ef35cc418d0cb01730c7e/src/main/java/io/cryostat/recordings/RecordingTargetHelper.java#L207

Looks like it errors out if return false on line 151. And error will be extracted and notifications will send as here.

cryostat-web/src/app/Topology/Actions/utils.tsx

Line 296 in ed7c201
export const notifyGroupActionErrors = (

The function can just be refined to recognize "recording exists" error and silent it. What do u think @andrewazores?

Better to have the server return some distinct status code or perhaps use a response header to indicate this particular "failure" case then, and check those conditions on the frontend rather than relying on parsing the message string.

@tthvo
Copy link
Member Author

tthvo commented Oct 10, 2023

Sounds good! Tho, we are using GraphQL in this case so status code will still be 200? Looks like we can refine the error field of the GraphQL response to include failure case, status, and resolution steps?

@andrewazores
Copy link
Member

Hmm. Yes, something like that makes sense, but I'd need to look into it more deeply. Could you or @aali309 do some researching on GraphQL error statuses and see if there is a best practice around this before we dive into code changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

3 participants