Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSPS-181 Add stairway hooks for failed flights - update metrics and status #141

Merged
merged 17 commits into from
Oct 11, 2024

Conversation

mmorgantaylor
Copy link
Collaborator

@mmorgantaylor mmorgantaylor commented Oct 3, 2024

Description

We have two actions we want to take if a pipeline run flight fails:

  1. Set the pipelineRun status in our db to FAILED
  2. Increment the pipeline failed metric counter for that pipeline type

We had put these into the undo step of the first step of a flight, but this was hacky and wouldn't get executed in cases of dismal failures. The better thing to do is to create Stairway hooks that are called at the end of each flight.

Because these hooks rely on specific fields being present in the working map, they would not necessarily work with any arbitrary flight. Therefore, we add logic to these two hooks to check for the presence of an input parameter that specifies whether to execute the task in the hook.

This PR also includes a refactor of JobMapKeys from an enum to a class so it behaves more like the (newly named) ImputationJobMapKeys (we weren't gaining anything from its being an enum).

Jira Ticket

https://broadworkbench.atlassian.net/browse/TSPS-181

@mmorgantaylor mmorgantaylor marked this pull request as ready for review October 4, 2024 19:33
Copy link
Collaborator

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small nitpicks around the tests and some questions. Overall looks great and this is so much nicer!

@@ -158,4 +158,19 @@ public static boolean flightComplete(FlightState flightState) {
|| flightState.getFlightStatus() == FlightStatus.FATAL
|| flightState.getFlightStatus() == FlightStatus.SUCCESS);
}

public static boolean isContextInvalid(FlightContext context) {
if (context == null || context.getWorkingMap() == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the scenario this method was meant to catch? Is this something we see in practice? I feel like this should at least be an error in the logs if not throw a full on exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just copied this from WSM's hook implementation. i assumed it helped in testing scenarios when you don't want to bother with this, but i didn't test without it - just now did, and everything runs fine, so we can take it out for now.

/** Check whether the inputParameters contain a particular key and whether its value is true */
public static boolean inputParametersContainTrue(FlightMap inputParameters, String key) {
return inputParameters.containsKey(key)
&& Boolean.TRUE.equals(inputParameters.get(key, boolean.class));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be Boolean.class or can you compare to just && inputParameters.get(key, boolean.class)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the way Stairway pulls things out, you can't get just a boolean. I did it this way to try to ward you off from complaining about Boolean.TRUE.equals, but you're right it should be Boolean.class.

import org.springframework.stereotype.Component;

/**
* A {@link StairwayHook} that logs a pipeline failure to Prometheus metrics upon flight failure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some more documentation around what key to use to enable this functionality on a flight

@@ -1,70 +0,0 @@
package bio.terra.pipelines.common.utils;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, cleaning up stuff!!

import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;

/** A {@link StairwayHook} that updates the PipelineRun status to FAILED on flight failure. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here about adding some notes about what key(s) the hook expects

@@ -14,6 +14,12 @@ public abstract class RunImputationJobFlightMapKeys {
public static final String WDL_METHOD_VERSION = "wdl_method_version";
public static final String PIPELINE_RUN_OUTPUTS = "pipeline_run_outputs";

// keys to determine which Stairway hooks to run
public static final String DO_SET_PIPELINE_RUN_STATUS_FAILED_HOOK =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would consider moving these flight wide keys to their own class. Also do you know why these keys are final Strings and the other key class is an enum?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can move to the generic JobMapKeys. JobMapKeys is something Stairway expects, but dealing with the enum ends up being uglier than this. we copied this pattern from other services that use Stairway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i dont really have a preference as to which id just rather be consistent if we can. I just noticed that we have two ways of doing it and didnt know if there was a good reason for it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is the class with public static final String key definitions (i.e. RunImputationJobFlightMapKeys) is nicer to work with, but Stairway expects JobMapKeys to be an enum, so that's why we have both. if you feel strongly enough we can convert RunImputationJobFlightMapKeys to enum to match JobMapKeys and add .getKeyName() every time we want to access a key name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about Stairway expecting JobMapKeys to be an enum - converted to a class and it works fine.

@@ -133,25 +127,6 @@ public StepResult doStep(FlightContext flightContext) {

@Override
public StepResult undoStep(FlightContext flightContext) {
// this is the first step in RunImputationGcpJobFlight and RunImputationAzureJobFlight.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay clean steps!

}

@Test
void inputParametersContainTrue_true() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not loving the mix of camel and snake case. I think you its just as understandable as just inputParametersContainsTrue and the one after this as inputParametersContainsFalse and inputParametersContainsNoKey or just probably better yet just one function that tests all three scenarios since they're not very complicated with a comment or two

}

@Test
void endFlight_notPipelineRunTypeFlight_success() throws InterruptedException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the camel casing and snake casing and smooshing these tests down into a smaller number. These are a little more complicated but with some refactoring it can probably be made more manageable. @ParameterizedTest might be useful here?

}

@Test
void endFlight_notPipelineRunTypeFlight_success() throws InterruptedException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar nitpick 🕊️

Copy link
Collaborator

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome thanks for the updates!

@okotsopoulos okotsopoulos removed their request for review October 10, 2024 13:32
public enum JobMapKeys {
public class JobMapKeys {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting - we follow the same enum/getKeyName() pattern in TDR. We don't ever reference the enum without the keyName. I like this new approach - it feels much less bulky. Is there something driving this change besides that it is cleaner?

Has this been accepted as the new standard? I'll have to keep this in mind for incremental changes to TDR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is any new standard - when we copied things from WSM, they had the enum JobMapKeys and a class for custom keys. We realized that there wasn't really a reason to have both (and it was confusing to have both), and the class version is less bulky, like you said. so we're going for it!

Comment on lines 164 to 165
return inputParameters.containsKey(key)
&& Boolean.TRUE.equals(inputParameters.get(key, Boolean.class));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return inputParameters.containsKey(key)
&& Boolean.TRUE.equals(inputParameters.get(key, Boolean.class));
return inputParameters.containsKey(key) && inputParameters.get(key, Boolean.class);

I'm surprised you need the Boolean.TRUE.equals bit - can you directly use the boolean value from the get call?

We have a few examples across TDR - here is one.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, sometime stairway gets picky about actually defining variables. Something like this might work instead:

Suggested change
return inputParameters.containsKey(key)
&& Boolean.TRUE.equals(inputParameters.get(key, Boolean.class));
Boolean booleanKey = inputParameters.get(key, Boolean.class);
return inputParameters.containsKey(key) && booleanKey;

You wouldn't think that pre-defining it in a variable would help, but I've come across a few cases where it is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh yeah this is super annoying, I don't know if TDR uses SonarQube, but it complains about this approach:
image
because a Boolean can be null. I couldn't figure out how to extract a boolean from the inputParameters.

I think that we could get around this by saying if it's null, assume that's false... I'll do some poking... we hate the Boolean.TRUE.equals() thing too 😭

Copy link

@snf2ye snf2ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more questions, but LGTM

Copy link

sonarcloud bot commented Oct 11, 2024

@mmorgantaylor mmorgantaylor merged commit 69441a7 into main Oct 11, 2024
12 checks passed
@mmorgantaylor mmorgantaylor deleted the TSPS-181_mma_stairway_hooks branch October 11, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants