Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 49822: Pipeline does not show a job as failed if DB fails #5363

Merged
merged 4 commits into from
Mar 27, 2024

Conversation

labkey-jeckels
Copy link
Contributor

Rationale

If a pipeline job is running or queued to be run when the database goes down, we fail to update its state in the DB. If the DB later comes back online, the job looks like it's still running or queued. That's inaccurate and can lead to ETLs not being queued on schedule, etc.

Changes

  • Stash the list of updates that failed
  • Check once a minute for failed updates and retry them

{
Scheduler scheduler = StdSchedulerFactory.getDefaultScheduler();

// Get configured quartz Trigger from subclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding how the comment aligns with the code. This doesn't seem to be getting anything from the subclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to clean up the comments from the message digest code that I started with. Fixed.

// Attempt the updates to the DB rows again in the hopes that the DB is back online
synchronized (_queuedUpdates)
{
if (!_queuedUpdates.isEmpty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement doesn't seem necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restructured the code a bit in my latest comment and the if check is a little more useful now. Not vital, but it saves a tiny bit of work

// Then try based on file path
sfExist = getStatusFile(job.getContainer(), job.getLogFilePath());
}
PipelineStatusFileImpl sfExist = getJobStatusFile(job.getJobGUID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was existing code and likely not worth changing now, but the name sfExist here is odd to me. existingFile would read better to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but am opting to keep the diff smaller instead of doing the rename

// This will happen if job was previously run from a cloud root.
sfExist = getStatusFile(job.getContainer(), job.getRemoteLogPath());
}
PipelineStatusFileImpl sfSet = new PipelineStatusFileImpl(job, status, info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sfSet is also a strange name to me, to the extent that I don't know what a better name is.

else if (PipelineJob.TaskStatus.complete.matches(status))
{
// Make sure the Enterprise Pipeline recognizes this as a completed
// job, even if did it not have a TaskPipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// job, even if did it not have a TaskPipeline.
// job, even if it did not have a TaskPipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@labkey-danield labkey-danield added epic and removed epic labels Mar 26, 2024
@labkey-jeckels labkey-jeckels merged commit a8b4fb6 into release24.3-SNAPSHOT Mar 27, 2024
3 of 4 checks passed
@labkey-jeckels labkey-jeckels deleted the 24.3_fb_49822_jobStatus branch March 27, 2024 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LKCRI] Issue 49822: Pipeline does not show a job as failed for if DB fails over to secondary server
3 participants