-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable #23634
base: master
Are you sure you want to change the base?
Conversation
…return a failure because the service may be unavailable
@yyj8 Please add the following content to your PR description and select a checkbox:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a deadlock check in the health check:
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java
Lines 366 to 408 in c9de1bb
@GET | |
@Path("/health") | |
@ApiOperation(value = "Run a healthCheck against the broker") | |
@ApiResponses(value = { | |
@ApiResponse(code = 200, message = "Everything is OK"), | |
@ApiResponse(code = 403, message = "Don't have admin permission"), | |
@ApiResponse(code = 404, message = "Cluster doesn't exist"), | |
@ApiResponse(code = 500, message = "Internal server error")}) | |
public void healthCheck(@Suspended AsyncResponse asyncResponse, | |
@ApiParam(value = "Topic Version") | |
@QueryParam("topicVersion") TopicVersion topicVersion) { | |
validateSuperUserAccessAsync() | |
.thenAccept(__ -> checkDeadlockedThreads()) | |
.thenCompose(__ -> internalRunHealthCheck(topicVersion)) | |
.thenAccept(__ -> { | |
LOG.info("[{}] Successfully run health check.", clientAppId()); | |
asyncResponse.resume(Response.ok("ok").build()); | |
}).exceptionally(ex -> { | |
LOG.error("[{}] Fail to run health check.", clientAppId(), ex); | |
resumeAsyncResponseExceptionally(asyncResponse, ex); | |
return null; | |
}); | |
} | |
private void checkDeadlockedThreads() { | |
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean(); | |
long[] threadIds = threadBean.findDeadlockedThreads(); | |
if (threadIds != null && threadIds.length > 0) { | |
ThreadInfo[] threadInfos = threadBean.getThreadInfo(threadIds, false, false); | |
String threadNames = Arrays.stream(threadInfos) | |
.map(threadInfo -> threadInfo.getThreadName() + "(tid=" + threadInfo.getThreadId() + ")").collect( | |
Collectors.joining(", ")); | |
if (System.currentTimeMillis() - threadDumpLoggedTimestamp | |
> LOG_THREADDUMP_INTERVAL_WHEN_DEADLOCK_DETECTED) { | |
threadDumpLoggedTimestamp = System.currentTimeMillis(); | |
LOG.error("Deadlocked threads detected. {}\n{}", threadNames, | |
ThreadDumpUtil.buildThreadDiagnosticString()); | |
} else { | |
LOG.error("Deadlocked threads detected. {}", threadNames); | |
} | |
throw new IllegalStateException("Deadlocked threads detected. " + threadNames); | |
} | |
} |
It also contains an example of how to check deadlocks.
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…return a failure because the service may be unavailable
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @yyj8. Some suggestions for field naming and simplifying the code comment.
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable.
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable.
@yyj8 btw. when you add commits to the PR, it's useful to make the commit title about the change and not copy the PR title into the follow up commits. When the PR is merged, all commits are squashed so they won't end up in the final merged commit. The benefit of the commit messages in the PR commits is that the reviewer will be able to follow the changes. |
…e should return a failure because the service may be unavailable. Add lastPrintThreadDumpTimestamp field to control the interval time for printing complete thread stack information.
…e should return a failure because the service may be unavailable. Add unit testing code.
Very good suggestion, future code submissions will include detailed instructions for modifying the code. |
…e should return a failure because the service may be unavailable. Add unit testing code.
…e should return a failure because the service may be unavailable. Add unit testing code.
pulsar-broker-common/src/test/java/org/apache/pulsar/common/configuration/VipStatusTest.java
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable. Add unit testing code, shutdown deadlock thread.
…e should return a failure because the service may be unavailable. Add unit testing code, shutdown deadlock thread.
...r-broker-common/src/test/java/org/apache/pulsar/common/configuration/MockServletContext.java
Outdated
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable. Modify unit testing code, use org.mockito.Mockito replaces MockServletContext.
pulsar-broker-common/src/test/java/org/apache/pulsar/common/configuration/VipStatusTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good work!
Checkstyle failed due to incorrect import order. I pushed a fix for that problem. @yyj8 If you are using IntelliJ/IDEA, please follow instructions at https://pulsar.apache.org/contribute/setup-ide/#configure-code-style to setup the IDE. |
Okay, thank you very much for your help. |
Closing and reopening to run a fresh CI build with latest changes from master branch. |
String statusFilePath = "/tmp/status.html"; | ||
File file = new File(statusFilePath); | ||
file.createNewFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use something like org.assertj.core.util.Files#newTemporaryFile
for creating the temporary file.
String statusFilePath = "/tmp/status.html"; | ||
File file = new File(statusFilePath); | ||
file.deleteOnExit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't make sense
Fixes #23635
Main Issue: #xyz
PIP: #xyz
Motivation
In some special scenarios, when the broker service has a deadlock, it needs to be able to automatically recover instead of requiring manual intervention. For example, when the service is deployed in a customer environment, we cannot directly manage it. If the service has a deadlock, the k8s probe should return a failure because the service may be unavailable. The probe failure triggers a broker pod restart to resolve the deadlock.
Modifications
Add deadlock detection in the probe. If a deadlock exists, print the thread stack and return a service unavailable exception.
Verifying this change
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository:
yyj8#10