Skip to content

Commit

Permalink
Ignore pop from an empty external ids stack in RoctracerLogger to avo…
Browse files Browse the repository at this point in the history
…id crash. (#1006)

Summary:
Pull Request resolved: #1006

See D62090845 for the context.
This diff is trying to mimic nvidia side behavior.
Take a similar workload/application that dyno trace crashes on MI300x, dyno trace on H100 looks like P1666484898. If search for keyword `CUPTI_ERROR_QUEUE_EMPTY` and refer to
[nvidia's doc](https://l.facebook.com/l.php?u=https%3A%2F%2Fdocs.nvidia.com%2Fcuda%2Farchive%2F9.2%2Fcupti%2Fgroup__CUPTI__ACTIVITY__API.html%23group__CUPTI__ACTIVITY__API_1g47395bf12ff55f30822d408b940567e3&h=AT1GbJqjqyEYga1oPxXkXPwznRcRGKnHtSlUt_708U3wxjzTel6MJbF2-o7f5yp7pdDKJ5Y_ASuojzFRECp-un81L7PU6GvesQfQ10v7419Eaqm3laLWGZIZldZpczkg37FlbFbI6zC59n6xtOdrscxX-bA), it looks like the suspicious migrated fiber thread attempts to deque from nvidia's thread_local queue fail, just like what we saw on the AMD side.

Reviewed By: davidberard98

Differential Revision: D64974651

fbshipit-source-id: 56b36b0d85361ef8839225663eb6aa314a0897e2
  • Loading branch information
sraikund16 authored and facebook-github-bot committed Oct 28, 2024
1 parent e6f2675 commit 5f5dc26
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion libkineto/src/RoctracerLogger.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,12 @@ void RoctracerLogger::popCorrelationID(CorrelationDomain type) {
if (!singleton().externalCorrelationEnabled_) {
return;
}
t_externalIds[type].pop_back();
if (!t_externalIds[type].empty()) {
t_externalIds[type].pop_back();
} else {
LOG(ERROR)
<< "Attempt to popCorrelationID from an empty external Ids stack";
}
}

void RoctracerLogger::clearLogs() {
Expand Down

0 comments on commit 5f5dc26

Please sign in to comment.