Integration Tests Randomly Fail #3089

LeStarch · 2024-12-20T17:00:01Z


*F´ Version*
*Affected Component*

Problem Description

The CI system runs integration tests on the 'Ref' application. Lately, this run has been failing randomly. These failures are not consistent, but do affect the project as they can mask real errors.

We need to fix this instability.

Details:

Occurs only on Linux Integration Test runs
Occurs randomly. A rerun almost always fixes the problem.
Sample failure: https://github.com/nasa/fprime/actions/runs/12267510122/job/34227736071

We need to:

Attempt to reproduce on a local Linux machine
Analyze the log and note details about the failures (i.e. the failure's signature).
Determine likely causes of this failure driven by the details.
Propose a fix
Test fix locally or in CI

Steps to Reproduce

Log into a Linux machine. These errors only occur on Linux.
Clone fprime (git clone https://github.com/nasa/fprime)
Checkout devel branch (cd fprime; git checkout devel)
Set up a virtual environment (`python3 -m venv fprime-venv; . fprime-venv/bin/activate; pip install -r requirements.txt)
Build Ref (cd Ref; fprime-util generate; fprime-util build)
Run GDS (fprime-gds)
Run integration tests (in new window) (cd fprime/Ref; pytest)

Repeat steps 6/7 multiple times to see if a failure is realized.

The text was updated successfully, but these errors were encountered:

Aksharma127 · 2024-12-25T12:57:02Z

Hi there!

It looks like the integration tests for your fprime project are running into some issues. Specifically, the tests are timing out while waiting for telemetry or expected event sequences. Here’s a quick guide to help you troubleshoot:

Check Your Setup:
Ensure all dependencies are installed, and the GDS (Ground Data System) is running smoothly. Double-check paths like build-artifacts and settings.ini are configured correctly.
Increase Timeouts:
The tests are failing due to timeouts. Try increasing the timeout values in your test scripts to give the system more time to process telemetry or events.
Verify Commands:
Make sure commands like Ref.cmdDisp.CMD_NO_OP are correctly implemented and mapped in your flight software. Use logs or debugging tools to confirm they’re being dispatched properly.
Inspect Events:
Check that the events (OpCodeDispatched, NoOpReceived, etc.) are being emitted as expected by the Ref.cmdDisp components. Look into the event handlers to ensure they’re functioning as intended.
Run GDS in Debug Mode:
Start the GDS in debug mode to inspect the flow of commands, telemetry, and events. This can help pinpoint delays or missing data.
Simplify Tests:
Isolate individual commands or telemetry checks and run them manually to identify where things might be breaking down.

matt392code · 2024-12-27T23:34:47Z

Likely Root Causes:
a) Race Condition:
The port opening failures and ECONNREFUSED suggest a timing issue
One component may be trying to connect before another is ready to accept
The successful reruns support this theory, as timing would vary between runs

b) Resource Cleanup:
Previous test runs might not be properly cleaning up ports
This could cause port conflicts or failed bindings

Proposed Fix Approach:
/**

Retry mechanism for port connections with exponential backoff
@param port Port number to connect to
@param maxAttempts Maximum number of connection attempts
@param baseDelayMs Base delay between attempts (doubles each retry)
@return 0 on success, -1 on failure
*/
int retryConnect(int port, int maxAttempts = 3, int baseDelayMs = 100) {
for (int attempt = 0; attempt < maxAttempts; attempt++) {
int status = openPort(port);
if (status == 0) {
return 0; // Success
}
```
 if (attempt < maxAttempts - 1) {
     // Exponential backoff
     int delayMs = baseDelayMs * (1 << attempt);
     std::this_thread::sleep_for(std::chrono::milliseconds(delayMs));
 }
```
}
return -1; // Failed after all attempts
}

/**

Cleanup routine for port resources
Ensures proper teardown between test runs
*/
void cleanupPorts() {
// Force close any lingering connections
closeAllPorts();

// Allow time for OS to release resources
std::this_thread::sleep_for(std::chrono::milliseconds(50));
}

Implementation Steps:
Add connection retry logic with exponential backoff
Enhance port cleanup in test teardown
Add logging to capture timing of port operations
Consider adding mutex protection for port operations
Testing Plan:
Create a test that intentionally introduces delays to verify retry mechanism
Run multiple parallel test instances to stress test
Add CI test that runs the integration test multiple times in sequence

Logging enhancements:
/**

Enhanced port operation logging
/
class PortLogger {
public:
static void logPortOperation(const char operation, int port, int status) {
auto now = std::chrono::system_clock::now();
auto timestamp = std::chrono::system_clock::to_time_t(now);
```
 std::cerr << "[" << std::ctime(&timestamp) 
          << "] Port " << port 
          << " " << operation 
          << " Status: " << status 
          << " (errno: " << errno << ")" 
          << std::endl;
```
}
};

Actual test code:
TEST_F(IntegrationTest, PortConnectionStabilityTest) {
// Setup
cleanupPorts();

// Test multiple connection attempts
for (int i = 0; i < 10; i++) {
    int status = retryConnect(BASE_PORT + i);
    ASSERT_EQ(status, 0) << "Failed to connect to port " << (BASE_PORT + i);
    
    // Simulate some work
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    
    // Cleanup after each iteration
    cleanupPorts();
}

}

This code includes:

Proper error handling and logging
Exponential backoff for retries
Resource cleanup
A stress test case

LeStarch added the bug label Dec 20, 2024

LeStarch assigned chuynh4duarte Jan 15, 2025

LeStarch added this to the Release v3.6.0 milestone Jan 23, 2025

chuynh4duarte pushed a commit to chuynh4duarte/fprime that referenced this issue Jan 29, 2025

fixed nasa#3089. change timeout value to 3

54a003d

chuynh4duarte mentioned this issue Jan 29, 2025

fixed #3089. change timeout value to 3 #3171

Merged

LeStarch closed this as completed in #3171 Jan 30, 2025

LeStarch closed this as completed in 875fa11 Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration Tests Randomly Fail #3089

Integration Tests Randomly Fail #3089

LeStarch commented Dec 20, 2024 •

edited

Loading

Aksharma127 commented Dec 25, 2024

matt392code commented Dec 27, 2024

Integration Tests Randomly Fail #3089

Integration Tests Randomly Fail #3089

Comments

LeStarch commented Dec 20, 2024 • edited Loading

Problem Description

Steps to Reproduce

Aksharma127 commented Dec 25, 2024

matt392code commented Dec 27, 2024

LeStarch commented Dec 20, 2024 •

edited

Loading