Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration Tests Randomly Fail #3089

Closed
LeStarch opened this issue Dec 20, 2024 · 2 comments · Fixed by #3171
Closed

Integration Tests Randomly Fail #3089

LeStarch opened this issue Dec 20, 2024 · 2 comments · Fixed by #3171
Assignees
Labels

Comments

@LeStarch
Copy link
Collaborator

LeStarch commented Dec 20, 2024

F´ Version
Affected Component

Problem Description

The CI system runs integration tests on the 'Ref' application. Lately, this run has been failing randomly. These failures are not consistent, but do affect the project as they can mask real errors.

We need to fix this instability.

Details:

  1. Occurs only on Linux Integration Test runs
  2. Occurs randomly. A rerun almost always fixes the problem.
  3. Sample failure: https://github.com/nasa/fprime/actions/runs/12267510122/job/34227736071

We need to:

  1. Attempt to reproduce on a local Linux machine
  2. Analyze the log and note details about the failures (i.e. the failure's signature).
  3. Determine likely causes of this failure driven by the details.
  4. Propose a fix
  5. Test fix locally or in CI

Steps to Reproduce

  1. Log into a Linux machine. These errors only occur on Linux.
  2. Clone fprime (git clone https://github.com/nasa/fprime)
  3. Checkout devel branch (cd fprime; git checkout devel)
  4. Set up a virtual environment (`python3 -m venv fprime-venv; . fprime-venv/bin/activate; pip install -r requirements.txt)
  5. Build Ref (cd Ref; fprime-util generate; fprime-util build)
  6. Run GDS (fprime-gds)
  7. Run integration tests (in new window) (cd fprime/Ref; pytest)

Repeat steps 6/7 multiple times to see if a failure is realized.

@LeStarch LeStarch added the bug label Dec 20, 2024
@Aksharma127
Copy link

Hi there!

It looks like the integration tests for your fprime project are running into some issues. Specifically, the tests are timing out while waiting for telemetry or expected event sequences. Here’s a quick guide to help you troubleshoot:

  1. Check Your Setup:
    Ensure all dependencies are installed, and the GDS (Ground Data System) is running smoothly. Double-check paths like build-artifacts and settings.ini are configured correctly.

  2. Increase Timeouts:
    The tests are failing due to timeouts. Try increasing the timeout values in your test scripts to give the system more time to process telemetry or events.

  3. Verify Commands:
    Make sure commands like Ref.cmdDisp.CMD_NO_OP are correctly implemented and mapped in your flight software. Use logs or debugging tools to confirm they’re being dispatched properly.

  4. Inspect Events:
    Check that the events (OpCodeDispatched, NoOpReceived, etc.) are being emitted as expected by the Ref.cmdDisp components. Look into the event handlers to ensure they’re functioning as intended.

  5. Run GDS in Debug Mode:
    Start the GDS in debug mode to inspect the flow of commands, telemetry, and events. This can help pinpoint delays or missing data.

  6. Simplify Tests:
    Isolate individual commands or telemetry checks and run them manually to identify where things might be breaking down.

@matt392code
Copy link
Contributor

Likely Root Causes:
a) Race Condition:
The port opening failures and ECONNREFUSED suggest a timing issue
One component may be trying to connect before another is ready to accept
The successful reruns support this theory, as timing would vary between runs

b) Resource Cleanup:
Previous test runs might not be properly cleaning up ports
This could cause port conflicts or failed bindings

Proposed Fix Approach:
/**

  • Retry mechanism for port connections with exponential backoff

  • @param port Port number to connect to

  • @param maxAttempts Maximum number of connection attempts

  • @param baseDelayMs Base delay between attempts (doubles each retry)

  • @return 0 on success, -1 on failure
    */
    int retryConnect(int port, int maxAttempts = 3, int baseDelayMs = 100) {
    for (int attempt = 0; attempt < maxAttempts; attempt++) {
    int status = openPort(port);
    if (status == 0) {
    return 0; // Success
    }

     if (attempt < maxAttempts - 1) {
         // Exponential backoff
         int delayMs = baseDelayMs * (1 << attempt);
         std::this_thread::sleep_for(std::chrono::milliseconds(delayMs));
     }
    

    }
    return -1; // Failed after all attempts
    }

/**

  • Cleanup routine for port resources

  • Ensures proper teardown between test runs
    */
    void cleanupPorts() {
    // Force close any lingering connections
    closeAllPorts();

    // Allow time for OS to release resources
    std::this_thread::sleep_for(std::chrono::milliseconds(50));
    }

  1. Implementation Steps:
  2. Add connection retry logic with exponential backoff
  3. Enhance port cleanup in test teardown
  4. Add logging to capture timing of port operations
  5. Consider adding mutex protection for port operations
  6. Testing Plan:
  7. Create a test that intentionally introduces delays to verify retry mechanism
  8. Run multiple parallel test instances to stress test
  9. Add CI test that runs the integration test multiple times in sequence

Logging enhancements:
/**

  • Enhanced port operation logging
    /
    class PortLogger {
    public:
    static void logPortOperation(const char
    operation, int port, int status) {
    auto now = std::chrono::system_clock::now();
    auto timestamp = std::chrono::system_clock::to_time_t(now);

     std::cerr << "[" << std::ctime(&timestamp) 
              << "] Port " << port 
              << " " << operation 
              << " Status: " << status 
              << " (errno: " << errno << ")" 
              << std::endl;
    

    }
    };

Actual test code:
TEST_F(IntegrationTest, PortConnectionStabilityTest) {
// Setup
cleanupPorts();

// Test multiple connection attempts
for (int i = 0; i < 10; i++) {
    int status = retryConnect(BASE_PORT + i);
    ASSERT_EQ(status, 0) << "Failed to connect to port " << (BASE_PORT + i);
    
    // Simulate some work
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    
    // Cleanup after each iteration
    cleanupPorts();
}

}

This code includes:

  1. Proper error handling and logging
  2. Exponential backoff for retries
  3. Resource cleanup
  4. A stress test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants