Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting master going offline error #1

Open
weithegreat opened this issue Dec 4, 2024 · 18 comments
Open

Getting master going offline error #1

weithegreat opened this issue Dec 4, 2024 · 18 comments

Comments

@weithegreat
Copy link

I am running with eRobTest, it's run on a RTLinux.
It connects with slaves fine, but very fast it moves to master goes offline error.
Is there error tolerance I can set to devices to not report such error?

@ZeroErrControl
Copy link
Owner

I am running with eRobTest, it's run on a RTLinux. It connects with slaves fine, but very fast it moves to master goes offline error. Is there error tolerance I can set to devices to not report such error?

Thank you for reporting this issue.

The problem you described is likely caused by system latency or insufficient real-time performance, which affects stable communication between the master and slaves.

Even with RTLinux, we recommend running the following command to verify if the maximum scheduling latency exceeds millisecond-level ranges:

sudo cyclictest -m -p99 -t1 -i100 -a3

In a similar case with Jetson Nano (without RTLinux), we achieved 4 hours of stable testing through the following optimizations:

  • Reducing system latency: Disable unnecessary tasks to lower system load.
  • Maximizing system performance: Ensure all cores are running at maximum frequency.
  • Kernel isolation: Bind EtherCAT tasks to specific CPU cores to avoid interference.
    Regarding error tolerance, EtherCAT's strict real-time requirements mean that system latency-induced disconnections cannot be resolved by adjusting error thresholds. Instead, we recommend focusing on optimizing system performance and resource allocation.

Additionally:

  • Check slave AL status codes to identify the disconnection cause.
  • Verify that Distributed Clock (DC) synchronization is stable, as desynchronization may exacerbate the issue.
    Let us know if you need further assistance or have additional logs to share.

@weithegreat
Copy link
Author

  1. I noticed you commented out ec_configdc() compared to simple test, why is it no needed?
  2. What is the right sequence of calling
    ecx_dcsync0();
    ec_configdc();
    ec_config_map(&IOmap_);

and do they all have to be called inside PRE_OP state? Or SAFE_OP state?

@weithegreat
Copy link
Author

Also, I noticed you're calling ec_send_processdata immediately after ec_receive_processdata.
Is it possible to do some data processing after ec_receive_processdata, then call ec_send_processdata? Would it cause problem?

@ZeroErrControl
Copy link
Owner

ZeroErrControl commented Dec 4, 2024

Also, I noticed you're calling ec_send_processdata immediately after ec_receive_processdata.
Is it possible to do some data processing after ec_receive_processdata, then call ec_send_processdata? Would it cause problem?

This is an excellent question. Here's what we found:

When using ec_configdc(), we noticed that the slave devices were unable to enter DC mode, which in turn prevented them from reaching the OP state. To investigate this, we analyzed the process of initializing eRob with the TwinCAT master (a standard EtherCAT master) by capturing network packets. During this analysis, we observed that the TwinCAT master writes the value 3 to the register at address 0x0981, and the value of object dictionary 1C32 is set to 2 to indicate the slave has correctly entered DC mode.

Based on these observations, we found that calling ecx_dcsync0() in the PRE_OP state, before ec_config_map(), satisfies these conditions. While ec_configdc() is theoretically meant to be called in the SAFE_OP state to calculate the master’s reference clock, we haven’t identified any significant differences when calling it in PRE_OP or SAFE_OP. You might want to try both and see how it works for your setup.

Thus, the correct sequence should be:

  • ecx_dcsync0() -> preop
  • ec_config_map() ->preop
  • ec_configdc() ->safeop
    Let us know if you have any further questions or issues.

@ZeroErrControl
Copy link
Owner

Also, I noticed you're calling ec_send_processdata immediately after ec_receive_processdata.
Is it possible to do some data processing after ec_receive_processdata, then call ec_send_processdata? Would it cause problem?

We haven't done much testing on this yet. This example is primarily designed to address the issue where the SOEM master cannot bring eRob into the OP state, with simple extensions for enabling and motion functionality.

We recommend that you try experimenting with your setup and focus on developing and optimizing the current code on the basis of stable operation of the master.

@weithegreat
Copy link
Author

weithegreat commented Dec 4, 2024 via email

@ZeroErrControl
Copy link
Owner

If I set the ecat DC cycle time to 1ms, what is the maximum delay for EROB unit to report master go offline?Can I set Cycletime to 2ms but still send and receive data at 1ms to avoid entering master go offline error.We have 20 devices on Ecat line, 15 erob acuators, but only EROB reports master go offline. There got be some threshold value I can set to avoid this error. This error is catastrophic for our system, we can tolerate delay in communication but cannot tokerate master go offline error

This is a highly technical question. We will conduct experiments and tests based on your issue to try and resolve it. However, we currently do not have a specific solution. Here are our suggestions:

Try adjusting the watchdog timeout settings in SOEM.
Configure the DC mode offset time properly.
Capture the error codes reported when the master goes offline to facilitate further troubleshooting.
Set both the cycle time and the data transmission time to 2ms for consistency.
Additionally, setting the DC cycle time to 2ms while maintaining a transmission interval of 1ms can also result in the master going offline.

@weithegreat
Copy link
Author

weithegreat commented Dec 4, 2024 via email

@weithegreat
Copy link
Author

The master station goes offline problem is really a headache for us, only eROB is reporting it

@ZeroErrControl
Copy link
Owner

How can I adjust the watchdog timeout settings in SOEM. Sent from my iPhone

On Dec 4, 2024, at 1:32 AM, ZeroErr @.***> wrote: Try adjusting the watchdog timeout settings in SOEM.

You can refer to the development documentation of the SOEM master station.

@ZeroErrControl
Copy link
Owner

The master station goes offline problem is really a headache for us, only eROB is reporting it

We plan to use the SOEM master to reproduce the issue you mentioned. If there are any results, we will notify you at the earliest opportunity.

@ZeroErrControl
Copy link
Owner

I am running with eRobTest, it's run on a RTLinux. It connects with slaves fine, but very fast it moves to master goes offline error. Is there error tolerance I can set to devices to not report such error?

Currently, we have removed the CPU affinity binding for Thread 1 and Thread 2 in the new program (eRob_test.cpp) to avoid unpredictable scheduling delays. After making this adjustment, we tested the program on the RT Linux system to drive six eRob units, and it has successfully run stably for over one hour. We recommend testing the long-term enabling of eRob first. If the issue of dropping out of OP still occurs, I will further optimize the master program.

@weithegreat
Copy link
Author

Did you reproduce the "Master go offline problem"? I don't care about dropping out of OP, "master go offline" is a fatal issue for us.
We're using RT system, we assigned a dedicated core only for the ethercat update thread, but it will immediately drop out if there's even a tiny slight glitch in timing. Once master station goes offline, there's no way to clear the fault, actuator stays in fault and won't work.

@ZeroErrControl
Copy link
Owner

Did you reproduce the "Master go offline problem"? I don't care about dropping out of OP, "master go offline" is a fatal issue for us. We're using RT system, we assigned a dedicated core only for the ethercat update thread, but it will immediately drop out if there's even a tiny slight glitch in timing. Once master station goes offline, there's no way to clear the fault, actuator stays in fault and won't work.

To be precise, I haven't fully understood what you mean by "master going offline." Could you provide the specific scenarios, the messages printed by the master, and the EtherCAT slave messages during the disconnection? This would help me better reproduce the issue. Previously, I interpreted the master going offline as the same as dropping out of OP state.

@weithegreat
Copy link
Author

weithegreat commented Dec 6, 2024 via email

@Rohithossain007
Copy link

ECAT device send error code 0XA000, and enter fault state Sent from my iPhoneOn Dec 6, 2024, at 12:03 AM, ZeroErr @.> wrote: Did you reproduce the "Master go offline problem"? I don't care about dropping out of OP, "master go offline" is a fatal issue for us. We're using RT system, we assigned a dedicated core only for the ethercat update thread, but it will immediately drop out if there's even a tiny slight glitch in timing. Once master station goes offline, there's no way to clear the fault, actuator stays in fault and won't work. To be precise, I haven't fully understood what you mean by "master going offline." Could you provide the specific scenarios, the messages printed by the master, and the EtherCAT slave messages during the disconnection? This would help me better reproduce the issue. Previously, I interpreted the master going offline as the same as dropping out of OP state. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.>

Hey do you have screenshot or anything of the error that you are facing?

@ZeroErrControl
Copy link
Owner

I am running with eRobTest, it's run on a RTLinux. It connects with slaves fine, but very fast it moves to master goes offline error. Is there error tolerance I can set to devices to not report such error?

Through my attempts with the SOEM master, I identified several key points for optimization:

Thread Isolation and CPU Affinity: Bind the EtherCAT thread to a specific CPU core and isolate the core to reduce interference from network management tasks.
State Monitoring and Automatic Recovery: Add state monitoring and automatic recovery mechanisms to ensure quick recovery in case of exceptions.
Thread CPU Affinity: Set CPU affinity for critical threads to minimize scheduling delays.
Improve System Real-Time Performance: Optimize the real-time kernel configuration and thread priorities to ensure timely execution of cyclic tasks.
Communication Timeout Handling: Implement timeout detection to prevent system stalls due to communication issues.
These improvements significantly enhance the stability and real-time performance of the SOEM master.

@ZeroErrControl
Copy link
Owner

I am running with eRobTest, it's run on a RTLinux. It connects with slaves fine, but very fast it moves to master goes offline error. Is there error tolerance I can set to devices to not report such error?

In the latest upload, I have included the optimized PP mode master project, which has undergone multiple one-hour stability tests. Additional considerations have already been mentioned in my previous responses and will not be repeated here. The eRob_eCoder project can be used for testing purposes, but it should not be directly applied to real-world applications to avoid potential risks or unforeseen losses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants