You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have noticed some problems with setting up topics between nodes on our noetic system that is running on a Jetson TX2. The issue is that it can take a very long time for publishers to be registered with subscribers in certain nodes which results in delays of up to 1 minute when starting the system. During this time, the CPU load for the nodes in question increases to 100% for a single core. We have only started seeing this issue since we recently upgraded to noetic from kinetic (which also meant an OS update). It has also only been possible to recreate on an ARM64 Jetson TX2 running a Balena operating system that hosts a Docker container with the ROS system. The setup is slightly exotic since we built noetic ourselves to run in Ubuntu 18.04, so it is very possible that this is the main problem. But it would be great to hear your thoughts on what could cause a problem like this, so we could at least take that information on to Balena or otherwise. We have not been able to recreate it on a Jetson TX2 that runs anUbuntu 18.04 desktop OS so it is likely caused by the operating system, the container environment, or a combination of the two.
Setup
ARM64 Jetson TX2
BalenaOS
Ubuntu 18.04 Docker container
ROS noetic
The problem
When troubleshooting this problem, we have narrowed it down to the publisherUpdate that can sometimes take up to 30 seconds for a single topic. It is mostly isolated to nodes that publish and subscribe to topics in both directions. We noticed it because it could sometimes take almost a minute before certain callbacks were triggered even though a rostopic subscriber to the same topic from the terminal would get all the publications. This makes sense if the subscriber in the node has not connected to the publisher yet.
I have recreated the problem with two simple nodes that just publish and subscribe to topics between eachother:
Running the same nodes on other setups does not produce the same delays. It's at most 0.05s for a single topic. We have tried the following setups:
x86_64 ubuntu16.04 kinetic
x86_64 ubuntu18.04 noetic
x86_64 ubuntu18.04 noetic in docker container
aarch64 ubuntu18.04 noetic
rosnode ping
There also seems to be a lot of latency when you try to ping the nodes. This does not improve even when all the topics have been set up. While the ping is trying to contact a node the CPU also usage spikes as it did with the publisherUpdate.
$ rosnode ping -a
Will ping the following nodes:
* /nodeY
* /nodeX
* /rosout
pinging /nodeY with a timeout of 3.0s
xmlrpc reply from http://5b282d4:50921/ time=2457.330704ms
pinging /nodeX with a timeout of 3.0s
xmlrpc reply from http://5b282d4:49871/ time=2443.327188ms
pinging /rosout with a timeout of 3.0s
xmlrpc reply from http://5b282d4:49927/ time=2008.521795ms
Final thoughts
There seems to be a lot of latency when communicating with the rosmaster. I think it is probably due to some issue with the underlying network configuration in the BalenaOS container environment. But I would still be curious to hear if you had better insight into how the rosmaster XMLRC communication can be affected in this way, and maybe if there are things to do that can mitigate it. The CPU load increasing when trying to ping or set up new topic connections could maybe be interesting to you guys, since that could maybe point to an inefficient waiting loop somewhere or something like that.
The text was updated successfully, but these errors were encountered:
I had a similar problem when running ROS nodes in docker. What helped was reducing the max file handles to 1024 again (as it is normally set without docker).
You can set this with ulimit -n 1024 in the container or in docker-compose with
I had a similar problem when running ROS nodes in docker. What helped was reducing the max file handles to 1024 again (as it is normally set without docker). You can set this with ulimit -n 1024 in the container or in docker-compose with
This is actually what I ended up doing to solve the problem. It seems to be somehow related to how the XMLRPC library deals with file descriptors inside docker containers. I found this #1927 issue, which details a very similar problem with the same proposed mitigation by limiting the number of allowed file descriptors. The actual fix that came from it does not work in our case since we have that already in our version of noetic. But changing the ulimit has been good enough for us for the time being.
I might give the proposed fix from #2208 a try out of curiosity, so thanks for the suggestion. I will let you know if I learn anything.
Description
We have noticed some problems with setting up topics between nodes on our noetic system that is running on a Jetson TX2. The issue is that it can take a very long time for publishers to be registered with subscribers in certain nodes which results in delays of up to 1 minute when starting the system. During this time, the CPU load for the nodes in question increases to 100% for a single core. We have only started seeing this issue since we recently upgraded to noetic from kinetic (which also meant an OS update). It has also only been possible to recreate on an ARM64 Jetson TX2 running a Balena operating system that hosts a Docker container with the ROS system. The setup is slightly exotic since we built noetic ourselves to run in Ubuntu 18.04, so it is very possible that this is the main problem. But it would be great to hear your thoughts on what could cause a problem like this, so we could at least take that information on to Balena or otherwise. We have not been able to recreate it on a Jetson TX2 that runs anUbuntu 18.04 desktop OS so it is likely caused by the operating system, the container environment, or a combination of the two.
Setup
The problem
When troubleshooting this problem, we have narrowed it down to the publisherUpdate that can sometimes take up to 30 seconds for a single topic. It is mostly isolated to nodes that publish and subscribe to topics in both directions. We noticed it because it could sometimes take almost a minute before certain callbacks were triggered even though a rostopic subscriber to the same topic from the terminal would get all the publications. This makes sense if the subscriber in the node has not connected to the publisher yet.
I have recreated the problem with two simple nodes that just publish and subscribe to topics between eachother:
When looking at the master.log for the session we can see delays in the publisherUpdate calls.
Running the same nodes on other setups does not produce the same delays. It's at most 0.05s for a single topic. We have tried the following setups:
rosnode ping
There also seems to be a lot of latency when you try to ping the nodes. This does not improve even when all the topics have been set up. While the ping is trying to contact a node the CPU also usage spikes as it did with the publisherUpdate.
Final thoughts
There seems to be a lot of latency when communicating with the rosmaster. I think it is probably due to some issue with the underlying network configuration in the BalenaOS container environment. But I would still be curious to hear if you had better insight into how the rosmaster XMLRC communication can be affected in this way, and maybe if there are things to do that can mitigate it. The CPU load increasing when trying to ping or set up new topic connections could maybe be interesting to you guys, since that could maybe point to an inefficient waiting loop somewhere or something like that.
The text was updated successfully, but these errors were encountered: