XmlRpcServer::countFreeFDs() takes too long inside docker container #1927

wenbin1989 · 2020-04-09T03:17:01Z

Hi there,

We just upgrade to ROS melodic from kinetic recently, and found that node connections take too long to establish, sometimes even over 2 seconds.
Our application is very time sensitive, and this affects a lot.

After some debug, I found that the issue is caused by countFreeFDs() function in xmlrpcpp.

It's comments say:

int XmlRpcServer::countFreeFDs() {
  // NOTE(austin): this function is not free, but in a few small tests it only
  // takes about 1.2mS when querying 50k file descriptors.

But inside docker container, file descriptors number usually is 1048576:

$ python -c "import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))"
(1048576, 1048576)

that takes a huge time to query.

docker version we are using is

$ docker --version
Docker version 19.03.5, build 633a0ea838

Any idea how to fix this? Thanks

The text was updated successfully, but these errors were encountered:

dirk-thomas · 2020-04-09T04:20:55Z

@trainman419 FYI since you added this logic in #1243.

harlowja · 2020-04-09T15:37:21Z

An example from a bare ubuntu 18.04 container:

josh@simula:~$ docker run -it ubuntu:18.04
root@a4ed19d3ed43:/# apt-get update > /dev/null
root@a4ed19d3ed43:/# apt-get install -y python > /dev/null
debconf: delaying package configuration, since apt-utils is not installed
root@a4ed19d3ed43:/# python -c "import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))"
(1048576, 1048576)

harlowja · 2020-04-09T15:38:24Z

$ docker --version
Docker version 19.03.2, build 6a30dfc

trainman419 · 2020-04-09T17:30:38Z

As a workaround, you could run ulimit -n 50000 before starting your nodes to reduce the number of available file descriptors to a lower value.

Most of the time in this function is probably spent in the poll() call because of the huge number of file descriptors that are being queried. This could probably be reduced by querying file descriptors in smaller batches (maybe 1024 or 16384) and terminating early once we've determined that there are adequate free file descriptors.

There may also be a faster linux system call to query the number of used or free file descriptors for a process; I did a quick search and didn't find anything, but this is the kind of function that is probably out there somewhere.

harlowja · 2020-04-09T18:48:43Z

A general question, is there really a reason for this whole change? Was someone actually running out of file descriptors (and they couldn't increase it via ulimit)?

wenbin1989 · 2020-04-09T19:52:43Z

I think querying file descriptors in smaller batches and terminating early might be a proper fix?

fujitatomoya · 2020-04-10T02:28:33Z

I think this should be taken care by system (system responsibility), not userland. via accept(), it can detect if it hits the limit for fds checking errno EMFILE, then notify user with error that user could increase the number of the fds.

wenbin1989 · 2020-04-10T02:35:49Z

I think this should be taken care by system (system responsibility), not userland. via accept(), it can detect if it hits the limit for fds checking errno EMFILE, then notify user with error that user could increase the number of the fds.

sounds reasonable to me

harlowja · 2020-04-10T16:09:32Z

Would it be possible to merge some kind of option to turn the check off until this gets fixed overall?

trainman419 · 2020-04-10T16:37:42Z

A general question, is there really a reason for this whole change? Was someone actually running out of file descriptors (and they couldn't increase it via ulimit)?

Yes; there were bugs in XmlRpc where it would leak file descriptors and then would go into an infinite loop when it ran out of file descriptors; see https://answers.ros.org/question/250393/rosout-leaks-file-descriptors/ and #914 and the seven PRs with tests and bug fixes listed on #1214 .

harlowja · 2020-04-10T17:09:55Z

A general question, is there really a reason for this whole change? Was someone actually running out of file descriptors (and they couldn't increase it via ulimit)?

Yes; there were bugs in XmlRpc where it would leak file descriptors and then would go into an infinite loop when it ran out of file descriptors; see https://answers.ros.org/question/250393/rosout-leaks-file-descriptors/ and #914 and the seven PRs with tests and bug fixes listed on #1214 .

Why was this codebase using that library then? (Was the upstream XmlRpc library/whatever fixed?)

trainman419 · 2020-04-10T18:04:01Z

Why was this codebase using that library then? (Was the upstream XmlRpc library/whatever fixed?)

This is a really, really insensitive question to ask.

ROS is over 10 years old, and these bugs were not known when the original authors of ROS chose to use XmlRpc. If you took a few minutes to review the PRs and look at the upstream version of XmlRpc, you'd see that it appears abandoned and has not gotten any updates in many years. ROS has chosen to continue doing maintenance fixes on XmlRpc instead of trying to find a complete replacement of it.

I'm here as a volunteer and because I feel some responsibility to the community. I'm happy to help with solutions, but if you're going to focus on placing blame then you're on your own to fix this.

harlowja · 2020-04-10T18:11:13Z

Nope, not focusing blame, only asking questions that I'd like to understand out of curiosity.

fujitatomoya · 2020-04-10T23:08:18Z

@wenbin1989 @trainman419 @harlowja

anybody willing to contribute the code on this? or already started?

fujitatomoya · 2020-04-12T06:49:34Z

@wenbin1989 @trainman419 @harlowja

Candidate#1: #1928

Candidate#2: #1929

I'd like to go with Candidate#2, since we do not want to create unexpected problems with this minor fix.

@wenbin1989
if possible, would you try #1929 to check if it fixes your problem? thanks in advance.

wenbin1989 · 2020-04-13T02:46:40Z

@fujitatomoya will do, thanks!

wenbin1989 · 2020-04-13T05:13:09Z

@fujitatomoya commented in the PR, thanks

harlowja · 2020-04-13T16:10:37Z

Sweet, will take a look soon.

dirk-thomas added the bug label Apr 9, 2020

This was referenced Apr 12, 2020

Do not count free FDs in userland. #1928

Closed

Check if enough FDs are free, instead counting the total free FDs. #1929

Merged

dirk-thomas closed this as completed in #1929 Apr 20, 2020

Gugfann mentioned this issue Jan 19, 2022

Large delays when communicating with rosmaster on noetic running in docker container on an ARM64 jetson TX2 #2156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XmlRpcServer::countFreeFDs() takes too long inside docker container #1927

XmlRpcServer::countFreeFDs() takes too long inside docker container #1927

wenbin1989 commented Apr 9, 2020

dirk-thomas commented Apr 9, 2020

harlowja commented Apr 9, 2020 •

edited

Loading

harlowja commented Apr 9, 2020

trainman419 commented Apr 9, 2020

harlowja commented Apr 9, 2020

wenbin1989 commented Apr 9, 2020

fujitatomoya commented Apr 10, 2020

wenbin1989 commented Apr 10, 2020

harlowja commented Apr 10, 2020

trainman419 commented Apr 10, 2020

harlowja commented Apr 10, 2020

trainman419 commented Apr 10, 2020

harlowja commented Apr 10, 2020

fujitatomoya commented Apr 10, 2020

fujitatomoya commented Apr 12, 2020

wenbin1989 commented Apr 13, 2020

wenbin1989 commented Apr 13, 2020

harlowja commented Apr 13, 2020

XmlRpcServer::countFreeFDs() takes too long inside docker container #1927

XmlRpcServer::countFreeFDs() takes too long inside docker container #1927

Comments

wenbin1989 commented Apr 9, 2020

dirk-thomas commented Apr 9, 2020

harlowja commented Apr 9, 2020 • edited Loading

harlowja commented Apr 9, 2020

trainman419 commented Apr 9, 2020

harlowja commented Apr 9, 2020

wenbin1989 commented Apr 9, 2020

fujitatomoya commented Apr 10, 2020

wenbin1989 commented Apr 10, 2020

harlowja commented Apr 10, 2020

trainman419 commented Apr 10, 2020

harlowja commented Apr 10, 2020

trainman419 commented Apr 10, 2020

harlowja commented Apr 10, 2020

fujitatomoya commented Apr 10, 2020

fujitatomoya commented Apr 12, 2020

wenbin1989 commented Apr 13, 2020

wenbin1989 commented Apr 13, 2020

harlowja commented Apr 13, 2020

harlowja commented Apr 9, 2020 •

edited

Loading