Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XmlRpcServer::countFreeFDs() takes too long inside docker container #1927

Closed
wenbin1989 opened this issue Apr 9, 2020 · 18 comments · Fixed by #1929
Closed

XmlRpcServer::countFreeFDs() takes too long inside docker container #1927

wenbin1989 opened this issue Apr 9, 2020 · 18 comments · Fixed by #1929
Labels

Comments

@wenbin1989
Copy link

Hi there,

We just upgrade to ROS melodic from kinetic recently, and found that node connections take too long to establish, sometimes even over 2 seconds.
Our application is very time sensitive, and this affects a lot.

After some debug, I found that the issue is caused by countFreeFDs() function in xmlrpcpp.

It's comments say:

int XmlRpcServer::countFreeFDs() {
  // NOTE(austin): this function is not free, but in a few small tests it only
  // takes about 1.2mS when querying 50k file descriptors.

But inside docker container, file descriptors number usually is 1048576:

$ python -c "import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))"
(1048576, 1048576)

that takes a huge time to query.

docker version we are using is

$ docker --version
Docker version 19.03.5, build 633a0ea838

Any idea how to fix this? Thanks

@dirk-thomas dirk-thomas added the bug label Apr 9, 2020
@dirk-thomas
Copy link
Member

@trainman419 FYI since you added this logic in #1243.

@harlowja
Copy link

harlowja commented Apr 9, 2020

An example from a bare ubuntu 18.04 container:

josh@simula:~$ docker run -it ubuntu:18.04
root@a4ed19d3ed43:/# apt-get update > /dev/null
root@a4ed19d3ed43:/# apt-get install -y python > /dev/null
debconf: delaying package configuration, since apt-utils is not installed
root@a4ed19d3ed43:/# python -c "import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))"
(1048576, 1048576)

@harlowja
Copy link

harlowja commented Apr 9, 2020

$ docker --version
Docker version 19.03.2, build 6a30dfc

@trainman419
Copy link
Contributor

As a workaround, you could run ulimit -n 50000 before starting your nodes to reduce the number of available file descriptors to a lower value.

Most of the time in this function is probably spent in the poll() call because of the huge number of file descriptors that are being queried. This could probably be reduced by querying file descriptors in smaller batches (maybe 1024 or 16384) and terminating early once we've determined that there are adequate free file descriptors.

There may also be a faster linux system call to query the number of used or free file descriptors for a process; I did a quick search and didn't find anything, but this is the kind of function that is probably out there somewhere.

@harlowja
Copy link

harlowja commented Apr 9, 2020

A general question, is there really a reason for this whole change? Was someone actually running out of file descriptors (and they couldn't increase it via ulimit)?

@wenbin1989
Copy link
Author

I think querying file descriptors in smaller batches and terminating early might be a proper fix?

@fujitatomoya
Copy link
Contributor

I think this should be taken care by system (system responsibility), not userland. via accept(), it can detect if it hits the limit for fds checking errno EMFILE, then notify user with error that user could increase the number of the fds.

@wenbin1989
Copy link
Author

I think this should be taken care by system (system responsibility), not userland. via accept(), it can detect if it hits the limit for fds checking errno EMFILE, then notify user with error that user could increase the number of the fds.

sounds reasonable to me

@harlowja
Copy link

Would it be possible to merge some kind of option to turn the check off until this gets fixed overall?

@trainman419
Copy link
Contributor

A general question, is there really a reason for this whole change? Was someone actually running out of file descriptors (and they couldn't increase it via ulimit)?

Yes; there were bugs in XmlRpc where it would leak file descriptors and then would go into an infinite loop when it ran out of file descriptors; see https://answers.ros.org/question/250393/rosout-leaks-file-descriptors/ and #914 and the seven PRs with tests and bug fixes listed on #1214 .

@harlowja
Copy link

A general question, is there really a reason for this whole change? Was someone actually running out of file descriptors (and they couldn't increase it via ulimit)?

Yes; there were bugs in XmlRpc where it would leak file descriptors and then would go into an infinite loop when it ran out of file descriptors; see https://answers.ros.org/question/250393/rosout-leaks-file-descriptors/ and #914 and the seven PRs with tests and bug fixes listed on #1214 .

Why was this codebase using that library then? (Was the upstream XmlRpc library/whatever fixed?)

@trainman419
Copy link
Contributor

Why was this codebase using that library then? (Was the upstream XmlRpc library/whatever fixed?)

This is a really, really insensitive question to ask.

ROS is over 10 years old, and these bugs were not known when the original authors of ROS chose to use XmlRpc. If you took a few minutes to review the PRs and look at the upstream version of XmlRpc, you'd see that it appears abandoned and has not gotten any updates in many years. ROS has chosen to continue doing maintenance fixes on XmlRpc instead of trying to find a complete replacement of it.

I'm here as a volunteer and because I feel some responsibility to the community. I'm happy to help with solutions, but if you're going to focus on placing blame then you're on your own to fix this.

@harlowja
Copy link

Nope, not focusing blame, only asking questions that I'd like to understand out of curiosity.

@fujitatomoya
Copy link
Contributor

@wenbin1989 @trainman419 @harlowja

anybody willing to contribute the code on this? or already started?

@fujitatomoya
Copy link
Contributor

@wenbin1989 @trainman419 @harlowja

Candidate#1: #1928

Candidate#2: #1929

I'd like to go with Candidate#2, since we do not want to create unexpected problems with this minor fix.

@wenbin1989
if possible, would you try #1929 to check if it fixes your problem? thanks in advance.

@wenbin1989
Copy link
Author

@fujitatomoya will do, thanks!

@wenbin1989
Copy link
Author

@fujitatomoya commented in the PR, thanks

@harlowja
Copy link

Sweet, will take a look soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants