-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduction of 80K/sec throughput #14
Comments
The model in the code is the larger model, so if you haven't, try with the small model. On a P100 I get 30+k FPS with the big model. Are you limited by the speed of the learner or the actors? |
Memory bandwidth of P40 is significantly lower than P100 which could explain the difference from 22k to 30+k. |
@lespeholt Thanks for the super quick response!
What can cause the cpu bottleneck, like a slow CPU? Also, what is the intra net connection (between actor and learner) for the throughput reported in the paper? Did you rely on, say, a fast Ethernet or Infiniband or something? In the run I mentioned the network traffic is over the public internet, can it be a bottleneck or not an important factor? |
You need to be able to transfer with around 2-3GB per second in total. If you increase the unroll length, you decrease the bandwidth requirements so you can try that. |
Thanks! How does the number 2-3GB/sec come (e.g., batch_size * width * height * rollout_len * BytesOfFloat, etc.)? I'm still reading the tf.FIFOQueue code (with capacity=1) and struggling to understand the sync mechanism. I guess answering this question my help me (and others) to understand how the Actor code works :) Also, I just asked around and found I was unable to access a P100, the best GPU in hand is only P40... So please feel free to close the issue. |
Actually it's much less for the big model (roughly same number of parameters as the small model). Per unroll: 30,000 FPS / 400 (unroll length 100 * action repeat 4) * 8.5MB ~= 650 MB/s excluding overhead |
I see, very clearly! Thanks so much! |
Hi, some updates. We tried the smaller net as in the paper (by modifying the Agent code). The throughput is still around 22K Also, several arguments combinations were tried to see how they impact throughput: |
Could be your network in that case. I suggest you take a look at TensorFlow performance timelines. |
Hi, Also, am I correct that for distributed execution I will need to modify the cluster spec in the code and pass my own? I'm currently doing that, but I was wondering if I'm missing something there. My current setup:
I'm using python 3.6 (only had to make minor adjustments like replacing |
It's correct that the cluster spec needs to be modified depending on your setup. The advantage of having several machines can be that the network is less saturated. However, your setup should get much higher speeds. The speeds you get is what you would expect only running on CPU. One thing to note though, the network in experiment.py is the bigger network described in the paper, so the target speed should be 30+k FPS. Can you verify that you actually run on GPU? I find that the best way to debug performance issues for TensorFlow is to look at the performance timelines. On them you can see if the learner is waiting on data from actors, which operations are slow and on what device they run on. |
Thanks for your quick response. Without GPU, we only get to about 500 env frames/sec. I made a plot of the GPU utilization using nvidia-smi in 1 second intervals: Should we expect to see permanent high utilization or is this normal? The results are similar when using a single node and no dynamic batching. With dynamic batching, we see a constant utilization of about ~20%. I also reproduced the issue (both single node and distributed) on a fresh install following the Dockerfile and with the following specs: Head node:
2 Child nodes:
This setup also leads to just about 5,5k environment frames/sec. I will try to look into the timelines tonight, is there any other reason you could think of that hurts gpu utilization? should we try 16 or 32 machines with few workers each? |
In a distributed setup, the utilization should be constantly high. In a single-machine setup, it may be somewhat low since producing the frames will slow it down. With the setup you mention, you should definitely see similar speeds or close to them as in the paper. Timelines for both the learner and actors is helpful. |
Yes, we used 1 CPU per actor. Can you try 150 actors with 1 CPU each? It's a bit hard to interpret the timelines without interacting with them. Since dequeuemany is taking that much time on the learner, it looks like they are bottlenecked by actors or the bandwidth to them. Not sure why there is a gap between the actor steps. If they wait on enqueuing, then it suggest a bottleneck in the learner or the bandwidth. In this case it would then be the network. Can you try and create new variables for each actor? i.e. no sharing of variables. If that is significantly faster, it's network bandwidth. |
Hi, I tried to reproduce the 80K/sec throughput reported in the paper, but only got around 22K/sec.
I ran the single learner on a GPU machine (the GPU is P40):
and ran 150 actors each on a CPU machine (each one is actually a docker machine in remote allocated by a cloud service):
where
i
denotes the i-th actor.Could you give some hints on how to reproduce the throughput? Did you require a proprietary intra net connection?
The text was updated successfully, but these errors were encountered: