-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequence Buffer Sampling Performance #180
Comments
Oooh that's not ideal. Let's see if we can speed this up. Did you profile with Is it mostly extracting the observations that is slow? I have also noticed that this function can be time consuming, it's just a lot of data to copy if you're dealing with images. In that case I would think the really advanced thing to do would be to make some parallel & pipelined data loader, like people use in supervised learning. Would need to use the read-write lock as in the asynchronous mode. Might get complicated! |
CUDA_LAUNCH_BLOCKING doesn't make a difference. I have tested it with just reading the one large continuous part of the buffer and reshaping it. This is much faster (about 20x). So I think it's not the amount of data but the different positions. I have also tried creating a list with all the indexes that should be read. And then I used one torch call to read all of them. But this is just as slow as your implementation with a loop. A parallel data loader would probably be the most elegant solution. But I have given up on it because it got to complicated. |
Dang, bummer to hear :( An intermediate solution could be to use the asynchronous runner, so that the sampling runs continuously in one process while optimization runs continuously in another. If sampling is the slower part anyway, then this would hide the memory copy time. Does it make it so you can't run the experiment? |
I have some performance issues with the sequence buffers. I have traced it to the extract_sequence function in rlpyt/utils/misc.py. It's implemented with a loop over all batch elements. This seems to be quite slow. But I wasn't able to find torch functions that could replace the python loop.
When I run my RL algo on a V100 the optimization loop spends about 50% of the time in the extract_batch() function.
Has anyone else encountered this problem before and has a solution?
The text was updated successfully, but these errors were encountered: