-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTDS Benchmarks #517
Comments
We've also used adding a matrix with it's transpose as a communication benchmark. Could be reasonable here to show the communication overhead goes to 0 while still computing work in parallel |
That's a good idea @jakirkham . I updated the test operations |
Thanks for writing this up @quasiben! Some questions:
|
Generally, yes! Some of these operations are already built in here:
Probably. Extending pybench seems like a reasonable way to automate running many of these operations. Much of the tests in the dask-cuda benchmark suite have nice output for testing a set of parameters be we don't have additional tooling for varying those parameters automatically. This would be very beneficial especially if plotting was also automated! I also think doing a bit of a dive on some of these operations with a manual setup would be useful. We could use tools like pynvtx and confirm kernels are launching on multiple threads. |
After touching base with @pentschev, I think it may be easier to extend For plotting pybench already has some tools for that, although @pentschev mentioned the process of generating the plots in his example notebook took some toil, so not sure how easy it would be to automate for wider use cases. I think a good start for now would be outputting benchmarking results in JSON that could be parsed by pybench's plotting utils, which could either be done:
|
Made a fork with some additional operations for testing and an argument to output JSON of the results: Some more thought may need to go into device sync and warm up as in pybench, but for now I'll try this out and see if there are noticeable results. |
Looks great, when you're ready, be sure to open a PR! 😄 |
Done! #524 From benchmarks using 4 GPUs and 1, 2, 4, 8 threads, there wasn't been a remarkable difference in performance; a plot of these results: I'm going to scale back the number of GPUs and check out what the Nsys profile and Dask performance reports look like. |
Interesting. It might be worth comparing notes with Peter to make sure things are configured correctly. Couple questions. Were you able to see multiple streams in use ( cupy/cupy#4322 (comment) )? Also are using the changes from PR ( cupy/cupy#4322 ) (note this is not in a CuPy release yet)? |
IIRC, I was only able to see significant difference on transpose sum, and besides the CuPy PTDS support, I also needed rapidsai/rmm#633 and UCX. Could you try those out as well @charlesbluca ? |
Sure! I'll run the benchmarks again and see how it looks. |
This PR adds the following operations to the local CuPy array benchmark: - sum - mean - array slicing This also adds an additional special argument, `--benchmark-json`, which takes an optional path to dump the results of the benchmark in JSON format. This would allow us to generate plots using the output, as discussed in #517. Some thoughts: - Should there be an additional argument to specify the array slicing interval (which is currently fixed at 3)? - Could the JSON output be cleaned up? Currently, a (truncated) sample output file looks like: ```json { "operation": "transpose_sum", "size": 10000, "second_size": 1000, "chunk_size": 2500, "compute_size": [ 10000, 10000 ], "compute_chunk_size": [ 2500, 2500 ], "ignore_size": "1.05 MB", "protocol": "tcp", "devs": "0,1,2,3", "threads_per_worker": 1, "times": [ { "wall_clock": 1.4910394318867475, "npartitions": 16 } ], "bandwidths": { "(00,01)": { "25%": "136.34 MB/s", "50%": "156.67 MB/s", "75%": "163.32 MB/s", "total_nbytes": "150.00 MB" } } } ``` Authors: - Charles Blackmon-Luca (@charlesbluca) Approvers: - Mads R. B. Kristensen (@madsbk) - Peter Andreas Entschev (@pentschev) URL: #524
It turns out the initial runs I did on TCP didn't have In particular, it looks like mean, sum, and transpose sum have some noticeable speed up. Some observations:
|
It might be worth creating larger data and/or using smaller chunks to see if this continues to hold up. |
I forgot to mention, for UCX you should pass also |
Sure! Currently running them now - I'll update the notebook once it finishes. I'll also play around with array/chunk size, any insight on how the size of the second array in SVD should scale? Also are there any good examples of comparative bar plots that have more than one independent variable? I'm trying to think about the best way to show the impact of changing array size and thread count in one plot. |
Updated the notebook; looks like the main difference with |
@charlesbluca as we discussed on our standup meeting last Friday, we also need to ensure UCX synchronizes on PTDS-only, rather than the default stream. Sorry for forgetting about this earlier, but could you try running the UCX benchmarks again with https://github.com/pentschev/distributed/tree/ptds-support ? |
Sure! Do I have to specify any options in the Dask configuration? |
No, right now it will rely on your environment variables: https://github.com/pentschev/distributed/blob/c1cf54c6fc584efbc423ad0b49608817deb08d88/distributed/comm/ucx.py#L46-L49 We may change that in the future though. |
Thanks for the clarification! Right now, I am trying out some different parameters for array/chunk size; currently, I'm thinking for a holistic test of any single operation, we should try: Array sizes (may need some smaller sizes to run for array multiplication):
Chunk sizes:
Second size (SVD only)
We could also play around with GPU count to see if that has discernible impact. |
I ended up using this testing suite for one operation (sum) and got some pretty strange results: In particular, we can see that the results aren't consistent with what was observed before with array/chunk size 40,000/10,000 - not sure if this is related to running all of these tests in short succession. For reference, here are the Dask profiles for this operation with 1 and 2 threads per worker (2 threads should be faster but was not): |
It looks like the results from before had something to do with generating the Dask performance report - removing that argument gave me results similar to the earliest benchmarks: Playing around with array/chunk size, some interesting observations:
I'm interested in if the smaller chunk size / larger thread count would continue scaling, although it might be impractical to have more than 32 threads per worker. |
I think the different results when generating performance reports could have something to do with device synchronization not being done: dask-cuda/dask_cuda/benchmarks/local_cupy.py Lines 86 to 98 in d78c668
Is there any reason device sync isn't being done when we opt for a performance report? |
I don't think so. Sounds like a bug (thanks for finding that btw 😄). Would you like to submit a PR? |
As John mentioned, this is indeed a bug. Could you submit a PR for that when you have a chance? |
We actually replace CuPy's memory pool for these benchmarks: dask-cuda/dask_cuda/benchmarks/utils.py Line 251 in e72d776
By default using RMM's default is a good balance, so I'd say yes. You can specify EDIT: If you're asking about these benchmarks in particular, you can specify 90-95% of the GPU's total memory to prevent any further allocations. |
What I meant to say is that in the benchmark code itself, it looks like |
Sounds like a bug. Would you like to do a PR to fix that? |
Sure! Is it okay if I roll this into a larger PR with the additional operations + NVTX traces? |
Sure, I don't think there's much rush for this right. And thanks for catching that @charlesbluca ! |
No problem! Opened up PR #548. |
Ran some benchmarks with column sum and gather; in general it looks like performance is better with a larger chunk size, though increased threads don't seem to have a consistent impact: For sum, there is a spike in performance at the largest chunk size This is surprising to me, as it seems like from the Nsys profiles there aren't too many kernel calls happening here, mostly UCX communication: |
|
Sure, what would the code look like running this without a mask? |
Sorry scratch that What are the chunk sizes here? How many chunks are we using? |
Chunk sizes are 31250, 62500, 125000, and 250000, so 16, 8, 4, and 2 chunks on the sliced array and 32, 16, 8, 4 chunks on the index array, respectively. |
Thanks for #553 @jakirkham! Column masking behaves similarly to column sum: Main difference in the profiles (125000 vs 250000) being that there's a lot more space between the kernel calls and UCX communication: Also looks like there's some PtoP copies happening only for the larger chunk size? Highlighted where that's happening. |
As I wrote in #536 (comment), this is probably due to |
This issue has been labeled |
Still active |
This issue has been labeled |
This issue has been labeled |
With cupy/cupy#4322 merged in we can begin profiling/benchmarking performance of more workers per GPU.
Setup
python -m pip install .
nthreads
: 1, 2, 4, 8, 32`Test Operations
Note @pentschev previously tested these operations on a single GPU:
Example of CuPy with Dask
cc @charlesbluca
The text was updated successfully, but these errors were encountered: