-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST]: How to run multi GPU cugraph leiden for a large graph #4884
Comments
"I think you're going to need a bigger boat" :-) I don't have exact measurements here, I'm going to do some hand-wavy math. We did scale testing of Louvain (which Leiden builds upon) from C++ (without the python dask/CUDF overhead). During that scale testing on A100s with 80 GB of memory, we needed 8 GPUs to first create the graph and then execute the algorithm. Leiden adds some additional memory pressure, but I have not measured that at scale. Let's assume it's an additional 10% to have memory to compute the refinement and to store both Leiden and Louvain clustering results during intermediate processing. That pushes us to a hand-wavy 9 GPUs. When we run this from python, as you are above, you have dask/cudf and the dask/cugraph layers also using GPU memory. The python approach above will use GPU memory when creating the persisted DataFrame, then it will use GPU memory in the DataFrame itself. cugraph is not allowed, when creating the graph, to delete the DataFrame memory, so this adds at least 70GB of GPU memory to the required footprint. Again, some hand-wavy math... I would think you would need a minimum of 10 GPUs, 12 GPUs would be safer, 16 GPUs would give you more margin for error in my hand-waviness and slightly better performance because you'd have more balance across the nodes. If I'm off in my projections, it's probably that I've forgotten something... so more GPUs is probably better than fewer. I suspect that if you push to 6 or 8 GPUs you'll be able to create the graph but will fail in Leiden. We don't currently have a memory spilling capability within cugraph. Generally, graph algorithms (due to the nature of the memory accesses having poor locality) perform poorly with data stored outside of the GPU memory. We have a number of options we are pursuing in the long run, but at the moment if you run out of memory you need to run with more GPUs. Using something like managed memory (which is frequently a linear slowdown for applications with better memory access patterns) typically results in terrible thrashing of the memory system. Specifically, in managed memory the system would have to bring an entire page of memory from host memory to GPU memory and we might only access one 8 byte value from that page before it gets ejected. Because of that I don't really have a better recommendation. |
Thanks @ChuckHastings that's what I thought. But, realistically with the advances in single-cell sequencing and spatial transcriptomics the problem is going to get worse. Right now, we have a 120 million dataset that's sequenced and we cannot possibly run the standard pipelines without having an insanely expensive GPU node with 50-100 GPUs based on what you suggested above. The above dataset with 3.7 billion edges was a KNN graph of only 50 million cells. Since the field heavily relies on the graph based clustering algorithms can we make this a priority to make these algorithms scale for a reasonable amount of resources (4-8 A100s)? As a matter of fact, even that's out of the reach for many biomedical institutions that rely on grant funding. |
We will consider this as we identify our priorities. We meet in March to discuss our priorities for the next year, I can update the issue after we have the discussion. |
We do have the ability to enable spilling for def enable_spilling():
"""
Enable spilling to host memory for cudf APIs in the process calling this
function.
"""
import cudf
cudf.set_option("spill", True)
...
if __name__ == "__main__":
...
print(f"Number of workers started are {len(client.has_what().keys())}")
# Enable spilling on the client and all workers.
#
# This allows computations that require large amounts of working memory to
# temporarily move GPU data not needed by the current computation into host
# memory to prevent GPU out-of-memory errors. Computations that still
# require more GPU memory than available even after spilling non-essential
# GPU data will still result in an out-of-memory error. Spilling is done
# only if necessary, and there's no performance impact if GPU data and GPU
# working memory fit in the total GPU memory, so it's recommended this be
# enabled for most workflows.
enable_spilling()
client.run(enable_spilling)
... |
Hey @rlratzel is it different than setting |
Ah, yes that may have the same effect. @VibhuJawa - can you provide some insight on this? |
Yup, they do have the same effect. |
Thanks @VibhuJawa. So @abs51295, I think that you're already doing the best you can do with the current code. We'll discuss this in our meeting in March to prioritize tasks for the next year. I'll update this issue once we have identified a priority. |
What is your question?
Hey, I am trying to run multi-GPU leiden clustering on a large graph with 3.7 billion edges. I have 4 NVIDIA A100 GPUs with 80G VRAM each. I am facing memory issues when I try to run it and I was wondering if you have any suggestions on how to handle such a large graph. Here's my code:
Here's the error message I get:
Code of Conduct
The text was updated successfully, but these errors were encountered: