Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exceeded protobuf size of 2gb #51

Open
dlr2 opened this issue Dec 2, 2024 · 0 comments
Open

exceeded protobuf size of 2gb #51

dlr2 opened this issue Dec 2, 2024 · 0 comments

Comments

@dlr2
Copy link

dlr2 commented Dec 2, 2024

I'm running some tests on SQL queries that use WHERE, GROUP BY and/or ORDER BY phrases. That is no JOINS atm. I was able to create a cluster manually, with six nodes that each have 2 GPUS plus a head node. This shows up in the UI as I would expect; I also have grafana and prometheus running with it. I am using a python client. But some issues:

  • In some cases, I get the error: ray.rpc.RequestWorkerLeaseRequest exceeded maximum protobuf size of 2GB" (I think its looking for 16GB in one case)
  • To add, from what I can see all of my queries are processed in one stage. And the output shows "Forcing reduce stage concurrency from 1200 to 1" then all compute takes place on the first node of the cluster. The 1200 is about the number of cores I have across the cluster and the value I used to initialize DatafusionRayContext. Might be that the query plan deduced doesn't allow for parallelism?
    • BTW, it seems odd that after ray.init which connects to the cluster I then need DatafusionRayContext(int) to define how many CPUs I want. I would think that the connection object could be passed like DatafusionRayContext(InitializationObject) and the resulting context would then use all of the cluster. Otherwise, I don't see how to get configs from the ray.init to then set the number of CPUs dynamically -- and there is no option for GPUs from connection info -- but maybe datafusion-ray doesn't use GPUs
  • In some cases, where I can simply run datafusion directly, the datafusion-ray version (also running on one node per above) can be up to 8x slower.

Unfortunately, I can't provide examples or datasets as my tests are internal.

Thanks!

PS: I hope this is the correct place for this type of feedback. I didn't see a forum or anything. And apologies for noting more than one issue in the same post. I realize that you are only at version 0.1.0 so I hope you can view this as helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant