Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Rapids Plugin is able to recognize the GPU device used in Executor and leave a record in Spark Eventlog #12151

Open
wjxiz1992 opened this issue Feb 17, 2025 · 2 comments
Labels
feature request New feature or request

Comments

@wjxiz1992
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
I wish the RAPIDS Accelerator for Apache Spark would be able to recognize the GPU device sku/type and is also able to write it in the spark eventlog.

Rational:

  1. There may be some new features in the future that start to require such information for auto-tuning moves.
  2. Leave a record in the eventlog so the spark-rapids-tools can get it and give out config suggestions based on the GPU type.

Describe the solution you'd like
No good idea yet as common cases are Driver nodes don't have GPU installed, and nvidia-smi must be called at executor. Maybe SparkEvent can help in this case.
Describe alternatives you've considered
/

Additional context
/

@wjxiz1992 wjxiz1992 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Feb 17, 2025
@wjxiz1992 wjxiz1992 changed the title [FEA] [FEA] Rapids Plugin is able to recognize the GPU device used in Executor and leave a record in Spark Eventlog Feb 17, 2025
@mattahrens
Copy link
Collaborator

@revans2 -- would this be feasible along with planned zero config changes for concurrentGpuTasks?

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 18, 2025
@gerashegalov
Copy link
Collaborator

I think we should do this in JNI on the executors https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_11e26f1c6bd42f4821b7ef1a4bd3bd25c instead of nvidia -smi and expose it as a metric, Thus the driver can store an event with deviceName:count pairs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants