Seeming unnecessary CGO depdenency in nvmlreceiver #180

Dylan-M · 2023-09-12T20:12:17Z

Currently, the nvmlreceiver is using a single dependency on a C library here:
https://github.com/GoogleCloudPlatform/opentelemetry-operations-collector/blob/master/receiver/nvmlreceiver/testcudakernel/test_cuda_kernel.cc

Based on my review of, but with limited understanding of the Nvidia libraries in general, this could be replaced with a native golang library.

From: https://docs.nvidia.com/cuda/cublas/index.html#id92

cublasStatus_t cublasXtDgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           const double          *beta,
                           double          *C, int ldc)

From: https://pkg.go.dev/gorgonia.org/cu/blas#Standard.Dgemm

func (impl *Standard) Dgemm(tA, tB blas.Transpose, m, n, k int, alpha float64, a []float64, lda int, b []float64, ldb int, beta float64, c []float64, ldc int)

These function definitions seem to be identical, with the exception of the cublasXtHandle. That handle wouldn't be needed in the pure golang implementation; so it makes sense for it to be missing.

If I am correct, this would be a nice change. Anything that can strip a collector down to a pure golang implementation is good in my book.

If I am incorrect in my belief that these libraries are functionally the same, I'm happy to be corrected on it.

The text was updated successfully, but these errors were encountered:

braydonk · 2023-09-12T20:27:09Z

Looping in @suffiank for implementation details here.

Dylan-M · 2023-09-21T12:49:04Z

Any thoughts? This has come up in discussion again on a slack group.

braydonk · 2023-09-21T15:21:41Z

Hi @Dylan-M,

Suffian is quite busy so I took a look. I'm not the primary point of contact for GPU-related stuff so this is best effort, but this is what I found.

I think getting rid of the C code would be good. The library does look like it will do what that little C++ load generator does. This would get rid of C++ code in this codebase which would be cleaner.

However, this is unfortunately not the only CGO dependency in this receiver. The NVML client we use is simply Go bindings for libnvidia-ml. See https://github.com/NVIDIA/go-nvml for more info.
What this means is that we probably can indeed get rid of that C++ code and make our tests a bit easier to work with, but it will not unburden this receiver from CGO entirely.

As such I don't think we're going to be able to prioritize this any time soon, but we welcome a PR to make that change that gets rid of C++ in this codebase. Thank you for raising the issue!

Dylan-M · 2023-09-21T15:30:34Z

Thank you for having a look, a coworker also pointed out what you said about linking to that external library; which I had forgotten to look at external deps. Still a change that is worth making, however I fully agree with your assessment and prioritization.

quentinmit · 2023-09-29T04:00:33Z

In Ops Agent we run a demo app from the CUDA SDK: https://github.com/GoogleCloudPlatform/ops-agent/blob/master/integration_test/third_party_apps_data/applications/nvml/exercise

But as @braydonk mentions, go-nvml requires cgo for everything, so we can't get GPU metrics without it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeming unnecessary CGO depdenency in nvmlreceiver #180

Seeming unnecessary CGO depdenency in nvmlreceiver #180

Dylan-M commented Sep 12, 2023

braydonk commented Sep 12, 2023

Dylan-M commented Sep 21, 2023

braydonk commented Sep 21, 2023 •

edited

Loading

Dylan-M commented Sep 21, 2023

quentinmit commented Sep 29, 2023

Seeming unnecessary CGO depdenency in nvmlreceiver #180

Seeming unnecessary CGO depdenency in nvmlreceiver #180

Comments

Dylan-M commented Sep 12, 2023

braydonk commented Sep 12, 2023

Dylan-M commented Sep 21, 2023

braydonk commented Sep 21, 2023 • edited Loading

Dylan-M commented Sep 21, 2023

quentinmit commented Sep 29, 2023

braydonk commented Sep 21, 2023 •

edited

Loading