GPU utilization in spark error #1294

m15pradeep · 2025-02-04T06:29:43Z

I'm using attached install_gpu_driver.sh in dataproc 2.2. GPU is not getting recognized in spark. Attached installation logs for reference

dataproc-gpu-main.txt
dataproc-initialization-script-0.log
install_gpu_driver.txt

Command:
Library: tensorflow[and-cuda]
import tensorflow as tf
print(tf.config.list_physical_devices('CPU'))
print(tf.config.list_physical_devices('GPU'))

Log:
2025-02-04 06:04:29.413125: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2025-02-04 06:04:31.327492: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1738649071.978220 80542 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738649072.400870 80542 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-04 06:04:36.546724: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-02-04 06:04:50.776297: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2025-02-04 06:04:50.776349: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:137] retrieving CUDA diagnostic information for host: gpu-nvidia-l4-a363a292-a17a0463-m
2025-02-04 06:04:50.776359: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:144] hostname: gpu-nvidia-l4-a363a292-a17a0463-m
2025-02-04 06:04:50.776482: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:168] libcuda reported version is: 570.86.15
2025-02-04 06:04:50.776521: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:172] kernel reported version is: 570.86.15
2025-02-04 06:04:50.776531: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:259] kernel version seems to match DSO: 570.86.15
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[]

The text was updated successfully, but these errors were encountered:

cjac · 2025-02-13T21:56:16Z

Hello Pradeep,

Can you please tell me which GPU you are attempting to utilize?

In order to make use of P4, P100 and V100, the current gpu installer requires that you select a dataproc image version equal or less than 2.0.67-debian10, 2.1.46-debian11 or 2.2.3-debian12 due to a recent policy change in the Linux kernel.

If you can tell me what GPU you have attached, and what version of CUDA/kernel driver you are targeting, I will be able to help further.

C.J.

m15pradeep · 2025-02-17T08:56:52Z

Hi CJ,

My requirement is all GPUs available under the region should work based with 2.2-debian12 image and the default CUDA/Kernel driver version is sufficient

For example, if region us-east4 is selected, all the listed GPUs should work

Regards,
Pradeep

cjac · 2025-02-18T01:35:24Z

Okay, I'm working on that now. Do you want a link to pre-release for testing, or would you like to wait until I publish it to GCS?

m15pradeep · 2025-02-18T15:30:44Z

Hi CJ,

Thank you for the update. Do you have any timeline when this will be published to GCS?

Regards,
Pradeep

cjac · 2025-02-18T15:37:39Z

My guess is within the next two weeks.

I'd like to bring your attention to this section of the README.md, however:

https://github.com/GoogleCloudDataproc/initialization-actions?tab=readme-ov-file#how-initialization-actions-are-used

⚠️ NOTICE: For production usage, before creating clusters, it is strongly recommended
that you copy initialization actions to your own Cloud Storage bucket to guarantee consistent use of the
same initialization action code across all Dataproc cluster nodes and to prevent unintended upgrades
from upstream in the cluster:

BUCKET=<your_init_actions_bucket>
CLUSTER=<cluster_name>
gsutil cp presto/presto.sh gs://${BUCKET}/
gcloud dataproc clusters create ${CLUSTER} --initialization-actions gs://${BUCKET}/presto.sh

cjac · 2025-02-18T16:13:05Z

Note also that there is a new, as-yet-undocumented feature which is still likely to change before being codified in the readme.

Passing the include-pytorch[1] metadata attribute value as yes will install pytorch and tensorflow to a conda environment. The name of the environment may be specified with the gpu-conda-env[2] metadata attribute.

[1]

initialization-actions/gpu/install_gpu_driver.sh

Line 1715 in 7e87522

INCLUDE_PYTORCH="$(get_metadata_attribute 'include-pytorch' 'no')"

[2]

initialization-actions/gpu/install_gpu_driver.sh

Line 788 in 7e87522

function install_pytorch() {

cjac · 2025-02-22T03:24:21Z

This change was merged with #1302

cjac self-assigned this Feb 13, 2025

cjac mentioned this issue Feb 18, 2025

[gpu] Exercise installer with all supported GPU types #1302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU utilization in spark error #1294

GPU utilization in spark error #1294

m15pradeep commented Feb 4, 2025

cjac commented Feb 13, 2025 •

edited

Loading

m15pradeep commented Feb 17, 2025

cjac commented Feb 18, 2025

m15pradeep commented Feb 18, 2025

cjac commented Feb 18, 2025

cjac commented Feb 18, 2025

cjac commented Feb 22, 2025

GPU utilization in spark error #1294

GPU utilization in spark error #1294

Comments

m15pradeep commented Feb 4, 2025

cjac commented Feb 13, 2025 • edited Loading

m15pradeep commented Feb 17, 2025

cjac commented Feb 18, 2025

m15pradeep commented Feb 18, 2025

cjac commented Feb 18, 2025

cjac commented Feb 18, 2025

cjac commented Feb 22, 2025

cjac commented Feb 13, 2025 •

edited

Loading