test on debian12 #8928

zpcore · 2025-04-02T22:59:40Z

Since cuda 12.8 requires Debian12, this PR tests if we can use Debian12 for the base image.

zpcore · 2025-04-02T23:36:14Z

@ysiraichi , this is the error message I see for Debian12 CUDA 12.8 build:

Step #2 - "build_xla_docker_image":       DEBUG: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/bazel_tools/tools/cpp/lib_cc_configure.bzl:118:10:
Step #2 - "build_xla_docker_image":       Auto-Configuration Warning: 'TMP' environment variable is not set, using 'C:\Windows\Temp' as default
Step #2 - "build_xla_docker_image":       DEBUG: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/bazel_tools/tools/cpp/lib_cc_configure.bzl:118:10:
Step #2 - "build_xla_docker_image":       Auto-Configuration Warning: 'TMP' environment variable is not set, using 'C:\Windows\Temp' as default
Step #2 - "build_xla_docker_image":       Loading:
Step #2 - "build_xla_docker_image":       Loading: 1 packages loaded
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (2 packages loaded, 0 targets configured)
Step #2 - "build_xla_docker_image":       WARNING: Download from https://mirror.bazel.build/github.com/bazelbuild/platforms/releases/download/0.0.9/platforms-0.0.7.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (104 packages loaded, 732 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (204 packages loaded, 9465 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (236 packages loaded, 18188 targets configured)
Step #2 - "build_xla_docker_image":       INFO: Analyzed target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (239 packages loaded, 20620 targets configured).
Step #2 - "build_xla_docker_image":       INFO: Found 1 target...
Step #2 - "build_xla_docker_image":       [0 / 85] [Prepa] BazelWorkspaceStatusAction stable-status.txt ... (4 actions, 0 running)
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/zlib/BUILD.bazel:5:11: Compiling zutil.c [for tool] failed: undeclared inclusion(s) in rule '@zlib//:zlib':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'zutil.c':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/limits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/syslimits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/com_google_absl/absl/base/BUILD.bazel:53:11: Compiling absl/base/log_severity.cc failed: undeclared inclusion(s) in rule '@com_google_absl//absl/base:log_severity':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'absl/base/log_severity.cc':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/limits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/syslimits.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/com_google_protobuf/BUILD.bazel:459:10: Compiling src/google/protobuf/compiler/main.cc [for tool] failed: undeclared inclusion(s) in rule '@com_google_protobuf//:protoc':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'src/google/protobuf/compiler/main.cc':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/llvm-project/llvm/BUILD.bazel:225:11: Compiling llvm/lib/Demangle/MicrosoftDemangle.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Demangle':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'llvm/lib/Demangle/MicrosoftDemangle.cpp':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/llvm-project/llvm/BUILD.bazel:225:11: Compiling llvm/lib/Demangle/Demangle.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Demangle':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'llvm/lib/Demangle/Demangle.cpp':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/com_google_absl/absl/time/internal/cctz/BUILD.bazel:21:11: Compiling absl/time/internal/cctz/src/civil_time_detail.cc failed: undeclared inclusion(s) in rule '@com_google_absl//absl/time/internal/cctz:civil_time':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'absl/time/internal/cctz/src/civil_time_detail.cc':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/limits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/syslimits.h'
Step #2 - "build_xla_docker_image":       Target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build
Step #2 - "build_xla_docker_image":       Use --verbose_failures to see the command lines of failed build steps.
Step #2 - "build_xla_docker_image":       INFO: Elapsed time: 40.157s, Critical Path: 0.41s
Step #2 - "build_xla_docker_image":       INFO: 281 processes: 8 remote cache hit, 273 internal.
Step #2 - "build_xla_docker_image":       FAILED: Build did NOT complete successfully
Step #2 - "build_xla_docker_image":       FAILED: Build did NOT complete successfully
Step #2 - "build_xla_docker_image":       INFO: Streaming build results to: https://source.cloud.google.com/results/invocations/43df6d48-28c1-44b5-85d6-af04f34cdbb5
Step #2 - "build_xla_docker_image":       Traceback (most recent call last):
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
Step #2 - "build_xla_docker_image":           main()
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
Step #2 - "build_xla_docker_image":           json_out['return_val'] = hook(**hook_input['kwargs'])
Step #2 - "build_xla_docker_image":                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
Step #2 - "build_xla_docker_image":           return hook(config_settings)
Step #2 - "build_xla_docker_image":                  ^^^^^^^^^^^^^^^^^^^^^
Step #2 - "build_xla_docker_image":         File "/tmp/pip-build-env-s_v42lx3/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
Step #2 - "build_xla_docker_image":           return self._get_build_requires(config_settings, requirements=[])
Step #2 - "build_xla_docker_image":                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step #2 - "build_xla_docker_image":         File "/tmp/pip-build-env-s_v42lx3/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 304, in _get_build_requires
Step #2 - "build_xla_docker_image":           self.run_setup()
Step #2 - "build_xla_docker_image":         File "/tmp/pip-build-env-s_v42lx3/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 320, in run_setup
Step #2 - "build_xla_docker_image":           exec(code, locals())
Step #2 - "build_xla_docker_image":         File "<string>", line 11, in <module>
Step #2 - "build_xla_docker_image":         File "/src/pytorch/xla/plugins/cuda/../../build_util.py", line 67, in bazel_build
Step #2 - "build_xla_docker_image":           subprocess.check_call(bazel_argv, stdout=sys.stdout, stderr=sys.stderr)
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call
Step #2 - "build_xla_docker_image":           raise CalledProcessError(retcode, cmd)
Step #2 - "build_xla_docker_image":       subprocess.CalledProcessError: Command '['bazel', 'build', '@xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so', '--symlink_prefix=/src/pytorch/xla/plugins/cuda/bazel-', '--remote_default_exec_properties=cache-silo-key=cache-silo-amd64-cuda-17', '--config=remote_cache', '--config=cuda']' returned non-zero exit status 1.
Step #2 - "build_xla_docker_image":       error: subprocess-exited-with-error

ysiraichi · 2025-04-03T13:48:22Z

This is odd. Are we using some kind of bazel cache for building it?

zpcore · 2025-04-03T17:06:07Z

This is odd. Are we using some kind of bazel cache for building it?

Yes, remote_cache is enabled for the bazel build. Will this impact the outcome?

ysiraichi · 2025-04-03T19:45:30Z

I'm not sure. But can we try doing it without using the remote cache?

zpcore · 2025-04-05T21:08:52Z

I disabled the remote cache and rebuilt again. PyTorch can be build with CUDA 12.8 successfully. However, for PyTorch/XLA, it will fail with No package matching 'libopenblas-base' is available. I thin the build is referring to the dependency here:

xla/infra/ansible/config/apt.yaml

Line 38 in cbff76f

- libopenblas-base

.

ysiraichi · 2025-04-07T12:14:29Z

Maybe we can replace that with libopenblas-dev (ref)

zpcore · 2025-04-08T00:03:34Z

Maybe we can replace that with libopenblas-dev (ref)

Great! 12.8 build can pass now. The compilation takes ~2hour30mins without remote cache, which takes ~1hour to complete. I will see if we can enable it again.

ysiraichi · 2025-04-08T12:27:50Z

I think that the problem there was that the cache stored some gcc-10 dependencies, while we wanted to use gcc-11. That's why the error was there.

test on debian12

30df0eb

zpcore requested a review from ysiraichi April 2, 2025 22:59

zpcore added 2 commits April 2, 2025 23:01

update 12.8 package

fbcc1ad

nit

4b3a75a

zpcore added 2 commits April 4, 2025 21:19

test to disable remote cache

98b6e56

another attempt to disable remote cache

b438dd8

update libopenblas

7158355

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test on debian12 #8928

test on debian12 #8928

zpcore commented Apr 2, 2025

zpcore commented Apr 2, 2025

ysiraichi commented Apr 3, 2025

zpcore commented Apr 3, 2025

ysiraichi commented Apr 3, 2025

zpcore commented Apr 5, 2025

ysiraichi commented Apr 7, 2025

zpcore commented Apr 8, 2025

ysiraichi commented Apr 8, 2025

test on debian12 #8928

Are you sure you want to change the base?

test on debian12 #8928

Conversation

zpcore commented Apr 2, 2025

zpcore commented Apr 2, 2025

ysiraichi commented Apr 3, 2025

zpcore commented Apr 3, 2025

ysiraichi commented Apr 3, 2025

zpcore commented Apr 5, 2025

ysiraichi commented Apr 7, 2025

zpcore commented Apr 8, 2025

ysiraichi commented Apr 8, 2025