Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test on debian12 #8928

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

test on debian12 #8928

wants to merge 6 commits into from

Conversation

zpcore
Copy link
Collaborator

@zpcore zpcore commented Apr 2, 2025

Since cuda 12.8 requires Debian12, this PR tests if we can use Debian12 for the base image.

@zpcore zpcore requested a review from ysiraichi April 2, 2025 22:59
@zpcore
Copy link
Collaborator Author

zpcore commented Apr 2, 2025

@ysiraichi , this is the error message I see for Debian12 CUDA 12.8 build:

Step #2 - "build_xla_docker_image":       DEBUG: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/bazel_tools/tools/cpp/lib_cc_configure.bzl:118:10:
Step #2 - "build_xla_docker_image":       Auto-Configuration Warning: 'TMP' environment variable is not set, using 'C:\Windows\Temp' as default
Step #2 - "build_xla_docker_image":       DEBUG: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/bazel_tools/tools/cpp/lib_cc_configure.bzl:118:10:
Step #2 - "build_xla_docker_image":       Auto-Configuration Warning: 'TMP' environment variable is not set, using 'C:\Windows\Temp' as default
Step #2 - "build_xla_docker_image":       Loading:
Step #2 - "build_xla_docker_image":       Loading: 1 packages loaded
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (2 packages loaded, 0 targets configured)
Step #2 - "build_xla_docker_image":       WARNING: Download from https://mirror.bazel.build/github.com/bazelbuild/platforms/releases/download/0.0.9/platforms-0.0.7.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (36 packages loaded, 9 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (104 packages loaded, 732 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (204 packages loaded, 9465 targets configured)
Step #2 - "build_xla_docker_image":       Analyzing: target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (236 packages loaded, 18188 targets configured)
Step #2 - "build_xla_docker_image":       INFO: Analyzed target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so (239 packages loaded, 20620 targets configured).
Step #2 - "build_xla_docker_image":       INFO: Found 1 target...
Step #2 - "build_xla_docker_image":       [0 / 85] [Prepa] BazelWorkspaceStatusAction stable-status.txt ... (4 actions, 0 running)
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/zlib/BUILD.bazel:5:11: Compiling zutil.c [for tool] failed: undeclared inclusion(s) in rule '@zlib//:zlib':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'zutil.c':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/limits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/syslimits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/com_google_absl/absl/base/BUILD.bazel:53:11: Compiling absl/base/log_severity.cc failed: undeclared inclusion(s) in rule '@com_google_absl//absl/base:log_severity':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'absl/base/log_severity.cc':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/limits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/syslimits.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/com_google_protobuf/BUILD.bazel:459:10: Compiling src/google/protobuf/compiler/main.cc [for tool] failed: undeclared inclusion(s) in rule '@com_google_protobuf//:protoc':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'src/google/protobuf/compiler/main.cc':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/llvm-project/llvm/BUILD.bazel:225:11: Compiling llvm/lib/Demangle/MicrosoftDemangle.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Demangle':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'llvm/lib/Demangle/MicrosoftDemangle.cpp':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/llvm-project/llvm/BUILD.bazel:225:11: Compiling llvm/lib/Demangle/Demangle.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Demangle':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'llvm/lib/Demangle/Demangle.cpp':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":       ERROR: /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/com_google_absl/absl/time/internal/cctz/BUILD.bazel:21:11: Compiling absl/time/internal/cctz/src/civil_time_detail.cc failed: undeclared inclusion(s) in rule '@com_google_absl//absl/time/internal/cctz:civil_time':
Step #2 - "build_xla_docker_image":       this rule is missing dependency declarations for the following files included by 'absl/time/internal/cctz/src/civil_time_detail.cc':
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdint.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stddef.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/stdarg.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/limits.h'
Step #2 - "build_xla_docker_image":         '/usr/lib/gcc/x86_64-linux-gnu/10/include/syslimits.h'
Step #2 - "build_xla_docker_image":       Target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build
Step #2 - "build_xla_docker_image":       Use --verbose_failures to see the command lines of failed build steps.
Step #2 - "build_xla_docker_image":       INFO: Elapsed time: 40.157s, Critical Path: 0.41s
Step #2 - "build_xla_docker_image":       INFO: 281 processes: 8 remote cache hit, 273 internal.
Step #2 - "build_xla_docker_image":       FAILED: Build did NOT complete successfully
Step #2 - "build_xla_docker_image":       FAILED: Build did NOT complete successfully
Step #2 - "build_xla_docker_image":       INFO: Streaming build results to: https://source.cloud.google.com/results/invocations/43df6d48-28c1-44b5-85d6-af04f34cdbb5
Step #2 - "build_xla_docker_image":       Traceback (most recent call last):
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
Step #2 - "build_xla_docker_image":           main()
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
Step #2 - "build_xla_docker_image":           json_out['return_val'] = hook(**hook_input['kwargs'])
Step #2 - "build_xla_docker_image":                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
Step #2 - "build_xla_docker_image":           return hook(config_settings)
Step #2 - "build_xla_docker_image":                  ^^^^^^^^^^^^^^^^^^^^^
Step #2 - "build_xla_docker_image":         File "/tmp/pip-build-env-s_v42lx3/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
Step #2 - "build_xla_docker_image":           return self._get_build_requires(config_settings, requirements=[])
Step #2 - "build_xla_docker_image":                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Step #2 - "build_xla_docker_image":         File "/tmp/pip-build-env-s_v42lx3/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 304, in _get_build_requires
Step #2 - "build_xla_docker_image":           self.run_setup()
Step #2 - "build_xla_docker_image":         File "/tmp/pip-build-env-s_v42lx3/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 320, in run_setup
Step #2 - "build_xla_docker_image":           exec(code, locals())
Step #2 - "build_xla_docker_image":         File "<string>", line 11, in <module>
Step #2 - "build_xla_docker_image":         File "/src/pytorch/xla/plugins/cuda/../../build_util.py", line 67, in bazel_build
Step #2 - "build_xla_docker_image":           subprocess.check_call(bazel_argv, stdout=sys.stdout, stderr=sys.stderr)
Step #2 - "build_xla_docker_image":         File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call
Step #2 - "build_xla_docker_image":           raise CalledProcessError(retcode, cmd)
Step #2 - "build_xla_docker_image":       subprocess.CalledProcessError: Command '['bazel', 'build', '@xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so', '--symlink_prefix=/src/pytorch/xla/plugins/cuda/bazel-', '--remote_default_exec_properties=cache-silo-key=cache-silo-amd64-cuda-17', '--config=remote_cache', '--config=cuda']' returned non-zero exit status 1.
Step #2 - "build_xla_docker_image":       error: subprocess-exited-with-error

@ysiraichi
Copy link
Collaborator

This is odd. Are we using some kind of bazel cache for building it?

@zpcore
Copy link
Collaborator Author

zpcore commented Apr 3, 2025

This is odd. Are we using some kind of bazel cache for building it?

Yes, remote_cache is enabled for the bazel build. Will this impact the outcome?

@ysiraichi
Copy link
Collaborator

I'm not sure. But can we try doing it without using the remote cache?

@zpcore
Copy link
Collaborator Author

zpcore commented Apr 5, 2025

I disabled the remote cache and rebuilt again. PyTorch can be build with CUDA 12.8 successfully. However, for PyTorch/XLA, it will fail with No package matching 'libopenblas-base' is available. I thin the build is referring to the dependency here:

- libopenblas-base
.

@ysiraichi
Copy link
Collaborator

Maybe we can replace that with libopenblas-dev (ref)

@zpcore
Copy link
Collaborator Author

zpcore commented Apr 8, 2025

Maybe we can replace that with libopenblas-dev (ref)

Great! 12.8 build can pass now. The compilation takes ~2hour30mins without remote cache, which takes ~1hour to complete. I will see if we can enable it again.

@ysiraichi
Copy link
Collaborator

I think that the problem there was that the cache stored some gcc-10 dependencies, while we wanted to use gcc-11. That's why the error was there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants