Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb fails to build on test-bot jsc-zen3 #19746

Closed
smoors opened this issue Jan 30, 2024 · 6 comments · Fixed by #20176
Closed

PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb fails to build on test-bot jsc-zen3 #19746

smoors opened this issue Jan 30, 2024 · 6 comments · Fixed by #20176
Milestone

Comments

@smoors
Copy link
Contributor

smoors commented Jan 30, 2024

failing build in #19739 (comment)

i suspect this is due to jsc-zen3 being a rocky9 machine, see
facebookincubator/gloo#345

there is a fix for it (and merged) in
facebookincubator/gloo#346

a comment in the above PR claims the fix is not correct, and suggests this fix instead
facebookincubator/gloo#348
however, this PR is much bigger, and i don't feel qualified to evaluate it
maybe for now we should just follow what PyTorch does and stick to facebookincubator/gloo#346 ?

@smoors
Copy link
Contributor Author

smoors commented Jan 30, 2024

error:

FAILED: third_party/gloo/gloo/CMakeFiles/gloo.dir/common/linux.cc.o·
/project/def-maintainers/boegelbot/rocky9/zen3/software/GCCcore/11.3.0/bin/g++ -DGLOO_USE_MPI=1 -I/project/def-maintainers/boegelbot/rocky9/zen3/software/FFTW.MPI/3.3.10-gompi-2022a/include -I/tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/cmake/../third_party/benchmark/include -I/tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/cmake/../third_party/cudnn_frontend/include -I/tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/gloo -I/tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/build/third_party/gloo -isystem /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/cmake/../third_party/googletest/googlemock/include -isystem /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/cmake/../third_party/googletest/googletest/include -isystem /project/def-maintainers/boegelbot/rocky9/zen3/software/protobuf/3.19.4-GCCcore-11.3.0/include -isystem /project/def-maintainers/boegelbot/rocky9/zen3/software/OpenBLAS/0.3.20-GCC-11.3.0/include -isystem /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/gemmlowp -isystem /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/neon2sse -isystem /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/XNNPACK/include -isystem /project/def-maintainers/boegelbot/rocky9/zen3/software/numactl/2.0.14-GCCcore-11.3.0/include -isystem /project/def-maintainers/boegelbot/rocky9/zen3/software/FFmpeg/4.4.2-GCCcore-11.3.0/include -isystem /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/cmake/../third_party/eigen -isystem /project/60006/boegelbot/rocky9/zen3/software/OpenMPI/4.1.4-GCC-11.3.0/include -isystem /project/def-maintainers/boegelbot/rocky9/zen3/software/OpenMPI/4.1.4-GCC-11.3.0/include -isystem /project/def-maintainers/boegelbot/rocky9/zen3/software/CUDA/11.7.0/include -isystem /project/def-maintainers/boegelbot/rocky9/zen3/software/NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0/include -O2 -ftree-vectorize -march=native -fno-math-errno -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -std=c++11 -fPIC -O3 -DNDEBUG -std=gnu++14 -MD -MT third_party/gloo/gloo/CMakeFiles/gloo.dir/common/linux.cc.o -MF third_party/gloo/gloo/CMakeFiles/gloo.dir/common/linux.cc.o.d -o third_party/gloo/gloo/CMakeFiles/gloo.dir/common/linux.cc.o -c /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/gloo/gloo/common/linux.cc
In file included from /tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/gloo/gloo/common/linux.cc:15:
/usr/include/linux/ethtool.h: In function int gloo::getInterfaceSpeedGLinkSettings(int, ifreq*):
/usr/include/linux/ethtool.h:2182:17: error: flexible array member ethtool_link_settings::link_mode_masks not at end of struct gloo::getInterfaceSpeedGLinkSettings(int, ifreq*)::<unnamed>
 2182 |         __u32   link_mode_masks[];
      |                 ^~~~~~~~~~~~~~~
/tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/gloo/gloo/common/linux.cc:192:11: note: next member __u32 gloo::getInterfaceSpeedGLinkSettings(int, ifreq*)::<unnamed struct>::link_mode_data [381] declared here
  192 |     __u32 link_mode_data[link_mode_data_nwords];
      |           ^~~~~~~~~~~~~~
/tmp/boegelbot/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch-v1.12.0/third_party/gloo/gloo/common/linux.cc:190:10: note: in the definition of struct gloo::getInterfaceSpeedGLinkSettings(int, ifreq*)::<unnamed>
  190 |   struct {
      |          ^

@boegel boegel added this to the release after 4.9.0 milestone Jan 30, 2024
@boegel
Copy link
Member

boegel commented Jan 30, 2024

@Flamefire Thoughts on this one?

@Flamefire
Copy link
Contributor

@Flamefire Thoughts on this one?

facebookincubator/gloo#348 looks indeed correct. The other is very likely wrong.

@boegel
Copy link
Member

boegel commented Feb 20, 2024

@Flamefire Any idea why this only pops up for jsc-zen3 (Rocky 9.x as OS could be a reason)?

@Flamefire
Copy link
Contributor

Yes: previously the __u32 link_mode_masks[]; syntax for flexible array members (i.e. actually just simple pointers) was not used. Instead it used __u32 link_mode_masks[0]; instead and then just accessed the array "out of bounds". I still see that on my Ubuntu 22.04 system.

In newer compilers an array with zero size is/may be flagged as an error (or at least warning) as are compile-time-known out of bounds accesses. Hence the need for the new syntax.

In Gloo they used

struct {
    struct ethtool_link_settings req;
    __u32 link_mode_data[link_mode_data_nwords];
  } ecmd;

to allocate storage for such an array on the stack with the idea that link_mode_data provides the storage for the link_mode_masks flexibile array member of ethtool_link_settings which worked for the link_mode_masks[0] case as an access to e.g. link_mode_masks[1] actually accessed link_mode_data[1]

But the new syntax requires the flexible array member to be at the end of a struct as the aliasing of link_mode_masks and link_mode_data might have not been intended and be actually unexpected.

FWIW: All of that is very suspicious: The idea of the flexible array members was that the size of the array is determined via allocation, i.e. (ethtool_link_settings*) malloc(sizeof(struct ethtool_link_settings) + sizeof(__u32)*number_of_words) but here storage is passed to a function filling that struct without any information on the actual size of the (flexible) array. So this just waits for a buffer overflow to happen (if the caller gets the array size to small)...

@Flamefire
Copy link
Contributor

I'm proposing my fix in #20176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants