Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComplexF32 eigen can return NaN unexpectedly #2186

Open
kmp5VT opened this issue Nov 30, 2023 · 2 comments
Open

ComplexF32 eigen can return NaN unexpectedly #2186

kmp5VT opened this issue Nov 30, 2023 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@kmp5VT
Copy link
Contributor

kmp5VT commented Nov 30, 2023

Describe the bug

There seems to be an issue with the stability of the eigen function with ComplexF32. Occasionally the eigen code will return NaN which is inconsistent with the CPU decomposition.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA, HDF5, LinearAlgebra
fid = h5open("broken_eigen.h5", "r")
m = read(fid, "matrix")
m = Hermitian(m)
cm = Hermitian(cu(m))
D, V = eigen(m)
cuD, cuV, eigen(cm)
close(fid)

broken_eigen.h5.txt
** Please note that this file is a .h5 file but I saved it as a txt because it would not let me post here just remove the .txt extension.

Manifest.toml

Status `~/.julia/environments/v1.9/Project.toml`
  [052768ef] CUDA v5.1.1
  [34da2185] Compat v4.10.0
  [f67ccb44] HDF5 v0.17.1
  [33e6dc65] MKL v0.6.1

Expected behavior

I would expect cuD and cuV to be the eigen values and eigen vectors of the CuMatrix cm which has values between [-4.6161222f-8, 0.8686561f0] with an absolute minimum value of 1.3966348f-25

Version info

Details on Julia:

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 1 on 32 virtual cores

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 535.113.1, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+535.113.1

Julia packages: 
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

1 device:
  0: NVIDIA RTX A6000 (sm_86, 45.964 GiB / 47.988 GiB available)
@kmp5VT kmp5VT added the bug Something isn't working label Nov 30, 2023
@maleadt
Copy link
Member

maleadt commented Jan 2, 2024

I can reproduce, but I'm not familiar with the eigen/heevd, so pinging a couple of people who were involved with this code and may be able to say something useful: @albertomercurio @GVigne. It's possible that this is a bug in NVIDIA's libraries, but I want to make sure we're not doing anything wrong before filing an issue.

@maleadt maleadt added the help wanted Extra attention is needed label Jan 9, 2024
@albertomercurio
Copy link
Contributor

I can also reproduce the problem. With ComplexF64 everything works. It seems something related to heevd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants