Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse MatVec Is Nondeterministic? #2582

Open
rbassett3 opened this issue Dec 9, 2024 · 0 comments
Open

Sparse MatVec Is Nondeterministic? #2582

rbassett3 opened this issue Dec 9, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@rbassett3
Copy link

rbassett3 commented Dec 9, 2024

Describe the bug

I'm encountering what appears to be nondeterministic rounding when multiplying by a sparse matrix.

To reproduce

I tried to upload a matrix market and a npy file to github, but they were too large, so I updated both files to my website as .txt

A_ub file
c file

Download them and then compute their product like so:

using MatrixMarket, NPZ, SparseArrays, CUDA, CUDA.CUSPARSE
#load from saved files
A_ub = SparseMatrixCSC{Float32}(MatrixMarket.mmread("A_ub.txt"))
c = Array{Float32}(npzread("c.txt"))

#transfer to GPU
A_ub = CuSparseMatrixCSR{Float32}(A_ub)
c = CuArray{Float32}(c)

#now do this a few times and observe that the numbers on the screen change.
A_ub*c

for example, on my first run I get

julia> A_ub * c
1002000-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 495.9215
 511.35
 505.44818
 483.6953
 488.43948
 493.39288
 491.31964
 494.82507
 495.3517
 504.84515
 484.67084
 498.8777
 497.72232
 493.01538
 478.84924
 488.24252
 502.58365
 501.54272
 497.06326
 509.82855
 511.73764
 512.2767
   

and on my second run I get

julia> A_ub * c
1002000-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 495.92154
 511.35
 505.44818
 483.6953
 488.43948
 493.39288
 491.31964
 494.82507
 495.3517
 504.84515
 484.67084
 498.8777
 497.72232
 493.01538
 478.84924
 488.24252
 502.58365
 501.54272
 497.0633
 509.82855
 511.73764
 512.2767
   

with the first and fourth-to-last numbers differing from the original.

I know that floating point arithmetic is non-associative and that parallelism of e.g. large sums can cause operations to be grouped differently across multiple evaluations. Is that what's going on here?

Expected behavior

Reproducible computation.

Version info

Details on Julia:

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.5
NVIDIA driver 555.42.6


CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+555.42.6

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA TITAN V (sm_70, 11.168 GiB / 12.000 GiB available)

@rbassett3 rbassett3 added the bug Something isn't working label Dec 9, 2024
@rbassett3 rbassett3 changed the title Sparse MatMul Is Nondeterministic? Sparse MatVec Is Nondeterministic? Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant