Sparse MatVec Is Nondeterministic? #2582

rbassett3 · 2024-12-09T23:30:19Z

Describe the bug

I'm encountering what appears to be nondeterministic rounding when multiplying by a sparse matrix.

To reproduce

I tried to upload a matrix market and a npy file to github, but they were too large, so I updated both files to my website as .txt

A_ub file
c file

Download them and then compute their product like so:

using MatrixMarket, NPZ, SparseArrays, CUDA, CUDA.CUSPARSE
#load from saved files
A_ub = SparseMatrixCSC{Float32}(MatrixMarket.mmread("A_ub.txt"))
c = Array{Float32}(npzread("c.txt"))

#transfer to GPU
A_ub = CuSparseMatrixCSR{Float32}(A_ub)
c = CuArray{Float32}(c)

#now do this a few times and observe that the numbers on the screen change.
A_ub*c

for example, on my first run I get

julia> A_ub * c
1002000-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 495.9215
 511.35
 505.44818
 483.6953
 488.43948
 493.39288
 491.31964
 494.82507
 495.3517
 504.84515
 484.67084
 498.8777
 497.72232
 493.01538
 478.84924
 488.24252
 502.58365
 501.54272
 497.06326
 509.82855
 511.73764
 512.2767
   ⋮

and on my second run I get

julia> A_ub * c
1002000-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 495.92154
 511.35
 505.44818
 483.6953
 488.43948
 493.39288
 491.31964
 494.82507
 495.3517
 504.84515
 484.67084
 498.8777
 497.72232
 493.01538
 478.84924
 488.24252
 502.58365
 501.54272
 497.0633
 509.82855
 511.73764
 512.2767
   ⋮

with the first and fourth-to-last numbers differing from the original.

I know that floating point arithmetic is non-associative and that parallelism of e.g. large sums can cause operations to be grouped differently across multiple evaluations. Is that what's going on here?

Expected behavior

Reproducible computation.

Version info

Details on Julia:

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.5
NVIDIA driver 555.42.6


CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+555.42.6

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA TITAN V (sm_70, 11.168 GiB / 12.000 GiB available)

The text was updated successfully, but these errors were encountered:

rbassett3 added the bug Something isn't working label Dec 9, 2024

rbassett3 changed the title ~~Sparse MatMul Is Nondeterministic?~~ Sparse MatVec Is Nondeterministic? Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse MatVec Is Nondeterministic? #2582

Sparse MatVec Is Nondeterministic? #2582

rbassett3 commented Dec 9, 2024 •

edited

Loading

Sparse MatVec Is Nondeterministic? #2582

Sparse MatVec Is Nondeterministic? #2582

Comments

rbassett3 commented Dec 9, 2024 • edited Loading

rbassett3 commented Dec 9, 2024 •

edited

Loading