CUBLAS dot far slower than BLAS dot #17

dpo · 2015-12-04T02:25:23Z

I wrote simple functions that perform dot products on Arrays and CudaArrays. I'm finding that the CUDA version is about 4x slower. Is this expected?

using CUDArt
using CUBLAS

function blasdots(x :: Vector{Float64}, y :: Vector{Float64}; kmax :: Int=100)
  for k = 1:kmax
    BLAS.dot(x, y)
  end
end

function cublasdots(d_x :: CudaArray{Float64}, d_y :: CudaArray{Float64}; kmax :: Int=100)
  for k = 1:kmax
    CUBLAS.dot(d_x, d_y)
  end
end

n = 10000
x = rand(n); y = rand(n)
d_x = CudaArray(x); d_y = CudaArray(y)

blasdots(x, y, kmax=1)  # compile
@time blasdots(x, y)

cublasdots(d_x, d_y, kmax=1)  # compile
@time cublasdots(d_x, d_y)

Running this script gives:

$ julia time_cublas.jl 
  0.001865 seconds (431 allocations: 27.450 KB)
  0.007459 seconds (583 allocations: 28.250 KB)
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF

(Bonus question: what's up with the EBADF???)

This is on OSX 10.9, Julia 0.4.1 installed from Homebrew, built against OpenBLAS, CUDA 7.5.

The text was updated successfully, but these errors were encountered:

kshyatt · 2015-12-04T02:43:03Z

It might be expected. The time to transfer data to the GPU over PCIe can be pretty substantial. If you can make your array size a power of 2 OR do multiple ops with the same data on the GPU, you should see better perf.

dpo · 2015-12-04T03:07:34Z

I probably misunderstand how this all works, but isn't the only transfer occurring when I say d_x = CudaArray(x)? Isn't all of cublasdots() taking place on the GPU?

kshyatt · 2015-12-04T03:10:18Z

Oh derp, you're right. I think it still might be the fact that the array size is not a power of two and is a little small.

dpo · 2015-12-04T03:28:37Z

Well, ok, it starts paying off at arrays of size 2^20:

array size: 2^20
  0.892670 seconds
  0.647335 seconds (3.00 k allocations: 109.375 KB)
array size: 2^21
  1.891142 seconds
  0.839174 seconds (3.00 k allocations: 109.375 KB)
array size: 2^22
  3.775395 seconds
  1.492279 seconds (3.00 k allocations: 109.375 KB)
array size: 2^23
  7.506833 seconds
  3.100094 seconds (3.00 k allocations: 109.375 KB)
array size: 2^24
 14.739128 seconds
  5.848365 seconds (3.00 k allocations: 109.375 KB)

At 2^25, Julia crashes, saying it's out of memory (which is suspicious; htop shows my memory usage as constant; I don't get such a crash when I only use BLAS.dot).

I thought it would pay off for smaller data size. Perhaps it's my card (GeForce GT 650M). Anyways, thanks for your help!

kshyatt · 2015-12-04T03:30:07Z

It could be the card, especially if you have a nice CPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUBLAS dot far slower than BLAS dot #17

CUBLAS dot far slower than BLAS dot #17

dpo commented Dec 4, 2015

kshyatt commented Dec 4, 2015

dpo commented Dec 4, 2015

kshyatt commented Dec 4, 2015

dpo commented Dec 4, 2015

kshyatt commented Dec 4, 2015

CUBLAS dot far slower than BLAS dot #17

CUBLAS dot far slower than BLAS dot #17

Comments

dpo commented Dec 4, 2015

kshyatt commented Dec 4, 2015

dpo commented Dec 4, 2015

kshyatt commented Dec 4, 2015

dpo commented Dec 4, 2015

kshyatt commented Dec 4, 2015