Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

CUBLAS dot far slower than BLAS dot #17

Open
dpo opened this issue Dec 4, 2015 · 5 comments
Open

CUBLAS dot far slower than BLAS dot #17

dpo opened this issue Dec 4, 2015 · 5 comments

Comments

@dpo
Copy link

dpo commented Dec 4, 2015

I wrote simple functions that perform dot products on Arrays and CudaArrays. I'm finding that the CUDA version is about 4x slower. Is this expected?

using CUDArt
using CUBLAS

function blasdots(x :: Vector{Float64}, y :: Vector{Float64}; kmax :: Int=100)
  for k = 1:kmax
    BLAS.dot(x, y)
  end
end

function cublasdots(d_x :: CudaArray{Float64}, d_y :: CudaArray{Float64}; kmax :: Int=100)
  for k = 1:kmax
    CUBLAS.dot(d_x, d_y)
  end
end

n = 10000
x = rand(n); y = rand(n)
d_x = CudaArray(x); d_y = CudaArray(y)

blasdots(x, y, kmax=1)  # compile
@time blasdots(x, y)

cublasdots(d_x, d_y, kmax=1)  # compile
@time cublasdots(d_x, d_y)

Running this script gives:

$ julia time_cublas.jl 
  0.001865 seconds (431 allocations: 27.450 KB)
  0.007459 seconds (583 allocations: 28.250 KB)
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF

(Bonus question: what's up with the EBADF???)

This is on OSX 10.9, Julia 0.4.1 installed from Homebrew, built against OpenBLAS, CUDA 7.5.

@kshyatt
Copy link
Contributor

kshyatt commented Dec 4, 2015

It might be expected. The time to transfer data to the GPU over PCIe can be pretty substantial. If you can make your array size a power of 2 OR do multiple ops with the same data on the GPU, you should see better perf.

@dpo
Copy link
Author

dpo commented Dec 4, 2015

I probably misunderstand how this all works, but isn't the only transfer occurring when I say d_x = CudaArray(x)? Isn't all of cublasdots() taking place on the GPU?

@kshyatt
Copy link
Contributor

kshyatt commented Dec 4, 2015

Oh derp, you're right. I think it still might be the fact that the array size is not a power of two and is a little small.

@dpo
Copy link
Author

dpo commented Dec 4, 2015

Well, ok, it starts paying off at arrays of size 2^20:

array size: 2^20
  0.892670 seconds
  0.647335 seconds (3.00 k allocations: 109.375 KB)
array size: 2^21
  1.891142 seconds
  0.839174 seconds (3.00 k allocations: 109.375 KB)
array size: 2^22
  3.775395 seconds
  1.492279 seconds (3.00 k allocations: 109.375 KB)
array size: 2^23
  7.506833 seconds
  3.100094 seconds (3.00 k allocations: 109.375 KB)
array size: 2^24
 14.739128 seconds
  5.848365 seconds (3.00 k allocations: 109.375 KB)

At 2^25, Julia crashes, saying it's out of memory (which is suspicious; htop shows my memory usage as constant; I don't get such a crash when I only use BLAS.dot).

I thought it would pay off for smaller data size. Perhaps it's my card (GeForce GT 650M). Anyways, thanks for your help!

@kshyatt
Copy link
Contributor

kshyatt commented Dec 4, 2015

It could be the card, especially if you have a nice CPU.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants