Training on GPU does not utilise GPU properly #836

mhmoudr · 2017-08-15T23:51:08Z

Mainly the issue is GPU utilisation. as after building LightGBM for GPU up to the described process, and running on a sample dataset, while monitoring both CPU and GPU (please check attached screen shot)
The process move the training dataset into GPU memory and nvidia-smi recognise it as a running process, while during the training GPU utilisation does not exceed 2-5%, on the other hand CPU seems to be fully utilised.

I am not sure if this is a defect of a kind of incomplete implementation.

Environment info

Operating System: Ubuntu 16
CPU: 2 Xeon (total 48 Cores )
C++ version: latest C++ and calling the CLI process

Error Message:

N/A

Steps to reproduce

Compile for GPU using the provided docs.
use the following config :
data = "/path/to/libsvm/file"
num_iterations = 3000
learning_rate = 0.01
max_depth = 12
device = gpu
gpu_platform_id = 0
gpu_device_id = 0

fenqingr · 2017-08-16T00:59:26Z

same thing here, most of time GPU usage stays at 0% with some spike ~10% to 20%. I though my CPU( i5 2500K @ 4G) caused bottle necking my GPU performance and from your case I believe it is not the case.

guolinke · 2017-08-16T05:24:06Z

It is normal if your training data is small.
BTW, you are using max_depth = 12, but forget to set num_leaves, so you are training the small model, which will also cause the low gpu usage.

mhmoudr · 2017-08-16T05:51:37Z

I am using a dataset that have ~14Mil row and ~1000 sparse features the memory foot print as you can notice in the nvidia-smi was 649MB, and in term of the num_levels I have set it to 12 as well and I had exactly the same behavior

guolinke · 2017-08-16T05:54:10Z

@mhmoudr num_leaves=2^max_depth.
@huanzhang12 I remember the sparse feature cannot use GPU to speed up, right ?

mhmoudr · 2017-08-16T06:19:34Z

As I am writing this I am running using num_levels = 24 (as depth still on 12) but GPU utilization is closer to 1% this time.
The question that I have why CPU in this scenario are utilized with something close to 80%, doesn't this mean that all of the heavy lifting (calculation) that supposed to happen on GPU cores is happening actually on CPU? as in other packages when I train on GPU, CPUs seems to be idle (nearly doing nothing)

guolinke · 2017-08-16T13:10:20Z

@mhmoudr refer to #768 .
The LightGBM GPU still need to use CPU to do some calculations.
And all sparse features cannot use GPU to speed up, so the CPU usage is high in your case.

huanzhang12 · 2017-08-16T15:49:38Z

@mhmoudr Sparse features processing has too much irregularity and are currently not accelerated on GPU. Try to set sparse_threshold parameter to 1.0 or a number very close to 1.0 and see if there are any improvement. See more details in the GPU performance tuning guide.

Noxoomo · 2017-10-03T18:02:08Z

I had the same problem. Changing num_threads option from cpu_cores (in my case 16) to and revert to 6cc1dd9 revision 1 solved the problem. BTW, it's bug and should be fixed.

UPD:
I have 2-socket server with 2x8 thread intel CPU. 32 threads cause significant slowdown compared to 16 threads for gpu-based learning.
LightGBM from current trunk works slower than 6cc1dd9 revision

guolinke closed this as completed Oct 26, 2017

Mtale mentioned this issue Sep 30, 2018

Training on GPU fails (OSError: exception: access violation) #1717

Open

lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on GPU does not utilise GPU properly #836

Training on GPU does not utilise GPU properly #836

mhmoudr commented Aug 15, 2017 •

edited

Loading

fenqingr commented Aug 16, 2017

guolinke commented Aug 16, 2017

mhmoudr commented Aug 16, 2017

guolinke commented Aug 16, 2017

mhmoudr commented Aug 16, 2017

guolinke commented Aug 16, 2017

huanzhang12 commented Aug 16, 2017

Noxoomo commented Oct 3, 2017 •

edited

Loading

Training on GPU does not utilise GPU properly #836

Training on GPU does not utilise GPU properly #836

Comments

mhmoudr commented Aug 15, 2017 • edited Loading

Environment info

Error Message:

Steps to reproduce

fenqingr commented Aug 16, 2017

guolinke commented Aug 16, 2017

mhmoudr commented Aug 16, 2017

guolinke commented Aug 16, 2017

mhmoudr commented Aug 16, 2017

guolinke commented Aug 16, 2017

huanzhang12 commented Aug 16, 2017

Noxoomo commented Oct 3, 2017 • edited Loading

mhmoudr commented Aug 15, 2017 •

edited

Loading

Noxoomo commented Oct 3, 2017 •

edited

Loading