Extremely slow data loading on high dimensional dataset. #6535

yinzheng-zhong · 2024-07-10T19:52:16Z

I am working with 4^11 features and it stuck on a single thread for 15 hours to load a 30G dataset. I can see the time has been spent waiting for

# basic.py line 2146
_LIB.LGBM_DatasetCreateFromFile().

I haven't looked into all the C++ code yet but if I work with 4^10 dimensional data, it takes around an hour to load. I think the problem is directly linked to the dimensionality of the dataset.

In addition, I am using Python and I have tried to load data from libsvm format as well as the dense Numpy array, both show the same result. I suppose the work on 4^12 dimensional data but this makes it impossible to work with. I have tried xgboost which only takes a few minutes to load the data and start training. It will be great if I can use LightBGM as it uses less RAM.

I saw other issues that might be relevant but not exactly the same. e.g. #4037
Any suggestion is appreciated. Thank you.

jameslamb added the question label Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow data loading on high dimensional dataset. #6535

Extremely slow data loading on high dimensional dataset. #6535

yinzheng-zhong commented Jul 10, 2024 •

edited

Loading

Extremely slow data loading on high dimensional dataset. #6535

Extremely slow data loading on high dimensional dataset. #6535

Comments

yinzheng-zhong commented Jul 10, 2024 • edited Loading

yinzheng-zhong commented Jul 10, 2024 •

edited

Loading