Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow data loading on high dimensional dataset. #6535

Open
yinzheng-zhong opened this issue Jul 10, 2024 · 0 comments
Open

Extremely slow data loading on high dimensional dataset. #6535

yinzheng-zhong opened this issue Jul 10, 2024 · 0 comments
Labels

Comments

@yinzheng-zhong
Copy link

yinzheng-zhong commented Jul 10, 2024

I am working with 4^11 features and it stuck on a single thread for 15 hours to load a 30G dataset. I can see the time has been spent waiting for

# basic.py line 2146
_LIB.LGBM_DatasetCreateFromFile(). 

I haven't looked into all the C++ code yet but if I work with 4^10 dimensional data, it takes around an hour to load. I think the problem is directly linked to the dimensionality of the dataset.

In addition, I am using Python and I have tried to load data from libsvm format as well as the dense Numpy array, both show the same result. I suppose the work on 4^12 dimensional data but this makes it impossible to work with. I have tried xgboost which only takes a few minutes to load the data and start training. It will be great if I can use LightBGM as it uses less RAM.

I saw other issues that might be relevant but not exactly the same. e.g. #4037
Any suggestion is appreciated. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants