-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow dataset creation #4037
Comments
Using csv is still very slow and single threaded
|
Setting max_bin = 4 doesn't help, still same time. Is there possibly an O(N^2) operation in the dataset creation process? |
Repeated samples of where in gdb shows always stuck in the https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L120-L187 |
https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L125-L134 Basically this inner features_in_group loop has a size that grows directly and same size as fidx (the outer loop). So even if by fidx=100000 the time is only 100us per fidx value, but by fidx=800000 the time is 1000us per fidx value. I already moved the bin_mapper stuff out of the loop: but that didn't help this slowness or the O(N^2) behavior. |
ping @shiyu1994 |
I tried to use the rand.Sample() to sample features_in_group, but turns out rand.Sample() is just as bad. The way it is designed it scales with the size to sample from, not the size of sample. That's also quite bad. So such an attempt still keeps things at O(N^2). Example of random without replacement that is O(N): https://stackoverflow.com/questions/28287138/c-randomly-sample-k-numbers-from-range-0n-1-n-k-without-replacement |
@pseudotensor Thanks for using LightGBM. The synthesized dataset is dense. So I think Lines 125 to 134 in 37e9878
won't find any available groups. Each feature will end up in a separate group. This is a very extreme case. A possible solution would be to limit the maximum number of trials (iterations) in line 125. |
Do you have plans implementing this? Or we can write this into our feature requests? |
@StrikerRUS We can have this in feature requests. A quick fix would be to randomly sample from feature groups in line 125. How to sample the groups when the total number of groups is large is an open question. We have a plan to separate the dataset construction. I think we may leave it to that part. |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
3.1.1.99
Using datatable or numpy, same result. Gets "stuck" here using 1 core for 10-20 minutes:
After the 10-20 minutes, I get system OOM even though I have 64GB.
I read: #1081
which is also quite wide. But in general, what recommendations are there for many columns and speeding things up while using not too much memory.
I'm trying various things, but seems like how lgb uses 1 core for 20 minutes can be improved. E.g. in xgboost I/we used openmp for data ingestion, which despite alot of the operations being memory bandwidth limited does speed things up. That makes sense since the lgb dataset construction operation is about 100x slower than making the data itself, which is bad.
So, it must be possible to parallelize the dataset construction since features are independent. E.g. even fork many jobs that take portion of data and create Dataset objects, then use
add_features_from
to column bind the features? Why isn't that done internally using openmp?The text was updated successfully, but these errors were encountered: