Slow dataset creation #4037

pseudotensor · 2021-03-01T18:27:21Z

import datatable as dt
import numpy as np

rows=200
cols=800000
table = dt.Frame(np.random.rand(rows, cols))
table.names = ["name_" + str(x) for x in range(table.shape[1])]
target = "name_0"

y = table[:, target].to_numpy().ravel()
del table[target]

import lightgbm as lgb
model = lgb.LGBMRegressor()
model.fit(table, y)

3.1.1.99

Using datatable or numpy, same result. Gets "stuck" here using 1 core for 10-20 minutes:

  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1204 in __init_from_np2d
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1158 in _lazy_init
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1356 in construct
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2096 in __init__
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230 in train
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637 in fit
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794 in fit

/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM10FindGroupsERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EERKS0_IiSaIiEEPPiPKiiiibbPS0_IaSaIaEE+0x1f4)[0x7fc824bb1404]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM19FastFeatureBundlingERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EEPPiPPdPKiiiRKS0_IiSaIiEEibbPS0_IaSaIaEE+0xbc2)[0x7fc824bb44e2]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM7Dataset9ConstructEPSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS3_EESaIS6_EEiRKS1_IS1_IdSaIdEESaISB_EEPPiPPdPKiimRKNS_6ConfigE+0x278)[0x7fc824bb50b8]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM13DatasetLoader31ConstructBinMappersFromTextDataEiiRKSt6vectorISsSaISsEEPKNS_6ParserEPNS_7DatasetE+0x1a29)[0x7fc824bcf179]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM13DatasetLoader12LoadFromFileEPKcii+0x1bb)[0x7fc824bd2b7b]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(LGBM_DatasetCreateFromFile+0x17b)[0x7fc824df36eb]

After the 10-20 minutes, I get system OOM even though I have 64GB.

I read: #1081

which is also quite wide. But in general, what recommendations are there for many columns and speeding things up while using not too much memory.

I'm trying various things, but seems like how lgb uses 1 core for 20 minutes can be improved. E.g. in xgboost I/we used openmp for data ingestion, which despite alot of the operations being memory bandwidth limited does speed things up. That makes sense since the lgb dataset construction operation is about 100x slower than making the data itself, which is bad.

So, it must be possible to parallelize the dataset construction since features are independent. E.g. even fork many jobs that take portion of data and create Dataset objects, then use add_features_from to column bind the features? Why isn't that done internally using openmp?

The text was updated successfully, but these errors were encountered:

pseudotensor · 2021-03-01T19:03:25Z

Using csv is still very slow and single threaded

import datatable as dt
import numpy as np

rows=200
cols=800000
table = dt.Frame(np.random.rand(rows, cols))
table.names = ["name_" + str(x) for x in range(table.shape[1])]
target = "name_0"
table_csv = "table.csv"
table.to_csv(table_csv, header=False)

y = table[:, target].to_numpy().ravel()

import lightgbm as lgb
train_set = lgb.Dataset(table_csv, label=y)
booster = lgb.train({}, train_set)

Current thread 0x00007fc8c279d740 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1148 in _lazy_init
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1356 in construct
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2096 in __init__
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230 in train
  File "slow_lgb3.py", line 17 in <module>

pseudotensor · 2021-03-01T21:02:20Z

Setting max_bin = 4 doesn't help, still same time. Is there possibly an O(N^2) operation in the dataset creation process?

pseudotensor · 2021-03-01T21:29:44Z

Repeated samples of where in gdb shows always stuck in the FindGroups function using the BinMapper

https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L120-L187

pseudotensor · 2021-03-02T00:48:00Z

https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L125-L134

Basically this inner features_in_group loop has a size that grows directly and same size as fidx (the outer loop). So even if by fidx=100000 the time is only 100us per fidx value, but by fidx=800000 the time is 1000us per fidx value.

I already moved the bin_mapper stuff out of the loop:

h2oai@e498551

but that didn't help this slowness or the O(N^2) behavior.

guolinke · 2021-03-02T06:34:32Z

ping @shiyu1994

pseudotensor · 2021-03-02T18:05:31Z

I tried to use the rand.Sample() to sample features_in_group, but turns out rand.Sample() is just as bad. The way it is designed it scales with the size to sample from, not the size of sample. That's also quite bad.

So such an attempt still keeps things at O(N^2).

Example of random without replacement that is O(N): https://stackoverflow.com/questions/28287138/c-randomly-sample-k-numbers-from-range-0n-1-n-k-without-replacement

shiyu1994 · 2021-03-05T11:29:55Z

@pseudotensor Thanks for using LightGBM.

The synthesized dataset is dense. So I think

LightGBM/src/io/dataset.cpp

Lines 125 to 134 in 37e9878

    
           for (int gid = 0; gid < static_cast<int>(features_in_group.size()); ++gid) { 
        
             auto cur_num_bin = group_num_bin[gid] + bin_mappers[fidx]->num_bin() + 
        
                                (bin_mappers[fidx]->GetDefaultBin() == 0 ? -1 : 0); 
        
             if (group_total_data_cnt[gid] + cur_non_zero_cnt <= 
        
                 total_sample_cnt + single_val_max_conflict_cnt) { 
        
               if (!is_use_gpu || cur_num_bin <= max_bin_per_group) { 
        
                 available_groups.push_back(gid); 
        
               } 
        
             } 
        
           }

won't find any available groups. Each feature will end up in a separate group.
This is a very extreme case. A possible solution would be to limit the maximum number of trials (iterations) in line 125.

StrikerRUS · 2021-06-09T13:29:04Z

@shiyu1994

A possible solution would be to limit the maximum number of trials (iterations) in line 125.

Do you have plans implementing this? Or we can write this into our feature requests?

shiyu1994 · 2021-07-06T01:39:13Z

@StrikerRUS We can have this in feature requests. A quick fix would be to randomly sample from feature groups in line 125. How to sample the groups when the total number of groups is large is an open question.

We have a plan to separate the dataset construction. I think we may leave it to that part.

StrikerRUS · 2024-07-24T23:35:20Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

pseudotensor added a commit to h2oai/LightGBM that referenced this issue Mar 2, 2021

Potential fix for microsoft#4037

e498551

shiyu1994 mentioned this issue Jul 6, 2021

Feature Requests & Voting Hub #2302

Open

jameslamb added the feature request label Feb 1, 2023

yinzheng-zhong mentioned this issue Jul 10, 2024

Extremely slow data loading on high dimensional dataset. #6535

Open

StrikerRUS closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow dataset creation #4037

Slow dataset creation #4037

pseudotensor commented Mar 1, 2021 •

edited

Loading

pseudotensor commented Mar 1, 2021

pseudotensor commented Mar 1, 2021

pseudotensor commented Mar 1, 2021 •

edited

Loading

pseudotensor commented Mar 2, 2021 •

edited

Loading

guolinke commented Mar 2, 2021

pseudotensor commented Mar 2, 2021 •

edited

Loading

shiyu1994 commented Mar 5, 2021

StrikerRUS commented Jun 9, 2021

shiyu1994 commented Jul 6, 2021

StrikerRUS commented Jul 24, 2024

Slow dataset creation #4037

Slow dataset creation #4037

Comments

pseudotensor commented Mar 1, 2021 • edited Loading

pseudotensor commented Mar 1, 2021

pseudotensor commented Mar 1, 2021

pseudotensor commented Mar 1, 2021 • edited Loading

pseudotensor commented Mar 2, 2021 • edited Loading

guolinke commented Mar 2, 2021

pseudotensor commented Mar 2, 2021 • edited Loading

shiyu1994 commented Mar 5, 2021

StrikerRUS commented Jun 9, 2021

shiyu1994 commented Jul 6, 2021

StrikerRUS commented Jul 24, 2024

pseudotensor commented Mar 1, 2021 •

edited

Loading

pseudotensor commented Mar 1, 2021 •

edited

Loading

pseudotensor commented Mar 2, 2021 •

edited

Loading

pseudotensor commented Mar 2, 2021 •

edited

Loading